Screen scraping with SpookyJS

“Then I felt just like a fiend,
It wasn’t even close to Halloween.”
Geto Boys

I needed to build a screen scraper for a Node.js application and spent a good deal of time making it all work. I wanted to share some lessons learned that I would have found very helpful to have known at the outset.

This post is about getting PhantomJS, CasperJS and SpookyJS playing nicely and understanding what role each one plays.

High Level – how it all works

PhantomJS does all the grunt work of scraping the screen. But to do anything remotely interesting, like logging in, clicking around, it quickly becomes cumbersome. That’s where CasperJS comes in. It sits on top of PhantomJS and lets you easily do things like logging in, following links etc. My scraper interacted solely with CasperJS, and it handled talking to PhantomJS.

PhantomJS and CasperJS are native processes running on the local server. As such, they can not be directly accessed via a Node.JS app, at least not without a lot of work.

That’s where SpookyJS comes in. SpookyJS is an npm module that lets you work with CasperJS directly from within your Node app. How it does this is beyond this post but it’s worth a read. It basically spins up a CasperJS process and talks to it via JSON/RPC calls. Neat stuff.

Know Thy Contexts! There are 3 of them

If I can pass on any knowledge in this post, it is this section. Know thy three contexts:

  • Node/SpookyJS context
  • CasperJS context
  • Page context

SpookyJS has a good write up on how to pass variables from one to the other. Examples always help me understand so let’s write one.

// In the Node app, where we require(‘spooky’) and all that

var spooky = new Spooky({
  casper: {
    //configure casperjs here
  }
}, function (err) {
  // NODE CONTEXT
  console.log('We are in the Node context');
  spooky.start('http://www.mysite.com/');
  spooky.then(function() {
    // CASPERJS CONTEXT
    console.log('We are in the CasperJS context');
    this.emit('console', 'We can also emit events here.');
    this.click('a#somelink');
  });
  spooky.then(function() {
    // CASPERJS CONTEXT
    var size = this.evaluate(function() {
    // PAGE CONTEXT
    console.log('....'); // DOES NOT GET PRINTED OUT
    __utils__.echo('We are in the Page context'); // Gets printed out
    this.capture('screenshot.png');
    var $selectsize = $('select#myselectlist option').size();
      return $selectsize;
    })
  })

CasperJS has a very convenient utils module it injects into each page. When you are in an Evaluate function, you are essentially on the page itself and that’s where you can use JQuery and the utils module. Using console.log is pointless as the output does not get captured by your application.

Quick Note: Insert JQuery

I don’t know why CasperJS didn’t just include JQuery. So much easier than trying to learn their own selector definitions. Add it when you set up the casper options.

var spooky = new Spooky({
 casper: {
   logLevel: 'error',
   verbose: false,
   options: {
     clientScripts: ['../public/javascripts/jquery.min.js']
   }
 }
 ...

Setting up SpookyJS

I’ll assume you know how to add the SpookyJS npm package to your app. Once you’ve done that, it’s time to start scraping! Here’s where I spent a lot of time.

Before you get too far, I would *strongly* encourage you to understand how CasperJS works. They have a good explanation of it on their docs, especially the section on the ‘evaluate’ method. This was my ‘aha’ moment.

The general flow of a SpookyJS app is you chain together several spooky.then steps. Each one of these is run once the previous one is completed. Anything added between a spooky.then step is run immediately without waiting for the previous step to complete.

spooky.then(function() {
 this.wait(5000, function() {
   this.emit('console', 'step 1');
 })
}
console.log('step 2');
spooky.then(function() {
 this.wait(5000, function() {
  this.emit('console', 'step 3');
 })
}
// prints out
step 2
step 1
step 3

Getting Slightly Fancier

Say we want to click a link, and then download a file from the resulting page. Let’s say that the link has an id of #account, and on the resulting page, the file to download is a link with id #downloadfile and we want to download it to /tmp/file.pdf

spooky.then(function() {
 this.click('a#account');
});
// This next step will not start until the page is loaded
spooky.then(function() {
 this.download(this.getElementAttribute('a#downloadfile','href'), '/tmp/file.pdf');
});

Getting Even Fancier

Now let’s say we have a drop down list of 12 months. Selecting a month refreshes the page and we want to take a screenshot of each of the 12 months.

** Using globals. To pass variables within then() functions, you can take advantage of global variables, attaching them to the window object. See this in the example below.

//Not showing the set up and config. Find that above.
spooky.start('http://www.pge.com/');
 spooky.then(function(){
  window.numMonths = this.evaluate(function() {
   var $selectsize = $('select#month_select option').size();
   return $selectsize; // returns 12, the number of months
  });
 });

 spooky.then(function() {
  var casperCount = 0;
  this.repeat(window.numMonths, function() {
   this.evaluate(function(i) {
    $('select#month_select').get(0).selectedIndex = i;
    // Refresh the page using one of the two ways below
    $('#month_selection_form').submit(); // If the select is within a form
    $('select#month_select').change(); // If the page has a trigger on the select
    return true;
   },{ i: casperCount });

   this.then(function() {
    this.capture('month' + casperCount + '.png');
    casperCount = casperCount + 1;
   });
  });
});

Hope That Helps

It took me a while to get all the contexts sorted out but once you understand how each piece plays together nicely with the others, SpookyJS with CasperJS can be very powerful. Add the npm cron package and then you’ve got a first class scraping application.
Happy scraping!

Advertisements

iOS – Updating application state from a UITableViewCell

One of my current clients is a new restaurant where they want an iOS app from which their customers can order their food.  One of the requirements is allowing the user to select 0-n number of dishes for multiple dishes.  For example:

Side dishes

So the question is as the user is incrementing the values, where do we store this state?  The first thought is to just keep it in the UITableViewCell but this has major drawbacks, including:

  • the data is lost once the UITableViewCell is scrolled off screen, as it is re-used to display other dishes.
  • maintaining state in the view is just not clean design.  This is controller territory.

Protocols and Delegation

The sensible place to keep the data is in the controller.  Then the question becomes, “How do we call back to the controller from the UITableViewCell?”  

Using Protocols and Delegation, we can easily have the controller maintain the state.  Here’s I did this.

SideDishTableViewCell -> UITableViewCell

I subclassed the view cell with SideDishTableViewCell (SDTVC).  In SDTVC, I defined a protocol:

// SideDishTableViewCell.h

@protocol SideDishTableViewCellDelegate;
@interface SideDishTableViewCell : UITableViewCell

// define a @property to hold a reference to the delegate
@property (assign, nonatomic) id <SideDishTableViewCellDelegate> delegate;
@end

// define the @protocol here
@protocol SideDishTableViewCellDelegate <NSObject>
@optional
- (int)addItemWithCell:(SideDishTableViewCell *)cell;
- (int)removeItemWithCell:(SideDishTableViewCell *)cell;
@end

Now the SDTVC has a reference to a delegate which we’ll call whenever the user adds or removes items.  One note: I chose to pass the entire SDTVC because it contains bits of data that the controller needs. Another option is to just pass the dish’s ID and have the controller do a bit more work to get its metadata.

Calling the delegate from IBAction

Each of the + and – buttons are tied to IBActions, and the IBAction methods are where we call out to the delegate.

// SideDishTableViewCell.m

// The value 'q' is returned from the controller and is used to update the quantity displayed.  The View does no math.
- (IBAction)increaseQuantity:(id)sender {
    int q = [[self delegate] addItemWithCell:self];
    self.quantityLabel.text = [NSString stringWithFormat:@"%d", q];
}
- (IBAction)decreaseQuantity:(id)sender {
    int q = [[self delegate] removeItemWithCell:self];
    self.quantityLabel.text = [NSString stringWithFormat:@"%d", q];
}

Implementing the protocol from the View Controller

// SideDishViewController.h

#import "SideDishTableViewCell.h"
@interface SideDishViewController : UIViewController <SideDishTableViewCellDelegate>

And the implementation.

// SideDishViewController.m
-(int)addItemWithCell:(SideDishTableViewCell *)cell {
     // Update a collection that holds the dishes and their quantities

     // Return this side dish's quantity
}

Summary

There you have it.  We’ve kept the view very simple to the point it doesn’t even have to do any math.  It simply lets the controller handle the increment/decrement and then just waits for the controller to tell it what the new quantity is.

 

Building a binary search tree in Javascript

“A tree’s a tree. How many more do you need to look at?” – Ronald Reagan

I am reading Secrets of the Javascript Ninja by John Resig and wanted to try out some of the more advanced Javascript concepts.  I also wanted to do something more than just a ‘hello world’ so I decided to build a binary search tree (bst).

Beauty and the BST

There are many articles out there on BST’s so I will skip going into that here.  What I am interested in building is a simple node ‘object’ in JS that can hold references to its left and right children.  To do this, I decided to use the JS prototype functionality.

// Name and value can be set at creation time so are passed into the constructor
function Node(name, value) {
     this.name = name;
     this.value = value;
}

Node.prototype.setLeft = function(left) {
     this.left = left;
}

Node.prototype.setRight = function(right) {
     this.right = right;
}

BST Insertion Logic

Next up is creating the logic that adds a new node to the right place in the BST.  We are not going to get into rebalancing so it is very possible that this tree is waaaay overweighted on one side.  We will live with that and maybe get to that in a future exercise.

// tree is the root node of the tree.  node is the new node to add
// If the new node is greater than tree, then we either add it as the right child if tree does not have a child, otherwise, we call insertNode again but this time passing in tree's right child as the tree parameter.  Similar logic is done if node is less than tree.

function insertNode(tree, node) {
    if (tree) {
        if (tree.value < node.value) {
            if (tree.right) {
                insertNode(tree.right, node);
            } else {
                tree.setRight(node);
            }
        } else {
            if (tree.left) {
                insertNode(tree.left, node);
            } else {
                tree.setLeft(node);
            }
        }
    } else {
        tree = node;
    }
    return tree;
}

Testing the BST Here we do some initial setup in setup, where we add several nodes in various ascending order. Then we print out the tree with printTreeAsc to verify we can walk the tree from lowest to highest, starting from root.

function setup() {
    nodeA = new Node('a', 5);
    nodeB = new Node('b', 12);
    nodeC = new Node('c', 10);
    nodeD = new Node('d', 15);
    nodeE = new Node('e', 20);
    nodeF = new Node('f', 25);
    nodeG = new Node('g', 8);
    nodeH = new Node('h', 3);

    var tree = insertNode(tree, nodeA);
    tree = insertNode(tree, nodeB);
    tree = insertNode(tree, nodeC);
    tree = insertNode(tree, nodeD);
    tree = insertNode(tree, nodeE);
    tree = insertNode(tree, nodeF);
    tree = insertNode(tree, nodeG);    
    tree = insertNode(tree, nodeH);    
}

function printTreeAsc(root) {
    var currNode = root;
    if(currNode.left) {
        printTreeAsc(currNode.left);
    }

    console.log(currNode.value);

    if(currNode.right) {
        printTreeAsc(currNode.right);
    }
}

Running setup() and then printTreeAsc(nodeA) yields:

3
5
8
10
12
15
20
25

It works!
Lastly, how tall is my BST?
BSTs are a fun way to work with algorithms and recursion, so I decided to write a method to calculate the height of the tree. Basically this will return the maximum number of steps from nodeA down to the lowest node. Perfect candidate for recursion!

function calcHeight(node) {
    if (node) {
        return 1 + Math.max(calcHeight(node.left), calcHeight(node.right));
    } else {
        return 0;
    }
}

Result: 5. Passing in nodeA, this gives a result of 5.

Summary
So we got to see the JS prototype feature in action when we build the tree, in insertNode. We also built a simple binary search tree and verified it works by iterating over it in ascending order. And last but not least, we wrote a simple recursive method to determine its height.

Using Node cronjobs to replace Heroku worker dynos

“Time keeps on slipping, slippin…into the future.”  – Steve Miller Band

Hopscotch.fm relies on data that is pulled in from several music related services.  I have created multiple tasks each of which can be run from the command line, like:

node run manualrun getShows sf

This will get the shows for San Francisco. On Heroku, I was using the Scheduler to automatically run this periodically. All good so far.

The Problem

The solution worked great up until I started needing several of these workers, to get shows, to get artist metadata, to get artist songs, cleanup, and more.  Easy enough I thought.  I just added more tasks to the Heroku Scheduler. Except there is a limit to the free tier on Heroku…

The Surprise (the Heroku bill!)

My Heroku bill was over $70!  How did this happen??  Turns out I had exceeded the free monthly hours with all the worker dynos I had been spinning up.  So I needed a solution quick.  I host hopscotch.fm on nodejitsu so I figured why not just use that.

The Solution (cron!)

Enter node-cron. If you’ve ever used Linux/UNIX cron jobs, it’s nearly identical.  The syntax for dates is the same.  All you need to do is specify the function to run.  Here is a cron job in file cronjobs.js that crunches radio stations for a city:

var cronJob = require('cron').CronJob;
var cruncher = require('./cruncher); // internal class that does the crunching
var jobMetroRadio = new cronJob('00 05 00 * * 1-7', (function() {
  console.log('Starting crunching shows.');
  return cruncher.crunchShows(function() {
    return console.log('Finished crunching shows.');
  });
}), (function() {}), true, time.tzset("America/Los_Angeles"));

Then in app.js I just:

require('./cronjobs');

And lastly add these two packages to package.json and install them

cron  // add to package.json
time  // add to package.json
npm install

The Result
I’ve moved all of the tasks over from Heroku Scheduler onto the nodejitsu deployment and everything is running smoothly. Hooray for cron!

Hopscotch.fm, now with Artist Radio

I finally made some time to work on hopscotch.fm this past week.  I was able to make a couple of major improvements to the site, mostly based on what I found to be annoying/missing when I was using it.

Soundcloud! – First was Youtube, gone.  Terrible success rate in finding videos for artists.  Next was Tomahk, also gone.  Better success rate but kludgy ux.  Now, Soundcloud!   All the songs are now streamed from Soundcloud.  The success rate is great, and I can have much more control over the player.

Artist Radio – now when you like the song you’re listening to, you can turn on Artist Radio and listen to more songs by that artist.

Performance – the unsexy beast in the room.  It was taking up to a minute to build a radio station, making changing stations a task for only the very determined.  Now all the stations are pre-built so they load up almost instantly.

I am finding I listen to hopscotch daily now as it is a great background stream of music while I work.  And last week I went to two shows that were featured on hopscotch radio, Rodriguez at the Warfield (awesome show) and White Ring at Elbo Room (good but not nearly as awesome).

No more Posole, please!

I reached a major milestone for me last month in fully automating the data retrieval for hopscotch.fm.  I now query multiple music api services at differing intervals throughout the day.  And adding a new city is quite simple to do.  Welcome Chicago!

But the ui was making me nauseous every time i saw it. Cool font that lost its cool the 100th time you’ve seen it. The off white background that just begs to be more interesting.

And the 1 out of 4 incorrect videos needs to be vastly improved and is one thing that would impact when I announce hopscotch.  Today we had Posole performing in San Francisco, and the video that you saw on hopscotch was educational yet ultimately not what should have been there

Here is a correct one:

When I completed the last revision, I generally liked the ui, but now I see it is kinda lacking. There isn’t much a user can do with it. I want to incorporate more social and use that to crowdsource the video library collection.

I am considering having a designer help with the ux. First I would want to get the functionality working with the major components like the crowdsourcing piece. I haven’t thought about revenue much because i wouldn’t pay for this site as it is now.

So, no more eating Posole. It’s time to listen to it!

Getting to wireframing

I have always believed wireframing tools were too rigid to let you be creative.  And so I’ve always been a pen and paper sketcher and I like the drawing part of it.  It’s fun and it can be very creative.

But where it doesn’t help is in when you’re trying to design to a higher level of precision.  I have started using Balsamiq and I was impressed with how limitless and simple it is.  Everything works as you’d expect it to, and in an hour I had a much clearer idea of what the ui I had sketched out could look like.  Things like portions and relative size to other objects on the page are much easier to see.

My sketches.  Chris Isaak is in my test group.

My sketches. Chris Isaak is in my test group.

Getting hopscotch.fm to version 1.0

This week I launched hopscotch.fm.  It is still a very early stage site and there are many things I’d like to add on and improve, but I decided I needed to draw a line in the sand and just ship something.  Anything.

In this blog post I want to give a general sense of how I went from idea to v1.0 going live.

Step 1: Proof of Concept

In this phase I wanted to rule out any potential major unknowns.  The two biggest ones were:

  • could I use Songkick’s API for my show listings
  • could I build a radio player around Youtube videos

The first one was answered fairly quickly.  The Songkick API features loads of shows and venues and while it’s not the cleanest data (dates are not always proper), it is a great start.

The second was a bit more involved.  I want to make hopscotch.fm a radio player so that when a song completes, it automatically moves on to the next song without the user doing anything.  I found a great library tubeplayer.js that did this.

Step 2: Get an ugly site up and running

It doesn’t have to be ugly, but I wasted no time in making anything look good.  I had buttons all over the page, images showing up over controls and other oddness.  But I got the functionality implemented.  The user hits play and:

  • The Youtube video starts playing
  • An artist image is downloaded via Youtube’s API
  • Venue info is retrieved from Songkick
  • Artist info is retrieved from Wikipedia.  I since removed this as it was the wrong half the time.

Step 3: Decide what constitutes v1.0 and do it

It’s probably no surprise that this step took the most time as now I needed to start caring about error handling and CSS and data validation and everything else that makes a real site, real.  I added Bootstrap, built a little admin page for syncing hopscotch data with Songkick.  Made a best guess effort at what a clean, simple user experience could be and from feedback I’ve gotten, I think it’s a decent start.

Step 4: Hosting and stuff

I originally put it on heroku as they have a free starter level.  But the 20 second startup time just to get the page to load is ridiculous so I’ve now switched to nodejitsu.  Data is on mongolab.  It’s all in the cloud.  It’s all happy.

Step 5: Go see a show!

That’s what hopscotch.fm is for so why not?

Node.js flash info messages

I was looking for a way to build a simple flash framework to easily display error and info messages throughout my webapp’s pages.  This is a pretty standard web development practice and sure enough there is a simple way to do this with node.js.  Technically this is using Express so if you’re using that then check out this write up: http://dailyjs.com/2011/01/03/node-tutorial-8/