Screen scraping with SpookyJS

“Then I felt just like a fiend,
It wasn’t even close to Halloween.”
Geto Boys

I needed to build a screen scraper for a Node.js application and spent a good deal of time making it all work. I wanted to share some lessons learned that I would have found very helpful to have known at the outset.

This post is about getting PhantomJS, CasperJS and SpookyJS playing nicely and understanding what role each one plays.

High Level – how it all works

PhantomJS does all the grunt work of scraping the screen. But to do anything remotely interesting, like logging in, clicking around, it quickly becomes cumbersome. That’s where CasperJS comes in. It sits on top of PhantomJS and lets you easily do things like logging in, following links etc. My scraper interacted solely with CasperJS, and it handled talking to PhantomJS.

PhantomJS and CasperJS are native processes running on the local server. As such, they can not be directly accessed via a Node.JS app, at least not without a lot of work.

That’s where SpookyJS comes in. SpookyJS is an npm module that lets you work with CasperJS directly from within your Node app. How it does this is beyond this post but it’s worth a read. It basically spins up a CasperJS process and talks to it via JSON/RPC calls. Neat stuff.

Know Thy Contexts! There are 3 of them

If I can pass on any knowledge in this post, it is this section. Know thy three contexts:

  • Node/SpookyJS context
  • CasperJS context
  • Page context

SpookyJS has a good write up on how to pass variables from one to the other. Examples always help me understand so let’s write one.

// In the Node app, where we require(‘spooky’) and all that

var spooky = new Spooky({
  casper: {
    //configure casperjs here
  }
}, function (err) {
  // NODE CONTEXT
  console.log('We are in the Node context');
  spooky.start('http://www.mysite.com/');
  spooky.then(function() {
    // CASPERJS CONTEXT
    console.log('We are in the CasperJS context');
    this.emit('console', 'We can also emit events here.');
    this.click('a#somelink');
  });
  spooky.then(function() {
    // CASPERJS CONTEXT
    var size = this.evaluate(function() {
    // PAGE CONTEXT
    console.log('....'); // DOES NOT GET PRINTED OUT
    __utils__.echo('We are in the Page context'); // Gets printed out
    this.capture('screenshot.png');
    var $selectsize = $('select#myselectlist option').size();
      return $selectsize;
    })
  })

CasperJS has a very convenient utils module it injects into each page. When you are in an Evaluate function, you are essentially on the page itself and that’s where you can use JQuery and the utils module. Using console.log is pointless as the output does not get captured by your application.

Quick Note: Insert JQuery

I don’t know why CasperJS didn’t just include JQuery. So much easier than trying to learn their own selector definitions. Add it when you set up the casper options.

var spooky = new Spooky({
 casper: {
   logLevel: 'error',
   verbose: false,
   options: {
     clientScripts: ['../public/javascripts/jquery.min.js']
   }
 }
 ...

Setting up SpookyJS

I’ll assume you know how to add the SpookyJS npm package to your app. Once you’ve done that, it’s time to start scraping! Here’s where I spent a lot of time.

Before you get too far, I would *strongly* encourage you to understand how CasperJS works. They have a good explanation of it on their docs, especially the section on the ‘evaluate’ method. This was my ‘aha’ moment.

The general flow of a SpookyJS app is you chain together several spooky.then steps. Each one of these is run once the previous one is completed. Anything added between a spooky.then step is run immediately without waiting for the previous step to complete.

spooky.then(function() {
 this.wait(5000, function() {
   this.emit('console', 'step 1');
 })
}
console.log('step 2');
spooky.then(function() {
 this.wait(5000, function() {
  this.emit('console', 'step 3');
 })
}
// prints out
step 2
step 1
step 3

Getting Slightly Fancier

Say we want to click a link, and then download a file from the resulting page. Let’s say that the link has an id of #account, and on the resulting page, the file to download is a link with id #downloadfile and we want to download it to /tmp/file.pdf

spooky.then(function() {
 this.click('a#account');
});
// This next step will not start until the page is loaded
spooky.then(function() {
 this.download(this.getElementAttribute('a#downloadfile','href'), '/tmp/file.pdf');
});

Getting Even Fancier

Now let’s say we have a drop down list of 12 months. Selecting a month refreshes the page and we want to take a screenshot of each of the 12 months.

** Using globals. To pass variables within then() functions, you can take advantage of global variables, attaching them to the window object. See this in the example below.

//Not showing the set up and config. Find that above.
spooky.start('http://www.pge.com/');
 spooky.then(function(){
  window.numMonths = this.evaluate(function() {
   var $selectsize = $('select#month_select option').size();
   return $selectsize; // returns 12, the number of months
  });
 });

 spooky.then(function() {
  var casperCount = 0;
  this.repeat(window.numMonths, function() {
   this.evaluate(function(i) {
    $('select#month_select').get(0).selectedIndex = i;
    // Refresh the page using one of the two ways below
    $('#month_selection_form').submit(); // If the select is within a form
    $('select#month_select').change(); // If the page has a trigger on the select
    return true;
   },{ i: casperCount });

   this.then(function() {
    this.capture('month' + casperCount + '.png');
    casperCount = casperCount + 1;
   });
  });
});

Hope That Helps

It took me a while to get all the contexts sorted out but once you understand how each piece plays together nicely with the others, SpookyJS with CasperJS can be very powerful. Add the npm cron package and then you’ve got a first class scraping application.
Happy scraping!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s