Advanced Website Parsing with Mechanize

Original author: Chris Mytton
  • Transfer
  • Tutorial
Continuing the theme of site parsing in Ruby , I decided to translate the next article by the same author.

In a previous post, I described the basics - an introduction to Ruby web parsing . At the end of the post, I mentioned the Mechanize tool, which is used for advanced parsing.

This article explains how to do advanced website parsing using Mechanize, which in turn allows you to do great HTML processing while working on Nokogiri.

Parsing reviews with Pitchfork


Mechanize out of the box provides tools that allow you to fill out fields in forms, follow links and take into account the robots.txt file. In this post, I'll show you how to use it to get the latest reviews from the Pitchfork website .

Parse neatly
You should always parse carefully. Read the article Is scraping legal? from the ScraperWiki blog for discussions on this topic.


Reviews are divided into several pages, therefore, we can’t just take one page and parse it using Nokogiri. Here we will need Mechanize with its ability to click on links and go through them to other pages.

Installation


First you need to install Mechanize itself and its dependencies through Rubygems.

$ gem install mechanize


You can start writing our parser. Create a file scraper.rband add some to it require. This will indicate the dependencies that are necessary for our script. dateand jsonthese are parts of the ruby ​​standard library, so there’s no need to install them additionally.

require 'mechanize'
require 'date'
require 'json'


Now we can start using Mechanize. The first thing to do is create a new instance of the Mechanize ( agent) class and use it to download the page ( page).

agent = Mehanize.new
page  = agent.get("http://pitchfork.com/reviews/albums/")


Find links to reviews


Now we can use the object pageto find links to reviews.
Mehanize allows you to use a method .links_withthat, as the name implies, finds links with the specified attributes. Here we are looking for links that match the regular expression.

This will return an array of links, but we only need links to reviews, not pagination. To remove unnecessary, we can call .rejectand discard links similar to pagination.

review_links = page.links_with(href: %r{^/reviews/albums/\w+})
review_links = review_links.reject do |link|
  parent_classes = link.node.parent['class'].split
  parent_classes.any? { |p| %w[next-container page-number].include?(p) }
end


For illustrative purposes and in order not to load the Pitchfork server, we will only take links to the first 4 reviews.

review_links = review_links[0...4]


Processing each review


We got a list of links and we want to process each one individually, for this we will use the method .mapand return the hash after each iteration.

The object pagehas a method .searchthat is delegated to the .searchNokogiri method . This means that we can use the CSS selector as an argument for .serachand it will return an array of matched elements.

First we take the review metadata using the CSS selector #main .review-meta .info, and then we look inside the review_metaelement for pieces of information that we need.

reviews = review_links.map do |link|
  review = link.click
  review_meta = review.search('#main .review-meta .info')
  artist = review_meta.search('h1')[0].text
  album = review_meta.search('h2')[0].text
  label, year = review_meta.search('h3')[0].text.split(';').map(&:strip)
  reviewer = review_meta.search('h4 address')[0].text
  review_date = Date.parse(review_meta.search('.pub-date')[0].text)
  score = review_meta.search('.score').text.to_f
  {
    artist: artist,
    album: album,
    label: label,
    year: year,
    reviewer: reviewer,
    review_date: review_date,
    score: score
  }
end


Now we have an array of hashes with reviews, which we can, for example, output in JSON format.

puts JSON.pretty_generate(reviews)


Together


The script is fully:

require 'mechanize'
require 'date'
require 'json'
agent = Mechanize.new
page = agent.get("http://pitchfork.com/reviews/albums/")
review_links = page.links_with(href: %r{^/reviews/albums/\w+})
review_links = review_links.reject do |link|
  parent_classes = link.node.parent['class'].split
  parent_classes.any? { |p| %w[next-container page-number].include?(p) }
end
review_links = review_links[0...4]
reviews = review_links.map do |link|
  review = link.click
  review_meta = review.search('#main .review-meta .info')
  artist = review_meta.search('h1')[0].text
  album = review_meta.search('h2')[0].text
  label, year = review_meta.search('h3')[0].text.split(';').map(&:strip)
  reviewer = review_meta.search('h4 address')[0].text
  review_date = Date.parse(review_meta.search('.pub-date')[0].text)
  score = review_meta.search('.score').text.to_f
  {
    artist: artist,
    album: album,
    label: label,
    year: year,
    reviewer: reviewer,
    review_date: review_date,
    score: score
  }
end
puts JSON.pretty_generate(reviews)


By saving this code in our file scraper.rband running it with the command:

$ ruby scraper.rb


We will get something similar to this:

[
  {
    "artist": "Viet Cong",
    "album": "Viet Cong",
    "label": "Jagjaguwar",
    "year": "2015",
    "reviewer": "Ian Cohen",
    "review_date": "2015-01-22",
    "score": 8.5
  },
  {
    "artist": "Lupe Fiasco",
    "album": "Tetsuo & Youth",
    "label": "Atlantic / 1st and 15th",
    "year": "2015",
    "reviewer": "Jayson Greene",
    "review_date": "2015-01-22",
    "score": 7.2
  },
  {
    "artist": "The Go-Betweens",
    "album": "G Stands for Go-Betweens: Volume 1, 1978-1984",
    "label": "Domino",
    "year": "2015",
    "reviewer": "Douglas Wolk",
    "review_date": "2015-01-22",
    "score": 8.2
  },
  {
    "artist": "The Sidekicks",
    "album": "Runners in the Nerved World",
    "label": "Epitaph",
    "year": "2015",
    "reviewer": "Ian Cohen",
    "review_date": "2015-01-22",
    "score": 7.4
  }
]


If you want, you can redirect this data to a file.

$ ruby scraper.rb > reviews.json


Conclusion


This is only the pinnacle of Mechanize. In this article, I did not even touch on the ability of Mechanize to fill out and submit forms. If you are interested, then I recommend reading the Mechanize manual and usage examples.

Many people in the comments to the previous post said that I had to just use Mechanize. Although I agree that Mechanize is a great tool, the example I cited in the first post on this topic was simple, and the use of Mechanize in it, it seems to me, is superfluous.

However, given the capabilities of Mechanize, I'm starting to think that even for simple parsing tasks, it will often be better to use it.

All articles in the series:

Also popular now: