
Advanced Website Parsing with Mechanize
- Transfer
- Tutorial
Continuing the theme of site parsing in Ruby , I decided to translate the next article by the same author.
In a previous post, I described the basics - an introduction to Ruby web parsing . At the end of the post, I mentioned the Mechanize tool, which is used for advanced parsing.
This article explains how to do advanced website parsing using Mechanize, which in turn allows you to do great HTML processing while working on Nokogiri.
Mechanize out of the box provides tools that allow you to fill out fields in forms, follow links and take into account the robots.txt file. In this post, I'll show you how to use it to get the latest reviews from the Pitchfork website .
Reviews are divided into several pages, therefore, we can’t just take one page and parse it using Nokogiri. Here we will need Mechanize with its ability to click on links and go through them to other pages.
First you need to install Mechanize itself and its dependencies through Rubygems.
You can start writing our parser. Create a file
Now we can start using Mechanize. The first thing to do is create a new instance of the Mechanize (
Now we can use the object
Mehanize allows you to use a method
This will return an array of links, but we only need links to reviews, not pagination. To remove unnecessary, we can call
For illustrative purposes and in order not to load the Pitchfork server, we will only take links to the first 4 reviews.
We got a list of links and we want to process each one individually, for this we will use the method
The object
First we take the review metadata using the CSS selector
Now we have an array of hashes with reviews, which we can, for example, output in JSON format.
The script is fully:
By saving this code in our file
We will get something similar to this:
If you want, you can redirect this data to a file.
This is only the pinnacle of Mechanize. In this article, I did not even touch on the ability of Mechanize to fill out and submit forms. If you are interested, then I recommend reading the Mechanize manual and usage examples.
Many people in the comments to the previous post said that I had to just use Mechanize. Although I agree that Mechanize is a great tool, the example I cited in the first post on this topic was simple, and the use of Mechanize in it, it seems to me, is superfluous.
However, given the capabilities of Mechanize, I'm starting to think that even for simple parsing tasks, it will often be better to use it.
All articles in the series:
In a previous post, I described the basics - an introduction to Ruby web parsing . At the end of the post, I mentioned the Mechanize tool, which is used for advanced parsing.
This article explains how to do advanced website parsing using Mechanize, which in turn allows you to do great HTML processing while working on Nokogiri.
Parsing reviews with Pitchfork
Mechanize out of the box provides tools that allow you to fill out fields in forms, follow links and take into account the robots.txt file. In this post, I'll show you how to use it to get the latest reviews from the Pitchfork website .
Parse neatly
You should always parse carefully. Read the article Is scraping legal? from the ScraperWiki blog for discussions on this topic.
Reviews are divided into several pages, therefore, we can’t just take one page and parse it using Nokogiri. Here we will need Mechanize with its ability to click on links and go through them to other pages.
Installation
First you need to install Mechanize itself and its dependencies through Rubygems.
$ gem install mechanize
You can start writing our parser. Create a file
scraper.rb
and add some to it require
. This will indicate the dependencies that are necessary for our script. date
and json
these are parts of the ruby standard library, so there’s no need to install them additionally.require 'mechanize'
require 'date'
require 'json'
Now we can start using Mechanize. The first thing to do is create a new instance of the Mechanize (
agent
) class and use it to download the page ( page
).agent = Mehanize.new
page = agent.get("http://pitchfork.com/reviews/albums/")
Find links to reviews
Now we can use the object
page
to find links to reviews. Mehanize allows you to use a method
.links_with
that, as the name implies, finds links with the specified attributes. Here we are looking for links that match the regular expression. This will return an array of links, but we only need links to reviews, not pagination. To remove unnecessary, we can call
.reject
and discard links similar to pagination.review_links = page.links_with(href: %r{^/reviews/albums/\w+})
review_links = review_links.reject do |link|
parent_classes = link.node.parent['class'].split
parent_classes.any? { |p| %w[next-container page-number].include?(p) }
end
For illustrative purposes and in order not to load the Pitchfork server, we will only take links to the first 4 reviews.
review_links = review_links[0...4]
Processing each review
We got a list of links and we want to process each one individually, for this we will use the method
.map
and return the hash after each iteration. The object
page
has a method .search
that is delegated to the .search
Nokogiri method . This means that we can use the CSS selector as an argument for .serach
and it will return an array of matched elements. First we take the review metadata using the CSS selector
#main .review-meta .info
, and then we look inside the review_meta
element for pieces of information that we need.reviews = review_links.map do |link|
review = link.click
review_meta = review.search('#main .review-meta .info')
artist = review_meta.search('h1')[0].text
album = review_meta.search('h2')[0].text
label, year = review_meta.search('h3')[0].text.split(';').map(&:strip)
reviewer = review_meta.search('h4 address')[0].text
review_date = Date.parse(review_meta.search('.pub-date')[0].text)
score = review_meta.search('.score').text.to_f
{
artist: artist,
album: album,
label: label,
year: year,
reviewer: reviewer,
review_date: review_date,
score: score
}
end
Now we have an array of hashes with reviews, which we can, for example, output in JSON format.
puts JSON.pretty_generate(reviews)
Together
The script is fully:
require 'mechanize'
require 'date'
require 'json'
agent = Mechanize.new
page = agent.get("http://pitchfork.com/reviews/albums/")
review_links = page.links_with(href: %r{^/reviews/albums/\w+})
review_links = review_links.reject do |link|
parent_classes = link.node.parent['class'].split
parent_classes.any? { |p| %w[next-container page-number].include?(p) }
end
review_links = review_links[0...4]
reviews = review_links.map do |link|
review = link.click
review_meta = review.search('#main .review-meta .info')
artist = review_meta.search('h1')[0].text
album = review_meta.search('h2')[0].text
label, year = review_meta.search('h3')[0].text.split(';').map(&:strip)
reviewer = review_meta.search('h4 address')[0].text
review_date = Date.parse(review_meta.search('.pub-date')[0].text)
score = review_meta.search('.score').text.to_f
{
artist: artist,
album: album,
label: label,
year: year,
reviewer: reviewer,
review_date: review_date,
score: score
}
end
puts JSON.pretty_generate(reviews)
By saving this code in our file
scraper.rb
and running it with the command:$ ruby scraper.rb
We will get something similar to this:
[
{
"artist": "Viet Cong",
"album": "Viet Cong",
"label": "Jagjaguwar",
"year": "2015",
"reviewer": "Ian Cohen",
"review_date": "2015-01-22",
"score": 8.5
},
{
"artist": "Lupe Fiasco",
"album": "Tetsuo & Youth",
"label": "Atlantic / 1st and 15th",
"year": "2015",
"reviewer": "Jayson Greene",
"review_date": "2015-01-22",
"score": 7.2
},
{
"artist": "The Go-Betweens",
"album": "G Stands for Go-Betweens: Volume 1, 1978-1984",
"label": "Domino",
"year": "2015",
"reviewer": "Douglas Wolk",
"review_date": "2015-01-22",
"score": 8.2
},
{
"artist": "The Sidekicks",
"album": "Runners in the Nerved World",
"label": "Epitaph",
"year": "2015",
"reviewer": "Ian Cohen",
"review_date": "2015-01-22",
"score": 7.4
}
]
If you want, you can redirect this data to a file.
$ ruby scraper.rb > reviews.json
Conclusion
This is only the pinnacle of Mechanize. In this article, I did not even touch on the ability of Mechanize to fill out and submit forms. If you are interested, then I recommend reading the Mechanize manual and usage examples.
Many people in the comments to the previous post said that I had to just use Mechanize. Although I agree that Mechanize is a great tool, the example I cited in the first post on this topic was simple, and the use of Mechanize in it, it seems to me, is superfluous.
However, given the capabilities of Mechanize, I'm starting to think that even for simple parsing tasks, it will often be better to use it.
All articles in the series:
- Ruby Web Parsing
- Advanced Website Parsing with Mechanize
- Using morph.io for web parsing