tonymadbrain March 6, 2015 at 17:40

Ruby Web Parsing

From the sandbox
Tutorial

This is a translation of the article “Web Scraping with Ruby,” which I found useful in learning the Ruby programming language. Parsing interests me for personal purposes. It seems to me that this is not only a useful skill, but also a good way to learn a language.

Parsing the web in Ruby is easier than you might think. Let's start with a simple example: I want to get a beautifully formatted JSON array of objects representing a list of movies from a local independent movie theater website .

First, we need a way to download an html page that contains all the movie ads. Ruby has a built-in http client, Net::HTTPas well as an add-on on it - open-uri.

Open uri

Open-uri is good for basic things, like the ones we do in the lesson, but it has some problems , so you might want to find another http client for the production environment.

So, the first thing to do is download html from a remote server.

require 'open-uri'
url = 'http://www.cubecinema.com/programme'
html = open(url)

Ok, now we have a page that we want to parse, now we need to get some information out of it. The best tool for this is Nokogiri . We are creating a new instance of Nokogiri for our html, which we just downloaded.

require 'nokogiri'
doc = Nokogiri::HTML(html)

Nokogiri is cool because it allows you to access html using CSS selectors, which, in my opinion, is much more convenient than using xpath.

Ok, now we have a document from which we can get a list of movies. Each list item has the following html structure, as shown below.

comedydvdfilmComedy Combo presents
      Live stand up + Monty Python and the Holy Grail
      Rare screening from 35mm!

      Sat 20 December | 19:30
      
Brave (and not so brave) Knights of the Round Table! Gain shelter from the vicious chicken of Bristol as we gather to bear witness to this 100% factually accurate retelling ... [more...]

Html processing

Each movie has a css class .showing, so that we can select all the shows and process them in turn.

showings = []
doc.css('.showing').each do |showing|
  showing_id = showing['id'].split('_').last.to_i
  tags = showing.css('.tags a').map { |tag| tag.text.strip }
  title_el = showing.at_css('h1 a')
  title_el.children.each { |c| c.remove if c.name == 'span' }
  title = title_el.text.strip
  dates = showing.at_css('.start_and_pricing').inner_html.strip
  dates = dates.split('
').map(&:strip).map { |d| DateTime.parse(d) }
  description = showing.at_css('.copy').text.gsub('[more...]', '').strip
  showings.push(
    id: showing_id,
    title: title,
    tags: tags,
    dates: dates,
    description: description
  )
end

Let's take a look at the parts of the code above.

showing_id = showing['id'].split('_').last.to_i

At the beginning, we take a unique identifier id, which is kindly set as an attribute of the html identifier in the markup. Using square brackets, we can access the attributes of the elements. Thus, in the case of the html presented above, there showing['id']should be “event_7557”. We are only interested in the numerical identifier, so we separate the result using the underscore .split('_')and then take the last element from the resulting array and convert it to an integer format .last.to_i.

tags = showing.css('.tags a').map { |tag| tag.text.strip }

Here we find all the tags for the movie using a .cssmethod that returns an array of matching elements. Then we map (use the map method) elements, take text from them and remove the spaces in it. For our html, the result will be ["comedy", "dvd", "film"].

title_el = showing.at_css('h1 a')
title_el.children.each { |c| c.remove if c.name == 'span' }
title = title_el.text.strip

The code to get the header is a bit more complicated because this html element contains some additional span elements with prefixes and suffixes. We take the header using .at_css, which returns one matching element. Then we iterate over each descendant of the header and remove the extra span. In the end, when the span is removed, we get the title text and clean it from unnecessary spaces.

dates = showing.at_css('.start_and_pricing').inner_html.strip
dates = dates.split('
').map(&:strip).map { |d| DateTime.parse(d) }

Next is the code to get the date and time of the show. This is a bit more complicated, because movies can be shown for several days and sometimes the price can be in the same element. We map the dates we find using DateTime.parseand as a result we get an array of Ruby objects - DateTime.

description = showing.at_css('.copy').text.gsub('[more...]', '').strip

Getting the description is a fairly simple process, the only thing worth doing is to remove the text [more...]using.gsub

showings.push(
    id: showing_id,
    title: title,
    tags: tags,
    dates: dates,
    description: description
  )

Now that we have all the necessary parts in the variables, we can write them to our hash, created to display all the films.

Json output

Now that we have selected every movie and we have an array of them, we can convert the result to JSON format.

require 'json'
puts JSON.pretty_generate(showings)

This code displays the showings array encoded in JSON format; when the script is run, the output can be redirected to a file or other program for further processing.

Putting it all together

Having collected all the parts in one place, we get the full version of our script:

require 'open-uri'
require 'nokogiri'
require 'json'
url = 'http://www.cubecinema.com/programme'
html = open(url)
doc = Nokogiri::HTML(html)
showings = []
doc.css('.showing').each do |showing|
  showing_id = showing['id'].split('_').last.to_i
  tags = showing.css('.tags a').map { |tag| tag.text.strip }
  title_el = showing.at_css('h1 a')
  title_el.children.each { |c| c.remove if c.name == 'span' }
  title = title_el.text.strip
  dates = showing.at_css('.start_and_pricing').inner_html.strip
  dates = dates.split('
').map(&:strip).map { |d| DateTime.parse(d) }
  description = showing.at_css('.copy').text.gsub('[more...]', '').strip
  showings.push(
    id: showing_id,
    title: title,
    tags: tags,
    dates: dates,
    description: description
  )
end
puts JSON.pretty_generate(showings)

If you save this to a file, for example, scraper.rband run it ruby scraper.rb, then you should see the output in JSON format. It should look like the one below.

[
  {
    "id": 7686,
    "title": "Harry Dean Stanton - Partly Fiction",
    "tags": [
      "dcp",
      "film",
      "ttt"
    ],
    "dates": [
      "2015-01-19T20:00:00+00:00",
      "2015-01-20T20:00:00+00:00"
    ],
    "description": "A mesmerizing, impressionistic portrait of the iconic actor in his intimate moments, with film clips from some of his 250 films and his own heart-breaking renditions of American folk songs. ..."
  },
  {
    "id": 7519,
    "title": "Bang the Bore Audiovisual Spectacle: VA AA LR + Stephen Cornford + Seth Cooke",
    "tags": [
      "music"
    ],
    "dates": [
      "2015-01-21T20:00:00+00:00"
    ],
    "description": "An evening of hacked TVs, 4 screen cinematic drone and electroacoustics. VAAALR: Vasco Alves, Adam Asnan and Louie Rice create spectacles using distress flares, C02 and junk electronics. Stephen Cornford: ..."
  }
]

All. And this is just a basic example of parsing. It is more difficult to parse a site that requires authorization at the beginning. For such cases, I recommend looking in the direction of mechanize , which is working on Nokogiri.

Hopefully this introduction to parsing will give you ideas about the data you want to see in a more structured format using the methods described above.

I also plan to translate another article on parsing from the same author.

All articles in the series:

Tags:

Ruby Web Parsing

Comedy Combo presents Live stand up + Monty Python and the Holy Grail Rare screening from 35mm!

Html processing

Json output

Putting it all together

Also popular now: