inossidabile March 20, 2012 at 02:51

Indexing AJAX Sites

Along with the development of Joosy , AJAX suddenly - but as expected - flooded all the projects for which we undertake. The paradigm has proven extremely successful in all aspects except one. The same classic: “AJAX? Indexing? Pff ... ". While we do online banking, it suits us perfectly. But how not to deny yourself this exquisite pleasure for open Web-resources?

And here's how: Google AJAX Crawling is Google’s standard that allows you to force Google to magically request another magic address instead of creating AJAX addresses in a special way (#!). From which Google will wait for the HTML dump of this page, which it chews merrily. Kind people have already written an article about how it works.. Well, it remains for us to learn how to effectively form this dump. Yes, so that without interfering with the code of the application itself.

Hashbang is a small Ruby proxy server that uses the Rack protocol. The latter means that in order to pick it up any web server working with Ruby and / or Rails will do. And for those who use Rails themselves, we have prepared a couple of special buns. But first things first.

General device

Upon initialization, Hashbang creates a WebKit browser instance in its bowels. After a request with the specified URL is launched, it opens the desired address, waits for a special Javascript event, and returns the HTML code at the time this event occurred.

This means that all you need to change in the current application is to call

Sunscraper.finish()

when the page prepared by Javascript can be considered ready.

In combat mode, it will look like this:

About the internal browser and performance

We experimented a lot with possible implementations of the "headless" browser. Tried Watir and various existing Qt bindings . Nothing good came of it. Desperate, we just wrote our own binding to Qt-WebKit, which only knows how to return HTML by tracking an event: Sunscraper . This miracle is written in a C / C ++ mix and connects to Ruby using FFI . This means that Sunscraper should work not only on MRI, but also on JRuby / Rubinius. Unfortunately, it still does not work with Rubinius due to bugs in the implementation of the same FFI.

Since all that we launch is the WebKit engine itself, performance is close to maximum for solving this problem. Real data from live servers during the collection process.

Before installation

Sunscraper uses Qt. Therefore, you will definitely need it to install the Hashbang gem. If you are using a Mac, we recommend the Homebrew : brew install qt. On Linux, you can install any, fresher, package.

Development mode for those on rails

If you are not developing on Rails, feel free to move on to the next paragraph, which will talk about introducing Hashbang into battle.

To install Hashbang in a Rails project, you must perform the following sequence of actions:

Add gem hashbangto Gemfile
Generate base application with rails g hashbang

Now inside your Rails application, in the hashbang folder, is a mini-application of Hashbang itself. And this means that you need to skip the first paragraph in the "setup and launch".

In a development environment, Hashbang inserts its middleware into the Rails download, which intercepts all requests containing the _escaped_fragment_ magic fragment and automatically processes them. There is only one problem: Webrick runs on a single thread. And since Hashbang requests "itself", this leads to a deadlock. Therefore, to test the current application locally, run it with rake hashbang:rails. This command will launch your application under the Unicorn server in two streams. After starting - localhost:3000/?_escaped_fragment_- and check the HTML. Just remember that you need to call in the AJAX application itself Sunscraper.finish().

To simulate running Hashbang in battle mode, where it works via /? Url = http: // ..., use the command rake hashbang:standalone.

Setup and Launch

If you do not use Rails, the basic application can be taken from a special repository . All you have to do is place it somewhere, make sure you have the gem bundler installed and do it in the root of the application bundle install.

Inside the generated / copied Hashbang application lies the config.rb file, which must be edited for effective work. It has only two directives:

url : regular expression that the requested URL should match
timeout : timeout in milliseconds that hashbang will wait for the Sunscraper.finish () event

Suppose that to start the service we use the Passenger module , which implements work with Rack based on Nginx. In this case, for the correct operation, we need to achieve the following:

Hasbang application should work on a special internal address
All requests containing _escaped_fragment_ should be forwarded to this application, and uri-escaped should be forwarded with absolute url to the url = .... parameter.
We need to limit the number of parallel resources to this application, because it is unlikely that we will be indexed into a hundred threads, and WebKit likes resources.

Here is the configuration file you can use: https://gist.github.com/2127685 . This is an example of using Hashbang in a Ralis application.

Oh sad

Unfortunately, in our native Penates, Yandex, this standard has not reached. It is supported by Google, it is supported by Bing (and therefore Yahoo). Even Facebook crawler supports it. But Yandex - no. This means that Hashbang will not help your indexing in the domestic segment of the Internet. At least for now. We direct the violent rays of good towards the Yandex team and wish them to quickly turn their eyes to the so rapidly developing technological segment of the Web :).

Finally

Despite the fact that we are already using Hashbang in battle, we have not yet tested it on all possible configurations. If you encounter any problems when assembling or configuring it, we are always happy to receive new Issues on the github .

Thanks :).

Tags: