Another article about indexing ajax sites by search engines
Today it’s stylish, fashionable, youth to create a site on AJAX, from the point of view of the user it’s quick and convenient, and search robots with such sites can have problems.
The most correct solution is to use regular links, but load the content with ajax, leaving the ability to get content via a regular link for users with disabled JS (you never know) and robots. That is, you need to develop it the old-fashioned way, with regular links, layout and view keys, then you can process all the links with javascript, hang the content loading via ajax on them using the link from the href attribute, tag a, in a very simplified form this should look something like this:
$(document).on('click', 'a.ajaxlinks', 'function(e) {
e.stopPropagation();
e.preventDefault();
var pageurl = $(this).attr('href');
$.ajax({
url: pageurl,
data: {
ajax: 1
},
success: function( resp ) {
$('#content').html(resp);
}
});
});
Here we just load the same pages, but with ajax, and on the backend we need to process the special ajax GET parameter and, if available, give the page without layout, well, if it's rude.
But the architecture is not always designed for this, moreover, sites on angularjs, and the like, work a little more complicated, and substitute content on a loaded html-template with variables. For such sites (or you can already call them applications), search engines came up with HashBang technology, in short - this is a link like example.com/#!/cats/grumpy-cat , when the search robot sees #! he makes a request to the server at example.com/?_escaped_fragment_=/cats/grumpy-cat, i.e. replaces "#!" to "? _escaped_fragment_ =", and the server should give the generated html to the search engine, identical to the one that the user would see on the original link. But if the application uses the HTML5 History API, and links like # are not used, you need to add a special meta tag to the head section:
At the sight of this tag, the search robot will understand that the site is running on ajax, and will redirect all requests for site content to the link: example.com/?_escaped_fragment_=/cats/grumpy-cat instead of example.com/cats/grumpy- cat .
You can handle these requests using the framework used, but in a complex application with angularjs, this is a bunch of redundant code.
The way we will go is described in the following diagram from Google:
To do this, we will catch all requests from _escaped_fragment_ and send them to phantom.js on the server, which by means of server webkit will generate an html snapshot of the requested page and give it to the crawler. Users will remain to work directly.
First, install the necessary software, if not already installed, something like this:
yum install screen
npm instamm phantomjs
ln -s /usr/local/node_modules/phantomjs/lib/phantom/bin/phantomjs /usr/local/bin/phantomjs
Next, we’ll write (or take a ready-made) server-side js-script (server.js) that will do html snapshots:
var system = require('system');
if (system.args.length < 3) {
console.log("Missing arguments.");
phantom.exit();
}
var server = require('webserver').create();
var port = parseInt(system.args[1]);
var urlPrefix = system.args[2];
var parse_qs = function(s) {
var queryString = {};
var a = document.createElement("a");
a.href = s;
a.search.replace(
new RegExp("([^?=&]+)(=([^&]*))?", "g"),
function($0, $1, $2, $3) { queryString[$1] = $3; }
);
return queryString;
};
var renderHtml = function(url, cb) {
var page = require('webpage').create();
page.settings.loadImages = false;
page.settings.localToRemoteUrlAccessEnabled = true;
page.onCallback = function() {
cb(page.content);
page.close();
};
// page.onConsoleMessage = function(msg, lineNum, sourceId) {
// console.log('CONSOLE: ' + msg + ' (from line #' + lineNum + ' in "' + sourceId + '")');
// };
page.onInitialized = function() {
page.evaluate(function() {
setTimeout(function() {
window.callPhantom();
}, 10000);
});
};
page.open(url);
};
server.listen(port, function (request, response) {
var route = parse_qs(request.url)._escaped_fragment_;
// var url = urlPrefix
// + '/' + request.url.slice(1, request.url.indexOf('?'))
// + (route ? decodeURIComponent(route) : '');
var url = urlPrefix + '/' + request.url;
renderHtml(url, function(html) {
response.statusCode = 200;
response.write(html);
response.close();
});
});
console.log('Listening on ' + port + '...');
console.log('Press Ctrl+C to stop.');
And run it in the screen using phantomjs :
screen -d -m phantomjs --disk-cache=no server.js 8888 http://example.com
Next, configure nginx (apache in the same way) to proxy requests for a running daemon:
server {
...
if ($args ~ "_escaped_fragment_=(.+)") {
set $real_url $1;
rewrite ^ /crawler$real_url;
}
location ^~ /crawler {
proxy_pass http://127.0.0.1:8888/$real_url;
}
...
}
Now, when requesting example.com/cats/grumpy-cat, search bots will go to the link example.com/?_escaped_fragment_=cats/grumpy-cat , which will be intercepted by nginx, and phantomjs will be sent, which will generate html on the server through the browser engine and give it to the robot.
In addition to Google, Yandex and Bing search robots, this will also work for sharing links via facebook.
Links:
https://developers.google.com/webmasters/ajax-crawling/docs/getting-started
https://help.yandex.ru/webmaster/robot-workings/ajax-indexing.xml
UPD (2.12.16) :
Configs for apache2 from kot-ezhva :
In case html5mode is used:
RewriteEngine on
RewriteCond %{QUERY_STRING} (.*)_escaped_fragment_=
RewriteRule ^(.*) 127.0.0.1:8888/$1 [P]
ProxyPassReverse / 127.0.0.1:8888/
If urls with bars:
RewriteEngine on
RewriteCond %{QUERY_STRING} _escaped_fragment_=(.*)
RewriteRule ^(.*) 127.0.0.1:8888/$1 [P]
ProxyPassReverse / 127.0.0.1:8888/