Breaking web c '#!' (hash-bang)

Published on February 15, 2011

Breaking web c '#!' (hash-bang)

Original author: Mike Davies
  • Transfer
The following is a translation of an article that draws attention to, in my opinion, a rather acute problem in the era of web 2.0, namely the purity of URLs.

Using as an example, it is shown what problems can result in blindly following state-of-the-art technologies, chasing SEO and denying the principle of “progressive enhancement”.

Last week, Monday, was unavailable due to broken JavaScript., along with other Gawker sites, displayed a blank homepage without content, advertising, or anything else. The transition from Google search results to subpages redirected back to the main one.

Javascript dependent URLs

Gawker, like Twitter before it, has rebuilt its sites completely dependent on JavaScript, including the URLs of its pages. JavaScript could not load, resulting in a lack of content and broken URLs.

New URLs now look you next way:!5753509/hello-world-this-is-the-new-lifehacker. Until Monday, the address was the same, only without #! ..

Fragment IDs

# Is a special URL character that tells the browser that the next part of the address is a link to an HTML element with that id or the named anchor of the current page. In the case of, the current page is the main page.

It turns out that until Monday the site consisted of a million pages, and now it is 1 page with a million fragment identifiers.

What for? I dont know. Twitter answered this question when it switched to the same technology that Google could index tweets like that. This is so, but the same could be achieved with the previous correct address structure, with less cost.

Solution to the problem

Address syntax c #! (hash-bang) first entered the web development arena when Google announced a way for a web developer to make a site accessible for indexing by a robot.

Prior to this, it was not well known about the right decisions and sites with beautiful technologies like Ajax for loading content observed a low level of indexing or rating for relevant keywords due to the bot not being able to detect content hidden behind JavaScript calls.

Google spent a lot of time to solve this problem, did not succeed in this and decided to go from the other end . Instead of trying to find this mythical content, let the site owners themselves report it. For this, a specification has been developed .

We must pay tribute to the fact that Google has carefully focused the attention of developers on the fact that they should make sites with “progressive improvement” and not rely on JavaScript as part of the content:

If you're starting from scratch, one good approach is to build your site's structure and navigation using only HTML. Then, once you have the site's pages, links, and content in place, you can spice up the appearance and interface with Ajax. Googlebot will be happy looking at the HTML, while users with modern browsers can enjoy your Ajax bonuses.

Those. c # address syntax! was specifically designed for sites that lay with the device on the fundamental principle of web development, and gave such a website a sip of life so that their content is detected by the bot.

And now, this lifeline seems to be accepted as the Only True Web Development Path by Facebook, Twitter, and now engineers.

Net URLs

In the Google spec, #! - addresses are referred to as "prettyURLs" and they are transformed by the bot into something more grotesque.

Last Sunday, Lifehacker'a address looks like this:

Good. The 7-digit code in the middle is the only incomprehensible fragment, but it is required by CMS to uniquely identify the article. Therefore, it is practically a “clean” address.

Today, the same article is available at:!5753509/hello-world-this-is-the-new-lifehacker.Now the address is less "clean" than the previous one, since the addition of #! fundamentally changes the structure of the address:
  • Address / 5753509 / hello-world-this-is-the-new-lifehacker becomes simple /
  • New fragment identifier added! 5753509 / hello-world-this-is-the-new-lifehacker is added to the address.
Have we achieved anything? Not. But the distortion of the address does not end there.

The Google specification says that the address will be redone into an address with parameters, i.e. in:

Thus, it is this address that returns the content, i.e. this address is canonical, i.e. what the bot will index.

It looks like: together with Gawker simply scored 10 years of experience at clean addresses and came back to a typical ASP site (How much more will Frontpage get?).

What is the problem?

The main problem is that addresses do not point to content. Each URL points to a homepage. If you are lucky and with JavaScript you are doing well, the content you need will rely on the main page.

More intricate than the usual address, more error prone and more fragile.

It turns out that when requesting the address associated with the content, the initiator of the request does not receive the content. Those. breakage is in the design. intentionally prevents bots from following links for interesting content. Of course, if you do not jump through the hoop invented by Google.

So why do you need this hoop?

Destination #!

So why use #! - addresses if this is a synthetic address that should also be redone to another that will directly deliver content?
Of all the reasons, the strongest is “Because it's so cool!” I said the strongest, not the strongest.

Engineers will mumble something about saving state in Ajax applications. And honestly, this is an idiotic reason for breaking the address in this way. The address in the href attribute can still be a normally formed link to the content. Since you use JavaScript anyway, you can ruin it later, in your click handler by clicking on the link, adding #! To the right place. It turns out that there is state maintenance, and without unnecessarily closing the site from bots and generally non-Javascript ' ov.

Deny all bots (except Googlebot)

All non-browser agents (spiders, aggregators, indexing scanners) that fully support both HTTP / 1.1 and the URL specification (RFC-2396, for example) will not be able to go through, of course, except for Googlebot.

Therefore, the following consequences should be considered:
  1. Caching no longer works, since intermediate servers do not have a canonical representation of content and therefore do not cache anything. This leads to Lifehacker opening longer, Gawker incurs greater losses due to increased traffic.
  2. HTTP / 1.1 and RFC-2396 compatible crawlers will not see anything but an empty home page. Those. applications and services that are built on such crawlers get the corresponding effect.
  3. The potential use of microformats is greatly reduced, i.e. only browser and Google aggregators will be able to see microformat data.
  4. Facebook Like widgets that use page identifiers will require additional actions for the page to be Like'n (by default, since the main page is the only one pointed to by direct URLs, and all "curves" with #! Will be understood as a link to the main page again )

Perfect JavaScript Dependency

If the content cannot be obtained by URL, then the site is broken. Gawker consciously took this step with breaking links. They left the availability of their site at the mercy of all sorts of JavaScript errors.
  • Failure to download JS led to a 5-hour unavailability of all Gawker services last Monday (02/07/2011).
  • The absence of a semicolon (;) at the end of an object or array declared as a literal will cause an error in Internet Explorer.
  • A randomly left console.log () will again cause a user to crash unarmed by developer tools.
  • Advertising inserts constantly turn out to be erroneous. Error in ad unit - no site. And experienced web developers know that the most dull code is just in the banner ads.
Such fragility without a real reason and benefit, not outweighing all the shortcomings. There are much better methods than the one that Lifehacker used. Even HTML5 and its History API would be a better solution.

Architecture nightmare

Gawker / Lifehacker violated the principle of "gradual improvement" and paid for it immediately by dropping their stytes on launch day. Every miss in JavaScript will lead to a fall and directly affect the income of Gawker and the trust of their audience.

additional literature