Twitter: 1 billion requests per day and a new search engine

    Currently, the load on Twitter servers has grown to 1000 TPS (tweets per second) and 12000 QPS (requests per second) - more than 1 billion requests per day. The current infrastructure still stands, but in order to create a reserve for several years ahead, the company decided to update the backend for the search engine. "If we have worked well, then you should not have seen anything in the last week", - reported in the Twitter developer blog.

    Until recently, the Twitter search backend was based on the old Summize SQL system. She boughtin July 2008, just for these purposes, and also took five out of six developers. The need to upgrade Twitter became clear immediately after the presentation of the iPhone 3G, then cooperation with Summize began. But now it's time to update again.

    About six months ago, it was decided to develop a new, modern search architecture based on an effective inverted index instead of a relational database. Since Twitter loves open source, the Apache Lucene search library written in Java was chosen as the starting point for the solution .

    The requirements for the new search engine were good scalability and maximum indexing speed. The task was set that from the moment a tweet is published to the possibility of a full-text search, no more than 10 seconds should pass on it. Since the indexer is only part of the entire pipeline along this path, it should have worked as quickly as possible (less than 1 second).

    To achieve my goals, I had to redo Lucene a little, because it is not very suitable for a real-time search engine. The main data structures in memory were rewritten, especially the post-lists, but at the same time support for the standard Lucene API was preserved, so there was almost no need to redo the search part of the library. Here are the key benefits resulting from the modification:

    * significantly improved garbage collection performance
    * data structures and non-blocking synchronization algorithms (lock-free)
    * post-lists that can be passed in the reverse order
    * efficient termination of requests at an early stage

    According to the developers themselves, some of the applied methods can be interesting and useful to other programmers (not only in the field of search), so that in the future a more detailed discussion of the topic is possible.

    One way or another, but all modifications made to Lucene will be sent to Apache, and some are already included in the main Lucene code and its new branch for real-time search.

    As a result of the upgrade of the search infrastructure, the load on the backend was significantly reduced (now it is only 5% of resources), so there is a good reserve for the future. The new indexer can index about 50 times more tweets per second than published today. And the new search engine works absolutely stably, without any complaints.

    One of the unpleasant moments of the Twitter search engine has always been the inability to search the archive of tweets for more than a few days. They attributed this to "lack of space." To get around this limit, you have to use third-party search engines that independently index tweets, for example, Topsy .

    Danny Sullivan checked on January 14, 2010search results with the word [today] and found the oldest tweet posted 7 days ago.

    A similar test in mid-September showed that the depth of the index was reduced to 4 days.

    With the introduction of the new search, it announced "an increase in the index by half without any consequences on the speed of search queries." Apparently, we are talking about a return to the same seven-day limit.

    Also popular now: