The twi journal

    For a long time hesitated to write in Habr. At least because of the technical instability of the project. Now that the work has been established (I sincerely hope so), we have received a small recognition in the form of a grant from Yuri Milner and Pavel Durov, I am ready to send the project to a meat grinder.

    image

    My name is Nikita Likhachev, I want to tell you about The Twi Journal website . This is a newspaper that is built on the basis of an automatic analysis of Russian-language Twitter.

    Project idea


    Design a robot that can analyze the broadcast of Russian-language segments of the networks Twitter, Instagram and Foursquare. Then display this content in a convenient form on one site and diversify its placement - send it to other social networks. Someone is interested in what is happening on Twitter, but he is reluctant to leave Vkontakte or Facebook . And someone just has little time to follow everything - he wants to evaluate the agenda in ten minutes.

    Sample Objectivity


    The project in no way claims to be absolute objectivity. Because we took the liberty of excluding from the indexed database accounts hosting jokes-slaughter and other content that does not carry any useful information. We also ignore maslovers (those who subscribe to everyone in a row) and people who win rating with bots. The first base was assembled by hand, whitewashing top bloggers:

    If the blogger Navalny is whitelisted and it corresponds to the “not maslover” check, then the people he reads are automatically entered into our database.

    Now the base continues to replenish by hand and already automatically - due to the fact that the robot finds new users in retweets on the existing base. Until now, we have not been blamed for the inferiority of the information picture, because an important topic cannot pass by at least one user from our database.

    Data processing


    Information


    The robot we call Adam collects all indexed tweets and divides them into several types: ordinary tweets; Tweets with a link to third-party resources, media; with reference to famous photo hosting sites; with reference to video hosting.

    Thus, the main page displays popular tweets and parsed links to articles in the media with the number of references, and in separate sections photos and videos : We

    image

    constantly try to come up with algorithms that help to get the maximum amount of fresh information in a short time. On videos, for example, they set a limit on the download date in order to display the latest in priority. The robot also monitors the video reviews on Twitter and displays them as comments:

    image

    User rating


    Based on our database, we strive to build at least an approximately objective rating of Russian microbloggers , dividing them into users, corporate accounts and the media. The rating is built by combining several indicators in one formula: the average number of mentions of the user, retweets of his posts, the number of his followers in relation to the number / lists to which he was added.

    All twitter ratings can be divided into two parts - those that require authorization to participate and those that do not. The former are considered more objective, since they have information about mentions and retweets of the user. But they also have a significant drawback: most popular bloggers will never log in to them because of mistrust or uselessness. The second type is devoid of this drawback, but it is rarely objective, since it is almost always based only on the number of followers, tweets and, perhaps, the age of the account. We tried to combine the best of both types of ratings.

    image

    Foursquare Seat Rankings


    Built in real time: showing places that are popular in the city right now. It is calculated as follows: once every 25 minutes, a robot is launched, which along the previously defined city boundaries (in Moscow, only the center and a couple of kilometers around it are checked so far) creates a matrix of points. For each point within a radius of two kilometers, the presence of popular places is checked using the Foursquare API .

    image

    A bit about technology


    Now we are located on one server. The whole project (including daemons) is written in PHP. We use the MySQL and MongoDB databases (for critical moments of recording speed) - InnoDB's performance on the insert is more than enough for us, and we cache most of the database samples with memcached. In general, memcached is an ideal choice for us, since you have to operate with a lot of data that can be cached without loss of efficiency. This made it possible to reduce the time for generating the main page to 40ms ( I'm afraid to predict the behavior of the site with a probable habra effect ).

    Recently, we began using Gearman to parallelize tasks such as processing tweets, calculating ratings, and for background tasks such as saving pictures to Amazon S3.

    Robot Adam checks for updates in the feed every 15-180 minutes, depending on the time of day. Since the materials are gaining popularity not immediately, but gradually, it is important for us to accompany them for some time after publication. It is at this moment that we parse the tweet into its components: text, links, pictures and videos. All links are opened if they are shortened, and their content is modified like the Reader function in Safari (in the manner of Readability).

    When processing images, we support photo hosting services twitpic , yfrog , pic.twitter.com , flickr , lockerz and instagr.am. For each of them, we wrote a simple API handler that finds previews for pictures, author and explanatory text. Some photo hosting had to use undocumented features. Fortunately, programmers quite often think alike, especially regarding the names of methods and parameters for them.

    image

    Development plans


    Now we are putting various experiments. For example, we plan to launch The Twi Football . In the framework of this project, we want to try online broadcasts of matches based on an analysis of Russian-language Twitter. The project will turn out to be a kind of springboard for testing the technologies that we will use in the main project: the server receives the fans' opinions directly from Twitter using the Streaming API (new tweets for team hashtags will appear faster than on the “native” Twitter search page).

    In our free time we indulge in our symbols:

    image

    But seriously, we want to try to scale the project to other countries. We will start, of course, with the United States (bought the domain twijournal.com). If it goes there, then we will go to other countries. There is not much time left, because the money that Durov and Milner gave us is running out rather quickly, although we are not particularly gorgeous.

    In our wildest dreams, we dream that we will be able to build similar media on the basis of other social networks, and then combine everything into one large content aggregator. But for now these are just dreams.

    The Twi Journal

    P. S. Suddenly someone who wants to work with us, or a journalist from another country, is reading this post? Just in case, I will leave our e-mail here: editors@tjournal.ru

    Also popular now: