Roman Ivanov: “Searching for blogs is not easy”

    Roman Ivanov, head of the communications services department at Yandex , in an interview with Habrahabr, talks about the features of blog searches and reports on which trends are visible in the Russian blogosphere.

    How did you end up at Yandex?

    Before Yandex, I worked for the JetStyle company in Yekaterinburg . He worked there as a developer, sysadmin and manager, including participating in the creation of the WackoWiki wiki engine and the innovative, but incomprehensible to ordinary people blog-wiki hosting NPZ .

    Actually, it was because of them that they noticed me on Yandex: they called us with Kolya Yaremko (co-author of WackoWiki and the main author of the NPZ) to talk, and then to work.

    By the way, with JetStyle we regularly cooperate.

    Why did you create an NPJ? Was it an experiment like that?

    Yes, it was such an experiment, an attempt to create a service based on concepts, and not on the wishes of the user. The NPJ was created by a group of people with different goals, who turned out to have one common interest, or rather, even the need for a tool to help a group (or groups) of people to work with each other and with different texts. One of the goals of the project was the scientific work of Kolya Yaremko, another goal was to create a communication environment for role-playing games, yet to create a corporate tool for organizing work with knowledge and notifications, and in the end to occupy an interesting, innovative niche for the synthesis of blog hosting and wiki.

    Now this project is slowly drifting without clear control. Ideological developers are busy with their interesting work, the community lives its own life. The main brain of the NPJ is Kolya Yaremko, however, now he does not have much time for this project.

    Has anyone tried to buy an NPV?

    Project, website or license? They bought a license several times. Nobody tried to buy the site and the project.

    Can you name the buyers?
    I can name two companies - Electronic City and Abak-Press .

    On your business card is written "head of communications services." Can you explain what services these are?

    These are all services related to communication on the Web. In addition - as it happened - I also lead the development of software for the end user. Among the currently open services, one can name Yandex.Mail (and its new version ), search for blogs (we call it abbreviated PPB), People , Yandex.Tape , Bookmarks . From the programs - “Bar” , “Yandex Personal Search” and “Spam Defense” .

    How long have you been managing the department?

    One and a half years, since January 2005.

    Big department?

    Now, except for me, there are four people in the department - these are all managers. Developers have a similar department of “developing communication services”, there are much more of them. Here, by the way, developers are not subordinate to managers, but together they do a common thing.

    Likely Bookmarks will be coming out in a new version soon? Among all the above, this service is perhaps the most "ancient". In the sense that it does not meet the spirit of the times.

    We traditionally do not talk about plans, so whether they come out or not, I don’t comment. And about antiquity - this is not entirely true. The service appeared one of the very first, in the year 2000, immediately had a social part, public bookmarks, etc., had only tags.
    In 2004, it was completely redone, becoming a personal part of Yandex.Catalog and losing all its social functions.

    When will Yandex.Mail be transferred to the ajax interface, which is available at mail.ya.ru?

    Now, any user can enable this interface in the settings as the default interface.
    Forcibly everyone does not plan to include a new interface in the near future, the transition will be gradual.

    Why?

    Because it’s impossible to forcefully change the user interface to something completely new. You can talk about new things, recommend new things, but not force users.
    It is unlikely that any of the users of Windows XP will be happy if they turn on the computer tomorrow, and then Vista instead of XP, without any warning.

    What is the size of the Russian-speaking blogosphere now, at the end of July 2006? How many new blogs in Russian appear every month? Do you have such statistics?

    The size of the blogosphere is difficult to estimate in accuracy. We know almost 900 thousand blogs, but there is still a noticeable number of non-updating, inactive blogs in those systems that we began to index not from the moment of their appearance, but later - such as Liveinternet , "Lady" , Diary.Ru .

    There are also several blog hosts that still do not have RSS - such as darkdiary and gothicjournal .

    That is, it is safe to say that more than a million - but how much more is not very clear.

    Как быстро растут LiveInternet и Diary? Когда, по твоим прикидкам, они потеснят Livejournal с первой строчки хитпарада популярных блог-хостингов?

    За июнь мы узнали 85 тысяч новых блогов, из них 21 тыс — Livejournal, 25.5 тыс — Liveinternet, 16.5 тыс — Блоги@Mail.Ru, 6 тыс — Diary.Ru, 5 тыс – «Рамблер-Планета».

    Когда обгонят — не берусь прогнозировать.

    «Рамблер-Планета» и Блоги@Mail.Ru появились одновременно, но первый, судя по статистике, во много раз «меньше» второго. Как ты думаешь, почему блогосфера на «Рамблере» растет медленнее блогосферы Mail.Ru?

    In fact, Planet began to be advertised much later, it seems, for six months. But this is not the only reason - it seems to me that Mail.Ru has a larger audience of those services from which people easily go to blogs. This is dating and photo hosting. In addition, Mail.Ru, as far as I saw, advertised its blogs on these services more.

    Well, finally, the positioning of the service at Blogs Mail.Ru is more understandable. The Planet metaphor still needs to be “mastered”, and in the “Blogs” it is enough to learn a new word.

    Why do you think “Rambler” “Lady”?
    Rambler is a company whose strategy I will not undertake to comment.

    I don’t know why “Rambler” simultaneously needs love.rambler.ru , planeta.rambler.ru ,mama.ru and damochka.ru . Perhaps this is some kind of strategy.

    Tell me how blog search works? How does indexing work? What is the name of the spider blogging?

    Blogging is not easy. The fact is that it differs fundamentally from web search: for web search, the size of the material accumulated over the previous years is almost not important - the database is completely updated in a very short time. For a blog search, on the other hand, the disappearance of archives will lead to disaster, because a blog search only indexes new entries - RSS feeds (the only source for indexing) usually contain only the last 10-20 entries; and there will be nowhere to take old records.

    What does blog search consist of?

    1. A robot called blogindexd. The robot downloads RSS feeds (its user-agent is YandexBlog / 0.99.101 (compatible; DOS3.30; Mozilla / 5.0; B; robot;) NN readers , where NN readers is the number of subscribers to this stream in Yandex.Tape - this information may be interesting for the author of the stream) and puts them in the repository.
    2. The repository for the text of the entries is called bulca. This is a file system-based repository developed by Yandex.
    3. Storage for meta-information (recording date, recording stream id, etc.). It uses mysql.
    4. Full-text index and search engine over this index. This is, in fact, the usual Yandex.Server. Generally speaking, the index is not one, it is divided into several - constant indexes that contain archives; static indexes, which contain records of recent weeks and are updated quite rarely, about once a day, and dynamic indexes, which are updated much more often, up to once every five minutes.
    5. Scheduler, which, based on the history of the stream, determines when it needs to be downloaded again. This is a fairly intelligent program, the purpose of which is to download streams as often as possible, but at the same time not to overload the servers from which we download streams. In the first months of the blog search, it happened that too actively downloading RSS from Livejournal.com, we “dropped” the server.
    6. A large number of additional scripts that are responsible for combating spam (there are spam on blogs), disabling news feeds (in the search for blogs, we try to leave only threads containing opinions - blogs, forums, groups, etc.) and much more.

    How many servers serve blog searches?

    Many. First, I don’t know the exact number, and secondly, I can’t say. It all started with about ten servers, now there are more.

    As far as I know, each server you name by some name, sometimes funny. What are the "blogging" servers called?

    Not all blog search servers are called intricately. Here the servers with “constant” indexes are called puzzle1, etc., and the rest have names in the form of ordinary abbreviations (db, m1a, s1 ...).
    But on front-end servers (common for blog searches, with a bunch of other services) they traditionally “come off”: plague, earthshake, shout, steemroll, soulcry, flamestrike, etc. As far as I understand, these are all spell names from ADnD ).

    How much blog spam? How fast do its volumes grow? Is there such statistics?

    Now we know more than a thousand spam RSS feeds, mainly hosted on large blog hosting sites.

    Until March 2006, when the blog search came out of beta, there was practically no spam at all, but the very next day after the “launch” we had to manually rake the first timid attempts. Since then, we have made automated tools that let us say that there is almost no spam in blog searches. Of course, there is no limit to perfection, and I can make a search query that will show at least a dozen spam blogs, but there is no more spam in the visible part of the search, only less. We recognize about a dozen and a half new spam streams per day.

    It’s worth noting that blog search spam is almost always aimed not at visitors who came from Yandex blog searches, but at web search robots - like Yandex, and probably other search engines - these are attempts to introduce robots with new doorways or wind up the link relevance of other doorways.

    There is still non-search spam when communities write off-topic messages, but it is not related to blog searches.

    How has the blogosphere in Russia changed over the past year? What trends are visible? What can you note?

    The most important change is the emergence and manifestation of other pillars of the blogosphere other than Livejournal. A year ago, there were no blogs on Mail.Ru and the Rambler planet, the size of diary.ru and liveinternet.ru was not clear. Over the same year, Liveinternet understood more about social services and other Web 2: 0, began to change a lot.
    Over the same year, mobile operators (MTS and Megafon) also reached blogs.

    It can be seen that many new people have come to the blogosphere, many of them do not know how to write well - they are not journalists, not writers and not “geeks”, but ordinary people with ordinary concerns.

    Through blog searches, the blogosphere’s connectivity has greatly increased: previously there were such large blog hosting services and units (well, hundreds) of standalone bloggers, but now in two clicks you can find links to yourself on any blog, collect opinions about that or another event from across the blogosphere.

    I’m sure that in many Internet-advanced companies the opinions of bloggers are carefully monitored - in any case, I personally monitor opinions and reviews about the most interesting and important Yandex services for me.

    Yandex.News is now broadcasting opinions from blogs next to news stories. When did you recognize the power of blogs?

    The power of blogs at Yandex was recognized when they came up with a search for blogs. That is, even before I appeared in the company, probably in the first half of 2004. Recognized it publicly and comprehensively with the release of search for blogs from the "beta" when he got into the line of search "tabs" under the search bar - in early 2006.

    Further integration into different services is a matter of time. Integrate with the news - an idea that lies on the surface itself, many have come up with it over the life of a blog search. Another thing is that it is not always easy to bring an idea to a specific implementation. In this case, it turned out, although not always "cleanly." We are working on this.

    And when did you personally feel the power of the blogosphere? Do you remember this moment?

    In relation to me personally, probably almost immediately, that is, in 2001, in LJ.
    The question asked in his blog often received a quick and good answer, and the question could be on almost any topic - from medicine for his son to choosing a scanner.

    Strength in a broader sense? Yes then. On September 11, 2001, more information about what was happening was in the friend and the fif-tape (the combined tape of all Russian-speaking LiveJournal users operating at that time) than in any separate media.

    The topic of blogs fascinated me, I participated in the development of the Reg] engine [ster in 2003, the NPZ in 2003-2005 . And then there was Yandex.

    Why is Reg] [ster "stalled"? The engine had every chance to develop into a large platform, but did not grow together?

    For two main reasons. Firstly, the code written by Dima Smirnov was rather inaccurate and slightly expandable (almost complete lack of modularity, procedural approach, etc.). Secondly, there was no enthusiast who would undertake to develop the "Register" after the creators ran out of enthusiasm. In particular, it ended for me, because I found more interesting projects - WackoWiki and subsequently NPZ.

    Corporate blogging is not a very popular thing in Russia, why do you think?

    For two reasons. Firstly, our blog audience is not as large as in the West. Although the growth in the number of people aware of what a blog is is certainly impressive - see the ROMIR data on the fact that blogging has doubled in the last nine months. Secondly, not all managers and PR services are ready for the openness that a corporate blog implies.

    Who reads comments on corporate blog posts ?

    There are many who: they end up in a shared mail folder that any employee has the right to read. Judging by the answers, Elena Kolmanovskaya and Ilya Segalovich, as well as technical support staff, are constantly reading. Well, I also read constantly.

    What do people write most often? Try to remember the strangest feedback?

    For a long time, most often they wrote "afftar zhzhot" - in response to posting about query-based speller . Comments like “I'm new I'm asking you for help” are regularly found, and, as far as possible, user support staff try to answer them.

    The strangest?
    Perhaps this one, but it is long for an interview.

    Why do some Yandex hosts in ICMP Echo-reply respond with the same TTL with which they received the request.

    Just curious, example:
    # traceroute -P ya.ru 

    ix2-m9 .yandex.net (193.232.244.93) 55.974 ms 37.562 ms 40.819 ms 
    c3-vlan3 .yandex.net (213.180.192.171) 63.987 ms 41.410 ms 80.810 ms 
    9 * * *
    10 * * *
    11 * * *
    12 * * *
    13 * * *
    14 * * *
    15 * * *
    16 ya.ru (213.180.204.8) 61.545 ms! 48.058 ms! 49.508 ms!
    Hops from the 9th to the 15th - as I understand the false, i.e. the host 213.180.204.8 (maybe there is something else) responds to ICMP with the same TTL with which packets reach it, and therefore the answers do not reach back until the TTL is doubled.
    What is this for? If it’s not difficult, give an answer ... Is this done for security reasons or is it some tricky hardware, does any load balancer behave this way?


    And, in short:

    emails are sent to me in English. Is it possible for letters to come in Russian?

    Anton Antich wants to do“Blogus” is the central place for studying the Russian-speaking blogosphere, what do you think about this?

    I have known about Blogus for a long time, we met with Anton, discussed how best to give them the number of links according to the search results for blogs.

    I think let a hundred flowers bloom. Any meaningful resource around the blogosphere is good for her.

    What do you think is the central place to study the blogosphere? Is the search for Yandex on blogs a central place?

    I think in many ways our blog search is such a place. Of course, the ideal is unattainable, but you should strive for it. We think a lot about what other services need to be done to become such a center for studying the blogosphere; we do these services.

    When should they be expected?

    I can’t talk about the timing, you understand. But judging by how vividly everything was introduced and improved on the service over the past six months, it can be assumed that pretty soon. Here, let's say, the ability to search only in blogs or only in forums with one click, right from the search results page, appeared about a month ago, without any announcement. I hope it is useful to our users.

    Also popular now: