SternMore April 27, 2015 at 11:00

GrabDuck: A New Look At Bookmarks

Hello reader. This short introductory post about our project - the search service GrabDuck . About what it is, what problems we tried to solve, and what comes of all this.

Simply put, the history of the project in search of its grateful customers, which we tried to make not boring and interesting. Whether it turned out or not is up to you to judge. Who are interested, please, under Cut.

Introduction

Offtopic 1, which at first reading can be skipped

Frequently Asked Answers

Yes, we know this is a goose, not a duck. It just so happened that when we looked for the image for the first page, we liked this suspicious goose the most, and we, we believe, didn’t go through more than one photo bank. And then, and who said what exactly should be a duck? Yes, it's a goose and it is watching you.

No, we are not affiliated with the DuckDuckGo project . I would like to, but no. We are not DDG, we are GD. In general, some kind of fashion went on alternative search services to designate ducks as the main ones, and we did not escape this. Although, when they came up with the name, it never crossed my mind that it would turn out to be like something else. True true!

Most of us are familiar with the problem of saving bookmarks in the browser. We constantly read something new and put off what is interesting in order to quickly find later.

I did that too. Everything that was interesting, I kept as bookmarks in Google Chrome. When there were a lot of bookmarks, the long process of classifying them began - this is how the “Programming” folder appeared, “java” in it, and “php” and “javascript” were placed next to it. Then a bunch of other technologies that somehow interested me. Over time, understanding all this became difficult - standard functionality does not allow you to do a lot. Some links could be attributed to several folders at once, and sometimes I just forgot about the old classification and made a new one in the neighborhood.

When there were a lot of bookmarks and there was no time to sort them right away, I found a way out - I got an Inbox folder. I promised myself that I would clear it on Fridays and began dumping everything I found there. However, this did not work, on Fridays there was always something more interesting. So most of my bookmarks were in my inbox (I once thought - there were them somewhere far beyond 400). Over time, I caught myself on the fact that I was just going in and opening from there what I needed and what I remember. And what I didn’t remember, I simply didn’t open, but searched again in Google.

So the idea of GrabDuck was born. And why can’t I search in my bookmarks as well as we do it regularly in search engines - by asking search queries. After all, most often I try to find the article I need because (1) I know what exactly the article was about (and therefore, I can outline the search query in general terms) and (2) I remember exactly that I had the necessary article somewhere ( or maybe not one).

“Stop,” I said to myself, because at work I do exactly that — the search. I know how it works, there is practical experience. And here is a shoemaker without boots on you.
So began our GrabDuck.

about the project

Offtopic 2, which can also be skipped

Some commonplace (or continued answers)

No, GrabDuck is not a startup. Tired of the tradition of calling everything and everyone with a big word - Startup. We do not want to participate in this endless battle for the investor. We want to make a good service that will benefit our users.

Yes, we love the service and use it ourselves. We know everyone says that, but we really use GD every day. Once, a few months ago, when I showed my friend an early prototype, he asked: “what will you do if you don’t?”. “It's not scary, the server is not expensive, I will pay and use it myself,” I replied. Since then I use it.

What is GrabDuck and how do we see it?

GrabDuck is a bookmark storage service where you can search "like on Google." The main idea of the service is to allow the user to throw the document in the "piggy bank" and forget it - the system will help to find it when it is needed.

Therefore, GrabDuck is primarily a search. A good full-fledged full-text search in all the materials that I saved (not only in the title, but throughout the article). A search that tries to find phrases of a search query based on vocabulary, and not just a set of words entered by the user. This is a search that is constantly being trained, what search queries I enter and which documents I open most often and adapts to my preferences.

In addition, as a “free” bonus, the system offers recommendations from articles of other users that may also be of interest to me, again, based on my requests and preferences. And, if the recommended article is really interesting, I can add this article to myself.

What we use

A short review with comments about what technologies and what we use.

Our server is Apache Tomcat, which runs several Java applications. We tried to follow the principles of Microservices Architecture and took out different parts of the application in different modules that communicate with each other. In principle, now we need only one server for everything, but in the future, when needed, we can, for example, deploy an additional module that parses articles on a second machine, thus increasing the capacity of only one component of the system without changing everything else.

As the front-end server we use Nginx. For a long time we chose between Apache and Nginx, as a result we settled on the second. The reasons are simple - for us it turned out to be more lightweight and easier to configure.

To store and work with data, we use the MySQL + Solr bundle. A sort of hybrid, where each component does what it does best.

MySQL is responsible for the integrity and storage of data in a normalized way. We can always collect from several tables all the necessary information about the document - page content, meta data, who has the page in bookmarks, personal information for each user, such as tags. One big drawback of this system is that MySQL is very slow and offers almost no full-text search capabilities. It should be said that in general, from the point of view of search, all fulltext search solutions that modern SQL products provide now, as a rule, are some basic features with which it is very difficult and sometimes almost impossible to do something worthwhile.

Altogether, MySQL was not very good for searching. Therefore, when it is necessary to find something at the user's request, the second component comes into operation - Solr, which is responsible for the search and aggregation of information. Each document from the database, when created or modified, is sent to Solr and based on it, a view is created that will be used directly for search.

Thus, we have all the advantages of a classic SQL database combined with the speed and power that NoSQL gives us.

How does it work

Consider what happens when a user adds something to the system. This can be either a single document added through the chrome extension , or importing a large number of bookmarks, the essence does not change and the data always follows the same algorithm.

First, a new document is created in which the URL and, if known, the title are saved. From this moment, the user sees the document at home, although full-text search is of course not yet available. After some time, as a rule, no later than 5 minutes, the parsing task starts. For parsing a page and extracting an article from it, we use our adapted Snaktory library . The output is the contents of the article, meta information and tags.

Now we need to check if this article is already in the database. If so, then there is no need to save it and the user can “reuse” the existing one. We check the match at the Canonical URL . As an example, any article on the hub has at least 3 different valid URLs: for google / yandex, for mobile display, and for desktop. In this case, the canonical URL will always be the same. The same scheme allows us to avoid duplication of information if a user, for example, wants to import the same bookmarked file several times.

If the link is not duplicate, then the next step is to determine the language of the document. This is necessary for two things. Firstly, the document is specially adapted in the search index for searching in this particular language (at the moment we support Russian and English, German is next in line). And secondly, the language is used to filter on recommendations. For example, if a user is aware that he is reading in English and German, then recommendations in Russian will not be shown, even if there are any documents in Russian on the search query. To determine the language, use the Language-detection library. A big minus, both the library in particular and probably the whole of all approaches to determining the language - the quality drops sharply with a small amount of text, according to our observations, for 100% of the result, you need at least 500 characters, after which the quality starts to limp.

And the last step, based on the saved document, creates an entity in the Solr search index. From this moment on, the document is available both for direct full-text search and for displaying in recommendations.

Where are we now

MVP is ready, the first users appeared. When they are looking for something, we are happy together. Particularly motivated when there are comments on the case. We want to say special thanks to one of our users anton_slim - a person really went through the service and rolled out a list of what is incomprehensible and crooked - corrected.

Now we are actively testing the service and therefore we invite everyone to try and share their impressions.

We plan to write on the blog on the topic of search: how we use technologies, what problems we face and how we solve them, in general, everything that may be of interest to you.

Subscribe and let's chat.

Tags: