
Lectures of the Technosphere. Info Search. Part 1 (spring 2017)
The new issue of video lectures of our educational project Technosphere is broadcast. This time the course is dedicated to information retrieval.
All Internet users have experience with search engines, regularly enter queries there and get results. Search engines have become so familiar that it’s hard to imagine that once they were not there, and the quality of modern search is taken for granted, although fifteen years ago it was completely different. However, the modern search system is the most complex software and hardware complex, the creators of which had to solve a huge number of practical problems, ranging from the large amount of processed data and ending with the nuances of human perception of search results.
In our course, we talk about the main methods used to create search engines. Some of them are a good example of ingenuity, some show where and how the modern mathematical apparatus can be applied.
Lecture list:
- Introduction
- Features of web search. Search Robot Architecture
- Search Robot Scheduler
- Indexing and Boolean Search
- Boolean index and search
- Duplicate Search
- Finding Duplicates (Part 2)
- Pornography filtering
- Antispam
- Snippets
- Building Snippets
- Correction of typos in requests
- Hints, reformulations, classifiers
Course lead:
- Jan Kisel, Head of Mail.Ru Search Infrastructure;
- Julia Sergukova, Programmer, Search Infrastructure Department, Mail.Ru;
- Dmitry Solovyov, Lead Developer, Mail.Ru Search Ranking Group;
- Andrey Murashev, programmer of recommendation systems for Search Mail.Ru;
- Mikhail Plekhanov, Programmer, Search Infrastructure Department, Mail.Ru;
- Evgeny Chernov, Head of Mail.Ru. Search Query Analysis Department
Lecture 1. Introduction
A review lecture on the importance of information search issues.
Lecture 2. Features of web-search. Search Robot Architecture
The first part of the lecture is devoted to web search: historical information is given, the topic of advertising in search is touched on a bit, and web search schemes are described. The second part is devoted to search robots (spiders): setting the task of collecting data, pumping it out, updating and storing.
Lecture 3. Search Robot Planner
The task of planning the work of a search robot is posed, Focused Crawler algorithms are considered, the Stone Garden algorithm is analyzed. Quota issues are also addressed.
Lecture 4. Indexing and Boolean Search
The composition and purpose of the search index is examined, the hardware of the search engine is discussed a little. It tells about the fast intersection of blocks, compression of the index and methods of increasing compression.
Lecture 5. Boolean index and search
Continuation of the previous lecture. The topic of compression is raised again: Simple9 algorithm, binary data in Python is considered. The second part of the lecture is devoted to the search dictionary: the presentation of stop words, aspects of dictionary storage are discussed. The third part of the lecture talks about the query tree: what it is, how the tree is executed, how to parse queries.
And at the end of the lecture, you will learn how the general indexing workflow is built.
Lecture 6. Search for duplicates
Finding duplicates is a big topic, divided into two lectures. First, you will learn about the terminology used, look at examples of duplicates, get acquainted with shingling. Then, practical methods for finding duplicates are considered: making improvements to the algorithms, the Minshingle signature construction method, measure, Jacquard, Broder's algorithm.
Lecture 7. Search for duplicates (part 2)
This lecture is dedicated to finding duplicates in very large arrays of documents. The technique of searching for fuzzy duplicates (Local Sensitive Hashing) is considered, algorithms with an indivisible signature are discussed, and finally, a comparison is made of the features of different algorithms.
Lecture 8. Filtering pornography
At the beginning of the lecture, it is explained why it is important to always filter pornographic materials, and general solutions to this issue are discussed. Then it talks about filtering techniques for web pages, queries, and images.
Lecture 9. Antispam
Also a very relevant topic. First, the reasons for the existence of spam are considered, and issues are discussed. It tells about the methods of influence of spam on search engines, about ways to counter this effect. You will learn how to detect spam by analyzing the content of pages, how to identify spam sites. Also will be considered techniques to combat fraud and spam in applications.
Lecture 10. Snippets
From the lecture you will learn what search snippets are and what kind of design of search results is recommended. The main elements of SERP are discussed, what the “semantic web" is, the micro-layout on the page is considered. At the end of the lecture, we talk about inorganic snippets and determining the end of sentences.
Lecture 11. Building snippets
Continuation of the theme of snippets. This time you will learn what text summarization is, organic snippets, a direct index are considered, and a technique for assessing the quality of snippets is discussed.
Lecture 12. Correction of typos in queries
The lecture is devoted to the methods of search and correction of typos in the entered queries.
Lecture 13. Tips, reformulations, classifiers
The last lecture of the course is devoted to the problem of generating prompts when a user enters a search query, the methods of reformulating queries to improve the search are considered. Finally, all kinds of query classifiers are discussed.
The playlist of all lectures is located here . Recall that current lectures and master classes on programming from our IT specialists in the Technopark, Technosphere and Technotrack projects are still published on the Technostream channel .