Why did Google need a knowledge graph?
- Transfer
When I introduce myself and say what our startup is doing, the interlocutor immediately raises the question: did you work on Facebook before, or was your development created under the influence of Facebook? Many are aware of Facebook’s efforts to maintain its social graph, because the company has published several articles about the infrastructure of this graph, which it has carefully built.
Google talked about her knowledge graph , but nothing about the internal infrastructure. However, the company also has specialized subsystems for it. In fact, much attention is being paid to the knowledge graph. Personally, I put at least two of my promotions on this horse - and started working on a new graph back in 2010.
Google needed to build infrastructurenot only to serve complex relationships in the Knowledge Graph, but also to support all OneBox thematic blocks in search results that have access to structured data. The infrastructure is necessary for 1) a quality circumvention of the facts with 2) a high enough bandwidth and 3) a low enough delay to manage to get into a good share of search queries on the web. It turned out that not one available system or database can perform all three actions.
Now that I’ve explained why infrastructure is needed, in the rest of the article I’ll talk about my experience building such systems, including for Knowledge Graph and OneBox .
I will introduce myself briefly. I worked at Google from 2006 to 2013. First as an intern, then as a software engineer in the web search infrastructure. Google acquired Metaweb in 2010 , and my team just launched Caffeine . I wanted to do something else - and started working with the guys from Metaweb (in San Francisco), spending time traveling between San Francisco and Mountain View. I wanted to figure out how to use the knowledge graph to improve my web search.
There have been such projects on Google before me. Notably, a project called Squaredcreated in a New York office, and there was some talk about Knowledge Cards. Then there were sporadic efforts by individuals / small teams, but at that time there was no established team chain, which ultimately forced me to leave Google. But we will come back to this later ...
As already mentioned, Google acquired Metaweb in 2010. Metaweb built a high-quality knowledge graph using several methods, including Wikipedia crawling and parsing, as well as a crowdsourcing wiki-style editing system using Freebase . All this worked on the basis of Graphd's own graph database - the graph daemon (now published on GitHub).
Graphd had some fairly typical properties. As a daemon, it worked on one server, stored all the data in memory and could issue a whole Freebase site. After the purchase, Google set one of the tasks to continue working with Freebase.
Google built an empire on standard hardware and distributed software.One server-side DBMS would never be able to serve crawling, indexing, and search results. First created SSTable, then Bigtable, which scales horizontally to hundreds or thousands of machines that share petabytes of data. Machine highlights Borg (hence came the K8 ), they communicate by Stubby (hence appeared gRPC ) rezolving with the IP-addresses through Neum Service Borg (BNC inside K8) and store data in the File the System the Google ( a GFS , you can tell Hadoop FS). Processes may die, machines may break, but the system as a whole is indestructible and will continue to hum.
Graphd got into such an environment. The idea of a database serving a whole website on one server is alien to Google (including me). In particular, Graphd needed 64 GB or more memory to run. If it seems to you that this is a little, remember: this is 2010. Most Google servers were equipped with a maximum of 32 GB. In fact, Google had to purchase special machines with enough RAM to serve Graphd in its current form.
Brainstorming began on how to move Graphd data or rewrite the system to work in a distributed way. But, you see, the graphs are complicated. This is not a key-value database for you, where you can simply take a piece of data, move it to another server and issue it when you request a key. Graphs perform efficient joins and workarounds that require software to work in a specific way.
One idea was to use a project called MindMeld (IIRC). It was assumed that memory from another server would be available much faster through network equipment. It should have been faster than regular RPCs, fast enough to pseudo-replicate the direct memory access required by the in-memory database. But the idea did not go too far.
Another idea that actually became a project was to create a truly distributed graph service system. Something that can not only replace Graphd for Freebase, but also really work in production. She was called Dgraph - a distributed graph, inverted from Graphd (graph-daemon).
If you're interested, then yes. My startup, Dgraph Labs, the company and the open-source project Dgraph are named after that project on Google (note: Dgraph is a trademark of Dgraph Labs; as far as I know, Google does not release projects with names that match the internal ones).
In almost all the rest of the text, when I mention Dgraph, I mean the internal Google project, and not the open source project that we created. But more on that later.
Creating inadvertently an infrastructure for graphs
Although I generally knew about Dgraph trying to replace Graphd, my goal was to create something to improve web search. At Metaweb, I met a DH research engineer who created Cubed .
As I mentioned, a motley group of engineers from the New York division developed Google Squared . But the DH system was much better. I started thinking how to implement it on Google. Google had puzzle pieces that I could easily use.
The first part of the puzzle is the search engine. This is a way to accurately determine which words are related to each other. For example, when you see a phrase like [tom hanks movies], it might tell you that [tom] and [hanks] are related. Similarly, from [san francisco weather] we see a connection between [san] and [francisco]. These are obvious things for people, but not so obvious for cars.
The second part of the puzzle is understanding grammar. When in the query [books by french authors], the machine can interpret this as [books] from [french authors], that is, books of those authors who are French. But she can also interpret this as [french books] from [authors], that is, books in French by any author. I applied the Part-Of-Speech tagger(POS) from Stanford University to better parse grammar and build a tree.
The third part of the puzzle is understanding entities. [french] can mean a lot. This may be a country (region), nationality (related to the French people), cuisine (related to food) or language. Then I applied another system to get a list of entities that a word or phrase can correspond to.
The fourth part of the puzzle was to understand the relationship between entities. When it is known how to connect words into phrases, in what order phrases should be performed, that is, their grammar, and to which entities they can correspond, you need to find the relationship between these entities in order to create machine interpretations. For example, we run the query [books by french authors], and POS says it is [books] from [french authors]. We have several entities for [french] and several for [authors]: the algorithm should determine how they are related. For example, they may be related by place of birth, that is, authors who were born in France (although they can write in English). Or it could be authors who are French citizens. Or authors
To determine if there is a connection between objects and how they are connected, you need a graph system. Graphd was never going to scale to the Google level, but you could use the search itself. Knowledge Graph data is stored in Triples triples format , that is, each fact is represented by three parts: subject (entity), predicate (relation) and object (other entity). Requests go like
I used the Google search index , assigned a docid to each triple, and built three indexes, one for S, P, and O. In addition, the index is nestable, so I added information about the type of each entity (i.e., actor, book, person, and etc.).
I made such a system, although I saw a problem with the depth of joins (which is explained below) and it is not suitable for complex queries. Actually, when someone from the Metaweb team asked me to publish a system for colleagues, I refused.
To determine the relationship, you can see how many results each query gives. For example, how many results do [french] and [author] give? We take these results and see how they are related to [books], etc. Thus, a lot of machine interpretations of the query appeared. For example, the query [tom hanks movies] generates a variety of interpretations, such as [movies directed by tom hanks], [movies starring tom hanks], [movies produced by tom hanks], but automatically rejects interpretations like [movies named tom hanks].
Each interpretation generates a list of results - valid entities on the chart - and also returns their types (present in attachments). This proved to be an extremely powerful function because understanding the type of results opened up possibilities such as filtering, sorting, or further expansion. You can sort out movies with the year of release, the length of the film (short, long), language, awards received, etc.
The project seemed so intelligent that we (DH was also partially involved as an expert on the knowledge graph) called him Cerebro, in honor of the device of the same name from the movie "X-Men" .
Cerebro often revealed very interesting facts.that weren’t originally in the search query. For example, at the request of [US presidents], Cerebro will realize that presidents are people, and people have growth. This allows us to sort presidents by growth and show that Abraham Lincoln is the highest president of the United States. In addition, people can be filtered by nationality. In this case, America and the United Kingdom appear on the list, because the United States had one British president, namely George Washington. (Disclaimer: the results are based on the state of the knowledge graph at the time of the experiment; I can not vouch for their correctness).
Cerebro was able to really understand user requests. Having received data for the entire graph, we could generate machine interpretations of the query, generate a list of results, and understand a lot from these results for further study of the graph. It was explained above: as soon as the system understands that it is dealing with films, people or books, etc., certain filters and sorts can be activated. You can even go around the nodes and show related information: from [US presidents] to [schools they went to] or [children they fathered]. Here are some other queries that the system itself generated: [female african american politicians], [bollywood actors married to politicians], [children of us presidents], [movies starring tom hanks released in the 90s]
DH demonstrated this opportunity to move from one list to another in another project called Parallax .
Cerebro showed a very impressive result, and Metaweb management supported it. Even in terms of infrastructure, it turned out to be efficient and functional. I called it the knowledge engine (like a search engine). But on Google, no one specifically addressed this topic. She was of little interest to my manager, they advised me to talk with one person, then with another, and as a result I got a chance to demonstrate the system to one very high top search manager.
The answer was not the one I was hoping for. To demonstrate the results of the knowledge engine for [books by french authors], he launched a Google search, showed ten lines with blue links and said that Google could do the same. In addition, they do not want to take traffic from sites, because they get angry.
If you think that he is right, think about this: when Google does a search on the Internet, it really does not understand the request. The system searches for the right words in the right position, taking into account the weight of the page and so on. This is a very complex system, but it does not understand either the query or the results. The user himself does all the work: reading, analyzing, extracting the necessary information from the results and further searches, adding together a complete list of results, etc.
For example, for [books by french authors] a person will first try to find an exhaustive list, although one page with such a list may not be found. Then sort these books by years of publication or filter by publishers and so on - all this requires a person to process a large amount of information, numerous searches and process the results. Cerebro is able to reduce these efforts and make user interaction simple and flawless.
But then there was no full understanding of the importance of the knowledge graph. The manual was not sure of its usefulness or how to relate it to the search. This new approach to knowledge is not easy for the organization that has achieved such significant success by providing users with links to web pages.
Over the course of the year, I struggled with a misunderstanding of the managers, and eventually gave up. A manager from the Shanghai office turned to me, and I handed him the project in June 2011. He put on him a team of 15 engineers. I spent a week in Shanghai, passing on to the engineers everything that I created and learned. DH was also involved in this business, and he led the team for a long time.
In the Cerebro graphing system, there was a problem with the depth of the union. Join is performed when the result of an early query is needed to complete a later one. A typical union includes some
Say you want to know [people in SF who eat sushi]. All people are assigned some data, including who lives in which city and what kind of food they eat.
The above query is a single-level join. If the application accesses the database, it will make one request for the first step. Then a few queries (one query for each result) to find out what each person eats, choosing only those who eat sushi.
The second step suffers from the fan-out problem. If the first step gives a million results (the population of San Francisco), then the second step should be given on request to everyone, asking for their eating habits, and then applying a filter.
Distributed system engineers usually solve this problem by broadcasting., that is, universal mailing. They accumulate the corresponding results, making one request to each server in the cluster. This provides a join, but causes problems with request latency.
Broadcasting does not work well in a distributed system. This problem is best explained by Jeff Dean of Google in his speech “Achieving a fast response in large online services” ( videos , slides ). The total delay is always greater than the delay of the slowest component. Small glare on individual computers causes delays, and the inclusion of many computers in the query dramatically increases the likelihood of delays.
Thus, the broadcast of one request greatly increases the delay. Now think, and if you need two, three or more associations? It is too slow to execute in real time.
The problem of fan deployment when request broadcast is inherent in most non-native DBs of graphs, including the Janus graph , Twitter FlockDB and Facebook TAO .
Distributed associations are a complex problem. Native graph databases allow avoiding this problem by storing a universal data set within one server (stand-alone database) and performing all joins without accessing other servers. For example, Neo4j does this .
Having completed work on Cerebro and having experience building a graph management system, I took part in the Dgraph project, becoming one of the three technical project managers. We applied innovative concepts that solved the problem of the depth of the union.
In particular, Dgraph separates graph data so that each join can be completely performed by one machine. Returning to an object
This allowed us to fulfill requests with an arbitrary depth of associations , eliminating the problem of fan deployment during broadcast. For example, the query [people in SF who eat sushi] will generatemaximum two network calls in the database regardless of cluster size. The first challenge will find all the people who live in San Francisco. The second request will send this list to intersect with all the people who eat sushi. Then you can add additional restrictions or extensions, each step still provides for no more than one network call.
This creates the problem of very large predicates on the same server, but it can be solved by further splitting the predicates between two or more instances as the size grows. In the worst case, one predicate will be split across the entire cluster. But this will happen only in a fantastic situation, when all the data corresponds to only one predicate. In other cases, this approach can significantly reduce the delay of requests in real systems.
Sharding was not the only innovation in Dgraph. All objects were assigned integer identifiers, they were sorted and saved in the form of a list (posting list) to quickly cross such lists later. This allows you to quickly filter during the merge, find common links, etc. Ideas from the Google search engines are also useful here.
Google's dgraph was not a database . This was one of the subsystems, which also responded to updates. So she needed indexing. I have had extensive experience working with real-time incremental indexing systems running under Caffeine .
I started a project to unify all OneBox within this graph indexing system, including weather, flight schedules, events and so on. You may not know the term OneBox, but you definitely saw it - this is a separate window that appears when certain types of queries are executed, where Google returns richer information. To see OneBox in action, try [ weather in sf ].
Previously, each OneBox worked on an autonomous backend and was supported by different development groups. There was a rich set of structured data, but OneBox units did not exchange data with each other. Firstly, different backends increased labor costs many times over. Secondly, the lack of information sharing limited the range of requests that Google could respond to.
For example, [events in SF] could show events, and [weather in SF] could show weather. But if [events in SF] understood that it was rainy now, then we could filter or sort the events according to the type “indoors” or “outdoors” ( it might be better to go to the cinema rather than football in heavy rain) )
With the help of the Metaweb team, we began to convert all this data to the SPO format and index it with one system. I named it Plasma, a real-time graph indexing engine for serving Dgraph.
Like Cerebro, the Plasma project received few resources, but continued to gain momentum. In the end, when the management realized that OneBox blocks were inevitably part of our project, it immediately decided to put the “right people” to manage the graph system. At the height of the political game, three leaders were replaced, each of whom had zero experience working with graphs.
During this leapfrog of Dgraph, Spanner project managers called Dgraph too complex a system. For reference, Spanner is a worldwide distributed SQL database that needs its own GPS watch to ensure global consistency. The irony of this is still blowing my roof.
Dgraph was canceled, Plasma survived. And at the head of the project they put a new team with a new leader, with a clear hierarchy and reporting to the CEO. The new team - with a poor understanding of the graphs and related problems - decided to create an infrastructure subsystem based on the existing Google search index (as I did for Cerebro). I suggested using the system that I already did for Cerebro, but it was rejected. I modified Plasma to crawl and expand each knowledge node into several levels so that the system can view it as a web document. They called this system TS ( abbreviation ).
This meant that the new subsystem would not be able to perform deep associations.Again, this is a curse that I see in many companies because engineers start with the wrong idea that “graphs are a simple problem that can be solved by simply building a layer on top of another system.”
A few months later, in May 2013, I left Google after working on Dgraph / Plasma for about two years.
Two years after leaving Google, I decided to develop Dgraph . In other companies, I see the same indecision regarding graphs as in Google. There were many unfinished solutions in the graph space, in particular, many custom solutions hastily assembled on top of relational or NoSQL databases, or as one of the many features of multi-model databases. If there was a native solution, then it suffered from scalability issues.
Nothing I saw had a coherent story with a productive, scalable design. Building a horizontally scalable graph database with low latency and arbitrary depth joins is an extremely difficult task , and I wanted to make sure that we built the Dgraph correctly.
The Dgraph team spent the last three years not only studying my own experience, but also putting a lot of their own efforts into designing - creating a graph database that has no analogues on the market. Thus, companies have the opportunity to use a reliable, scalable and productive solution instead of another half-finished solution.
Google talked about her knowledge graph , but nothing about the internal infrastructure. However, the company also has specialized subsystems for it. In fact, much attention is being paid to the knowledge graph. Personally, I put at least two of my promotions on this horse - and started working on a new graph back in 2010.
Google needed to build infrastructurenot only to serve complex relationships in the Knowledge Graph, but also to support all OneBox thematic blocks in search results that have access to structured data. The infrastructure is necessary for 1) a quality circumvention of the facts with 2) a high enough bandwidth and 3) a low enough delay to manage to get into a good share of search queries on the web. It turned out that not one available system or database can perform all three actions.
Now that I’ve explained why infrastructure is needed, in the rest of the article I’ll talk about my experience building such systems, including for Knowledge Graph and OneBox .
How do I know that?
I will introduce myself briefly. I worked at Google from 2006 to 2013. First as an intern, then as a software engineer in the web search infrastructure. Google acquired Metaweb in 2010 , and my team just launched Caffeine . I wanted to do something else - and started working with the guys from Metaweb (in San Francisco), spending time traveling between San Francisco and Mountain View. I wanted to figure out how to use the knowledge graph to improve my web search.
There have been such projects on Google before me. Notably, a project called Squaredcreated in a New York office, and there was some talk about Knowledge Cards. Then there were sporadic efforts by individuals / small teams, but at that time there was no established team chain, which ultimately forced me to leave Google. But we will come back to this later ...
History of Metaweb
As already mentioned, Google acquired Metaweb in 2010. Metaweb built a high-quality knowledge graph using several methods, including Wikipedia crawling and parsing, as well as a crowdsourcing wiki-style editing system using Freebase . All this worked on the basis of Graphd's own graph database - the graph daemon (now published on GitHub).
Graphd had some fairly typical properties. As a daemon, it worked on one server, stored all the data in memory and could issue a whole Freebase site. After the purchase, Google set one of the tasks to continue working with Freebase.
Google built an empire on standard hardware and distributed software.One server-side DBMS would never be able to serve crawling, indexing, and search results. First created SSTable, then Bigtable, which scales horizontally to hundreds or thousands of machines that share petabytes of data. Machine highlights Borg (hence came the K8 ), they communicate by Stubby (hence appeared gRPC ) rezolving with the IP-addresses through Neum Service Borg (BNC inside K8) and store data in the File the System the Google ( a GFS , you can tell Hadoop FS). Processes may die, machines may break, but the system as a whole is indestructible and will continue to hum.
Graphd got into such an environment. The idea of a database serving a whole website on one server is alien to Google (including me). In particular, Graphd needed 64 GB or more memory to run. If it seems to you that this is a little, remember: this is 2010. Most Google servers were equipped with a maximum of 32 GB. In fact, Google had to purchase special machines with enough RAM to serve Graphd in its current form.
Graphd Replacement
Brainstorming began on how to move Graphd data or rewrite the system to work in a distributed way. But, you see, the graphs are complicated. This is not a key-value database for you, where you can simply take a piece of data, move it to another server and issue it when you request a key. Graphs perform efficient joins and workarounds that require software to work in a specific way.
One idea was to use a project called MindMeld (IIRC). It was assumed that memory from another server would be available much faster through network equipment. It should have been faster than regular RPCs, fast enough to pseudo-replicate the direct memory access required by the in-memory database. But the idea did not go too far.
Another idea that actually became a project was to create a truly distributed graph service system. Something that can not only replace Graphd for Freebase, but also really work in production. She was called Dgraph - a distributed graph, inverted from Graphd (graph-daemon).
If you're interested, then yes. My startup, Dgraph Labs, the company and the open-source project Dgraph are named after that project on Google (note: Dgraph is a trademark of Dgraph Labs; as far as I know, Google does not release projects with names that match the internal ones).
In almost all the rest of the text, when I mention Dgraph, I mean the internal Google project, and not the open source project that we created. But more on that later.
The story of Cerebro: the knowledge engine
Creating inadvertently an infrastructure for graphs
Although I generally knew about Dgraph trying to replace Graphd, my goal was to create something to improve web search. At Metaweb, I met a DH research engineer who created Cubed .
As I mentioned, a motley group of engineers from the New York division developed Google Squared . But the DH system was much better. I started thinking how to implement it on Google. Google had puzzle pieces that I could easily use.
The first part of the puzzle is the search engine. This is a way to accurately determine which words are related to each other. For example, when you see a phrase like [tom hanks movies], it might tell you that [tom] and [hanks] are related. Similarly, from [san francisco weather] we see a connection between [san] and [francisco]. These are obvious things for people, but not so obvious for cars.
The second part of the puzzle is understanding grammar. When in the query [books by french authors], the machine can interpret this as [books] from [french authors], that is, books of those authors who are French. But she can also interpret this as [french books] from [authors], that is, books in French by any author. I applied the Part-Of-Speech tagger(POS) from Stanford University to better parse grammar and build a tree.
The third part of the puzzle is understanding entities. [french] can mean a lot. This may be a country (region), nationality (related to the French people), cuisine (related to food) or language. Then I applied another system to get a list of entities that a word or phrase can correspond to.
The fourth part of the puzzle was to understand the relationship between entities. When it is known how to connect words into phrases, in what order phrases should be performed, that is, their grammar, and to which entities they can correspond, you need to find the relationship between these entities in order to create machine interpretations. For example, we run the query [books by french authors], and POS says it is [books] from [french authors]. We have several entities for [french] and several for [authors]: the algorithm should determine how they are related. For example, they may be related by place of birth, that is, authors who were born in France (although they can write in English). Or it could be authors who are French citizens. Or authors
Search Index Graph System
To determine if there is a connection between objects and how they are connected, you need a graph system. Graphd was never going to scale to the Google level, but you could use the search itself. Knowledge Graph data is stored in Triples triples format , that is, each fact is represented by three parts: subject (entity), predicate (relation) and object (other entity). Requests go like
[S P] → [O]
or [P O] → [S]
sometimes [S O] → [P]
. I used the Google search index , assigned a docid to each triple, and built three indexes, one for S, P, and O. In addition, the index is nestable, so I added information about the type of each entity (i.e., actor, book, person, and etc.).
I made such a system, although I saw a problem with the depth of joins (which is explained below) and it is not suitable for complex queries. Actually, when someone from the Metaweb team asked me to publish a system for colleagues, I refused.
To determine the relationship, you can see how many results each query gives. For example, how many results do [french] and [author] give? We take these results and see how they are related to [books], etc. Thus, a lot of machine interpretations of the query appeared. For example, the query [tom hanks movies] generates a variety of interpretations, such as [movies directed by tom hanks], [movies starring tom hanks], [movies produced by tom hanks], but automatically rejects interpretations like [movies named tom hanks].
Each interpretation generates a list of results - valid entities on the chart - and also returns their types (present in attachments). This proved to be an extremely powerful function because understanding the type of results opened up possibilities such as filtering, sorting, or further expansion. You can sort out movies with the year of release, the length of the film (short, long), language, awards received, etc.
The project seemed so intelligent that we (DH was also partially involved as an expert on the knowledge graph) called him Cerebro, in honor of the device of the same name from the movie "X-Men" .
Cerebro often revealed very interesting facts.that weren’t originally in the search query. For example, at the request of [US presidents], Cerebro will realize that presidents are people, and people have growth. This allows us to sort presidents by growth and show that Abraham Lincoln is the highest president of the United States. In addition, people can be filtered by nationality. In this case, America and the United Kingdom appear on the list, because the United States had one British president, namely George Washington. (Disclaimer: the results are based on the state of the knowledge graph at the time of the experiment; I can not vouch for their correctness).
Blue links versus knowledge
Cerebro was able to really understand user requests. Having received data for the entire graph, we could generate machine interpretations of the query, generate a list of results, and understand a lot from these results for further study of the graph. It was explained above: as soon as the system understands that it is dealing with films, people or books, etc., certain filters and sorts can be activated. You can even go around the nodes and show related information: from [US presidents] to [schools they went to] or [children they fathered]. Here are some other queries that the system itself generated: [female african american politicians], [bollywood actors married to politicians], [children of us presidents], [movies starring tom hanks released in the 90s]
DH demonstrated this opportunity to move from one list to another in another project called Parallax .
Cerebro showed a very impressive result, and Metaweb management supported it. Even in terms of infrastructure, it turned out to be efficient and functional. I called it the knowledge engine (like a search engine). But on Google, no one specifically addressed this topic. She was of little interest to my manager, they advised me to talk with one person, then with another, and as a result I got a chance to demonstrate the system to one very high top search manager.
The answer was not the one I was hoping for. To demonstrate the results of the knowledge engine for [books by french authors], he launched a Google search, showed ten lines with blue links and said that Google could do the same. In addition, they do not want to take traffic from sites, because they get angry.
If you think that he is right, think about this: when Google does a search on the Internet, it really does not understand the request. The system searches for the right words in the right position, taking into account the weight of the page and so on. This is a very complex system, but it does not understand either the query or the results. The user himself does all the work: reading, analyzing, extracting the necessary information from the results and further searches, adding together a complete list of results, etc.
For example, for [books by french authors] a person will first try to find an exhaustive list, although one page with such a list may not be found. Then sort these books by years of publication or filter by publishers and so on - all this requires a person to process a large amount of information, numerous searches and process the results. Cerebro is able to reduce these efforts and make user interaction simple and flawless.
But then there was no full understanding of the importance of the knowledge graph. The manual was not sure of its usefulness or how to relate it to the search. This new approach to knowledge is not easy for the organization that has achieved such significant success by providing users with links to web pages.
Over the course of the year, I struggled with a misunderstanding of the managers, and eventually gave up. A manager from the Shanghai office turned to me, and I handed him the project in June 2011. He put on him a team of 15 engineers. I spent a week in Shanghai, passing on to the engineers everything that I created and learned. DH was also involved in this business, and he led the team for a long time.
Join-depth problem
In the Cerebro graphing system, there was a problem with the depth of the union. Join is performed when the result of an early query is needed to complete a later one. A typical union includes some
SELECT
, i.e., a filter in certain results from a universal data set, and then these results are used to filter by another part of the data set. I will explain with an example. Say you want to know [people in SF who eat sushi]. All people are assigned some data, including who lives in which city and what kind of food they eat.
The above query is a single-level join. If the application accesses the database, it will make one request for the first step. Then a few queries (one query for each result) to find out what each person eats, choosing only those who eat sushi.
The second step suffers from the fan-out problem. If the first step gives a million results (the population of San Francisco), then the second step should be given on request to everyone, asking for their eating habits, and then applying a filter.
Distributed system engineers usually solve this problem by broadcasting., that is, universal mailing. They accumulate the corresponding results, making one request to each server in the cluster. This provides a join, but causes problems with request latency.
Broadcasting does not work well in a distributed system. This problem is best explained by Jeff Dean of Google in his speech “Achieving a fast response in large online services” ( videos , slides ). The total delay is always greater than the delay of the slowest component. Small glare on individual computers causes delays, and the inclusion of many computers in the query dramatically increases the likelihood of delays.
Consider a server with a delay of more than 1 ms in 50% of cases, and more than 1 s in 1% of cases. If the request goes to only one such server, only 1% of the responses exceed a second. But if the request goes to hundreds of such servers, then 63% of the responses exceed a second.
Thus, the broadcast of one request greatly increases the delay. Now think, and if you need two, three or more associations? It is too slow to execute in real time.
The problem of fan deployment when request broadcast is inherent in most non-native DBs of graphs, including the Janus graph , Twitter FlockDB and Facebook TAO .
Distributed associations are a complex problem. Native graph databases allow avoiding this problem by storing a universal data set within one server (stand-alone database) and performing all joins without accessing other servers. For example, Neo4j does this .
Dgraph: unions with arbitrary depth
Having completed work on Cerebro and having experience building a graph management system, I took part in the Dgraph project, becoming one of the three technical project managers. We applied innovative concepts that solved the problem of the depth of the union.
In particular, Dgraph separates graph data so that each join can be completely performed by one machine. Returning to an object
subject-predicate-object
(SPO), each Dgraph instance contains all the subjects and objects corresponding to each predicate in that instance. Several predicates are stored in an instance, each one being completely stored. This allowed us to fulfill requests with an arbitrary depth of associations , eliminating the problem of fan deployment during broadcast. For example, the query [people in SF who eat sushi] will generatemaximum two network calls in the database regardless of cluster size. The first challenge will find all the people who live in San Francisco. The second request will send this list to intersect with all the people who eat sushi. Then you can add additional restrictions or extensions, each step still provides for no more than one network call.
This creates the problem of very large predicates on the same server, but it can be solved by further splitting the predicates between two or more instances as the size grows. In the worst case, one predicate will be split across the entire cluster. But this will happen only in a fantastic situation, when all the data corresponds to only one predicate. In other cases, this approach can significantly reduce the delay of requests in real systems.
Sharding was not the only innovation in Dgraph. All objects were assigned integer identifiers, they were sorted and saved in the form of a list (posting list) to quickly cross such lists later. This allows you to quickly filter during the merge, find common links, etc. Ideas from the Google search engines are also useful here.
Combining all OneBox blocks through Plasma
Google's dgraph was not a database . This was one of the subsystems, which also responded to updates. So she needed indexing. I have had extensive experience working with real-time incremental indexing systems running under Caffeine .
I started a project to unify all OneBox within this graph indexing system, including weather, flight schedules, events and so on. You may not know the term OneBox, but you definitely saw it - this is a separate window that appears when certain types of queries are executed, where Google returns richer information. To see OneBox in action, try [ weather in sf ].
Previously, each OneBox worked on an autonomous backend and was supported by different development groups. There was a rich set of structured data, but OneBox units did not exchange data with each other. Firstly, different backends increased labor costs many times over. Secondly, the lack of information sharing limited the range of requests that Google could respond to.
For example, [events in SF] could show events, and [weather in SF] could show weather. But if [events in SF] understood that it was rainy now, then we could filter or sort the events according to the type “indoors” or “outdoors” ( it might be better to go to the cinema rather than football in heavy rain) )
With the help of the Metaweb team, we began to convert all this data to the SPO format and index it with one system. I named it Plasma, a real-time graph indexing engine for serving Dgraph.
Leapfrog Management
Like Cerebro, the Plasma project received few resources, but continued to gain momentum. In the end, when the management realized that OneBox blocks were inevitably part of our project, it immediately decided to put the “right people” to manage the graph system. At the height of the political game, three leaders were replaced, each of whom had zero experience working with graphs.
During this leapfrog of Dgraph, Spanner project managers called Dgraph too complex a system. For reference, Spanner is a worldwide distributed SQL database that needs its own GPS watch to ensure global consistency. The irony of this is still blowing my roof.
Dgraph was canceled, Plasma survived. And at the head of the project they put a new team with a new leader, with a clear hierarchy and reporting to the CEO. The new team - with a poor understanding of the graphs and related problems - decided to create an infrastructure subsystem based on the existing Google search index (as I did for Cerebro). I suggested using the system that I already did for Cerebro, but it was rejected. I modified Plasma to crawl and expand each knowledge node into several levels so that the system can view it as a web document. They called this system TS ( abbreviation ).
This meant that the new subsystem would not be able to perform deep associations.Again, this is a curse that I see in many companies because engineers start with the wrong idea that “graphs are a simple problem that can be solved by simply building a layer on top of another system.”
A few months later, in May 2013, I left Google after working on Dgraph / Plasma for about two years.
Afterword
- A few years later, the section “Internet Search Infrastructure” was renamed to “Internet Search Infrastructure and Knowledge Graph”, and the leader to whom I once showed Cerebro headed the direction “Knowledge Graph” telling about how they intend to replace simple blue Knowledge links to answer user questions directly as often as possible.
- When the Shanghai team working on Cerebro was close to putting it into production, the project was taken from them and given to the New York division. In the end, it was launched as Knowledge Strip. If you are looking for [ tom hanks movies ], you will see it at the top. It has improved a bit since the first launch, but still does not support the level of filtering and sorting that was laid in Cerebro.
- All three technical managers who worked on Dgraph (including myself) eventually left Google. As far as I know, the rest are now working at Microsoft and LinkedIn.
- I managed to get two promotions at Google, and I was supposed to get a third when I left the company as a senior software engineer (Senior Software Engineer).
- Judging by some fragmentary rumors, the current version of TS is actually very close to the design of the Cerebro graph system, and each subject, predicate and object has an index. Therefore, she still suffers from the problem of the depth of unification.
- Plasma has since been rewritten and renamed, but it continues to work as a real-time graph indexing system for TS. Together, they continue to post and process all structured data on Google, including the Knowledge Graph.
- Google’s inability to make deep unions is visible in many places. For example, we still do not see the exchange of data between OneBox blocks: [cities by most rain in asia] does not give a list of cities, although all the data is in the knowledge column (instead, the web page is cited in the search results); [events in SF] cannot be filtered by weather; [US presidents] results are not sorted, filtered, or expanded by other facts: their children or the schools where they studied. I believe this was one of the reasons for the discontinuation of Freebase support .
Dgraph: Phoenix Bird
Two years after leaving Google, I decided to develop Dgraph . In other companies, I see the same indecision regarding graphs as in Google. There were many unfinished solutions in the graph space, in particular, many custom solutions hastily assembled on top of relational or NoSQL databases, or as one of the many features of multi-model databases. If there was a native solution, then it suffered from scalability issues.
Nothing I saw had a coherent story with a productive, scalable design. Building a horizontally scalable graph database with low latency and arbitrary depth joins is an extremely difficult task , and I wanted to make sure that we built the Dgraph correctly.
The Dgraph team spent the last three years not only studying my own experience, but also putting a lot of their own efforts into designing - creating a graph database that has no analogues on the market. Thus, companies have the opportunity to use a reliable, scalable and productive solution instead of another half-finished solution.