Frolenarzt December 23, 2012 at 14:57

MilkyWeb - Graph of Everything

In this article I want to share my thoughts on how to solve the fundamental problems of the modern Internet. I want to describe a model that, in my opinion, can help to better organize knowledge on the Internet, and demonstrate my attempt to implement such a model.

Intro

Social networks and search engines try to organize as much information as possible about the world around them and in particular about the user.
In computer sciences, the basis for the description of any subject area (ON) is ontologies or their simplified variety - graphs . It is with their help that it is possible to most definitely describe any knowledge base for a computer.

A large number of specialized graphs have been created in the existing web space: Facebook, Linkedin, Foursquare, etc.
As you know, Google is expanding its Knowledge Graph and is actively using it in search engines.

The problem is that in the world there are an infinite number of subject areas and in order to create a new graph it is customary to create a new social network.

ProjectThe MilkyWeb (MW, MilkyWeb) that I want to present is an attempt to create a universal tool for describing any subject areas (creating any graphs) in one place.
In other words, this is an attempt to create a universal ~~social network of~~ the knowledge base of everything in the world.

The project has not yet left the alpha stage, so the interface leaves much to be desired, for which I ask you not to be angry. The site was imposed, only under Chrome. I decided not to waste time on supporting cross-browser support, so I apologize to users of other browsers.

Ideology

The ideology of the project is based on the ontology model - the mathematical representation of knowledge.
It is based on three pillars: concepts, individuals, and predicates.

Concepts are abstract concepts of the world. Roughly speaking, these are generalized (collective) names of things and phenomena that surround us.
Many concepts form a hierarchy, for example, the concept of "Programmer" is a derivative of the concept of "Man", and the latter, in turn, is "Organism".
You can draw an analogy with programming: Concepts in ontology are classes in OOP.
Concepts are of two types: abstract concepts and sets (or Ancestors from the English. Ancestor - ancestor, ancestor). The concept of “friendship” is abstract, while “machine” is the name of many real objects.

Individuals are objects that surround us in the real world. Each individual is an implementation of at least one concept-ancestor. In the context of OOP, concept individuals are instances of classes.
For example: the Albert Einstein object is an individual of the Scientist concept. Inheritance is naturally supported. Since “Scientist” is “Man,” “Albert Einstein” is also “Man.”
When a new user creates an account in MW, in fact, this means creating a new individual of the “Man” concept in ontology.

In terms of graphs, concepts and individuals are vertices of a graph, while predicates appear as edges (or arcs).

Predicates are properties by which vertices of a graph are interconnected.
A simple example of a predicate, as many might have guessed, is the friendship relationship in FB or VK.

The connection shown in the figure above is called a triplet , because three components participate in it: subject “Richard Feynman” (top of the graph), predicate “Born” (arc of the graph), object “New-York” (top of the graph).

In fact, the whole task of the MilkyWeb project is reduced to ensuring that the user can create a page of any object of the surrounding world (concept or individual) and can semantically correctly link it to other pages (using a predicate).

Each predicate is created in conjunction with one or more concepts.
For example, the properties of “friend” or “mother” can only be among individuals of the concept of “man”; and the predicate “CEO” can bind “person” and “company”.

Predicates are literal. Such “literal” predicates do not point to the top of the graph, but to some value. Each literal has a type, for example: string, integer, date, geographical coordinates, etc. (only URL literals are currently supported).

Concepts and predicates are the skeleton of any ontology, that is, the template on which the entire graph is built, so at the moment these entities can only be created by the site administration. This process includes not only the creation of entities as such, but their configuration, which the user does not see.
For example, each predicate has a threshold for the maximum number of triplets. So with the predicate "mother" an individual can have only one triplet, and with the predicate "friend" - many.

Example

As I said, the administration creates the ontology framework, and users fill it.
I will give an example of filling the subject area based on the concept of "film".

The administrator creates the concept of “movie”, and a set of necessary predicates such as “cast”, “director”, “producer”, “premiere”, “country”, “favorite movie”, “watch”.

The FOO user, based on the concept of “movie”, creates the page (individual) “Pirates of the Caribbean” and begins to “describe” it.
Using the cast predicate, he indicates that the individuals Johnny Depp and Keira Knightley starred in the film.
He then links the page to the producer, director and country.
The literal "premiere" user indicates that the premiere of the film took place on June 28, 2003.

Okay, the main details about the movie were introduced, but what next? Further FOO may indicate that “Pirates of the Caribbean” is his “favorite movie”.
At this time, the GOO user, who is a friend of FOO, just missed the monitor and saw in his stream the newly created FOO triplet. He took it as a call to action and decided to ~~download a movie on torrents~~ to buy a DVD with this picture and immediately see it! Starting to enjoy the movie, he created the “watch” triplet “Pirates of the Caribbean”, thereby telling the whole world a small part of his life with one click!

It was not for nothing that I chose the subject area “films” as an example. Facebook engineers are just working to structure such moments in people's lives. Read more: www.wired.com/business/2012/11/mike-vernal-facebook

I also want to note that the predicate “favorite movie” and the “I like” button on the movie page on the IMDB website are not the same thing. The semantics of likes are very blurry and do not allow us to say unambiguously what the user had in mind when he “liked” a particular page.

This structure greatly simplifies the description of a particular subject area. If Facebook has a constant set of templates for creating pages, then in the above system, templates can be created on the fly. If at one moment in time we decide to bring new software to the social network, it will just be necessary to create a set of concepts and predicates that are characteristic for this area.

At the moment, all created pages support only English (should be taken into account when searching). The plans include a localization mechanism in other languages.

Data sharing and Big Data problems

I did not find a suitable expression in Russian that denotes sharing in the sense of “sharing information” or “disseminating information,” so I left the term in the header without translation.

Recently, to describe the area, which is characterized by rapid growth in the amount of information, it is customary to use the concept of Big Data . A priori, this term means a problem: the rate of data generation is so high that the most valuable information can be lost in the general flow. So that information is not lost, it is necessary to structure and classify it.

As practice has shown, the formation of a news feed based on posts from friends is not the best option. More precisely, this method is good in order to receive information about people around you, but not about things of interest in general.
As a result, in Vkontakte the news feed is littered with seals and quotes of “great” people. You can try to subscribe to thematic publicities, but this does not guarantee delivery to the user feed of all the information currently generated that might be of interest to the user.
Facebook sculpts a crutch after a crutch to deliver only the most relevant information to the user’s feed. And to some extent this is enough, but the news feed construction algorithm is based on user actions (likes, comments, etc.), therefore it is also not universal.

In my opinion, the most successful approach to the model “came, found out all the relevant, left” turned out on Twitter and Hacker News.
Therefore, I initially tried to make the mechanics of disseminating information in MilkyWeb a cross between T and HN. Those. the user visits the site and receives all the information that might interest him lately X.
But not only from the pages to which he is subscribed (Twitter, FB, VK), but also by thematic streams (HN).

In MW you can distribute text (up to 2000 characters), links and videos (YouTube). There are no photos yet - they are expensive to store.

How can a user share information and who will receive this information?

User can:

Post messages to your page;
In this case, the message will reach those users who follow the sender.
It is worth noting that if user A created at least one triplet with user B, then A is considered to be subscribed to B.
Post messages on the page to another user;
Obviously: the message will only reach the addressee.
post messages to the pages of individuals “not users”;
The message will reach everyone who is subscribed to this entity.
Broadcast messages in thematic threads.
Thematic threads are all about concepts. Those. you can post a message on the page "programming". In this case, the message will reach everyone who is subscribed to “Programming”, as well as to all users who “inherit” the concept of “Programmer”.

The last two methods are for trying to solve the Big Data problem. The basic idea is this:
User X has information that is thematically related to a particular area of real life. He does not think about where to post this information, but simply throws it into the general thematic stream for one or another software. And now the task of the system is to select the most valuable data from the general stream based on the actions of other users (for example, ranking or reposts).

Work on this mechanism is still underway. There is no content ranking system yet, but it will be implemented in the near future, and there are ideas on how to make a custom news feed more relevant than all other networks on the basis of all this. It is the model described in the previous chapter that allows us to semantically unambiguously distinguish concepts and correctly classify information.

Naturally, this approach can generate waves of spam. At the moment, you can not post more than one message on the site in 20 seconds. In the future I will more intelligently solve this problem. The challenge now is to test the mechanics for viability and to highlight possible critical points.

As the reader probably guessed, in such a system there is great potential for the distribution of targeted content. You can make complex selections to select the target audience. For example, send a message to everyone who is “Programmer” and “lives in” “Moscow”; or to those who "bought" "iPhone" and "bought" "iPad"; or anyone who drives a Mercedes.
Perhaps someday this will become a way of monetization, but now the mission of the Milky Web project is different. I want to talk about her in the next chapter.

Semantic Web

Semantic Web (SP) is a web space in which human-generated content is understandable to a computer.
This can be achieved by adding metadata to the web document (e.g. html). Metadata is widely used on the network and plays important roles in searching, structuring data, etc.
But in order for the search engine to be able to “understand” the content of a page, it is necessary that this page is accompanied by a separate document with a computer-friendly description (in the form of a graph) of that part of the world that is discussed on the original page.

The specification requires that such meta-documents be compiled in RDF format. The problem is that these files must be created by someone in order to be attached to the html document.

Actually, this is the problem that I decided to solve two years ago in the form of a thesis. The goal was to make a convenient and interactive tool for creating RDF descriptions, a centralized repository of meta-data, where they will be accumulated and will not be duplicated.

Over time, I deviated a little from the given direction in favor of the social aspect. But now it’s possible to get an RDF description of an entity by going to milkyweb.net/rdf/ {c | p | i} / entity_id . For example, RDF documents of the individual “ Moscow ” and the concept of “ Human ” are located at milkyweb.net/rdf/i/10460 and milkyweb.net/rdf/c/10000, respectively (user information is naturally not public).

That is, all that remains for the webmaster is simply to attach a link to the necessary object to the web page of his site. In the future, the search engine will pick up the document for the specified URL and will be able to classify the content on the page, increasing the relevance of the search results. Or it will be possible in real time to observe the appearance of content for an entity throughout the Internet. Agree, cool! :)

For specialists in this field, I note that integration with existing dictionaries is planned.

Of course, I greatly simplify everything. In order to popularize the joint venture, one social network is not enough. Most likely, it is necessary to create special frameworks for web developers who automate the process of marking content with metadata. But I believe that sooner or later such a mechanism will work, and the first step in this direction is the creation of a global Internet knowledge base.

Problems

The biggest problems that I encountered lie in the ideologists of the project and in the terminology of ontologies as such.

All W3C JV (RDF, OWL) technology specifications state that you can get by with Concepts, Individuals, and Predicates to describe web ontologies, and I believed this for a while.

In the Russian Wikipedia you can find such a description of the concept "Concept":

Concepts are abstract groups, collections, or sets of objects. They may include instances, other classes, or combinations of both.

And then comes an example with a slight digression:

The concept of "people", the nested concept of "man." What a “person” is - an embedded concept, or an instance (individual) - depends on the ontology.

An inconspicuous remark at first glance (in italics) is a fundamental problem of philosophers of all time.

If we begin to create a global ontology according to these “classical rules”, our entire structure will immediately collapse, as I personally saw.
Initially, I believed that concepts in my network are abstract concepts that may or may not “have” individuals.
And individuals, in turn, are real objects that can be felt with your hands and which “realize” some concepts.

But suppose we have the concept of "Phone." Now we need to create an iPhone page. But what is an iPhone: a concept or an individual? Suppose this is an individual. And at some point in time, the FOO user decides to create a personal page for that device"IPhone", which lies in his pocket. What for? It doesn't matter, maybe he wants to put it up for sale. The important thing here is that if “iPhone” is an individual, then it is no longer possible to create a page for a specific device, because we have limited the level of abstraction and the system ceases to be holistic.
Okay, suppose iPhone is a concept. But we initially decided that concepts are fundamental concepts, they cannot come and go over time. That is, we will not for each new product created by humanity, we will not create a separate concept in the hierarchy.

Therefore, the very idea that there are concepts and individuals in the world is true only within a predetermined framework and this approach cannot be used to create a global ontology.

There are many such pitfalls, and I think that creating a universal way of describing the world is possible only through checks and rearrangements.

Outro

I do not expect a quick return from the project, as I said earlier - at the moment this is an experiment.
Many questions and problems stand on the edge. Perhaps the global graph does not have a place to be at all. Or, perhaps, the proposed approach is simply unsuitable for its creation.
The goal of my activity is to find in practical ways possible solutions to the fundamental problems of the global web.

All that I described above is just the tip of the iceberg of my ideas and ideas. If the topic is relevant, I will try to continue the series of articles.

I will be grateful for the feedback of any content! You can write in the comments, in PM or in the form on the site (post bugs and hacks there too).
If the ideas presented to someone seem interesting, and this someone wants to take part in the development of the project, I am open to cooperation (the core of the site is Java + MySQL).

By development, I understand not only development, but also the filling of the knowledge base. Now the network has created about 1000 entities in different subject areas, which, of course, is very small. If you did not find the page of your city, country, favorite music group, film, etc., try creating such a page and share your user-experience.

PS: Those who requested an invitation, do not be surprised if it does not come immediately. SMTP server is our bottleneck. You can write me in a personal - kick.

Thank you for attention!

Tags: