Semantic technology is simple and accessible using pedigrees as an example.

  • Tutorial
A program capable of logical conclusions within the framework of the task may seem like a technical miracle and the embodiment of Skynet. But, as you can see below, to create such a program in Python today is not difficult if you use semantic technology. We will dwell on a good example of ontologies - family trees - and for any family member in the family tree we can derive his family relationships of arbitrary complexity (it is limited by computational resources). For example, on the family tree of the Romanov family below is shown the great-cousin cousin (first cousin twice removed) of the Russian emperor Peter II.

image

So if you want to get acquainted with the technologies of the semantic web in practice, welcome to the cat, where we will train on cats on pedigrees.

You can read about triplets, RDF, and ontologies on Wikipedia or in other posts . To describe family relationships in pedigrees, we use the OWL 2 Family History Knowledge Base ( FHKB ) ontology . Note that the authors of FHKB, although they recognize their offspring as a good educational example, still do not recommend OWL 2 for use in real genealogy applications because of the computational complexity for today's reasoning systems. Our application will remain educational: we restrict ourselves to small genealogies of up to a hundred family members.

Genealogy data is usually available in GEDCOM text format ( .ged) Some genealogy portals and pedigree programs allow you to upload link graphs in this format. We will read GEDCOM using the library of the same name for the Python language and generate triplets of individuals (the so-called ABox ) for the FHKB ontology. We already have logic ( TBox ) for deriving family ties, and all we need to do is set the data to which this logic will be applied.

Imagine that we have data for the following three individuals (abstract), using the example of the aforementioned family of Russian tsars:

Александр I *есть-брат* Николая I.
Николай I *есть-отец* Александра II.


and FHKB logic:

Свойство *есть-дядя* является последовательностью свойств *есть-брат* и *есть-отец*.


Then the system of reasoning is able to establish the following fact:

Александр I *есть-дядя* Александра II.


The same information in Turtle RDF is below. It is compact and fairly easy to read:

fhkb:i1 a owl:NamedIndividual ;
    fhkb:isBrotherOf fhkb:i2 ;
    rdfs:label "Александр I" .
fhkb:i2 a owl:NamedIndividual ;
    fhkb:isFatherOf fhkb:i3 ;
    rdfs:label "Николай I" .
fhkb:i3 a owl:NamedIndividual ;
    rdfs:label "Александр II" .
fhkb:isFatherOf a owl:ObjectProperty ;
    rdfs:label "есть-отец" .
fhkb:isBrotherOf a owl:ObjectProperty ;
    rdfs:label "есть-брат" .
fhkb:isUncleOf a owl:ObjectProperty ;
    owl:propertyChainAxiom ( fhkb:isBrotherOf fhkb:isFatherOf ) ;
    rdfs:label "есть-дядя" .


(Note: some details are omitted here for clarity. In the original FHKB, the properties isFatherOf , isBrotherOf and isUncleOf are defined slightly differently to optimize logical reasoning.)

So, we set the individuals i1 , i2 and i3 , the properties isFatherOf and isBrotherOf , assigned these properties to individuals and introduced new property isUncleOf . Note the prefixes rdfs :, owl : and fhkb : - they show the areas of expertise involved. Rdfs prefix: indicates the standard RDF schema (in the example above, this is the label property). Prefix owl : indicates standard ontology terms (an individual property, the sequence properties, etc.). And the prefix fhkb : is the FHKB genealogical ontology that we use, where the logic of related relationships is defined ( isFatherOf , isBrotherOf , isUncleOf , as well as other terms, isGrandfatherOf , isFirstCousinOf , etc.).

For each individual, it’s enough for us to take from GEDCOM only the minimum information about paternity (motherhood), brothers, sisters and marriages (in fact, GEDCOM does not contain anything else), all other family ties, the logic for which we are given in FHKB, will be deduced by the system reasoning.

image

So, the logical base (TBox) is available in the Turtle file header.ttl from the repository for this article. The genealogy of the royal family of the Romanovs is also present in GEDCOM , but the reader is advised to take his own for interest. And here is the script that will generate individuals for the FHKB ontology from the GEDCOM file: gedcom2ttl.py . (After cloning the repository, install Python dependencies using pip install -r requirements.txt.) Copy the FHKB logic header.ttl to a new file and add the result of the script to it:

cp data/header.ttl romanov_family.ttl
./gedcom2ttl.py data/tsars.ged >> romanov_family.ttl


As a result, we got an ontology (TBox + ABox) in Turtle format, which can be opened in any external editor (for example, Protégé ). If necessary, Turtle can be converted to OWL XML using the ttl2owl.py script . Now, the derivation of kinship in this ontology is a matter of technology. I know of three modern open-source reasoning systems for Python: RDFClosure , FuXi, and Fact ++ with owlcpp wrapper. In fact, there are many more if you “make friends” Python with the Java virtual machine (historically, Java is a leader in semantic technologies and provides a much larger set of tools). The three mentioned are built on increasing complexity and productivity. The first is a naive “brute force” approach, when all possible triplets are generated by brute force. The second (FuXi) is based on the infix Python notation for OWL and the Rete algorithm . Third (Fact ++) is a low-level optimized implementation of the Tableaux algorithm. In general, today it is one of the most effective open source reasoning systems. For our tasks, the first system (RDFClosure) is enough, especially since it is written in pure Python and installed by the trivial pip install command. For discussion on the genealogy of the Romanovs, tsars.ged (41 family members) RDFClosure on a laptop with Intel Core i7 1.70GHz takes about ten seconds.

As already mentioned, the disadvantage of OWL 2 in relation to pedigrees is computational complexity. I omitted some of the kinship relations mentioned in the illustration above, and reduced the Romanov family tree to royal persons and their closest relatives so that the demonstration reasoning would not load your computer too much. If you ask all the family ties from the illustration above and expand the genealogy to at least several hundred family members, RDFClosure becomes useless (Fact ++, however, continues to work).

Run the reasoning for the ontology obtained above:

./infer.py romanov_family.ttl


While the discussion is going on, I will explain the key points of the infer.py script . Its essence fits in six lines:

import rdflib
from RDFClosure import DeductiveClosure, OWLRL_Extension
g = rdflib.Graph()
g.parse("romanov_family.ttl", format="turtle")
DeductiveClosure(OWLRL_Extension).expand(g)
print g.serialize(format="turtle")


In the first two lines, we import the RDFClosure reasoning system and the RDFLib library, which provides interoperability with ontologies. In the third and fourth line - declare the graph and fill it with the contents of the ontology romanov_family.ttl . The fifth line is the start of reasoning. In this case, they are nothing more than a cyclic extension of the input graph with new triplets according to the OWL 2 rules. Sixth - printing of the received graph (in the same Turtle format).

So, we got the result romanov_family.ttl.inferred (by disk size it is several times larger than the input file). Let's prepare it for visualization. I wrote a simple HTML5 application ( index.html), showing a graph of inferred family relationships in a browser using the D3.js JavaScript library It is available in the online branch of the repository for this article. The edges of the graph correspond to information taken from GEDCOM (marriages, isFatherOf , isMotherOf ), and the derived kinship relations are highlighted in different colors when choosing a family member. The choice is by hovering or by touching on the touch screens. The graph for this application is specified in JSON format with a very simple structure - a list of edges indicating the vertices (individuals) and the type of connection (relationship) between them. The ontology obtained in the previous step is translated into this JSON by the ttl2json.py script :

./ttl2json.py romanov_family.ttl.inferred > romanov_family.json


By default, an HTML5 application loads JSON at data / tsars.json . The new JSON you generated can be downloaded to the browser with a simple click of a button on a web page (the File API without a server is used, and the visualization works offline).

All the above commands are collected in the shell script gedcom2json.sh . Using it, you can directly translate GEDCOM genealogies into JSON with derived family relationships for visualization. Adding inference and visualization of other related relationships is relatively simple. To do this, firstly, add the appropriate logic to TBox FHKB , and secondly, add the identifier of the new sibling to the Turtle-JSON ttl2json.py converterthirdly, specify the color, name and identifier of the new sibling in the HTML5 rendering code. Of course, the time of JSON generation from GEDCOM will increase somewhat.

In addition, there is an idea that the input data for any ontology (not only genealogical) can serve as mind maps. Of course, when drawing, you must adhere to clear rules so that you can transfer the map to the ABox ontology using, for example, the Python XMind SDK. That is how, for example, I started logical reasoning for my family tree, which historically led in the form of an intellect card.

To summarize: asking only the closest family ties between family members (brothers and sisters, marriages, fatherhood and motherhood) and determining the logic of the remaining ties, we were able to derive all the other ties thanks to semantic technologies. Thus, we touched the powerful tool underlying the products such as Wolfram Alpha and the Google knowledge graph . Ontologies and reasoning systems are mature and widely used technologies today, but, unfortunately, the threshold for entering this area is by no means low.

Link to the repository for this article: github.com/blokhin/genealogical-trees
HTML5 application: blokhin.github.io/genealogical-trees/#en
Public GEDCOM files can be exported from genealogy portals, for example, www.wikitree.com

Enjoy a dive into semantic technology, and let Skynet not be scared!

Also popular now: