Moskus October 29, 2015 at 10:31

Natural language concepts versus formal classifications in OpenStreetMap

Those who are even a little familiar with the OpenStreetMap project probably heard about a couple of principles that are at its core: “any tags you like” and the fact that the content of the cartographic database is primary in this project, and not how the contents of this The base displays the Standard style on osm.org . But is everything so good and rosy with the semantic structure of this database, given the first principle? Reading the Russian-language OSM forum thread, I decided to understand the situation and describe it here.

A little more history and facts. The OSM project originated in the UK. Because the main language for tags, which in most cases are just words, is British English. Therefore, the designations of a sports center or territory called “okrug” are written asleisure=sports_centreand accordingly. German words are also used, for example, one of the (non-recommended) designations of types of megalithic structures is a tag from the German Großsteingrab (dolmen). Traditionally, tags consist of a key (key is that to the left of the equal sign) and value (value is that to the right). This, as it were, indicates the principle that the key corresponds to a class of objects or properties, and the value to an object or a specific value of the property. Sometimes keys and values use namespace syntax. This is often done in cases where several tags make up the so-called notation scheme, where the general designation of an object is supplemented by properties specific to it. For instance:place=neighbourhoodmegalith_type=grosssteingrab

social_facility=day_care
social_facility:for=senior
Such a pair of tags would mean "social institution, day care, for the elderly." This option is quite perfect, because there can be any number of other keys in the namespace named after the root key that match the qualifying properties.

Another common method is qualifying tags without using a namespace. For instance:
barrier=bollard
bollard=removable
This means: "artificial obstacle, pole, retractable." From the point of view of the natural language, “post = retractable” is a rather delusional construction, but it must be borne in mind that in OSM all these words correspond to abstractions, which should, ideally, be clearly described in the project Wiki (which does not always happen) . The disadvantage of this approach is that there can only be one tagging column-specific refinement tag, since OSM cannot assign two tags with the same key to an object. In this case, there can be as many nonspecific qualifying tags as possible for other objects - say, at this column material and height can be indicated: material=concrete, height=0.7.

So far, everything seems quite logical and understandable. But, as you know, any good thing is easy enough to spoil. Obviously, in order to store some data in the database, while retaining the ability to simply parse it, to select subsets according to the necessary characteristics, to find very specific data and objects, the database must preserve more or less harmonious data semantics. Otherwise, it turns into a loosely structured text. But remember that the OSM base, being a cartographic base, is obliged to store information about the real world, which many perceive "as is", in the form of indivisible objects, without highlighting any special properties in advance. People are just used to talking about what they see. Usually, when it comes to large projects with large databases of objects, for example, online stores, a typical use case for such databases is data sampling for the user. In some cases, this is a parametric search, in others it is a “smart” search, which allows you to associate property sets of objects (goods) with free-form search queries.

The situation in OSM is the reverse: project participants, on the contrary, enter data into the database, and everyone does this to the best of their skills, including the ability to highlight the main features of objects. And given the principle of “any tags you like”, which is designed to guarantee the extensibility of the notation system and the project’s ability to store a wide variety of data for a variety of needs, sometimes such use of the familiar natural language leads, if not catastrophic for semantics, to something, which is worthy of the epithet "extreme uncertainty."

Think and honestly answer yourself: if you want to buy a specific product, you will look for a store where such a product must be, or where it can only be with some relatively small probability? Those wishing to spend time running around in places where the desired can only be the case of chance, most likely, there are few. But imagine, in OSM there are tags denoting a store where it is not known what is for sale. For example, this shop=kiosk. As you can see from the description, you can find anything there, from cigarettes to newspapers. And you can not find it. The only clear characteristic of such a store is its size. Because you can’t even say for sure whether the kiosk is a small trading pavilion, standing separately, or it is built into a building. And in some countries the wordkiosk can be called just a small store.

In fact, this tag simply migrated to the notation scheme from a natural language. “Thank you” for him can be said to a person named Etric Celine. As you can see, on his project Wiki page he quite honestly writes that he doesn’t give a damn about the discussion of notation (the formal procedure for proposing tags and discussing it before adoption), at any order, but he considers it very important that everyone “does at least ". So he did something: introduced a tag that practically means nothing. Do you know how many such “stores unknown” in the OSM database? Nearly fifty thousand. And a lot of people, only becoming participants in the project and not understanding the importance of describing the properties of objects, hang this tag on any small shopping pavilion that comes across, if they don’t know the best designation for it, although they exist for tobacco shops and places, where newspapers sell, and for places where ice cream is sold, as well as many others. What leads them? That

For an experienced developer or database architect, a situation where, in addition to numerous designations such as “supermarket”, “kiosk”, “mall”, there is no universal means of describing the range, it may seem extremely strange, but such is the reality. That is, of course, there are tags for bookstores, or do-it-yourself stores. But here is what a supermarket or supermarket sells - can not be described in any way, with all desire. By the way, there is a long debate about the difference between a supermarket and a mall, because it is impossible to draw a clear line: although the mall is, by definition, “a building with a lot of shops, entertainment venues, cafes and restaurants,” but because a supermarket can also have indoor premises for tenants. So when a supermarket turns into a mall, and most importantly - is this difference important at all?

Too many common tags have not very clear definitions and limits of applicability. For example, it is impossible to formulate a clear difference between a restaurant and a cafe as well. Such a difference is not the size, not the service of the waiters, not the assortment of dishes, not the working hours, not the way of seating the visitors, not the requirement to reserve a table, not the price, and not anything else. These are just the words “cafe” and “restaurant”. Of course, in some extreme cases, the word “cafe” definitely does not suit some places of a very high class. But where is the clear boundary? It does not exist. Therefore, the purpose of tags amenity=cafeandamenity=restarurant- a fuzzy procedure, which, strictly speaking, contradicts another important principle of OSM: verification. This principle states that any designation entered in the database must be such that another participant in the project could confirm it, that is, clearly identify in the same way. The presence of the word “restaurant” in the name of the place is not a criterion, because in the Russian language there are borrowed “cafes” and “restaurants”. But what about the Czech hospoda or Polish tawerna? But no way, because to always go along the path “to each word (concept of language) should have its own tag” - is wrong. Using abstraction as a thought tool, it is necessary to find similar and different properties of objects, and then to designate precisely these properties, not paying attention to the habit. Then it will be easier to provide the user with a map or guide based on OSM data: it’s not necessary to guess what he meant by marking something on the map as a restaurant, not a cafe. A parametric search or demonstration of the desired properties in the list is definitely a more user-friendly solution than slipping it at all all the places where you can eat, with almost no explanation.

Sometimes classification attempts are made, but natural language and everyday knowledge interfere with the creation of a correct, correct classification. Almost from the OSM project base, to clarify the type of trees in the forests there are two tags: wood=coniferous, wood=deciduous(literally - "trees with cones" and "deciduous trees"). These two words - coniferous , deciduous - are common in English. And people are used to contrasting them. Coniferous and deciduous are spoken in such cases., which is somewhat more correct from the point of view of biology, but also not completely. In fact, there are trees that drop foliage seasonally, and those that do not (evergreen). And at the same time, there are trees with leaves and trees with needles. That is, there may be a tree with needles and cones, but dropping needles for the winter ( European larch ). Or a tree with leaves, but evergreen ( Lavrovishnya ). Plus, there are other, less numerous properties. Not so long ago, the original scheme was replaced by a scheme with two keys responsible for the seasonal cycle of leaves and their shape.

Another situation where the knowledge in the subject area of the tag authors was not strict enough, which gave rise to slurred and conflicting descriptions, is the case with towers and masts related in OSM to the key values of man-made objectsman_made=*. In building engineering - a field of knowledge that covers all types of man-made stationary structures, towers are called such narrow vertical structures that stand only due to reliance on their own foundation. And with masts - something that has braces, each of which is attached to the anchor device. That is, everything is quite simple, moreover, such a classification is international in nature. But in other technical areas, these terms may be used differently. Say, energy masts are also called the tower, from the point of view of the builder. Bottom line - in OSM, these tags are assigned to man-made objects quite freely.

The most curious (and, in the case of the spread of this practice, unpleasant due to semantic divergence, that is, discrepancies in the meaning of the notation) situations are the use of such words in tags that have completely different meanings in different languages. A recent example is a proposal by one of the Russian-speaking participants to introduce a tag that indicates the place where you can get a “business lunch”. The funny thing about this situation is that, probably, only in Russia the word “business lunch” (they came up with it somewhere in the nineties of the last century, when everything with the prefix “business” sounded more solid) is called a set of dishes at a fixed price, which can be obtained at certain times of the day. In the rest of the world, this is called a French table d'hôte , fix-price, or other local word, howeverbusiness lunch , in any case, means something that is associated with negotiations at dinner, and not some specific type of service in the restaurant. Of course, the words used in the tags are arbitrary. But they should be clear to other participants in the project, at least to such an extent that there is no doubt what subject area this tag belongs to. Therefore, the adoption of such designations that will mislead anyone who speaks English, originally not from Russia, is unacceptable.

The reverse situation is even more common. Borrowed words rarely rarely change their meaning at all, and therefore those for whom the culture of English-speaking countries is a dark forest often make mistakes by interpreting tags in accordance with the meaning of a consonant borrowed word, and not with what this word means in the original. Also, consonant words can exist in different languages independently. So, the Russian-speaking participants in the OSM project often mislead the tag highway=alley. The fact is that the English word alley sounds like the French allée and the Russian alley . Russian was borrowed from French, and therefore means similar: a walk road along which trees are planted. English is alley- this is usually a narrow technical passage or passage along the side or back wall of buildings, or a passage between private land plots located in one or two rows. This word is closer to the Russian "backyard". But inexperienced participants often try to label the highway=alleyalley with trees as a tag .

Even among the English-speaking community, consent is not always in itself, due to cultural differences. For example, a typical American drugstore, besides drugs, sells a bunch of manufactured goods, cosmetics, food and drinks. A prescription department may be, unexpectedly, in a supermarket. The British, however, have an idea of a pharmacy that is somewhat closer to the usual inhabitants of Russia.

Another example is the use of words like cabin , hutas key values building=*. According to the key, these tags should indicate the type of building. However, there is no clear difference between the two. But there are associations with the appointment. For example, an American will most likely associate a cabin with something like a summer residence, that is, with a small private or rental holiday home. And a resident of Norway, seeing the word hut , may recall the mountain winters belonging to the tourist association Den Norske Turistforening , which its members have the right to use. A similar association can occur in a German-speaking Swiss, only with the Schweizer Alpen Club mountain shelters. That is, people can very well compensate for the lack of a clear definition regarding the type of building by association with its purpose. And in Russia, until recently, these tags could indicate a hut , which is wrong, because it can already be designated as a "building made of logs" using tags building=yes, material=log.

Of course, semantic chaos originating from natural language does not dominate the project, although it is quite noticeable. There are quite strong and successful precedents for attempts to create reliable, consistent, and clear classifications, replacing tags with a vague value with sets of keys, each responsible for its own separate property. One of these, well-known, but not yet officially approved schemes is Healthcare 2.0. It was created with the aim of having tools that describe medical institutions of various types without the uncertainties inherent, for example, to a tag amenity=doctors. Using it, one can describe both a large hospital and a private doctor's office. Quite a lot of work to create a scheme to indicate the state of forests was done by one of the Russian participants in the project. Unfortunately, things didn’t go beyond placing the description on the page of outdated tags at the moment .

The new schemes, which are well thought out, turn out to be practically independent of the cultural and linguistic context. The maximum that may be required is the addition of one or two values of a key. For example, Healthcare 2.0 allows you to describe Russian-specific medical facilities such as a feldsher point, although its authors had no idea about such an institution. This is the power of using elementary properties that can be freely combined.

The saddest thing in this situation, it seems to me, is that even many experienced project participants do not understand this problem or understand it, but claim that it is insignificant, or that its solution can significantly increase the entry threshold and scare away the notorious newcomers (about whom many they like to reason very much, ascribing to them qualities that are convenient for arguing their point of view).

Practice shows that, firstly, when a new, more specific scheme appears, people successfully begin to use it, having the opportunity to designate what was previously impossible or uncomfortable to designate, but wanted to. Secondly, OSM exists in order to create the most adequate reality, the most complete and at the same time free-to-use world map, and not to create an interest club (although this is also good). And if it’s difficult for someone to understand the principle of quite obvious notation, but it is easy to use the indistinct mindlessly, then what contribution can he make to the creation of quality data? And vice versa, people who are dissatisfied with the quality of designation schemes or the absence of any designations may well make a useful contribution if such adequate methods appear.

Tags:

Natural language concepts versus formal classifications in OpenStreetMap

Also popular now: