Can tags beat headings? Tag Hierarchies

    What role do sections, categories, hubs, and other facet classification play, etc. in our online life. Is everything so obvious with them?
    All these concepts came to us from the paper past, then strict systematization was the only way to navigate in books and documents. At first, in Internet environments, heading was almost the only way to navigate. Catalogs flourished and multiplied, Yahoo is a prime example of turning a catalog into a mega successful project with a capitalization of $ 32 billion .

    tags win headings.  bosch

    But the development of search technologies has greatly undermined the credibility and relevance of catalogs and rubrics. It’s just like with dinosaurs, the large and clumsy Catalogs were defeated by the nimble and predatory Search Engines.
    Defeated in the field of global navigation on the Web, but in separate ecological niches, that is, on sites - the rubrication remains a classic navigation tool.

    Then tags or shortcuts appeared, it seemed this tool would supplant the rubrication as an outdated and inflexible approach. But no, everyone remained with their own. And often you can see the rubrication and tagging coexisting on the same information space.

    Why haven't Tags become a great revelation and grave digger?

    It seems to me for the following reasons:
    1. The rubric shows well the general topic of this information source (website, weblog, etc.). It is combed in advance and does not suffer from the oddities that the tags suffer (unless they are moderated)

    2. Likewise, the content located in the section is most likely rightfully there, but the tags with which it is provided are just the subjective opinion of the person who put the tags.

    3. In forum solutions, the heading is the basis for the moderation of posts. That is, the content generated by the user is likely to fall into a more or less relevant section. And this simplifies life for those who do not write, but read.

    4. In online stores and other systems where classification taxis - rubrication is the basis of navigation.

    In total, with the exception of the item with shops, all other advantages of the rubrication come down to the predictability of the quality of the content found in the rubric.

    That is, if you look in the opposite direction - at the tags, they are too unpredictable both as message labels and as indicators of message quality.

    On the other hand, it is intuitively clear that tags are also a rubrication, but simply in a more general and democratic way. Due to this democratic nature, tags lose rubrics. Excessive freedom leads to uncertainty.

    For example, if we use the tag as a navigator, but a navigator in the context of a message, then this is one thing:
    I read the latest news about a mobile phone and see a tag “coverage” there. When clicking on this tag, in fact, I should click on the link "mobile communications" + "coverage".
    If the text was about breeding especially dairy cows, then the tag “cover” would have a different meaning of “cow” + “cover”.
    If I came across a tag without context, for example, I just saw them in the tag cloud of a site, then how to distinguish “coverage A” from “coverage B”?

    fuck or call

    I allow myself seditious thought - an unambiguous interpretation of the semantic context of the tag is possible only if we know the history of the reader's interests (behavioral analysis?). That is, if our reader usually delves into the texts of the categories “networks”, “mobile networks”, etc., then we can understand what he means by the tag “coverage”. If we DO NOT have such information, then everything is as usual ...

    What to do?

    From the point of view of the development of tagging technologies, this means this:

    Using tags to search

    1. In doubtful cases, it is better to use not one tag, but a set of tags to obtain a selection of messages.
    2. This collection can be made up of the Main tag (the one that was ordered as a criterion for choosing posts) and additional, that is, those labels in which the user "often grazes".
    3. Additional tags, of course, should be often found with the Basic, that is, be “connected” - highly correlated.


    Yes, this is the weakest point. Tags are usually put by the author and this is very subjective, experiments with folksonomy are still not encouraging. Some other mechanisms are needed to make tagging more rigorous. Most likely, moderation and help mechanisms to the tagger in the form of work with synonyms or the search for analogies.


    Tags are not just words, they are words with an implied indication of belonging to the subject area (braid (hairstyle), braid (coast), braid (tool)).
    That is, a tag is at least two parameters (word, subject area).
    The subject area, in turn, is also a collection of tags tightly connected by joint appearances.

    Tag hierarchies hold enormous potential. These hierarchies are natural, that is, those that are trampled as paths by users, and not laid by designers.

    You can say the preamble is over. The following are very specific recipes from our mathematician Sergei Lvov. I really hope that he will be given a voice on Habré ( popolznev ).

    The first steps to the tag hierarchy

    Author: Sergey Lvov

    1. What do we want

    Imagine a network community (a network is a computer network), whose members communicate in writing - such a virtual hut-discussion. Messages (posts, sayings, notes, remarks, matches) are saved and form a large pile in which you can clean up things in different ways, make connections, and so on.

    One of the ways to put things in order, or rather, one of the ways to form structures in the heap of messages is through tags (thematic labels). It is assumed that users themselves come up with shortcuts and attribute them to their messages. Since community members are not limited in inventing shortcuts, a lot of shortcuts themselves turn into a big pile, and for shortcuts to become a tool for forming structures, they themselves need to be put in order. There are two fundamentally and fundamentally different ways to clean up the heap: manually and automatically. Of course, we are now interested in the second method, although, probably, sometimes you can’t do without a manual fit.

    Since labels are responsible for the “thematic nature” of messages, putting order in the heap of labels means establishing a measure that would allow us to say how any two labels are thematically close to each other. This measurement can
    be arranged as a metric (= distance) in the mathematical sense of the word (that is, for any two different labels, the distance between them must be a positive number, independent of the order in which the labels are listed; the distance from the
    label to itself is zero). But it seemed to us more convenient to take a measurement like a correlation coefficient (the minimum value, for completely unrelated labels, is 0, the maximum is 1). Moreover, this coefficient does not have to be symmetrical: one label to the second can be attached more strongly than the second to the first.

    2. Elements and designations

    Through M denote the set of all messages stored in our boltoteke: M = { m 1 , ..., m of N }. It is clear that the set M changes with time, the number of messages grows, but the dynamics do not interest us: we consider the system at an arbitrary fixed moment. S is the set of all tag tags currently available in the system: S = { s 1 , ..., s NN }. Between the sets M and Sa correspondence (relation) is defined, which, by analogy with geometry (many readers will say: with graph theory! - but they came up with geometers) can be called an incident: message m and label s are incident if message m is labeled s .

    Retreat . The rules of conduct in a hut-discussion can be such that not only the author of the message has the right to hang labels on the message. But for our task this is now unimportant: whoever labels the messages, at the moment in question, the system is fixed in its state; all that matters is whether the message and the label are incidental.

    If s 1, ..., s r are labels that mark the messagem (you can define the tag set of the message m : S ( m ) = { s 1 , ..., s r }), then the quantities e [s 1 ] ( m ), ..., e [s r ] ( m ), which are called the (tagged) values ​​of the message m . How exactly they are calculated is not very important for us now, but to be calculated
    they must so that the more significant the message, the higher its significance. Roughly speaking, the significance of a message is the number of “pluses” exposed by the message to readers. A feature of our system: the plus sign is put not just on the message, but attached to the label (s).

    The genetic relationships between the messages themselves are taken into account: each message can have “descendants” and “ancestors” (“predecessors”). If the message m 2 is written in response to the message m 1 , then the message m1 will be called the immediate ancestor or predecessor for m 2 , and m 2 - the immediate descendant. If the message is m 2written in response to a message that is the immediate descendant of the message m 1 , then m 2 will be called the direct descendant (or just a descendant) of the message m 1 . If message m 2 is written in response to a message that is a descendant of message m 1 , then m 2 will also be called the (direct) descendant of message m 1 . Direct ancestors (predecessors) are defined similarly.

    3. Approach No. 0: appearance statistics

    The system we are building does not understand anything. For her, a label is just a set of characters (we will put the problem of homonymy out of brackets - we will assume that it is somehow solved). Therefore, the system can evaluate the thematic proximity of labels only based on the frequency of their joint appearances: it is natural to assume that if a couple of labels are often found together, then they are thematically close. Denote by µ [ s ] the number of messages tagged with s , by µ [ u ] the number of messages tagged by u , and by µ [ su ] the number of messages tagged with s andu at the same time.

    The first attempt to determine the thematic relationship coefficient of the labels s and u :

    ρ 0 ( s , u ) = µ [ su ] / ( µ [ s ] + µ [ u ] - µ [ su ] ). (1)

    The value in the denominator is the number of messages provided with at least one of the labels s , u .
    What is the bad formula (1)? First of all, to run through all the messages - this can turn out to be a very long time. It would be nice to have a good selection of messages to reduce the amount of work. Moreover, this situation is possible. It is understood
    that descendant messages will often inherit the labels of their parents. And if a long dialogue thread is started in the system (on forums or in the same LJ, this happens all the time), which is no longer interesting to anyone except its participants, then it can skew statistics.

    An exit is possible such. We call a message nodal if it has more than one immediate descendant. Further, let µ j [ s ] , µ j [ u ] , µ j [su ] - the number of nodal messages, marked, respectively, with the label s , label u , labels s and u at the same time. Now we correct formula (1), taking into account not all messages in general, but
    only nodal ones:

    ρ 0 j ( s , u ) = µ j [ su ] / ( µ j [ s ] + µ j [ u ] - µ j [ su ] ) (2)

    We got rid of one defect of formula (1), but that is not all. It is not good that formula (1), and with it formula (2), are symmetric - in fact, the relationship between the labels is asymmetrical. But to fix it is not difficult.

    Let us introduce the coefficient of dependence of the label s on the label u :

    ρ 1 j ( s , u ) = μ j [ su ] / μ j [ s ] . (3)

    The meaning of this formula is simple: the more often the label s occurs separately from the label u , the less the dependence of the label s on the label u .

    Note that all three formulas give unity if we substitute the same label in the place of two arguments: that is, ρ 0 ( s , s ) = ρ 0 j ( s , s ) = ρ 1 j ( s , s ) = 1 This is a natural condition for normalization.

    Let's go further. In all the formulas that we have presented so far, the key element is the frequency of the simultaneous occurrence of two labels. However, sometimes the semantic proximity of two labels can be the reason for just their simultaneous
    non-occurrences. For example: what one person calls tomatoes, another calls tomatoes - but one user will only see the label “tomatoes” all the time, and the other only “tomatoes”. If we exclude the a priori attribution of the status of synonyms to “tomatoes” and “tomatoes” (we have so far excluded such methods from consideration), then we can hope to catch the closeness of labels, which often appear simultaneously with some third label. For example, if the label “homology” often goes along with the label “topology”, and the label “homotopy” often goes along with the label “topology”, then even without knowing anything about homology and homotopy, we can assume that these things have something then common, thematically close. Formally, this can be expressed as follows (we still take into account only nodal messages):

    ρ 2 j (s , u ) = max ( v∈S ) ((µ j [ sv ] / µ j [ s ] ) (µ j [ uv ] / µ j [ u ] )). (4)

    Here, however, we again returned to symmetry: by construction, ρ 2 j ( s , u ) = ρ 2 j ( u , s ). The meaning of this symmetry is that we are now looking for a third label, to which two compared labels would be close. Note that always ρ 2j ( s , u ) ≥ ρ 1 j ( u , s ) (because for u = v and for s = v one of the factors in the bracket in the right-hand side of formula (4) is 1, and in the second it coincides with ρ 1 j ( u , s ), i.e., the value ρ 1 j ( u , s ) is certainly achieved, but the maximum may turn out to be larger).

    You can also try to enrich the formula:

    ρ 3 j ( s, u ) = max (max ( v∈S ) ((µ j [ sv ] / µ j [ s ] ) (µ j [ uv ] / µ j [ u ] )), max ( v∈S ) ((µ j [ sv ] / µ j [ s ] ) (µ j [ uv ] / µ j [ v ] )). (5)

    4. The possibility of other approaches

    All the formulas given in the previous paragraph implement attempts to guess the thematic proximity based on statistics. What else can be done? You can try to use the equipment of our hut-discussion: remember that its main
    highlight is a system of significance. Any of formulas (1) - (5) can be modified in the following way. The numbers µ with different indices are the number of messages that satisfy certain conditions (which conditions — the indices assigned to µ are exactly responsible for this ). The number of messages is the sum of units: for each message that meets the necessary conditions, in the amount of µnot 1 is added. If you add not 1, but a value that depends on the significance of the message (since at least two labels are involved in all of our formulas, and at least two values ​​can be used), we will get more subtle formulas. But whether this subtlety to good will be a difficult question. We will not discuss this in detail now - we will postpone it for later.

    Apparently, beyond the already said, only the methods of “manual work” and administration-moderation remain. For example, you can pre-create several thematic blocks and whenever the user starts a new shortcut, prompts him to put this shortcut in one of the blocks, establish some hierarchical relationships or bindings to existing shortcuts.

    Also popular now: