A method for identifying “trolls” in network communities using Q&AC as an example

    In connection with the topic of "Megamind" in our articles, you and I have somewhat moved away from hardcore IT topics, but this does not mean that we have become less interested in this. Therefore, I decided to dilute the prevailing atmosphere with a small pseudoscientific article. There will be several formulas under the cut, please do not be scared.

    All in all, this is a brief translation of an article posted on the Cornell University website , with some of my inserts.

    annotation


    trollThe Internet has begun to play a more important role in people's lives since the advent of Web 2.0 . The interaction between users has enabled them to freely exchange information through social networks, forums, blogs, Wikipedia sites and other interactive, jointly developed media resources.

    On the other hand, all the flaws of the second web concept are evident. Content orientation has become the most important plus and minus of the network at the same time. The issues of reliability and reliability of information in full growth are faced by the owners and users of interactive communities. As in real life, in the process of communicating through a network, sometimes situations arise when some users violate the rules of generally accepted “network” etiquette. In fact, in order to maintain a normal atmosphere of the resource, owners are forced to introduce artificial rules for interaction and monitor their compliance.

    One of these blatant violations is trolling.

    “Trolling” is the instigation of a participant in communication (“troll”) of anger, conflict by implicitly or explicitly bullying, humiliating, insulting another participant or participants, often in violation of the site’s rules and, sometimes unconsciously for the “troll” itself, of the ethics of network interaction. It is expressed in the form of aggressive, mocking and abusive behavior. It is used both by personified participants interested in greater recognition, publicity, outrageous, and anonymous users without the possibility of their identification. In a particular case, “trolling” is a provocation of a “victim” in order to attract attention.

    This article proposes a new approach for computing attackers. This method is based on the degree of conflict of trust functions between different messages in the discussion thread. To demonstrate the viability of the approach, we test it on artificial data.

    Recently, the ways of obtaining information have shifted significantly towards accelerating, facilitating, and reducing labor costs. In fact, thanks to the Internet, research on a topic has been reduced to a simple click of a mouse button. Although on some issues it is difficult to find a satisfying answer using traditional search engines. Instead, we prefer to get an expert opinion.

    As a result, such a tool of information interaction as a community of questions and answers (hereinafter Q&AC) Such systems allow each user to contribute to the development of the community. Unfortunately, not all messages are reliable: some users impersonate experts, while others publish useless messages. Therefore, the work of the moderators of these communities becomes a very important process. Most often, an increase in “junk” messages is the result of the action of “trolls”.

    - Q&AC: a quick overview


    A. Q&AC Users


    Users are the main characters of Q&AC. Conventionally, they can be divided into: “experts”, “students” and “trolls”.
    Experts: users with knowledge or skills in a particular field.
    Students: Users trying to get information or experience.
    Trolls: Persons by any means trying to disturb the peace of the community. Their goal is to create counterproductive discussions.

    B. Identification of sources in Q&AC


    Many studies have already tried to evaluate the sources of information in communities.
    Some offer user assessment models based on the number of best user responses. The best answer here is determined by the requesting user or the voting method.
    In others, authors focus on the selection of questions chosen by the user to answer. Experts always prefer to answer questions in which they are more competent.
    Some authors propose complex structures based on the cognitive and behavioral criteria of users to evaluate not only reliability, but also the experience of information providers.

    B. Uncertainty in Q&AC


    When dealing with information supplied by people, we are faced with several levels of uncertainty. There are three levels of uncertainty for Q&AC. The first is related to the extraction and integration of uncertainty, the second to information sources of uncertainty, the third to the very essence of information. In our case, we are more interested in the estimation of sources and part of the uncertainty associated with this. Indeed, on the network, when we encounter other users (i.e., sources of information), we almost never possess a priori knowledge about them.

    - Mathematical apparatus


    One of the mathematical tools for modeling and processing inaccurate (interval) expert assessments, measurements or observations is the theory of confidence functions .

    The theory of confidence functions or the theory of Dempster-Shafer uses mathematical objects called “confidence functions”. Usually their main goal is to model the degree of confidence of a certain subject to something. At the same time, the literature contains a large number of interpretations of “confidence functions” that can be used in various applied problems.


    The approach proposed in the article involves the use of this theory in conjunction with the introduction of a quantity that determines the conflict of two combined confidence functions.

    Let us proceed to the description of the method.

    One of the important assumptions suggests that the “trolls” are integrated only into the popular branches of the discussion. We break the further description of the method into three steps.

    1. Custom posts


    Researchers offer the basic characteristics of “trolls”: aggression, deception, violation of the rules, success. Also indicate such behavioral characteristics as, neglect of moral standards, obvious sadistic and psychopathic tendencies. In the context of this work, the researchers distinguished the differences between the “trolls” and other users manually from the messages. Based on this, messages can be: relevant, offtopic, nonsense or abuse. Define the framework characterizing the message:

    1
    [1] The

    nature of the message is determined relative to the published question or topic. At this stage, we assume that the method uniquely determines the nature of each message.

    2. User conflict


    Detection of irrelevant messages does not yet give an unambiguous answer to the question of whether the user belongs to the “troll” cohort. A user can be caught only hurt and responding to poddevki. In addition, the subject of discussion may gradually change. In fact, in order to distinguish the “troll” from other users, we need a quantitative assessment of how much this user conflicts with others. The proposed approach also involves measuring the magnitude of the conflict between the messages of each user.

    table

    - Conf msg / U : A measure of the conflict between the k th message of user U i and messages written by each other user U j .

    2
    [2]
    - Conf msg : a measure of the conflict between the k th message of user U i and all messages written by all other users of U based on the weighted average. This value takes into account the number of messages written by each user in order to determine the level of conflict, especially between “trolls” and experts.

    3
    [3]
    - Conf user : general measure of user conflict U i

    4
    [4]
    The magnitude of the user's total conflict can increase when he engages in endless debate with the “troll”. In this case, the user becomes a victim, and moderators have to control user behavior in many threads.

    3. Clustering users


    The final step is to classify users according to their conflict measures into two groups. The authors have provided for the breakdown of users into groups using the K-means algorithm .

    The k-means algorithm is a simple repeating clustering algorithm that divides a specific data set into a user-defined number of clusters, k . The algorithm is simple to implement and run, relatively fast, easy to adapt and common in practice. This is historically one of the most important data mining algorithms.

    In our case, the number of clusters: k = 2 .

    As a result of the algorithm, all users will be divided into two clusters. “Trolls” will fall into the group with the greatest measure of conflict, and respectable users with the least measure of conflict.

    - Example


    Consider the following example:
    Take one of the discussion threads containing thirty-one posts written by eight users. The general measure of conflict for each user, expressed through equation [4] is shown in the figure below.
    User U 1 sent three relevant and two controversial messages in response to a message from user U 4 .
    User U 2 sent seven relevant messages and two controversial messages in response to a message from user U 4 .
    User U 3 has sent four relevant messages and one offtopic message to user U 8.
    User U 4 posted two controversial posts.
    User U 5 has posted one relevant message.
    User U 6 has posted three relevant posts.
    User U 7 has posted two relevant posts.
    User U 8 posted three posts: the first two offtopic, one controversial.

    The general measure of user conflict U 4 is greater relative to U 8, because the second one posted his messages after a large number of relevant messages from other users. Thus, this situation showed a higher measure of conflict.
    We apply the k-means algorithm , which will divide users into two groups:

    table2


    Users U 1 ; U 2 ; U 3 ; are not classified as “trolls,” despite some of their posts, as they also posted relevant posts.

    Gist


    - Conclusions


    “Trolling” on the network is unambiguously defined as a negative and, in some way, even a destructive phenomenon, which complicates the receipt of information by users. Many modern online communities have rating systems for self-regulation, but none of them can do without moderation. Which in itself leads to increased costs for community owners. Small resources are mainly managed on their own, large ones are forced to support specialists.

    This article proposes a new synthesized approach to determining the quality of the user by the nature of the messages published by him. At the moment, the authors have developed a methodology for searching for unscrupulous users of one branch of discussions (threads), but they intend to expand its work within the entire community.

    When writing a topic, I was forced to miss part of the description of a heavy weight apparatus to make the article not only useful, but also readable. Since the links have not yet been canceled, those interested in the topic can make up for this simplification personally.

    Unfortunately, in the framework of one article it is impossible to grasp the immensity, since this topic is very extensive and with a detailed elaboration draws at least a candidate. If this method can be fastened to the karma formula of any community, there may be prospects for getting rid of the tedious duty of woolly comments, because the UFO will arrive and publish this entry here , the voting system for comments can itself reject troll attempts.

    Not performed:


    In this work, the differences between the “trolls” and other users were highlighted manually; for practical use, this option is not suitable. It is clear that the process can be automated, at least according to estimates of comments. An attempt to implement the algorithm on the example of Habr was unsuccessful, mainly due to the fact that UFOs are overwriting the minuscule comments.

    Work in this direction will be continued.

    Additional information on clustering on Habré: "Clustering: k-means and c-means algorithms" .

    Also popular now: