How to compare: “amazing car” and “ugly hut”, in a marketing survey and in big data

    We all participated in surveys, online or in real life. And when we start a new project, we can’t do without surveys. But sometimes there are survey results with which it is not clear what to do except smile , in the picture below, the result of a survey of the All-Russian Public Opinion Research Center (VTsIOM).


    I was curious as to how questions with qualitative assessments are being used now and found that VTsIOM, POF , Levada Center use mainly a tri-band scale (poor / normal / good). In cases of more detailed questions, the scale increases to 5-6 units , but rarely.



    Then, today, there is a situation in which sociologists move away from the multi-level scale of qualitative assessments and try to use a three-level one. And if sociology is able to get out of this, then when analyzing decent amounts of data, the need to use qualitative estimates becomes a complicating factor and reduces the reliability of the results. Since, for example, it is practically impossible to distinguish between the concepts: “a beautiful apartment” and “excellent housing”, and taking into account the answer of one of the characters of the “Twelve Chairs”: “To whom the bride and the mare are”, the multi-intersection of qualities goes beyond reasonable limits.

    There is a gradation mechanism and it is well used by banks in determining forgeries in financial documents. This is Benford's distribution law , which, in 1984, was proved by Ted Hill..
    The theoretical calculations of the proposed tool are presented in this material: " Benford 's law and the distributions falling under it ."

    In Wikipedia, this law is formulated as follows: if we have the base of the number system b (b> 2), then for the digit d (d ∈ {1, ..., b - 1}) the probability of being the first significant digit is:



    Based on the foregoing, we get the mechanism gradations of qualitative features, as follows.

    Choose the number of intervals, well, let's say 5, that is, four gradations and one middle interval. So b = 6, we get the probabilities for the intervals:

    1st interval - 0.386853;
    2nd interval - 0.226294;
    3rd interval - 0.160558;
    4th interval - 0.124539;
    5th interval - 0.101756.

    From the statistics of frequency, words evaluating qualitative signs, we make a series in ascending order and put down an index. We carry out the conversion of frequency to the probability of utterance. Next, we accumulate the probabilities from the tail, until we get a value equal to the extreme row (5th), in our case - 0,101756 and words (qualitative definitions), the probabilities of which fall into this sum, we relate to range 5. Further, by a decreasing index, we carry out further summation until we approach the probability value of the 4th range and, further, similarly, to the value of the first interval.

    As a result, we get clear subsets with a real numerical estimate.

    I will not reassure that the selection of synonyms is easy. Since each for himself determines a convenient result / effort ratio.

    Also popular now: