Dan11aM February 18, 2014 at 20:19

Typing (defining properties) of an object by the hands of site users

From the sandbox

There are frequent cases when it is required to define a set of properties of a newly created object. For example, this may concern a site with descriptions of goods, films (and, accordingly, a set of tags or properties is required for each object). In general, this applies to any repository of descriptions of any objects that imply the presence of properties and the ability to compare objects with each other on the principle of "similar or dissimilar."

So, it is given: the site has a ready-made set of objects for which properties are defined and verified. And a new object is added, about which we do not know anything, but site visitors can judge. Objective: to make sure that the administrator does not need to manually add the required properties, but everything is done by himself, by the hands of site visitors.

For clarity, we assume that we have a website dedicated to cell phones. On the site (for simplicity) - 5 phones with the following conditional properties (properties are numbered for convenience):
A> Vibrate (1), Radio (2), Speakerphone (3), Flashlight (4)
B> Vibrate (1), Handsfree (3), MP3-player (5)
C> Flashlight (4), One-piece housing (6), MP3-player (5)
D> One-piece case (6), TV (7)
E> MP3-player (5 ), TV (7), Radio (2)

And the sixth device is added, about which the site administrator does not know anything, unlike visitors. Let it be a device with Radio (2) and TV (7).

In our example, there are only 7 possible attributes for the object. We assign all possible properties to the new object.

The next step is to determine only those properties that the object really possesses; for this, we suggest that visitors to the site choose the degree of similarity between the known object and the new one (we offer a randomly selected object). The similarity is evaluated on a scale from 0 to 2x, where 0 - “doesn’t look like”, 1- “there is something in common” and 2- “very similar”. You can make a more extended scale, but for simplicity, this one is used here.

When comparing, we take into account only those properties that both the new and the known object have. If the user has chosen the degree of similarity “very similar”, then we add 1 to the “weight” of intersecting properties of an unknown object. When “there is something in common” we add 0.5, and if “they are not alike”, then subtract 1.

I sketched a small example in PHP illustrating the operation of the algorithm.

// насколько объекты похожи на новый
$known_objects = array ('a'=>1, 'b'=>0, 'c'=>0, 'd'=>1, 'e'=>2);
//объекты и их свойства
$a = array (1,2,3,4);
$b = array (1,3,5);
$c = array (4,6,5);
$d = array (6,7);
$e = array (5,7,2);
//для каждого свойства нового объекта мы определяем вес, для начала он = 1
$new_object = array (1=>1, 2=>1, 3=>1, 4=>1, 5=>1, 6=>1, 7=>1);
//множитель, чем меньше значение, тем более плавно изменяются веса
const K_MUL = 0.1;
$s = array_keys($known_objects);
//100 итераций, сравниваем со случайно выбранным объектом
for ($i=0; $i<100; $i++) {
    shuffle($s);
    $new_object = cmp($new_object, $$s[0], $known_objects[$s[0]]);
}
//нормализуем веса 
process($new_object);
//выводим результат
print_r($new_object);
//нормализуем веса и удаляем свойства с весом меньше 0.5
function process(&$new_object) {
    $max = 0;
    foreach ($new_object as $k=>$v) {
        if ($v > $max) {
            $max = $v;
        }
    }
    $mv = 1.1;
    if ($max > $mv) {
        $div = (1 / $max);
        foreach ($new_object as $k=>$v) {
            $new_object[$k] *= $div;
        }
    }
    foreach ($new_object as $k=>$v) {
        if ($new_object[$k] <0.5) {
            unset($new_object[$k]);
        }
    }
}
//сравниваем неизвестный ($a) и известный ($b) объект
function cmp($a,$b,$val) {
    switch ($val) {
        case 0: {
            $add = -0.5;
            break;
        }
        case 1: {
            $add = 0.5;
            break;
        }
        case 2: {
            $add = 1;
            break;
        }
    }
    foreach ($a as $k=>$v) {
        if (in_array($k, $b)) {
            $a[$k] += $add * K_MUL;
        }
    }
    return $a;
}

The code is very primitive, but gives an understanding of the work.
At the output, we get something like such an array, where the key is the property number and the value = calculated weight

Array
(
[2] => 1,
[7] => 0.944444
)

As the tests show, the accuracy depends on the number of iterations, while , the minimum number of iterations = 50 (correlates with K_MUL * 0.5, where 0.5 is the minimum weight change step).

Adding known objects with varying degrees of similarity improves the definition of properties of an unknown object.

Human factor

The case considered by us is ideal. But what if, say, a certain percentage of users respond inaccurately? To simulate this situation, you can add randomization of answers by adding the following line to the cmp function:
If (rand (0,100)> 70) {
$ val = rand (0,2);
}

We simulate a situation in which every third answer is random (it may correspond to the truth, or maybe not).

As the tests showed, with an increase in the number of iterations by 3 times (the same 1/3 of potentially incorrect answers), we get the same array 2 and 7, and only occasionally fluctuations appear that can be eliminated by changing the threshold in the "process"
Array function (
[2] => 0.98648648648649 - true property
[6] => 0.50675675675676 - fluctuation
[7] => 1 is a valid property
)

Possible improvements

The first improvement is the elimination of errors. Having a sufficient number of comparisons, we can exclude results that do not fit into the total mass, and thereby can increase accuracy.

The second improvement: the change in voice weight among users.
Users whose answers match those of the majority receive more weight from their own voice. Accordingly, in subsequent votes for "similarity," the vote of such a user will have more weight, which should also reduce the spread.

An important addition: it is assumed that such an algorithm operates in a friendly environment in which users can make mistakes, but they do not intentionally or massively.

I will be glad to questions, suggestions and just comments.

Tags:

classification algorithms

Typing (defining properties) of an object by the hands of site users

Human factor

Possible improvements

Also popular now: