Automatic recommendation: some theory and practice

    1. Introduction


    This article will discuss some basic theoretical and practical issues of automatic recommendation. Particular attention will be paid to the story of the experience of using Apache Mahout on large portals (written in Yii 2) with high traffic (several million people per day). Examples of PHP and JAVA source code will be provided to help the reader better understand the Mahout integration process.

    2. Robustness assessment


    First of all, we need to make sure that the results are not affected by various interferences. As a rule, the easiest way to recommend a certain entity is the ranking method by the average value of a user rating. The higher the average score, the higher the probability of advisability to recommend the object. However, even in such a simple approach, there is a very important point - interference in accuracy. Let's look at a couple of examples. Suppose that a user has the opportunity to evaluate a product (or other object) on a scale from 1 to 10. Let there be a finite set of user ratings for a certain object: 5, 7, 4, 8, 6, 5, 10, 5, 10, 9, 10, 10, 10, 5, 2, 10. For a more comfortable perception, display them in the form of a diagram:



    The power of the set is 16. In this particular case, we do not observe a significant difference between the median (7.5) and the arithmetic mean value (7.25), and the variance of the random variable is only 7.267. In some cases, we may need to filter out emissions (for example, if the variational range can be large). Naturally, we can use robust indicators. Let there be a set: 1, 2, 1, 2, 3, 1, 5, 100, 4. Its graphical representation:



    In the above example, the arithmetic mean does not clearly reflect the real situation. Moreover, the variance in the sample is clearly large (in this example, a little more than 1060). Using robust indicators (median and quartiles) helps us avoid such problems.

    3. Accounting for correlation


    What method of finding correlation comes to mind the very first? Naturally, this is Karl Pearson's linear correlation coefficient. Here is another small example. Suppose two experts need to evaluate the quality of sites. Each of them gives his own rating. For example, this:



    A single look at the chart is enough to see the strong similarity of opinions of the two experts. In this example, the linear correlation coefficient of Karl Pearson was 0.9684747092. Based on the data obtained, it is possible to predict that the probability of the occurrence of the event “set a similar rating” for other sites will be quite high. Knowing tastes is easier to recommend, right? But can we rely on the estimates of only like-minded users, and not on everything in a row?

    4. Automatic recommendation


    Consider the example of the interesting free Apache Mahout library. Assume that there are three objects. I rated only two of them (the first object I rated only two points, and the second object five). In addition to me, there are three more people who evaluated objects. But unlike me, they rated all three objects. Let's look at the table with all the ratings:



    Indeed, if you pass such data to Mahout, then he will recommend the third object to me. There is no point in recommending the first two objects, since I no longer just know about them, but even put them a mark. Moreover, Mahout was able to take into account the similarity of my opinion with three other people - if I give a very different assessment to the first object (say 10), then Mahout will not recommend me anything. Am I confused anything? We’ll check it now.

    The MySQLJDBCDataModel class can receive data from the MySQL database (the table should contain: user_id, item_id, and preference). And the FileDataModel class can load from a file (CSV file, where each line looks like “1,1,2.0”, and empty lines are ignored). Theoretically, an application on Yii should not know anything at all about recommendation methods, but simply take the necessary information from the database (table link: user identifier, identifier of the recommended object) and display it for the user.

    I had to do quite a lot of tasks on highly loaded sites (with traffic of several million people a day) on Yii, including integration with various analytical systems and search platforms. Naturally, I often had to understand a huge number of Java projects, but I connected Mahout for the first time.

    Of course, there can be many ways to exchange data. Starting from direct uploads (export using Hibernate to the site database) from external systems, to using queues (Gearman, RabbitMQ). I saw some funny cases when parsing sites using JSOUP and even very slow PhantomJS was used to get data, and sometimes they downloaded from Excel using POI. But let's not talk about sad things.

    By the way, storage methods are also not boring - starting from MongoDB to search engines (Endeca, Solr, Sphinx, even miracles built into ATG). Of course, such options have a right to exist and are not rarely used by huge projects, however, in this article I would like to consider a more common option.

    Let's say we have a site on Yii with a traffic of several million people a day. Let the MySQL cluster be used as the database (memcached takes care of all the hardships and deprivations of the load). The application does not have write permissions to the database, and the data is transmitted exclusively through the API (to the Redis cluster), from where it is picked up (thanks to the free google-gson and Jedis libraries) an analysis system written in Java. It was to her that the Mahout library was added.

    But I want to get not just a list of identifiers, but ready-made (for a widget) data. What do I need? Suppose I want to display a picture. I also need a headline. Of course, I need a link to the recommended object (the page where the user will get if he clicks on the widget). This will be a universal option. In the system responsible for unloading, I can add the logic necessary for filling this table to me. In this case, the table structure may be something like this:

    use yii\db\Schema;
    use yii\db\Migration;
    class m150923_110338_recommend extends Migration
    {
        public function up()
        {
            $this->createTable('recommend', [
                'id' => $this->primaryKey(),
                'status' => $this->boolean()->notNull(),
                'url' => $this->string(255)->notNull(),
                'title' => $this->string(255)->notNull(),
                'image' => $this->string(255)->notNull(),
                'created_at' => $this->datetime()->notNull(),
                'updated_at' => $this->datetime()->notNull(),
            ]);
        }
        public function down()
        {
            $this->dropTable('recommend');
        }
    }
    

    The model should have a method that allows us to understand which user we will recommend this entity. Part of the recommended objects will tell us Mahout. Of course, from the very beginning we will foresee a situation when Mahout cannot recommend anything to us (or the quantity will be insufficient). The model may be something like this:

    namespace common\models;
    use Yii;
    use common\models\Api;
    /**
     * This is the model class for table "recommend".
     *
     * @property integer $id
     * @property integer $status
     * @property string $url
     * @property string $title
     * @property string $image
     * @property string $created_at
     * @property string $updated_at
     */
    class Recommend extends \yii\db\ActiveRecord
    {
        const STATUS_INACTIVE = 0;
        const STATUS_ACTIVE = 1;
        /**
         * @inheritdoc
         */
        public static function tableName()
        {
            return 'recommend';
        }
        /**
         * @inheritdoc
         */
        public function rules()
        {
            return [
                [['status', 'url', 'title', 'image', 'created_at', 'updated_at'], 'required'],
                [['status'], 'integer'],
                [['created_at', 'updated_at'], 'safe'],
                [['url', 'title', 'image'], 'string', 'max' => 255]
            ];
        }
        /**
         * @inheritdoc
         */
        public function attributeLabels()
        {
            return [
                'id' => 'ID',
                'status' => 'Статус',
                'url' => 'Ссылка',
                'title' => 'Название',
                'image' => 'Ссылка на картинку',
                'created_at' => 'Создано',
                'updated_at' => 'Обновлено',
            ];
        }
        /**
         * @inheritdoc
         */
        public function behaviors()
        {
            return [
                [
                    'class' => \yii\behaviors\TimestampBehavior::className(),
                    'value' => new \yii\db\Expression('NOW()'),
                ],
            ];
        }
        /**
         * Status list
         */
        public function statusList()
        {
            return [
                self::STATUS_INACTIVE => 'Скрыто',
                self::STATUS_ACTIVE => 'На сайте',
            ];
        }
        /**
         * @param integer $userId
         * @param integer $limit
         */
        public static function getItemsByUserId($userId = 1, $limit = 6)
        {
            $itemIds = [];
            // В методе get класса Api уже есть JSON::decode и обработка исключений
            // Мы получаем ID для Recommend, а не для объектов сайта (товаров, новостей, страниц)
            $mahout = Api::get('s=mahout&order=value&limit=' . (int)$limit . '&user=' . (int)$userId);
            if(!empty($mahout['status']) && $mahout['status'] == true) {
                $itemIds = $mahout['item-ids'];
            }
            if(count($itemIds) < $limit) {
                // Рекомендации в зависимости от событий на сайте (добавления товаров в корзину, 
                // подписка на новости, источник перехода на сайт, поиск на сайте и т.д.). Если недостаточно, 
                // то возвращаем универсальный массив.
                $limit = $limit - count($itemIds);
                $recommend = Api::get('s=recommend&limit=' . (int)$limit . '&user=' . (int)$userId);
                if(!empty($recommend['status']) && $recommend['status'] == true) {
                    $itemIds = array_merge($itemIds, $recommend['item-ids']);
                }
            }
            return static::find()->where(['id' => $itemIds, 'status' => static::STATUS_ACTIVE])->all();
        }
    }
    

    And the controller will also not be tricky at all:

    namespace frontend\controllers;
    use Yii;
    use yii\web\Controller;
    use common\models\Recommend;
    class MainController extends Controller
    {
        private $_itemsLimit = 6;
        private $_cacheTime = 120;
        public function actionIndex()
        {
            $userId = Yii::$app->request->cookies->getValue('userId', 1);
            $recommends = Recommend::getDb()->cache(function ($db) use ($userId) {
                return Recommend::getItemsByUserId($userId, $this->_itemsLimit);
            }, $this->_cacheTime);
            return $this->render('index', ['recommends' => $recommends]);
        }
    }
    

    And here is the view (view in MVC):

    title = 'Example';
    $this->params['breadcrumbs'][] = $this->title;
    ?>
    

    Рекомендуемые товары:

    <?= Html::encode($recommend->title) ?>

    The prototype is ready. It remains to transfer the desired code to the real system. I had to start the task on Monday, and on Saturday I decided to try Mahout on my home computer. A bunch of books read is good, but practice is also important. In a few minutes, I sketched a simple Java application that takes data from a CSV file and writes the result in JSON format.

    The interface asks us to implement just one method that will return JSON. In this particular case, we need to provide a link to the CSV data file and a list of user IDs who need to recommend something:

    package com.api.service;
    import java.util.List;
    public interface IService {
    	String run(String datasetFile, List userIds);
    }
    

    Next, create a factory:

    package com.api.service;
    public class ServiceFactory {
    	/**
    	 * Get Service
    	 * @param type
    	 * @return
    	 */
    	public IService getService(String type) {
    		if (type == null) {
    			return null;
    		}
    		if(type.equalsIgnoreCase("Mahout")) {
    			return new MahoutService();
    		}
    		return null;
    	}
    }
    

    For example, I will get a list of recommended objects for each user that appears in the list of identifiers:

    package com.api.service;
    import java.io.IOException;
    import java.util.List;
    import org.apache.mahout.cf.taste.common.TasteException;
    import com.api.model.CustomUserRecommender;
    import com.api.util.MahoutHelper;
    import com.google.gson.Gson;
    import com.google.gson.GsonBuilder;
    public class MahoutService implements IService {
    	@Override
    	public String run(String datasetFile, List userIds)  {
    		Gson gson = new GsonBuilder().create();
    		MahoutHelper mahoutHelper = new MahoutHelper();
    		List customUserRecommenders = null;
    		try {
    			customUserRecommenders = mahoutHelper.customUserRecommender(userIds, datasetFile);
    		} catch (IOException | TasteException e) {
    			e.printStackTrace();
    		}
    		return gson.toJson(customUserRecommenders);
    	}
    }
    

    And here is the “same” class:

    package com.api.util;
    import java.io.File;
    import java.io.IOException;
    import java.util.ArrayList;
    import java.util.List;
    import org.apache.mahout.cf.taste.common.TasteException;
    import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
    import org.apache.mahout.cf.taste.impl.neighborhood.ThresholdUserNeighborhood;
    import org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender;
    import org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity;
    import org.apache.mahout.cf.taste.model.DataModel;
    import org.apache.mahout.cf.taste.neighborhood.UserNeighborhood;
    import org.apache.mahout.cf.taste.recommender.UserBasedRecommender;
    import org.apache.mahout.cf.taste.similarity.UserSimilarity;
    import com.api.model.CustomUserRecommender;
    public class MahoutHelper {
    	/**
    	 * @param List userIds
    	 * @param String datasetFile
    	 * @return List
    	 * @throws IOException
    	 * @throws TasteException
    	 */
    	public List customUserRecommender(List userIds, String datasetFile) throws IOException, TasteException {
    		List customUserRecommenders = new ArrayList();
    		DataModel datamodel = new FileDataModel(new File(datasetFile)); 
    		UserSimilarity usersimilarity = new PearsonCorrelationSimilarity(datamodel);
    		UserNeighborhood userneighborhood = new ThresholdUserNeighborhood(0.1, usersimilarity, datamodel);
    		UserBasedRecommender recommender = new GenericUserBasedRecommender(datamodel, userneighborhood, usersimilarity);
    		for (Integer userId : userIds) {
    			customUserRecommenders.add(new CustomUserRecommender(userId, recommender.recommend(userId, 10)));
    		}
    		return customUserRecommenders;
    	}
    }
    

    In a real project, the Mahout library was added to an existing system (turnkey solution). As I already mentioned, the API was chosen as the method of data transfer. As practice has shown, the addition of recommendations on key pages (for example, product card) affects conversion very well. Not infrequently, a personal rating of recommended sites is sent by e-mail, for example, once a week.

    If possible, try on each page to make a small form of a survey of the user about the interest and usefulness of a particular product for him. At a minimum, you can make two characters ("+" and "-"). Dichotomous classification is usually expressed by numerical estimates (preferably 2 and 10, so that the difference is more obvious). Try to motivate people to give ratings - the more ratings, the easier it is to give an accurate recommendation. You can take into account the orders of goods (once bought, then appreciated). Just be very careful to avoid all kinds of speculation. Please constantly verify the data with a series of experiments (A / B tests).

    I do not want to remind obvious things, but the opinion of most people is not always objectively correct. For example, there may be a very beautiful girl of 25 years who is worried about the complexes that arose in her childhood. Some guys can strongly believe in the effectiveness of NLP and hypnosis as ways to seduce girls. Even a kind old woman can smear a wound to her grandson with an alcohol solution of brilliant green, although the use of miramistin will be clearly more reasonable. The list goes on for a very long time. Ideally, you should add manual filtering of knowingly poor-quality recommendations (if it comes to evaluating other sites) or tighten quality control (if you evaluate objects on your site).

    Also popular now: