Dmytro_Voloshyn March 25, 2014 at 14:32

Machine learning in a simple project

I am CTO of the Preply project and I want to tell a little about what every programmer dreams of, namely, about complex and interesting tasks in simple projects.

To be more precise, on how you can add a little science to business and get a little benefit as a result. In this article I will try to describe one of the contexts of using Machine Learning in a real project.

Problem

We are the Preply tutors platform and everyone wants to deceive us.

The user on our website leaves applications for tutors and, after they agree on the conditions, pays for the lessons through the website. If classes go on Skype, we accept all payments through the site. If they meet live, our commission is the cost of the first lesson.

Tutors and students for some reason try to circumvent the payment of lessons through the site. To do this, they use an internal messaging system, which is designed to clarify details about upcoming classes and is available after sending the application to the tutor. Here are some examples of contact sharing:

My skype is vasiliy.p, tel +789123456. So at 19:00 on April 1!

Good evening! You could write your number or call my + 78-975-12-34

I do not want to pay before the lesson, my name is Vasily Pupkin - find me on VK

An experienced programmer will immediately say: "What is the problem of writing regular expressions for possible options for exchanging messages?" There is no problem, but this solution has several disadvantages:

It is difficult to provide all the options for incorrect (that is, those that contain contacts) messages. For example, in the first version of the product there was a set of regular expressions for a phone number, but it worked and blocked messages of the form:
Friday - from 13 00-15 00-15 30 ... how much will the group lesson cost?

In a more complex case, a regular expression was used for email. mail intended to block messages like:
vasya (dog) pupkin (dot) ru

but at the same time blocked a completely harmless text:
I know English as a dog: I understand everything, but I can’t say it.

The word Skype is even more complicated: it’s very difficult to distinguish between messages containing attempts to exchange Skype:
please add me in Skype - vasya82pupkin

from clarifying messages:
do you want to have skype or local lessons?
There is no control over the threshold of trust. That is, the message is either blocked or not. In order to change the logic, you need to climb into the code. In real life, it is much easier to make mistakes of the second type (message skipping) than errors of the first type (false alarm), because with a false alarm the user will write support, the support manager will take the time to apologize for incorrect blocking and unlock messages, not to mention the spoiled experience of using the service. On the other hand, users who exchange contacts rarely become our customers, so it’s easier to make mistakes of the second type, since we won’t make money from them (yes, this is a business).

At some point in time, I decided to spend the weekend in order to make the locking process more scientific. The following describes what came of it. I must say right away that my goal was not to do everything correctly, accurately and scientifically, but rather to make it work without errors and have a positive impact on income.

Decision

I decided to try 3 machine learning methods to correctly classify the correct / incorrect messages that I remembered from the Coursera Machine Learning by Andrew Ng course .

The first problem is preparing the basis for training. We had over 50,000 posts previously classified by the old system. I took only 5000 of them and spent about 2-3 hours trying to correct the incorrect classification in those messages where the previous system made mistakes. In theory, the larger the base, the better, but in the real world it’s quite difficult to manually prepare a large sample (in other words, laziness).

One of the nuances in the time-consuming process of sample preparation is the ethics of the process. I admit, it would not be very convenient for me to read someone else’s message, so before that I mixed up the words so that when I glance over suspicious messages were visible, but without an understanding of the essence. For example:

It was:

I would like to start classes in the month of February, is this possible? I can also tell the exact time in January, but it will definitely not be earlier than 18:00

It became:

maybe no classes since February? In January it would be exact, also earlier than 18:00 I’m sure but I can say I would like to start the month, it’s

It was:

I will be glad to be of service, my tel. (012) 345-678 Call, we will agree, thanks

It became:

tel. be glad I will be useful, my Call, we will agree, thanks (012) 345-678

The result is a csv file with ~ 5000 lines, where incorrect messages are marked with zero, and the correct ones with one. After that, on the basis of working with data, we determined a set of message characteristics that “by eye” have an impact on classification.

phone suspicion;
suspicion of email Mail
suspicion of skype contacts;
suspicion of url;
suspicion of social. Networks
correct words that come with numbers (time, currency);
message length;
suspicious words: find, add, mine;
… etc.

After defining the characteristics for each of them, I wrote several regular expressions, for example:

import re
SEPARATOR = "|"
reg_arr = [ re.compile(u'фейсбук|facebook|linkedin|vkontakt|вконтакт',re.IGNORECASE | re.UNICODE),
			re.compile(u'соц.{1,10}сет', re.UNICODE)]
			re.compile(u'скайп[^у]|skype', re.IGNORECASE | re.UNICODE),
			re.compile(u'скайпу', re.IGNORECASE | re.UNICODE),
			re.compile(u'[йцукенгшщздлорпавифячсмітьбю].*\s[a-zZ-Z]', re.IGNORECASE | re.UNICODE),
			re.compile('\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4}'),
...
			re.compile('http|www|\.com', re.IGNORECASE),
			re.compile(u'мой|my', re.IGNORECASE | re.UNICODE),
			re.compile(u'найдите|find', re.IGNORECASE | re.UNICODE),
			re.compile(u'добавь|add', re.IGNORECASE | re.UNICODE),
...
			re.compile('\w+@\w+', re.IGNORECASE),
			re.compile('.{0,50}', re.IGNORECASE),
			re.compile('.{50,200}', re.IGNORECASE),
....
		]	
def feature_vector(text):
	return map(lambda x: 1 if x.search(text) else 0, reg_arr)
fi=open('db_machine.csv', 'r')
fo=open('db_machine_result.csv', 'w')
for line in fi:	
	[text, result] = line.split(SEPARATOR)
	output = feature_vector(text).append(result)
	fo.write(",".join(map(lambda x: str(x), output )) + "\n")	
fo.close()
fi.close()

Accordingly, after processing all messages for all characteristics (we now have about a hundred), we write the vector of characteristics and the classification result to a file.

After preparing the data, it is necessary to break the sample into three parts: for training (train set), for selecting parameters (cross-validation set) and verification (test set). Following the advice from the course, the sizes of the training samples, the selection of parameters, the test are correlated in the ratio 60/20/20:

import random
with open('db_machine_result.csv','r') as source:
	data = [ (random.random(), line) for line in source ]
data.sort()
n = len(data)
with open('db_machine_result_train.csv','w') as target:
	for _, line in data[:int(n*0.60)]:
		target.write( line )
with open('db_machine_result_cross.csv','w') as target:
	for _, line in data[int(n*0.60):int(n*0.80)]:
		target.write( line )		
with open('db_machine_result_test.csv','w') as target:
	for _, line in data[int(n*0.80):]:
		target.write( line )

Then, guided by the principle of not reinventing the wheel and getting the result as quickly as possible, we used scripts from the Machine Learning Coursera and simply ran our samples using logistic regression, SVM and neural network algorithms. The scripts are simply taken from the course, for example SVM looks like this:

clear ; close all; clc
data_train = load('db_machine_result_train.csv'); 
X = data_train(:, 1:end-1); y = data_train(:,end);
data_val = load('db_machine_result_cross.csv'); 
Xval = data_val(:, 1:end-1); yval = data_val(:,end);
data_test = load('db_machine_result_test.csv'); 
Xtest = data_test(:, 1:end-1); ytest = data_test(:,end);
[C, sigma] = dataset3Params(X, y, Xval, yval); % подбор параметров на cross-validation set
fprintf('C: %f\n', C);
fprintf('sigma``: %f\n', sigma);
model= svmTrain(X, y, C, @(x1, x2) gaussianKernel(x1, x2, sigma));
p = svmPredict(model, X);
fprintf('Training Accuracy: %f\n', mean(double(p == y)) * 100);
p = svmPredict(model, Xtest);
fprintf('Test Accuracy: %f\n', mean(double(p == ytest)) * 100);
fprintf('Program paused. Press enter to continue.\n');
pause;

You can look at how the svmTrain / svmPredict functions are implemented on the course website or, for example, here .

All algorithms in the cross-validation sample sorted the internal parameters ( λ for regularization, σ , C for the Gaussian function, size for the size of the hidden layer of the neural network). Present the final accuracy results for some of them below:

Neural networks			Logistic Regression			SVM
size = 30, λ = 1	size = 30, λ = 0.01	size = 30, λ = 0.001	λ = 0	λ = 0.01	λ = 1	Linear (λ 0.001, σ = 0.001)	Gaussian (λ 0.1, σ = 0.1)	Gaussian (λ 0.001, σ = 0.001)
96.41%	97.88%	98.16%	97.51%	97.88%	98.16%	96.48%	97.14%	98.89%

Here it is necessary to clarify that in the process of preparing the system, the result was much worse (96.6% for SVM, for example), and debugging made very tangible improvements. We launched the logistic regression as the simplest and fastest on the real data of the entire sample, and revised the classification result. We were surprised to find that the system turned out to be smarter than me, since in 30% of cases there was an error in the classification of messages by a person (as I wrote, I looked at ~ 5000 messages and, as it turned out, made somewhere 30-40 classification errors), and the system classified everything correctly. During debugging, we corrected errors in the database and, accordingly, the accuracy of the method grew. Moreover, we expanded the characteristic vector if we saw that some interesting pattern is not processed by the system.

We chose to use the SVM method, the characteristics on the general sample were as follows:

Message		Fact
Message		Correct	Incorrect
Forecast	Correct	4998	36
Forecast	Incorrect	eleven	390

Since the system has the property that the classes are “skewed classes”, I will also give the parameters for comparing the algorithm:

Precision	Recall	Accuracy
99.28%	99.78%	99.13%

In the end, we decided to use SVM with the core of the Gaussian function to filter messages on the site. It is more complicated than logistic regression, but gives a significantly better result, although it works more slowly.
The complete message processing path is as follows:

The user sends a message to the site, Backbone JS creates a model on the client’s machine and sends a POST request to its server API;
The server API written in Django TastyPie uses the Django model validation form;
the first validator pulls the user profile from the database and checks whether the user is marked as an intruder (no need to check further, 403 response) or he has already made payments through the site (no need to check further, immediately 201 response);
The svmPredict validator returns the result of checking the message text. If the user has violated the rules, the corresponding flag is set in his profile, otherwise everything is fine and the user receives 201 responses from the API and the message is written to the database;
if the message contained contacts or the user was an intruder, the client receives a 403 response, upon receipt of which Backbone renders a message to the user that he is violating the rules. The user in the database is marked as an intruder;

So far, it has been working well, and we are happy about it.

conclusions

To understand why Machine Leaning works better than the old system is very simple - it reveals those relationships between the characteristics that were hidden in expert observation. For example, we had a regular expression and several if-conditions for the event: if there is Cyrillic and Latin in the text, a few numbers and the message is short, then this is most likely an exchange of contacts. Now we simply count individual events, and the system itself understands what is the relationship between them and makes rules in our place.

Now we really use SVM in production to classify messages due to good accuracy indicators. We use it in a very simple way - we took the set of weights of the optimal model and use the svmPredict function ported to Python, mentioned above, for classification. In an ideal world, it would be necessary to create a feedback system with the teacher, so that the administrator would point out classification errors, and the system would adjust weights and improve. But our project lives in the real world, where time = money, and so far we are enjoying the fact that the number of requests for support regarding incorrect blocking has fallen by 2 times. It is also an interesting idea to balance the threshold of trust and, accordingly, errors of the first and second type, but so far everything suits us. It is quite difficult to measure the number of errors such as “skip messages”. I’ll only clarify that the conversion of applications into payments after the introduction of the system has not fallen. In other words, even if there are more passes, this does not affect the business. But there were fewer passes on the eye. So this is a very good result over the weekend.

If the topic is interesting to you, then I am ready to write about a collaborative filtering approach for the tutorial recommendations that we do. If you need a code, also contact - there is nothing secret there, but in the article I would like to describe the pipeline more.

PS: We are growing and in the future we are looking for 2 smart and responsible programmers in our Kiev office: an intern and more experienced to close tasks that my two hands lack. Our stack of Python / Django and JS / Backbone. Many interesting tasks and best practices. Write dmytro@preply.com

Tags:

Machine learning in a simple project

Problem

Decision

conclusions

Also popular now: