Search for connections in social networks
Hello, Habr! In this post we want to share our solution to the problem of predicting hidden connections in Beeline's corporate social network Beehive. We solved this problem as part of the Microsoft virtual hackathon . I must say that before this hackathon, our team already had successful experience in solving such problems on the hackathon from Odnoklassniki and we really wanted to try our best practices on new data. In the article, we will talk about the main approaches that are used to solve such problems and share the details of our solution.
The development of a high-quality algorithm of friends' recommendations is one of the highest priorities for almost any social network. such functionality is a powerful tool to attract and retain users.
There are a lot of publications on this topic in English literature, and the task itself even has a special abbreviation PYMK (People You May Know).
Beeline, within the framework of a virtual hackathon from Microsoft, provided the graph of the corporate social network "Hive". 5% of the edges in the graph were artificially hidden. The task was to find the hidden edges of the original graph.
In addition to the presence of communication between users of the social network, information was also provided for each pair about the companies in which the users work, information about the number of messages sent and received, the duration of calls and the number of documents sent and received.
The search problem for edges is formulated as a binary classification problem; the measure F1 was proposed as an acceptance metric . In some such tasks, the quality metric is considered separately for each user and the average is estimated. In this problem, quality was evaluated globally for all edges.
In the general case, to search for hidden edges in a graph, it is necessary to sort through all possible pairs of vertices and obtain an estimate of the probability of a connection for each pair.
For large graphs, this approach will require a lot of computing resources. Therefore, in practice, many candidates in social graphs are limited only to pairs that have at least one common friend. As a rule, such a limitation allows one to noticeably reduce the amount of computation and speed up the algorithm without significant loss in quality.
Each pair of candidates is described by a feature vector and a binary answer: “1” if there is an edge or “0” if there is no edge. On the obtained set {feature vector, answer}, a model is trained that predicts the likelihood of an edge for a pair of candidates.
Because the graph in this problem is non-directional, then the feature vector should not depend on the permutation of the candidates in a pair. This property allows you to take into account a couple of candidates during training only once and halve the size of the training sample.
To answer the question which edges are hidden in the initial graph, the probability estimate at the output of the model must be turned into a binary answer, choosing the appropriate threshold value.
To assess the quality and selection of model parameters, we removed 5% of randomly selected edges from the provided graph. The remaining graph was used to search for candidates, generate features, and model training. Hidden ribs were used to select the threshold and final quality assessment.
The basic approaches for generating features in the PYMK task are described below.
For each user, we consider statistics: the distribution of friends by geography, community, age or gender. According to these statistics, we obtain an assessment of the similarity of candidates to each other, for example, using a scalar product.
Jacquard coefficient - allows you to evaluate the similarity of the two sets. As sets can be both friends, and, for example, and community of candidates.
The Adamik / Adar coefficient is essentially a weighted sum of the common friends of the two candidates.
The weight in this amount depends on the number of friends of a mutual friend. The fewer friends a mutual friend has, the greater the contribution he makes to the resulting amount. By the way, we have actively used this idea in our decision.
These signs are obtained as a result of matrix expansions. Moreover, decompositions can be applied both to the matrix of relationships between users, and to the community – user matrices, or geography – user and the like matrices. The resulting vectors with latent signs can be used to assess the similarity of objects to each other, for example, using a cosine measure of distance.
Perhaps the most common matrix decomposition algorithm is SVD . You can also use the ALS algorithm , popular in recommender systems, and the algorithm for searching for communities in BigCLAM graphs .
This group of features is calculated taking into account the structure of the graph. As a rule, to save time in the calculations, not the entire graph is used, but some part of it, for example, a subgraph of common friends of depth 2.
One of the most popular signs is Hitting Time - the average number of steps required to complete a route from one candidate to another with taking into account the weights of the ribs. The path is laid in such a way that the next vertex is chosen randomly with a probability depending on the values of the attributes of the edges emanating from the current node.
In solving this problem, we actively used the idea embodied in the Adamik / Adar coefficient that not all friends are equally useful. We experimented with a discount function - we tried fractional degrees instead of the logarithm, and also experimented with weighing mutual friends according to various utility attributes.
Because we worked on the task in parallel, then we got two independent solutions, which were subsequently combined for the final submission.
The first solution is based on the Adamik / Adar idea and is an empirical formula that takes into account both the number of friends of a common friend and the flow of messages between candidates through common friends.
n is the number of mutual friends;
xi is the number of friends of a mutual friend;
yi is the sum of incoming and outgoing messages between the candidates through their mutual friend.
In the second solution, we generated 32 attributes and trained a log model on them. regression and random forest.
Models from the first and second solutions were combined using another logistic regression.
The table describes the main features that were used in the second solution.
The table below shows the quality estimates of both individually and the resulting models
So, the model is trained. The next step is to select a threshold for optimizing the acceptance metric.
In this problem, we optimized the metric F1 .
This metric is equally sensitive to both accuracy and completeness and represents the harmonic mean of these values.
Because the dependence of the metric F1 on the threshold is a convex function, then the search for the maximum is not difficult.
In this problem, to select the optimal threshold value, we used the binary search algorithm.
The source graph was specified as a list of edges with user identifiers and corresponding attributes. A total of 5.5 million links were presented in the training sample. The source data is provided in the form of a text file in csv format and occupies 163 MB on the hard drive.
As part of the hackathon, we were provided with the resources of the Azure cloud service under the Microsoft BizSpark program , in which we created a virtual machine for our calculations. The server price per hour was $ 0.2 and did not depend on the intensity of calculations. The budget allocated by the organizers was enough to solve this problem.
We implemented the search algorithm for common friends at Spark, the results of intermediate calculations were cached to disk in parquet format, which allowed us to significantly reduce the reading time. The algorithm for searching for common friends in a virtual machine was 8 hours. Candidates with a list of mutual friends in the parquet format occupy 2.1GB.
The algorithm for training and selecting model parameters is implemented in python using the scikit-learn package . The processes of generating features, training the model, and selecting a threshold on a virtual server took about 3 hours in total.
In conclusion, I would like to thank Ivan Bragin for his active participation in solving the problem and creativity in choosing the empirical formula.
Statement of the problem and initial data
The development of a high-quality algorithm of friends' recommendations is one of the highest priorities for almost any social network. such functionality is a powerful tool to attract and retain users.
There are a lot of publications on this topic in English literature, and the task itself even has a special abbreviation PYMK (People You May Know).
Beeline, within the framework of a virtual hackathon from Microsoft, provided the graph of the corporate social network "Hive". 5% of the edges in the graph were artificially hidden. The task was to find the hidden edges of the original graph.
In addition to the presence of communication between users of the social network, information was also provided for each pair about the companies in which the users work, information about the number of messages sent and received, the duration of calls and the number of documents sent and received.
The search problem for edges is formulated as a binary classification problem; the measure F1 was proposed as an acceptance metric . In some such tasks, the quality metric is considered separately for each user and the average is estimated. In this problem, quality was evaluated globally for all edges.
Training and test
In the general case, to search for hidden edges in a graph, it is necessary to sort through all possible pairs of vertices and obtain an estimate of the probability of a connection for each pair.
For large graphs, this approach will require a lot of computing resources. Therefore, in practice, many candidates in social graphs are limited only to pairs that have at least one common friend. As a rule, such a limitation allows one to noticeably reduce the amount of computation and speed up the algorithm without significant loss in quality.
Each pair of candidates is described by a feature vector and a binary answer: “1” if there is an edge or “0” if there is no edge. On the obtained set {feature vector, answer}, a model is trained that predicts the likelihood of an edge for a pair of candidates.
Because the graph in this problem is non-directional, then the feature vector should not depend on the permutation of the candidates in a pair. This property allows you to take into account a couple of candidates during training only once and halve the size of the training sample.
To answer the question which edges are hidden in the initial graph, the probability estimate at the output of the model must be turned into a binary answer, choosing the appropriate threshold value.
To assess the quality and selection of model parameters, we removed 5% of randomly selected edges from the provided graph. The remaining graph was used to search for candidates, generate features, and model training. Hidden ribs were used to select the threshold and final quality assessment.
The basic approaches for generating features in the PYMK task are described below.
Counters
For each user, we consider statistics: the distribution of friends by geography, community, age or gender. According to these statistics, we obtain an assessment of the similarity of candidates to each other, for example, using a scalar product.
Similarity sets and common friends
Jacquard coefficient - allows you to evaluate the similarity of the two sets. As sets can be both friends, and, for example, and community of candidates.
The Adamik / Adar coefficient is essentially a weighted sum of the common friends of the two candidates.
The weight in this amount depends on the number of friends of a mutual friend. The fewer friends a mutual friend has, the greater the contribution he makes to the resulting amount. By the way, we have actively used this idea in our decision.
Latent factors
These signs are obtained as a result of matrix expansions. Moreover, decompositions can be applied both to the matrix of relationships between users, and to the community – user matrices, or geography – user and the like matrices. The resulting vectors with latent signs can be used to assess the similarity of objects to each other, for example, using a cosine measure of distance.
Perhaps the most common matrix decomposition algorithm is SVD . You can also use the ALS algorithm , popular in recommender systems, and the algorithm for searching for communities in BigCLAM graphs .
Signs on graphs
This group of features is calculated taking into account the structure of the graph. As a rule, to save time in the calculations, not the entire graph is used, but some part of it, for example, a subgraph of common friends of depth 2.
One of the most popular signs is Hitting Time - the average number of steps required to complete a route from one candidate to another with taking into account the weights of the ribs. The path is laid in such a way that the next vertex is chosen randomly with a probability depending on the values of the attributes of the edges emanating from the current node.
Decision
In solving this problem, we actively used the idea embodied in the Adamik / Adar coefficient that not all friends are equally useful. We experimented with a discount function - we tried fractional degrees instead of the logarithm, and also experimented with weighing mutual friends according to various utility attributes.
Because we worked on the task in parallel, then we got two independent solutions, which were subsequently combined for the final submission.
The first solution is based on the Adamik / Adar idea and is an empirical formula that takes into account both the number of friends of a common friend and the flow of messages between candidates through common friends.
n is the number of mutual friends;
xi is the number of friends of a mutual friend;
yi is the sum of incoming and outgoing messages between the candidates through their mutual friend.
In the second solution, we generated 32 attributes and trained a log model on them. regression and random forest.
Models from the first and second solutions were combined using another logistic regression.
The table describes the main features that were used in the second solution.
sign | description |
---|---|
An analog of the Adamik / Hadar coefficient, but instead of the logarithm, the root of the third degree was used | |
Weigh common friends based on message conduction. The lower the ratio of outgoing and incoming messages / calls / documents of a mutual friend, the higher its weight when summing | |
flow_common | We evaluate the patency of messages / calls / documents between candidates through a mutual friend. The higher the cross, the greater the weight when adding up |
friends_jaccard | Jacquard coefficient for friends of candidates |
friend_company | Similarity based on the share of friends of the user from the candidate’s company |
company_jaccard | We evaluate the friendliness of the companies of candidates using the Jacquard coefficient (equal to one if the candidates are from the same company) |
The table below shows the quality estimates of both individually and the resulting models
Model | F1 | Accuracy | Completeness |
---|---|---|---|
Empirical formula | 0.064 | 0.059 | 0.069 |
Log Regression | 0.060 | 0.057 | 0.065 |
Random forest | 0.065 | 0.070 | 0.062 |
Log Regression + Random Forest | 0.066 | 0.070 | 0.063 |
Log Regression + Random Forest + Empirical Formula | 0.067 | 0.063 | 0.071 |
Threshold selection
So, the model is trained. The next step is to select a threshold for optimizing the acceptance metric.
In this problem, we optimized the metric F1 .
This metric is equally sensitive to both accuracy and completeness and represents the harmonic mean of these values.
Because the dependence of the metric F1 on the threshold is a convex function, then the search for the maximum is not difficult.
In this problem, to select the optimal threshold value, we used the binary search algorithm.
Technology
The source graph was specified as a list of edges with user identifiers and corresponding attributes. A total of 5.5 million links were presented in the training sample. The source data is provided in the form of a text file in csv format and occupies 163 MB on the hard drive.
As part of the hackathon, we were provided with the resources of the Azure cloud service under the Microsoft BizSpark program , in which we created a virtual machine for our calculations. The server price per hour was $ 0.2 and did not depend on the intensity of calculations. The budget allocated by the organizers was enough to solve this problem.
We implemented the search algorithm for common friends at Spark, the results of intermediate calculations were cached to disk in parquet format, which allowed us to significantly reduce the reading time. The algorithm for searching for common friends in a virtual machine was 8 hours. Candidates with a list of mutual friends in the parquet format occupy 2.1GB.
The algorithm for training and selecting model parameters is implemented in python using the scikit-learn package . The processes of generating features, training the model, and selecting a threshold on a virtual server took about 3 hours in total.
In conclusion, I would like to thank Ivan Bragin for his active participation in solving the problem and creativity in choosing the empirical formula.