Cloud recommendation system with Hadoop and Apache Mahout

Original author: Windows Azure Team
  • Transfer
  • Tutorial
image

Apache Mahout is a machine learning library designed for use in scalable machine learning applications. Recommendation systems are the most recognizable machine learning applications currently in use. In completing the tasks in this guide, we will use the Million Song Dataset online archive to create recommendations for choosing songs for users based on their musical preferences.



What will be discussed in this guide:



  • How to use the recommendation system

This manual consists of the following sections.


  1. Exploring and formatting data
  2. Mahout quest

Installation and setup


In completing the tasks in this guide, you will need an account to access Apache Hadoop-based services for Windows Azure. In addition, a cluster will need to be created. To get an account and create a cluster Hadoop, follow the instructions in " Getting Started with the Microsoft Hadoop on Windows Azure ," article " Introduction to Hadoop on Windows Azure " .



Exploring and formatting data


Apache Mahout offers an integrated element-based collaborative filtering implementation. Element-based collaborative filtering is most often used to analyze data when making recommendations.



In the given example, users perform actions with elements (songs). These users have preferences for these items, expressed by the number of times they listen to the songs. Sample data is provided on the Echo Nest Taste Profile Subset webpage .



clip_image002
Fig. 1. Sample data from the Milion Song Dataset archive



To use a dataset with Mahout, you need to complete two tasks.



  1. Convert song and user IDs to integer values.
  2. Save new values ​​with their ratings to a comma-separated file.

Launch Visual Studio 2010. In the program window, select File -> New Project . In the Installed Templates pane of the Visual C # node, select the Window category , and then select Console Application from the list . Name the project ConvertToMahoutInput .



clip_image004
Fig. 2. Creating a console application



After creating the application, open the Program.cs file and add the following static members to the Program class .



const char tab = '\u0009';
static Dictionary usersMapping = new Dictionary();
static Dictionary songMapping = new Dictionary();

Then add the following code to the Main method .



var inputStream = File.Open(args[0], FileMode.Open);
var reader = new StreamReader(inputStream);
var outStream = File.Open("mInput.txt", FileMode.OpenOrCreate);
var writer = new StreamWriter(outStream);
var i = 1;
var line = reader.ReadLine();
while (!string.IsNullOrWhiteSpace(line))
{
    i++;
    if (i > 5000)
    break;
    var outLine = line.Split(tab);
    int user = GetUser(outLine[0]);
    int song = GetSong(outLine[1]);
    writer.Write(user);
    writer.Write(',');
    writer.Write(song);
    writer.Write(',');
    writer.WriteLine(outLine[2]);
    line = reader.ReadLine();
}
Console.WriteLine("saved {0} lines to {1}", i, args[1]);
reader.Close();
writer.Close();
SaveMapping(usersMapping, "usersMap.csv");
SaveMapping(songMapping, "songMapping.csv");
Console.WriteLine("Mapping saved");
Console.ReadKey();

Now create the GetUser and GetSong functions to convert identifiers to integers.



static int GetUser(string user)
{
    if (!usersMapping.ContainsKey(user))
        usersMapping.Add(user, usersMapping.Count + 1);
    return usersMapping[user];
}
static int GetSong(string song)
{
    if (!songMapping.ContainsKey(song))
        songMapping.Add(song, songMapping.Count + 1);
    return songMapping[song];
}

And finally, create a utility program for implementing the SaveMapping method, which saves the dictionaries of the mapping of robotic programs into CSV files.



static void SaveMapping(Dictionary mapping, string fileName)
{
    var stream = File.Open(fileName, FileMode.Create);
    var writer = new StreamWriter(stream);
    foreach (var key in mapping.Keys)
    {
        writer.Write(key);
        writer.Write(',');
        writer.WriteLine(mapping[key]);
    }
    writer.Close();
}

Now download the sample data located at this link . After downloading, open the train_triplets.txt.zip archive and extract the train_triplets.txt file .



When running the utility, add a command line argument with the location of the train_triplets.txt file . To do this, right-click the ConvertToMahoutInput project node in Solution Explorer and select Properties from the context menu . On the project properties page, add the path to the train_triplets.txt file to the Command line arguments text box .



clip_image006
Fig. 3. Specifying a Command Line Argument



To start the program, press the F5 key . After its completion, open the bin \ Debug folder from the location where the project was saved, and view the result of the utility program.



clip_image008
Fig. 4. ConvertToMahoutInput Utility Result

Mahout quest


Open the Hadoop Cluster Portal at https://www.hadooponazure.com and click the Remote Desktop icon .



clip_image010
Fig. 4. Remote desktop icon



Pack the mInput.txt file from the bin \ Debug folder in the Zip archive and copy it to the root folder c: \ on the remote cluster. After copying, extract the file from the archive.



Now create a file with the user ID for which recommendations will be created. To do this, in the root folder c: \ create a text file called users.txt and write the identifier of one user in it.



Note. To create recommendations for other users, add their identifiers in separate lines.



Then upload the mInput.txt and users.txt files to HDFS. To do this, open the Hadoop Command Shell and run the following commands.



hadoop fs -copyFromLocal c: \ mInput.txt input \ mInput.txt
hadoop fs -copyFromLocal c: \ users.txt input \ users.txt



Now you can complete the task using the command:



hadoop jar c: \ Apps \ dist \ mahout \ mahout-core-0.5-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob --input = input / mInput.txt --output = output - -usersFile = input / users.txt



The Mahout job runs for several minutes, after which an output file is created. Run the following command to get a local copy of the output file.



hadoop fs -copyToLocal output / part-r-00000 c: \ output.txt



Open output.txt from the root folder c: \ and examine its contents. The file has the following structure.



user [song: rating, song: rating, ...]



conclusions


Recommendation systems are an important feature of many modern social networking sites, multimedia streaming, online stores and other online sites. Mahout offers a ready-made recommendation system that is easy to use, contains many useful features, and can scale on the Hadoop platform.



You can take advantage of the data processing and cloud scalability benefits of Hadoop and Apache Mahout on the Windows Azure platform. Try today windowsazure.com/ru-ru and www.hadooponazure.com


Also popular now: