grinCo August 6, 2013 at 01:45

Introducing Apache Mahout

Original author: “Mahout in Action” Owen, Anil, Dunning, Friedman

Transfer

Hey.

My first article on Habré showed that not many know about Mahout library. (Maybe, of course, I am mistaken in this.) Yes, and there is no fact-finding material on this topic. So I decided to write a post about the features of the library. A couple of pen samples showed that the best introduction to the topic would be short excerpts from the book “Mahout in Action” by Owen, Anil, Dunning, Friedman. Therefore, I made a free translation of some places that, it seems to me, tell well about the scope of Mahout.

Meet Apache Mahout (1)

* Hereinafter in brackets the chapter from the book is indicated.

Mahout is an open source machine learning library from Apache. The algorithms that the library implements together can be called machine learning or collective intelligence. This can mean a lot, but at the moment it means primarily recommender systems (collaborative filtering), clustering and classification.
Mahout is scalable . Mahout strives to become a machine learning tool with the ability to process data on one or more machines. In the current version of Mahout, scalable machine learning implementations are written in java, and some parts are built on the Apache Hadoop distributed computing project.
Mahout is a java library . It does not provide a user interface, alerted server, or installer. This is a framework of tools designed for use and adaptation by developers.

...
Mahout contains a number of models and algorithms, many still in development or in the experimental phase ( algorithms ). At this early stage of the project’s life, three key topics are most visible: recommendation systems (collaborative filtering), clustering and classification. This is far from all that Mahout has, but these topics are the most visible and mature.
...
In theory, Mahout is a project open to implement any kind of machine learning model. In practice, three key areas of machine learning are currently implemented.
...

Recommender systems. (1.2.1)

Referral systems are the most recognizable machine learning model in use today. You see services or sites that are trying to recommend books or movies, or articles, based on your previous actions. They try to derive tastes and preferences, and identify unknown objects that are of interest.

Amazon.com is arguably the best-known e-commerce site that has implemented recommendations. Based on purchases and activity on the site, Amazon recommends books and other things that may be of interest.
Netflix also recommends DVDs that may be interesting and offers a $ 1M prize for researchers who can improve the quality of their recommendations.
Social networks such as Facebook use recommender techniques to identify people who are most likely to be defined as “unconnected friends.”

Clustering (1.2.2)

Clustering is less obvious, but it appears in equally well-known references. As the name implies, clustering methods try to group large numbers of objects together into clusters that have common similarities. In this way, hierarchy and order are established in large or difficult to understand sets of data, and in this way interesting patterns are established or the data set is easier to understand.

Google News groups news articles by name using the clustering technique.
Search engines such as Clusty also group their search results.
Customers can be grouped into segments (clusters) using clustering techniques based on attributes: income, location, shopping habits.

Clustering helps to determine structure and even hierarchy, in a large collection of things, which may be difficult to comprehend. Businesses can use this technique to identify hidden groups among users, or to reasonably organize a large collection of documents, or to identify common usage patterns for sites using their logs.

Classification (1.2.3)

Classification models decide whether or not an item is part of a particular category or whether it has some attribute. ...

Yahoo! Mail decides whether or not an incoming message is spam, based on previous messages and spam messages from users, as well as the characteristics of the letters themselves.
Google's Picasa and other photo management applications can identify an image area containing a human face.
The optical character recognition program classifies small areas of scanned text into individual characters.

Classification helps to decide whether a new piece of input data or an item corresponds to the previous patterns considered; and it is often used to classify behavior or pattern. This can be used to detect suspicious network activity or fraud. And also to find out whether a user's message indicates frustration or satisfaction.

Each of these models works best when equipped with a lot of good input. In some cases, these methods should not only work on large amounts of data, but should receive the result quickly, and these factors make scalability the main task. One of the main reasons to use Mahout is scalability.

As repeatedly noted in the book, there is no ready-made recipe that can be taken and applied to a typical situation. For each case, you need to try different algorithms and input data. Only by understanding the essence of the algorithms can you successfully use the library.

Launch of the first recommendation system (2.2)

... now we are exploring a simple user-oriented recommendation system.

Creation of input data (2.2.1)

... The
recommendation system needs input on which the recommendations will be based. This data takes the form of preferences in the Mahout language. Because Since recommendation systems are more understandable in terms of recommending items to users, we will talk about preference as a user-subject association. ... Preference consists of a user ID and an item ID, and usually a number expressing the degree of preference of this user to this item (rating). IDs in Mahout are always integers. The value of preference can be anything, the main thing is that a larger value expresses a greater positive affection. For example, these values can be rated on a scale from 1 to 5, where 1 indicates that the user does not like the item, 5 shows that the item is very like.
Create an intro.csv file containing userID, itemID, value information.
...
Now, run the following code.

class RecommenderIntro {
public static void main(String[] args) throws Exception {
DataModel model = new FileDataModel (new File("intro.csv"));
UserSimilarity similarity = new PearsonCorrelationSimilarity (model);
UserNeighborhood neighborhood = new NearestNUserNeighborhood (2, similarity, model);
Recommender recommender = new GenericUserBasedRecommender (model, neighborhood, similarity);
List recommendations = recommender.recommend(1, 1);
for (RecommendedItem recommendation : recommendations) {
System.out.println(recommendation);
}
}
}

The DataModel stores and provides access to all the preferences, users, and items needed for computing. The UserSimilarity implementation provides some insight into how similar user tastes are; it can be based on one of many metrics or calculations. (metrics are described in the first post) The implementation of UserNeighborhood defines the concept of a group of users who are closest to a given user. (The first parameter 2 is the number of users in this group.) Finally, the Recommender implementation binds the previous three components together to make recommendations to users. The recommend method (int userId, int number) takes two parameters: the user and the number of recommendations that this user needs to make.

Output: RecommendedItem [item: XXX, value: Y].Where Y is the predicted rating given by the user to 1 subject XXX. This item is recommended to the user, as he has the highest predicted score.

Tags:

Mahout