Introduction to RapidMiner
- Tutorial

In addition to the miner itself, there is also the RapidMiner Server (previously called RapidAnalytics, up to version 6) which can be used as a repository for storing and executing miner processes (including on a schedule), “fumbling” connections to data sources between users, sending data from miner processes as a web service.
To our regret, with version 6, the creators of the miner decided to start making money on the sales of this software and changed the license from AGPL to Business Source. Nevertheless, version 5 of AGPL and we can use it freely and without restrictions. Therefore, it will be considered in the article. Also note that in the sixth version there are not many new operators and functions (perhaps the most interesting is cloud support), and for most tasks RapidMiner 5 Community is enough.
Installation
Not so long ago, the links to download RapidMiner 5 were removed from the official site, so we will collect RM from the source code that we take in the official project on the github.
To build RapidMiner from the repository, we need
- Installed Java and JDK
- Apache ant
- Git client

git clone https://github.com/rapidminer/rapidminer-5.git
the next step we will collect the project
ant build
ant release.makePlatformIndependent
now run the miner
.\scripts\RapidMinerGUI.bat
for Linux, respectively
./scripts/RapidMinerGUI.sh
A window will open before you as in the picture on the right. Click on New Process and move on.
Basic concepts
Before we look at the basic principles of working with RapidMiner using an example, we will make a small introduction to its basic concepts.
Process
The set of operators interconnected in a given order to perform the required task of data analysis / processing.

Operator

In the program interface, the Operators tab corresponds to the operators tab - where in the hierarchy they are grouped by functional feature. To use the operator, you must click on it and transfer it to the workspace of the process.
Repository

In the Repositories contribution to RM, only Samples, DB, and Local Repository can be seen here. The first, as is already clear from the name, is a set of processes - examples, DB - current database connections available in the miner (defined through Tools -> Manage Database Connections) and Local Repository, a place to store your own processes on the computer.
Process context

- Process input - data transmitted to the input of the process. Here you can specify the path to the data inside the repository.
- Process output - this shows the path in the repository where the result of the process will be saved.
- Macros is a global variable available in the process from anywhere. It can take as value only strings or numbers.
Note that Process input and Process output are indicated by circles in the process along the process border with the inscriptions inp and res . To use the data from the input or save it, you need to connect the corresponding circle with the input / output of the operators.
The best training is practice. Let's make a small process on the basis of which we will see the basic principles of working with the miner.
Small task
You are the director of a small company that creates websites, industrial design, etc. Quite often, due to the large number of orders and the lack of employees, you hire freelancers from different countries (as clients from all over the world) and regularly enter information on the work done in an Excel plate indicating the name of the contractor, type of work, date of payment, amount and currency of payment . At some point, you wanted to get the amount of costs, in rubles (per CB rate), which you incurred by type of work for a specific date (more interesting cases are a breakdown by months, the employees are left to do their own experiments).


Pay attention to the pressed button

Set the parameters as in the picture on the right and click on the Edit list to the right of the data set meta data information below. We expose everything as in the picture below

As you might guess here we put the names of the columns, a check mark is placed to exclude or include the column from the parsing result, type and role. Roles other than attribute may be needed in mining, in the usual case, they are usually not required.
Click Apply and go to the next step. Add the Filter examples operator (Data Transformation -> Filtering), connect its input to the Read CSV output, and the output to the process output with a circle and the inscription res . You will get this picture



So the Context tab at this step will look like the image on the right. We defined the macro (recall, i.e., a global variable) and now use it to filter records by date from our CSVshnichka. Click on the Filter Examples operator, select in the condition class attribute_value_filter and in the parameter stringwrite: date =% {date}. On the left we indicated the name of the column by which the filtering takes place, in the center the operation of checking for equality and on the right taking the value from the macro.
Let's see what happened. We click on the button to start the process



The first result was obtained, but we would like to see the costs in rubles at the rate of the Central Bank of the Russian Federation. Switch to Design Perspective by clicking on


Where url: http://www.cbr.ru/scripts/XML_daily.asp?date_req=%{date}
Note that we substituted the macro in the operator parameter.
We will get the data, but something must convert it to an ExampleSet - i.e. table with data. In the first case, Read CSV performed this role , but now, as it’s not hard to guess, we will use Read XML (Import -> Data -> Read XML). Pull the operator, connect its input to the output of the Open file operator and make the following settings (if you are having difficulty with xpath, use the import wizard by clicking on the Import configuration wizard).

It is necessary to determine what attributes RapidMiner will take for the ExampleSet . Click on Edit enumeration to the right of xpath for attributes, add two entries
Value [1] / text () - cost in rubles of a currency unit
CharCode [1] / text () - letter currency code
Now we need to set the value types for the attributes. To do this, click on the Edit list to the right of the data set meta datainformation and set it as in the picture below

At this stage, we have a process that you should look like this

It's time to convert currencies into date-filtered data. For this, as you might guess, we will need to somehow combine quotes and data. The Join operator (Data Transformation -> Set Operations -> Join) will help us with this . Now do the following. We take the output of the Filter examples operator, which is currently connected with the output of the process and connect it with the Join operator , we do the same with the Read XML operator .

Now click on the Join operator and determine how the data will be combined. Uncheck the use id attribute as key , since we are combining in the currency field , a new key attributes parameter will appear, click on the Edit list to the left of it , in the Add entry dialog and write in both fields - currency . Save changes. We can see what happened, similar to how it was done above by clicking on the button


We are getting closer to our cherished goal - to find out how much we spent in rubles on our tasks. There was the last touch, actually the conversion itself. Add the Generate Attributes operator (Data Transformation -> Attribute Set Reduction and Transformation -> Generation ) to the process and connect its input to the output of the Join operator , and the first output of which is written exp (abbreviated ExampleSet ) to the output of the process. As it is clear from the name of the operator, his task is to add a new attribute. To do this, click on the operator and on the right in its settings on the Edit list , the button opposite the function descriptions . Let's name the attribute and how to read it.

Save changes and execute the process, our result

Hurrah! Here it is the treasured figure of costs in rubles that we incurred at the Central Bank rate on the payment date. You can develop this task very far, for example, draw information for a month, grouped by type of work, artist or date. In general, the scope of imagination.