shadoof January 9, 2013 at 17:02

Data Mining: Primary data processing using a DBMS. Part 1

Tutorial

What is the article about

In the tasks of researching large volumes of data, there are many subtleties and pitfalls. Especially for those who are just starting to explore hidden dependencies and internal relationships inside arrays of information. If a person does this on his own, then it becomes an additional difficulty to choose examples from which you can study and search for a community to exchange views and evaluate your success. The example should not be too complicated, but at the same time it should cover the main problems that arise when solving problems that are close to reality, so that the task is not perceived like this:

From this point of view, the Kaggle resource will be very interesting [ 1], which turns data mining into a sport. There are competitions on data analysis. Some competitions are with training materials and are intended for beginners. It is precisely training in data analysis, using the example of solving one of the training problems, that will be devoted to a series of articles. The first article will be about preparing data and using a DBMS for this purpose. Actually, about how and where to start. The reader is supposed to understand SQL.

Task: “Titanic: Machine Learning in Disasters.”

One of the two competitions for beginners is the Titanic [ 2 ]. Job translation:
“The death of the Titanic is one of the most inglorious shipwrecks in history. On April 15, 1912, during his first voyage, the Titanic sank after a collision with an iceberg, killing 1,502 of 2,224 passengers and crew. This sensational tragedy shocked the international community and led to improved safety requirements for ships. One of the reasons that the shipwreck caused such casualties was the lack of lifeboats for passengers and crew. There was also an element of fortuitous chance, affecting salvation from drowning. Also, some groups of people were more likely to survive than others, such as women, children, and the upper class. In this competition, we will invite you to complete an analysis of what types of people are most likely to be saved. In particular, we will ask you to use machine learning tools to determine if

Data

Two files are given for the competition: train.csv and test.csv. Text data separated by commas. One line - one passenger record. Some data for recording may be unknown and skipped.
DESCRIPTION OF VARIABLES:

Variable name	What does it mean
survival	Salvation (0 = No; 1 = Yes)
pclass	Passenger class (1 = 1st; 2 = 2nd; 3 = 3rd)
name	Name
sex	Floor
age	Age
sibsp	Number of Brothers (Sisters) / Spouses on board
parch	Number of Parents / Children on board
ticket	Number of ticket
fare	Passenger fare
cabin	Cabin
embarked	Port of Landing (C = Cherbourg; Q = Queenstown; S = Southampton)

Note:
Pclass is an indicator of socioeconomic status (SES)
1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower
Age in years; Fractional, if age is less than one (1)
If age is estimated, then it is in the form xx.5
The following definitions are used for sibsp and parch.
Brothers (Sisters): Brother, Sister, Half-brother, or Half-sister among Titanic passengers
Spouses: Husband or Wife among Titanic passengers (Mistresses and Grooms are ignored)
Parents: Mother or Father among Titanic passengers.
Children: Son, Daughter, Stepson or Stepdaughter among the passengers of the Titanic.
Other family members are excluded, including cousins, cousins, uncles, aunts, daughters-in-law, sons-in-law
Children traveling with nannies had parch = 0.
Likewise, traveling with close friends, neighbors, is not counted in parch siblings.

This task, as we see, is well structured and the fields are practically defined. Actually, the first stage is to submit data for machine learning. The issues of selecting and redefining fields, partitions, merges, classifications - depend on the supply of data. By and large, the question rests on coding and normalization. Coding of qualitative features (there are several approaches) and data preprocessing before applying machine learning methods. This task does not contain really largevolume of data. But most tasks (for example, Herritage Health Pr., Yandex Internet mathematics) are not so structured and you have to operate with millions of records. It is more convenient to do this using the DBMS. I chose PostgreSQL DBMS. All SQL code was written for this DBMS, but with minor changes it is suitable for MySQL, Oracle and MS SQL server.

Download data

Download two files - train.csv and test.csv.
We create tables for data storage:

--Удаляем таблицу если она существует
DROP TABLE IF EXISTS titanik_train;
--Создаем таблицу
CREATE TABLE titanik_train
(
  survived int,
  pclass int,
  name varchar(255),
  sex varchar(255),
  age float,
  sibsp int,
  parch int,
  ticket varchar(255),
  fare float,
  cabin varchar(255),
  embarked varchar(255)
);
--Загружаем таблицу из CSV файла '/home/andrew/titanik/train.csv'.
-- HEADER значит что в CSV файле есть заголовок c именами полей.
COPY titanik_train FROM '/home/andrew/titanik/train.csv' CSV HEADER;
DROP TABLE IF EXISTS titanik_test;
CREATE TABLE titanik_test
(
  pclass int,
  name varchar(255),
  sex varchar(255),
  age float,
  sibsp int,
  parch int,
  ticket varchar(255),
  fare float,
  cabin varchar(255),
  embarked varchar(255)
);
COPY titanik_test FROM '/home/andrew/titanik/test.csv' CSV HEADER;

As a result, we get two tables: with training and test data.
In this case, there were no data loading errors. Those. all data corresponded to the types that we determined. If errors appear, then there are two ways: first, fix it with regular expressions and with an editor like sed (or with a command based on perl -e) or load everything in the form of text data first, and fix it with regular expressions and queries using the DBMS.
Add the primary key:

--создаем последовательность
CREATE SEQUENCE titanik_train_seq;
--создаем таблицу с первичным ключем
select nextval('titanik_train_seq') as id, a.* 
into titanik_train_pk
from titanik_train a;

Data exploration

DBMS tables are just a tool. The tool helps to more conveniently study and transform data. From the very beginning, it’s convenient to divide the record fields for each passenger into numerical and text fields - by data type. To use machine learning algorithms, you still have to convert everything to a numerical representation. And the quality of the methods we choose will depend very much on how adequately we do this.

Numeric data:
survived, pclass, age, sibsp, parch, fare
Text data:
name, sex, ticket, cabin, embarked

Let's start with the data that is presented in text form. In fact, this is a problem of encoding information. And here, not necessarily the simplest approach will be the most correct. For example, if we ignore this example, and imagine that we have a field that defines the position:

general manager;
Department head;
shift supervisor;
cleaner;
trainee;

It is logical to encode them this way:

general manager	5	1
Department head	4	0.75
shift supervisor	3	0.5
cleaner	2	0.25
intern	1	0

Those. in this way we tried to fit such an element of reality as positions into a mathematical representation. This is one (simplest) option for submitting such information. This is provided that the "distance" between posts (which does not correspond to reality) is considered the same. Those. here the question arises, what is the "distance between posts"? Importance? The scope of authority? Scope of Responsibility? Prevalence? Place in the hierarchy? Or the number of employees in such a position at the enterprise? A lot of questions for which for the task divorced from reality there will be no answer. But in reality there are answers. And there are methods that help bring the mathematical representation closer to reality. But this is a topic for a separate and complex article, but for now, we assume that we will try to encode data in the range from 0 to 1 (Code 2).
So, for data that can be compared, there is at least some explanation of why we did so, with the definition of lesser and greater, encoded data.
After all, numbers carry quantitative information! Something more, and something less!
And if we have “name” as descriptive information. Do we have the right to encode data for example like this:

Name	Code2
Nikolay	0
Peter	0.5
Paul	1

It seems that we can code this way, but it turns out that we add to the data a component that is not there - a comparison of names (numbers: more or less) by an unknown attribute. It would be more logical to present the data in a slightly different way (This is also called bag-of-words):

Record number	Nikolay	Peter	Paul
15602	0	1	0
15603	0	0	1
15604	1	0	0

And so many names will be - so many fields will be. And the values are: 0 or 1. The disadvantage is obvious - what if we have a million different names? What to do? In our simple task there is no such trouble, but I will answer - this is a rather rapidly growing branch of computer science and mathematics - compression of input data. One of the keywords is “sparse matrix”. Also - “autoencoder”, “Vowpal Wabbit” [ 3 ], “feature hashing”. Again, this is a topic for the article. Besides, materials on this subject flickered on a habr. But for our task, we will digress from this problem for now. And back to the question:
Does the first idea have the right to life?
The answer depends heavily on whether we can neglect the component we added about ranking “an unknown attribute by which we compared the names”. Very often you have to neglect - when there is no other possibility.

And if you put instead of names, for example, the number-letter numbers of tickets (which is closer to our task). With tickets it’s easier - there is a series and there is a number. We can leave the number as it is. Perhaps it determines the landing site, and accordingly the distance from the entrance, etc. But the coding of the series is already more complicated. First of all, you need to answer the question: Can a single value in reality be represented as different lines in the data? The answer is unequivocal - maybe. Those. need to check it out. The easiest option is a typo. Extra dot, comma, space or dash.
The presence of such data already entails distortions, and a model built on “spoiled” information will not show a good result (only if this was not done in a controlled way to improve the generalization property of the model).
First, we’ll separate the ticket number from the series (type, brand, etc.).
To do this, browse the data, pre-sorted by name. Select the data groups:

ticket without a series
series ticket

Next, we will do this: create a table where the series is highlighted in a separate field. Where there is no series - for now, put an empty value. To do this, use the request:

select id,survived,pclass,"name",sex,age,sibsp,parch,ticket,fare,cabin,embarked,
m[1] as ticket_type, m[2] as ticket_number
into titanik_train_1
from 
(select id,
  survived,pclass,"name",sex,age,sibsp,parch,ticket,fare,cabin,embarked,
  regexp_matches(ticket, '^\s*?(.*?)\s*?(\d*?)$') as m 
  from titanik_train_pk
) as a;

where '^ \ s *? (. *?) \ s *? (\ d *?) $' is a regular expression that allows you to select parts in the ticket field - series and number in an array.
We will leave the numbers as they are, you can check with a regular expression for the presence of non-digital characters. Using a similar regular expression.

Double problem

If the same reality object is named differently, we get the problem of twins. Let us dwell only on one aspect of the detection of doubles: Typos in the title. Considering that the volume of our data is small, and subject to simple viewing, we will do so - we will derive all the unique values of the field from the table.

Inquiry:

select ticket_type, count(ticket_type) from titanik_train_1 group by 1 order by 2 desc;

And here is the result, where I noted supposedly the same values, which are indicated by different lines. 45 entries.

Quite a lot of coincidences. If you delve deeper into the task and find out by what principle tickets were marked - you can still reduce the number of items. But, one of the conditions of the task was not to use additional information on this well-known tragedy. Therefore, we dwell only on this.

Requests for changing data look like this:

update  titanik_train_1 set ticket_type='A.5.' where ticket_type = 'A./5.';
update  titanik_train_1 set ticket_type='A.5.' where ticket_type = 'A/5';
update  titanik_train_1 set ticket_type='A.5.' where ticket_type = 'A/5.';
update  titanik_train_1 set ticket_type='A.5.' where ticket_type = 'A/S';
update  titanik_train_1 set ticket_type='A/4' where ticket_type = 'A/4.';
update  titanik_train_1 set ticket_type='A/4' where ticket_type = 'A4.';
update  titanik_train_1 set ticket_type='C.A.' where ticket_type = 'CA';
update  titanik_train_1 set ticket_type='C.A.' where ticket_type = 'CA.';
update  titanik_train_1 set ticket_type='SW/PP' where ticket_type = 'S.W./PP';
update  titanik_train_1 set ticket_type='SC/PARIS' where ticket_type = 'SC/Paris';
update  titanik_train_1 set ticket_type='SOTON/O.Q.' where ticket_type = 'SOTON/OQ';
update  titanik_train_1 set ticket_type='SOTON/O2' where ticket_type = 'STON/O 2.';
update  titanik_train_1 set ticket_type='SOTON/O2' where ticket_type = 'STON/O2.';
update  titanik_train_1 set ticket_type='W/C' where ticket_type = 'W./C.';
update  titanik_train_1 set ticket_type='W.E.P.' where ticket_type = 'WE/P';

Now there are 30 entries.
In more complex tasks, the processing is significantly different. Especially if you can’t see everything. Then you can use the Levenshtein function - this will allow you to find words that are close in spelling. You can correct it a bit and make words that differ only in punctuation even closer. Again, we return to the concept of measures and metrics - but what is meant by the distance between words? Characters that are more similar in appearance to B and 8? Or those that sound similar?
In principle, in such a simple way you need to go through all the symbol fields in the training and test tables. Another improvement is the combination of column data from these tables before looking for doubles.
About Distribution. At this stage, the data processing work does not require special knowledge and can be easily parallelized between a certain number of performers. Here, for example, at this stage, the team will very much bypass the single opponent.

The post turned out to be quite voluminous, because the continuation is in the next part.

Links:

1. www.kaggle.com
2. www.kaggle.com/c/titanic-gettingStarted
3. http://hunch.net/~vw/

Update:
Part two: habrahabr.ru/post/165281
Part three: habrahabr.ru/post/165283
Part Four: habrahabr.ru/post/173819

Tags: