shadoof January 10, 2013 at 09:44

Data Mining: Primary data processing using a DBMS. Part 2

Tutorial

Every half hour, a new article appears with a screaming slogan Big Data - “New Oil”! . Just a godsend for marketing texts. Big Data = Big Oil = Profit. Where did this statement come from? Let's go beyond the stamp and dig a little deeper:
One of the first to use it was Michael Palmer [ 1 ] back in 2006:

Data is just crude oil. It is valuable, but without processing it cannot be truly used. It must be turned into gas, plastic, chemicals, etc., in order to create value entailing profitability; so the data needs to be analyzed and "bitten" so that they become valuable.

This understanding of the trend “Big Data - New Oil!” Is closer to reality than to marketing. And it doesn’t completely cancel Disraeli’s statements:
“There are three types of lies: There are lies, blatant lies and statistics . ”
This article is a continuation of the topic Data Mining: Primary data processing using a DBMS. Part 1
Continuing the production!

Continue to remove doubles

In the last article, a small example of code was given that allows you to get rid of "doubles". We continue to move in the same direction. For successful work, it is also necessary to convert a table with test data:

--последовательность уже создана ранее 
--создаем таблицу с первичным ключом
select nextval('titanik_train_seq') as id, a.* 
into titanik_test_pk
from titanik_test a;
--разделяем поле с именем билета аналогично способу с тренировочными данными
select id,pclass,"name",sex,age,sibsp,parch,ticket,fare,cabin,embarked,
m[1] as ticket_type, m[2] as ticket_number
into titanik_test_1
from 
 (select id,pclass,"name",sex,age,sibsp,parch,ticket,fare,cabin,embarked,
   regexp_matches(ticket, '^\s*?(.*?)\s*?(\d*?)$') as m 
   from titanik_test_pk
 ) as a;

In this code example, I decided to use the sequence from the previous table, so that it would be easier to combine data if necessary later. The division of the name of the ticket into text and series is carried out in the same way.

We apply similar update operators to the test table (plus add two more at the end to replace elements that were not in the training table):

update  titanik_test_1 set ticket_type='A.5.' where ticket_type = 'A./5.';
update  titanik_test_1 set ticket_type='A.5.' where ticket_type = 'A/5';
update  titanik_test_1 set ticket_type='A.5.' where ticket_type = 'A/5.';
update  titanik_test_1 set ticket_type='A.5.' where ticket_type = 'A/S';
update  titanik_test_1 set ticket_type='A/4' where ticket_type = 'A/4.';
update  titanik_test_1 set ticket_type='A/4' where ticket_type = 'A4.';
update  titanik_test_1 set ticket_type='C.A.' where ticket_type = 'CA';
update  titanik_test_1 set ticket_type='C.A.' where ticket_type = 'CA.';
update  titanik_test_1 set ticket_type='SW/PP' where ticket_type = 'S.W./PP';
update  titanik_test_1 set ticket_type='SC/PARIS' where ticket_type = 'SC/Paris';
update  titanik_test_1 set ticket_type='SOTON/O.Q.' where ticket_type = 'SOTON/OQ';
update  titanik_test_1 set ticket_type='SOTON/O2' where ticket_type = 'STON/O 2.';
update  titanik_test_1 set ticket_type='SOTON/O2' where ticket_type = 'STON/O2.';
update  titanik_test_1 set ticket_type='W/C' where ticket_type = 'W./C.';
update  titanik_test_1 set ticket_type='W.E.P.' where ticket_type = 'WE/P';
update  titanik_test_1 set ticket_type='SOTON/O.Q.' where ticket_type = 'STON/OQ.';
update  titanik_test_1 set ticket_type='SC/PARIS' where ticket_type = 'S.C./PARIS';

Ticket data - processed. Now it is necessary to process the remaining textual data in the same way:
Sex - no doubles were found, does not need to be divided:

select sex, count(sex) from titanik_train_1 group by 1 order by 1 asc;

sex	count
female	314
male	577

select sex, count(sex) from titanik_test_1 group by 1 order by 1 asc;

sex	count
female	152
male	266

Cabins - more interesting here:

--опасный запрос!
-- select cabin, count(cabin) from titanik_train_1 group by 1 order by 1 asc;
select cabin, count(id) from titanik_train_1 group by 1 order by 1 asc;

If we execute the first request (commented out), then we get a rather strange value - 0 entries for which the cabin is not indicated. This is due to the features of the work of aggregating functions. Cannot stack empty values correctly. Therefore, we write count (id) . And we get the result: 687 passengers with an unspecified cabin. It can be assumed that this is a “common” compartment. And most likely a ticket class is not specified for these entries.
Let's check our assumption:

select id, cabin, ticket_type from titanik_train_1 where cabin ISNULL;
select id, cabin, ticket_type from titanik_train_1 where cabin NOTNULL;

Not confirmed. Printed many lines for which the type of ticket is indicated. And vice versa (query number 2)? Also not confirmed. We conclude that either the ticket series has been lost for a certain number of people, or it shows something else, and not the location of the person in the cabin or not. That is, carries additional information. We return to the previous request.
In the query output for cabs and the number of records with grouping, there are interesting lines:

cabin	count
C23 C25 C27	4
C30	1
F g73	2
T	1

Firstly, information about the type of cabin (the first letter in front of the numbers).
Secondly, for one ticket - several cabins. And very often, several people in the same cabin with tickets, which indicate several cabins (read places). This yields quite interesting data that cannot be ignored. In fact, this data is duplicate relatives, but taking into account, for example, friends or acquaintances, or work colleagues - i.e. Familiar people willing to help each other. Also, we get information about how many people were in the cabin.
Conclusion - add the cab type field. And add the number of cabins on the ticket. We also add a field for the number of people in cabins.
Those. a family of 4 occupies 2 cabins. Or for example, two different people occupy the same cabin. The amount of data is growing!
The queries that implement this are quite complex and require an understanding of how regular PREG expressions work in PostgreSQL.
You can fit everything into one huge request, but I decided to split it into two parts. Part one determines the type of cabin and the number of cabins per ticket, and the second request determines the number of people with the same cabin (set of cabins) on the ticket.

select id,survived,pclass,"name",sex,age,sibsp,parch,ticket,fare,cabin,embarked, cnt as cabin_cnt, cabin_type, ticket_type, ticket_number
into titanik_train_2
from (
select id, survived, pclass,"name",sex,age,sibsp,parch,ticket,fare,cabin,embarked, ticket_type, ticket_number,
regexp_matches(cabin, '^\s*?([A-Z]\d*)\s*?([A-Z]\d*)\s*?([A-Z]\d*)\s*?([A-Z]\d*)\s*?.*?$') as m, 4 as cnt, 
substring(cabin, 1, 1) as cabin_type
from titanik_train_1
UNION
select id, survived, pclass,"name",sex,age,sibsp,parch,ticket,fare,cabin,embarked, ticket_type, ticket_number,
regexp_matches(cabin, '^\s*?([A-Z]\d*)\s*?([A-Z]\d*)\s*?([A-Z]\d*)\s*?$') as m, 3 as cnt,
substring(cabin, 1, 1) as cabin_type
from titanik_train_1
UNION
select id, survived, pclass,"name",sex,age,sibsp,parch,ticket,fare,cabin,embarked, ticket_type, ticket_number,
regexp_matches(cabin, '^\s*?([A-Z]\d*)\s*?([A-Z]\d*)\s*?$') as m, 2 as cnt,
substring(cabin, 1, 1) as cabin_type
from titanik_train_1
UNION
select id, survived, pclass,"name",sex,age,sibsp,parch,ticket,fare,cabin,embarked, ticket_type, ticket_number,
regexp_matches(cabin, '^\s*?([A-Z]\d*)\s*?$') as m, 1 as cnt,
substring(cabin, 1, 1) as cabin_type
from titanik_train_1
UNION
select id, survived, pclass,"name",sex,age,sibsp,parch,ticket,fare,cabin,embarked, ticket_type, ticket_number,
NULL as m, 0 as cnt, NULL as cabin_type
from titanik_train_1 where cabin ISNULL)
as a;

In principle, the only difficult point here is the regular expression. How I built it:
F C82 C45 - an example of the name that you need to grab. This query is built from the main block:
([AZ] \ d *) - [AZ] means that there must be at least one, alphabetic large character, \ d * - any number of 0 ... digits.

And the second request, which counts the number of people in the cabins.

select a.*, b.cnt as cabin_people_cnt
into 
titanik_train_3
from 
titanik_train_2 a, (
select cabin as cabid, count(id) as cnt from titanik_train_1 group by 1) as b
where 
a.cabin = b.cabid OR (a.cabin ISNULL AND b.cabid ISNULL);
--Обновляем. Если кабин не указана - то ставим ноль вместо 687.
update titanik_train_3 set cabin_people_cnt=0 where cabin ISNULL;

As a result, we get three additional fields: type of cabin, number of cabins per ticket and number of people in the cabin.

Similarly, we do for test data:

select id,pclass,"name",sex,age,sibsp,parch,ticket,fare,cabin,embarked, cnt as cabin_cnt, cabin_type, ticket_type, ticket_number
into titanik_test_2
from (
select id,  pclass,"name",sex,age,sibsp,parch,ticket,fare,cabin,embarked, ticket_type, ticket_number,
regexp_matches(cabin, '^\s*?([A-Z]\d*)\s*?([A-Z]\d*)\s*?([A-Z]\d*)\s*?([A-Z]\d*)\s*?.*?$') as m, 4 as cnt, 
substring(cabin, 1, 1) as cabin_type
from titanik_test_1
UNION
select id,  pclass,"name",sex,age,sibsp,parch,ticket,fare,cabin,embarked, ticket_type, ticket_number,
regexp_matches(cabin, '^\s*?([A-Z]\d*)\s*?([A-Z]\d*)\s*?([A-Z]\d*)\s*?$') as m, 3 as cnt,
substring(cabin, 1, 1) as cabin_type
from titanik_test_1
UNION
select id,  pclass,"name",sex,age,sibsp,parch,ticket,fare,cabin,embarked, ticket_type, ticket_number,
regexp_matches(cabin, '^\s*?([A-Z]\d*)\s*?([A-Z]\d*)\s*?$') as m, 2 as cnt,
substring(cabin, 1, 1) as cabin_type
from titanik_test_1
UNION
select id,  pclass,"name",sex,age,sibsp,parch,ticket,fare,cabin,embarked, ticket_type, ticket_number,
regexp_matches(cabin, '^\s*?([A-Z]\d*)\s*?$') as m, 1 as cnt,
substring(cabin, 1, 1) as cabin_type
from titanik_test_1
UNION
select id,  pclass,"name",sex,age,sibsp,parch,ticket,fare,cabin,embarked, ticket_type, ticket_number,
NULL as m, 0 as cnt, NULL as cabin_type
from titanik_test_1 where cabin ISNULL)
as a;

and the second part:

select a.*, b.cnt as cabin_people_cnt
into 
titanik_test_3
from 
titanik_test_2 a, (
select cabin as cabid, count(id) as cnt from titanik_test_1 group by 1) as b
where 
a.cabin = b.cabid OR (a.cabin ISNULL AND b.cabid ISNULL);

One field left: embarked port:

select embarked, count(id) from titanik_train_3 group by 1 order by 1 asc;
select embarked, count(id) from titanik_test_3 group by 1 order by 1 asc;

The result is this - no doubles were found, does not need separation:

embarked	count
C	168
Q	77
S	644
	2

What to do with two records where there is no data? You can replace with random values, you can discard, you can put the average. To choose from.

conclusions

In this part, we preliminarily prepared text data in the training and test samples. In time, this work took about three hours. From downloading data to the current moment.

This part turned out to be quite impressive in volume, therefore continued in the next post. In the next post we will try to form a table with numerical values instead of string ones. If suddenly someone decides to do it simultaneously with me, using queries and processing data on this tutorial, I will answer questions in comments. Waiting for criticism.

Update
Part Three: habrahabr.ru/post/165283
Part Four: habrahabr.ru/post/173819

Tags:

Data Mining: Primary data processing using a DBMS. Part 2

Continue to remove doubles

conclusions

Also popular now: