shadoof January 21, 2013 at 23:01

Data Mining: Primary data processing using a DBMS. Part 3 (Summary Tables)

This series is devoted to data analysis for finding patterns. As an example, one of the training tasks of the Kaggle sports data analysis community is used. Although the data sizes for the task are not large, the processing methods that will be considered are quite applicable for large amounts of data.
After completing Part 1 and Part 2 , two tables were formed containing the converted data.
titanik_test_3 and titanik_train_3.

The structure of the fields in them differs by one field - survived, the value of which we have to determine for the test data set. Here is the code describing the structure of the table titanik_train_3

CREATE TABLE titanik_train_3
(
  id bigint,
  survived integer,
  pclass integer,
  name character varying(255),
  sex character varying(255),
  age double precision,
  sibsp integer,
  parch integer,
  ticket character varying(255),
  fare double precision,
  cabin character varying(255),
  embarked character varying(255),
  cabin_cnt integer,
  cabin_type text,
  ticket_type text,
  ticket_number text,
  cabin_people_cnt integer
)

In fact, the task is to turn a table with character-numerical data into a table with only a numerical representation. The creation of data dictionaries and pivot tables will help us in this. To do this, we transfer the numerical data in the same form in which it was, and symbolic - we encode.
The most important condition for using dictionaries is the full coverage of meanings. Therefore, it is optimal at this stage (although in principle it is possible earlier) to merge the tables into one. And put NULL in the missing field.
Given that the same sequence was used to create the primary key, there should be no problems. This is done using the UNION operator.

select a.* into titanik_full_1 from (
select * from  titanik_train_3
union
select 
id,
NULL::integer as survived,
  pclass, "name",  sex ,  age ,  sibsp ,  parch, ticket,fare,cabin,embarked,cabin_cnt,cabin_type,ticket_type,ticket_number,
  cabin_people_cnt
 from  titanik_test_3
) as a;

Now we get one table that contains both the test and training data set.
We remove all fields except the numeric ones:

 select a.* into titanik_full_2 from (
 select id, survived, pclass::float, age::float, sibsp::float, parch::float, fare::float, cabin_cnt::float, CAST(ticket_number as float) as ticket_number, cabin_people_cnt::float
 from titanik_full_1 where ticket_number != '' 
 union
 select id, survived, pclass, age, sibsp, parch, fare, cabin_cnt, 0 as ticket_number, cabin_people_cnt
 from titanik_full_1 where ticket_number = '' ) as a;

We get the table titanik_full_2, which looks like this:

CREATE TABLE titanik_full_2
(
  id bigint,
  survived integer,
  pclass integer,
  age double precision,
  sibsp integer,
  parch integer,
  fare double precision,
  cabin_cnt integer,
  ticket_number integer,
  cabin_people_cnt bigint
)

Now we will add to this table by the field, which will mean, this or that value of the property has this or that value for this row. These tables are called pivot (pivot tables), only a little not as usual,-values field will accept either 0 or 1. This is schematically toborazheno below:

Ie the table has now become larger, the number of fields will be equal to the number of unique values. In principle, all these values can be made manually upon request. But it’s better to write a small function in PL / PGSQL that will automatically expand the fields.

CREATE OR REPLACE FUNCTION sparse_matrix_generator(
tablename_source character varying, 
tablename_dest character varying, 
field_name character varying)
RETURNS integer AS
$$
DECLARE
  pgst_object  REFCURSOR;
  unival character varying;
BEGIN
  OPEN pgst_object FOR EXECUTE 'select distinct '||field_name ||' from '||tablename_source ||' where ' || field_name ||' NOTNULL';
    LOOP
      FETCH pgst_object INTO unival;
      EXIT WHEN NOT FOUND;
      EXECUTE 'ALTER TABLE '|| tablename_dest ||' ADD COLUMN "'|| field_name||unival ||'" smallint NOT NULL DEFAULT 0';
      EXECUTE 'UPDATE '||tablename_dest||' SET "'||field_name||unival|| '"= 1 FROM ' ||tablename_source||
      ' WHERE '||tablename_dest||'.id = '||tablename_source||'.id and '||field_name||' = '''||unival||'''';
    END LOOP;
  RETURN 0;
END;
$$
LANGUAGE 'plpgsql';

This function is applied as follows:

select sparse_matrix_generator('titanik_full_1', 'titanik_full_2', 'cabin_type');
select sparse_matrix_generator('titanik_full_1', 'titanik_full_2', 'ticket_type');
select sparse_matrix_generator('titanik_full_1', 'titanik_full_2', 'embarked');
select sparse_matrix_generator('titanik_full_1', 'titanik_full_2', 'sex');

So we now have a sparse matrix of 58 columns. It is necessary to normalize it and separate the test and training samples according to the survived field.
There are different ways of rationing. For different methods of data analysis, there are different sampling requirements. We use one of the simplest minimax rationing. The bottom line is this: the minimum is 0, the maximum: 1, and everything else is proportionally between them. To do this, we write a function:

CREATE OR REPLACE FUNCTION minmax_normalizer(tablename_source character varying, field_name character varying)
  RETURNS integer AS
$BODY$
DECLARE
  pgst_object  REFCURSOR;
  maxval float;
  minval float;
  C RECORD;
BEGIN
  EXECUTE 'select min("'||field_name ||'") as minval, max("'||field_name ||'") as maxval from '|| tablename_source INTO C;
  maxval := C.maxval;
  minval := C.minval;
  EXECUTE 'UPDATE '||tablename_source||' SET "'||field_name||'"=("'||field_name||'"-$1)/($2-$1)' USING minval, maxval;
  RETURN 0;
END;
$BODY$
  LANGUAGE plpgsql VOLATILE
  COST 100;

And apply it to the fields of the table that need to be normalized:

select minmax_normalizer('titanik_full_2', 'pclass');
select minmax_normalizer('titanik_full_2', 'age');
select minmax_normalizer('titanik_full_2', 'sibsp');
select minmax_normalizer('titanik_full_2', 'parch');
select minmax_normalizer('titanik_full_2', 'fare');
select minmax_normalizer('titanik_full_2', 'cabin_cnt');
select minmax_normalizer('titanik_full_2', 'ticket_number');
select minmax_normalizer('titanik_full_2', 'cabin_people_cnt');

As a result, we get a table with only numerical values in the range from zero to one.
Choose a test and training sample:

select * into titanik_test_final from 
titanik_full_2 where survived isnull;
alter table titanik_test_final drop column survived;

for a test sample and accordingly:

select * into titanik_train_final from 
titanik_full_2 where survived notnull;

for training.
This table has empty values. They can be replaced, for example, with an average value. To do this, we also use the function:

CREATE OR REPLACE FUNCTION null_normalizer(tablename_source character varying)
  RETURNS integer AS
$BODY$
DECLARE
  pgst_object  REFCURSOR;
  fieldval character varying;
  count_null integer;
  field_avg float;
BEGIN
  OPEN pgst_object FOR EXECUTE 'select column_name from information_schema.columns where'||
  ' table_name='''||tablename_source||'''';
    LOOP
      FETCH pgst_object INTO fieldval;
      EXIT WHEN NOT FOUND;
      count_null := 0;
      EXECUTE 'select count(id) from '||tablename_source||' where "'||fieldval||'" isnull' into count_null;
      IF count_null > 0 THEN
	  raise notice 'field: %', fieldval;
	  EXECUTE 'select avg('||fieldval||') from '||tablename_source INTO field_avg; 
	  EXECUTE 'UPDATE '||tablename_source||' set '||fieldval||'= $1 where '||fieldval||' isnull' using field_avg;
      END IF;      
    END LOOP;
  RETURN 0;
END;
$BODY$
  LANGUAGE plpgsql VOLATILE
  COST 100;

The function works this way: We select all field names for the table, count the number of nonzero elements in the field, and if the number is greater than zero, we start searching for the average value and updating empty values to the average.
The function is called this way:

select null_normalizer('titanik_test_final');
select null_normalizer('titanik_train_final');

The resulting table is quite large and sparse:

CREATE TABLE titanik_test_final
(
  id bigint,
  pclass double precision,
  age double precision,
  sibsp double precision,
  parch double precision,
  fare double precision,
  cabin_cnt double precision,
  ticket_number double precision,
  cabin_people_cnt double precision,
  "cabin_typeF" smallint,
  "cabin_typeB" smallint,
  "cabin_typeG" smallint,
  "cabin_typeC" smallint,
  "cabin_typeT" smallint,
  "cabin_typeD" smallint,
  "cabin_typeE" smallint,
  "cabin_typeA" smallint,
  "ticket_typeSW/PP" smallint,
  "ticket_typeC" smallint,
  "ticket_typePC" smallint,
  "ticket_typeAQ/3." smallint,
  "ticket_typeSC/A.3" smallint,
  "ticket_typeS.O.C." smallint,
  "ticket_typeS.O./P.P." smallint,
  "ticket_typeSC/AH" smallint,
  "ticket_typeSOTON/O2" smallint,
  "ticket_typeC.A." smallint,
  "ticket_typeW/C" smallint,
  "ticket_typeS.C./A.4." smallint,
  "ticket_typeFa" smallint,
  "ticket_typeLP" smallint,
  "ticket_typeSCO/W" smallint,
  "ticket_typeF.C." smallint,
  "ticket_typeA.5." smallint,
  "ticket_typeSC/AH Basle" smallint,
  "ticket_typeSC/A4" smallint,
  "ticket_typeS.C./PARIS" smallint,
  "ticket_typeS.O.P." smallint,
  "ticket_typeLINE" smallint,
  "ticket_typeSO/C" smallint,
  "ticket_typeP/PP" smallint,
  "ticket_typeAQ/4" smallint,
  "ticket_typeSC" smallint,
  "ticket_typeW.E.P." smallint,
  "ticket_typeSOTON/O.Q." smallint,
  "ticket_typeA/4" smallint,
  "ticket_typeSC/PARIS" smallint,
  "ticket_typeA. 2." smallint,
  "ticket_typeF.C.C." smallint,
  "ticket_typeS.P." smallint,
  "ticket_typePP" smallint,
  "ticket_typeC.A./SOTON" smallint,
  "embarkedC" smallint,
  "embarkedQ" smallint,
  "embarkedS" smallint,
  sexfemale smallint,
  sexmale smallint
)

To output to text files, use the script:

COPY titanik_train_final to '/tmp/titanik_train_final.csv';
COPY titanik_test_final to '/tmp/titanik_test_final.csv';

Actually, the data is ready. Now we can try to find patterns.
To reduce the dimension of the now sparse table, you can use the auto-encoder, or linear PCA. Continued in the next part. It is planned to apply an auto encoder and a decisive forest and look at the result that will turn out in the standings.

UPD: Fourth

Tags:

Data Mining: Primary data processing using a DBMS. Part 3 (Summary Tables)

Also popular now: