Organization of storage of intermediate tables for the CART algorithm
Good day to all who read. First, I will explain the background to this issue. It all started with the fact that a friend asked me to help him make an application (WinForm in C #) that would create a binary decision tree using the CART algorithm based on the statistics in the table on the form. The bottom line is to divide the source table into 2 parts according to the calculated criteria. In order to divide the source table, we must find the average for each of the columns:
The first and last columns are not taken into account in the calculation (the first value is Year, the last value is 0 or 1). This function will return a one-dimensional array of average values, access to which will go according to the index.
Then there is the actual search for the Gini coefficient for each of the columns, which will show us which of the columns to take to separate our table, among them the minimum is selected and, accordingly, the column index with the minimum coefficient will be necessary for us to separate. Then this column is selected and all values greater than the average value for this column are sent to one table, for example 1.1, and all values are less than the average in table 1.2. Below is the code. which is responsible for the first partition of the source table:
The table itself, I recall, is obtained from the DataGridView. And all of the above is called in this way:
Then each intermediate table goes through all the same steps. A signal to complete the division of the table is that in the last column (T), all values will be either all equal to 1 or all equal to 0. At each next step, one column is not involved in the calculations, for which the Gini coefficient was minimal at the previous step. Thus, a binary decision tree is constructed. So, now the question itself: how can you organize the storage of intermediate tables so that they are shortened and that it is important not to lose. They suggested the option of storing them in a three-dimensional array, stacking all the intermediate tables at positions 1, 11, 12, 111, 112, and so on, but we rejected this option because of its “gluttony” in memory. If anyone understood at least something, tell me how to organize it.
//нахождение средних значений по столбцам
public static double[] GetAverageValue(DataTable table)
{
#region Расчет средних значений по столбцам
var sr = new double[14];
for (int j = 1; j < table.Columns.Count - 1; j++)
{
double sum = 0.0;
for (int i = 0; i < table.Rows.Count; i++)
{
sum = sum + Convert.ToDouble(table.Rows[i][j]);
}
sr[j - 1] = sum / table.Rows.Count;
}
#endregion
return sr;
}
The first and last columns are not taken into account in the calculation (the first value is Year, the last value is 0 or 1). This function will return a one-dimensional array of average values, access to which will go according to the index.
Then there is the actual search for the Gini coefficient for each of the columns, which will show us which of the columns to take to separate our table, among them the minimum is selected and, accordingly, the column index with the minimum coefficient will be necessary for us to separate. Then this column is selected and all values greater than the average value for this column are sent to one table, for example 1.1, and all values are less than the average in table 1.2. Below is the code. which is responsible for the first partition of the source table:
//нахождение значений больше либо равных среднему в столбце таблицы
public static int[] GetMoreAverageAndLessAverage(double average, double[] columns, int[] T)
{
int moreAverage = 0;
int valueTMoreAverage = 0;
int valueTInvertMoreAverage = 0;
int lessAverage = 0;
int valueTLessAverage = 0;
int valueTLessInvertAverage = 0;
var moreAverageValue = new int[6];
for (int i = 0; i < columns.Length; i++ )
{
if (columns[i] >= average)
{
moreAverage = moreAverage + 1;
if (T[i] == 1)
{
valueTMoreAverage = valueTMoreAverage + 1;
}
else
{
valueTInvertMoreAverage = valueTInvertMoreAverage + 1;
}
}
else
{
lessAverage = lessAverage + 1;
if (T[i] == 1)
{
valueTLessAverage = valueTLessAverage + 1;
}
else
{
valueTLessInvertAverage = valueTLessInvertAverage + 1;
}
}
}
moreAverageValue[0] = moreAverage;
moreAverageValue[1] = valueTMoreAverage;
moreAverageValue[2] = valueTInvertMoreAverage;
moreAverageValue[3] = lessAverage;
moreAverageValue[4] = valueTLessAverage;
moreAverageValue[5] = valueTLessInvertAverage;
return moreAverageValue;
}
//нахождение значения коэффициента Gini(T,param)
public static double GetValueGini(int[] currentCollumnsAndT, int lengthCollumns)
{
var gini = 0.0;
if (currentCollumnsAndT != null)
{
double gini1 = (Convert.ToDouble(currentCollumnsAndT.GetValue(0)) / lengthCollumns) *
(1 -
((Convert.ToDouble(currentCollumnsAndT.GetValue(1)) / Convert.ToDouble(currentCollumnsAndT.GetValue(0))) *
(Convert.ToDouble(currentCollumnsAndT.GetValue(1)) / Convert.ToDouble(currentCollumnsAndT.GetValue(0)))) -
((Convert.ToDouble(currentCollumnsAndT.GetValue(2)) / Convert.ToDouble(currentCollumnsAndT.GetValue(0))) *
(Convert.ToDouble(currentCollumnsAndT.GetValue(2)) / Convert.ToDouble(currentCollumnsAndT.GetValue(0)))));
double gini2 = (Convert.ToDouble(currentCollumnsAndT.GetValue(3)) / lengthCollumns) *
(1 -
((Convert.ToDouble(currentCollumnsAndT.GetValue(4)) / Convert.ToDouble(currentCollumnsAndT.GetValue(3))) *
(Convert.ToDouble(currentCollumnsAndT.GetValue(4)) / Convert.ToDouble(currentCollumnsAndT.GetValue(3)))) -
((Convert.ToDouble(currentCollumnsAndT.GetValue(5)) / Convert.ToDouble(currentCollumnsAndT.GetValue(3))) *
(Convert.ToDouble(currentCollumnsAndT.GetValue(5)) / Convert.ToDouble(currentCollumnsAndT.GetValue(3)))));
gini = gini1 + gini2;
}
return gini;
}
//расчет первичного разделения таблицы
public static double[] GetValueGiniIsPrimaryDivide(DataTable table)
{
var currentCollumn = new double[table.Rows.Count];
var average = GetAverageValue(table); //выкинуть в класс формы чтобы не изменялось, передавать параметром в GetValueGiniIsPrimaryDivide (возможно передавать)
var T = new int[table.Rows.Count];
var gini = new double[table.Columns.Count - 1];
for (int i = 0; i < table.Rows.Count; i++)
{
T[i] = Convert.ToInt16(table.Rows[i][table.Columns.Count - 1]);
}
for (int i = 1; i < table.Columns.Count - 1; i++)
{
for (int j = 0; j < table.Rows.Count; j++)
{
currentCollumn[j] = 0.0;
currentCollumn[j] = Convert.ToDouble(table.Rows[j][i].ToString());
}
var bufer = average[i - 1];
gini[i] = GetValueGini(GetMoreAverageAndLessAverage(bufer, currentCollumn, T), table.Rows.Count);
}
var min = Convert.ToDouble(gini.GetValue(0));
int indexMin = 0;
for (int i = 0; i < gini.Length; i++)
{
if (Convert.ToDouble(gini.GetValue(i)) <= min)
{
min = Convert.ToDouble(gini.GetValue(i));
indexMin = i;
}
}
var returnValue = new double[2];
returnValue[0] = min;
returnValue[1] = indexMin;
return returnValue;
}
}
The table itself, I recall, is obtained from the DataGridView. And all of the above is called in this way:
#region Создание из DataGrid DataTable
var table = new DataTable();
for (int j = 0; j < dataGridView1.Columns.Count; j++)
{
table.Columns.Add(dataGridView1.Columns[j].Name);
}
for (int i = 0; i < dataGridView1.Rows.Count - 1; i++)
{
table.Rows.Add();
for (int j = 0; j < dataGridView1.Columns.Count; j++)
{
table.Rows[i][j] = dataGridView1[j, i].Value.ToString();
}
}
#endregion
var giniIsMin = SeparationTable.GetValueGiniIsPrimaryDivide(table);
Then each intermediate table goes through all the same steps. A signal to complete the division of the table is that in the last column (T), all values will be either all equal to 1 or all equal to 0. At each next step, one column is not involved in the calculations, for which the Gini coefficient was minimal at the previous step. Thus, a binary decision tree is constructed. So, now the question itself: how can you organize the storage of intermediate tables so that they are shortened and that it is important not to lose. They suggested the option of storing them in a three-dimensional array, stacking all the intermediate tables at positions 1, 11, 12, 111, 112, and so on, but we rejected this option because of its “gluttony” in memory. If anyone understood at least something, tell me how to organize it.