"Letter to the Turkish Sultan" or linear regression in C # using Accord.NET to analyze Moscow open data

  • Tutorial
When it comes to mastering the basics of machine learning, often invited to consider the appropriate tools to Python or the R . We will not discuss their pros and cons, but just ask ourselves what to do if you are only familiar with the .NET ecosystem, but you are very curious to plunge into the world of data science? The answer is simple, do not despair and look in the direction of F # , and if you, like me from .NET, know only the basics of C #, then try to study the Accord.NET Framework .

We have already examined its application for solving the classification problem, and this time try to consider the simplest tools for linear regression. To do this, we will use open data on the analysis of citizens' appeals taken from the official website of the Mayor of Moscow .

Despite the fact that C # is indicated in the title of the article , we will try to collect code on VB.NET as well .

I just have to invite you to cut!




To avoid such comments , I’ll say right away at the very beginning that I have nothing to do with the Moscow government , governments, prefectures, etc. therefore, it makes no sense for me to complain about their work . I just accidentally found this data , manually hammered it into a tablet and posted it on GitHub for you .

Well, and if anyone is curious, this article continues the mini-cycle devoted to how I studied Data Science from scratch (and so did not really learn it) , if anyone is interested, I hid the links to other articles under the spoiler.


I honestly admit I'm not a programmer, besides I studied Accord.NET very superficially. Unfortunately, there isn’t much literature on it, and there weren’t any immediately available on-line training courses, so in many ways only the developers’s site remains, and it is not as informative as we would like.

Therefore, with the data set proposed above, I carried out the basic manipulations in the last article of the cycle (there is also a more detailed description of the data set). And in this article, we will try to read the data with a squeak, train the model and build some kind of graph.

Contents:

Part I: introduction and a little about data
Part II: writing code in C #
Part III: writing code in VB and conclusion

Before you start writing code, two words about data.
This is open data on the analysis of citizens' appeals received by various executive bodies of the city of Moscow. I must say that the statistics are scarce, so far only 22 months.

In fact, it could have been 23 months, but in November the developers provided an incomplete set of data, and I did not include it.
Data is presented in csv format. The data columns mean the following: I did not find a way to automate the data collection process and I had to collect them by hand as a result, so that I could make a little mistake somewhere, well, I will leave the reliability of the data on the conscience of the authors. It remains to tell just a couple of words about the framework itself and you can go to the code. Accord.NET is an open source project

num – Индекс записи
year – год записи
month – месяц записи
total_appeals – общее количество обращений за месяц
appeals_to_mayor – общее количество обращений в адрес Мэра
res_positive- количество положительных решений
res_explained – количество обращений на которые дали разъяснения
res_negative – количество обращений с отрицательным решением
El_form_to_mayor – количество обращений к Мэру в электронной форме
Pap_form_to_mayor - количество обращений к Мэру на бумажных носителях to_10K_total_VAO…to_10K_total_YUZAO – количество обращений на 10000 населения в различных округах Москвы
to_10K_mayor_VAO… to_10K_mayor_YUZAO– количество обращений в адрес Мэра и правительства Москвы на 10000 населения в различных округах города




, which nevertheless, in most cases, can be used for commercial development under the LGPL license. It seems that the framework has all the basic tools necessary for data analysis and machine learning, from testing statistical hypotheses to neural networks.

Now you can with good conscience go to the code.
I posted the solution with the project in C # and VB.NET for you on GitHub , you can just download it and try to build it (in theory it should start). If you want to create a project yourself from scratch, then for the same functionality you need to do the following:

  1. Create a new project (I created a console project with Net Framework 4.5).
  2. Using the package manager (NuGet) install Accord.Controls version 3.8 (it will pull all the other packages we need), as well as Accord.IO for working with tables. Also, to draw the graph, you will need to enable the standard Windows.Forms library. That's actually all you can write code.

I will place the full C # code under the spoiler.

Full code for C #
using System;
using System.Linq;
using Accord.Statistics.Models.Regression.Linear;
using Accord.IO;
using Accord.Math;
using System.Data;
using System.Collections.Generic;
using Accord.Controls;
using Accord.Math.Optimization.Losses;
namespacecs_msc_mayor
{
    classProgram
    {
        staticvoidMain(string[] args)
        {
            //for separating the training and test samplesint traintPos = 18;
            int testPos = 22;
            int allData = testPos + (testPos - traintPos);
            //for correct reading symbol of float point in csv
            System.Globalization.CultureInfo customCulture = (System.Globalization.CultureInfo)System.Threading.Thread.CurrentThread.CurrentCulture.Clone();
            customCulture.NumberFormat.NumberDecimalSeparator = ".";
            System.Threading.Thread.CurrentThread.CurrentCulture = customCulture;
            //read datastring CsvFilePath = @"msc_appel_data.csv";
            DataTable mscTable = new CsvReader(CsvFilePath, true).ToTable();
            //for encoding the string values of months into numerical values
            Dictionary<string, double> monthNames = new Dictionary<string, double>
            {
                ["January"] = 1,
                ["February"] = 2,
                ["March"] = 3,
                ["April"] = 4,
                ["May"] = 5,
                ["June"] = 6,
                ["July"] = 7,
                ["August"] = 8,
                ["September"] = 9,
                ["October"] = 10,
                ["November"] = 11,
                ["December"] = 12
            };
            string[] months = mscTable.Columns["month"].ToArray<String>();
            double[] dMonths= newdouble[months.Length];
            for (int i=0; i< months.Length; i++)
            {
                dMonths[i] = monthNames[months[i]];
                //Console.WriteLine(dMonths[i]);
            }
            //select the target columndouble[] OutResPositive = mscTable.Columns["res_positive"].ToArray();
            // separation of the test and train target sampledouble[] OutResPositiveTrain = OutResPositive.Get(0, traintPos);
            double[] OutResPositiveTest = OutResPositive.Get(traintPos, testPos);
            //deleting unneeded columns
            mscTable.Columns.Remove("total_appeals");
            mscTable.Columns.Remove("month");
            mscTable.Columns.Remove("res_positive");
            mscTable.Columns.Remove("year");
            //add coded in a double column month into Table//create new column
            DataColumn newCol = new DataColumn("dMonth", typeof(double));
            newCol.AllowDBNull = true;
            // add new column
            mscTable.Columns.Add(newCol);
            //fill new columnint counter = 0;
            foreach (DataRow row in mscTable.Rows)
            {
                row["dMonth"] = dMonths[counter];
                counter++;
            }
            //receiving input data from a tabledouble[][] inputs = mscTable.ToArray();
            //separation of the test and train sampledouble[][] inputsTrain= inputs.Get(0, traintPos);
            double[][] inputsTest = inputs.Get(traintPos, testPos);
            //simple linear regression modelvar ols = new OrdinaryLeastSquares()
            {
                UseIntercept = true
            };
            //linear regression model for several features
            MultipleLinearRegression regression = ols.Learn(inputsTrain, OutResPositiveTrain);
            //make a predictiondouble[] predicted = regression.Transform(inputsTest);
            //console outputfor (int i = 0; i < testPos - traintPos; i++)
            {
                Console.WriteLine("predicted: {0}   real: {1}", predicted[i], OutResPositiveTest[i]);
            }
            // And  print the squared error using the SquareLoss class:
            Console.WriteLine("error = {0}", new SquareLoss(OutResPositiveTest).Loss(predicted));
            // print the coefficient of determinationdouble r2 = new RSquaredLoss(numberOfInputs: 29, expected: OutResPositiveTest).Loss(predicted); 
            Console.WriteLine("R^2 = {0}", r2);
            // alternative print the coefficient of determinationdouble ur2 = regression.CoefficientOfDetermination(inputs, OutResPositiveTest, adjust: true);
            Console.WriteLine("alternative version of R2 = {0}", r2);
            Console.WriteLine("Press enter and close chart to exit");
            // for chart int[] classes = newint[allData];
            double[] mountX = newdouble[allData];
            for (int i = 0; i < allData; i++)
            {
                if (i<testPos)
                {
                   // for csv data
                    mountX[i] = i+1;
                    classes[i] = 0; //csv data is class 0
                }
                else
                {
                    //for predicted
                    mountX[i] = i- (testPos - traintPos)+1;
                    classes[i] = 1; //predicted is class 1
                }
            }
            // make points of chart
            List<double> OutChart = new List<double>();
            OutChart.AddRange(OutResPositive);
            OutChart.AddRange(predicted);
            // plot chart
            ScatterplotBox.Show("res_positive from months", mountX, OutChart.ToArray(), classes).Hold();
            // for pause
            Console.ReadLine();
        }
    }
}


In many ways, the solution to the linear regression problem is taken from an example from the developers site , everything is not very complicated there, but still let's analyze the code in parts.

using System;
using System.Linq;
using Accord.Statistics.Models.Regression.Linear;
using Accord.IO;
using Accord.Math;
using System.Data;
using System.Collections.Generic;
using Accord.Controls;
using Accord.Math.Optimization.Losses;

Loading namespaces of third-party libraries.

namespacecs_msc_mayor
{
    classProgram
    {
        staticvoidMain(string[] args)
        {

We create a namespace, a class, a main method - everything is trivial.

//for separating the training and test samplesint traintPos = 18;
            int testPos = 22;
            int allData = testPos + (testPos - traintPos);

We determine the variables that later come in handy to divide the data into control and training samples.

//for correct reading symbol of float point in csv
            System.Globalization.CultureInfo customCulture = (System.Globalization.CultureInfo)System.Threading.Thread.CurrentThread.CurrentCulture.Clone();
            customCulture.NumberFormat.NumberDecimalSeparator = ".";
            System.Threading.Thread.CurrentThread.CurrentCulture = customCulture;

This is useful to us so that our fractional part separator is read the same way both in the python project version and in the .NET version (at least for me) .

//read datastring CsvFilePath = @"msc_appel_data.csv";
            DataTable mscTable = new CsvReader(CsvFilePath, true).ToTable();

We read the data from the csv file into the data table format.

//for encoding the string values of months into numerical values
            Dictionary<string, double> monthNames = new Dictionary<string, double>
            {
                ["January"] = 1,
                ["February"] = 2,
                ["March"] = 3,
                ["April"] = 4,
                ["May"] = 5,
                ["June"] = 6,
                ["July"] = 7,
                ["August"] = 8,
                ["September"] = 9,
                ["October"] = 10,
                ["November"] = 11,
                ["December"] = 12
            };
            string[] months = mscTable.Columns["month"].ToArray<String>();
            double[] dMonths= newdouble[months.Length];
            for (int i=0; i< months.Length; i++)
            {
                dMonths[i] = monthNames[months[i]];
                //Console.WriteLine(dMonths[i]);
            }

In order to process data about the month in which the calls occurred, it is necessary to translate them into a digestible format, in this case we will encode everything into double.
By analogy with a solution in Python, we first create a dictionary, and then transcode the data in accordance with it using a loop.

//select the target columndouble[] OutResPositive = mscTable.Columns["res_positive"].ToArray();
            // separation of the test and train target sampledouble[] OutResPositiveTrain = OutResPositive.Get(0, traintPos);
            double[] OutResPositiveTest = OutResPositive.Get(traintPos, testPos);

Select the objective function. We will predict the number of positive decisions for all appeals.
In the first row, we pull this data from the table, converting to a double type.
And then in the other two variables we copy positions from 0 to 18 for the training sample and from 18 to 22 for the control sample.

//deleting unneeded columns
            mscTable.Columns.Remove("total_appeals");
            mscTable.Columns.Remove("month");
            mscTable.Columns.Remove("res_positive");
            mscTable.Columns.Remove("year");

We remove unnecessary columns from the table: our objective function, months, years, and the total number of hits, because it includes information about the positive result of the review.

//add coded in a double column month into Table//create new column
            DataColumn newCol = new DataColumn("dMonth", typeof(double));
            newCol.AllowDBNull = true;
            // add new column
            mscTable.Columns.Add(newCol);
            //fill new columnint counter = 0;
            foreach (DataRow row in mscTable.Rows)
            {
                row["dMonth"] = dMonths[counter];
                counter++;
            }

And now, we’ll add a column with transcoded months, first we create a new column, add it to the table, and then fill it in a loop.

//receiving input data from a tabledouble[][] inputs = mscTable.ToArray();
            //separation of the test and train sampledouble[][] inputsTrain= inputs.Get(0, traintPos);
            double[][] inputsTest = inputs.Get(traintPos, testPos);

By analogy with the objective function, we create arrays of input data (features).

//simple linear regression modelvar ols = new OrdinaryLeastSquares()
            {
                UseIntercept = true
            };
            //linear regression model for several features
            MultipleLinearRegression regression = ols.Learn(inputsTrain, OutResPositiveTrain);

It remains to create models. First, we create an object of ordinary linear regression, and then on the basis of it we create a model for multiple regression, because we have almost 30 signs. We train the model naturally in the training set.

//make a predictiondouble[] predicted = regression.Transform(inputsTest);

We get directly the prediction for the training sample.

//console outputfor (int i = 0; i < testPos - traintPos; i++)
            {
                Console.WriteLine("predicted: {0}   real: {1}", predicted[i], OutResPositiveTest[i]);
            }
            // And  print the squared error using the SquareLoss class:
            Console.WriteLine("error = {0}", new SquareLoss(OutResPositiveTest).Loss(predicted));
            // print the coefficient of determinationdouble r2 = new RSquaredLoss(numberOfInputs: 29, expected: OutResPositiveTest).Loss(predicted); 
            Console.WriteLine("R^2 = {0}", r2);
            // alternative print the coefficient of determinationdouble ur2 = regression.CoefficientOfDetermination(inputs, OutResPositiveTest, adjust: true);
            Console.WriteLine("alternative version of R2 = {0}", r2);
            Console.WriteLine("Press enter and close chart to exit");

We display in the console data on the predicted and real values, as well as information about the error and the coefficient of determination.

// for chart int[] classes = newint[allData];
            double[] mountX = newdouble[allData];
            for (int i = 0; i < allData; i++)
            {
                if (i<testPos)
                {
                   // for csv data
                    mountX[i] = i+1;
                    classes[i] = 0; //csv data is class 0
                }
                else
                {
                    //for predicted
                    mountX[i] = i- (testPos - traintPos)+1;
                    classes[i] = 1; //predicted is class 1
                }
            }
            // make points of chart
            List<double> OutChart = new List<double>();
            OutChart.AddRange(OutResPositive);
            OutChart.AddRange(predicted);

Developers seem to advise themselves to use third-party tools for displaying charts, but we will use the ScatterplotBox chart supplied with the framework, which displays the points. For the data to be at least somehow visual, we create an analogue of the time trend on the X scale (point 1 is January 16, the last point October 2017), we also classify points in another array the first 22 are our initial data, and the last 4 are predicted (graph paint them in a different color).

// plot chart
            ScatterplotBox.Show("res_positive from months", mountX, OutChart.ToArray(), classes).Hold();
            // for pause
            Console.ReadLine();
        }
    }
}

ScatterplotBox.Show displays a window with a chart. We feed him our previously prepared data for the X and U axes.

Honestly, I don’t know Visual Basic, but here a converter from C # to VB.NET will help us .

We will not analyze the code in parts, you can be guided by the comments left in the code, they are identical for both projects and divide the code into similar sections.

Full code on VB.NET
Imports System
Imports System.Linq
Imports Accord.Statistics.Models.Regression.Linear
Imports Accord.IO
Imports Accord.Math
Imports System.Data
Imports System.Collections.Generic
Imports Accord.Controls
Imports Accord.Math.Optimization.Losses
Module Program
    Sub Main()
        'for separating the training and test samplesDim traintPos As Integer = 18Dim testPos As Integer = 22Dim allData As Integer = testPos + (testPos - traintPos)
        'for correct reading symbol of float point in csvDim customCulture As System.Globalization.CultureInfo = CType(System.Threading.Thread.CurrentThread.CurrentCulture.Clone(), System.Globalization.CultureInfo)
        customCulture.NumberFormat.NumberDecimalSeparator = "."
        System.Threading.Thread.CurrentThread.CurrentCulture = customCulture
        'read dataDim CsvFilePath As String = "msc_appel_data.csv"Dim mscTable As DataTable = New CsvReader(CsvFilePath, True).ToTable()
        'for encoding the string values of months into numerical valuesDim monthNames As Dictionary(Of String, Double) = New Dictionary(Of String, Double) From
            {{"January", 1}, {"February", 2}, {"March", 3}, {"April", 4}, {"May", 5}, {"June", 6},
            {"July", 7}, {"August", 8}, {"September", 9},
            {"October", 10}, {"November", 11}, {"December", 12}}
        Dim months As String() = mscTable.Columns("month").ToArray(Of String)()
        Dim dMonths As Double() = New Double(months.Length - 1) {}
        For i As Integer = 0To months.Length - 1
            dMonths(i) = monthNames(months(i))
        Next'select the target columnDim OutResPositive As Double() = mscTable.Columns("res_positive").ToArray()
        'separation of the test and train target sampleDim OutResPositiveTrain As Double() = OutResPositive.[Get](0, traintPos)
        Dim OutResPositiveTest As Double() = OutResPositive.[Get](traintPos, testPos)
        'deleting unneeded columns
        mscTable.Columns.Remove("total_appeals")
        mscTable.Columns.Remove("month")
        mscTable.Columns.Remove("res_positive")
        mscTable.Columns.Remove("year")
        'add coded in a double column month into Table'create new columnDim newCol As DataColumn = New DataColumn("dMonth", GetType(Double))
        newCol.AllowDBNull = True'add new column
        mscTable.Columns.Add(newCol)
        'fill new columnDim counter As Integer = 0ForEach row As DataRow In mscTable.Rows
            row("dMonth") = dMonths(counter)
            counter += 1Next'receiving input data from a tableDim inputs As Double()() = mscTable.ToArray()
        'separation of the test and train sampleDim inputsTrain As Double()() = inputs.[Get](0, traintPos)
        Dim inputsTest As Double()() = inputs.[Get](traintPos, testPos)
        'simple linear regression modelDim ols = New OrdinaryLeastSquares() With {.UseIntercept = True}
        'linear regression model for several featuresDim regression As MultipleLinearRegression = ols.Learn(inputsTrain, OutResPositiveTrain)
        'make a predictionDim predicted As Double() = regression.Transform(inputsTest)
        'console outputFor i As Integer = 0To testPos - traintPos - 1
            Console.WriteLine("predicted: {0}   real: {1}", predicted(i), OutResPositiveTest(i))
        Next'And  print the squared error using the SquareLoss class
        Console.WriteLine("error = {0}", New SquareLoss(OutResPositiveTest).Loss(predicted))
        'print the coefficient of determinationDim r2 As Double = New RSquaredLoss(numberOfInputs:=29, expected:=OutResPositiveTest).Loss(predicted)
        Console.WriteLine("R^2 = {0}", r2)
        'alternative print the coefficient of determinationDim ur2 As Double = regression.CoefficientOfDetermination(inputs, OutResPositiveTest, adjust:=True)
        Console.WriteLine("alternative version of R2 = {0}", r2)
        Console.WriteLine("Press enter and close chart to exit")
        'for chart Dim classes As Integer() = New Integer(allData - 1) {}
        Dim mountX As Double() = New Double(allData - 1) {}
        For i As Integer = 0To allData - 1If i < testPos Then
                mountX(i) = i + 1
                classes(i) = 0'csv data is class 0Else
                mountX(i) = i - (testPos - traintPos) + 1
                classes(i) = 1'predicted is class 1EndIfNext'make points of chartDim OutChart As List(Of Double) = New List(Of Double)()
        OutChart.AddRange(OutResPositive)
        OutChart.AddRange(predicted)
        'plot chart
        ScatterplotBox.Show("res_positive from months", mountX, OutChart.ToArray(), classes).Hold()
        'for pause
        Console.ReadLine()
    EndSubEnd Module


It should be noted that our project turned out to be quite cross-platform, since it can be assembled using both Visual Studio for Windows and MonoDevelop for Linux. True, this is true, only with respect to C #, the code on VB.NET under Mono is not always compiled without problems.
Instead of a thousand words, we’d better look at the screenshots.

Build VB project version 1.0.1. under Windows.



Build a C # project version 1.0.0. under Linux Mint.



You probably noticed that the results in the pictures are slightly different.
This is not Mono's fault . The thing is that in the version of the project (1.0.0) in C # compiled for Linux, I forgot to take into account the transcoded column with months. And in the version of the project (1.0.1) on VB assembled in Visual Studio - I took into account.

I wanted to first correct the screenshots, but then I thought that this was a clear demonstration that this feature slightly improves the quality of the prediction.

However, in fact, we have achieved poor results that have no benefit other than academic.

The reasons for this were the following factors:

  1. The data we have is in different values, but we did not scale them. (Because I still have not figured out how to do this using Accord.NET) .
  2. We also stuffed almost all the attributes into the model and at the same time did not use the elimination of “bad” signs, that is, regularization. (Guess why? That's right because I haven't figured it out yet either) .
  3. Well, certainly there is too little data to make, normal predictions.

There may be some other things that I don’t know about.

But fortunately, we did not set as our goal the practical application of the model, it was important for us to learn about the existence of the framework and try to do the simplest things, and then I hope that you will master this tool and I will learn to work with Accord.Net in your articles .

Also popular now: