BubaVV March 15, 2013 at 20:01

Correlation for Beginners

Tutorial

Update for those who find the article useful and put it in favorites. There is a decent chance that the post will go to minus, and I will be forced to take it to drafts. Keep a copy!

Brief and uncomplicated material for non-specialists, telling in a visual form about the various methods of searching for regression dependencies. This is not close to academic, but I hope that is understandable. A ride as a mini-training manual on data processing for students of natural sciences, who do not know mathematics well ~~, however, like the author~~ . Calculations in Matlab, preparation of data in Excel - it so happened in our area

Introduction

Why is this even necessary? In and around science, the task of predicting some unknown parameter of an object based on the known parameters of this object (predictors) and a large set of similar objects, the so-called training set, very often arises. Example. Here we choose an apple in the bazaar. It can be described by such predictors: redness, weight, number of worms. But as consumers, we are interested in the taste measured in parrots on a five-point scale. From life experience, we know that the taste with decent accuracy is 5 * red + 2 * weight-7 * the number of worms. We’ll talk about the search for such addictions. To make learning easier, let's try to predict the weight of the girl based on her 90/60/90 and height.

Initial data

As an object of study, I will take data on the parameters of the figure of the girls of the month of Playboy. Source - www.wired.com/special_multimedia/2009/st_infoporn_1702, slightly ennobled and converted from inches to centimeters. I recall a joke about the fact that 34 inches is like two seventeen-inch monitors. Also separated records with incomplete information. When working with real objects, they can be used, but now they only bother us. But they can be used to verify the adequacy of the results. All the data we have is continuous, that is, roughly speaking, like a float. They are converted to integers only so as not to clutter up the screen. There are ways to work with discrete data - in our example, for example, it can be skin color or nationality, which take one of a fixed set of values. This has more to do with classification and decision-making methods, which draws another manual. Data.xlsThere are two sheets in the file. On the first, the actual data, on the second - the screened incomplete data and a set for testing our model.

Designations

W - real weight
W_p - weight predicted by our model
S - bust
T - waist
B - hips
L - height
E - model error

How to evaluate the quality of the model?

The goal of our exercise is to get some kind of model that describes an object. The method of obtaining and the principle of operation of a specific model does not bother us so far. This is just a function f (S, T, B, L) that gives the girl's weight. How to understand which function is good and high quality, and which is not very? For this, the so-called fitness function is used. The most classic and often used is the sum of the squares of the difference between the predicted and the real value. In our case, this will be the sum (W_p - W) ^ 2 for all points. Actually, the name “least squares method” came from here. The criterion is not the best and not the only one, but quite acceptable as the default method. Its peculiarity is that it is sensitive to emissions and thereby considers such models to be of less quality. There are all sorts of methods of the smallest modules, etc.

Simple linear regression

The simplest case. We have one predictor variable and one dependent variable. In our case, it can be, for example, height and weight. We need to construct the equation W_p = a * L + b, i.e. find the coefficients a and b. If we carry out this calculation for each sample, then W_p will maximally coincide with W for the same sample. That is, we will have the following equation for each girl:
W_p_i = a * L_i + b
E_i = (W_p-W) ^ 2

The total error in this case will be sum (E_i). As a result, for optimal values of a and b, sum (E_i) will be minimal. How to find the equation?

Matlab

For simplification, I highly recommend installing an Excel plugin called Exlink. It is in the matlab / toolbox / exlink folder. It is very easy to transfer data between programs. After installing the plugin, another menu appears with the obvious name, and Matlab starts automatically. The transfer of information from Excel to Matlab is triggered by the command "Send data to MATLAB", and vice versa - "Get data from MATLAB". We send to Matlab the numbers from column L and separately from W, without headers. We call the variables the same. The linear regression calculation function is polyfit (x, y, 1). The unit shows the degree of the approximation polynomial. With us it is linear, therefore a unit. Finally we get the regression coefficients: regr=polyfit(L,W,1). a we can get as regr (1), b - as regr (2). That is, we can get our W_p: values W_p=L*repr(1)+repr(2). Bring them back to Excel.

Graphic

Hmm, sparse. This is a graph of W_p (W). The formula on the graph shows the relationship of W_p and W. Ideally, there would be W_p = W * 1 + 0. The discretization of the source data has come out - the point cloud is checkered. Correlation coefficient not in an arc - the data are weakly correlated with each other, i.e. our model poorly describes the relationship between weight and height. According to the graph, this can be seen as points located in the form of a weakly elongated along a straight cloud. A good model will give a cloud stretched into a narrow strip, even worse - just a random set of dots or a round cloud. The model needs to be supplemented. About the correlation coefficient is worth telling separately, because it is often used absolutely incorrectly.

Matrix calculation

It is possible and without any polifitov cope with the construction of regression, if slightly supplement the column with the values of the growth of another column filled units: L(:,2)=1. The two shows the number of the column into which units are written. Then the regression coefficients can be found on the following formula: repr=inv(L'*L)*L'*W. And vice versa, find W_p : W_p=L*repr. When you realize the magic of matrices, using functions becomes fun. A single column is needed to calculate the free term of the regression, that is, simply the term without multiplying by the parameter. If you do not add it, then there will be only one member in the regression: W_p = a * L. It is obvious enough that it will be worse in quality than a regression with two terms. In general, you need to get rid of a free member only if it is definitely not needed. By default, it is still present.

Multilinear Regression

In the Russian-language literature of past years, it is referred to as MMNC - the method of multiple least squares. This is an extension of the least squares method for several predictors. That is, we are dealing not only with growth, but also with the rest, so to speak, horizontal dimensions. The data preparation is exactly the same: both matrices in the matlab, adding a column of units, calculation using the same formula. For lovers of functions there b = regress(y,X). This function also requires the addition of a unit column. We repeat the calculation according to the formula from the section on matrices, send to Excel, look.

Attempt number two

And so better, but still not very. As you can see, the cellularity remained only horizontally. There is no getting anywhere, the original weights were integers in pounds. That is, after conversion to kilograms, they fall on the grid in increments of about 0.5. Total final view of our model:

W_p = 0.2271 * S + 0.1851 * T + 0.3125 * B + 0.3949 * L - 72.9132

Volumes in centimeters, weight in kg. Since we have all the values except growth in the same units of measure and about the same order of magnitude (except for the waist), we can estimate their contributions to the total weight. The reasoning is something like this: the coefficient at the waist is the smallest, as are the quantities themselves in centimeters. Therefore, the contribution of this parameter to the weight is minimal. In the bust and especially in the hips, it is larger, i.e. a centimeter at the waist gives a smaller increase in weight than on the chest. And most of all, the volume of the ass affects the weight. However, this is known to any man interested in the matter. That is, at least, our model of real life does not contradict.

Model validation

The name is loud, but try to get at least the approximate weights of those girls for whom there is a full set of sizes, but no weight. There are 7 of them: from May to June 1956, July 1957, March 1987, August 1988. We find the weights predicted by the model: W_p=X*repr

Well, at least in textual form it looks believable. And how much this corresponds to reality is up to you

Applicability

In short, the resulting model is suitable for objects like our dataset. That is, according to the obtained correlations, it is not necessary to consider the parameters of the figures of women with a weight of 80+, an age that is very different from the hospital average, etc. In real applications, we can assume that the model is suitable if the parameters of the object under study are not too different from the average values of the same parameters for the initial data set. Problems can arise (and arise) if our predictors are highly correlated with each other. That is, for example, this is the growth and length of the legs. Then the coefficients for the corresponding values in the regression equation will be determined with low accuracy. In this case, one of the parameters should be discarded, or the principal component method should be used to reduce the number of predictors. If we have a small sample and / or many predictors, then we run the risk of being overridden by the model. That is, if we take 604 parameters for our sample (and there are only 604 girls in the table), we can analytically get an equation with a 604 + 1 term that will absolutely accurately describe what we threw into it. But his predictive power will be very small. Finally, not all objects can be described by multilinear dependence. There are logarithmic, and power, and all sorts of complex. Their search is already a completely different issue. There are logarithmic, and power, and all sorts of complex. Their search is already a completely different issue. There are logarithmic, and power, and all sorts of complex. Their search is already a completely different issue.

Future plans

If it works well, then I’ll try to present in the same style the principal component method for reducing data dimension, regression on the main components, PLS method, the beginning of cluster analysis and object classification methods. If the habrapublic does not accept very well, then I will try to take into account the comments. If not at all, then I’ll forget about shirnarmass in general, and I will have enough of my students. See you soon!

Tags: