Statistical tests in R. Part 1: Binary classification
- From the sandbox
- Tutorial
Good day. I want to share my knowledge about working with statistics in R.
Many of us have to deal with various data at work and in everyday life. It is not so difficult to process and analyze them qualitatively and correctly. In this series of articles I will show the application of some statistical tests.
Interested in? Welcome to cat .
Part 2: Tests of qualitative data
Part 3: Tests of quantitative data
I want to apologize in advance that I often use English terms, as well as for their possible incorrect translation.
The first article is devoted to such an interesting test as binary classification. This is testing, which consists in checking objects for any quality. For example, diagnostic tests (everyone probably did the Mantoux) or detection of signals in radar.
We will analyze by example. All sample files can be downloaded at the end of the article. Imagine that you came up with an algorithm that determines the presence of a person in a photograph. It seems that everything works, you were glad, but early. After all, you need to evaluate the quality of your algorithm. Here we need to use our test. We will not now wonder about the required sample size for testing. Let's say that you took 30 photos, personally entered into the Excel file whether there is a person on them or not, and then drove through your algorithm. As a result, we got the following table:
We save it immediately in csv so as not to bother reading xls (this is possible in R, but not out of the box).
Now a little theory. Based on the test results, the following table is compiled.
A priori probability:
Sensitivity . P (T + | H +). The likelihood that a person will be discovered.
Se = 14/16
Specificity , in other tests is often called Power. P (T- | H-). The likelihood that in the absence of a person, the test result is negative.
Sp = 10/14
likelihood ratio (Likelihood quotient). An important characteristic for evaluating a test. Consists of 2 values.
In the literature, a test is considered good if LR + and LR- are greater than 3 (applies to medical tests).
A posteriori probability: positive and negative predictive value(positive and negative predictive value). The probability that the test result (positive or negative) is correct.
PV + = 14/18
PV- = 10/12
There are also such concepts as an error of the first kind (1 - Se) and an error of the second kind (1 - Sp). Essentially equivalent to sensitivity and specificity.
To get started, download the data.
In the last two lines, we assigned labels instead of 0 and 1. It is necessary to do this, because otherwise R will work with our data as numbers.
The table can be displayed as follows:
This table is not bad, but there is a prettyR package that will do almost everything for us. In order to install a package, in the default R gui you need to click install packages in packages and type the name of the package.
We use the library. For a change, we will output the result in html, because in my RStudio tables are displayed a little incorrectly (if you know how to fix it - write).
Let’s analyze what is written there.
Thus, we obtain quantitative characteristics of our algorithm. Note that LR +, which is indicated on the table as an odds ratio of more than 3. Also pay attention to the parameters described above. As a rule, the main interest should be PV + and Se, since false alarm is an additional cost, and failure to detect can lead to fatal consequences.
But what if our data are quantitative? This can be, for example, a parameter by which the previous algorithm makes a decision (say, the number of pixels of skin color). For fun, let's look at the operation of an algorithm that blocks spammers.
You are the creator of a new social network, and you are trying to fight spammers. Spammers send a large number of messages, so the simplest thing is to block them after exceeding a certain message threshold. Just how to choose it? We take a sample of 30 users again. We find out if they are robots, read the number of messages and get:
Quite a bit of theory. After choosing a threshold, we divide the sample into 2 parts and get a table from the 1st example. Naturally, our task is to choose the best threshold. There is no unambiguous algorithm, because sensitivity and specificity play a different role in each real example. However, there is a method that helps to make a decision, as well as evaluate the test as a whole. This method is called the ROC-curve, a “receiver performance curve” used initially in radar. Let's build it in R.
First, install the ROCR package (the gtools, gplots and gdata packages will be installed with it, if you don’t have them).
Again loading data.
Now we are building a curve.
On this graph, the y axis is sensitivity, and the x axis (1 - specificity). Obviously, for a good test you need to maximize both sensitivity and specificity. It is unknown only in what proportion. If both parameters are equivalent, then you can search for the point farthest from the bisectrix. By the way, in R there is an opportunity to make this graph more visual by adding cutoff points.
That's much better. We see that the points farthest from the bisector are 40 and 60. By the way, about the bisector and the area under the curve that we calculated. The bisector is a fool test, i.e. 50 by 50. A good test should have an area under the curve that is greater than 0.5, ie area under the bisector. It is desirable to greatly exceed, but not to be less, because in this case it is better to poke at random than to use our method.
In this article, I described how to work with binary classification in R. As you can see, situations where to apply them can be found in everyday life. The main characteristics of such tests are sensivity, specificity, likelihood rate, and predictive value. They are interconnected and show the effectiveness of the test from different angles. In the case of quantitative data, they can be adjusted by selecting a cut point. To do this, you can use the ROC-curve. The choice is made separately in each case, taking into account the requirements for the test, but usually sensitivity is more important.
The following articles will focus on the analysis of qualitative and quantitative data, the t-test, the chi-square test and much more.
Thanks for attention. I hope you enjoyed it!
Sample Files
Many of us have to deal with various data at work and in everyday life. It is not so difficult to process and analyze them qualitatively and correctly. In this series of articles I will show the application of some statistical tests.
Interested in? Welcome to cat .
Part 2: Tests of qualitative data
Part 3: Tests of quantitative data
I want to apologize in advance that I often use English terms, as well as for their possible incorrect translation.
Binary classification, quality data
The first article is devoted to such an interesting test as binary classification. This is testing, which consists in checking objects for any quality. For example, diagnostic tests (everyone probably did the Mantoux) or detection of signals in radar.
We will analyze by example. All sample files can be downloaded at the end of the article. Imagine that you came up with an algorithm that determines the presence of a person in a photograph. It seems that everything works, you were glad, but early. After all, you need to evaluate the quality of your algorithm. Here we need to use our test. We will not now wonder about the required sample size for testing. Let's say that you took 30 photos, personally entered into the Excel file whether there is a person on them or not, and then drove through your algorithm. As a result, we got the following table:
We save it immediately in csv so as not to bother reading xls (this is possible in R, but not out of the box).
Now a little theory. Based on the test results, the following table is compiled.
Important parameters
A priori probability:
Sensitivity . P (T + | H +). The likelihood that a person will be discovered.
Se = 14/16
Specificity , in other tests is often called Power. P (T- | H-). The likelihood that in the absence of a person, the test result is negative.
Sp = 10/14
likelihood ratio (Likelihood quotient). An important characteristic for evaluating a test. Consists of 2 values.
In the literature, a test is considered good if LR + and LR- are greater than 3 (applies to medical tests).
A posteriori probability: positive and negative predictive value(positive and negative predictive value). The probability that the test result (positive or negative) is correct.
PV + = 14/18
PV- = 10/12
There are also such concepts as an error of the first kind (1 - Se) and an error of the second kind (1 - Sp). Essentially equivalent to sensitivity and specificity.
Now in R
To get started, download the data.
tab<-read.csv(file="data1.csv", header=TRUE, sep=",", dec=".")
attach(tab)
Test <- factor(Test, levels=c("0","1"), labels=c("T-","T+"), ordered=T)
Human <-factor(Human, levels=c("0","1"), labels=c("H-","H+"), ordered=T)
In the last two lines, we assigned labels instead of 0 and 1. It is necessary to do this, because otherwise R will work with our data as numbers.
The table can be displayed as follows:
addmargins(table(Test, Human))
This table is not bad, but there is a prettyR package that will do almost everything for us. In order to install a package, in the default R gui you need to click install packages in packages and type the name of the package.
We use the library. For a change, we will output the result in html, because in my RStudio tables are displayed a little incorrectly (if you know how to fix it - write).
library(prettyR)
test<-calculate.xtab(Test, Human, varnames=c("Test","Human","T+","T-","H+","H-"))
print(test, html=T)
Let’s analyze what is written there.
Thus, we obtain quantitative characteristics of our algorithm. Note that LR +, which is indicated on the table as an odds ratio of more than 3. Also pay attention to the parameters described above. As a rule, the main interest should be PV + and Se, since false alarm is an additional cost, and failure to detect can lead to fatal consequences.
Binary classification, quantitative data
But what if our data are quantitative? This can be, for example, a parameter by which the previous algorithm makes a decision (say, the number of pixels of skin color). For fun, let's look at the operation of an algorithm that blocks spammers.
You are the creator of a new social network, and you are trying to fight spammers. Spammers send a large number of messages, so the simplest thing is to block them after exceeding a certain message threshold. Just how to choose it? We take a sample of 30 users again. We find out if they are robots, read the number of messages and get:
Quite a bit of theory. After choosing a threshold, we divide the sample into 2 parts and get a table from the 1st example. Naturally, our task is to choose the best threshold. There is no unambiguous algorithm, because sensitivity and specificity play a different role in each real example. However, there is a method that helps to make a decision, as well as evaluate the test as a whole. This method is called the ROC-curve, a “receiver performance curve” used initially in radar. Let's build it in R.
First, install the ROCR package (the gtools, gplots and gdata packages will be installed with it, if you don’t have them).
Again loading data.
# loading data
# don't forget to set your working directory
tab <- read.csv(file="data2.csv", header=TRUE, sep=",", dec=".")
attach(tab)
Now we are building a curve.
# area under the curve calculation
auc <- slot(performance(pred, "auc"), "y.values")[[1]]
# ROC-curve
library(ROCR)
pred <- prediction(Messages, Bot)
plot(performance(pred, "tpr", "fpr"),lwd=2)
lines(c(0,1),c(0,1))
text(0.6,0.2,paste("AUC=", round(auc,4), sep=""), cex=1.4)
title("ROC Curve")
On this graph, the y axis is sensitivity, and the x axis (1 - specificity). Obviously, for a good test you need to maximize both sensitivity and specificity. It is unknown only in what proportion. If both parameters are equivalent, then you can search for the point farthest from the bisectrix. By the way, in R there is an opportunity to make this graph more visual by adding cutoff points.
# ROC-curve with better plotting
plot(performance(pred, "tpr", "fpr"), print.cutoffs.at=c(30,40,60,81), text.adj=c(1.1,-0.5) ,lwd=2)
lines(c(0,1),c(0,1))
text(0.6,0.2,paste("AUC=", round(auc,4), sep=""), cex=1.4)
title("ROC Curve")
That's much better. We see that the points farthest from the bisector are 40 and 60. By the way, about the bisector and the area under the curve that we calculated. The bisector is a fool test, i.e. 50 by 50. A good test should have an area under the curve that is greater than 0.5, ie area under the bisector. It is desirable to greatly exceed, but not to be less, because in this case it is better to poke at random than to use our method.
Summary
In this article, I described how to work with binary classification in R. As you can see, situations where to apply them can be found in everyday life. The main characteristics of such tests are sensivity, specificity, likelihood rate, and predictive value. They are interconnected and show the effectiveness of the test from different angles. In the case of quantitative data, they can be adjusted by selecting a cut point. To do this, you can use the ROC-curve. The choice is made separately in each case, taking into account the requirements for the test, but usually sensitivity is more important.
The following articles will focus on the analysis of qualitative and quantitative data, the t-test, the chi-square test and much more.
Thanks for attention. I hope you enjoyed it!
Sample Files