Statistical tests in R. Part 3: Quantitative data tests

  • Tutorial
This is the third article in a series on using R for statistical data analysis, which will examine the presentation and testing of quantitative data. You will learn how to quickly and clearly present the data, as well as how to use the t-test in R.

Part 1: Binary classification
Part 2: Analysis of quality data

Let's go!

To get started, I want to bring back the diagram from the last article:

Paired data differ in that the data for the tested groups were obtained from the same objects. Applications for the t-test: the influence of a factor on the change in sales / application speed / device life, a comparison of two groups of people on productivity. If there are several groups, then ANOVA models (ANOVA - analysis of variance) are used for analysis, which will be discussed in the following articles.

Presentation of data before analysis

Let's start with how to present the data. I use the file I have of one medical test, which I used for my studies. I will briefly describe it. People from the two groups, the study and the control, did some exercises. Before and after exercise, their physiological parameters were measured. We will try to analyze the pulse and forced expiratory volume. In addition to t-tests, I will supplement the previous article and show how to get high-quality data from numbers. So:

tab <- read.csv(file="data1.csv", header=TRUE, sep=",", dec=".")
tab <- cbind(tab, pulsediff=pulse2-pulse1,
           FEVcut2.5=cut(FEV1_1, c(0,2.5,max(FEV1_1)+0.1)))

We form an additional column with a pulse difference before and after, as well as an additional column of qualitative data for the volume of forced expiration, which can be tested with a chi-square test. The last is done by the cut function. In it, we set data and cutoff points. At the exit:

We go further. We calculate the average values ​​and standard deviations, and construct visual graphs.

mean(pulsediff[group==0], na.rm=T)
mean(pulsediff[group==1], na.rm=T)
sd(pulsediff[group==0], na.rm=T)
sd(pulsediff[group==1], na.rm=T)
        main="Distribution of pulse difference stratified by group",
        names=c("control-group", "exercise-group"),
        ylab="pulse difference")

From important I want to note, na.rm = T . If your data contains empty cells, R will remove them yourself. Otherwise, you will get an error. Boxplot is a very good method for visualizing a selection. It shows the maximum and minimum, the average value, as well as the probability quantiles of 25 and 75 percent.

Now let's talk about the difference for paired and independent data in terms of statistics. In the case of independent data, the following confidence interval is used for analysis: The

average value here is the difference in the means of each sample, and the standard deviation is calculated using a special formula for the difference.

In the case of paired data, we subtract the values ​​in pairs and get a new sample, in which we find our average value and standard deviation. Confidence Interval:

Application tests in R

#Paired data
#approach 1:
#approach 2:
t.test(pulse1[group==1], pulse2[group==1], paired=T)
t.test(pulse1[group==0], pulse2[group==0], paired=T)

Here we look at the differences between the pulse before and after in the same group. You can use t.test in two ways, either by sending a difference there, or two data arrays.


Conclusions: in group 0 there is no difference before and after, in group 1 there is a difference, because the p-value is much less than 5% .

#Unpaired data

Here we analyze the difference between the groups. Data is independent. Strictly speaking, R uses the Welch test, which is slightly different from the usual t-test. The Welch test is more accurate, they converge with a large sample size.


Conclusions: the difference between the groups is significant, because the p-value is much less than 5% .

#Descriptive analysis:
xtab(sex~FEVcut2.5, data=tab)
#Inferential analysis:
chisq.test(table(sex, FEVcut2.5), correct=F)
chisq.test(table(sex, FEVcut2.5))
fisher.test(table(sex, FEVcut2.5))

Here we compare the volume of forced expiration in men and women.
Table (from R gui, in RStudio, my personal tables are slightly incorrectly displayed):



Here we applied three tests to increase complexity and accuracy. Once again, I recommend using the Fisher test, but keep these in mind. Conclusions: the tests gave different results, but the p-value is still very small. Groups differ among themselves.


So, today we looked at examples of using tests. This information is sufficient to conduct sufficiently high-quality statistical studies. These methods can be applied in any field. Their use will protect you from errors, allow you to objectively evaluate your work and provide other people with objectively reliable information. There are several more topics that I want to cover regarding the estimation of the required sample size, proof of equality of random variables, and ANOVA models.

Also popular now: