Simple linear regression

library(ggplot2)
library(patchwork)

This work relies heavily on Montgomery, Peck, and Vining (2021). To give the authors proper credit, they are cited multiple times throughout the document.

Introduction

The goal of the present document is to propose an introduction to the simple linear regression and its application in R. While many tutorials are available online on how to perform such a simple analysis, I found it quite hard to have an explanation not on how to do such analyzes but how they work. So this document is for all those who would like to know more about how stuff works.

What is a simple linear regression model?

There are four words in the term “simple linear regression model”, and each has its own importance. Here is a simple explanation for each of them, from the right to the left:

A model is any way to try to approach some reality using some sort of proxy. We often think about models through a mathematical point of view but models can be more diverse than that. For example, mice are often used as models to understand more broadly how the immune system of vertebrates (including humans) works.
In mathematics, the term regression indicates that we want to understand the link between one or multiple variables, which we often call the independent or explanatory variables, and another one called the dependent or response variable.
The term linear indicates that the link between the response variable and the explanatory variable(s) follows a straight line.
Finally, we call these models simple when there is only one explanatory variable.

By convention, in most textbooks and classes the explanatory variable will be noted \(x\) and the response variable will be noted \(y\). The simple linear regression model is thus written

\[ y=\beta_0+\beta_1x + \varepsilon \]

where \(\beta_0\) is the intercept of the regression, \(\beta_1\) is the slope of the regression, and \(\varepsilon\) is the statistical error, i.e., the variability in \(y\) that cannot be explained by \(x\). This error is here to take into account the fact that the relation between \(x\) and \(y\) is most of the time not perfectly linear.

Another way of writing a simple linear regression model (which I will now call SLRM - and pronounce “slurm” when I read it in my head) is

\[ y_i=\beta_0+\beta_1x_i + \varepsilon_i \]

where \(i\) indicates the \(i\)-th value for \(x\), \(y\) or \(\varepsilon\). Montgomery, Peck, and Vining (2021) define the first formula as a population model and the second one as a sample regression model. The nuance is subtle and not of much interest here, but let’s say that the sample represents, as its name suggests, a fraction of the population. Most of the time, we only have access to a sample and not the whole population, and this will thus change some of the calculations shown below.

Complete example

Data

For the rest of this document, we will work on one particular example using the “rocket propellant data” proposed page 15 of Montgomery, Peck, and Vining (2021). Of course, the outcomes and results presented here should work just as fine with any other data set, including but not limited to those included in the different R packages you may find. The description of the data set is reproduced below

A rocket motor is manufactured by bonding an igniter propellant and a sustainer propellant together inside a metal housing. The shear strength of the bond between the two types of propellant is an important quality characteristic. It is suspected that shear strength is related to the age in weeks of the batch of sustainer propellant.

In this example, the explanatory variable is the age of the sustainer propellant. This age, in weeks, has been measured twenty times and the values are reported below in the x variable.

x = c(15.50, 23.75,  8.00, 17.00,  5.50, 19.00, 24.00,  2.50,  7.50, 11.00, 
      13.00,  3.75, 25.00,  9.75, 22.00, 18.00,  6.00, 12.50,  2.00, 21.50)

On the other hand, the response variable is the shear strength, measured in psi. For each propellant age, a corresponding shear strength has been measured, so we also have twenty values, reported below in the y variable.

y = c(2158.70, 1678.15, 2316.00, 2061.30, 2207.50, 1708.30, 1784.70, 2575.00, 2357.90, 2256.70,
      2165.20, 2399.55, 1779.80, 2336.75, 1765.30, 2053.50, 2414.40, 2200.50, 2654.20, 1753.70)

A simple representation of the data can be found on Figure 1. A simple looks shows what we will demonstrate later: there appears to be a negative linear relation between the shear strength and the age of the propellant. In other words, it seems plausible that \(\beta_1<0\).

Code

ggplot() +
  geom_point(aes(x = x, y = y)) +
  labs(x = "Age of propellant (weeks)", y = "Shear strength (psi)") +
  theme_minimal()

Figure 1: Scatter diagram of shear strength versus propellant age.

SLRM using R

Note

In this document, I will try to show how things work with a detailed step-by-step approach most of the time, but please take note that most of the time what I will show in multiple code lines can be done in one unique line without loss of readability.

Let start by defining the formula of the SLRM we want to study. In R, it can be stored as a formula object where we link the y variable and the x variable with the use of a ~ character.

slrm <- y ~ x
cat(class(slrm))

formula

The simplest way, in R, to test the regression of \(y\) by \(x\), is to simply use the lm() function, as shown below.

mod <- lm(formula = slrm)

Then, the summary() function can be used to produce a “human-friendly” presentation of the results of the regression.

summary(mod)


Call:
lm(formula = slrm)

Residuals:
    Min      1Q  Median      3Q     Max 
-215.98  -50.68   28.74   66.61  106.76 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2627.822     44.184   59.48  < 2e-16 ***
x            -37.154      2.889  -12.86 1.64e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 96.11 on 18 degrees of freedom
Multiple R-squared:  0.9018,    Adjusted R-squared:  0.8964 
F-statistic: 165.4 on 1 and 18 DF,  p-value: 1.643e-10

There are a lot of lines in this output so let’s review them. The first two lines indicate

Call:
lm(formula = slrm)

They show the formula used in the model. Here, it is not really informative, for two reasons. The first one is that we now what the formula is, since we are the ones who wrote it. But note that it can be useful if, for example, you are reading some outputs without having access to the code which produced them. The second reason why it is not informative here is that we stored the formula in the slrm object, thus masking what the formula really is.

Then we have

Residuals:
    Min      1Q  Median      3Q     Max 
-215.98  -50.68   28.74   66.61  106.76

These lines give some information about the residuals of the regression. The residuals are the difference between the observed values \(y_i\) and the expected values \(\hat y_i\). I will come back to these residuals later. The value under Min indicates the lowest value among the residuals, the 1Q value indicates the first quartile of the distribution of the residuals, the Median value indicates the median value of the residuals, the 3Q value indicates the third quartile of the distribution of the residuals and the Max value indicates the highest value among the residuals.

The following lines are certainly the most interesting for anyone trying to analyze any data set using a SLRM. We have

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2627.822     44.184   59.48  < 2e-16 ***
x            -37.154      2.889  -12.86 1.64e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

This table gives different information about the estimation of the coefficients of the SLRM. Once again, we will come back to it later. The first line, starting with (Intercept), gives us the information about the intercept of the regression, while the second line, starting with x, gives us the information about the slope (i.e., the coefficient associated with x). The first column, Estimate, gives us the estimated value for each coefficient; the second, Std. Error, gives us the standard errors associated with these estimates; the third one, t value, gives us the value of the \(t\) statistics calculated for these estimates; and finally the last column, Pr(>|t|) gives us the \(p\)-value associated with the \(t\) statistics. The asterisks shown after this last column refer to the bottom line (starting with Signif. codes:). They are used to give a visual information about the “range” of the \(p\)-values: if they are between 0 and 0.001 there are three asterisks; between 0.001 and 0.01 there are two asterisks; etc. It may be useful to make decisions regarding the results of your analyzes (rejecting or not your null hypothesis).

Finally, we have a bunch of information reunited in the last three lines, as we have

Residual standard error: 96.11 on 18 degrees of freedom
Multiple R-squared:  0.9018,    Adjusted R-squared:  0.8964 
F-statistic: 165.4 on 1 and 18 DF,  p-value: 1.643e-10

The first of these line gives the standard error of the residuals with the associated degrees of freedom. The second line shows two coefficients of determination associated with the regression. Finally, the last line indicates the results of an analysis of variance applied to the model.

Les mains dans le cambouis¹

The goal of this part is now to show you how to reproduce all the results R gave you automatically when you used the lm() then the summary() functions.

Coefficients

Estimates

First, we will look at how the values in the coefficients table were obtained, and more precisely, how the estimates are calculated. The first estimate value that we can calculate is the slope. By convention, in statistics, estimates are written with a hat symbol on them. The estimate of the slope, which is the estimate of the coefficient associated with \(x\), is thus \(\hat\beta_1\). According to equation (2.11) in Montgomery, Peck, and Vining (2021), the least-square estimator² of the slope is

\[ \hat\beta_1=\frac{S_{xx}}{S_{xy}} \]

where \(S_{xx}\) is the corrected sum of squares of the \(x_i\) and \(S_{xy}\) is the corrected sum of cross products of \(x_i\) and \(y_i\).

According to equation (2.9) in Montgomery, Peck, and Vining (2021), we have

\[ S_{xx}=\sum_{i=1}^n\left(x_i-\bar x\right)^2 \]

and, according to equation (2.10), we have

\[ S_{xy}=\sum_{i=1}^ny_i\left(x_i-\bar x\right) \]

where \(n\) is the number of observations (twenty in our example), and \(\bar x\) is the mean of \(x_i\). This mean value can be calculated with the formula

\[ \bar x=\frac{1}{n}\sum_{i=1}^nx_i \]

Let’s calculate all these values using R most basic functions. The number of observations can simply be obtained using the length() function on either the x or y object.

n = length(y)

Now, we calculate the mean of \(x\).

xbar = 1 / n * sum(x)

Note that this value can be obtained more directly using the mean() function.

xbar = mean(x)

We can now calculate the value of \(S_{xx}\)…

Sxx = sum((x - xbar) ** 2)

…and the value of \(S_{xy}\).

Sxy = sum(y * (x - mean(x)))

Note that these values can be obtained more directly using the var() and cov() functions. Note that these functions are tricky because they calculates unbiased estimators of the variance and covariance of the population, while we need the variance and covariance of the sample. This means that we need to correct the values given by var() and cov() with a \(\frac{1}{n-1}\) factor.

Sxx = var(x) * (n - 1)
Sxy = cov(x, y) * (n - 1)

Now that we have all that we need, we can calculate the value of \(\hat\beta_1\).

beta1hat = Sxy / Sxx

If we compare the value stored in the beta1hat object and the value given with the summary() function, we find that they are, as expected, equal.

cat(summary(mod)$coefficients[2, 1])

-37.15359

cat(beta1hat)

-37.15359

Now that we have the value of \(\hat\beta_1\), it is easy to obtain the value of \(\hat\beta_0\). According to equation (2.6) in Montgomery, Peck, and Vining (2021), we have

\[ \hat\beta_0=\bar y-\hat\beta_1\bar x \]

where \(\bar y\) is the mean of \(y_i\). This mean value can be calculated with the formula

\[ \bar y=\frac{1}{n}\sum_{i=1}^ny_i \]

So, before calculating the value of \(\hat\beta_0\), we need to calculate the value of \(\bar y\).

ybar = 1 / n * sum(y)

Note that once again this value can be obtained more directly using the mean() function.

ybar = mean(y)

Now we can calculate the value of \(\hat\beta_0\).

beta0hat = ybar - beta1hat * xbar

For this coefficient too, if we compare the value stored in the beta0hat object and the value given with the summary() function, we find that they are equal.

cat(summary(mod)$coefficients[1, 1])

2627.822

cat(beta0hat)

2627.822

And voilà! We are all done for the values of the estimates. Now, we can continue with the second column of the coefficients table: the standard errors.

Standard errors

We will first calculate the value of the standard error for the slope. According to equation (2.14) in Montgomery, Peck, and Vining (2021), the variance of \(\hat\beta_1\) is

\[ \text{var}\left(\hat\beta_1\right)=\frac{\sigma^2}{S_{xx}} \]

where \(\sigma^2\) is the variance of \(y_i\). Unfortunately, we rarely have access to the true value of \(\sigma^2\) and we must thus calculate an estimated value for it. According to equation (2.19) in Montgomery, Peck, and Vining (2021), an unbiased estimate of \(\sigma^2\), \(\hat\sigma^2\), is

\[ \hat\sigma^2 =\frac{SS_{Res}}{n-2} \]

where \(SS_{Res}\) is the residual sum of squares and \(n-2\) is the number of degrees of freedom. In SLRM, the degrees of freedom is \(n-2\) because there are two coefficient estimates in our model: \(\hat\beta_0\) and \(\hat\beta_1\). Note that \(\hat\sigma^2\) is also called the residual mean square, noted \(MS_{Res}\). According to equation (2.16) in Montgomery, Peck, and Vining (2021), we have

\[ SS_{Res}=\sum_{i=1}^n\left(y_i-\hat y_i\right)^2 \]

where \(\hat y_i\) is the \(i\)-th estimated value of \(y_i\). The values of \(\hat y_i\) can be simply obtained using the values of \(\hat\beta_0\), \(\hat\beta_1\) and \(x_i\). We have

\[ \hat y_i=\hat\beta_0+\hat\beta_1x_i \]

We must thus calculate the values of \(\hat y_i\) in R.

yhat = beta0hat + beta1hat * x

Note that the values of \(\hat y_i\) can also be obtained using the predict() function on the mod object.

yhat = predict(mod)

Now we can calculate the value of \(SS_{Res}\).

SSRes = sum((y - yhat) ** 2)

From that, we can obtain the value of \(\hat\sigma^2\).

sigma2hat = SSRes / (n - 2)

And now, the value of \(\text{var}\left(\hat\beta_1\right)\).

varbeta1hat = sigma2hat / Sxx

Finally, to obtain the standard error reported in the coefficients table, we just need to obtain the square root of \(\text{var}\left(\hat\beta_1\right)\) using the sqrt() function.

sebeta1hat = sqrt(varbeta1hat)

If we compare the value stored in the sebeta1hat object and the value given with the summary() function, we find once again that they are equal.

cat(sebeta1hat)

2.889107

cat(summary(mod)$coefficients[2, 2])

2.889107

Now let’s calculate the standard error of \(\hat\beta_0\). According to equation (2.15) from Montgomery, Peck, and Vining (2021), the variance of \(\hat\beta_0\) is

\[ \text{var}\left(\hat\beta_0\right)=\sigma^2\left(\frac{1}{n}+\frac{\bar x^2}{S_{xx}}\right) \]

We will once again replace \(\sigma^2\) by \(\hat\sigma^2\) as the former is not available. We thus have everything we need ready and can calculate the value of \(\text{var}\left(\hat\beta_0\right)\).

varbeta0hat = sigma2hat * (1 / n + xbar ** 2 / Sxx)

And we simply obtain the standard error of \(\hat\beta_0\) by using the sqrt() function over the varbeta0hat object.

sebeta0hat = sqrt(varbeta0hat)

We can now compare the value stored in the sebeta0hat object and the value given with the summary() function and see that they are equal.

cat(sebeta0hat)

44.18391

cat(summary(mod)$coefficients[1, 2])

44.18391

\(t\)-values

We’re progressing. Now, the third column of the coefficients table contains the \(t\)-values. They are used to test if the values of the coefficients are significantly different from zero, but in reality a \(t\) test could be used to test the difference between the estimated values of the coefficients and any theoretical value. According to equation (2.24) in Montgomery, Peck, and Vining (2021), the value of the \(t\) statistics for coefficient \(\hat\beta_1\) is

\[ t_0=\frac{\hat\beta_1-\beta_{10}}{\sqrt{MS_{Res}/S_{xx}}} \]

where \(\beta_{10}=0\) in the calculations used in the results of the summary() function. Also, as a reminder, \(MS_{Res}=\hat\sigma^2\), so we can see that the denominator of the formula is simply the standard error of \(\hat\beta_1\).

We can thus simply calculate the value of \(t_0\) using the values stored in beta1hat and sebeta1hat.

t0beta1hat = beta1hat / sebeta1hat

When we compare the value in t0beta1 with the value in the coefficients table, once again, they match.

cat(t0beta1hat)

-12.85989

cat(summary(mod)$coefficients[2, 3])

-12.85989

The value of \(t_0\) for \(\hat\beta_0\) is obtained the same way.

t0beta0hat = beta0hat / sebeta0hat

We compare the values here too to find that they are equal.

cat(t0beta0hat)

59.47464

cat(summary(mod)$coefficients[1, 3])

59.47464

\(p\)-values

The \(p\)-values are obtained using the \(t_0\) values obtained above, which are supposed to follow a \(t\) distribution with \(n-2\) degrees of freedom. We use the pt() function with these values to obtain the distribution of the \(t\) statistics and thus the \(p\)-value.

First, we calculate the \(p\)-value for coefficient \(\hat\beta_1\).

pbeta1hat = 2 * pt(- abs(t0beta1hat), df = n - 2)

We see that it is the same value as the one shown in the coefficients table.

cat(pbeta1hat)

1.643344e-10

cat(summary(mod)$coefficients[2, 4])

1.643344e-10

Now, we calculate the \(p\)-value for coefficient \(\hat\beta_0\).

pbeta0hat = 2 * pt(- abs(t0beta0hat), df = n - 2)

Now, it is a bit trickier than previously. If you scroll back to the global output of summary(mod) shown above, you’ll see that the \(p\)-value for coefficient \(\hat\beta_0\) is < 2e-16. But if you print directly the value, you can see that you find another value, which is the same as the one we obtained manually.

cat(pbeta0hat)

4.063559e-22

cat(summary(mod)$coefficients[1, 4])

4.063559e-22

But why does R prints < 2e-16 instead of the actual value? This is a bit off-topic but it is still good to know. As a computer program, R is fatally limited in the precision it can give to floats (i.e., decimal numbers). R stores a variable, called .Machine, “holding information on the numerical characteristics of the machine [it] is running on” (quote from the .Machine documentation). One of these “characteristics”, called double.eps, is the “smallest positive floating-point number \(x\) such that \(1+x\neq1\)”. Let’s see what is the value of this \(x\) number.

cat(.Machine$double.eps)

2.220446e-16

We see that this value is close to the < 2e-16 given by the summary() function. Globally, below this double.eps threshold, R cannot efficiently discriminate between a value different from zero and zero. We can check it easily.

cat(1 + pbeta0hat != 1)

FALSE

Residuals

Now that we know the values of \(\hat\beta_0\) and \(\hat\beta_1\), it is possible to calculate the values of the residuals. The residuals, sometimes noted \(e_i\), correspond to the difference between the observed values of the response variable, \(y_i\), and the estimated values, \(\hat y_i\), as shown in equation (2.12) from Montgomery, Peck, and Vining (2021). We have

\[ e_i=y_i-\hat y_i \]

It can easily be obtained in R with the values we alreay have calculated.

e = y - yhat

The residuals can also be easily obtained using the residuals() function over the mod object.

e = residuals(mod)

As a reminder, in the summary there were a few information about the residuals, mostly in the first lines. The first information we get is the minimal value of the residuals. Before accessing it, let’s have a look at the residuals. In the following code chunk, we sort (from the lowest to the highest) and print the values of the residuals.

esort = sort(e)
cat(round(esort, 2))

-215.98 -213.6 -88.95 -75.32 -67.27 -45.14 -14.59 8.73 9.5 20.37 37.1 37.57 40.06 48.56 65.09 71.18 80.82 94.44 100.68 106.76

We immediately see that the lowest value of \(e_i\) is -215.98, as we saw in the summary of the SLRM. We can also access this value using the min() function.

minres =  min(e)
cat(round(minres, 2))

-215.98

The second value is the first quartile or 25th percentile. The easiest way to obtain it is to used the quantile() function with probs = .25.

Q1res = quantile(e, probs = .25)
cat(round(Q1res, 2))

-50.68

Next, we have the median. We can access it using the median() function.

medres = median(e)
cat(round(medres, 2))

28.74

Note that we can also access it using the quantile() function with probs = .5.

medres = quantile(e, probs = .5)
cat(round(medres, 2))

28.74

Then we have the third quarter or 75th percentile. The easiest way to obtain it is to used the quantile() function with probs = .75.

Q3res = quantile(e, probs = .75)
cat(round(Q3res, 2))

66.61

Finally, we have the maximal value of the residuals. Looking at the esort object produced before, we can manually find that the value is 106.76. We can also obtain it using the max() function.

maxres = max(e)
cat(round(maxres, 2))

106.76

And that’s everything for the top of the summary. Now, let’s have a look at the first line of the three at the bottom of the summary. The residual standard error, also called the standard error of the regression, is simply the square root of the residual mean square \(\hat\sigma^2\).

sigma = sqrt(sigma2hat)

We can check that we obtain the same value as the one given in the summary.

cat(sigma)

96.10609

cat(summary(mod)$sigma)

96.10609

Note that the \(n-2=18\) degrees of freedom given in the same line are the one used before, notably to calculate the value of \(\hat\sigma^2\).

Coefficient of determination

On the next line out of the last three, we see two determination coefficients. A determination coefficient represents the proportion of variation in \(y\) explained by the regressor \(x\). According to equation (2.47) in Montgomery, Peck, and Vining (2021), the coefficient of determination in a SLRM is

\[ R^2=1-\frac{SS_{Res}}{SS_T} \]

where \(SS_T\) is the total variability in the observations. We have

\[ SS_T=\sum_{i=1}^n\left(y_i-\bar y\right) \]

We can easily calculate it in R.

SST = sum((y - ybar) ** 2)

You can see on Figure 2 a graphical explanation of how the values of \(SS_T\) and \(SS_{Res}\) are obtained.

Code

SST = sum((y - ybar) ** 2)

ggplot() +
  geom_hline(yintercept = ybar, linetype = "dashed") +
  geom_segment(aes(x = x, y = y, yend = mean(y)), color = "blue") +
  geom_point(aes(x = x, y = y)) +
  geom_point(aes(x = x, y = ybar), color = "blue") +
  labs(x = "Age of propellant (weeks)", y = "Shear strength (psi)") +
  annotate("text", x = 17.5, y = 2500, label = bquote(SS[T]==.(SST)), color = "blue") +
  theme_minimal() -> p1

SSres = sum((y - yhat) ** 2)

ggplot() +
  geom_abline(slope = coef(mod)[2], 
              intercept = coef(mod)[1], linetype = "dashed") +
  labs(x = "Age of propellant (weeks)", y = "Shear strength (psi)") +
  geom_segment(aes(x = x, y = y, yend = predict(mod)), color = "red") +
  geom_point(aes(x = x, y = y)) +
  geom_point(aes(x = x, y = yhat), color = "red") +
  annotate("text", x = 17.5, y = 2500, label = bquote(SS[Res]==.(SSRes)), color = "red") +
  theme_minimal() -> p2

p1 + p2 + plot_layout(axis_titles = "collect") + plot_annotation(tag_levels = 'A')

Figure 2: Scatter diagram of shear strength versus propellant age. (A) The dashed line follows the equation \(y=\bar y\) and the blue segments indicate the distances between the observed values of \(y\) (black dots) and the expected values if there was no variance in \(y\) (blue dots). (B) The dashed line follows the equation \(y=\hat\beta_0+\hat\beta_1x\) and the red segments indicate the distances between the observed values of \(y\) (black dots) and the expected values predicted by the regression (red dots).

We can now obtain the value of \(R^2\).

R2 = 1 - SSRes / SST

We can see that this value matches the “multiple R-squared” in the summary.

cat(R2)

0.9018414

cat(summary(mod)$r.squared)

0.9018414

Now, how is the “adjusted R-squared” obtained? And, first of all, what is it? Well, the adjusted R-squared, which we will call \(R^2_{Adj}\), is most useful in multiple linear regression models than in SLRM. It can be used to compare nested models and decide if adding a new variable in the model is really useful. Without going into details, the value of \(R^2\) will necessarily increase when adding a new variable while the value of \(R^2_{Adj}\) will only increase if adding the new variable reduces the residual mean square, \(\hat\sigma^2\). According to equation (3.27) in Montgomery, Peck, and Vining (2021), we have

\[ R^2_{Adj}=1-\frac{SS_{Res}/\left(n-p\right)}{SS_T/\left(n-1\right)} \]

where \(p\) is the number of coefficients estimated in the model. In a SLRM, we have \(p=2\) (except in the particular case where the intercept is null by construction).

p = 2

We can now calculate the value of \(R^2_{Adj}\).

R2Adj = 1 - (SSres / (n - p)) / (SST / (n - 1))

And we find that this value is the same as the one reported in the summary.

cat(R2Adj)

0.8963882

cat(summary(mod)$adj.r.squared)

0.8963882

Analysis of variance

Finally, the last line of the summary is the result of an analysis of variance with the \(F\)-statistic, the associated degrees of freedom, and the \(p\)-value. According to equation (2.36) in Montgomery, Peck, and Vining (2021), the value of the \(F\)-statistic, noted \(F_0\), is

\[ F_0=\frac{SS_R/df_R}{SS_{Res}/df_{Res}} \]

where \(df_{Res}=n-2\) and thus \(SS_{Res}/df_{Res}=\hat\sigma^2\), \(SS_R\) is the regression sum of squares (i.e., the variation explained by the regression) and \(df_R=1\) in SLRM is the associated degree of freedom. According to equation (2.34) in Montgomery, Peck, and Vining (2021), we have

\[ SS_R = \hat\beta_1S_{xy} \]

but we can simply note that \(SS_R=SS_T-SS_{Res}\). We see that \(df_R=1\) (first degree of freedom reported in the summary) because \(SS_R\) is determined solely by \(\hat\beta_1\) while \(df_{Res}=n-2\) (second degree of freedom reported in the summary) because both \(\hat\beta_0\) and \(\hat\beta_1\) are used in calculating the values of \(y_i-\hat y_i\) (see Montgomery, Peck, and Vining (2021) for more details).

We fix the values of \(df_R\) and \(df_{Res}\).

dfR = 1
dfRes = n - 2

We calculate the value of \(SS_R\).

SSR = beta1hat * Sxy

Finally, we can calculate the value of \(F_0\).

F0 = (SSR / dfR) / (SSRes / dfRes)

Once more, if we compare the value obtained manually with the value shown in the summary, we find out that they are the same.

cat(F0)

165.3768

cat(summary(mod)$fstatistic[1])

165.3768

The value of \(F_0\) follows the \(F_{df_R,df_{Res}}\) distribution. From that, we can easily find the associated \(p\)-value using the pf() function.

pval = 1 - pf(F0, dfR, dfRes)

The value is the same as the one shown in the summary (actually, it is not possible to print the value directly out of the summary). Note that the \(p\)-value is the same as the one found for the \(t\)-test of coefficient \(\hat\beta_1\).

cat(pval)

1.643343e-10

Conclusion

In this document, I tried to explain how the different outputs given by R when working on a single linear regression model are mathematically obtained. The goal was not to demonstrate in details how the different formulas used are constructed. For those interested, I redirect them to Montgomery, Peck, and Vining (2021), which I heavily used to produce this work.

Session info

sessionInfo()

R version 4.4.3 (2025-02-28 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=French_France.utf8  LC_CTYPE=French_France.utf8   
[3] LC_MONETARY=French_France.utf8 LC_NUMERIC=C                  
[5] LC_TIME=French_France.utf8    

time zone: Europe/Paris
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] patchwork_1.3.0 ggplot2_3.5.1  

loaded via a namespace (and not attached):
 [1] vctrs_0.6.5       cli_3.6.3         knitr_1.50        rlang_1.1.4      
 [5] xfun_0.51         generics_0.1.3    jsonlite_1.9.1    labeling_0.4.3   
 [9] glue_1.7.0        colorspace_2.1-1  htmltools_0.5.8.1 scales_1.3.0     
[13] rmarkdown_2.29    grid_4.4.3        evaluate_1.0.3    munsell_0.5.1    
[17] tibble_3.2.1      fastmap_1.2.0     yaml_2.3.10       lifecycle_1.0.4  
[21] compiler_4.4.3    dplyr_1.1.4       htmlwidgets_1.6.4 pkgconfig_2.0.3  
[25] rstudioapi_0.17.1 farver_2.1.2      digest_0.6.37     R6_2.6.1         
[29] tidyselect_1.2.1  pillar_1.10.1     magrittr_2.0.3    withr_3.0.2      
[33] tools_4.4.3       gtable_0.3.6

References

Montgomery, Douglas C., Elizabeth A. Peck, and G. Geoffrey Vining. 2021. Introduction to linear regression analysis. Sixth edition. Wiley series in probability and statistics. Hoboken, New Jersey: Wiley.

Footnotes

Les mains dans le cambouis is a French expression meaning that you are going deep inside how something works; going beyond just using something but really understanding how it is built.↩︎
In linear regression models, the estimators considered as “best” are those minimizing the sum of the squares of the residuals.↩︎