Linear Regression Using R

Using This Document

This was set up to provide you with an easy and reliable resource for conducting regression models. This can, and should, be an evolving document; as you notice errors, inconsistencies, or the absence of a tool that you think is very important, please let your instructor, your TA, or Danny Forster (forster[dot]danny[at]gmail.com) know. We also want to make it as easy to understand as possible, so if you find anything unclear or can suggest a better way to present information, or even acccomplish a task, then please contribute!

Also, note that R packages are constantly being updated, which can influence how functions are called from various packages. This means that much of the syntax presented here will inevitably go out of date. Unless contacted, I will not be keeping this document up to date.

Setting up your R script

First, you should load all of the packages at the beginning of your script. If you don’t have a package, use r install.packages("packageName").

install.packages("ggplot2") # This installs only one package
install.packages(c("knitr", "ggplot2", "foreign", "stargazer", "lmSupport", "sjPlot", "BaylorEdPsych", "lavaan", "semPlot", "gvlma")) # This installs many packages

Once your packages are installed, then just use r library("packageName").

library(ggplot2) # this package is good for making plots
library(foreign) # this allows you to read in spss data files
library(stargazer) # this creates nice descriptive tables
library(lmSupport) # this has nice features for comparing models
library(sjPlot) # This makes pretty plots for various models
library(BaylorEdPsych) # this has resources for logistic regression
library(lavaan) # this can fit structural equation models
library(semPlot) # this can create path diagrams from structural equation models
library(gvlma) # assesses global linear model assumptions
sjp.setTheme(base = theme_bw()) # This changes the default to something I like...

Next, set your working directory. For any project, you should always use a specific filepath on your computer. If you are working across multiple computers, use cloud storage, then just take note of which working directory is for which computer. For example:

setwd("/Users/dannyforster/Dropbox/Dissertation/Analyses/") # home computer
setwd("/Users/dforster/Dropbox/Dissertation/Analyses/") # office computer

Data Management with R

After you set your working directory, all files in that filepath can now be accessed by only referencing the file name. For example:

dissData <- read.csv("dissertationData.csv", header = T) # if your data is saved as a .csv
dissData <- read.table("dissertationData.txt", header = T) # if your data is saved as a tab delimited .txt
dissData <- read.spss("dissertationData.sav", to.data.frame = T, use.value.labels = F) # if your data is saved as a .sav (SPSS file; requires the 'foreign' package)

Checking your data

For this example, we will use the continuous moderator data.

modData <- read.spss("continuous moderation example from Cohen et al.sav", to.data.frame = T)

## Warning in read.spss("continuous moderation example from Cohen et
## al.sav", : continuous moderation example from Cohen et al.sav: Unrecognized
## record type 7, subtype 18 encountered in system file

str(modData) # This tells you the structure of your data (whether your variables are numeric, factor, etc.)

## 'data.frame':    245 obs. of  11 variables:
##  $ id      : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ age     : num  60 40 29 47 48 42 55 43 39 51 ...
##  $ yearsexe: num  10 9 2 10 9 6 8 19 9 14 ...
##  $ enduranc: num  18 36 51 18 23 30 8 40 28 15 ...
##  $ cage    : num  10.82 -9.18 -20.18 -2.18 -1.18 ...
##  $ cyearsex: num  -0.673 -1.673 -8.673 -0.673 -1.673 ...
##  $ inter   : num  -7.28 15.37 175.06 1.47 1.98 ...
##  $ hiexer  : num  -5.45 -6.45 -13.45 -5.45 -6.45 ...
##  $ interhi : num  -58.93 59.22 271.43 11.9 7.63 ...
##  $ loexer  : num  4.1 3.1 -3.9 4.1 3.1 ...
##  $ lointer : num  44.37 -28.49 78.67 -8.96 -3.67 ...
##  - attr(*, "codepage")= int 1252

head(modData) # This shows the first 6 rows of your data

##   id age yearsexe enduranc       cage   cyearsex      inter   hiexer
## 1  1  60       10       18  10.816327 -0.6734694  -7.284465  -5.4482
## 2  2  40        9       36  -9.183673 -1.6734694  15.368596  -6.4482
## 3  3  29        2       51 -20.183673 -8.6734694 175.062474 -13.4482
## 4  4  47       10       18  -2.183673 -0.6734694   1.470637  -5.4482
## 5  5  48        9       23  -1.183673 -1.6734694   1.980841  -6.4482
## 6  6  42        6       30  -7.183673 -4.6734694  33.572678  -9.4482
##      interhi  loexer     lointer
## 1 -58.929510  4.1022  44.3707347
## 2  59.218163  3.1022 -28.4895918
## 3 271.434078 -3.8978  78.6719224
## 4  11.897090  4.1022  -8.9578653
## 5   7.632563  3.1022  -3.6719918
## 6  67.872784  0.1022  -0.7341714

summary(modData) # This shows basic descriptives of your data

##        id           age           yearsexe        enduranc
##  Min.   :  1   Min.   :20.00   Min.   : 0.00   Min.   : 0.00
##  1st Qu.: 63   1st Qu.:43.00   1st Qu.: 7.00   1st Qu.:19.00
##  Median :124   Median :48.00   Median :11.00   Median :27.00
##  Mean   :125   Mean   :49.18   Mean   :10.67   Mean   :26.53
##  3rd Qu.:187   3rd Qu.:56.00   3rd Qu.:14.00   3rd Qu.:33.00
##  Max.   :250   Max.   :82.00   Max.   :26.00   Max.   :55.00
##       cage            cyearsex            inter              hiexer
##  Min.   :-29.184   Min.   :-10.6735   Min.   :-131.570   Min.   :-15.448
##  1st Qu.: -6.184   1st Qu.: -3.6735   1st Qu.:  -7.264   1st Qu.: -8.448
##  Median : -1.184   Median :  0.3265   Median :   4.511   Median : -4.448
##  Mean   :  0.000   Mean   :  0.0000   Mean   :  13.587   Mean   : -4.775
##  3rd Qu.:  6.816   3rd Qu.:  3.3265   3rd Qu.:  32.103   3rd Qu.: -1.448
##  Max.   : 32.816   Max.   : 15.3265   Max.   : 175.062   Max.   : 10.552
##     interhi             loexer          lointer
##  Min.   :-165.252   Min.   :-5.898   Min.   :-175.428
##  1st Qu.: -20.792   1st Qu.: 1.102   1st Qu.: -16.244
##  Median :   7.507   Median : 5.102   Median :   2.002
##  Mean   :  13.587   Mean   : 4.776   Mean   :  13.587
##  3rd Qu.:  36.403   3rd Qu.: 8.102   3rd Qu.:  37.453
##  Max.   : 275.733   Max.   :20.102   Max.   : 298.701

stargazer(modData[, 2:10], type = "text") # modData[, 2:10] takes only the 2nd-10th rows (i.e., excludes id)

##
## ==============================================
## Statistic  N   Mean  St. Dev.   Min      Max
## ----------------------------------------------
## age       245 49.184  10.107     20      82
## yearsexe  245 10.673  4.775      0       26
## enduranc  245 26.531  10.819     0       55
## cage      245 0.000   10.107  -29.184  32.816
## cyearsex  245 0.000   4.775   -10.673  15.327
## inter     245 13.587  46.012  -131.570 175.062
## hiexer    245 -4.775  4.775   -15.448  10.552
## interhi   245 13.587  67.131  -165.252 275.733
## loexer    245 4.776   4.775    -5.898  20.102
## ----------------------------------------------

Regression analysis

First, create a model with enduranc regressed on cyearsex and cage

m1 <- lm(enduranc ~ cyearsex + cage, data = modData)

# check multi-collinearity

vif(m1)

## cyearsex     cage
## 1.086837 1.086837

sqrt(vif(m1)) > 2

## cyearsex     cage
##    FALSE    FALSE

# check non-independence of errors

durbinWatsonTest(m1)

##  lag Autocorrelation D-W Statistic p-value
##    1      0.02165138      1.955424   0.772
##  Alternative hypothesis: rho != 0

# check global assumptions (using gvlma package)

global1 <- gvlma(m1)

summary(global1)

##
## Call:
## lm(formula = enduranc ~ cyearsex + cage, data = modData)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -21.7994  -6.9040   0.5701   5.6326  27.2279
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  26.5306     0.6337  41.865  < 2e-16 ***
## cyearsex      0.9163     0.1386   6.610 2.44e-10 ***
## cage         -0.2571     0.0655  -3.925 0.000113 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.919 on 242 degrees of freedom
## Multiple R-squared:  0.1663, Adjusted R-squared:  0.1594
## F-statistic: 24.14 on 2 and 242 DF,  p-value: 2.754e-10
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05
##
## Call:
##  gvlma(x = m1)
##
##                     Value p-value                Decision
## Global Stat        2.7023  0.6088 Assumptions acceptable.
## Skewness           0.7820  0.3765 Assumptions acceptable.
## Kurtosis           0.7505  0.3863 Assumptions acceptable.
## Link Function      0.3835  0.5357 Assumptions acceptable.
## Heteroscedasticity 0.7863  0.3752 Assumptions acceptable.

summary(m1)

##
## Call:
## lm(formula = enduranc ~ cyearsex + cage, data = modData)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -21.7994  -6.9040   0.5701   5.6326  27.2279
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  26.5306     0.6337  41.865  < 2e-16 ***
## cyearsex      0.9163     0.1386   6.610 2.44e-10 ***
## cage         -0.2571     0.0655  -3.925 0.000113 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.919 on 242 degrees of freedom
## Multiple R-squared:  0.1663, Adjusted R-squared:  0.1594
## F-statistic: 24.14 on 2 and 242 DF,  p-value: 2.754e-10

# get confidence intervals

confint(m1)

##                  2.5 %     97.5 %
## (Intercept) 25.2823165 27.7789080
## cyearsex     0.6432272  1.1893923
## cage        -0.3861168 -0.1280706

Then you can add the interaction term as a variable…

m2.var <- lm(enduranc ~ cyearsex + cage + inter, data = modData)

summary(m2.var)

##
## Call:
## lm(formula = enduranc ~ cyearsex + cage + inter, data = modData)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -21.165  -6.939   0.269   6.300  21.299
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.88872    0.64662  40.037  < 2e-16 ***
## cyearsex     0.97272    0.13653   7.124 1.20e-11 ***
## cage        -0.26169    0.06406  -4.085 6.01e-05 ***
## inter        0.04724    0.01359   3.476 0.000604 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.7 on 241 degrees of freedom
## Multiple R-squared:  0.2061, Adjusted R-squared:  0.1962
## F-statistic: 20.86 on 3 and 241 DF,  p-value: 4.764e-12

…or you can simply create the interaction within the model.

m2.star <- lm(enduranc ~ cyearsex * cage, data = modData) # The asterisk creates the main effects and interaction between two variables

summary(m2.star)

##
## Call:
## lm(formula = enduranc ~ cyearsex * cage, data = modData)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -21.165  -6.939   0.269   6.300  21.299
##
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)   25.88872    0.64662  40.037  < 2e-16 ***
## cyearsex       0.97272    0.13653   7.124 1.20e-11 ***
## cage          -0.26169    0.06406  -4.085 6.01e-05 ***
## cyearsex:cage  0.04724    0.01359   3.476 0.000604 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.7 on 241 degrees of freedom
## Multiple R-squared:  0.2061, Adjusted R-squared:  0.1962
## F-statistic: 20.86 on 3 and 241 DF,  p-value: 4.764e-12

m2.colon <- lm(enduranc ~ cyearsex + cage + cyearsex:cage, data = modData) # The colon isolates the interaction term (i.e., the colon will not evaluate main effects, so they must be entered separately)

summary(m2.colon)

##
## Call:
## lm(formula = enduranc ~ cyearsex + cage + cyearsex:cage, data = modData)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -21.165  -6.939   0.269   6.300  21.299
##
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)   25.88872    0.64662  40.037  < 2e-16 ***
## cyearsex       0.97272    0.13653   7.124 1.20e-11 ***
## cage          -0.26169    0.06406  -4.085 6.01e-05 ***
## cyearsex:cage  0.04724    0.01359   3.476 0.000604 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.7 on 241 degrees of freedom
## Multiple R-squared:  0.2061, Adjusted R-squared:  0.1962
## F-statistic: 20.86 on 3 and 241 DF,  p-value: 4.764e-12

Now let’s compare models:

modelCompare(m1, m2.star) # this uses the lmSupport package

## SSE (Compact) =  23810.33
## SSE (Augmented) =  22673.79
## Delta R-Squared =  0.03979367
## Partial Eta-Squared (PRE) =  0.04773338
## F(1,241) = 12.08038, p = 0.0006041869

This moderation is significant! We can plot this very easily in R.

sjp.int(m2.star, type = "eff", mdrt.values = "meansd")

You can also create a nice summary table using sjt.lm()

sjt.lm(m2.star)

	enduranc
	B	CI	p
(Intercept)	25.89	24.61 – 27.16	<.001
cyearsex	0.97	0.70 – 1.24	<.001
cage	-0.26	-0.39 – -0.14	<.001
cyearsex:cage	0.05	0.02 – 0.07	<.001
Observations	245
R² / adj. R²	.206 / .196

Polynomial Analysis

poly <- read.spss("polynomial revised.sav", to.data.frame = T)
head(poly)

##   time gain time2 ctime ctime2
## 1    1    4     1    -2      4
## 2    1    2     1    -2      4
## 3    1    3     1    -2      4
## 4    1    2     1    -2      4
## 5    2    7     4    -1      1
## 6    2    5     4    -1      1

We want to regress gain on centered time (ctime) and centered time squared (ctime2)

polyreg <- lm(gain ~ ctime + ctime2, data = poly)

summary(polyreg)

##
## Call:
## lm(formula = gain ~ ctime + ctime2, data = poly)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -5.0071 -1.7214  0.2429  1.6357  5.9571
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  12.0429     1.0098  11.926 1.11e-09 ***
## ctime         1.3750     0.4582   3.001 0.008040 **
## ctime2       -1.9464     0.3872  -5.026 0.000104 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.898 on 17 degrees of freedom
## Multiple R-squared:  0.6684, Adjusted R-squared:  0.6294
## F-statistic: 17.13 on 2 and 17 DF,  p-value: 8.415e-05

There is a significant quadratic effect. We can interpret this more easily by plotting it.

What happens if we try to plot it using sjPlots?

# plotting the linear effect of time

sjp.lm(polyreg, type = "pred", vars = "ctime", facet.grid = F)

# plotting the quadratic effect of time

sjp.lm(polyreg, type = "pred", vars = "ctime2", facet.grid = F)

This is confusing: We have a positive linear effect but a negative quadratic effect. We can’t really see the non-linearity here. What happens if we just include both variables in our plot?

sjp.lm(polyreg, type = "pred", vars = c("ctime", "ctime2"), facet.grid = F)

This also makes no sense. We have to coerce these two effects (linear and quadratic) onto a single, linear scale. To do so, you will have to use a function I created. If you have the nonLinearFunction_v7.R script in your working directory, you can simply load the function using source(). If you want to see how the function works, feel free to open it up and check it out. It’s pretty ugly, but I would be happy to take you through it, step-by-step.

source("nonLinearFunction_v7.R")

Now you can use the nonLinearPlot() function.

nonLinearPlot(model = polyreg, data = poly, x1 = "ctime", x2 = "ctime2", type = "quad", main = "Gain ~ Time", xlab = "Time", ylab = "Gain", scale = mean(poly$time))

Now we can clearly see how the combined linear and quadratic effects create a cruvilinear pattern of predicted values.

Logistic Regression

Logistic models are just as easy as OLS models. Instead of the lm() function, we will use the glm() function (generalized linear model). To see the different generlized linear models that R allows you to fit, use r ?family. For logistic regression, we will use family = binomial(link = “logit”).

First, import the data set.

logitData <- read.spss("logistic data.sav", to.data.frame = T)

## Warning in read.spss("logistic data.sav", to.data.frame = T): logistic
## data.sav: Unrecognized record type 7, subtype 18 encountered in system file

head(logitData)

##   age agrp chd dummy   cage c10year    lnage
## 1  20    1   0     0 -24.38  -2.438 2.995732
## 2  23    1   0     0 -21.38  -2.138 3.135494
## 3  24    1   0     0 -20.38  -2.038 3.178054
## 4  25    1   0     0 -19.38  -1.938 3.218876
## 5  25    1   1     0 -19.38  -1.938 3.218876
## 6  26    1   0     0 -18.38  -1.838 3.258097

# to create a dummy variable

logitData$dummyAge <- ifelse(logitData$age < 50, 0, 1)

# compare new dummy variable to old dummy variable

logitData$dummyAge == logitData$dummy

##   [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [15] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [29] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [43] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [57] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [71] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [85] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [99] TRUE TRUE

In this data set, we first want to predict chd with our dummy variable.

logMod1 <- glm(chd ~ dummy, data = logitData, family = binomial(link = "logit"))

summary(logMod1)

##
## Call:
## glm(formula = chd ~ dummy, family = binomial(link = "logit"),
##     data = logitData)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -1.6481  -0.7787  -0.7787   0.7710   1.6378
##
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -1.0380     0.2822  -3.678 0.000235 ***
## dummy         2.0989     0.4788   4.384 1.17e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 136.66  on 99  degrees of freedom
## Residual deviance: 114.61  on 98  degrees of freedom
## AIC: 118.61
##
## Number of Fisher Scoring iterations: 4

# Exponentiate coefficients to get Odds and Odds Ratios
exp(logMod1$coefficients)

## (Intercept)       dummy
##   0.3541667   8.1568627

# Obtain and exponentiate confidence intervals to get Odds and Odds Ratios
confint(logMod1)

## Waiting for profiling to be done...

##                 2.5 %     97.5 %
## (Intercept) -1.619470 -0.5055285
## dummy        1.193538  3.0827738

exp(confint(logMod1))

## Waiting for profiling to be done...

##                 2.5 %     97.5 %
## (Intercept) 0.1980036  0.6031867
## dummy       3.2987321 21.8188390

# Obtain psued-R squared values from the BaylorEdPsych package

PseudoR2(logMod1)

##         McFadden     Adj.McFadden        Cox.Snell       Nagelkerke
##        0.1613743        0.1174706        0.1979135        0.2656432
## McKelvey.Zavoina           Effron            Count        Adj.Count
##        0.2334979        0.2150322        0.7400000        0.3953488
##              AIC    Corrected.AIC
##      118.6090963      118.7328076

For the second analysis, we want to predict chd with mean-centered age.

logMod2 <- glm(chd ~ cage, data = logitData, family = binomial(link = "logit"))

summary(logMod2)

##
## Call:
## glm(formula = chd ~ cage, family = binomial(link = "logit"),
##     data = logitData)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -1.9718  -0.8456  -0.4576   0.8253   2.2859
##
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.38677    0.23972  -1.613    0.107
## cage         0.11092    0.02406   4.610 4.02e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 136.66  on 99  degrees of freedom
## Residual deviance: 107.35  on 98  degrees of freedom
## AIC: 111.35
##
## Number of Fisher Scoring iterations: 4

# Exponentiate coefficients to get Odds and Odds Ratios
exp(logMod2$coefficients)

## (Intercept)        cage
##   0.6792452   1.1173068

# Obtain and exponentiate confidence intervals to get Odds and Odds Ratios
confint(logMod2)

## Waiting for profiling to be done...

##                   2.5 %     97.5 %
## (Intercept) -0.87154824 0.07534233
## cage         0.06693158 0.16200672

exp(confint(logMod2))

## Waiting for profiling to be done...

##                 2.5 %   97.5 %
## (Intercept) 0.4183034 1.078253
## cage        1.0692223 1.175868

# Obtain psued-R squared values from the BaylorEdPsych package

PseudoR2(logMod2)

##         McFadden     Adj.McFadden        Cox.Snell       Nagelkerke
##        0.2144684        0.1705648        0.2540516        0.3409928
## McKelvey.Zavoina           Effron            Count        Adj.Count
##        0.3371659        0.2725374        0.7400000        0.3953488
##              AIC    Corrected.AIC
##      111.3530927      111.4768040

Plotting with Logistic Regression

Again, the sjPlots package makes it easy to get prediction lines.

sjp.glm(logMod2, type = "pred", vars = "cage")

You can also create a nice summary table using sjt.glm()

sjt.glm(logMod2)

	chd
	Odds Ratio	CI	p
(Intercept)	0.68	0.42 – 1.08	.107
cage	1.12	1.07 – 1.18	<.001
Observations	100

Path Analysis

These are advanced techniques, but this is just to demonstrate the ease of doing these types of analyses in R. First, I’m going to show you how to conduct the mediation analysis from your second homework assignment using the lavaan package.

ptsd <- read.spss("PTSD on warstr data with blood pressure.sav", to.data.frame = T)

head(ptsd)

##   ID  PTSD ANXTOT WARSTR BLOODPRE
## 1  1 19.00     21      7     98.8
## 2  2 15.31      9      6    105.0
## 3  3 30.00     11      9    113.0
## 4  4  5.00     15     12    125.0
## 5  5 19.00      3      4     88.0
## 6  6 18.00      1      6    100.0

From this data, we wanted to test whether PTSD mediated the relationship between War Stress and Blood Pressure, controlling for the influence of Anxiety on PTSD. With SEM, we can analyze everything, including the indirect effects, in a single step.

# First, create your model

mediate <- '
# variables must be spelled/capitalized correctly

# letters joined to variables with an asterisk are labels for the specified path

# Y on X and M

BLOODPRE ~ c*WARSTR + b*PTSD

# M on X (controlling for covariate)

PTSD ~ a*WARSTR + ANXTOT

# We can create a new parameter (the indirect and total effects) from the labels above

# Indirect Effect

ab := a*b

# Total Effect

total := c + (a*b)
'

# Fit the model using bootstrap standard errors

fit <- sem(mediate, data = ptsd, se = "bootstrap")

summary(fit, ci = T)

## lavaan (0.5-23.1097) converged normally after  25 iterations
##
##                                                   Used       Total
##   Number of observations                           151         152
##
##   Estimator                                         ML
##   Minimum Function Test Statistic                0.096
##   Degrees of freedom                                 1
##   P-value (Chi-square)                           0.757
##
## Parameter Estimates:
##
##   Information                                 Observed
##   Standard Errors                            Bootstrap
##   Number of requested bootstrap draws             1000
##   Number of successful bootstrap draws            1000
##
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|) ci.lower ci.upper
##   BLOODPRE ~
##     WARSTR     (c)    1.088    0.552    1.971    0.049   -0.092    2.123
##     PTSD       (b)    0.421    0.132    3.192    0.001    0.142    0.685
##   PTSD ~
##     WARSTR     (a)    0.822    0.288    2.856    0.004    0.265    1.427
##     ANXTOT            0.709    0.109    6.503    0.000    0.487    0.920
##
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|) ci.lower ci.upper
##    .BLOODPRE        172.194   24.518    7.023    0.000  126.288  221.636
##    .PTSD             61.697    8.404    7.341    0.000   45.693   78.169
##
## Defined Parameters:
##                    Estimate  Std.Err  z-value  P(>|z|) ci.lower ci.upper
##     ab                0.346    0.159    2.179    0.029    0.082    0.677
##     total             1.433    0.496    2.892    0.004    0.384    2.344

You can also get R to give you a path diagram for a fitted model using semPaths(). This is from the semPlot package. This isn’t super great (you’re much better off using powerpoint), but it can still give you a functional path diagram.

# roatation = 2 puts all exogenous variables on the left side
# edge.label.cex changes the font size (default = 0.8, which is rather small)
semPaths(fit, whatLabels = "par", rotation = 2, edge.label.cex = 1)

## Warning in qgraph(Edgelist, labels = nLab, bidirectional = Bidir, directed
## = Directed, : The following arguments are not documented and likely not
## arguments of qgraph and thus ignored: loopRotation; residuals; residScale;
## residEdge; CircleEdgeEnd