library("tidyverse")
<- read_rds("data-derived/pog_cts.rds") pog
4 Polynomial regression
For more information about these approaches, see Barr (2008) and Mirman, Dixon, and Magnuson (2008). It is also possible to use Generalized Additive Mixed Models (GAMMs), which can more easily accommodate arbitrary wiggly patterns and asymptotes, but that is beyond the current scope of this textbook.
# A tibble: 1,021,288 × 5
sub_id t_id f_c role pad
<int> <int> <int> <fct> <lgl>
1 1 1 -90 target FALSE
2 1 1 -89 target FALSE
3 1 1 -88 target FALSE
4 1 1 -87 target FALSE
5 1 1 -86 target FALSE
6 1 1 -85 target FALSE
7 1 1 -84 target FALSE
8 1 1 -83 target FALSE
9 1 1 -82 target FALSE
10 1 1 -81 target FALSE
# … with 1,021,278 more rows
4.1 Binning data
We are going to follow the Mirman, Dixon, and Magnuson (2008) approach. What we want to do first is to model the shape of the curve for existing competitors and see if it differs across children and adults.
We will perform separate by-subject and by-item analysis. The reason why this is needed is that we have to first aggregate the data in order to deal with the frame-by-frame dependencies. A common approach is to aggregate frames into 50 ms bins (i.e. each having 3 frames).
The general formula for binning data is:
bin = floor( (frame + binsize/2) / binsize ) * binsize
To bin things up into bins of 3 frames each, it would be
bin = floor( (frame + 3/2) / 3) * 3
To get a sense for how this formula works, try it out in the console.
<- -10:10
sample_frames
rbind(frame = sample_frames,
bin = floor( (sample_frames + 3/2) / 3) * 3)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
frame -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2
bin -9 -9 -9 -6 -6 -6 -3 -3 -3 0 0 0 3
[,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21]
frame 3 4 5 6 7 8 9 10
bin 3 3 6 6 6 9 9 9
4.1.1 Activity: Calculating bins
Following the above logic, add the variables bin
and ms
(time in milliseconds for the corresponding bin) to the pog
table. Save the result as pog_calc
.
# A tibble: 1,021,288 × 7
sub_id t_id f_c role pad bin ms
<int> <int> <int> <fct> <lgl> <int> <int>
1 1 1 -90 target FALSE -30 -500
2 1 1 -89 target FALSE -30 -500
3 1 1 -88 target FALSE -29 -483
4 1 1 -87 target FALSE -29 -483
5 1 1 -86 target FALSE -29 -483
6 1 1 -85 target FALSE -28 -466
7 1 1 -84 target FALSE -28 -466
8 1 1 -83 target FALSE -28 -466
9 1 1 -82 target FALSE -27 -450
10 1 1 -81 target FALSE -27 -450
# … with 1,021,278 more rows
4.1.2 Activity: Count frames in bins
For the analysis below, we’re going to focus on the existing competitors (ctype == "exist"
). Link the pog_calc
data to information about subjects and conditions (crit
) to create the following table, where Y
is the number of frames observed for the particular combination of sub_id
, group
, crit
, ms
, and role
. Save the resulting table as pog_subj_y
.
# A tibble: 50,630 × 6
sub_id group crit ms role Y
<int> <chr> <chr> <int> <fct> <int>
1 1 adult competitor -500 target 6
2 1 adult competitor -500 critical 4
3 1 adult competitor -500 existing 4
4 1 adult competitor -500 novel 6
5 1 adult competitor -500 (blank) 0
6 1 adult competitor -483 target 7
7 1 adult competitor -483 critical 5
8 1 adult competitor -483 existing 8
9 1 adult competitor -483 novel 8
10 1 adult competitor -483 (blank) 2
# … with 50,620 more rows
4.1.3 Activity: Compute probabilities
Now add in variables N
, the total number of frames for a given combination of sub_id
, group
, crit
, and ms
, and p
, which is the probability (Y / N
). Save the result as pog_subj
.
# A tibble: 50,630 × 8
sub_id group crit ms role Y N p
<int> <chr> <chr> <int> <fct> <int> <int> <dbl>
1 1 adult competitor -500 target 6 20 0.3
2 1 adult competitor -500 critical 4 20 0.2
3 1 adult competitor -500 existing 4 20 0.2
4 1 adult competitor -500 novel 6 20 0.3
5 1 adult competitor -500 (blank) 0 20 0
6 1 adult competitor -483 target 7 30 0.233
7 1 adult competitor -483 critical 5 30 0.167
8 1 adult competitor -483 existing 8 30 0.267
9 1 adult competitor -483 novel 8 30 0.267
10 1 adult competitor -483 (blank) 2 30 0.0667
# … with 50,620 more rows
4.2 Plot mean probabilities
4.2.1 Activity: Mean probabilities
Let’s now compute the mean probabilities for looks to the critical object across groups (adults, children) and condition (competitor, unrelated). First calculate the table pog_means
below, then use it to create the graph below.
# A tibble: 244 × 4
group crit ms probability
<chr> <chr> <int> <dbl>
1 adult competitor -500 0.137
2 adult competitor -483 0.140
3 adult competitor -466 0.131
4 adult competitor -450 0.129
5 adult competitor -433 0.135
6 adult competitor -416 0.137
7 adult competitor -400 0.135
8 adult competitor -383 0.138
9 adult competitor -366 0.137
10 adult competitor -350 0.129
# … with 234 more rows
4.3 Polynomial regression
Our task now is to fit the functions shown in the above figure using orthogonal polynomials. To avoid asymptotes, we will limit our analysis to 200 to 500 ms window, which is where the function seems to be changing.
The first thing we will do is prepare the data, adding in deviation-coded numerical predictors for group
(G
) and crit
(C
).
We will load in the R packages {lme4}
for fitting linear mixed-effects models, and {polypoly}
for working with orthogonal polynomials.
# if you don't have it, type
# install.packages("polypoly") # in the console
library("polypoly")
library("lme4")
<- pog_subj %>%
pog_prep filter(role == "critical", ms >= -200) %>%
mutate(G = if_else(group == "child", 1/2, -1/2),
C = if_else(crit == "competitor", 1/2, -1/2))
## check that we didn't make any errors
%>%
pog_prep distinct(group, crit, G, C)
# A tibble: 4 × 4
group crit G C
<chr> <chr> <dbl> <dbl>
1 adult competitor -0.5 0.5
2 adult unrelated -0.5 -0.5
3 child competitor 0.5 0.5
4 child unrelated 0.5 -0.5
<- pog_prep %>%
pog_3 poly_add_columns(ms, degree = 3) %>%
select(sub_id, group, G, crit, C, ms, p, ms1, ms2, ms3)
<- lmer(p ~ (ms1 + ms2 + ms3) * G * C +
mod_3 + ms2 + ms3) * C || sub_id),
((ms1 data = pog_3)
summary(mod_3)
Linear mixed model fit by REML ['lmerMod']
Formula: p ~ (ms1 + ms2 + ms3) * G * C + ((1 | sub_id) + (0 + ms1 | sub_id) +
(0 + ms2 | sub_id) + (0 + ms3 | sub_id) + (0 + C | sub_id) +
(0 + ms1:C | sub_id) + (0 + ms2:C | sub_id) + (0 + ms3:C | sub_id))
Data: pog_3
REML criterion at convergence: -14758.4
Scaled residuals:
Min 1Q Median 3Q Max
-3.2565 -0.5914 -0.0358 0.5113 5.4038
Random effects:
Groups Name Variance Std.Dev.
sub_id (Intercept) 0.001249 0.03534
sub_id.1 ms1 0.054835 0.23417
sub_id.2 ms2 0.027294 0.16521
sub_id.3 ms3 0.026226 0.16194
sub_id.4 C 0.003698 0.06081
sub_id.5 ms1:C 0.128879 0.35900
sub_id.6 ms2:C 0.106783 0.32678
sub_id.7 ms3:C 0.061623 0.24824
Residual 0.005815 0.07625
Number of obs: 7138, groups: sub_id, 83
Fixed effects:
Estimate Std. Error t value
(Intercept) 0.133722 0.003983 33.569
ms1 -0.360730 0.026378 -13.676
ms2 -0.188137 0.019077 -9.862
ms3 0.134597 0.018736 7.184
G 0.027162 0.007967 3.409
C 0.030741 0.006915 4.445
ms1:G 0.108453 0.052756 2.056
ms2:G -0.035381 0.038154 -0.927
ms3:G -0.088315 0.037473 -2.357
ms1:C 0.058203 0.041147 1.414
ms2:C -0.193833 0.037774 -5.131
ms3:C 0.045101 0.029710 1.518
G:C 0.001996 0.013831 0.144
ms1:G:C 0.068046 0.082295 0.827
ms2:G:C 0.082218 0.075548 1.088
ms3:G:C -0.127579 0.059420 -2.147
Correlation matrix not shown by default, as p = 16 > 12.
Use print(x, correlation=TRUE) or
vcov(x) if you need it
It converged! Before we get too excited, plot the model fitted values against the observed values to assess the quality of the fit.
We need data to feed in to the predict()
function in order to generate our fitted values. We’ll use pog_means
for this purpose, adding in all of the predictors we need for the model, and restricting the range.
<- pog_means %>%
pog_new filter(ms >= -200) %>%
mutate(G = if_else(group == "child", 1/2, -1/2),
C = if_else(crit == "competitor", 1/2, -1/2))
Now we are ready to feed it into predict()
to generate fitted values. Note that we want to make predictions for the “typical” subject with random effects of zero, which requires setting re.form = NA
for the predict()
function. See ?predict.merMod
for details. We use newdata = .
to send the current data from our pipeline as the argument for newdata
.
<- pog_new %>%
fits_3 poly_add_columns(ms, degree = 3) %>%
mutate(fitted = predict(mod_3, newdata = .,
re.form = NA))
Now we plot the fitted values (lines) against observed (points).
ggplot(fits_3,
aes(ms, probability,
shape = crit, color = group)) +
geom_point() +
geom_line(aes(y = fitted, linetype = crit)) +
theme(legend.position = "top")
Not good. We might want to try a higher order model. Alternatively, we can restrict the range further to get rid of asymptotic elements in the later part of the window. Let’s try the latter first because that’s fairly easy.
<- pog_3 %>%
pog_3b filter(between(ms, -50L, 300L))
## refit with a different dataset
<- update(mod_3, data = pog_3b) mod_3b
Generate fitted values from the new model and plot.
<- pog_new %>%
fits_3b filter(between(ms, -50L, 300L)) %>%
poly_add_columns(ms, degree = 3) %>%
mutate(fitted = predict(mod_3b, newdata = .,
re.form = NA))
ggplot(fits_3b,
aes(ms, probability,
shape = crit, color = group)) +
geom_point() +
geom_line(aes(y = fitted, linetype = crit)) +
theme(legend.position = "top")
Well, that is even worse.
4.3.1 Activity: Quintic model
A cubic is really not enough. Try to fit a quintic (5th order) function on the reduced data range (-50 ms to 300 ms). Use the bobyqa
optimizer to get lmer()
to converge (control = lmerControl(optimizer = "bobyqa")
), and fit it with REML=FALSE
.
Then, follow the example above to assess the quality of fit using a plot.
OK, that’s a fit that we can be happy with.
Let’s have a look at the model output.
summary(mod_5)
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: p ~ (ms1 + ms2 + ms3 + ms4 + ms5) * G * C + ((1 | sub_id) + (0 +
ms1 | sub_id) + (0 + ms2 | sub_id) + (0 + ms3 | sub_id) +
(0 + ms4 | sub_id) + (0 + ms5 | sub_id) + (0 + C | sub_id) +
(0 + ms1:C | sub_id) + (0 + ms2:C | sub_id) + (0 + ms3:C |
sub_id) + (0 + ms4:C | sub_id) + (0 + ms5:C | sub_id))
Data: pog_5
Control: lmerControl(optimizer = "bobyqa")
AIC BIC logLik deviance df.resid
-9233.6 -9004.1 4653.8 -9307.6 3615
Scaled residuals:
Min 1Q Median 3Q Max
-3.2631 -0.4883 -0.0013 0.4694 4.4521
Random effects:
Groups Name Variance Std.Dev.
sub_id (Intercept) 0.002499 0.04999
sub_id.1 ms1 0.046758 0.21624
sub_id.2 ms2 0.017703 0.13305
sub_id.3 ms3 0.014331 0.11971
sub_id.4 ms4 0.005942 0.07709
sub_id.5 ms5 0.006101 0.07811
sub_id.6 C 0.007937 0.08909
sub_id.7 ms1:C 0.113327 0.33664
sub_id.8 ms2:C 0.075215 0.27425
sub_id.9 ms3:C 0.061012 0.24701
sub_id.10 ms4:C 0.020677 0.14380
sub_id.11 ms5:C 0.021688 0.14727
Residual 0.002148 0.04635
Number of obs: 3652, groups: sub_id, 83
Fixed effects:
Estimate Std. Error t value
(Intercept) 0.167871 0.005541 30.295
ms1 -0.209290 0.024008 -8.718
ms2 -0.121173 0.015042 -8.056
ms3 0.006115 0.013625 0.449
ms4 0.046584 0.009195 5.066
ms5 0.003498 0.009299 0.376
G 0.026723 0.011082 2.411
C 0.054710 0.009899 5.527
ms1:G 0.148235 0.048016 3.087
ms2:G 0.006309 0.030084 0.210
ms3:G -0.070349 0.027249 -2.582
ms4:G 0.018576 0.018390 1.010
ms5:G 0.006853 0.018597 0.368
ms1:C 0.010683 0.037648 0.284
ms2:C -0.191025 0.030953 -6.171
ms3:C 0.006511 0.028053 0.232
ms4:C 0.099358 0.017347 5.728
ms5:C -0.003818 0.017695 -0.216
G:C -0.009915 0.019798 -0.501
ms1:G:C 0.144761 0.075295 1.923
ms2:G:C 0.046990 0.061907 0.759
ms3:G:C -0.156586 0.056106 -2.791
ms4:G:C 0.009317 0.034695 0.269
ms5:G:C 0.061278 0.035390 1.732
Correlation matrix not shown by default, as p = 24 > 12.
Use print(x, correlation=TRUE) or
vcov(x) if you need it
Now let’s use model comparison to answer our question: do the time-varying components for lexical competition vary across children and adults?
<-
mod_5_drop update(mod_5,
~ . -ms1:G:C -ms2:G:C -ms3:G:C -ms4:G:C -ms5:G:C)
.
anova(mod_5, mod_5_drop)
Data: pog_5
Models:
mod_5_drop: p ~ ms1 + ms2 + ms3 + ms4 + ms5 + G + C + (1 | sub_id) + (0 + ms1 | sub_id) + (0 + ms2 | sub_id) + (0 + ms3 | sub_id) + (0 + ms4 | sub_id) + (0 + ms5 | sub_id) + (0 + C | sub_id) + (0 + ms1:C | sub_id) + (0 + ms2:C | sub_id) + (0 + ms3:C | sub_id) + (0 + ms4:C | sub_id) + (0 + ms5:C | sub_id) + ms1:G + ms2:G + ms3:G + ms4:G + ms5:G + ms1:C + ms2:C + ms3:C + ms4:C + ms5:C + G:C
mod_5: p ~ (ms1 + ms2 + ms3 + ms4 + ms5) * G * C + ((1 | sub_id) + (0 + ms1 | sub_id) + (0 + ms2 | sub_id) + (0 + ms3 | sub_id) + (0 + ms4 | sub_id) + (0 + ms5 | sub_id) + (0 + C | sub_id) + (0 + ms1:C | sub_id) + (0 + ms2:C | sub_id) + (0 + ms3:C | sub_id) + (0 + ms4:C | sub_id) + (0 + ms5:C | sub_id))
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
mod_5_drop 32 -9229.0 -9030.5 4646.5 -9293.0
mod_5 37 -9233.6 -9004.1 4653.8 -9307.6 14.653 5 0.01195 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There are further things we could potentially do with this model, including performing model comparison on the time-varying components. One thing we probably should do would be to repeat all the above steps, but treating items as a random factor instead of subjects.
One issue with polynomial regression is that the complexity of the model is likely to give rise to convergence problems. One strategy is to estimate the parameters using re-sampling techniques, which we’ll learn about in the next chapter.
Before we do that, let’s save pog_subj
, because we’ll need it for the next set of activities.
%>%
pog_subj saveRDS(file = "data-derived/pog_subj.rds")