variable | type | description |
---|---|---|
ID | int | arbitrary value uniquely identifying each frame within subject |
TETTime | dbl | (ignored) |
RTTime | int | (ignored) |
CursorX | int | horizontal point of gaze in pixels |
CursorY | int | vertical point of gaze in pixels |
TimestampSec | int | timestamp in seconds |
TimestampMicrosec | int | millisecond portion of timestamp (cycles around) |
XGazePosLeftEye | dbl | (ignored) |
YGazePosLeftEye | dbl | (ignored) |
XCameraPosLeftEye | dbl | (ignored) |
YCameraPosLeftEye | dbl | (ignored) |
DiameterPupilLeftEye | dbl | (ignored) |
DistanceLeftEye | dbl | (ignored) |
ValidityLeftEye | int | (ignored) |
XGazePosRightEye | dbl | (ignored) |
YGazePosRightEye | dbl | (ignored) |
XCameraPosRightEye | dbl | (ignored) |
YCameraPosRightEye | dbl | (ignored) |
DiameterPupilRightEye | dbl | (ignored) |
DistanceRightEye | dbl | (ignored) |
ValidityRightEye | int | (ignored) |
TrialId | int | arbitrary value uniquely identifying each trial within a subject (same as t_id) |
UserDefined_1 | chr | phase of the trial (Fixation, Preview, StimSlide) |
1 Import, epoching, and time-alignment
The overall task here is to scrape out the data we want to use from each trial (epoching) and align the frame counters for all trials to the disambiguation point for the particular audio stimulus that was played on that trial (time-alignment). In other words, the disambiguation point should be the temporal “origin” (zero point) for the timeline on each trial.
1.1 Data import
For the first part of pre-processing, we will load the eye data into our R session using functions from the {readr}
package, which is one of many packages that is part of the {tidyverse}
meta-package. The .gazedata
files from the Tobii eyetracking system are in .tsv
or Tab Separated Values format, for which we use read_tsv()
.
Before we can perform epoching and time-alignment, we have to import and clean up the .gazedata
files. These are 42 adult data files and 41 child data files located in the adult
and child
subdirectories of data-raw/
. These files follow the naming convention data-raw/adult/sub_XXX.gazedata
and data-raw/child/sub_XXX.gazedata
where the XXX
part of the filename the unique integer identifying each subject, which corresponds to sub_id
in the subjects
table.
The raw gazedata files include a lot of unnecessary information. We’ll need to scrape out the data that we need and convert the XXX
value from the filename into a sub_id
variable in the resulting table. The source files have the format below.
1.1.1 Activity: One Subject
Read in the Tobii eyetracking data for a single subject from the datafile data-raw/adult/sub_003.gazedata
, and convert it to the format below.
# A tibble: 16,658 × 7
sub_id t_id f_id sec x y phase
<int> <int> <int> <dbl> <int> <int> <chr>
1 3 1 145 1317141127. 666 521 Preview
2 3 1 146 1317141127. 649 442 Preview
3 3 1 147 1317141127. 618 507 Preview
4 3 1 148 1317141127. 645 471 Preview
5 3 1 149 1317141127. 632 471 Preview
6 3 1 150 1317141127. 645 536 Preview
7 3 1 151 1317141127. 651 474 Preview
8 3 1 152 1317141127. 643 541 Preview
9 3 1 153 1317141127. 628 581 Preview
10 3 1 154 1317141127. 643 532 Preview
# … with 16,648 more rows
Here, we have renamed TrialId
to t_id
, which is the name it takes throughout the rest of the database. We have also renamed CursorX
and CursorY
to x
and y
respectively. We have also renamed ID
to f_id
(frame id) and UserDefined_1
to phase
. We also exclude any frames from the phase where UserDefined_1 == "Fixation"
, because these frames are not informative, and doing so reduces the size of the data we need to import.
1.1.2 Activity: All Subjects
Now adapt the code that you wrote above to load in all 83 into a single table, which should have the same format as for the data you imported for subject 3 above.
1.2 Epoching and time-alignment
The Tobii eyetracker recorded data at a rate of 60 Hertz (i.e., 60 frames per second, or one frame every 1/60th of a second.) For each trial, the frame counter (ID
, which we renamed to f_id
) starts at 1 and increments every frame. This is not very useful because we need to know when certain stimulus events occurred, and these will take place at a different frame number for every trial, depending on the timing of the speech events of the stimulus for that trial. We need to re-define the ‘origin’ of the eye-tracking data. In this study, we used the ‘disambiguation point’, which is the point in the word where the signal distinguishes between two competing lexical items (e.g., candy and candle).
As the above figure shows, each trial had three phases, a Fixation
, Preview
, and StimSlide
phase, which are indexed by the variable phase
. Playback of a soundfile with a pre-recorded speech stimulus began simultaneously with the onset of the StimSlide
phase.
For each trial (uniquely identified by sub_id
and t_id
), we are going to need to do two things to time-align the eye data to the disambiguation point.
Find out what sound was played and the timing of the disambiguation point within that soundfile, as measured from the start of the file.
Figure out the frame number corresponding to the start of the
StimSlide
phase and then adjust by the amount calculated in the previous step.
1.2.1 Activity: Disambiguation Point
Create the table below from the raw data, which has information about the onset of the disambiguation point for each trial. Store the table as origin_adj
.
You may wish to consult Appendix A to see what tables the values in the table below have been are drawn from. You’ll need to import these tables into your session. All of these tables have the extension .csv
, which indicates they are in Comma Separated Values format. The ideal way to import these files is to use read_csv()
from the {readr}
package.
# A tibble: 5,644 × 4
sub_id t_id sound disambig_point
<int> <int> <chr> <int>
1 1 1 Tpelican.wav 1171
2 1 2 Tpumpkin.wav 1079
3 1 3 pencil.wav 810
4 1 4 paddle.wav 881
5 1 6 Tbalcony.wav 1012
6 1 7 Tnapkin.wav 1069
7 1 11 Tflamingo.wav 1150
8 1 13 Tangel.wav 1036
9 1 14 Tparachute.wav 1046
10 1 16 Tmushroom.wav 1062
# … with 5,634 more rows
1.2.2 Activity: Onset of StimSlide
Now let’s do part 2, where we find the value of f_id
for the first frame of eyedata for each trial following the onset of the StimSlide
phase. We should have a table that looks like the one below, with one row for each trial, and where f_ss
is the value of f_id
for the earliest frame in the StimSlide
phase.
# A tibble: 7,385 × 3
sub_id t_id f_ss
<int> <int> <int>
1 1 1 338
2 1 2 729
3 1 3 1124
4 1 4 1443
5 1 5 1795
6 1 6 2300
7 1 7 2593
8 1 8 3348
9 1 9 3874
10 1 10 4331
# … with 7,375 more rows
1.2.3 Activity: Combine origins
Now that we have the first frame of StimSlide
and the adjustment we have to make in milliseconds for the disambiguation point, combine the tables and calculate f_z
, which will represent the “zero points” in frames for each trial. Store the resulting table in origins
.
# A tibble: 5,643 × 5
sub_id t_id f_ss disambig_point f_z
<int> <int> <int> <int> <int>
1 1 1 338 1171 408
2 1 2 729 1079 794
3 1 3 1124 810 1173
4 1 4 1443 881 1496
5 1 6 2300 1012 2361
6 1 7 2593 1069 2657
7 1 11 4699 1150 4768
8 1 13 5395 1036 5457
9 1 14 5893 1046 5956
10 1 16 6811 1062 6875
# … with 5,633 more rows
1.2.4 Activity: Time-align
Now we’re ready to calculate a new frame index on our eye data (edat
), f_c
, which is centered on the zero point, f_z
. The resulting table should be called epdat
and have the following structure.
# A tibble: 1,341,405 × 7
sub_id t_id f_id f_z f_c x y
<int> <int> <int> <int> <int> <int> <int>
1 1 1 272 408 -136 628 523
2 1 1 273 408 -135 634 529
3 1 1 274 408 -134 633 519
4 1 1 275 408 -133 644 531
5 1 1 276 408 -132 637 520
6 1 1 277 408 -131 635 515
7 1 1 278 408 -130 636 519
8 1 1 279 408 -129 638 518
9 1 1 280 408 -128 642 519
10 1 1 281 408 -127 638 518
# … with 1,341,395 more rows
1.3 Save the data
We’ve reached a stopping point. We’ll want to save the epoched data so that we can use that as our starting point for the next preprocessing stage. We’ll remove the variables f_id
and f_z
because we no longer need them. We’ll also keep 1.5 seconds (90 frames) before and after the disambiguation point for each trial.
## if we haven't made a "data-derived" directory, do so now
if (!dir.exists("data-derived")) dir.create("data-derived")
%>%
epdat filter(f_c >= -90L, f_c <= 90L) %>%
select(-f_id, -f_z) %>%
saveRDS(file = "data-derived/edat-epoched.rds")