Longitudinal Data Analysis

longitudinal data analysis

  • If repeated measurements are over time, such data are called repeated measures, time-course, or longitudinal data
  • Studies involving such data are often called longitudinal studies (usually a cohort study, although RCTs can also be longitudinal).

features of longitudinal data

  • Defining feature: repeated observations on individuals, allowing the direct study of change.
  • Note that the measurements are commensurate, i.e. the same variable is measured repeatedly.
  • Longitudinal data require sophisticated statistical techniques because the repeated observations are usually (positively) correlated.
  • Sequential nature of the measures implies that certain types of correlation structures are likely to arise.
  • Correlation must be accounted for to obtain valid inferences.

potential advantages

  • They allow investigation of events that occur in time. Essential to the study of normal growth and aging, and the effects of individual characteristics, treatments, or environmental exposures on those changes. Also essential to the study of the temporal pattern of response to treatment.
  • Can study the order of events.
  • Permit more complete ascertainment of exposure histories in epidemiologic studies.

example dataset used in this presentation

  • This study was conducted in 16 boys and 11 girls, who at ages 8, 10, 12, and 14 had their distance (mm) from the center of the pituitary gland to the pteryomaxillary fissure measured. Changes in pituitary-pteryomaxillary distances during growth is important in orthodontal therapy.
  • The goals of the study were to describe the distance in boys and girls as simple functions of age, and then to compare the functions for boys and girls

read in data

The dental dataset contains the following variables:

  • id = a sequential number;
  • sex = sex, a factor with categories 0 = “Girl”, 1 = “Boy”;
  • y8 = Measure at age 8;
  • y10 = Measure at age 10;
  • y12 = Measure at age 12;
  • y14 = Measure at age 14.

dental<-readr::read_csv("dental.csv")

dat<-dental |> 
  dplyr::select(-1) |> 
  gather(key=ageclass,value=distance,-c(id,sex)) |> 
  arrange(id) |> 
  mutate(agesex=paste0(ageclass,sex),
         age=as.numeric(substr(ageclass,2,1000)),
         age1=as.factor(age),
         sex1=ifelse(sex=="Girl",1,0))

glimpse of the dataset

(tab<-dat%>%
  head() |> 
  kableExtra::kbl())
id sex ageclass distance agesex age age1 sex1
1 Girl y8 21.0 y8Girl 8 8 1
1 Girl y10 20.0 y10Girl 10 10 1
1 Girl y12 21.5 y12Girl 12 12 1
1 Girl y14 23.0 y14Girl 14 14 1
2 Girl y8 21.0 y8Girl 8 8 1
2 Girl y10 21.5 y10Girl 10 10 1

Explanatory data analysis(EDA)

for the following graphs ill use my own format

dental_long <- pivot_longer(dental, cols = starts_with("y"), 
                            names_to = "measurement", values_to = "distance") %>% 
  mutate(
    age = parse_number(measurement),
    measurement = fct_inorder(paste("Measure at age", age))
  ) %>% 
  set_variable_labels(
    age = "Age of the child at measurement",
    measurement = "Label for time measurement",
    distance = "Measurement"
  )

group_by(dental_long, sex, measurement) %>% 
  get_summary_stats(distance, show = c("mean", "sd","median","sd","max","min")) |> 
  kableExtra::kbl()
sex measurement variable n mean sd median max min
Boy Measure at age 8 distance 16 22.875 2.453 23.00 27.5 17.0
Boy Measure at age 10 distance 16 23.812 2.136 23.50 28.0 20.5
Boy Measure at age 12 distance 16 25.719 2.652 25.00 31.0 22.5
Boy Measure at age 14 distance 16 27.469 2.085 26.75 31.5 25.0
Girl Measure at age 8 distance 11 21.182 2.125 21.00 24.5 16.5
Girl Measure at age 10 distance 11 22.227 1.902 22.50 25.0 19.0
Girl Measure at age 12 distance 11 23.091 2.365 23.00 28.0 19.0
Girl Measure at age 14 distance 11 24.091 2.437 24.00 28.0 19.5

(box<-ggplot(dental_long, aes(measurement, distance, fill = measurement)) +
  tvthemes::scale_fill_avatar()+
  geom_boxplot() +
  geom_jitter(width = 0.2) +
  guides(fill = "none") +
  labs(x = "", y = "Dental growth, mm")+
  theme(axis.title = element_text(face = 2),
        axis.title.x = element_text(hjust = 1),
        axis.text.x = element_text(angle = 35, hjust = 1, vjust = 1)))
(box1<-ggplot(dental_long, aes(sex, distance, fill = measurement)) +
  tvthemes::scale_fill_avatar()+
  geom_boxplot() +
  labs(x = "", y = "Dental growth, mm", fill = ""))
group_by(dental_long, sex, measurement) %>% 
  summarise(mean_distance = mean(distance), .groups = "drop") %>% 
  ggplot(aes(sex, mean_distance, fill = measurement, label = round(mean_distance))) +
  geom_col(position = "dodge") +
  tvthemes::scale_fill_avatar()+
  geom_text(position = position_dodge(width = 0.9), vjust = -0.5) +
  coord_flip() +
  labs(x = "", y = "Mean Dental growth, mm", fill = "")

covariance

# co-variance matrix
cov_obs <- dplyr::select(dental, starts_with("y")) %>%  cov()
cov_obs
#>           y8      y10      y12      y14
#> y8  5.925926 3.285256 4.875356 4.039886
#> y10 3.285256 4.653846 3.858974 4.532051
#> y12 4.875356 3.858974 7.938746 6.197293
#> y14 4.039886 4.532051 6.197293 7.654558

correlation matrix

cov2cor(cov_obs)
#>            y8       y10       y12       y14
#> y8  1.0000000 0.6255833 0.7108079 0.5998338
#> y10 0.6255833 1.0000000 0.6348775 0.7593268
#> y12 0.7108079 0.6348775 1.0000000 0.7949980
#> y14 0.5998338 0.7593268 0.7949980 1.0000000

further exploration

ggpairs(dplyr::select(dental, starts_with("y")), lower = list(continuous = "smooth"))

ggpairs(dental, mapping = aes(colour = sex), columns = 3:6,
        lower = list(continuous = "smooth"))

trend by gender

(p<-group_by(dental_long, sex, age) %>% 
  summarise(mean = list(mean_ci(distance)), .groups = "drop") %>% 
  unnest_wider(mean) %>% 
  mutate(agex = age - .05 + .05*(sex == "Boy")) %>% 
  ggplot(aes(agex, y, col = sex, shape = sex)) +
  geom_point() +
  ggthemes::scale_shape_cleveland()+
  geom_errorbar(aes(ymin = ymin, ymax = ymax), width = 0.2) +
  geom_line() +
  labs(x = "Age, years", y = "Mean Dental growth, mm", shape = "Sex", col = "Sex"))

dental growth per child

(p<-ggplot(dental_long, aes(age, distance, col = factor(id))) +
  geom_point() +
  geom_line() +
  tvthemes::scale_color_avatar()+
  facet_wrap(~ id) +
  labs(x = "Age, years", y = "Dental growth, mm", col = "Child id") +
  guides(col = guide_legend(nrow = 3)))
(q<-ggplot(dental_long, aes(age, distance, col = factor(id))) +
  geom_line() +
  tvthemes::scale_color_avatar()+
  labs(x = "Age, years", y = "Dental growth, mm", col = "Child id") +
  guides(col = guide_legend(nrow = 3)))
(m<-ggplot(dental_long, aes(age, distance)) +
  geom_line(aes(group = factor(id))) +
  geom_smooth() +
  facet_grid(~ sex)) 

Variance Components - matrix V :

  • Maximum Likelihood (ML)
  • restricted maximum likelihood (REML)

What’s the difference between ML and REML

  • ML estimates of variances are known to be biased in small samples
  • the simplest case: Sample variance

$$var(x)=\frac{1}{n-1}\sum_{i=1}^n(x_i-\bar{x})^2$$

  • The REML estimation is a generalization of this idea,It provides unbiased estimates of the parameters in the covariance matrix \(V_i\) in small samples

fitting marginal models in R

Marginal models can be fitted using function gls() from the nlme package it has the following structure

  • model: a formula specifying the response vector and the covariates to include in the model
  • data: a data frame containing all the variables
  • correlation: a function describing the assumed correlation structure
  • weights: a function describing the assumed within-group heteroscedasticity structure

covariance and correlation matrices

Covariance Matrix

Figure 1: Covariance Matrix

Properties

  • on the diagonal the variances, off diagonal covariances
  • symmetric \(\rightarrow cov(Y_1 , Y_2 ) = cov(Y_2 , Y_1)\)

Variances, covariances and correlations

  • variance measures how far a set of numbers is spread out (always positive)
  • covariance is a measure of how much two random variables change together (positive or negative)
  • correlation a measure of the linear correlation (dependence) between two variables (between -1 and 1; 0 no correlation)

$$corr(Y_1,Y_2)=\frac{cov(Y_1.Y_2)}{\sqrt{var(Y_1)\sqrt{var(Y_2}}}$$

  • Due to the fact that the magnitude of the covariance between \(Y_1\) and \(Y_2\) depends on their variability, we translate the covariance matrix into a correlation matrix

Correlation Matrix

Figure 2: Correlation Matrix

  • We need an appropriate choice for \(V_i\) in order to appropriately describe the correlations between the repeated measurements

  • compound symmetry

  • Gaussian spatial correlation

  • autoregressive process

  • Toeplitz

  • exponential spatial correlation

  • Unstructured - Every variance and covariance term for observations within a school is a separate parameter and is therefore estimated uniquely; no patterns among variances or correlations are assumed. This structure offers maximum flexibility but is most costly in terms of parameters estimated.
  • Compound symmetry - Assume variance is constant across all time points and correlation is constant across all pairs of time points. This structure is highly restrictive but least costly in terms of parameters estimated.
  • Autoregressive - Assume variance is constant across all time points, but correlation drops off in a systematic fashion as the gap in time increases. Autoregressive models expand compound symmetry by allowing for a common structure where points closest in time are most highly correlated.

  • Toeplitz - Toeplitz is similar to the autoregressive model, except that it does not impose any structure on the decline in correlation as time gaps increase. Thus, it requires more parameters to be estimated than the autoregressive model while providing additional flexibility.
  • Heterogeneous variances - The assumption that variances are equal across time points found in the compound symmetry, autoregressive, and Toeplitz models can be relaxed by introducing additional parameters to allow unequal (heterogeneous) variances.

doing it in R

define a function in R

corandcov <- function(glsob,cov=T,...){
  corm <- corMatrix(glsob$modelStruct$corStruct)[[5]]
  print(corm)
  varstruct <- print(glsob$modelStruct$varStruct)  
  varests <- coef(varstruct, uncons=F, allCoef=T)
  covm <- corm*glsob$sigma^2*t(t(varests))%*%t(varests)
  return(covm)}

unstructured mean and unstructured covariance

## model 1
model<-gls(distance~-1+age1*sex1,data=dat,corr=corSymm(form=~1|id) , 
           weights=varIdent(form=~1|age1),
          method="ML")

cc <- corMatrix(model$modelStruct$corStruct)[[5]]
print(cc)
#>           [,1]      [,2]      [,3]      [,4]
#> [1,] 1.0000000 0.5706970 0.6613212 0.5215786
#> [2,] 0.5706970 1.0000000 0.5631718 0.7262192
#> [3,] 0.6613212 0.5631718 1.0000000 0.7280975
#> [4,] 0.5215786 0.7262192 0.7280975 1.0000000

linear average trend

model2<-gls(distance~age*sex +sex ,data=dat,corr=corSymm(form=~1|id) , 
           weights=varIdent(form=~1|age1),
          method="ML")

cc <- corMatrix(model2$modelStruct$corStruct)[[5]]
print(cc)
#>           [,1]      [,2]      [,3]      [,4]
#> [1,] 1.0000000 0.5443358 0.6525650 0.5187527
#> [2,] 0.5443358 1.0000000 0.5607202 0.7190277
#> [3,] 0.6525650 0.5607202 1.0000000 0.7275928
#> [4,] 0.5187527 0.7190277 0.7275928 1.0000000

compare model 1 and 2

anova(model2,model,type="LR")
#>        Model df      AIC      BIC    logLik   Test  L.Ratio p-value
#> model2     1 14 447.4770 485.0269 -209.7385                        
#> model      2 18 452.5093 500.7877 -208.2546 1 vs 2 2.967746  0.5632

Parallel average profiles

model3<-gls(distance~age+sex,data=dat,corr=corSymm(form=~1|id) , 
           weights=varIdent(form=~1|age1),
          method="ML")
cc <- corMatrix(model3$modelStruct$corStruct)[[5]]
print(cc)
#>           [,1]      [,2]      [,3]      [,4]
#> [1,] 1.0000000 0.5623195 0.6420580 0.4754107
#> [2,] 0.5623195 1.0000000 0.5416056 0.6520613
#> [3,] 0.6420580 0.5416056 1.0000000 0.7249401
#> [4,] 0.4754107 0.6520613 0.7249401 1.0000000

compare all the three models

anova(model,model2,model3,type="LR")
#>        Model df      AIC      BIC    logLik   Test  L.Ratio p-value
#> model      1 18 452.5093 500.7877 -208.2546                        
#> model2     2 14 447.4770 485.0269 -209.7385 1 vs 2 2.967746  0.5632
#> model3     3 13 452.1527 487.0204 -213.0763 2 vs 3 6.675655  0.0098

Topeltz covariance structure

model4<-gls(distance~age*sex+sex,data=dat,corr=corARMA(form=~1|id,p=3,q=0) , 
           weights=varIdent(form=~1|age1),
          method="ML")
cc <- corMatrix(model4$modelStruct$corStruct)[[5]]
print(cc)
#>           [,1]      [,2]      [,3]      [,4]
#> [1,] 1.0000000 0.6221903 0.6953509 0.4812799
#> [2,] 0.6221903 1.0000000 0.6221903 0.6953509
#> [3,] 0.6953509 0.6221903 1.0000000 0.6221903
#> [4,] 0.4812799 0.6953509 0.6221903 1.0000000

compare all the 4 models

anova(model,model2,model3,model4,type="LR")
#>        Model df      AIC      BIC    logLik   Test  L.Ratio p-value
#> model      1 18 452.5093 500.7877 -208.2546                        
#> model2     2 14 447.4770 485.0269 -209.7385 1 vs 2 2.967746  0.5632
#> model3     3 13 452.1527 487.0204 -213.0763 2 vs 3 6.675655  0.0098
#> model4     4 11 444.4973 474.0008 -211.2487 3 vs 4 3.655361  0.1608

AR(1) covariance structure

model5<-gls(distance~age*sex+sex,data=dat,corr=corAR1(form=~1|id) , 
           weights=varIdent(form=~1|age1),
          method="ML")
cc <- corMatrix(model5$modelStruct$corStruct)[[5]]
print(cc)
#>           [,1]      [,2]      [,3]      [,4]
#> [1,] 1.0000000 0.6187219 0.3828168 0.2368571
#> [2,] 0.6187219 1.0000000 0.6187219 0.3828168
#> [3,] 0.3828168 0.6187219 1.0000000 0.6187219
#> [4,] 0.2368571 0.3828168 0.6187219 1.0000000

compare all models

anova(model,model2,model3,model4,model5)
#>        Model df      AIC      BIC    logLik   Test   L.Ratio p-value
#> model      1 18 452.5093 500.7877 -208.2546                         
#> model2     2 14 447.4770 485.0269 -209.7385 1 vs 2  2.967746  0.5632
#> model3     3 13 452.1527 487.0204 -213.0763 2 vs 3  6.675655  0.0098
#> model4     4 11 444.4973 474.0008 -211.2487 3 vs 4  3.655361  0.1608
#> model5     5  9 456.7470 480.8862 -219.3735 4 vs 5 16.249639  0.0003
cc <- corMatrix(model4$modelStruct$corStruct)[[5]]
print(cc)
#>           [,1]      [,2]      [,3]      [,4]
#> [1,] 1.0000000 0.6221903 0.6953509 0.4812799
#> [2,] 0.6221903 1.0000000 0.6221903 0.6953509
#> [3,] 0.6953509 0.6221903 1.0000000 0.6221903
#> [4,] 0.4812799 0.6953509 0.6221903 1.0000000

General Linear Mixed effects model

  • Linear MIXED models mix (consider) both fixed and random effects (Hence the name ’Mixed’ models)
  • These models are a way of dealing with correlated data so we can consider research questions in much the same way as independant data; Inferences about within-subject effects can be conducted in a similar way to (standard) between-subject effects.
  • in these models, some subset of the regression coefficients vary randomly from one individual to another, accounting for natural heterogeneity across subjects. Individuals in population are assumed to have their own subject-specific mean response trajectories over time.
  • The mean response is modeled as a combination of population characteristics (fixed effects) assumed to be shared by all individuals, while subject-specific effects (random effects) are unique to a particular individual.
  • Linear Mixed Models are a particular type of hierarchical models which contains both fixed and random effects.

Two - Stage Approach

Growth curve models can be motivated in terms of a two-stage model. In the two-stage model, we assume

  1. A straight line (or curve) fits the observed responses for each subject (first stage)
  2. A regression model relating the mean of the individual intercepts and slopes to subject-specific covariates (second stage)

stage 1

$$Y_{ij}=v_{i1}+v_{i2}t_{ij}+\epsilon_{ij}$$

where \(v_{i1}\) and \(v_{i2}\) are parameters specific to the \(ith\) subject and the errors, \(e_{ij}\) , are implicitly assumed to be independent within a subject.

Doing it in R : method 1

library(nlme);rel_model<-lmList(distance~age,data=Orthodont);plot(augPred(rel_model),grid=TRUE)

method 2 : using R’s nested data capabilities

# created nested dataframes 
dental_nested <- Orthodont %>%  
              group_by(Subject) %>% 
              nest() 

dental_nested
#> # A tibble: 27 x 2
#> # Groups:   Subject [27]
#>    Subject data            
#>    <ord>   <list>          
#>  1 M01     <tibble [4 x 3]>
#>  2 M02     <tibble [4 x 3]>
#>  3 M03     <tibble [4 x 3]>
#>  4 M04     <tibble [4 x 3]>
#>  5 M05     <tibble [4 x 3]>
#>  6 M06     <tibble [4 x 3]>
#>  7 M07     <tibble [4 x 3]>
#>  8 M08     <tibble [4 x 3]>
#>  9 M09     <tibble [4 x 3]>
#> 10 M10     <tibble [4 x 3]>
#> # i 17 more rows

create linear models for each subject or individual

dental_models<-dental_nested %>% 
  mutate(
    model=map(data,~lm(distance~age,data=.x))
  )
dental_models
#> # A tibble: 27 x 3
#> # Groups:   Subject [27]
#>    Subject data             model 
#>    <ord>   <list>           <list>
#>  1 M01     <tibble [4 x 3]> <lm>  
#>  2 M02     <tibble [4 x 3]> <lm>  
#>  3 M03     <tibble [4 x 3]> <lm>  
#>  4 M04     <tibble [4 x 3]> <lm>  
#>  5 M05     <tibble [4 x 3]> <lm>  
#>  6 M06     <tibble [4 x 3]> <lm>  
#>  7 M07     <tibble [4 x 3]> <lm>  
#>  8 M08     <tibble [4 x 3]> <lm>  
#>  9 M09     <tibble [4 x 3]> <lm>  
#> 10 M10     <tibble [4 x 3]> <lm>  
#> # i 17 more rows

coefficients of multiple models

dental_models %>%  
  mutate(coef = map(model,~tidy(.x))) %>% 
  unnest(coef) 
#> # A tibble: 54 x 8
#> # Groups:   Subject [27]
#>    Subject data             model  term     estimate std.error statistic p.value
#>    <ord>   <list>           <list> <chr>       <dbl>     <dbl>     <dbl>   <dbl>
#>  1 M01     <tibble [4 x 3]> <lm>   (Interc~   17.3       3.85      4.50  0.0461 
#>  2 M01     <tibble [4 x 3]> <lm>   age         0.95      0.343     2.77  0.109  
#>  3 M02     <tibble [4 x 3]> <lm>   (Interc~   14.8       2.62      5.67  0.0297 
#>  4 M02     <tibble [4 x 3]> <lm>   age         0.775     0.233     3.32  0.0798 
#>  5 M03     <tibble [4 x 3]> <lm>   (Interc~   16         3.55      4.51  0.0459 
#>  6 M03     <tibble [4 x 3]> <lm>   age         0.75      0.316     2.37  0.141  
#>  7 M04     <tibble [4 x 3]> <lm>   (Interc~   24.7       2.23     11.1   0.00803
#>  8 M04     <tibble [4 x 3]> <lm>   age         0.175     0.198     0.882 0.471  
#>  9 M05     <tibble [4 x 3]> <lm>   (Interc~   13.7       3.57      3.82  0.0622 
#> 10 M05     <tibble [4 x 3]> <lm>   age         0.85      0.318     2.67  0.116  
#> # i 44 more rows

model performance

library(broom)
model_perf<-dental_models %>% 
  mutate(coef=map(model,~glance(.x))) %>% 
  unnest(coef);model_perf
#> # A tibble: 27 x 15
#> # Groups:   Subject [27]
#>    Subject data     model  r.squared adj.r.squared sigma statistic p.value    df
#>    <ord>   <list>   <list>     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>
#>  1 M01     <tibble> <lm>       0.793        0.690  1.53      7.68  0.109       1
#>  2 M02     <tibble> <lm>       0.847        0.770  1.04     11.0   0.0798      1
#>  3 M03     <tibble> <lm>       0.738        0.607  1.41      5.63  0.141       1
#>  4 M04     <tibble> <lm>       0.280       -0.0800 0.887     0.778 0.471       1
#>  5 M05     <tibble> <lm>       0.781        0.672  1.42      7.14  0.116       1
#>  6 M06     <tibble> <lm>       0.992        0.988  0.194   243.    0.00409     1
#>  7 M07     <tibble> <lm>       0.898        0.847  0.851    17.7   0.0522      1
#>  8 M08     <tibble> <lm>       0.324       -0.0144 1.71      0.957 0.431       1
#>  9 M09     <tibble> <lm>       0.311       -0.0339 4.59      0.902 0.443       1
#> 10 M10     <tibble> <lm>       0.9          0.85   0.791    18.0   0.0513      1
#> # i 17 more rows
#> # i 6 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>, deviance <dbl>,
#> #   df.residual <int>, nobs <int>

visualy inpect fit of your models

augmented_models<-dental_models %>% 
  mutate(augmented=map(model,~augment(.x))) %>% 
  unnest(augmented)

visualy inspect the models

p<-augmented_models %>% 
  ggplot(aes(x=distance,y=age))+
  geom_point()+
  geom_line(aes(y=.fitted),color="red")+ 
  facet_wrap(~Subject,scale="free")

p

stage 2 : the intercepts and the slopes are regressed on

other covariates:

$$v_{i1} = \alpha_1 + X_i\beta_1 + e_{i1}$$ $$v_{i2} = \alpha_2 + X_i\beta_2 + e_{i2}$$

b<-lapply(rel_model,coef)
V<-lapply(rel_model,vcov)
estm<-rep(c("intercept","slope"),length(b))
subj<-rep(names(b),each=2)
library(metafor)
b<-unlist(b)
V<-bldiag(V)

results for stage 2

(res2<-rma.mv(b~estm-1,V,random=~estm|subj,struct="UN"))
#> 
#> Multivariate Meta-Analysis Model (k = 54; method: REML)
#> 
#> Variance Components:
#> 
#> outer factor: subj (nlvls = 27)
#> inner factor: estm (nlvls = 2)
#> 
#>             estim    sqrt  k.lvl  fixed      level 
#> tau^2.1    5.6463  2.3762     27     no  intercept 
#> tau^2.2    0.0478  0.2187     27     no      slope 
#> 
#>            rho.intr  rho.slop    intr  slop 
#> intercept         1                 -    27 
#> slope       -0.5726         1      no     - 
#> 
#> Test for Residual Heterogeneity:
#> QE(df = 52) = 1611.6315, p-val < .0001
#> 
#> Test of Moderators (coefficients 1:2):
#> QM(df = 2) = 3080.0218, p-val < .0001
#> 
#> Model Results:
#> 
#>                estimate      se     zval    pval    ci.lb    ci.ub      
#> estmintercept   17.6669  0.6109  28.9196  <.0001  16.4696  18.8642  *** 
#> estmslope        0.5762  0.0555  10.3868  <.0001   0.4675   0.6850  *** 
#> 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

drawbacks for the

  • The two-stage method is less attractive when the number and timing of observations varies among subjects, because it does not take proper account of the weighting.
  • Also, note that the two-stage formulation of the growth curve model imposes certain restrictions and structure on the covariates.
  • That is, in the two-stage approach covariates at the first stage (except for the intercept) must be time-varying, while covariates at the second stage must be time-invariant.
  • In contrast, in the mixed effects model the only restriction is that the columns of Z i are a subset of the columns of \(X_i\) .

the general mixed effects model

  • the two stage model can be perfomed explicitly in the analysis
  • the associated drawbacks of the two stage model can be avoided by combining the two stages into one model such that

$$Y_i=Z_i\beta_i+\epsilon_i \cdots (1)$$

$$\beta_i=K_i\beta+b_i \cdots (2)$$ can be simplified to

$$Y_i=X_i\beta+Z_ib_i+\epsilon_i$$ where \(Z_iK_i=X_i\)

it is given that

$$b_i \sim N(0,D)$$ $$\epsilon_i:n_i \times 1$$ error vector $$\epsilon_i \sim N(0,R)$$ and \(b_1\cdots b_N,\epsilon_1,\cdots,\epsilon_N\) independent

Terminology

  • Random Effects : \(\beta\): \(p \times 1\) vector of fixed effects regression coefficient

Random effects are those where the particular ’groups’ of observations (levels) are drawn randomly from a population of groups.

  • fixed effects : \(b_i\) : \(q\times 1\) vector of random effects regression coefficient

are the ones we know (and love), where we would expect a differences between two groups to be FIXED as we move from the sample to the population

  • Variance components are elements in \(D\) and \(R\)

  • \(Y_i\) : \(n_i \times 1\) response vector

  • \(X_i\) : \(n_i \times p\) design matrix of fixed effects

  • \(Z_i\) : \(n_i \times q\) design matrix of random effects

hence

$$Y_i \sim N(X_i\beta,V(\alpha))$$

features

  • D must be symmetric and positive definite

  • \(Z_i\) columns are subsets of \(X_i\) columns

  • conditional mean of \(Y_i\) given random effects \(b_i\) is \(E(Y_i|b_i)=X_i\beta+Z_ib_i\)

  • the marginal mean of \(Y_i\) is \(E(E(Y_i|b_i))=E(X_i\beta+Z_ib_i)=X_i\beta\)

Matrix D

Matrix R

contrasting with the general linear model

cont’d

fitting the General linear mixed effects model (lme package)

model<-lme(distance~age,random=~age|Subject,data=Orthodont)

#equatiomatic::extract_eq(model)
  • almost similar to the two stage results

fitting using the imer package

lin_0 <- lmer(distance ~ 1 + (1 | id), data = dental_long)
summary(lin_0)
#> Linear mixed model fit by REML. t-tests use Satterthwaite's method [
#> lmerModLmerTest]
#> Formula: distance ~ 1 + (1 | id)
#>    Data: dental_long
#> 
#> REML criterion at convergence: 515.4
#> 
#> Scaled residuals: 
#>     Min      1Q  Median      3Q     Max 
#> -3.2400 -0.5277 -0.1073  0.4732  2.7687 
#> 
#> Random effects:
#>  Groups   Name        Variance Std.Dev.
#>  id       (Intercept) 3.752    1.937   
#>  Residual             4.930    2.220   
#> Number of obs: 108, groups:  id, 27
#> 
#> Fixed effects:
#>             Estimate Std. Error      df t value Pr(>|t|)    
#> (Intercept)  24.0231     0.4297 26.0000   55.91   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#equatiomatic::extract_eq(lin_0)

interpretation

  • The estimated marginal mean of the dental distance is \(\beta_0=24.02mm\)
  • The estimated variance of the random-effect reflecting between-subject variability is 3.752; the estimated variance of the error term reflecting within-subject variability is 4.930.
  • The correlation between any two repeated measures (ICC) is equal to \(3.76/(3.76+4.93)=0.43\)

ranova(lin_0)
#> ANOVA-like table for random-effects: Single term deletions
#> 
#> Model:
#> distance ~ (1 | id)
#>          npar  logLik    AIC   LRT Df Pr(>Chisq)    
#> <none>      3 -257.68 521.36                        
#> (1 | id)    2 -269.14 542.28 22.92  1  1.689e-06 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

comment

  • The small p-value suggests evidence of between-individual heterogeneity, which support evidence for choosing a mixed-effects model instead of a only fixed-effects model.
Bongani Ncube
Bongani Ncube
Data Scientist/Statistics/Public Health

My research interests include Casual Inference , Public Health , Survival Analysis, bayesian Statistics, Machine learning and Longitudinal Data Analysis.