Project

Trajectory analysis with R

Trajectory analysis · Group analysis · Data science

Overview

We can quickly fall into the trap that a given disease develops similarly for all patients across time, or that all repeat customers are alike. Trajectory analysis allows us to see beyond this, and learn about what sub-groups of patients, customers, or people comprise our data. And we can then create models to see what predicts membership in one group over another.

Approach

This project was completed in R, using packages such as:

traj
dplyr
nnet

Within the crowd, there is nuance

When we talk about brain health, particularly about diseases like dementia or Alzheimer’s, it’s easy to fall into binary thinking. We simplify the story: either someone has dementia, or they don’t. But in reality, brain health—like most areas of healthcare—is far more complex than a simple yes-or-no diagnosis.

Two people can both receive a diagnosis of dementia on the same day, but five years later, they might be in very different places. One person could still be functioning relatively independently, with only mild memory issues. The other might have rapidly progressed to severe cognitive decline, requiring full-time care. This isn’t just chance — there are underlying patterns, or trajectories, that we can map out and analyze to better understand why people with the same diagnosis can experience such different outcomes.

These trajectories aren’t random. They’re shaped by an array of factors: genetics, lifestyle, early diagnosis, access to care, and other medical conditions. By analyzing these patterns, we can begin to see the nuance beneath the diagnosis. Among those with dementia, we may be able to identify sub-groups, or clusters of individuals whose trajectories are similar. We can also try to see if there are other variables that predict membership in one of these trajectories.

This allows us to move away from treating dementia as a monolithic condition and toward a more personalized approach to care, where we anticipate the likely path someone will take and intervene earlier and more effectively.

To the OASIS dataset

The Open Access Series of Neuroimaging (OASIS) is an open source initiative aimed at making clinical neuroscience datasets available to the scientific community. In particular, OASIS-3 involves neuroimaging and clinical measurement of normal aging and Alzheimer’s disease. These data are available through Kaggle, and contain summary information on clinical measures (such as the Mini-Mental State Exam) and MRI measures (including Total Brain Volume, and Intracranial Volume). We will use these data for our example analysis.

Data will first be imported and converted to a “wide” format, as required by the trajectory analysis package, traj.

# reading data
dementia <- read.csv(path, na.strings = c('', 'NA'))

# pivoting to a wide dataframe, required by `traj`
dementia_wide <- dementia %>%
  pivot_wider(names_from = Visit, 
              values_from = c(Group, MR.Delay, M.F, Age, Hand, EDUC, MMSE, CDR, eTIV, nWBV, ASF),
              id_cols = Subject.ID)

Because there are so few cases at the latest timepoint, we will create some synthetic data (based on the mean and standard deviation of original data) to increase the total sample by 500%. To exaggerate the clustering analysis, these data will be made artificially noisy by introducing jitter to a random sub-set of numeric columns and then bootstrapping all columns to randomly replace all values. This means that the data are manipulated for illustrative purposes, and not representative of true trends in the OASIS data.

# original number of rows
n_original <- nrow(dementia_wide)
n_synthetic <- 4 * n_original

# creating an empty dataframe to hold the new values
dementia_wide_synth <- data.frame(matrix(ncol = ncol(dementia_wide), nrow = n_synthetic))
names(dementia_wide_synth) <- names(dementia_wide)
numeric_cols <- names(dementia_wide)[sapply(dementia_wide, is.numeric)]

# setting seed and setting half the numeric columns to receive jitter
set.seed(189)
numeric_cols_jitter <- sample(numeric_cols, length(numeric_cols) / 2)

# generate synthetic data for each numeric column
for(col in names(dementia_wide)) {
  if(is.numeric(dementia_wide[[col]])) {
    col_mean <- mean(dementia_wide[[col]], na.rm = TRUE)
    col_sd   <- sd(dementia_wide[[col]], na.rm = TRUE)
    
    # generate synthetic data from a normal distribution
    samples <- rnorm(n_synthetic, mean = col_mean, sd = col_sd)
    
    # for a random 50% of numeric columns, add extra jitter
    if(col %in% numeric_cols_jitter) {
      jitter_amount <- 0.50 * col_sd
      samples <- samples + rnorm(n_synthetic, mean = 0, sd = jitter_amount)
    }
    dementia_wide_synth[[col]] <- samples
  } else {
    # for non-numeric columns sample only from the non-NA values
    non_na_values <- dementia_wide[[col]][!is.na(dementia_wide[[col]])]
    if(length(non_na_values) > 0) {
      dementia_wide_synth[[col]] <- sample(non_na_values, n_synthetic, replace = TRUE)
    } else {
      dementia_wide_synth[[col]] <- NA
    }
  }
}

# resample every column
for(col in names(dementia_wide_synth)) {
  dementia_wide_synth[[col]] <- sample(dementia_wide_synth[[col]], n_synthetic, replace = TRUE)
}

# merge the data
dementia_wide_combined <- rbind(dementia_wide, dementia_wide_synth)

Next, we’ll select columns for this analysis and rename them. This step could have been performed earlier, with only the select columns made noisier and appended with synthetic data. Completing operations in this order allows us to retain a synthetic, noisy, full dataset for any further analysis.

We will keep the mmse variables, related to the clinical Mini-Mental State Exam (a commonly administered neurological test). We will also keep one MRI-related measure, wbv , or Whole Brain Volume.

# selecting columns of interest, and creating a time (t) variable in years
dementia_clean <- dementia_wide_combined %>%
  select(1, 7:11, 32:36, 47:51) %>%
  mutate(t1 = (mean(MR.Delay_1, na.rm = TRUE)/365),
         t2 = (mean(MR.Delay_2, na.rm = TRUE)/365),
         t3 = (mean(MR.Delay_3, na.rm = TRUE)/365),
         t4 = (mean(MR.Delay_4, na.rm = TRUE)/365),
         t5 = (mean(MR.Delay_5, na.rm = TRUE)/365)) 

# dropping time 4 and 5 owing to limited data at these timepoints
dementia_clean <- dementia_clean[c(1, 7:21)]
dementia_wide_cols <- c('Subject.ID', 
                        'mmse1', 'mmse2', 'mmse3', 'mmse4', 'mmse5',
                        'wbv1', 'wbv2', 'wbv3', 'wbv4', 'wbv5',
                        't1', 't2', 't3', 't4', 't5')
colnames(dementia_clean) <- dementia_wide_cols
dementia_clean$Subject.ID <- gsub('OAS2_', '', dementia_clean$Subject.ID)

Moving on to the traj package. First, we need to set up the data in a way that is usable by the package, in matrix format. The package requires the temporal data to be split from the measured data (with an identifier retained in both matrices), and for only complete cases to be used.

# the package requires the data to be in a matrix
dementia_clean <- dementia_clean[complete.cases(dementia_clean), ]
dementia_matrix <- as.matrix(dementia_clean)

# separating the matrix into the gose and time variables, as required
dementia_mmse <- dementia_matrix[, c(1, 2:6)]
dementia_mmse <- apply(dementia_mmse, 2, as.numeric)
dementia_wbv <- dementia_matrix[, c(1, 7:11)]
dementia_wbv <- apply(dementia_wbv, 2, as.numeric)
dementia_time <- dementia_matrix[, c(1, 12:16)]
dementia_time <- apply(dementia_time, 2, as.numeric)

The data are now ready for Steps 1 through 3 of the traj package. The purpose of each step is summarized below. Details are included in the official documentation.

Step 1: Compute Measures of Change In this step, the package calculates a set of features (up to 18 measures) that capture various aspects of individual trajectories. These measures include basic statistics like the range, mean, slope of a linear model, and others that describe the pattern of change over time (e.g., rate of change, variability). The goal here is to quantify key characteristics of each trajectory.

Step 2: Dimensionality Reduction After the measures are computed, this step uses a Principal Component Analysis (PCA) or similar techniques to reduce the dimensionality of the data. The idea is to select the most informative measures (those that explain the most variance) while removing redundant ones (e.g., if two measures are highly correlated). This reduction ensures the final clustering is based on the most meaningful aspects of the data.

Step 3: Clustering Finally, the k-means clustering algorithm (or similar) is applied to group the trajectories (alongside a bootstrapping technique) into distinct clusters. The number of clusters can either be specified by the user or automatically selected using criteria like the Calinski-Harabasz index. This step ultimately identifies which trajectories follow similar patterns, creating groups of individuals with shared longitudinal profiles.

# running steps s1, s2, and s3
dementia_mmse_s1 = traj::Step1Measures(dementia_mmse, dementia_time, ID = TRUE)
dementia_mmse_s2 = traj::Step2Selection(dementia_mmse_s1)
dementia_mmse_s3 = traj::Step3Clusters(dementia_mmse_s2, nclusters = NULL)
plot(dementia_mmse_s3)

With the plot command run on the last traj output, we can access a number of visuals. One that is useful, especially when nclusters is set to NULL (which bootstraps the analysis to find the ideal number of trajectory groups), is the Gap Statistic plot.

Here, similar to a scree plot, we can gain a sense of how many clusters to use. A higher gap statistic suggests a better separation between trajectory groups. Similar to a scree plot, an “elbow” suggests a point where we hit a granularity/explainability trade-off. There is no clear elbow in this plot of synthetic, noisy data, but for illustration, we’ll now set nclusters = 5.

The first plot shows MMSE trajectories, overlaid on one another. We can see here that patients typically follow the same trajectory up to about 4 years, after which point there is divergence. The second plot shows the number of subjects that comprise each trajectory group (or cluster), along with error bars to show how tightly patients follow the same trajectory. Those in Cluster 5 have a more variable trajectory relative to the others.

A similar exercise on whole brain volume leads to three distinct trajectory groups, which may be called “Early gainers” (Cluster 1), “Late gainers” (Cluster 2), and “Steady decliners” (Cluster 3).

We can now assign cluster number to each subject in the original dataset, for additional plotting or analysis.

# creating a dataframe with the clusters/groups
dementia_traj_clust <- as.data.frame(dementia_s3$partition) 
colnames(dementia_traj_clust)[1] <- 'Subject.ID'
dementia$Subject.ID <- gsub('OAS2_', '', dementia$Subject.ID)
dementia$Subject.ID <- as.numeric(dementia$Subject.ID)
dementia_traj_clust_group <- right_join(dementia, dementia_traj_clust, by = 'Subject.ID')
dementia_traj_clust_group$Cluster <- as.factor(dementia_traj_clust_group$Cluster)

And from here we can run a simple multi-nomial logistic regression on just the Visit 1 data (to avoid issues with independence of observations, which we would have if we kept all timepoints in the current dataframe and modeled using nnet).

# identifying predictors of group membership, based on the trajectory analyses above
dementia_traj_clust_group$Cluster <- relevel(
  dementia_traj_clust_group$Cluster, ref = "1"
)

# limiting to just the first timepoint
dementia_time1 <- dementia_traj_clust_group %>%
  filter(Visit == 1)

# running the multinomial logistic regression
traj_model <- multinom(Cluster ~ Group + M.F + Age + SES + MMSE + nWBV, data = dementia_time1)
summary(traj_model)
exp(coef(traj_model))

# producing z and p scores
traj_model_z <- summary(traj_model)$coefficients / summary(traj_model)$standard.errors
traj_model_z
traj_model_p <- (1 - pnorm(abs(traj_model_z), 0, 1)) * 2
traj_model_p

# model summary

Call:
multinom(formula = Cluster ~ Group + M.F + Age + SES + MMSE + 
    nWBV, data = dementia_time1)

Coefficients:
  (Intercept) GroupDemented GroupNondemented       M.FM          Age        SES         MMSE      nWBV
2    4.803783    0.21841554        0.6097253  0.3641153 -0.037950742 -0.1912469  0.008891539 -3.370123
3   -3.749771    0.11719489        0.4193449  0.4101102  0.008221361 -0.1982510  0.026360206  2.956908
4   -6.712865    0.05503565       -0.5249516 -0.1129051 -0.005354362 -0.1743867  0.011968925  9.136123
5  -14.135524    0.40001508        0.1195039 -1.8120214  0.070176759 -0.3264990 -0.094225454 12.778287

Std. Errors:
  (Intercept) GroupDemented GroupNondemented      M.FM        Age       SES       MMSE     nWBV
2    4.232329     0.4917398        0.4274073 0.2393097 0.01929334 0.1049638 0.04763886 4.322986
3    4.350179     0.4807173        0.4071783 0.2448725 0.01928813 0.1082568 0.05176268 4.436522
4    5.057027     0.4844787        0.4193040 0.2842837 0.02270420 0.1253268 0.05573495 5.223436
5    3.171797     1.3440672        1.1346692 1.0960568 0.03418699 0.3319224 0.16171241 3.003562

Residual Deviance: 1590.517 
AIC: 1654.517 

# exponentiated coefficients

(Intercept) GroupDemented GroupNondemented      M.FM       Age       SES      MMSE         nWBV
2 1.219709e+02      1.244104         1.839926 1.4392402 0.9627604 0.8259287 1.0089312 3.438540e-02
3 2.352312e-02      1.124339         1.520965 1.5069838 1.0082552 0.8201639 1.0267107 1.923839e+01
4 1.215178e-03      1.056578         0.591584 0.8932354 0.9946599 0.8399721 1.0120408 9.284697e+03
5 7.261393e-07      1.491847         1.126938 0.1633237 1.0726978 0.7214451 0.9100776 3.544373e+05

# z values

(Intercept) GroupDemented GroupNondemented       M.FM        Age        SES       MMSE       nWBV
2    1.135021     0.4441689        1.4265672  1.5215231 -1.9670386 -1.8220274  0.1866446 -0.7795822
3   -0.861981     0.2437917        1.0298804  1.6747907  0.4262394 -1.8313035  0.5092512  0.6664923
4   -1.327433     0.1135977       -1.2519595 -0.3971564 -0.2358313 -1.3914559  0.2147472  1.7490639
5   -4.456629     0.2976154        0.1053204 -1.6532186  2.0527327 -0.9836607 -0.5826730  4.2543773

# p values

(Intercept) GroupDemented GroupNondemented       M.FM        Age        SES      MMSE         nWBV
2 2.563665e-01     0.6569205        0.1537047 0.12812861 0.04917876 0.06845083 0.8519393 4.356368e-01
3 3.886980e-01     0.8073921        0.3030661 0.09397530 0.66993340 0.06705525 0.6105762 5.050965e-01
4 1.843655e-01     0.9095567        0.2105846 0.69125211 0.81356355 0.16408723 0.8299644 8.027997e-02
5 8.325853e-06     0.7659967        0.9161215 0.09828639 0.04009851 0.32528238 0.5601135 2.096316e-05

The results will not be belaboured, as again the data are synthetic and contain added noise, but a multinomial logistic model can inform you what predicts group membership. Here, with our artificial data, it would seem that Age is a predictor of membership in Group 2 and Group 5 (vs. Group 1, the reference category), and while brain volume is a predictor of membership in Group 5 (vs. Group 1).

Finding the nuance among the crowd

Understanding brain health through the lens of trajectory analysis helps us move beyond simple, one-size-fits-all diagnoses. By recognizing that people with the same condition—like dementia—can follow different paths, we uncover the complexity and nuance beneath the surface. These trajectories, shaped by a variety of factors, allow us to identify patterns and sub-groups among individuals, helping healthcare providers anticipate the course of the disease. If variables that are modifiable (such as hours of weekly exercise or cognitive activity) are predictive of membership in certain groups, these can become targets of intervention, policy, or patient education. In turn, this insight paves the way for a more personalized, proactive approach to care, where interventions can be tailored to each individual’s unique journey through the condition.

Outcomes

This analysis demonstrated how to identify trajectory groups within a dataset. This can be applied to longitudinal datasets to learn about the different sub-types of patients, customers, or users captured in your numbers.

Credits

OASIS dataset courtesy of Kaggle