Project
Trajectory analysis with R
Trajectory analysis · Group analysis · Data science
Overview
We can quickly fall into the trap that a given disease develops similarly for all patients across time, or that all repeat customers are alike. Trajectory analysis allows us to see beyond this, and learn about what sub-groups of patients, customers, or people comprise our data. And we can then create models to see what predicts membership in one group over another.
Approach
This project was completed in R, using packages such as:
traj
dplyr
nnet
Within the crowd, there is nuance
When we talk about brain health, particularly about diseases like dementia or Alzheimer’s, it’s easy to fall into binary thinking. We simplify the story: either someone has dementia, or they don’t. But in reality, brain health—like most areas of healthcare—is far more complex than a simple yes-or-no diagnosis.
Two people can both receive a diagnosis of dementia on the same day, but five years later, they might be in very different places. One person could still be functioning relatively independently, with only mild memory issues. The other might have rapidly progressed to severe cognitive decline, requiring full-time care. This isn’t just chance — there are underlying patterns, or trajectories, that we can map out and analyze to better understand why people with the same diagnosis can experience such different outcomes.
These trajectories aren’t random. They’re shaped by an array of factors: genetics, lifestyle, early diagnosis, access to care, and other medical conditions. By analyzing these patterns, we can begin to see the nuance beneath the diagnosis. Among those with dementia, we may be able to identify sub-groups, or clusters of individuals whose trajectories are similar. We can also try to see if there are other variables that predict membership in one of these trajectories.
This allows us to move away from treating dementia as a monolithic condition and toward a more personalized approach to care, where we anticipate the likely path someone will take and intervene earlier and more effectively.
To the OASIS dataset
The Open Access Series of Neuroimaging (OASIS) is an open source initiative aimed at making clinical neuroscience datasets available to the scientific community. In particular, OASIS-3 involves neuroimaging and clinical measurement of normal aging and Alzheimer’s disease. These data are available through Kaggle, and contain summary information on clinical measures (such as the Mini-Mental State Exam) and MRI measures (including Total Brain Volume, and Intracranial Volume). We will use these data for our example analysis.
Data will first be imported and converted to a “wide” format, as required by the trajectory analysis package, traj
.
# reading data
dementia <- read.csv(path, na.strings = c('', 'NA'))
# pivoting to a wide dataframe, required by `traj`
dementia_wide <- dementia %>%
pivot_wider(names_from = Visit,
values_from = c(Group, MR.Delay, M.F, Age, Hand, EDUC, MMSE, CDR, eTIV, nWBV, ASF),
id_cols = Subject.ID)
Because there are so few cases at the latest timepoint, we will create some synthetic data (based on the mean and standard deviation of original data) to increase the total sample by 500%. To exaggerate the clustering analysis, these data will be made artificially noisy by introducing jitter
to a random sub-set of numeric columns and then bootstrapping all columns to randomly replace all values. This means that the data are manipulated for illustrative purposes, and not representative of true trends in the OASIS data.
# original number of rows
n_original <- nrow(dementia_wide)
n_synthetic <- 4 * n_original
# creating an empty dataframe to hold the new values
dementia_wide_synth <- data.frame(matrix(ncol = ncol(dementia_wide), nrow = n_synthetic))
names(dementia_wide_synth) <- names(dementia_wide)
numeric_cols <- names(dementia_wide)[sapply(dementia_wide, is.numeric)]
# setting seed and setting half the numeric columns to receive jitter
set.seed(189)
numeric_cols_jitter <- sample(numeric_cols, length(numeric_cols) / 2)
# generate synthetic data for each numeric column
for(col in names(dementia_wide)) {
if(is.numeric(dementia_wide[[col]])) {
col_mean <- mean(dementia_wide[[col]], na.rm = TRUE)
col_sd <- sd(dementia_wide[[col]], na.rm = TRUE)
# generate synthetic data from a normal distribution
samples <- rnorm(n_synthetic, mean = col_mean, sd = col_sd)
# for a random 50% of numeric columns, add extra jitter
if(col %in% numeric_cols_jitter) {
jitter_amount <- 0.50 * col_sd
samples <- samples + rnorm(n_synthetic, mean = 0, sd = jitter_amount)
}
dementia_wide_synth[[col]] <- samples
} else {
# for non-numeric columns sample only from the non-NA values
non_na_values <- dementia_wide[[col]][!is.na(dementia_wide[[col]])]
if(length(non_na_values) > 0) {
dementia_wide_synth[[col]] <- sample(non_na_values, n_synthetic, replace = TRUE)
} else {
dementia_wide_synth[[col]] <- NA
}
}
}
# resample every column
for(col in names(dementia_wide_synth)) {
dementia_wide_synth[[col]] <- sample(dementia_wide_synth[[col]], n_synthetic, replace = TRUE)
}
# merge the data
dementia_wide_combined <- rbind(dementia_wide, dementia_wide_synth)
Next, we’ll select columns for this analysis and rename them. This step could have been performed earlier, with only the select columns made noisier and appended with synthetic data. Completing operations in this order allows us to retain a synthetic, noisy, full dataset for any further analysis.
We will keep the mmse
variables, related to the clinical Mini-Mental State Exam (a commonly administered neurological test). We will also keep one MRI-related measure, wbv
, or Whole Brain Volume.
# selecting columns of interest, and creating a time (t) variable in years
dementia_clean <- dementia_wide_combined %>%
select(1, 7:11, 32:36, 47:51) %>%
mutate(t1 = (mean(MR.Delay_1, na.rm = TRUE)/365),
t2 = (mean(MR.Delay_2, na.rm = TRUE)/365),
t3 = (mean(MR.Delay_3, na.rm = TRUE)/365),
t4 = (mean(MR.Delay_4, na.rm = TRUE)/365),
t5 = (mean(MR.Delay_5, na.rm = TRUE)/365))
# dropping time 4 and 5 owing to limited data at these timepoints
dementia_clean <- dementia_clean[c(1, 7:21)]
dementia_wide_cols <- c('Subject.ID',
'mmse1', 'mmse2', 'mmse3', 'mmse4', 'mmse5',
'wbv1', 'wbv2', 'wbv3', 'wbv4', 'wbv5',
't1', 't2', 't3', 't4', 't5')
colnames(dementia_clean) <- dementia_wide_cols
dementia_clean$Subject.ID <- gsub('OAS2_', '', dementia_clean$Subject.ID)
Moving on to the traj
package. First, we need to set up the data in a way that is usable by the package, in matrix format. The package requires the temporal data to be split from the measured data (with an identifier retained in both matrices), and for only complete cases to be used.
# the package requires the data to be in a matrix
dementia_clean <- dementia_clean[complete.cases(dementia_clean), ]
dementia_matrix <- as.matrix(dementia_clean)
# separating the matrix into the gose and time variables, as required
dementia_mmse <- dementia_matrix[, c(1, 2:6)]
dementia_mmse <- apply(dementia_mmse, 2, as.numeric)
dementia_wbv <- dementia_matrix[, c(1, 7:11)]
dementia_wbv <- apply(dementia_wbv, 2, as.numeric)
dementia_time <- dementia_matrix[, c(1, 12:16)]
dementia_time <- apply(dementia_time, 2, as.numeric)
The data are now ready for Steps 1 through 3 of the traj
package. The purpose of each step is summarized below. Details are included in the official documentation.
Step 1: Compute Measures of Change In this step, the package calculates a set of features (up to 18 measures) that capture various aspects of individual trajectories. These measures include basic statistics like the range, mean, slope of a linear model, and others that describe the pattern of change over time (e.g., rate of change, variability). The goal here is to quantify key characteristics of each trajectory.
Step 2: Dimensionality Reduction After the measures are computed, this step uses a Principal Component Analysis (PCA) or similar techniques to reduce the dimensionality of the data. The idea is to select the most informative measures (those that explain the most variance) while removing redundant ones (e.g., if two measures are highly correlated). This reduction ensures the final clustering is based on the most meaningful aspects of the data.
Step 3: Clustering Finally, the k-means clustering algorithm (or similar) is applied to group the trajectories (alongside a bootstrapping technique) into distinct clusters. The number of clusters can either be specified by the user or automatically selected using criteria like the Calinski-Harabasz index. This step ultimately identifies which trajectories follow similar patterns, creating groups of individuals with shared longitudinal profiles.
# running steps s1, s2, and s3
dementia_mmse_s1 = traj::Step1Measures(dementia_mmse, dementia_time, ID = TRUE)
dementia_mmse_s2 = traj::Step2Selection(dementia_mmse_s1)
dementia_mmse_s3 = traj::Step3Clusters(dementia_mmse_s2, nclusters = NULL)
plot(dementia_mmse_s3)
With the plot
command run on the last traj
output, we can access a number of visuals. One that is useful, especially when nclusters
is set to NULL (which bootstraps the analysis to find the ideal number of trajectory groups), is the Gap Statistic plot.
Here, similar to a scree plot, we can gain a sense of how many clusters to use. A higher gap statistic suggests a better separation between trajectory groups. Similar to a scree plot, an “elbow” suggests a point where we hit a granularity/explainability trade-off. There is no clear elbow in this plot of synthetic, noisy data, but for illustration, we’ll now set nclusters
= 5.
The first plot shows MMSE trajectories, overlaid on one another. We can see here that patients typically follow the same trajectory up to about 4 years, after which point there is divergence. The second plot shows the number of subjects that comprise each trajectory group (or cluster), along with error bars to show how tightly patients follow the same trajectory. Those in Cluster 5 have a more variable trajectory relative to the others.
A similar exercise on whole brain volume leads to three distinct trajectory groups, which may be called “Early gainers” (Cluster 1), “Late gainers” (Cluster 2), and “Steady decliners” (Cluster 3).
We can now assign cluster number to each subject in the original dataset, for additional plotting or analysis.
# creating a dataframe with the clusters/groups
dementia_traj_clust <- as.data.frame(dementia_s3$partition)
colnames(dementia_traj_clust)[1] <- 'Subject.ID'
dementia$Subject.ID <- gsub('OAS2_', '', dementia$Subject.ID)
dementia$Subject.ID <- as.numeric(dementia$Subject.ID)
dementia_traj_clust_group <- right_join(dementia, dementia_traj_clust, by = 'Subject.ID')
dementia_traj_clust_group$Cluster <- as.factor(dementia_traj_clust_group$Cluster)
And from here we can run a simple multi-nomial logistic regression on just the Visit 1 data (to avoid issues with independence of observations, which we would have if we kept all timepoints in the current dataframe and modeled using nnet
).
# identifying predictors of group membership, based on the trajectory analyses above
dementia_traj_clust_group$Cluster <- relevel(
dementia_traj_clust_group$Cluster, ref = "1"
)
# limiting to just the first timepoint
dementia_time1 <- dementia_traj_clust_group %>%
filter(Visit == 1)
# running the multinomial logistic regression
traj_model <- multinom(Cluster ~ Group + M.F + Age + SES + MMSE + nWBV, data = dementia_time1)
summary(traj_model)
exp(coef(traj_model))
# producing z and p scores
traj_model_z <- summary(traj_model)$coefficients / summary(traj_model)$standard.errors
traj_model_z
traj_model_p <- (1 - pnorm(abs(traj_model_z), 0, 1)) * 2
traj_model_p
# model summary
Call:
multinom(formula = Cluster ~ Group + M.F + Age + SES + MMSE +
nWBV, data = dementia_time1)
Coefficients:
(Intercept) GroupDemented GroupNondemented M.FM Age SES MMSE nWBV
2 4.803783 0.21841554 0.6097253 0.3641153 -0.037950742 -0.1912469 0.008891539 -3.370123
3 -3.749771 0.11719489 0.4193449 0.4101102 0.008221361 -0.1982510 0.026360206 2.956908
4 -6.712865 0.05503565 -0.5249516 -0.1129051 -0.005354362 -0.1743867 0.011968925 9.136123
5 -14.135524 0.40001508 0.1195039 -1.8120214 0.070176759 -0.3264990 -0.094225454 12.778287
Std. Errors:
(Intercept) GroupDemented GroupNondemented M.FM Age SES MMSE nWBV
2 4.232329 0.4917398 0.4274073 0.2393097 0.01929334 0.1049638 0.04763886 4.322986
3 4.350179 0.4807173 0.4071783 0.2448725 0.01928813 0.1082568 0.05176268 4.436522
4 5.057027 0.4844787 0.4193040 0.2842837 0.02270420 0.1253268 0.05573495 5.223436
5 3.171797 1.3440672 1.1346692 1.0960568 0.03418699 0.3319224 0.16171241 3.003562
Residual Deviance: 1590.517
AIC: 1654.517
# exponentiated coefficients
(Intercept) GroupDemented GroupNondemented M.FM Age SES MMSE nWBV
2 1.219709e+02 1.244104 1.839926 1.4392402 0.9627604 0.8259287 1.0089312 3.438540e-02
3 2.352312e-02 1.124339 1.520965 1.5069838 1.0082552 0.8201639 1.0267107 1.923839e+01
4 1.215178e-03 1.056578 0.591584 0.8932354 0.9946599 0.8399721 1.0120408 9.284697e+03
5 7.261393e-07 1.491847 1.126938 0.1633237 1.0726978 0.7214451 0.9100776 3.544373e+05
# z values
(Intercept) GroupDemented GroupNondemented M.FM Age SES MMSE nWBV
2 1.135021 0.4441689 1.4265672 1.5215231 -1.9670386 -1.8220274 0.1866446 -0.7795822
3 -0.861981 0.2437917 1.0298804 1.6747907 0.4262394 -1.8313035 0.5092512 0.6664923
4 -1.327433 0.1135977 -1.2519595 -0.3971564 -0.2358313 -1.3914559 0.2147472 1.7490639
5 -4.456629 0.2976154 0.1053204 -1.6532186 2.0527327 -0.9836607 -0.5826730 4.2543773
# p values
(Intercept) GroupDemented GroupNondemented M.FM Age SES MMSE nWBV
2 2.563665e-01 0.6569205 0.1537047 0.12812861 0.04917876 0.06845083 0.8519393 4.356368e-01
3 3.886980e-01 0.8073921 0.3030661 0.09397530 0.66993340 0.06705525 0.6105762 5.050965e-01
4 1.843655e-01 0.9095567 0.2105846 0.69125211 0.81356355 0.16408723 0.8299644 8.027997e-02
5 8.325853e-06 0.7659967 0.9161215 0.09828639 0.04009851 0.32528238 0.5601135 2.096316e-05
The results will not be belaboured, as again the data are synthetic and contain added noise, but a multinomial logistic model can inform you what predicts group membership. Here, with our artificial data, it would seem that Age
is a predictor of membership in Group 2 and Group 5 (vs. Group 1, the reference category), and while brain volume is a predictor of membership in Group 5 (vs. Group 1).
Finding the nuance among the crowd
Understanding brain health through the lens of trajectory analysis helps us move beyond simple, one-size-fits-all diagnoses. By recognizing that people with the same condition—like dementia—can follow different paths, we uncover the complexity and nuance beneath the surface. These trajectories, shaped by a variety of factors, allow us to identify patterns and sub-groups among individuals, helping healthcare providers anticipate the course of the disease. If variables that are modifiable (such as hours of weekly exercise or cognitive activity) are predictive of membership in certain groups, these can become targets of intervention, policy, or patient education. In turn, this insight paves the way for a more personalized, proactive approach to care, where interventions can be tailored to each individual’s unique journey through the condition.
Outcomes
This analysis demonstrated how to identify trajectory groups within a dataset. This can be applied to longitudinal datasets to learn about the different sub-types of patients, customers, or users captured in your numbers.
Credits
OASIS dataset courtesy of Kaggle