Project

Survey data analysis with Classical Test Theory

Survey analysis · Classical Test Theory · Data science

Overview

A framework for analyzing Likert-style survey data from Kaggle’s Patient Satisfaction Database is presented. I provide an overview of how to analyze key survey features such as missingness, single-item endorsement, and inter-item correlations (i.e., how the responses of one question relate to the responses of another). This simple and systematic framework gives users an opportunity to analyze their survey data rigorously and systematically.

Approach

This project was completed in R, using packages such as:

likert
psych
dplyr
tidyr
ggplot2

What gets measured gets improved

You wouldn’t measure your height with a rubber ruler.

And in the same way, we wouldn't poll opinion or collect data using a survey that doesn't stand on solid ground.

Surveys can a powerful tool, and if nothing else remain a common way to understand user sentiment. But they only work if you truly understand what the responses represent. Especially with Likert scales (those 1 to 5 or 1 to 7 response options, for example), the numbers alone don’t tell the full story. People interpret those scales differently, and small variations in wording can shift how they respond. A "Somewhat agree" response may not mean the same thing to two different people, and that can make all the difference in the insights you draw. Without properly evaluating these measures, you might be drawing conclusions from data that isn’t actually measuring what you think it is.

The importance of evaluating survey measures goes beyond face value. It would be tempting to take survey results at face value. But in data science, especially when working with subjective measures like attitudes, satisfaction, or agreement, relying on surface-level results can be risky. This is where concepts like validity and reliability come in. Validity ensures that your survey is truly measuring what it’s intended to, while reliability ensures consistency across different respondents or over time. Without these evaluations, your data could be skewed by poorly designed questions, response biases, or cultural differences in interpreting the scales.

A framework for survey evaluation

This is where Classical Test Theory (CTT) comes into play. CTT is one of the most widely used frameworks for assessing survey reliability and validity. At its core, CTT assumes that every respondent’s observed score (what they actually report) is made up of two components: their true score (what you’re trying to measure) and an error term (any noise or inconsistencies that cloud the true result).

CTT provides a systematic way to evaluate the reliability of survey items — ensuring that the same respondent would answer consistently across different instances (test-retest reliability) and that the items within the survey itself are internally consistent (per measures such as Cronbach’s alpha). Without this kind of analysis, you might misinterpret randomness or bias as meaningful data.

What does CTT help you understand

The most academic uses of CTT will be to ultimately perform a factor analysis, which helps to deeply understand survey structure (and ensure the items in a survey are capturing the intended constructs, and that those constructs are different yet related to the appropriate degree). But there are also more accessible, easier-to-analyze metrics available that can still provide actionable insight.

For example, CTT can help understand floor-and-ceiling effects, which occur when a disproportionate number of respondents select the lowest or highest possible score on a survey item, limiting data variability. This can lead to problems in reliability because the survey fails to capture differences among respondents. If everyone selects "5" on a satisfaction scale, it becomes impossible to detect finer variations in satisfaction.

As well, inter-item correlations are part of CTT's assessment of internal consistency. High inter-item correlations indicate that the items are measuring the same construct. However, if the correlations are too high, it might suggest redundancy, meaning the items are too similar and not providing new information. If the correlations are too low, the items might be measuring different things, which would reduce the internal consistency of the survey.

A few functions

Below, a series of functions will be developed to help better understand your surveys, under a CCT lens. We'll go through these functions with the Patient Satisfcation Dataset, available through Kaggle, which provides insight into patient satisfaction with the healthcare system in Macedonia. This dataset uses a 1-5 response scale for all items except for the first, which will be dropped to simplify our current use case. The functions should be easily adaptable to your dataframes.

Visualizing Likert responses

First, visualizing the response distribution of your data helps you get oriented. Using the likert package for this is quick and effective.

```{r read data, echo=FALSE}

patient_satisfaction <- read.csv('filepath')
patient_satisfaction <- patient_satisfaction[, c(2:17)]

```

The number of levels and the associated labels can be changed for Likert surveys that have a different number of repsonse levels.

First, we will create a renaming map, to change the names of the datafields (which are not always interpretable) to more descriptive labels.

```{r rename map}

rename_labels <- list(
  'Check up' = 'Checkupappointment',
  'Waiting time' = 'Timewaiting',
  'Administration' = 'Adminprocedures',
  'Cleanliness and hygeine' = 'Hygieneandcleaning', 
  'Appointment timing' = 'Timeofappointment',
  'Physician experience' = 'Qualityexperiencedr',
  'Specialist availability' = 'Specialistsavaliable',
  'Physician communication' = 'Communicationwithdr',
  'Diagnosis' = 'Exactdiagnosis',
  'Equipment' = 'Modernequipment',
  'Staff friendliness' = 'friendlyhealthcareworkers',
  'Lab services available' = 'labservices',
  'Drug availabulity' = 'avaliablityofdrugs',
  'Waiting room quality' = 'waitingrooms',
  'Hospital room quality' = 'hospitalroomsquality',
  'Ameneties' = 'parkingplayingroomscaffes'
)

```

This plot will now show, in order of level of agreement, responses per item in a colour-coded, easy-to-grasp summative visual.

```{r likert function}

# visualizing likert data
make_likert_plot <- function(data, cols, rename_map) {
  df.likert <- data
  df.likert[cols] <- lapply(df.likert[cols], 
                            factor,
                            levels = c(1, 2, 3, 4, 5),
                            labels = c('Very unsatisfied', 'Not satisfied', 'Neutral', 'Satisfied', 'Very satisfied'))
  df.likert <- df.likert %>%
    rename(!!!rename_map)
  df.likert <- df.likert[cols]
  df.likert <- as.data.frame(df.likert)
  df.likert.plot <- plot(likert(df.likert), legend.position = 'bottom') +
    theme(legend.text = element_text(size = 6))
  return(df.likert.plot)
}

```

```{r likert plot, fig.align='center', fig.height=6, fig.width=6}

make_likert_plot(patient_satisfaction, 1:16, rename_labels)

```

Missingness

Next, we can examine missingness. Two functions will be written. The first to tabulate missingness, and the second to visualize missingness, both taking assists from the naniar package. For the table, a flextable output is used, as it is web-friendly. As well, conditional formatting has been applied in order to highlight cases of missingness >10% (a generally accepted CTT threshold for missingness). N.b. some data points were deleted at random, and some were deliberately deleted, to create additional missingness for visualization purposes.

```{r missingness}

# missingness table
missingness_table <- function(data, rename_map) {
  data <- data %>%
    rename(!!!rename_map)
  data.missing <- as.data.frame(miss_var_summary(data))
  custom_col_names <- c('Question', 'Missing (n)', 'Missing (%)')
  colnames(data.missing) <- custom_col_names
  ft <- flextable(data.missing)
  ft <- bg(ft, i = ~ `Missing (%)` > 10, j = 'Missing (%)', bg = 'darkred')
  ft <- color(ft, i = ~ `Missing (%)` > 10, j = 'Missing (%)', color = 'white')
  ft <- set_table_properties(ft, width = 1, layout = 'autofit')
  ft <- set_caption(ft, caption = 'Summary of missing cases')
  return(ft)
}

# missingness figure
missingness_figure <- function(data, rename_map) {
  data <- data %>%
    rename(!!!rename_map)
  gg_miss_upset(data)
}

```

```{r missingness table}

missingness_table(patient_satisfaction, rename_labels)

```

```{r missingness figure, fig.align='center', fig.height=6, fig.width=6}

missingness_figure(patient_satisfaction, rename_labels)

```

Floor-and-ceiling responses

Next, we'll tabulate floor-and-ceiling responses, in order to make sure there isn't excessive clustering of responses at these two levels. If most responses tend to be "Very satisfied" or "Very unsatisfied", this makes the survey somewhat ineffective, drawing more praise or criticism than may be useful. Here too conditional formatting is applied to highlight floor or ceiling responses with greater than 80% endorsement, another CTT-based threshold.

```{r floor and ceiling}

# floor and ceiling
floor_and_ceiling_table <- function(data, cols, rename_map) {
  data <- data %>%
    rename(!!!rename_map)
  data[cols] <- lapply(data[cols], factor,
                       levels = c(1, 2, 3, 4, 5),
                       labels = c('Very unsatisfied', 'Not satisfied', 'Neutral', 'Satisfied', 'Very satisfied'))
  freq_list <- lapply(data, function(col) {
    as.data.frame(table(col)) %>%
      mutate(Percentage = round(100 * Freq / sum(Freq), 1))
  })
  freq_df <- bind_rows(freq_list, .id = 'Variable') %>%
    rename(Response = col)
  freq_df <- freq_df %>%
    group_by(Variable) %>%
    mutate(Variable = ifelse(row_number() == 1, Variable, '')) %>%
    ungroup()
  ft <- flextable(freq_df)
  ft <- bg(ft, i = ~ Response %in% c('Very unsatisfied', 'Very satisfied') & Percentage > 80,
           j = 'Percentage', bg = 'darkred')
  ft <- color(ft, i = ~ Response %in% c('Very unsatisfied', 'Very satisfied') & Percentage > 80,
              j = 'Percentage', color = 'white')
  ft <- bg(ft, i = ~ Percentage <= 80, j = 'Percentage', bg = 'grey90')
  ft <- set_table_properties(ft, width = 1, layout = 'autofit')
  ft <- set_caption(ft, caption = 'Summary of floor and ceiling responses')
  return(ft)
}

```

```{r floor and ceiling table}

floor_and_ceiling_table(patient_satisfaction, 1:16, rename_labels)

```

The responses to just the first 5 questions are displayed, for brevity.

Single-item endorsement

This function and table is similar to the prior, with slightly different conditional formatting. These two tables can (and in some cases should) be produced as one, with the full set of conditional formatting required, but are provided as two for illustrative purposes. Single-item endorsement should not exceed 50% for any given response level in order to avoid excessive "binning", which limits response distribution and thus further analysis potential.

```{r single item endorsement}

single_item_table <- function(data, cols, rename_map) {
  data <- data %>%
    rename(!!!rename_map)
  data[cols] <- lapply(data[cols],
                       factor,
                       levels = c(1, 2, 3, 4, 5),
                       labels = c('Very unsatisfied', 'Not satisfied', 'Neutral', 'Satisfied', 'Very satisfied'))
  freq_list <- lapply(data, function(col) {
    as.data.frame(table(col)) %>%
      mutate(Percentage = round(100 * Freq / sum(Freq), 1))
  })
  freq_df <- bind_rows(freq_list, .id = 'Variable') %>%
    rename(Response = col)
  freq_df <- freq_df %>%
    group_by(Variable) %>%
    mutate(Variable = ifelse(row_number() == 1, Variable, '')) %>%
    ungroup()
  ft <- flextable(freq_df)
  ft <- bg(ft, i = ~ Percentage > 50, j = 'Percentage', bg = 'darkred')  
  ft <- bg(ft, i = ~ Percentage <= 50, j = 'Percentage', bg = 'grey95') 
  ft <- color(ft, i = ~ Percentage > 50, j = 'Percentage', color = 'white') 
  ft <- color(ft, i = ~ Percentage <= 50, j = 'Percentage', color = 'black') 
  ft <- set_table_properties(ft, width = 1, layout = 'autofit')
  ft <- set_caption(ft, caption = 'Summary of single-item endorsement')
  return(ft)
}

```

```{r single item endorsement table}

single_item_table(patient_satisfaction, 1:16, rename_labels)

```

Skewness

Next, data skewness will be examined, with help from the e1071 package. Skewness values in excess of |2| are highlighted red, as this exceeds the CTT threshold, as it too may limit future analysis potential.

```{r skewness}

skewness_table <- function(data, rename_map) {
  data <- data %>%
    rename(!!!rename_map)
  skewness_results <- sapply(data, function(x) {
    if (is.numeric(x)) {
      skewness(x, na.rm = TRUE)  
    } else {
      NA  
    }
  })
  skewness_df <- data.frame(
    Variable = names(skewness_results),
    Skewness = round(skewness_results, 3)
  )
  ft <- flextable(skewness_df)
  ft <- bg(ft, i = ~ abs(Skewness) > 2, j = 'Skewness', bg = 'darkred')  
  ft <- bg(ft, i = ~ abs(Skewness) <= 2, j = 'Skewness', bg = 'grey95') 
  ft <- color(ft, i = ~ abs(Skewness) > 2, j = 'Skewness', color = 'white') 
  ft <- color(ft, i = ~ abs(Skewness) <= 2, j = 'Skewness', color = 'black') 
  ft <- set_table_properties(ft, width = 1, layout = 'autofit')
  ft <- set_caption(ft, caption = 'Summary of item skewness')
  return(ft)
}

```

```{r skewness table}

skewness_table(patient_satisfaction, rename_labels)

```

Inter-item correlations

Items on the survey should have an appropriate level of correlation with one another, otherwise it may indicate they are not measuring the same construct. However, inter-item correlations that are too high suggest redundancy in the questionnaire, or it may hint that two questions which are meant to measure different constructs may in fact be measuring the same. Here, the corrplot package is used to visualize correlations between individual questions. As the data are categorical, Cramer's V's are used instead of Pearson correlations, and pairs of items with correlation values exceeding 0.7 should be flagged for follow up. We will also use a new map with shorter labels to aid in visualization, and append a legend below the corrplot.

```{r inter item correlation}

# map for corrplot
cor_map <- list(
  'Q1' = 'Checkupappointment',
  'Q3' = 'Timewaiting',
  'Q4' = 'Adminprocedures',
  'Q5' = 'Hygieneandcleaning', 
  'Q6' = 'Timeofappointment',
  'Q7' = 'Qualityexperiencedr',
  'Q8' = 'Specialistsavaliable',
  'Q9' = 'Communicationwithdr',
  'Q10' = 'Exactdiagnosis',
  'Q11' = 'Modernequipment',
  'Q12' = 'friendlyhealthcareworkers',
  'Q13' = 'labservices',
  'Q14' = 'avaliablityofdrugs',
  'Q15' = 'waitingrooms',
  'Q16' = 'hospitalroomsquality',
  'Q17' = 'parkingplayingroomscaffes'
)

# inter item correlation figure
inter_item_fig <- function(data, rename_map, cor_map) {
  data <- data %>%
    rename(!!!cor_map)
  cramers_v_matrix <- matrix(NA, ncol = length(data), nrow = length(data))
  colnames(cramers_v_matrix) <- colnames(data)
  rownames(cramers_v_matrix) <- colnames(data)
  for (i in seq_along(data)) {
    for (j in seq_along(data)) {
      cramers_v_matrix[i, j] <- round(assocstats(table(data[[i]], data[[j]]))$cramer, 2)
    }
  }
  corrplot(cramers_v_matrix, method = 'number', type = 'lower', diag = FALSE, tl.col = "black", tl.cex = 0.5, tl.srt = 45, number.cex = 0.5)
  map_df <- as.data.frame(t(as.data.frame(rename_map)))
  cor_df <- data.frame(
    Question = names(cor_map),
    Variable = unlist(cor_map),
    stringsAsFactors = FALSE
  )
  map_df <- data.frame(
    Description = names(rename_map),
    Variable2 = unlist(rename_map),
    stringsAsFactors = FALSE
  )
  merged_df <- cbind(cor_df, map_df)
  merged_df <- merged_df[, c(1, 3)]
  ft <- flextable(merged_df)
  ft <- set_table_properties(ft, width = 1, layout = 'autofit')
  ft <- set_caption(ft, caption = 'Legend')
  return(ft)
}

```

```{r inter item correlation figure, fig.align='center', fig.height=6, fig.width=6}

inter_item_fig(patient_satisfaction, rename_labels, cor_map)

```

Internal consistency

Next, we want to examine the internal consistency of the scale as a whole, to ensure it is not fractioned by multiple, unexpected constructs. We will use Cronbach's alpha to examine the internal consistency, with values > 0.7 deemed acceptable per CTT. The inter-item correlation table above already provides a hint as to whether the individual questions are sufficiently and appropriately related to one another.

```{r cronbach alpha}

cronbach_alpha_test <- function(data, rename_map) {
  data <- data %>%
    rename(!!!rename_map)
  ca_result <- alpha(data)
  cronbach_alpha_value <- round(ca_result$total$raw_alpha, 3)
  items <- ncol(data)
  sample_units <- nrow(data)
  ca_df <- data.frame(
    Measure = c("Cronbach's alpha", 'Questions', 'Sample units'),
    Metric = c(cronbach_alpha_value, items, sample_units)
    )
  ft <- flextable(ca_df)
  ft <- bg(ft,  j = 'Metric', bg = 'grey95')
  ft <- bg(ft, i = ~ Measure == "Cronbach's alpha" & Metric < 0.7, j = 'Metric', bg = 'darkred')
  ft <- color(ft, i = ~ Measure == "Cronbach's alpha" & Metric >= 0.7, j = 'Metric', color = 'black')
  ft <- color(ft, i = ~ Measure == "Cronbach's alpha" & Metric < 0.7, j = 'Metric', color = 'white')
  ft <- set_table_properties(ft, width = 1, layout = 'autofit')
  ft <- set_caption(ft, caption = "Summary of Cronbach's alpha")
  return(ft)
}

```

```{r cronbach alpha test}

cronbach_alpha_test(patient_satisfaction, rename_labels)

```

Test-retest administration

Finally, we'll look into test-retest administration, to get a sense of whether the same respondents respond similarly at multiple timepoints. This sample dataset is not longitudinal, with only one, cross-sectional administration. We'll resample the dataset to artificially create a second administration, again for illustrative purposes.

```{r resample}

# flatten the dataset into a vector
flattened <- unlist(patient_satisfaction)
# shuffle all values
shuffled <- sample(flattened)
# reshape `shuffled` back into the original dataframe structure
patient_satisfaction_shuffled <- as.data.frame(matrix(
  shuffled, 
  nrow = nrow(patient_satisfaction),
  ncol = ncol(patient_satisfaction)))
# applying the same columns names
colnames(patient_satisfaction_shuffled) <- colnames(patient_satisfaction)

```

Now we'll create a test-retest function, that takes the mean of each row (respondent) to get a sense of the overall 'score'. We'll look for an r value in excess of 0.50, per CTT. It will merge the 'pre' and 'post' datasets, creating a simple scatter plot with density functions in the margins, courtest of ggpubr and ggExtra.

```{r test retest}

test_retest <- function(pre_data, post_data, cols, rename_map) {
  pre_data <- pre_data %>%
    mutate(Pre = rowMeans(.[, cols]))
  post_data <- post_data %>%
    mutate(Post = rowMeans(.[, cols]))
  prepost_cor <- cor(pre_data$Pre, post_data$Post,
                     use = 'complete.obs',
                     method = 'pearson')
  prepost <- merge(pre_data, post_data, by.x = 0, by.y = 0)
  plot1 <- ggplot(
    prepost, aes(Pre, Post)) +
      geom_point() +
      geom_smooth(method = lm, se = FALSE) +
      stat_cor(method = 'pearson') +
      theme_classic()
  ggMarginal(plot1)
}

```

```{r test retest plot, fig.align='center', fig.height=6, fig.width=6}

test_retest(patient_satisfaction, patient_satisfaction_shuffled, 1:16, rename_labels)

```

Summary

This takes us to the end of an initial, CTT-informed method of survey evaluation, with a complete analysis using factor analysis pending. It should be a start towards understanding the performance of your surveys, and if they are measuring what you think they should be measuring, and in a way that is intelligible and suitable for further analysis.

It's gives you a yard stick, not a rubber ruler.