Project
What’s in a list?
Data Science · Data Visualization · Topic Modelling
Overview
We can take a list and perform topic modelling to get a sense of how items in that list cluster thematically. This post compares two approaches, LDA (a frequency-based approach) and BERTopic (a contextual approach) to categorize scientific articles based on their titles.
Approach
This project was completed in Python, using packages such as:
gensim
bertopic
hdbscan
Years ago, I published a citation analysis on traumatic brain injury (TBI), identifying the 50 most-cited papers in the field at that time.
Pulling the papers and deriving citation metrics was simple. Categorizing the papers “by hand” (i.e., in an Excel sheet containing my interpretation of the theme of each article) was not. The process was judicious, with papers often straddling two topics, given their multiple objectives.
Having now learned about topic models, such as Latent Dirichlet Allocation (LDA) and Bidirectional Encoder Representations from Transformers (BERT), led me to questions about how these models would apply topics to the list of most-cited TBI articles. More so than their performance in comparison to my manual topic labelling, I was interested in how they would perform relative to one another given their differences. LDA is a probabilistic approach, which assumes there are latent (or hidden) topics within a “corpus” (a list of articles, in this case), and that each topic is based on a distribution of words. BERT instead is contextual, using a transformer model which allows the model to understand each word by looking at all other words in the sentence; it can tell whether the word “damages” refers to injury or legal compensation. The text is encoded, and then represented by the model as an interpretable topic (by way of a deep neural network that involves many other steps).
So how do they compare? Let’s first set the global environment and then look into LDA.
# environment
import pandas as pd
import nltk
import gensim
from gensim import corpora
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from bertopic import BERTopic
import pyLDAvis.gensim_models
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import hdbscan
The next step is to download punkt
from nltk
to tokenize words (break them into smaller “chunks” - breaking sentences into words or words into prefixes, roots, suffixes), and stopwords
to identify common words (”the”, “in”, “it”, etc.) that are filtered out.
# set up
nltk.download('punkt')
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
The data (list of 50 most-cited articles in TBI) will now be imported.
# import
df = pd.read_csv("top_tbi_articles.csv")
df.columns = df.columns.str.strip().str.lower()
titles = df['title'].dropna().tolist()
We now perform some list comprehension to preprocess the words from each title into a list of lists. Namely, it tokenizes each title and filters out anything that is not purely alphabetical or included in stop_words
.
# creating list
texts = [[word for word in word_tokenize(title.lower()) if word.isalpha() and word not in stop_words]
for title in titles]
We will now create a dictionary that contains a unique integer ID for each unique word in texts
, and a corpus which creates (word ID, word count) tuples using doc2bow
(”document to bag of words”).
# dictionary and corpus
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
The LDA model is now run and visualized with a dynamic html output.
# lda model
lda_model = gensim.models.LdaModel(
corpus=corpus,
id2word=dictionary,
num_topics=5,
passes=10,
random_state=42
)
# visualize lda
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)
pyLDAvis.save_html(vis, 'lda_visualization.html')
What results is a dynamic html output that allows you to visualize topics and how they cluster on the left, and the words that comprise these topics (with their relative frequencies) on the right. Below is a screengrab of said output, highlighting a topic that may focus on a topic I’ll call “Sport-related concussion consensus statements”. Generally, while the topics are well separated, they are not entirely thematically distinct. In addition to the consensus statement topic, there appears to be one on neuropathology of TBI and another on animal models, leaving the last two somewhat unspecified with respect to their theme.
Moving to BERT, we’ll perform a similar analysis, drawing on a clustering method called Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN), which identifies clusters based on the density distribution of constituent data points.
# create documents list
documents = [' '.join(text) for text in texts]
# hdbscan model
hdbscan_model = hdbscan.HDBSCAN(min_cluster_size=3, min_samples=2)
bertopic_hdb = BERTopic(hdbscan_model=hdbscan_model)
topics_hdb, _ = bertopic_hdb.fit_transform(documents)
bertopic_hdb.visualize_topics().write_html("bertopic_hdbscan_topics.html")
bertopic_hdb.visualize_barchart(top_n_topics=7).write_html("bertopic_hdbscan_barchart.html")
A similar dynamic html output is produced, with a screengrab of it below. Here, the topics seem to be better clustered, and more internally consistent. The image shown identified the “animal models” topic, with other topics including consensus statements, neuropathology, and outcome evaluation studies (mainly in severe TBI).
Topics can be assigned to each study, as below, allowing users to assess topic-study fit.
# map articles to topics
topic_article_map = {}
for topic_id in set(topics_hdb):
indices = [i for i, topic in enumerate(topics_hdb) if topic == topic_id]
articles = df.iloc[indices]["title"].tolist()
topic_article_map[topic_id] = articles
# get keywords
keywords = bertopic_hdb.get_topic(topic_id)
keywords_summary = ", ".join([word for word, _ in keywords[:5]])
print(f"\nTopic {topic_id}: Top keywords: {keywords_summary}")
print(f"Articles in Topic {topic_id}:")
for article in articles:
print(f"- {article}")
Topic 0: Top keywords: head, injury, severe, outcome, patients
Articles in Topic 0:
- The role of secondary brain injury in determining outcome from severe head injury
- Treatment of traumatic brain injury with moderate hypothermia
- Lack of effect of induction of hypothermia after acute brain injury
- Disability caused by minor head injury
- Diffuse axonal injury due to non-missile head injury in human beings: an analysis of 45 cases
- Adverse effects of prolonged hyperventilation in patients with severe head injury: a randomized clinical trial
- The outcome from severe head injury with early diagnosis and intensive management
- A new classification of head injury based on computed tomography
- Diffuse axonal injury in head injury: definition, diagnosis, and grading
- Cerebral blood flow and metabolism in comatose patients with acute head injury: relationship to intracranial hypertension
- Disability after severe head injury: observations on the use of the Glasgow Outcome Scale
- Significance of intracranial hypertension in severe head injury
- Neurobehavioral outcome following minor head injury: a three-center study
- The Galveston Orientation and Amnesia Test: a practical scale to assess cognition after head injury
- Guidelines for the management of severe head injury
- Oops!: performance correlates of everyday attentional failures in traumatic brain injured and normal subjects
- Cerebral circulation and metabolism after severe traumatic brain injury: the elusive role of ischemia
- Impact of ICP instability and hypertension on outcome in patients with severe head trauma
- The Canadian CT Head Rule for patients with minor head injury
- Delayed recovery of intellectual function after minor head injury
- Further experience in the management of severe head injury
- Diffuse degeneration of the cerebral white matter in severe dementia following head injury
- Predicting outcome in individual patients after severe head injury
- A phase II study of moderate hypothermia in severe brain injury
- The 5-year outcome of severe blunt head injury: a relative’s view
- Effect of mild hypothermia on uncontrollable intracranial hypertension after severe head injury
- Clinical trials in head injury
- Association of apolipoprotein E polymorphism with outcome after head injury
Topic 1: Top keywords: brain, rat, model, injury, traumatic
Articles in Topic 1:
- Erythropoietin crosses the blood–brain barrier to protect against experimental brain injury
- The role of excitatory amino acids and NMDA receptors in traumatic brain injury
- ATP mediates rapid microglial response to local brain injury in vivo
- A new model of diffuse brain injury in rats: part 1: pathophysiology and biomechanics
- Traumatic brain injury in the rat: characterization of a lateral fluid-percussion model
- A fluid-percussion model of experimental brain injury in the rat
- Massive increases in extracellular potassium and the indiscriminate release of glutamate following concussive brain injury
- A controlled cortical impact model of traumatic brain injury in the rat
- Activation of CPP32-like caspases contributes to neuronal apoptosis and neurological dysfunction after traumatic brain injury
- Evidence of apoptotic cell death after experimental traumatic brain injury in the rat
Topic 2: Top keywords: concussion, sport, international, conference, statement
Articles in Topic 2:
- Paced auditory serial-addition task: a measure of recovery from concussion
- Consensus statement on concussion in sport: The 3rd International Conference on Concussion in Sport held in Zurich
- Cumulative effects associated with recurrent concussion in collegiate football players: The NCAA Concussion Study
- Summary and agreement statement of The first International Conference on Concussion in Sport, Vienna 2001
- Summary and agreement statement of The 2nd International Conference on Concussion in Sport, Prague 2004
- Relationship between concussion and neuropsychological performance in college football players
- Experimental cerebral concussion
Topic 3: Top keywords: traumatic, brain, mild, clinical, injury
Articles in Topic 3:
- Mild traumatic brain injury in soldiers returning from Iraq
- Traumatic brain injury in the United States: a public health perspective
- The epidemiology and impact of traumatic brain injury: a brief overview
- Cerebral concussion and traumatic unconsciousness: correlation of experimental and clinical observations on blunt head injuries
- Mild traumatic brain injury: pathophysiology, natural history, and clinical management
Topic labels can be added to the original dataframe and exported as a csv, if needed.
# export DataFrame with topic assignments
df_filtered = df[df['title'].notna()].copy().reset_index(drop=True)
df_filtered['bertopic_topic'] = topics_hdb
df_filtered.to_csv("bertopic_topic_assignments.csv", index=False)
In this example, with a very limited set of data (article titles) to screen, the contextual models performs better than the LDA model. Interestingly, the original “manual” screening led to more topics, perhaps owing to the parameters for cluster size and minimum samples set in the HDBSCAN model. Running this on the objective statements of each paper may yield different results still.
In an update to my article, I would expand the list to at least 100, and use BERT to provide an initial set of topics that could perhaps be refined with some input from a researcher experienced in concussion.
After all, context matters.
Outcomes
A look into how two topic models, LDA and BERT, perform in labelling the 50 most-cited articles in concussion. The internal consistency of topics is improved when using BERT.
Credits
My source publication: Top-cited articles in traumatic brain injury.