Topic Modelling

Project

What’s in a list?

Data Science · Data Visualization · Topic Modelling

image

Overview

We can take a list and perform topic modelling to get a sense of how items in that list cluster thematically. This post compares two approaches, LDA (a frequency-based approach) and BERTopic (a contextual approach) to categorize scientific articles based on their titles.

Approach

This project was completed in Python, using packages such as:

  • gensim
  • bertopic
  • hdbscan

Years ago, I published a citation analysis on traumatic brain injury (TBI), identifying the 50 most-cited papers in the field at that time.

Pulling the papers and deriving citation metrics was simple. Categorizing the papers “by hand” (i.e., in an Excel sheet containing my interpretation of the theme of each article) was not. The process was judicious, with papers often straddling two topics, given their multiple objectives.

Having now learned about topic models, such as Latent Dirichlet Allocation (LDA) and Bidirectional Encoder Representations from Transformers (BERT), led me to questions about how these models would apply topics to the list of most-cited TBI articles. More so than their performance in comparison to my manual topic labelling, I was interested in how they would perform relative to one another given their differences. LDA is a probabilistic approach, which assumes there are latent (or hidden) topics within a “corpus” (a list of articles, in this case), and that each topic is based on a distribution of words. BERT instead is contextual, using a transformer model which allows the model to understand each word by looking at all other words in the sentence; it can tell whether the word “damages” refers to injury or legal compensation. The text is encoded, and then represented by the model as an interpretable topic (by way of a deep neural network that involves many other steps).

So how do they compare? Let’s first set the global environment and then look into LDA.

# environment
import pandas as pd
import nltk
import gensim
from gensim import corpora
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from bertopic import BERTopic
import pyLDAvis.gensim_models
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import hdbscan

The next step is to download punkt from nltk to tokenize words (break them into smaller “chunks” - breaking sentences into words or words into prefixes, roots, suffixes), and stopwords to identify common words (”the”, “in”, “it”, etc.) that are filtered out.

# set up 
nltk.download('punkt')
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

The data (list of 50 most-cited articles in TBI) will now be imported.

# import
df = pd.read_csv("top_tbi_articles.csv")
df.columns = df.columns.str.strip().str.lower()
titles = df['title'].dropna().tolist()

We now perform some list comprehension to preprocess the words from each title into a list of lists. Namely, it tokenizes each title and filters out anything that is not purely alphabetical or included in stop_words.

# creating list
texts = [[word for word in word_tokenize(title.lower()) if word.isalpha() and word not in stop_words]
         for title in titles]

We will now create a dictionary that contains a unique integer ID for each unique word in texts, and a corpus which creates (word ID, word count) tuples using doc2bow (”document to bag of words”).

# dictionary and corpus
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

The LDA model is now run and visualized with a dynamic html output.

# lda model
lda_model = gensim.models.LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=5,
    passes=10,
    random_state=42
)

# visualize lda
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)
pyLDAvis.save_html(vis, 'lda_visualization.html')

What results is a dynamic html output that allows you to visualize topics and how they cluster on the left, and the words that comprise these topics (with their relative frequencies) on the right. Below is a screengrab of said output, highlighting a topic that may focus on a topic I’ll call “Sport-related concussion consensus statements”. Generally, while the topics are well separated, they are not entirely thematically distinct. In addition to the consensus statement topic, there appears to be one on neuropathology of TBI and another on animal models, leaving the last two somewhat unspecified with respect to their theme.

image

Moving to BERT, we’ll perform a similar analysis, drawing on a clustering method called Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN), which identifies clusters based on the density distribution of constituent data points.

A similar dynamic html output is produced, with a screengrab of it below. Here, the topics seem to be better clustered, and more internally consistent. The image shown identified the “animal models” topic, with other topics including consensus statements, neuropathology, and outcome evaluation studies (mainly in severe TBI).

image

Topics can be assigned to each study, as below, allowing users to assess topic-study fit.

Topic labels can be added to the original dataframe and exported as a csv, if needed.

# export DataFrame with topic assignments
df_filtered = df[df['title'].notna()].copy().reset_index(drop=True)
df_filtered['bertopic_topic'] = topics_hdb
df_filtered.to_csv("bertopic_topic_assignments.csv", index=False)

In this example, with a very limited set of data (article titles) to screen, the contextual models performs better than the LDA model. Interestingly, the original “manual” screening led to more topics, perhaps owing to the parameters for cluster size and minimum samples set in the HDBSCAN model. Running this on the objective statements of each paper may yield different results still.

In an update to my article, I would expand the list to at least 100, and use BERT to provide an initial set of topics that could perhaps be refined with some input from a researcher experienced in concussion.

After all, context matters.

Outcomes

A look into how two topic models, LDA and BERT, perform in labelling the 50 most-cited articles in concussion. The internal consistency of topics is improved when using BERT.

Credits

My source publication: Top-cited articles in traumatic brain injury.