RegEx text analysis

Project

A comparison of two tales

Data Science · Data Analysis · Regular Expressions · ReGex

image

Overview

Scientific articles follow the same structure, but how similar are two of mine? This post will look into Regular Expressions, and use Python’s re package to analyze word frequency (and paired word frequency), sentence length, and reading score of these articles

Approach

This project was completed in Python, using packages such as:

  • re
  • Counter
  • PyPDF2

My goal when writing a scientific article is accuracy and clarity. Holding to these goals leads to an article that is true to the idea being tested and the methods used to do so. In the end, this is the purpose of writing such articles for the scientific community.

Yet I do wonder how my writing has changed with experience and whether it varies when writing for different scientific audiences. The narrative of a scientific article is shaped by its standard format, from Introduction to Methods to Results to Discussion. There is little room for any creative writing. But I am curious about how aspects of my writing, such as word frequency, sentence length, and repetitive clauses, changes between articles.

In this post, I will compare two of my scientific articles:

Both are related to concussion, though one is focused on neuroimaging and the other is focused on exercise/physical activity. The audiences for both are scientific, but slightly different. In running through this comparison, this post will explore Python's powerful tool for handling Regular Expressions, or RegEx, highlighting some of its most important features.

First, the two articles will be uploaded from PDF using PyPDF2.

import PyPDF2

# open the mri article
with open('mri_article.pdf', 'rb') as file:
    reader = PyPDF2.PdfReader(file)
    mri_article = ""
    # loop through each page in the pdf
    for page in reader.pages:
        page_text = page.extract_text()
        mri_article += page_text + "\n"

# and the exercise article
with open('exercise_article.pdf', 'rb') as file:
    reader = PyPDF2.PdfReader(file)
    exercise_article = ""
    # loop through each page in the pdf
    for page in reader.pages:
        page_text = page.extract_text()
        exercise_article += page_text + "\n"

First, let's see how many words are in each article, so we can compare metrics fairly.

import re

def word_count(text):
    count = re.findall(r"\b\w+\b", text)
    return len(count)

Here, we have a simple RegEx command, which we can break down:

  • \b defines a word boundary
  • w+ matches one or more word characters (this can include symbols)

A comprehensive list of all regular expression operations can be found in the official Python documentation

print(f"The MRI article has {word_count(mri_article)} words")
print(f"The exercise article has {word_count(exercise_article)} words")
The MRI article has 7403 words
The exercise article has 6282 words

The MRI article, all things considered including references, is about 15% longer than the exercise article.

Frequency analysis

Next, we'll search for the number of times "concussion" appears in the article.

concussion_pattern = r"\bconcuss(?:ion|ions|ed|ive)?\b"

def concussion_count(text):
    count = re.findall(concussion_pattern, text, re.IGNORECASE)
    return len(count)

We can again break down the RegEx:

  • \b defines a word boundary
  • concuss matches the literal string "concuss"
  • (?:...)? is a non-capturing group, and with the ? at the end, it means that this can - but does not have to - match to what is inside the brackets (i.e., it can still count and retun "concuss")
  • ion|ions|ed|ive provides alternative endings to be matched (e.g., concussive)
print(f"The MRI article has {concussion_count(mri_article)} mentions of concuss*, meaning {(concussion_count(mri_article)/word_count(mri_article))*100:.3f}% of its words are concussion or a variant of this word.")
print(f"The exercise article has {concussion_count(exercise_article)} mentions of concuss*, meaning {(concussion_count(exercise_article)/word_count(exercise_article))*100:.3f}% of its words are concussion or a variant of this word.")
The MRI article has 108 mentions of concuss*, meaning 1.459% of its words are concussion or a variant of this word.
The exercise article has 110 mentions of concuss*, meaning 1.751% of its words are concussion or a variant of this word.

Bigrams and Trigrams

This can be simply captured with a ctrl+f. But if we want to see how frequently two, three, or four words appear together, re proves useful.

import matplotlib.pyplot as plt
import pandas as pd

def plot_top_ngrams(text, ngram=2, top_n=10, title=''):
    words = re.findall(r'\b\w+\b', text.lower())
    
    if ngram == 2:
        ngrams = list(zip(words, words[1:]))
    elif ngram == 3:
        ngrams = [tuple(words[i:i+3]) for i in range(len(words)-2)]
    else:
        raise ValueError("Only bigrams (2) and trigrams (3) supported.")

    ngram_counts = Counter(ngrams)
    common_ngrams = ngram_counts.most_common(top_n)
    
    # create a df for visualization
    df = pd.DataFrame(common_ngrams, columns=['Ngram', 'Frequency'])
    df['Ngram'] = df['Ngram'].apply(lambda x: ' '.join(x))

    # plotting the results
    plt.figure(figsize=(12, 6))
    plt.barh(df['Ngram'], df['Frequency'])
    plt.xlabel('Frequency')
    plt.ylabel(f'Top {top_n} {"Bigrams" if ngram==2 else "Trigrams"}')
    plt.title(title if title else f'Top {top_n} {"Bigrams" if ngram==2 else "Trigrams"} in Text')
    plt.gca().invert_yaxis()
    plt.grid(axis='x', linestyle='--', alpha=0.7)
    plt.show()

We can now plot the most frequently used 2-word phrases in each article, and the function can handle identification and plotting of 3-word phrases as well.

plot_top_ngrams(mri_article, ngram=2, top_n=10, title='MRI Article - Top 10 Bigrams')
image
plot_top_ngrams(exercise_article, ngram=2, top_n=10, title='Exercise Article - Top 10 Bigrams')
image

It seems that “with concussion” and “et al” are the only two-word phrases that rank among the top-10 most frequent across both articles.

Counting numbers

re is also more useful than a ctrl+f to find any and all numbers in either article.

number_pattern = r"\d+(?:[.,]\d+)*"

def number_count(text):
    count = re.findall(number_pattern, text)
    unique_count = list(set(count))
    return len(unique_count)

We can again break down the RegEx:

  • \d+ marches one or more digits (0-9)
  • (?:...) is again a non-capturing group
  • [.,] allows for matching to a sequence of numbers that may be separated by a dot or comma
print(f"The MRI article has {number_count(mri_article)} unique numbers, meaning numbers make up {(number_count(mri_article)/word_count(mri_article))*100:.3f}% of the article.")
print(f"The exercise article has {number_count(exercise_article)} unique numbers, meaning numbers make up {(number_count(exercise_article)/word_count(exercise_article))*100:.3f}% of the article.")
The MRI article has 391 unique numbers, meaning numbers make up 5.282% of the article.
The exercise article has 375 unique numbers, meaning numbers make up 5.969% of the article.

Again, this too is similar across both articles, which may suggest that mention of specific quantiative findings and values is similar across both articles, despite their varying length.

Scientific abbreviations

Next, we'll pull all unique acronyms in either article to get a sense of their overlap. Journals sometimes also ask for this information, and this is a quick way to pull it.

abbreviation_pattern = r"\b[A-Z]{3,6}\b"

def abbreviation_count(text):
    count = re.findall(abbreviation_pattern, text)
    unique_count = list(set(count))
    return len(unique_count), unique_count

Here, the addition to the regular expression is as follows:

  • [A-Z] will only search for upper case letters, as is typical for abbreviations
  • {3, 6} is a specifier which requires that hte preceeding element (an uppercase letter) must appear at least 3 times and at most 6 times, which is a typical range for an abbreviation

Here is a count and list of all the abbreviations in the MRI article:

abbreviation_count(mri_article)
(38,
 ['FWE',
  'BOLD',
  'OPEN',
  'CGS',
  'ONE',
  'TBI',
  'PNC',
  'ART',
  'EPI',
  'SPM',
  'PCC',
  'NIH',
  'ABIDE',
  'MRI',
  'SPGR',
  'MNI',
  'JAMA',
  'CONN',
  'NCAA',
  'CARE',
  'FOV',
  'TRACK',
  'IRC',
  'III',
  'PCSS',
  'CIHR',
  'ROI',
  'FWHM',
  'MOP',
  'TFCE',
  'PCS',
  'MATLAB',
  'DOD',
  'DMN',
  'SMNF',
  'FPN',
  'SMN',
  'ANCOV'])

And now the exercise article:

abbreviation_count(exercise_article)
(16,
 ['JSR',
  'CIHR',
  'FITT',
  'CGS',
  'LPA',
  'VPA',
  'MPA',
  'UTC',
  'JAMA',
  'MVPA',
  'SPSS',
  'IBM',
  'HTR',
  'MSS',
  'PES',
  'III'])

Word frequency analysis

Next, let's look into word frequency analysis across both articles, using Counter from collections and similar syntax to what has been used above.

from collections import Counter
import pandas as pd

def frequency_analysis(text):
    # define a set of stop words to exclude
    stop_words = {"a", "the", "it", "an", "and", "or", "of", "to", "in", "that", "with", "as", "on"}
    words = re.findall(r"\b\w+\b", text.lower())
    filtered_words = [word for word in words if word not in stop_words and not word.isdigit()]
    word_counts = Counter(filtered_words)
    df_word_freq = pd.DataFrame(word_counts.most_common(10), columns = ["Word", "Frequency"])
    return df_word_freq

We can now run this for each article. Generally, we see that the frequency of words differs (with the exception of the word "concussion") between articles, which reflects the different subject matter of and audiences for these articles.

frequency_analysis(mri_article)
image
frequency_analysis(exercise_article)
image

Sentence length

Next, we'll look into how sentence length differs by article.

def sentence_length(text):
    # splitting sentences on periods, exclamation marks, and question marks (the latter two are less applicable to scientific writing)
    sentences = re.split(r"[.!?]+", text)
    # removing leading and trailing whtie space
    sentences = [s.strip() for s in sentences if s.strip()]
    sentence_lengths = [len(s.split()) for s in sentences]
    avg_sentence_length = sum(sentence_lengths) / len(sentence_lengths) if sentence_lengths else 0
    return avg_sentence_length

This is structured similarly to the prior sections, with strip() used to remove white space.

print(f"The MRI article has an average sentence length of {(sentence_length(mri_article)):.3f}.")
print(f"The exercise article has an average sentence length of {(sentence_length(exercise_article)):.3f}.")
The MRI article has an average sentence length of 7.251.
The exercise article has an average sentence length of 8.931.

There seems to be a difference here, though perhaps not one that would impact the reading experience similarly.

Readability score

We can also calculate readability scores, as can be done in Microsoft Word, but is completed as an exercise here.

def readability_score(text):
    words = re.findall(r"\b\w+\b", text)
    total_words = len(words)
    sentences = re.split(r"[.!?]+", text)
    sentences = [s.strip() for s in sentences if s.strip()]
    total_sentences = len(sentences)
    total_syllables = sum([len(re.findall(r"[aeiouyAEIOUY]+", word)) for word in words])
    flesch_score = 206.835 - (1.015 * (total_words / total_sentences)) - (84.6 * (total_syllables / total_words))
    return flesch_score

Here, a higher score is a measure of greater readability.

print(f"The MRI article has a readability score of {(readability_score(mri_article)):.3f}.")
print(f"The exercise article has a readability score of {(readability_score(exercise_article)):.3f}.")
The MRI article has a readability score of 64.267.
The exercise article has a readability score of 57.241.

The greater readability of the MRI article may be attributable to its shorter sentence length. The benefit of completing this scoring here vs. in a word processor it is relatively quick to understand the impact of a few words on the readability score. For example, the function above could be set to filter out words with more than 5 syllables to understand how removing or replacing these words impacts readability of the article.

In the end, the two articles seem more similar than not. This may be largely attributed to the standard framing of scientific articles. Or the general sentence structure that favors direct prose. Or the similar article lengths imposed by journals. There are a fair number of boundaries imposed on scientific articles.

Maybe I can escape them with even more experience.

Outcomes

In comparing two of my scientific articles, we explored the functionality of the re package. This post can help guide text analysis - including word count, paired word frequency, sentence length, reading score - on other documents, scientific or otherwise.