K-Nearest Neighbours

Project

Look to your left…

Data Science · Data Visualization · Clustering Analysis · knn

image

Overview

Look to your left. Look to your right. One of you won’t graduate. So the story goes.

This post will use a dataset on academic success and a clustering data analysis approach called k-nearest neighbors to determine what predicts graduation or drop out.

Approach

This project was completed in Python, using packages such as:

  • sklearn
  • PCA
  • SMOTE

Here is a story that comes with your 101 classes.

The professor walks to the lecturn. There is no introduction. Just an anecdote meant to add anxiety or competitiveness to the room: “Look to your left. Look to your right. One of you won’t make it to graduation.”

Instead of relying on gut feelings or ominous one-liners, let’s turn to data science to see whether these professors have any empirical backing. Specifically, we’ll use a method that literally looks at your neighbors [dataset] to figure out if you’re like them. This method is K Nearest Neighbors (KNN), a technique that compares datapoints to those nearest to them (i.e., their neighbours) in an attempt to classify data on this type of sameness. Or in our case, we will compare university students to one another across a number of features, to see if there is something that predicts dropout.

For this post, we'll use the "Predict Students' Dropout and Academic Success" dataset from the UCI Machine Learning Repository. This dataset captures a range of factors influencing university students' journeys, from their academic performance to external socio-economic pressures. Some of the key features include:

  • Admission Grade: How well did they perform before entering university?
  • Curricular Unit Grades: Academic performance over the semesters.
  • Unemployment Rate: External economic pressures that might influence a student’s trajectory.

Let's load the data and rename the columns.

# loading data
import pandas as pd
import numpy as np

file_path = 'data.xlsx'
data = pd.read_excel(file_path)

Let's explore the data some more. Visualizing the distribution of the target variable and missingness throughout the dataset are some initial, important first steps.

# data summary
print(data.describe())

We will quickly check for missingness across the dataset.

# missingness 
print(data.isnull().sum())

There is no missingness in this dataset, and imputation is therefore not required.

We’ll now quickly visualize the distribution of the Target.

# target distribution
import matplotlib.pyplot as plt
import seaborn as sns

data['Target'].value_counts().plot(kind='bar', title='Distribution of Target')
plt.show()
image

At first glance, the professors do seem to be on to some basic descriptive statistics. It would not be unreasonable based on the plot above alone to suggest to a 101 class that a third will not graduate. Beyond that, we do appear to have enough data per category for a worthwhile analysis.

Next, we dive into key features like admission grades and unemployment rates. pairplot will be used to visualize associations between Target and some features of interest.

# visualizing key features
sns.pairplot(data, hue='Target', vars=[
    'Admission_grade', 'Age_at_enrollment', 'Scholarship_holder', 'Gender', 'Debtor'
])
plt.show()
image

Graduates have higher admission_grade and lower age_at_enrollment. There seem to be more females that graduate (coded as Gender == 0) than don’t, whereas among males, the split is even. Most females that graduate are not debtors. In contrast, there seem to be a high concentration of dropouts that are debtors, irrespective of admission_grade and age_at_enrollment suggesting that debt load may contribute to leaving school despite high entry marks and a young age. Accordingly, many who do not hold scholarships (Scholarship_holder == 0) are dropouts.

We'll now build the KNN model, using key functionality in scikit-learn. KNN works by comparing a student to their “k” nearest neighbors in feature space. If your closest neighbors graduated, chances are you’re on track to graduate too. Class imbalance (or the uneven split between graduates, dropouts, and enrolled students) will be addressed using Synthetic Minority Oversampling Technique (SMOTE). RandomizedSearchCV() will also be used to identify the optimial hyper-parameters for the KNN model. We'll also build pipelines to scale, transform, and/or encode numerical and categorical variables.

Now, we'll define the KNN model and set up the RandomizedSearchCV(). This allows us to identify the best parameters, like the number of neighbors and the distance metric. After optimization, we evaluate its performance using metrics like accuracy, precision and recall to ensure it identifies graduates and dropouts effectively.

Next, we'll look at the best parameters identified in the previous step. And we'll also assess model accuracy, and visualize this with a confusion matrix.

Best parameters: {'weights': 'distance', 'p': 1, 'n_neighbors': 2}
Cross-valudation accuracy scores: [0.79830349 0.76320755 0.81603774 0.8245283  0.79339623]
Mean cross-validation accuracy: 0.7990946597193818

We can plot a confusion matrix to better visualize how the model performs.

image
Classification report:
              precision    recall  f1-score   support

           0       0.89      0.79      0.84       463
           1       0.75      0.88      0.81       424
           2       0.80      0.77      0.78       439

    accuracy                           0.81      1326
   macro avg       0.81      0.81      0.81      1326
weighted avg       0.82      0.81      0.81      1326

The model seems to perform modestly, with an accuracy around 80%. The greatest confusion here appears to be classification between Graduates and Enrolled. In fuller analyses, next steps would be to improve the model. Feature engineering would be explored to understand more about how pre-admission factors (such as prior grades, occupation of parents), socio-economic factors (including debtor and displacement status), and enrollment-related factors (such as course load) interact, which may allow for more accurate modelling.

For the present project, to aid in visualizing the similarity of these two groups, we can use dimensionality reduction techniques, such as PCA prior to plotting the data.

image

Plotting on just the first two principal components (which cumulatively only account for only 27% of all variance), we see that Dropouts (in yellow) are slightly separate from the other two classes, which reinforces the difficulty in distinguishing between Graduates and Enrolled status.

We will now plot the decision boundaries defined by the KNN model.

The background colours of this plot represent the prediction classes. The points are the actual data. Areas in which the background colour matches the point suggests a datapoint that was accurately predicted by the KNN model. Instances where a point does not match its background suggests misclassification.

In general, we can see here that the decision boundaries leave considerable room for misclassification. For example, in roughly the middle of the plot, we can see small pockets of each background colour, suggesting that a new datapoint (or one in the test set) is at risk of being misclassified if its position varies only slightly.

So, if you look left and right, are you like your neighbour? This analysis would say… kinda. The plot above shows how similar data points are towards the center of the plot, suggesting here, its hard to tell if your neighbour is a dropout or graduate. Towards the edges of the plot the distinction becomes clearer.

Outcomes

Using SMOTE and KNeighborsClassifier, we looked into whether the “look to your left… look to your right…” adage held true. Initial modelling shows that dropout and graduate status can be predicted with moderate accuracy, with some initial steps for model improvement outlined.

Credits

Data: UCI ML Repository