Project

Look to your left…

Data Science · Data Visualization · Clustering Analysis · knn

Overview

Look to your left. Look to your right. One of you won’t graduate. So the story goes.

This post will use a dataset on academic success and a clustering data analysis approach called k-nearest neighbors to determine what predicts graduation or drop out.

Approach

This project was completed in Python, using packages such as:

sklearn
PCA
SMOTE

Here is a story that comes with your 101 classes.

The professor walks to the lecturn. There is no introduction. Just an anecdote meant to add anxiety or competitiveness to the room: “Look to your left. Look to your right. One of you won’t make it to graduation.”

Instead of relying on gut feelings or ominous one-liners, let’s turn to data science to see whether these professors have any empirical backing. Specifically, we’ll use a method that literally looks at your neighbors [dataset] to figure out if you’re like them. This method is K Nearest Neighbors (KNN), a technique that compares datapoints to those nearest to them (i.e., their neighbours) in an attempt to classify data on this type of sameness. Or in our case, we will compare university students to one another across a number of features, to see if there is something that predicts dropout.

For this post, we'll use the "Predict Students' Dropout and Academic Success" dataset from the UCI Machine Learning Repository. This dataset captures a range of factors influencing university students' journeys, from their academic performance to external socio-economic pressures. Some of the key features include:

Admission Grade: How well did they perform before entering university?
Curricular Unit Grades: Academic performance over the semesters.
Unemployment Rate: External economic pressures that might influence a student’s trajectory.

Let's load the data and rename the columns.

# loading data
import pandas as pd
import numpy as np

file_path = 'data.xlsx'
data = pd.read_excel(file_path)

# renaming columns
columns = [
    "Marital_status", "Application_mode", "Application_order", "Course", 
    "Daytime_evening_attendance", "Previous_qualification", "Previous_qualification_grade",
    "Nationality", "Mother_qualification", "Father_qualification", "Mother_occupation",
    "Father_occupation", "Admission_grade", "Displaced", "Educational_special_needs",
    "Debtor", "Tuition_fees_up_to_date", "Gender", "Scholarship_holder", "Age_at_enrollment",
    "International", "Curricular_units_1st_sem_credited", "Curricular_units_1st_sem_enrolled",
    "Curricular_units_1st_sem_evaluations", "Curricular_units_1st_sem_approved", 
    "Curricular_units_1st_sem_grade", "Curricular_units_1st_sem_without_evaluations",
    "Curricular_units_2nd_sem_credited", "Curricular_units_2nd_sem_enrolled",
    "Curricular_units_2nd_sem_evaluations", "Curricular_units_2nd_sem_approved",
    "Curricular_units_2nd_sem_grade", "Curricular_units_2nd_sem_without_evaluations",
    "Unemployment_rate", "Inflation_rate", "GDP", "Target"
]
data.columns = columns

Let's explore the data some more. Visualizing the distribution of the target variable and missingness throughout the dataset are some initial, important first steps.

# data summary
print(data.describe())

       Marital_status  Application_mode  Application_order       Course  \
count     4424.000000       4424.000000        4424.000000  4424.000000   
mean         1.178571         18.669078           1.727848  8856.642631   
std          0.605747         17.484682           1.313793  2063.566416   
min          1.000000          1.000000           0.000000    33.000000   
25%          1.000000          1.000000           1.000000  9085.000000   
50%          1.000000         17.000000           1.000000  9238.000000   
75%          1.000000         39.000000           2.000000  9556.000000   
max          6.000000         57.000000           9.000000  9991.000000   

       Daytime_evening_attendance  Previous_qualification  \
count                 4424.000000             4424.000000   
mean                     0.890823                4.577758   
std                      0.311897               10.216592   
min                      0.000000                1.000000   
25%                      1.000000                1.000000   
50%                      1.000000                1.000000   
75%                      1.000000                1.000000   
max                      1.000000               43.000000   

       Previous_qualification_grade  Nationality  Mother_qualification  \
count                   4424.000000  4424.000000           4424.000000   
mean                     132.613314     1.873192             19.561935   
std                       13.188332     6.914514             15.603186   
min                       95.000000     1.000000              1.000000   
25%                      125.000000     1.000000              2.000000   
50%                      133.100000     1.000000             19.000000   
75%                      140.000000     1.000000             37.000000   
max                      190.000000   109.000000             44.000000   

       Father_qualification  ...  \
count           4424.000000  ...   
mean              22.275316  ...   
std               15.343108  ...   
min                1.000000  ...   
25%                3.000000  ...   
50%               19.000000  ...   
75%               37.000000  ...   
max               44.000000  ...   

       Curricular_units_1st_sem_without_evaluations  \
count                                   4424.000000   
mean                                       0.137658   
std                                        0.690880   
min                                        0.000000   
25%                                        0.000000   
50%                                        0.000000   
75%                                        0.000000   
max                                       12.000000   

       Curricular_units_2nd_sem_credited  Curricular_units_2nd_sem_enrolled  \
count                        4424.000000                        4424.000000   
mean                            0.541817                           6.232143   
std                             1.918546                           2.195951   
min                             0.000000                           0.000000   
25%                             0.000000                           5.000000   
50%                             0.000000                           6.000000   
75%                             0.000000                           7.000000   
max                            19.000000                          23.000000   

       Curricular_units_2nd_sem_evaluations  \
count                           4424.000000   
mean                               8.063291   
std                                3.947951   
min                                0.000000   
25%                                6.000000   
50%                                8.000000   
75%                               10.000000   
max                               33.000000   

       Curricular_units_2nd_sem_approved  Curricular_units_2nd_sem_grade  \
count                        4424.000000                     4424.000000   
mean                            4.435805                       10.230206   
std                             3.014764                        5.210808   
min                             0.000000                        0.000000   
25%                             2.000000                       10.750000   
50%                             5.000000                       12.200000   
75%                             6.000000                       13.333333   
max                            20.000000                       18.571429   

       Curricular_units_2nd_sem_without_evaluations  Unemployment_rate  \
count                                   4424.000000        4424.000000   
mean                                       0.150316          11.566139   
std                                        0.753774           2.663850   
min                                        0.000000           7.600000   
25%                                        0.000000           9.400000   
50%                                        0.000000          11.100000   
75%                                        0.000000          13.900000   
max                                       12.000000          16.200000   

       Inflation_rate          GDP  
count     4424.000000  4424.000000  
mean         1.228029     0.001969  
std          1.382711     2.269935  
min         -0.800000    -4.060000  
25%          0.300000    -1.700000  
50%          1.400000     0.320000  
75%          2.600000     1.790000  
max          3.700000     3.510000  

[8 rows x 36 columns]

We will quickly check for missingness across the dataset.

# missingness 
print(data.isnull().sum())

Marital_status                                  0
Application_mode                                0
Application_order                               0
Course                                          0
Daytime_evening_attendance                      0
Previous_qualification                          0
Previous_qualification_grade                    0
Nationality                                     0
Mother_qualification                            0
Father_qualification                            0
Mother_occupation                               0
Father_occupation                               0
Admission_grade                                 0
Displaced                                       0
Educational_special_needs                       0
Debtor                                          0
Tuition_fees_up_to_date                         0
Gender                                          0
Scholarship_holder                              0
Age_at_enrollment                               0
International                                   0
Curricular_units_1st_sem_credited               0
Curricular_units_1st_sem_enrolled               0
Curricular_units_1st_sem_evaluations            0
Curricular_units_1st_sem_approved               0
Curricular_units_1st_sem_grade                  0
Curricular_units_1st_sem_without_evaluations    0
Curricular_units_2nd_sem_credited               0
Curricular_units_2nd_sem_enrolled               0
Curricular_units_2nd_sem_evaluations            0
Curricular_units_2nd_sem_approved               0
Curricular_units_2nd_sem_grade                  0
Curricular_units_2nd_sem_without_evaluations    0
Unemployment_rate                               0
Inflation_rate                                  0
GDP                                             0
Target                                          0
dtype: int64

There is no missingness in this dataset, and imputation is therefore not required.

We’ll now quickly visualize the distribution of the Target.

# target distribution
import matplotlib.pyplot as plt
import seaborn as sns

data['Target'].value_counts().plot(kind='bar', title='Distribution of Target')
plt.show()

At first glance, the professors do seem to be on to some basic descriptive statistics. It would not be unreasonable based on the plot above alone to suggest to a 101 class that a third will not graduate. Beyond that, we do appear to have enough data per category for a worthwhile analysis.

Next, we dive into key features like admission grades and unemployment rates. pairplot will be used to visualize associations between Target and some features of interest.

# visualizing key features
sns.pairplot(data, hue='Target', vars=[
    'Admission_grade', 'Age_at_enrollment', 'Scholarship_holder', 'Gender', 'Debtor'
])
plt.show()

Graduates have higher admission_grade and lower age_at_enrollment. There seem to be more females that graduate (coded as Gender == 0) than don’t, whereas among males, the split is even. Most females that graduate are not debtors. In contrast, there seem to be a high concentration of dropouts that are debtors, irrespective of admission_grade and age_at_enrollment suggesting that debt load may contribute to leaving school despite high entry marks and a young age. Accordingly, many who do not hold scholarships (Scholarship_holder == 0) are dropouts.

We'll now build the KNN model, using key functionality in scikit-learn. KNN works by comparing a student to their “k” nearest neighbors in feature space. If your closest neighbors graduated, chances are you’re on track to graduate too. Class imbalance (or the uneven split between graduates, dropouts, and enrolled students) will be addressed using Synthetic Minority Oversampling Technique (SMOTE). RandomizedSearchCV() will also be used to identify the optimial hyper-parameters for the KNN model. We'll also build pipelines to scale, transform, and/or encode numerical and categorical variables.

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler, FunctionTransformer
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

# defining features and target
X = data.drop('Target', axis=1)

label_encoder = LabelEncoder()
y = data['Target']
y = label_encoder.fit_transform(y)

# smote
smote = SMOTE(random_state=189)
X_resampled, y_resampled = smote.fit_resample(X, y)

# scaling features
scaler = StandardScaler()
X_resampled_scaled = scaler.fit_transform(X_resampled)

# splitting
X_train, X_test, y_train, y_test = train_test_split(X_resampled_scaled, y_resampled, test_size=0.2, random_state=189)

Now, we'll define the KNN model and set up the RandomizedSearchCV(). This allows us to identify the best parameters, like the number of neighbors and the distance metric. After optimization, we evaluate its performance using metrics like accuracy, precision and recall to ensure it identifies graduates and dropouts effectively.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import RandomizedSearchCV

# knn model
knn = KNeighborsClassifier()

# knn with randomsearch
param_distributions = {
    'n_neighbors': np.arange(1, 31),
    'weights': ['uniform', 'distance'],
    'p': [1, 2]
}

random_search = RandomizedSearchCV(knn, param_distributions, n_iter=50, cv=5, scoring='accuracy', random_state=189)
random_search.fit(X_train, y_train)

Next, we'll look at the best parameters identified in the previous step. And we'll also assess model accuracy, and visualize this with a confusion matrix.

from sklearn.model_selection import cross_val_score

# best parameters
print('Best parameters:', random_search.best_params_)

# best model
best_knn = random_search.best_estimator_

# cross-validation scores
cv_scores = cross_val_score(best_knn, X_train, y_train, cv=5, scoring='accuracy')
print('Cross-valudation accuracy scores:', cv_scores)
print('Mean cross-validation accuracy:', np.mean(cv_scores))

Best parameters: {'weights': 'distance', 'p': 1, 'n_neighbors': 2}
Cross-valudation accuracy scores: [0.79830349 0.76320755 0.81603774 0.8245283  0.79339623]
Mean cross-validation accuracy: 0.7990946597193818

We can plot a confusion matrix to better visualize how the model performs.

from sklearn.metrics import confusion_matrix, classification_report, ConfusionMatrixDisplay

# prediction 
y_pred = best_knn.predict(X_test)

# evaluation
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=label_encoder.classes_, yticklabels=label_encoder.classes_)
plt.title('Confusion matrix')
plt.xlabel('Predicted labels')
plt.ylabel('True labels')

print('Classification report:')
print(classification_report(y_test, y_pred))

Classification report:
              precision    recall  f1-score   support

           0       0.89      0.79      0.84       463
           1       0.75      0.88      0.81       424
           2       0.80      0.77      0.78       439

    accuracy                           0.81      1326
   macro avg       0.81      0.81      0.81      1326
weighted avg       0.82      0.81      0.81      1326

The model seems to perform modestly, with an accuracy around 80%. The greatest confusion here appears to be classification between Graduates and Enrolled. In fuller analyses, next steps would be to improve the model. Feature engineering would be explored to understand more about how pre-admission factors (such as prior grades, occupation of parents), socio-economic factors (including debtor and displacement status), and enrollment-related factors (such as course load) interact, which may allow for more accurate modelling.

For the present project, to aid in visualizing the similarity of these two groups, we can use dimensionality reduction techniques, such as PCA prior to plotting the data.

from sklearn.decomposition import PCA

# three cluster approach
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_resampled_scaled)
explained_variance = pca.explained_variance_ratio_

plt.figure(figsize=(10, 6))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_resampled, cmap='viridis', alpha=0.7)
plt.colorbar(scatter, label=label_encoder.classes_)
plt.title(f'PCA visualization of clusters\nExplained variance: PC1={explained_variance[0]:.2f}, PC2={explained_variance[1]:.2f}')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show();

Plotting on just the first two principal components (which cumulatively only account for only 27% of all variance), we see that Dropouts (in yellow) are slightly separate from the other two classes, which reinforces the difficulty in distinguishing between Graduates and Enrolled status.

We will now plot the decision boundaries defined by the KNN model.

# Fit the KNN model on 2D PCA-transformed data
knn_for_plot = KNeighborsClassifier(n_neighbors=best_knn.n_neighbors, weights=best_knn.weights, p=best_knn.p)
knn_for_plot.fit(X_pca, y_resampled)

# Create a mesh grid for visualization
x_min, x_max = X_pca[:, 0].min() - 1, X_pca[:, 0].max() + 1
y_min, y_max = X_pca[:, 1].min() - 1, X_pca[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1))

# Predict on the grid
Z = knn_for_plot.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Define class labels (Adjust if different)
class_labels = {0: "Dropout", 1: "Graduate", 2: "Enrolled"}  

# Use consistent colormap
cmap = plt.colormaps['viridis']
norm = plt.Normalize(vmin=min(class_labels.keys()), vmax=max(class_labels.keys()))  # Normalize class values

# Plot decision boundary
plt.figure(figsize=(10, 6))
plt.contourf(xx, yy, Z, alpha=0.4, cmap=cmap)

# Store scatter plots for legend extraction
scatter_plots = []

# Plot each class separately so they have distinct labels in the legend
for class_value, label in class_labels.items():
    mask = y_resampled == class_value
    scatter = plt.scatter(
        X_pca[mask, 0], X_pca[mask, 1], 
        label=label, c=cmap(norm(class_value)),  # Assign exact color
        edgecolor='k', alpha=0.8
    )
    scatter_plots.append(scatter)

# Proxy artist for the decision boundary
boundary_patch = Patch(color=cmap(0.5), alpha=0.4, label="Decision Boundary")

# Use exact scatter plot colors in the legend
plt.legend(handles=[boundary_patch] + scatter_plots)

plt.title('KNN Decision Boundaries and Neighborhoods')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.show()

The background colours of this plot represent the prediction classes. The points are the actual data. Areas in which the background colour matches the point suggests a datapoint that was accurately predicted by the KNN model. Instances where a point does not match its background suggests misclassification.

In general, we can see here that the decision boundaries leave considerable room for misclassification. For example, in roughly the middle of the plot, we can see small pockets of each background colour, suggesting that a new datapoint (or one in the test set) is at risk of being misclassified if its position varies only slightly.

So, if you look left and right, are you like your neighbour? This analysis would say… kinda. The plot above shows how similar data points are towards the center of the plot, suggesting here, its hard to tell if your neighbour is a dropout or graduate. Towards the edges of the plot the distinction becomes clearer.

Outcomes

Using SMOTE and KNeighborsClassifier, we looked into whether the “look to your left… look to your right…” adage held true. Initial modelling shows that dropout and graduate status can be predicted with moderate accuracy, with some initial steps for model improvement outlined.

Credits