Project
Look to your left…
Data Science · Data Visualization · Clustering Analysis · knn
Overview
Look to your left. Look to your right. One of you won’t graduate. So the story goes.
This post will use a dataset on academic success and a clustering data analysis approach called k-nearest neighbors to determine what predicts graduation or drop out.
Approach
This project was completed in Python, using packages such as:
sklearn
PCA
SMOTE
Here is a story that comes with your 101 classes.
The professor walks to the lecturn. There is no introduction. Just an anecdote meant to add anxiety or competitiveness to the room: “Look to your left. Look to your right. One of you won’t make it to graduation.”
Instead of relying on gut feelings or ominous one-liners, let’s turn to data science to see whether these professors have any empirical backing. Specifically, we’ll use a method that literally looks at your neighbors [dataset] to figure out if you’re like them. This method is K Nearest Neighbors (KNN), a technique that compares datapoints to those nearest to them (i.e., their neighbours) in an attempt to classify data on this type of sameness. Or in our case, we will compare university students to one another across a number of features, to see if there is something that predicts dropout.
For this post, we'll use the "Predict Students' Dropout and Academic Success" dataset from the UCI Machine Learning Repository. This dataset captures a range of factors influencing university students' journeys, from their academic performance to external socio-economic pressures. Some of the key features include:
- Admission Grade: How well did they perform before entering university?
- Curricular Unit Grades: Academic performance over the semesters.
- Unemployment Rate: External economic pressures that might influence a student’s trajectory.
Let's load the data and rename the columns.
# loading data
import pandas as pd
import numpy as np
file_path = 'data.xlsx'
data = pd.read_excel(file_path)
# renaming columns
columns = [
"Marital_status", "Application_mode", "Application_order", "Course",
"Daytime_evening_attendance", "Previous_qualification", "Previous_qualification_grade",
"Nationality", "Mother_qualification", "Father_qualification", "Mother_occupation",
"Father_occupation", "Admission_grade", "Displaced", "Educational_special_needs",
"Debtor", "Tuition_fees_up_to_date", "Gender", "Scholarship_holder", "Age_at_enrollment",
"International", "Curricular_units_1st_sem_credited", "Curricular_units_1st_sem_enrolled",
"Curricular_units_1st_sem_evaluations", "Curricular_units_1st_sem_approved",
"Curricular_units_1st_sem_grade", "Curricular_units_1st_sem_without_evaluations",
"Curricular_units_2nd_sem_credited", "Curricular_units_2nd_sem_enrolled",
"Curricular_units_2nd_sem_evaluations", "Curricular_units_2nd_sem_approved",
"Curricular_units_2nd_sem_grade", "Curricular_units_2nd_sem_without_evaluations",
"Unemployment_rate", "Inflation_rate", "GDP", "Target"
]
data.columns = columns
Let's explore the data some more. Visualizing the distribution of the target variable and missingness throughout the dataset are some initial, important first steps.
# data summary
print(data.describe())
Marital_status Application_mode Application_order Course \
count 4424.000000 4424.000000 4424.000000 4424.000000
mean 1.178571 18.669078 1.727848 8856.642631
std 0.605747 17.484682 1.313793 2063.566416
min 1.000000 1.000000 0.000000 33.000000
25% 1.000000 1.000000 1.000000 9085.000000
50% 1.000000 17.000000 1.000000 9238.000000
75% 1.000000 39.000000 2.000000 9556.000000
max 6.000000 57.000000 9.000000 9991.000000
Daytime_evening_attendance Previous_qualification \
count 4424.000000 4424.000000
mean 0.890823 4.577758
std 0.311897 10.216592
min 0.000000 1.000000
25% 1.000000 1.000000
50% 1.000000 1.000000
75% 1.000000 1.000000
max 1.000000 43.000000
Previous_qualification_grade Nationality Mother_qualification \
count 4424.000000 4424.000000 4424.000000
mean 132.613314 1.873192 19.561935
std 13.188332 6.914514 15.603186
min 95.000000 1.000000 1.000000
25% 125.000000 1.000000 2.000000
50% 133.100000 1.000000 19.000000
75% 140.000000 1.000000 37.000000
max 190.000000 109.000000 44.000000
Father_qualification ... \
count 4424.000000 ...
mean 22.275316 ...
std 15.343108 ...
min 1.000000 ...
25% 3.000000 ...
50% 19.000000 ...
75% 37.000000 ...
max 44.000000 ...
Curricular_units_1st_sem_without_evaluations \
count 4424.000000
mean 0.137658
std 0.690880
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 12.000000
Curricular_units_2nd_sem_credited Curricular_units_2nd_sem_enrolled \
count 4424.000000 4424.000000
mean 0.541817 6.232143
std 1.918546 2.195951
min 0.000000 0.000000
25% 0.000000 5.000000
50% 0.000000 6.000000
75% 0.000000 7.000000
max 19.000000 23.000000
Curricular_units_2nd_sem_evaluations \
count 4424.000000
mean 8.063291
std 3.947951
min 0.000000
25% 6.000000
50% 8.000000
75% 10.000000
max 33.000000
Curricular_units_2nd_sem_approved Curricular_units_2nd_sem_grade \
count 4424.000000 4424.000000
mean 4.435805 10.230206
std 3.014764 5.210808
min 0.000000 0.000000
25% 2.000000 10.750000
50% 5.000000 12.200000
75% 6.000000 13.333333
max 20.000000 18.571429
Curricular_units_2nd_sem_without_evaluations Unemployment_rate \
count 4424.000000 4424.000000
mean 0.150316 11.566139
std 0.753774 2.663850
min 0.000000 7.600000
25% 0.000000 9.400000
50% 0.000000 11.100000
75% 0.000000 13.900000
max 12.000000 16.200000
Inflation_rate GDP
count 4424.000000 4424.000000
mean 1.228029 0.001969
std 1.382711 2.269935
min -0.800000 -4.060000
25% 0.300000 -1.700000
50% 1.400000 0.320000
75% 2.600000 1.790000
max 3.700000 3.510000
[8 rows x 36 columns]
We will quickly check for missingness across the dataset.
# missingness
print(data.isnull().sum())
Marital_status 0
Application_mode 0
Application_order 0
Course 0
Daytime_evening_attendance 0
Previous_qualification 0
Previous_qualification_grade 0
Nationality 0
Mother_qualification 0
Father_qualification 0
Mother_occupation 0
Father_occupation 0
Admission_grade 0
Displaced 0
Educational_special_needs 0
Debtor 0
Tuition_fees_up_to_date 0
Gender 0
Scholarship_holder 0
Age_at_enrollment 0
International 0
Curricular_units_1st_sem_credited 0
Curricular_units_1st_sem_enrolled 0
Curricular_units_1st_sem_evaluations 0
Curricular_units_1st_sem_approved 0
Curricular_units_1st_sem_grade 0
Curricular_units_1st_sem_without_evaluations 0
Curricular_units_2nd_sem_credited 0
Curricular_units_2nd_sem_enrolled 0
Curricular_units_2nd_sem_evaluations 0
Curricular_units_2nd_sem_approved 0
Curricular_units_2nd_sem_grade 0
Curricular_units_2nd_sem_without_evaluations 0
Unemployment_rate 0
Inflation_rate 0
GDP 0
Target 0
dtype: int64
There is no missingness in this dataset, and imputation is therefore not required.
We’ll now quickly visualize the distribution of the Target
.
# target distribution
import matplotlib.pyplot as plt
import seaborn as sns
data['Target'].value_counts().plot(kind='bar', title='Distribution of Target')
plt.show()
At first glance, the professors do seem to be on to some basic descriptive statistics. It would not be unreasonable based on the plot above alone to suggest to a 101 class that a third will not graduate. Beyond that, we do appear to have enough data per category for a worthwhile analysis.
Next, we dive into key features like admission grades and unemployment rates. pairplot
will be used to visualize associations between Target
and some features of interest.
# visualizing key features
sns.pairplot(data, hue='Target', vars=[
'Admission_grade', 'Age_at_enrollment', 'Scholarship_holder', 'Gender', 'Debtor'
])
plt.show()
Graduates have higher admission_grade
and lower age_at_enrollment
. There seem to be more females that graduate (coded as Gender
== 0) than don’t, whereas among males, the split is even. Most females that graduate are not debtors. In contrast, there seem to be a high concentration of dropouts that are debtors, irrespective of admission_grade
and age_at_enrollment
suggesting that debt load may contribute to leaving school despite high entry marks and a young age. Accordingly, many who do not hold scholarships (Scholarship_holder
== 0) are dropouts.
We'll now build the KNN model, using key functionality in scikit-learn
. KNN works by comparing a student to their “k” nearest neighbors in feature space. If your closest neighbors graduated, chances are you’re on track to graduate too. Class imbalance (or the uneven split between graduates, dropouts, and enrolled students) will be addressed using Synthetic Minority Oversampling Technique (SMOTE
). RandomizedSearchCV()
will also be used to identify the optimial hyper-parameters for the KNN model. We'll also build pipelines to scale, transform, and/or encode numerical and categorical variables.
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler, FunctionTransformer
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
# defining features and target
X = data.drop('Target', axis=1)
label_encoder = LabelEncoder()
y = data['Target']
y = label_encoder.fit_transform(y)
# smote
smote = SMOTE(random_state=189)
X_resampled, y_resampled = smote.fit_resample(X, y)
# scaling features
scaler = StandardScaler()
X_resampled_scaled = scaler.fit_transform(X_resampled)
# splitting
X_train, X_test, y_train, y_test = train_test_split(X_resampled_scaled, y_resampled, test_size=0.2, random_state=189)
Now, we'll define the KNN model and set up the RandomizedSearchCV()
. This allows us to identify the best parameters, like the number of neighbors and the distance metric. After optimization, we evaluate its performance using metrics like accuracy, precision and recall to ensure it identifies graduates and dropouts effectively.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import RandomizedSearchCV
# knn model
knn = KNeighborsClassifier()
# knn with randomsearch
param_distributions = {
'n_neighbors': np.arange(1, 31),
'weights': ['uniform', 'distance'],
'p': [1, 2]
}
random_search = RandomizedSearchCV(knn, param_distributions, n_iter=50, cv=5, scoring='accuracy', random_state=189)
random_search.fit(X_train, y_train)
Next, we'll look at the best parameters identified in the previous step. And we'll also assess model accuracy, and visualize this with a confusion matrix.
from sklearn.model_selection import cross_val_score
# best parameters
print('Best parameters:', random_search.best_params_)
# best model
best_knn = random_search.best_estimator_
# cross-validation scores
cv_scores = cross_val_score(best_knn, X_train, y_train, cv=5, scoring='accuracy')
print('Cross-valudation accuracy scores:', cv_scores)
print('Mean cross-validation accuracy:', np.mean(cv_scores))
Best parameters: {'weights': 'distance', 'p': 1, 'n_neighbors': 2}
Cross-valudation accuracy scores: [0.79830349 0.76320755 0.81603774 0.8245283 0.79339623]
Mean cross-validation accuracy: 0.7990946597193818
We can plot a confusion matrix to better visualize how the model performs.
from sklearn.metrics import confusion_matrix, classification_report, ConfusionMatrixDisplay
# prediction
y_pred = best_knn.predict(X_test)
# evaluation
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=label_encoder.classes_, yticklabels=label_encoder.classes_)
plt.title('Confusion matrix')
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
print('Classification report:')
print(classification_report(y_test, y_pred))
Classification report:
precision recall f1-score support
0 0.89 0.79 0.84 463
1 0.75 0.88 0.81 424
2 0.80 0.77 0.78 439
accuracy 0.81 1326
macro avg 0.81 0.81 0.81 1326
weighted avg 0.82 0.81 0.81 1326
The model seems to perform modestly, with an accuracy around 80%. The greatest confusion here appears to be classification between Graduates and Enrolled. In fuller analyses, next steps would be to improve the model. Feature engineering would be explored to understand more about how pre-admission factors (such as prior grades, occupation of parents), socio-economic factors (including debtor and displacement status), and enrollment-related factors (such as course load) interact, which may allow for more accurate modelling.
For the present project, to aid in visualizing the similarity of these two groups, we can use dimensionality reduction techniques, such as PCA
prior to plotting the data.
from sklearn.decomposition import PCA
# three cluster approach
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_resampled_scaled)
explained_variance = pca.explained_variance_ratio_
plt.figure(figsize=(10, 6))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_resampled, cmap='viridis', alpha=0.7)
plt.colorbar(scatter, label=label_encoder.classes_)
plt.title(f'PCA visualization of clusters\nExplained variance: PC1={explained_variance[0]:.2f}, PC2={explained_variance[1]:.2f}')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show();
Plotting on just the first two principal components (which cumulatively only account for only 27% of all variance), we see that Dropouts (in yellow) are slightly separate from the other two classes, which reinforces the difficulty in distinguishing between Graduates and Enrolled status.
We will now plot the decision boundaries defined by the KNN model.
# Fit the KNN model on 2D PCA-transformed data
knn_for_plot = KNeighborsClassifier(n_neighbors=best_knn.n_neighbors, weights=best_knn.weights, p=best_knn.p)
knn_for_plot.fit(X_pca, y_resampled)
# Create a mesh grid for visualization
x_min, x_max = X_pca[:, 0].min() - 1, X_pca[:, 0].max() + 1
y_min, y_max = X_pca[:, 1].min() - 1, X_pca[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1))
# Predict on the grid
Z = knn_for_plot.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Define class labels (Adjust if different)
class_labels = {0: "Dropout", 1: "Graduate", 2: "Enrolled"}
# Use consistent colormap
cmap = plt.colormaps['viridis']
norm = plt.Normalize(vmin=min(class_labels.keys()), vmax=max(class_labels.keys())) # Normalize class values
# Plot decision boundary
plt.figure(figsize=(10, 6))
plt.contourf(xx, yy, Z, alpha=0.4, cmap=cmap)
# Store scatter plots for legend extraction
scatter_plots = []
# Plot each class separately so they have distinct labels in the legend
for class_value, label in class_labels.items():
mask = y_resampled == class_value
scatter = plt.scatter(
X_pca[mask, 0], X_pca[mask, 1],
label=label, c=cmap(norm(class_value)), # Assign exact color
edgecolor='k', alpha=0.8
)
scatter_plots.append(scatter)
# Proxy artist for the decision boundary
boundary_patch = Patch(color=cmap(0.5), alpha=0.4, label="Decision Boundary")
# Use exact scatter plot colors in the legend
plt.legend(handles=[boundary_patch] + scatter_plots)
plt.title('KNN Decision Boundaries and Neighborhoods')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.show()
The background colours of this plot represent the prediction classes. The points are the actual data. Areas in which the background colour matches the point suggests a datapoint that was accurately predicted by the KNN model. Instances where a point does not match its background suggests misclassification.
In general, we can see here that the decision boundaries leave considerable room for misclassification. For example, in roughly the middle of the plot, we can see small pockets of each background colour, suggesting that a new datapoint (or one in the test set) is at risk of being misclassified if its position varies only slightly.
So, if you look left and right, are you like your neighbour? This analysis would say… kinda. The plot above shows how similar data points are towards the center of the plot, suggesting here, its hard to tell if your neighbour is a dropout or graduate. Towards the edges of the plot the distinction becomes clearer.
Outcomes
Using SMOTE
and KNeighborsClassifier
, we looked into whether the “look to your left… look to your right…” adage held true. Initial modelling shows that dropout and graduate status can be predicted with moderate accuracy, with some initial steps for model improvement outlined.
Credits
Data: UCI ML Repository