Clustering

Clustering

When less is more

Group analysis · Clustering · Data science

image

Overview

An open source diabetes dataset containing more than a dozen variables is visualized in 2D and 3D using cluster analysis. This makes working with complex, multi-dimensional datasets more intuitive. These clustering approaches can help data users gain an intuitive and visual understanding of their data, how it clusters, and what underlying patterns it may hold.

Approach

This project was completed in Python, using packages such as:

  • sklearn
  • TSNE
  • MDS

High-dimensional data can feel overwhelming — like trying to understand a movie by focusing solely on its production and technical details. You might end up knowing everything about the budget and its allocation, the size and experience of the effects team, and the lenses used to shoot the movie. But this richness of information tells nothing of the story. Knowing all of this and more can still leave the plot untouched.

When data are dimension-rich, the associations between the target and features can be cloudy. Visualizing these data using one dimension per feature becomes infeasible when we surpass the bounds of a human-readable 3D plot. Having such rich data may lead to a more promising model, but for an initial exploration and visualization, they pose a challenge. The story remains unclear.

Dimensionality can be reduced for plotting purposes using techniques such as t-distributed Stochastic Neighbour Embedding (t-SNE) and Multi-Dimensional Scaling (MDS). These approaches help bring clarity to this complexity. They take the overwhelming mass of high-dimensional data and transform it into a simpler, more interpretable form, often in just two or three dimensions. These methods reveal underlying clusters, trends, and relationships that are otherwise hidden. This process allows us to step back and see the bigger picture.

Data science isn’t just about processing data; it’s about understanding it. Techniques like t-SNE and MDS help us see the story hidden in the complexity, making the invisible visible. Sometimes, less really is more, especially when it comes to making sense of vast amounts of information.

This post will use the Estimation of Obesity Levels Based on Heating Habits and Physical Condition dataset from the UCI ML repository. In short, there are 16 features (including age, technology use, meal and snack consumption) available for predicting the target, which is obesity level. The data will first be prepared for plotting.

import pandas as pd
from ucimlrepo import fetch_ucirepo 
 
# fetch dataset 
estimation_of_obesity_levels_based_on_eating_habits_and_physical_condition = fetch_ucirepo(id=544) 
  
# data (as pandas dataframes) 
X = estimation_of_obesity_levels_based_on_eating_habits_and_physical_condition.data.features 
y = estimation_of_obesity_levels_based_on_eating_habits_and_physical_condition.data.targets 
df = pd.concat([X, y], axis=1)

We can print the meta data and variable information to better understand the contents of the dataset.

# metadata 
print(estimation_of_obesity_levels_based_on_eating_habits_and_physical_condition.metadata) 
  
# variable information 
print(estimation_of_obesity_levels_based_on_eating_habits_and_physical_condition.variables) 

                              name     role         type
0                           Gender  Feature  Categorical
1                              Age  Feature   Continuous
2                           Height  Feature   Continuous
3                           Weight  Feature   Continuous
4   family_history_with_overweight  Feature       Binary
5                             FAVC  Feature       Binary
6                             FCVC  Feature      Integer
7                              NCP  Feature   Continuous
8                             CAEC  Feature  Categorical
9                            SMOKE  Feature       Binary
10                            CH2O  Feature   Continuous
11                             SCC  Feature       Binary
12                             FAF  Feature   Continuous
13                             TUE  Feature      Integer
14                            CALC  Feature  Categorical
15                          MTRANS  Feature  Categorical
16                      NObeyesdad   Target  Categorical

Given there are different data types (including CategoricalBinaryContinuous) that may require different transformations, all variables corresponding to a data type will be pulled into respective lists.

# creating lists for each variable type
categorical_vars = []   # more than 2 categories
binary_vars = []        # variables with exactly 2 unique categories
continuous_vars = []    # numeric variables

for col in X.columns:
    unique_values = X[col].nunique()
    
    if X[col].dtype == 'object':
        if unique_values == 2:
            binary_vars.append(col)  
        else:
            categorical_vars.append(col)  
    else:
        continuous_vars.append(col)

print("Categorical Variables:", categorical_vars)
print("Binary Variables:", binary_vars)
print("Continuous Variables:", continuous_vars)
Categorical Variables: ['CAEC', 'CALC', 'MTRANS']
Binary Variables: ['Gender', 'family_history_with_overweight', 'FAVC', 'SMOKE', 'SCC']
Continuous Variables: ['Age', 'Height', 'Weight', 'FCVC', 'NCP', 'CH2O', 'FAF', 'TUE']

Next, pipelines will be created to transform each data type appropriately. StandardScaler will be used on numeric data, and OrdinalEncoder and OneHotEncoder will be used for categorical and binary variables, respectiely.

from numpy import ravel
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, OneHotEncoder, LabelEncoder
from sklearn.pipeline import Pipeline

# data type specific pipelines
num_pipeline = Pipeline([('scaler', StandardScaler())])
cat_pipeline = Pipeline([('ordinal', OrdinalEncoder())])
bin_pipeline = Pipeline([('onehot', OneHotEncoder(handle_unknown='ignore'))])

# applying the transformers to appropriate columns using ColumnTransformer
preprocessor = ColumnTransformer(transformers=[
    ('num', num_pipeline, continuous_vars),
    ('ord', cat_pipeline, categorical_vars),
    ('nom', bin_pipeline, binary_vars)
])

# fit_transform for the features
X_transformed = preprocessor.fit_transform(X)

# label encoder for the target
le = LabelEncoder()
y_encoded = le.fit_transform(ravel(y))

t-SNE is a technique that models high-dimensional data by first converting similarities between data points (in higher-dimensions) into probabilities, and then embedding these probabilities into a lower-dimensional (i.e., 2D or 3D) space. Its main focus is to preserve local relationships in the data, making it useful for visualizing clusters and patterns on a smaller scale. In the end, by maintaining the relative distance between similar points, t-SNE creates visualizations that reveal complex structures, making it easier to understand and interpret data.

Here is a 2D plot of the high-dimensional data generated using t-SNE, specifying some key parameters:

  • n_components: The number of dimensions (typically 2 or 3)
  • verbose: To print progress information, which may be helpful when running t-SNE on a larger dataset
  • perplexity: A measure of the balance between local and global aspects of the data, with lower numbers preserving more local relationships. This may need to be an iterable input, depending on your data and needs
  • n_iter: The number of times t-SNE updates the position of datapoints to minimize differences between low- and high-dimensional relationships
from sklearn.manifold import TSNE
# running t-SNE
tsne_2d = TSNE(n_components = 2, verbose = 1, perplexity = 25, n_iter = 500)
tsne_2d_result = tsne_2d.fit_transform(X_transformed)
[t-SNE] Computing 76 nearest neighbors...
[t-SNE] Indexed 2111 samples in 0.000s...
[t-SNE] Computed neighbors for 2111 samples in 0.651s...
[t-SNE] Computed conditional probabilities for sample 1000 / 2111
[t-SNE] Computed conditional probabilities for sample 2000 / 2111
[t-SNE] Computed conditional probabilities for sample 2111 / 2111
[t-SNE] Mean sigma: 0.594183
[t-SNE] KL divergence after 250 iterations with early exaggeration: 69.424301
[t-SNE] KL divergence after 500 iterations: 0.942613
import matplotlib.pyplot as plt
# plotting the 2D figure with the original (non-encoded) labels for y
plt.figure(figsize=(12, 8))
scatter = plt.scatter(
    tsne_2d_result[:, 0], 
    tsne_2d_result[:, 1], 
    c=y_encoded, 
    cmap='hsv', 
    alpha=0.7
)

plt.title("t-SNE Obesity-Activity-Diet")
plt.xlabel("t-SNE Component 1")
plt.ylabel("t-SNE Component 2")

# creating a custom legend with original labels
handles, _ = scatter.legend_elements()
unique_classes = le.classes_ 
legend_labels = [unique_classes[int(label)] for label in range(len(unique_classes))]
plt.legend(handles, legend_labels, title="Outcome category")

plt.show()
image

Next, we’ll produce a t-SNE plot of these data in 3D.

# 3d tsne plot
tsne_3d = TSNE(n_components = 3, verbose = 1, perplexity = 25, n_iter = 500)
tsne_3d_result = tsne_3d.fit_transform(X_transformed)
[t-SNE] Computing 76 nearest neighbors...
[t-SNE] Indexed 2111 samples in 0.001s...
[t-SNE] Computed neighbors for 2111 samples in 0.122s...
[t-SNE] Computed conditional probabilities for sample 1000 / 2111
[t-SNE] Computed conditional probabilities for sample 2000 / 2111
[t-SNE] Computed conditional probabilities for sample 2111 / 2111
[t-SNE] Mean sigma: 0.594183
[t-SNE] KL divergence after 250 iterations with early exaggeration: 69.294075
[t-SNE] KL divergence after 500 iterations: 0.798805
# pulling the results for each dimension
df['tsne-3d-one'] = tsne_3d_result[:, 0]
df['tsne-3d-two'] = tsne_3d_result[:, 1]
df['tsne-3d-three'] = tsne_3d_result[:, 2]
# plotting the data in 3d
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns


fig = plt.figure(figsize=(16, 10))
ax = fig.add_subplot(111, projection='3d')

scatter = ax.scatter(
    df['tsne-3d-one'], 
    df['tsne-3d-two'], 
    df['tsne-3d-three'], 
    c=y_encoded, 
    cmap='hsv', 
    alpha=0.6
)

ax.set_xlabel('t-SNE Component 1')
ax.set_ylabel('t-SNE Component 2')
ax.set_zlabel('t-SNE Component 3')
ax.set_title('3D t-SNE Visualization')

handles, _ = scatter.legend_elements()
unique_classes = le.classes_  # Get original class labels from LabelEncoder
legend_labels = [unique_classes[int(label)] for label in range(len(unique_classes))]
legend = ax.legend(handles, legend_labels, title="Outcome category")
ax.add_artist(legend)

plt.show()
image

Based on either the 2D or 3D plots, we can see that Obesity_Type_III (otherwise known as severe obesity, with a BMI > 40) tends to cluster together, and is distinct from the other data. This would suggest that these data are relatively homogenous, and the underlying features (all 16 of them in our dataset) are relatively similar. This sub-group may be strongly defined, with less variability in the other features (suggesting that this clinical group shares commonality in diet and eating habits). Obesity_Type_II appears to be the second most distinct group in this dataset, though within this sub-group there appear to be 3 main sub-phenotypes. In contrast, it is relatively difficult to distinguish those with Normal_Weight from the remaining sub-groups, which suggests that this sub-group shares relations among its features in a manner similar to the remaining sub-groups (mainly the Insufficient_Weight and Overweight groups). It would make sense that those with a healthy weight cluster more closely to those who are overweight, than those who have severe obesity.

Next, we'll plot these data using MDS, which aims to preserve the distance relationships between data points that are present in the higher-dimensional space when plotting them on a lower-dimensional space. MDS computes a distance matrix and finds an optimal spatial configuration that maintains these distances (in lower dimensionality) as accurately as possible. It provides a better visualization of the global structure of the data.

The data will first be plotted in 2D using MDS, which can be more computationally expensive than t-SNE.

# running mds
from sklearn.manifold import MDS

mds_2d = MDS(n_components=2, random_state=42)
mds_2d_result = mds_2d.fit_transform(X_transformed)
# plotting the MDS results in 2D
plt.figure(figsize=(12, 8))
scatter = plt.scatter(
    mds_2d_result[:, 0], 
    mds_2d_result[:, 1], 
    c=y_encoded, 
    cmap='hsv', 
    alpha=0.7
)

plt.title("MDS Obesity-Activity-Diet")
plt.xlabel("MDS Component 1")
plt.ylabel("MDS Component 2")

# creating a custom legend with original labels
handles, _ = scatter.legend_elements()
unique_classes = le.classes_ 
legend_labels = [unique_classes[int(label)] for label in range(len(unique_classes))]
plt.legend(handles, legend_labels, title="Outcome category")

plt.show()
image

And now MDS in 3D.

# mds in 3d
mds_3d = MDS(n_components=3, random_state=42)
mds_3d_result = mds_3d.fit_transform(X_transformed)
df['mds-3d-one'] = mds_3d_result[:, 0]
df['mds-3d-two'] = mds_3d_result[:, 1]
df['mds-3d-three'] = mds_3d_result[:, 2]
fig = plt.figure(figsize=(16, 10))
ax = fig.add_subplot(111, projection='3d')

scatter = ax.scatter(
    df['mds-3d-one'], 
    df['mds-3d-two'], 
    df['mds-3d-three'], 
    c=y_encoded, 
    cmap='hsv', 
    alpha=0.6
)

ax.set_xlabel('MDS Component 1')
ax.set_ylabel('MDS Component 2')
ax.set_zlabel('MDS Component 3')
ax.set_title('3D MDS Visualization')

handles, _ = scatter.legend_elements()
unique_classes = le.classes_  # Get original class labels from LabelEncoder
legend_labels = [unique_classes[int(label)] for label in range(len(unique_classes))]
legend = ax.legend(handles, legend_labels, title="Outcome category")
ax.add_artist(legend)

plt.show()
image

With MDS as with t-SNE, we see that the most clustered sub-group is Obesity_Type_III (followed by Obesity_Type_II). The converge of data gives confidence in our visualization of it.

By plotting in human-friendly dimensions, we start to see a key plotline in the data: one character (Obesity_Type_III), based on all their traits (read: data) is unlike the others, and the others are relatively difficult to tell apart from one another, based on their traits (or, again, data). Even more may be understood through other plots, models, or questions these relatively quick visualizations lead to.

Outcomes

Data visualization using t-SNE and MDS allowed us to see, in either 2D or 3D, that one group (Obesity_Type_III) is unlike the others. This may not have been possible to clean through inspection of tabular data or multiple 2D plots that did not harness the power of clustering.

Credits

Data: UCI ML Repository