Clustering

Clustering

When less is more

Group analysis · Clustering · Data science

image

Overview

An open source diabetes dataset containing more than a dozen variables is visualized in 2D and 3D using cluster analysis. This makes working with complex, multi-dimensional datasets more intuitive. These clustering approaches can help data users gain an intuitive and visual understanding of their data, how it clusters, and what underlying patterns it may hold.

Approach

This project was completed in Python, using packages such as:

  • sklearn
  • TSNE
  • MDS

High-dimensional data can feel overwhelming — like trying to understand a movie by focusing solely on its production and technical details. You might end up knowing everything about the budget and its allocation, the size and experience of the effects team, and the lenses used to shoot the movie. But this richness of information tells nothing of the story. Knowing all of this and more can still leave the plot untouched.

When data are dimension-rich, the associations between the target and features can be cloudy. Visualizing these data using one dimension per feature becomes infeasible when we surpass the bounds of a human-readable 3D plot. Having such rich data may lead to a more promising model, but for an initial exploration and visualization, they pose a challenge. The story remains unclear.

Dimensionality can be reduced for plotting purposes using techniques such as t-distributed Stochastic Neighbour Embedding (t-SNE) and Multi-Dimensional Scaling (MDS). These approaches help bring clarity to this complexity. They take the overwhelming mass of high-dimensional data and transform it into a simpler, more interpretable form, often in just two or three dimensions. These methods reveal underlying clusters, trends, and relationships that are otherwise hidden. This process allows us to step back and see the bigger picture.

Data science isn’t just about processing data; it’s about understanding it. Techniques like t-SNE and MDS help us see the story hidden in the complexity, making the invisible visible. Sometimes, less really is more, especially when it comes to making sense of vast amounts of information.

This post will use the Estimation of Obesity Levels Based on Heating Habits and Physical Condition dataset from the UCI ML repository. In short, there are 16 features (including age, technology use, meal and snack consumption) available for predicting the target, which is obesity level. The data will first be prepared for plotting.

import pandas as pd
from ucimlrepo import fetch_ucirepo 
 
# fetch dataset 
estimation_of_obesity_levels_based_on_eating_habits_and_physical_condition = fetch_ucirepo(id=544) 
  
# data (as pandas dataframes) 
X = estimation_of_obesity_levels_based_on_eating_habits_and_physical_condition.data.features 
y = estimation_of_obesity_levels_based_on_eating_habits_and_physical_condition.data.targets 
df = pd.concat([X, y], axis=1)

We can print the meta data and variable information to better understand the contents of the dataset.

# metadata 
print(estimation_of_obesity_levels_based_on_eating_habits_and_physical_condition.metadata) 
  
# variable information 
print(estimation_of_obesity_levels_based_on_eating_habits_and_physical_condition.variables) 

Given there are different data types (including CategoricalBinaryContinuous) that may require different transformations, all variables corresponding to a data type will be pulled into respective lists.

Categorical Variables: ['CAEC', 'CALC', 'MTRANS']
Binary Variables: ['Gender', 'family_history_with_overweight', 'FAVC', 'SMOKE', 'SCC']
Continuous Variables: ['Age', 'Height', 'Weight', 'FCVC', 'NCP', 'CH2O', 'FAF', 'TUE']

Next, pipelines will be created to transform each data type appropriately. StandardScaler will be used on numeric data, and OrdinalEncoder and OneHotEncoder will be used for categorical and binary variables, respectiely.

t-SNE is a technique that models high-dimensional data by first converting similarities between data points (in higher-dimensions) into probabilities, and then embedding these probabilities into a lower-dimensional (i.e., 2D or 3D) space. Its main focus is to preserve local relationships in the data, making it useful for visualizing clusters and patterns on a smaller scale. In the end, by maintaining the relative distance between similar points, t-SNE creates visualizations that reveal complex structures, making it easier to understand and interpret data.

Here is a 2D plot of the high-dimensional data generated using t-SNE, specifying some key parameters:

  • n_components: The number of dimensions (typically 2 or 3)
  • verbose: To print progress information, which may be helpful when running t-SNE on a larger dataset
  • perplexity: A measure of the balance between local and global aspects of the data, with lower numbers preserving more local relationships. This may need to be an iterable input, depending on your data and needs
  • n_iter: The number of times t-SNE updates the position of datapoints to minimize differences between low- and high-dimensional relationships
from sklearn.manifold import TSNE
# running t-SNE
tsne_2d = TSNE(n_components = 2, verbose = 1, perplexity = 25, n_iter = 500)
tsne_2d_result = tsne_2d.fit_transform(X_transformed)
image

Next, we’ll produce a t-SNE plot of these data in 3D.

# 3d tsne plot
tsne_3d = TSNE(n_components = 3, verbose = 1, perplexity = 25, n_iter = 500)
tsne_3d_result = tsne_3d.fit_transform(X_transformed)
# pulling the results for each dimension
df['tsne-3d-one'] = tsne_3d_result[:, 0]
df['tsne-3d-two'] = tsne_3d_result[:, 1]
df['tsne-3d-three'] = tsne_3d_result[:, 2]
image

Based on either the 2D or 3D plots, we can see that Obesity_Type_III (otherwise known as severe obesity, with a BMI > 40) tends to cluster together, and is distinct from the other data. This would suggest that these data are relatively homogenous, and the underlying features (all 16 of them in our dataset) are relatively similar. This sub-group may be strongly defined, with less variability in the other features (suggesting that this clinical group shares commonality in diet and eating habits). Obesity_Type_II appears to be the second most distinct group in this dataset, though within this sub-group there appear to be 3 main sub-phenotypes. In contrast, it is relatively difficult to distinguish those with Normal_Weight from the remaining sub-groups, which suggests that this sub-group shares relations among its features in a manner similar to the remaining sub-groups (mainly the Insufficient_Weight and Overweight groups). It would make sense that those with a healthy weight cluster more closely to those who are overweight, than those who have severe obesity.

Next, we'll plot these data using MDS, which aims to preserve the distance relationships between data points that are present in the higher-dimensional space when plotting them on a lower-dimensional space. MDS computes a distance matrix and finds an optimal spatial configuration that maintains these distances (in lower dimensionality) as accurately as possible. It provides a better visualization of the global structure of the data.

The data will first be plotted in 2D using MDS, which can be more computationally expensive than t-SNE.

# running mds
from sklearn.manifold import MDS

mds_2d = MDS(n_components=2, random_state=42)
mds_2d_result = mds_2d.fit_transform(X_transformed)
image

And now MDS in 3D.

# mds in 3d
mds_3d = MDS(n_components=3, random_state=42)
mds_3d_result = mds_3d.fit_transform(X_transformed)
df['mds-3d-one'] = mds_3d_result[:, 0]
df['mds-3d-two'] = mds_3d_result[:, 1]
df['mds-3d-three'] = mds_3d_result[:, 2]
image

With MDS as with t-SNE, we see that the most clustered sub-group is Obesity_Type_III (followed by Obesity_Type_II). The converge of data gives confidence in our visualization of it.

By plotting in human-friendly dimensions, we start to see a key plotline in the data: one character (Obesity_Type_III), based on all their traits (read: data) is unlike the others, and the others are relatively difficult to tell apart from one another, based on their traits (or, again, data). Even more may be understood through other plots, models, or questions these relatively quick visualizations lead to.

Outcomes

Data visualization using t-SNE and MDS allowed us to see, in either 2D or 3D, that one group (Obesity_Type_III) is unlike the others. This may not have been possible to clean through inspection of tabular data or multiple 2D plots that did not harness the power of clustering.

Credits

Data: UCI ML Repository