Let's cluster the data!


On this page, we explore the similarities across countries using the indicators described on the overall visualization page.

We rely on two machine learning algorithms to help visualize the data. Click below to reveal more information about the two algorithms used here.

⯈ 1. k-means clustering

k-means clustering is an unsupervised machine learning algorithm that partitions data into k clusters. The algorithm minimizes differences within each cluster, thus points within a cluster are in general more similar.

An example of performing k-means for 2-dimensional data with 3 clusters is shown as follows:

k-means

Each point is colored according to their assigned k-means cluster, and we see that the three clusters are recovered.

The sole parameter here is the number of clusters (k). We suggest picking k based on the visualized t-SNE representation.


⯈ 2. t-SNE dimensional reduction

For a 2-dimensional case, we can plot the data using one variable against the other (e.g. social mobility vs number of COVID cases). This is not possible when the data contain more than 3 features, which requires more than 3 spatial dimensions to represent all features.

t-distributed Stochastic Neighbor Embedding or t-SNE is a popular statistical technique that represents high-dimensional data in lower-dimensions. In a t-SNE plot, smaller distances imply more similar data points, hence similar countries tend to become grouped and form a "cluster".

For example, applying t-SNE to data with 3 clusters results in the following representation:

k-means k-means

From the t-SNE representation, we see that the 3 clusters are recovered. However, physical distances are not retained in this representation: the clusters appear equidistant apart although 2 are much closer physically.

Two important parameters which can be tuned are the learning rate and the perplexity. Loosely speaking, the perplexity represents how many close neighbors surrounding each point. The learning rate determines how much points are updated during each iteration, and too large a value can lead to poor convergence.

We have supplied default values for this dataset. For a thorough explanation of the t-SNE parameters, please refer to this post. We allow 5000 iterations for the algorithm to converge.


⯈ Instructions
  1. Choose the date.
  2. (Optional) Choose which indicators to use.
  3. Click on Run t-SNE! to start the iterations and display the visualization.
    • Change parameters if required.
    • To pause/continue, click on the red pause button. To restart, click on the blue refresh button.
    • Hover over any point to display the numerical values of the metrics used.
  4. Click on Run k-means! to color points according to their assigned cluster.
    • Try changing the number of clusters for k-means!
Interpret the results!


t-SNE parameters

Learning rate: Perplexity:

Iteration: 0

k-means parameters

Number of clusters:





Notes:


References

  1. t-SNE visualization adapted from tSNEJS library.
  2. k-means algorithm from kmeans.js github.