On this page, we explore the similarities across countries using the indicators described on the overall visualization page.
We rely on two machine learning algorithms to help visualize the data. Click below to reveal more information about the two algorithms used here.
k-means clustering is an unsupervised machine learning algorithm that partitions data into k clusters. The algorithm minimizes differences within each cluster, thus points within a cluster are in general more similar.
An example of performing k-means for 2-dimensional data with 3 clusters is shown as follows:
Each point is colored according to their assigned k-means cluster, and we see that the three clusters are recovered.
The sole parameter here is the number of clusters (k). We suggest picking k based on the visualized t-SNE representation.
For a 2-dimensional case, we can plot the data using one variable against the other (e.g. social mobility vs number of COVID cases). This is not possible when the data contain more than 3 features, which requires more than 3 spatial dimensions to represent all features.
t-distributed Stochastic Neighbor Embedding or t-SNE is a popular statistical technique that represents high-dimensional data in lower-dimensions. In a t-SNE plot, smaller distances imply more similar data points, hence similar countries tend to become grouped and form a "cluster".
For example, applying t-SNE to data with 3 clusters results in the following representation:
⟹
From the t-SNE representation, we see that the 3 clusters are recovered. However, physical distances are not retained in this representation: the clusters appear equidistant apart although 2 are much closer physically.
Two important parameters which can be tuned are the learning rate and the perplexity. Loosely speaking, the perplexity represents how many close neighbors surrounding each point. The learning rate determines how much points are updated during each iteration, and too large a value can lead to poor convergence.
We have supplied default values for this dataset. For a thorough explanation of the t-SNE parameters, please refer to this post. We allow 5000 iterations for the algorithm to converge.