COVID-19: Clustering National Patterns

An exercise in k-means clustering

COVID-19 datascience

The COVID-19 pandemic has produced vast quantities of publicly available data that holds a prominent global interest. While the integrity and accuracy of the data has been scrutinized and questioned from nation to nation, the data still provides a great jumping point to explore some fundamental data science tools.

In this post, we will explore how nations compared over their first 40 days of COVID-19 exposure for both infections and mortality. The data set for national infections and deaths is large enough that a brute force curve by curve comparison would be a challenge. One strategy is to group nations that followed similar infection and mortality curves together and compare the groups rather than the specific nations. One of the common algorithms for executing this type of analysis is k-means clustering.

K-means clustering is a type of unsupervised machine learning that analyzes a data set and assigns k cluster centers such that 1) the cluster center is the arithmetic mean of all points in the cluster and 2) all points are closer to their own cluster center than any other cluster center. K-means then yields k sets of data that can be used to represent the entire data set under study. Further to this, new data can be analyzed and assigned to these previously determined clusters and anything learned from the original clusters can be assumed to apply to the new data.

The graphs on this page are produced with plotly and are interactive. For example, clicking lines in the legend can turn data sets on and off; double-clicking turns the others off. Hovering over points on the lines will show the country's name and totals. The scale can be toggled between linear and logarithmic.

National Infections by K-Means Cluster

This plot represents how countries are assigned to 4 k-means clusters based on their first 40-days of reporting more than 50 COVID-19 infections. Interestingly, of the 121 countries in the data set, 105 of them were assigned to cluster 0. The remaining 16 countries were spread amongst the top 3 clusters with the US claiming cluster 3 all to itself. In addition to ending with a much higher number of infections at the end of 40 days, the US followed a unique trajectory through most of the data set.

The Python Abridgment

In the code below, modelData is a dataframe containing infection (or mortality) data as y. x is the offset day from the date the country recorded more than 50 infections (or 10 deaths). The testData dataframe is simply a listing of countries from modelData that have more than 40 days of recorded data. This is important because the KMeans function does not like NaN values and to backfill them with data would influence the outcome of the clustering algorithm.

print(modelData.head(5).to_markdown())
|    |   x |   y | index       |
|---:|----:|----:|:------------|
|  0 |   0 |  11 | Afghanistan |
|  1 |   1 |  14 | Afghanistan |
|  2 |   2 |  14 | Afghanistan |
|  3 |   3 |  15 | Afghanistan |
|  4 |   4 |  15 | Afghanistan |

The following code is the basics for executing the k-means clustering algorithm and plotting the results.

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
import plotly as plotly
import plotly.express as px

# set our number of clusters
cluster_num = 4

# build a dataframe based on our model and test set and drop data > 40 days
clusterData = modelData.loc[modelData['index'].isin(testSet.index.values)]
clusterData = clusterData[(clusterData['x'] < 40)]

# set the index to the country name and the x value which will give us a column for y data
# each country and x value will have 40 y values in a single column
# unstack will put the y values across the columns where each column is the x value
# reset the index and drop the 'index' column which is the country name
clusterData = clusterData.set_index(['index','x']).unstack().reset_index()
growthCluster = clusterData.drop(columns="index")

# execute the k-means algorithm and fit the data
kmeans = KMeans(n_clusters=cluster_num)
kmeans.fit(growthCluster)

# predict the cluster for each row of the data set
y_kmeans = kmeans.predict(growthCluster)

# add the predicted cluster back into the dataframe that carries the country name as the index
clusterData['Cluster'] = y_kmeans

# set the index to identify Country and Cluster and then 'stack' the results
# the index is cleared and the result is 'tidy' data
plotData = clusterData.set_index(['index','Cluster']).stack().reset_index()

# Plot the data using plotly express
fig = px.line(plotData, x="x", y="y", color="Cluster", line_group="index", hover_name="index",
                  line_shape="spline", render_mode="svg")
Summary of K-Means Infection Clusters 1, 2 & 3

If we were interested in doing a deeper study on the hardest hit nations, we could then drop the countries that fell into cluster 0, representing the flattest overall curve, and focus our attention on the countries in the other three.

Nation Count @ Day 40 Day Zero Cluster
US 275367 2020-02-24 3
Turkey 110130 2020-03-18 2
Spain 153222 2020-03-01 2
Germany 113296 2020-02-29 2
Italy 110574 2020-02-22 2
China 79932 2020-01-22 2
Brazil 40743 2020-03-12 1
Russia 57999 2020-03-14 1
United Kingdom 79874 2020-03-03 1
Canada 28209 2020-03-07 1
Iran 53183 2020-02-24 1
Portugal 20206 2020-03-11 1
Belgium 30589 2020-03-05 1
France 79163 2020-02-28 1
Netherlands 26710 2020-03-05 1
Switzerland 25107 2020-03-03 1

Note: Going up to 5 clusters only moved 15 countries out of cluster 0 and into the upper clusters, leaving 90 in the lowest. With 8 clusters created, cluster 0 still had 79 countries.

Infection Cluster Centers

Potentially more interesting than knowing which cluster each country falls into, the lines representing each clusters geometric center's can also be plotted by pulling the lists from the k-means data ( _y = kmeans.cluster_centers_[i]) where i is 0 -3. These lines represent the 4 models that k-means clustering could be used to predict against for countries outside of the data set. These could also be used for similar future infections (COVID-26?) to help nations determine which model they tracking closest to.

National Mortality by K-Means Cluster

Similarly, the same algorithm can be run against mortality data.

Summary of K-Means Mortality Clusters 1, 2 & 3

There were only 67 countries with more than 40 days of reporting deaths. Of these 67, 15 were assigned to clusters 1 - 3 and 52 were assigned to cluster 0. With mortality, the US ended with a higher number of deaths at the end of 40 days but because the curve was similar to both France and Italy, they were all assigned to the same cluster. Again, we can drop the countries in cluster 0 and focus attention on those in clusters 1 - 3.

Nation Count @ Day 40 Day Zero Cluster
United Kingdom 20264 2020-03-13 3
Spain 18708 2020-03-07 3
US 26086 2020-03-04 2
Italy 15362 2020-02-25 2
France 17169 2020-03-07 2
Mexico 2507 2020-03-27 1
Brazil 5083 2020-03-20 1
Canada 2983 2020-03-20 1
Sweden 2194 2020-03-18 1
Belgium 6917 2020-03-17 1
Germany 5575 2020-03-15 1
Turkey 3174 2020-03-22 1
Iran 3294 2020-02-24 1
Netherlands 3929 2020-03-13 1
China 2872 2020-01-22 1
Mortality Cluster Centers

Similarly, the mortality curve cluster centers can be extracted and plotted.

Summary of Infection and Mortality Clusters

With both sets of data, we can simply sum the two cluster numbers together to get a quick metric of how the country fared across both infections and mortality. All things being equal, you would expect close alignment between the infection and mortality clusters. Discrepancies here would be another opportunity to pull together more data and try to determine possible causes. For the sake of trimming our data set again, we will continue to ignore countries that were assigned to cluster 0 for both infections and mortality.

Nation Infection Cluster Infections @ Day 40 Mortality Cluster Deaths @ Day 40 Cluster Sum
Spain 2 153222 3 18708 5
US 3 275367 2 26086 5
Italy 2 110574 2 15362 4
United Kingdom 1 79874 3 20264 4
Turkey 2 110130 1 3174 3
France 1 79163 2 17169 3
Germany 2 113296 1 5575 3
China 2 79932 1 2872 3
Belgium 1 30589 1 6917 2
Iran 1 53183 1 3294 2
Canada 1 28209 1 2983 2
Netherlands 1 26710 1 3929 2
Brazil 1 40743 1 5083 2
Portugal 1 20206 0 973 1
Mexico 0 11633 1 2507 1
Sweden 0 10948 1 2194 1
Russia 1 57999 0 1827 1
Switzerland 1 25107 0 1478 1

Further Analysis of Cluster Zero

If our interest was in studying the countries with more favourable COVID-19 trajectories over the first 40 days, we could simply focus our attention on the countries in cluster 0 and drop clusters 1 - 3. We could then continue with more k-means clustering on this sub-set to look for patterns and valid models in these countries as well. The following table shows our original cluster 0 sub-clustered into 4.

Cluster Zero Nation Count @ Day 40 Day Zero Sub Cluster
Peru 21648 2020-03-16 3
Chile 11296 2020-03-14 3
Israel 12758 2020-03-08 3
Ireland 16040 2020-03-13 3
Ecuador 22719 2020-03-17 3
Korea, South 9661 2020-02-20 3
Bangladesh 13770 2020-03-31 3
Austria 14226 2020-03-06 3
India 15722 2020-03-10 3
Norway 6525 2020-03-04 2
Poland 9856 2020-03-13 2
Ukraine 10406 2020-03-22 2
Serbia 7483 2020-03-16 2
Dominican Republic 6416 2020-03-20 2
Denmark 7268 2020-03-09 2
Australia 6315 2020-03-04 2
Indonesia 7135 2020-03-13 2
Sweden 10948 2020-03-05 2
Belarus 10463 2020-03-18 2
Romania 9242 2020-03-13 2
Saudi Arabia 11631 2020-03-13 2
Philippines 6459 2020-03-12 2
Pakistan 11155 2020-03-15 2
Mexico 11633 2020-03-15 2
Czechia 6746 2020-03-11 2
United Arab Emirates 6302 2020-03-10 1
Argentina 3607 2020-03-16 1
Uzbekistan 2118 2020-03-24 1
Kazakhstan 3138 2020-03-21 1
Morocco 4120 2020-03-19 1
Hungary 2443 2020-03-17 1
South Africa 3953 2020-03-15 1
Finland 3783 2020-03-11 1
Moldova 3638 2020-03-20 1
Algeria 3127 2020-03-16 1
Qatar 5448 2020-03-11 1
Luxembourg 3654 2020-03-14 1
Colombia 4881 2020-03-16 1
Croatia 2009 2020-03-16 1
Greece 2207 2020-03-08 1
Malaysia 4683 2020-03-04 1
Egypt 2844 2020-03-09 1
Thailand 2643 2020-03-07 1
Panama 5338 2020-03-16 1
Estonia 1552 2020-03-13 0
Lithuania 1375 2020-03-21 0
Japan 1387 2020-02-16 0
Rwanda 261 2020-03-26 0
Slovakia 1325 2020-03-15 0
Kuwait 993 2020-03-02 0
Cuba 1649 2020-03-25 0
Taiwan* 425 2020-03-13 0
Liechtenstein 82 2020-03-23 0
Mauritius 332 2020-03-26 0
Diamond Princess 706 2020-02-07 0
Trinidad and Tobago 116 2020-03-22 0
Montenegro 322 2020-03-25 0
Iceland 1727 2020-03-07 0
Tunisia 975 2020-03-20 0
Brunei 138 2020-03-15 0
West Bank and Gaza 344 2020-03-22 0
New Zealand 1476 2020-03-21 0
Monaco 96 2020-03-31 0
Slovenia 1330 2020-03-11 0
Andorra 743 2020-03-19 0
Cyprus 822 2020-03-19 0
Burkina Faso 641 2020-03-21 0
Sri Lanka 523 2020-03-18 0
Malta 450 2020-03-19 0
Iraq 1415 2020-03-07 0
Cote d'Ivoire 1362 2020-03-24 0
Armenia 1596 2020-03-16 0
Cameroon 1832 2020-03-23 0
Honduras 1178 2020-03-26 0
Azerbaijan 1766 2020-03-21 0
Guinea 2146 2020-04-02 0
Nigeria 2558 2020-03-25 0
Paraguay 431 2020-03-27 0
Oman 2274 2020-03-21 0
Ghana 2169 2020-03-24 0
Bahrain 1136 2020-03-04 0
Congo (Kinshasa) 682 2020-03-26 0
Senegal 933 2020-03-22 0
Afghanistan 2469 2020-03-24 0
Kyrgyzstan 843 2020-03-27 0
Singapore 455 2020-02-12 0
Latvia 812 2020-03-18 0
Madagascar 193 2020-03-31 0
Albania 678 2020-03-16 0
Bosnia and Herzegovina 1565 2020-03-19 0
Uruguay 596 2020-03-17 0
San Marino 455 2020-03-10 0
Vietnam 268 2020-03-14 0
Kosovo 855 2020-03-26 0
Georgia 539 2020-03-22 0
Costa Rica 695 2020-03-18 0
North Macedonia 1421 2020-03-20 0
Bolivia 1802 2020-03-27 0
Niger 821 2020-04-01 0
Lebanon 673 2020-03-11 0
Bulgaria 1097 2020-03-15 0
Venezuela 331 2020-03-21 0
Jordan 447 2020-03-18 0
Kenya 621 2020-03-30 0
Cambodia 122 2020-03-20 0



Covid-19 Data Source: https://github.com/CSSEGISandData/COVID-19

This is the data repository for the 2019 Novel Coronavirus Visual Dashboard operated by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). Also, Supported by ESRI Living Atlas Team and the Johns Hopkins University Applied Physics Lab (JHU APL).

Thank you to them for making this data available.

Previous Post Next Post