COVID-19: Clustering National Patterns

Home COVID-19: Clustering National Patterns

The COVID-19 pandemic has produced vast quantities of publicly available data that holds a prominent global interest. While the integrity and accuracy of the data has been scrutinized and questioned from nation to nation, the data still provides a great jumping point to explore some fundamental data science tools.

In this post, we will explore how nations compared over their first 40 days of COVID-19 exposure for both infections and mortality. The data set for national infections and deaths is large enough that a brute force curve by curve comparison would be a challenge. One strategy is to group nations that followed similar infection and mortality curves together and compare the groups rather than the specific nations. One of the common algorithms for executing this type of analysis is k-means clustering.

K-means clustering is a type of unsupervised machine learning that analyzes a data set and assigns k cluster centers such that 1) the cluster center is the arithmetic mean of all points in the cluster and 2) all points are closer to their own cluster center than any other cluster center. K-means then yields k sets of data that can be used to represent the entire data set under study. Further to this, new data can be analyzed and assigned to these previously determined clusters and anything learned from the original clusters can be assumed to apply to the new data.

The graphs on this page are produced with plotly and are interactive. For example, clicking lines in the legend can turn data sets on and off; double-clicking turns the others off. Hovering over points on the lines will show the country's name and totals. The scale can be toggled between linear and logarithmic.

National Infections by K-Means Cluster

This plot represents how countries are assigned to 4 k-means clusters based on their first 40-days of reporting more than 50 COVID-19 infections. Interestingly, of the 121 countries in the data set, 105 of them were assigned to cluster 0. The remaining 16 countries were spread amongst the top 3 clusters with the US claiming cluster 3 all to itself. In addition to ending with a much higher number of infections at the end of 40 days, the US followed a unique trajectory through most of the data set.

The Python Abridgment

In the code below, modelData is a dataframe containing infection (or mortality) data as y. x is the offset day from the date the country recorded more than 50 infections (or 10 deaths). The testData dataframe is simply a listing of countries from modelData that have more than 40 days of recorded data. This is important because the KMeans function does not like NaN values and to backfill them with data would influence the outcome of the clustering algorithm.

print(modelData.head(5).to_markdown())

|    |   x |   y | index       |
|---:|----:|----:|:------------|
|  0 |   0 |  11 | Afghanistan |
|  1 |   1 |  14 | Afghanistan |
|  2 |   2 |  14 | Afghanistan |
|  3 |   3 |  15 | Afghanistan |
|  4 |   4 |  15 | Afghanistan |

The following code is the basics for executing the k-means clustering algorithm and plotting the results.

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
import plotly as plotly
import plotly.express as px

# set our number of clusters
cluster_num = 4

# build a dataframe based on our model and test set and drop data > 40 days
clusterData = modelData.loc[modelData['index'].isin(testSet.index.values)]
clusterData = clusterData[(clusterData['x'] < 40)]

# set the index to the country name and the x value which will give us a column for y data
# each country and x value will have 40 y values in a single column
# unstack will put the y values across the columns where each column is the x value
# reset the index and drop the 'index' column which is the country name
clusterData = clusterData.set_index(['index','x']).unstack().reset_index()
growthCluster = clusterData.drop(columns="index")

# execute the k-means algorithm and fit the data
kmeans = KMeans(n_clusters=cluster_num)
kmeans.fit(growthCluster)

# predict the cluster for each row of the data set
y_kmeans = kmeans.predict(growthCluster)

# add the predicted cluster back into the dataframe that carries the country name as the index
clusterData['Cluster'] = y_kmeans

# set the index to identify Country and Cluster and then 'stack' the results
# the index is cleared and the result is 'tidy' data
plotData = clusterData.set_index(['index','Cluster']).stack().reset_index()

# Plot the data using plotly express
fig = px.line(plotData, x="x", y="y", color="Cluster", line_group="index", hover_name="index",
                  line_shape="spline", render_mode="svg")

Summary of K-Means Infection Clusters 1, 2 & 3

If we were interested in doing a deeper study on the hardest hit nations, we could then drop the countries that fell into cluster 0, representing the flattest overall curve, and focus our attention on the countries in the other three.

Nation	Count @ Day 40	Day Zero	Cluster
US	275367	2020-02-24	3
Turkey	110130	2020-03-18	2
Spain	153222	2020-03-01	2
Germany	113296	2020-02-29	2
Italy	110574	2020-02-22	2
China	79932	2020-01-22	2
Brazil	40743	2020-03-12	1
Russia	57999	2020-03-14	1
United Kingdom	79874	2020-03-03	1
Canada	28209	2020-03-07	1
Iran	53183	2020-02-24	1
Portugal	20206	2020-03-11	1
Belgium	30589	2020-03-05	1
France	79163	2020-02-28	1
Netherlands	26710	2020-03-05	1
Switzerland	25107	2020-03-03	1

Note: Going up to 5 clusters only moved 15 countries out of cluster 0 and into the upper clusters, leaving 90 in the lowest. With 8 clusters created, cluster 0 still had 79 countries.

Infection Cluster Centers

Potentially more interesting than knowing which cluster each country falls into, the lines representing each clusters geometric center's can also be plotted by pulling the lists from the k-means data ( _y = kmeans.cluster_centers_[i]) where i is 0 -3. These lines represent the 4 models that k-means clustering could be used to predict against for countries outside of the data set. These could also be used for similar future infections (COVID-26?) to help nations determine which model they tracking closest to.

National Mortality by K-Means Cluster

Similarly, the same algorithm can be run against mortality data.

Summary of K-Means Mortality Clusters 1, 2 & 3

There were only 67 countries with more than 40 days of reporting deaths. Of these 67, 15 were assigned to clusters 1 - 3 and 52 were assigned to cluster 0. With mortality, the US ended with a higher number of deaths at the end of 40 days but because the curve was similar to both France and Italy, they were all assigned to the same cluster. Again, we can drop the countries in cluster 0 and focus attention on those in clusters 1 - 3.

Nation	Count @ Day 40	Day Zero	Cluster
United Kingdom	20264	2020-03-13	3
Spain	18708	2020-03-07	3
US	26086	2020-03-04	2
Italy	15362	2020-02-25	2
France	17169	2020-03-07	2
Mexico	2507	2020-03-27	1
Brazil	5083	2020-03-20	1
Canada	2983	2020-03-20	1
Sweden	2194	2020-03-18	1
Belgium	6917	2020-03-17	1
Germany	5575	2020-03-15	1
Turkey	3174	2020-03-22	1
Iran	3294	2020-02-24	1
Netherlands	3929	2020-03-13	1
China	2872	2020-01-22	1

Mortality Cluster Centers

Similarly, the mortality curve cluster centers can be extracted and plotted.

Summary of Infection and Mortality Clusters

With both sets of data, we can simply sum the two cluster numbers together to get a quick metric of how the country fared across both infections and mortality. All things being equal, you would expect close alignment between the infection and mortality clusters. Discrepancies here would be another opportunity to pull together more data and try to determine possible causes. For the sake of trimming our data set again, we will continue to ignore countries that were assigned to cluster 0 for both infections and mortality.

Nation	Infection Cluster	Infections @ Day 40	Mortality Cluster	Deaths @ Day 40	Cluster Sum
Spain	2	153222	3	18708	5
US	3	275367	2	26086	5
Italy	2	110574	2	15362	4
United Kingdom	1	79874	3	20264	4
Turkey	2	110130	1	3174	3
France	1	79163	2	17169	3
Germany	2	113296	1	5575	3
China	2	79932	1	2872	3
Belgium	1	30589	1	6917	2
Iran	1	53183	1	3294	2
Canada	1	28209	1	2983	2
Netherlands	1	26710	1	3929	2
Brazil	1	40743	1	5083	2
Portugal	1	20206	0	973	1
Mexico	0	11633	1	2507	1
Sweden	0	10948	1	2194	1
Russia	1	57999	0	1827	1
Switzerland	1	25107	0	1478	1

Further Analysis of Cluster Zero

If our interest was in studying the countries with more favourable COVID-19 trajectories over the first 40 days, we could simply focus our attention on the countries in cluster 0 and drop clusters 1 - 3. We could then continue with more k-means clustering on this sub-set to look for patterns and valid models in these countries as well. The following table shows our original cluster 0 sub-clustered into 4.

Cluster Zero Nation	Count @ Day 40	Day Zero	Sub Cluster
Peru	21648	2020-03-16	3
Chile	11296	2020-03-14	3
Israel	12758	2020-03-08	3
Ireland	16040	2020-03-13	3
Ecuador	22719	2020-03-17	3
Korea, South	9661	2020-02-20	3
Bangladesh	13770	2020-03-31	3
Austria	14226	2020-03-06	3
India	15722	2020-03-10	3
Norway	6525	2020-03-04	2
Poland	9856	2020-03-13	2
Ukraine	10406	2020-03-22	2
Serbia	7483	2020-03-16	2
Dominican Republic	6416	2020-03-20	2
Denmark	7268	2020-03-09	2
Australia	6315	2020-03-04	2
Indonesia	7135	2020-03-13	2
Sweden	10948	2020-03-05	2
Belarus	10463	2020-03-18	2
Romania	9242	2020-03-13	2
Saudi Arabia	11631	2020-03-13	2
Philippines	6459	2020-03-12	2
Pakistan	11155	2020-03-15	2
Mexico	11633	2020-03-15	2
Czechia	6746	2020-03-11	2
United Arab Emirates	6302	2020-03-10	1
Argentina	3607	2020-03-16	1
Uzbekistan	2118	2020-03-24	1
Kazakhstan	3138	2020-03-21	1
Morocco	4120	2020-03-19	1
Hungary	2443	2020-03-17	1
South Africa	3953	2020-03-15	1
Finland	3783	2020-03-11	1
Moldova	3638	2020-03-20	1
Algeria	3127	2020-03-16	1
Qatar	5448	2020-03-11	1
Luxembourg	3654	2020-03-14	1
Colombia	4881	2020-03-16	1
Croatia	2009	2020-03-16	1
Greece	2207	2020-03-08	1
Malaysia	4683	2020-03-04	1
Egypt	2844	2020-03-09	1
Thailand	2643	2020-03-07	1
Panama	5338	2020-03-16	1
Estonia	1552	2020-03-13	0
Lithuania	1375	2020-03-21	0
Japan	1387	2020-02-16	0
Rwanda	261	2020-03-26	0
Slovakia	1325	2020-03-15	0
Kuwait	993	2020-03-02	0
Cuba	1649	2020-03-25	0
Taiwan*	425	2020-03-13	0
Liechtenstein	82	2020-03-23	0
Mauritius	332	2020-03-26	0
Diamond Princess	706	2020-02-07	0
Trinidad and Tobago	116	2020-03-22	0
Montenegro	322	2020-03-25	0
Iceland	1727	2020-03-07	0
Tunisia	975	2020-03-20	0
Brunei	138	2020-03-15	0
West Bank and Gaza	344	2020-03-22	0
New Zealand	1476	2020-03-21	0
Monaco	96	2020-03-31	0
Slovenia	1330	2020-03-11	0
Andorra	743	2020-03-19	0
Cyprus	822	2020-03-19	0
Burkina Faso	641	2020-03-21	0
Sri Lanka	523	2020-03-18	0
Malta	450	2020-03-19	0
Iraq	1415	2020-03-07	0
Cote d'Ivoire	1362	2020-03-24	0
Armenia	1596	2020-03-16	0
Cameroon	1832	2020-03-23	0
Honduras	1178	2020-03-26	0
Azerbaijan	1766	2020-03-21	0
Guinea	2146	2020-04-02	0
Nigeria	2558	2020-03-25	0
Paraguay	431	2020-03-27	0
Oman	2274	2020-03-21	0
Ghana	2169	2020-03-24	0
Bahrain	1136	2020-03-04	0
Congo (Kinshasa)	682	2020-03-26	0
Senegal	933	2020-03-22	0
Afghanistan	2469	2020-03-24	0
Kyrgyzstan	843	2020-03-27	0
Singapore	455	2020-02-12	0
Latvia	812	2020-03-18	0
Madagascar	193	2020-03-31	0
Albania	678	2020-03-16	0
Bosnia and Herzegovina	1565	2020-03-19	0
Uruguay	596	2020-03-17	0
San Marino	455	2020-03-10	0
Vietnam	268	2020-03-14	0
Kosovo	855	2020-03-26	0
Georgia	539	2020-03-22	0
Costa Rica	695	2020-03-18	0
North Macedonia	1421	2020-03-20	0
Bolivia	1802	2020-03-27	0
Niger	821	2020-04-01	0
Lebanon	673	2020-03-11	0
Bulgaria	1097	2020-03-15	0
Venezuela	331	2020-03-21	0
Jordan	447	2020-03-18	0
Kenya	621	2020-03-30	0
Cambodia	122	2020-03-20	0

Covid-19 Data Source: https://github.com/CSSEGISandData/COVID-19

This is the data repository for the 2019 Novel Coronavirus Visual Dashboard operated by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). Also, Supported by ESRI Living Atlas Team and the Johns Hopkins University Applied Physics Lab (JHU APL).

Thank you to them for making this data available.

Previous Post Next Post

Welcome to the virtual home of Phaedrus Technologies where we explore the likes of:

Industrial Automation
IoT
Data Science
Linux & Open Source

For commercial applications see:

	Tetranex Solutions

Popular Tags

virtualization edge COVID-19 datascience culture Linux careers IoT

COVID-19: Clustering National Patterns

An exercise in k-means clustering

National Infections by K-Means Cluster

The Python Abridgment

Summary of K-Means Infection Clusters 1, 2 & 3

Infection Cluster Centers

National Mortality by K-Means Cluster

Summary of K-Means Mortality Clusters 1, 2 & 3

Mortality Cluster Centers

Summary of Infection and Mortality Clusters

Further Analysis of Cluster Zero

Related Posts

Popular Tags

Archives