General approach

The analysis involves several steps:

Choose the set of

explanatory variables – in our case, measures of institutions;
outcomes you’re interested in – in our case, various measures of desirable features of a society such as income, life expectancy, education, equality, etc., as well as of some other features that may or may not be correlated with institutions, like ethnic and religious heterogeneity.

Note: you often have some theoretical grounds on which to assume that the explanatory variables cause the outcomes, but, strictly speaking, assuming such causality is not necessary – the analysis may be purely descriptive.

Do a cluster analysis on the set of explanatory variables. The result of this analysis will be to label each case in your dataset (in our case, each country) as belonging to a particular category (cluster).
Calculate the means and standard errors of the outcomes of interest within each category, and see if outcomes differ from one category to another.

You can vizualize these results using bar plots with error bars.
Interpretation: If the error bars made with the standard errors overlap, you can conclude that the differences between means across categories are not statistically significant. The opposite, however, is not true – if they do not operlap you are still not sure the differences in means are statistically significant. By contrast, if you create the error bars with the confidence intervals, overlap implies nothing, but non-overlap implies statistical significance. See this link for more info.

These three steps are the standard analysis for any type of cluster analysis.

With hierarchical cluster analysis you can go further and have a more complex analysis as follows:

Do a cluster analysis on the set of outcomes. This will generate a tree showing you how similar different countries are in terms of the set of outcomes of interest.
Calculate the correlation between the tree created by the cluster analysis on the set of institutions (step 2) and the tree created on the set of outcomes (step 4). This will tell you to what extent your set of institutions provide a good explanation for the outcomes.
Try out different sets of explanatory variables, for instance to see which set of institutions leads to a cluster tree that best matches the cluster tree based on outcomes. Possible questions you can ask: What if we add political institutions and some measures of culture to the set of economic institutions? Will that provide a better account of the differences in outcomes between countries?

In the case of regression analysis, we get some conclusion about the impact of each individual explanatory variable upon one outcome of interest (and posibly also the impact of some combination of variables, if we add interaction terms to the regression). The cluster analysis allows us to study the effects of packages of institutions as a whole. The downside is that we cannot estimate what effect any one specific institutional change (to, say, one single institutional variable) would have upon the outcome. But this analysis does allow us to discover what alternative institutional packages exist, and to analyze their comparative performance – rather than talking about comparative economic systems based on purely intuitive categories, as it is all too common.

Packages

# for plotting trees
library(dendextend)
library(ape)

# for visualizing kmeans results
library(factoextra)

library(tidyverse)

# for calculating and plotting correlations between dendrograms
library(corrplot)

Load and explore dataset

I’m using here the Quality of Government standard dataset.

qog <- rio::import("data/qog_std_cs_jan18.dta")

Build summary of the dataset:

qog_summary <- data.frame(
    Variable    = names(qog), 
    Description = Hmisc::label(qog), 
    t(pastecs::stat.desc(qog)) 
  ) %>% 
  select(Variable, Description, Obs = nbr.val, min, median, mean, max) %>% 
  mutate_if(is.numeric, round, 1)

qog_summary %>% 
  DT::datatable(rownames = FALSE, list(pageLength = 5), 
                caption = "Search for things like GDP per capita, equality, corruption, democracy, education, etc.; or for data sources such as wvs_ (World Values Survey), fi_ (Fraser Institute), wdi_ (World Bank's Development Indicators), etc.")

Data preparation

Select explanatory variables and outcome variables

Define two lists of variables, one for institutional factors and another for outcomes of interest, and create smaller dataset for our purposes

# pick the list of variables we want to use from the QoG
fraser <- c("fi_ftradeint_pd", "fi_legprop_pd", "fi_reg_pd", "fi_sm_pd", "fi_sog_pd")

outcomes <- c(
  # GDP per capita, growth, Gini, life expectancy
  "unna_gdppc", "wdi_gdpcapgr", "wdi_gini", "wdi_lifexp",

  # fractionalization
  "al_ethnic", "al_religion",

  # education female and male ages 35-44
  "gea_ea3544f", "gea_ea3544m",

  # corruption, public sector and political
  "vdem_pubcorr", "vdem_corr")

# prepare smaller work dataset containing institutional factors
institutions <- qog %>% 
  select(cname, ccodealp, fraser, outcomes)

# remove rows with missing data
institutions <- na.omit(institutions)

# use row names to identify the countries
row.names(institutions) <- institutions$cname

qog_summary %>% 
  filter(Variable %in% c(fraser, outcomes)) %>% 
  knitr::kable()

Variable	Description	Obs	min	median	mean	max
al_ethnic	Ethnic Fractionalization	186	0.0	0.4	0.4	0.9
al_religion	Religion Fractionalization	188	0.0	0.5	0.4	0.9
fi_ftradeint_pd	Freedom to Trade Internationally (panel data)	158	3.0	7.2	7.1	9.5
fi_legprop_pd	Legal Structure and Security of Property Rights (panel data)	158	1.6	5.0	5.2	8.8
fi_reg_pd	Regulation of Credit, Labor and Business (panel data)	158	3.0	7.1	7.0	9.0
fi_sm_pd	Access to Sound Money (chain_linked)	158	3.2	8.7	8.3	9.8
fi_sog_pd	Size of Government: Expenditures, Taxes and Enterprises (panel data)	158	3.5	6.4	6.4	9.5
gea_ea3544f	Educational Attainment (35-44 years, Female)	187	0.7	9.4	8.9	15.5
gea_ea3544m	Educational Attainment (35-44 years, Male)	187	2.0	9.9	9.8	15.2
unna_gdppc	GDP per Capita (Current Prices in US$)	193	148.8	5496.0	15827.6	187649.8
vdem_corr	Political corruption index	168	0.0	0.6	0.5	1.0
vdem_pubcorr	Public sector corruption index	168	0.0	0.5	0.5	1.0
wdi_gdpcapgr	GDP per capita growth (annual %)	187	-62.2	1.9	1.8	24.6
wdi_gini	GINI index (World Bank estimate)	115	24.1	36.4	37.5	63.4
wdi_lifexp	Life expectancy at birth, total (years)	185	50.6	72.9	71.3	83.6

Scaling the data to account for differing units of measure

In general, before doing the cluster analysis, you need to make sure that all your variables are measured in comparable units – otherwise your analysis can be heavily distorted by just one variable that happens to have much larger values than the others.

In the present case, I’m using the Fraser Institute data on economic institutions, and this data is already scaled – all variables are between 1 and 10. But if you add some other instititutional factors you need to consider this possible problem.

The simplest way to standardize the dataset is to use the scale function:

fraser_scaled   <- institutions %>% select(fraser) %>% scale() %>% as.data.frame()
outcomes_scaled <- institutions %>% select(outcomes) %>% scale() %>% as.data.frame()

Hierarchical clustering

The basic analysis

The cluster analysis identifies how similar two countries are to each other, either based on Euclidean distance or based on correlations.

If our case, the Euclidean distance between, say, USA and France is:

\[ \begin{align} d_{USA-FR}^2 =& [(Size of Govt)_{USA}-(Size of Govt)_{FR}]^2 + \\ & [(Deregulation)_{USA}-(Deregulation)_{FR}]^2 + \\ & [(Free Trade)_{USA}-(Free Trade)_{FR}]^2 + \\ & [(Sound Money)_{USA}-(Sound Money)_{FR}]^2 + \\ & [(Property Rights)_{USA}-(Property Rights)_{FR}]^2 \end{align} \] The “ward.D” method of clustering is calculating these distances between all pairs of countries and creates the clusters such that countries within a cluster are those that have the smallest distances between each other. This is usually the best method for identifying clearly distinct categories. See this link for more information about clustering methods.

Create a hierarchical cluster tree based on Euclidean distance between cases.

hc <- institutions %>% 
  select(fraser) %>%                # choose only fraser variables
  dist(method = "euclidean") %>%    # calculate distances between pairs of countries
  hclust(method="ward.D")           # build cluster tree

Plot the cluster tree (known as a “dendrogram”)

plot(hc, 
     hang = -1,   # arranges labels at fixed distance
     cex = 0.7,   # size of labels
     main = ""
     )

Plot a circular dendrogram

hc %>% 
  ape::as.phylo() %>% 
  plot(type="fan", cex = 0.8)

Based on the plot we can see that there we can choose to separate the countries into 2, 3 or 4 categories. I’m going to choose 3 here.

Categorize the data based on the clusters

Split the set into 3 groups and add the group number to the dataset.

institutions <- institutions %>% 
  mutate(groups_fraser = cutree(hc, 3)) %>% 
  
  # recode the group numbers to be in a better order
  # (this was discovered by first doing the analysis without recoding)
  mutate(groups_fraser = recode(groups_fraser, `1` = 2, `2` = 1, `3` = 3))

List countries in each group

map_df(1:3, function(g){
  institutions %>% 
    filter(groups_fraser == g) %>% 
    select(cname) %>% 
    rename_at("cname", funs(paste("Group", g)))
  }) %>% 
  
  # shift elements in Group 2 and 3 upwards in the table to replace the NAs
  mutate(`Group 2` = lead(`Group 2`, length(na.omit(`Group 1`)))) %>% 
  mutate(`Group 3` = lead(`Group 3`, length(na.omit(`Group 1`)) + 
                                     length(na.omit(`Group 2`))
                          )
         ) %>% 
  
  # eliminate rowns that have only NAs
  filter(!is.na(`Group 1`) | !is.na(`Group 2`) | !is.na(`Group 3`)) %>% 
  
  # change NAs to ""
  mutate_all(funs(ifelse(is.na(.), "", .))) %>% 
  
  knitr::kable()

Group 1	Group 2	Group 3
Algeria	Albania	Austria
Argentina	Armenia	Belgium
Bolivia	Bhutan	Canada
Brazil	Bosnia and Herzegovina	Croatia
Myanmar	Bulgaria	Czech Republic
Burundi	Sri Lanka	Denmark
Cameroon	Chile	Estonia
Chad	China	Finland
Congo	Colombia	France (1963-)
Congo, Democratic Republic	Costa Rica	Germany
Benin	Cyprus (1975-)	Greece
Ecuador	Dominican Republic	Hungary
Guinea	El Salvador	Ireland
Haiti	Fiji	Israel
Iran	Georgia	Italy
Cote d’Ivoire	Ghana	Korea, South
Madagascar	Guatemala	Netherlands
Mauritania	Honduras	Norway
Niger	Iceland	Poland
Pakistan (1971-)	India	Portugal
Senegal	Indonesia	Rwanda
Sierra Leone	Kazakhstan	Slovakia
Zimbabwe	Kyrgyzstan	Slovenia
Togo	Laos	Spain
Egypt	Lebanon	Sweden
Burkina Faso	Latvia	Switzerland
	Liberia	United Kingdom
	Lithuania	United States
	Mauritius
	Mexico
	Mongolia
	Moldova
	Nicaragua
	Panama
	Paraguay
	Peru
	Philippines
	Romania
	Russia
	Seychelles
	South Africa
	Tajikistan
	Thailand
	Turkey
	Uganda
	Ukraine
	Macedonia
	Tanzania
	Uruguay
	Zambia

Plot some relationships between institutions and outcomes

institutions %>% 
  ggplot() +
    aes(x = fi_legprop_pd, y = log(unna_gdppc), color = factor(groups_fraser)) +
    geom_point() +
    labs(x = "Property rights", y = "Log of GDP per capita")

institutions %>% 
  ggplot() +
    aes(x = fi_legprop_pd, y = vdem_pubcorr, color = factor(groups_fraser)) +
    geom_point() +
    labs(x = "Property rights", y = "Corruption")

institutions %>% 
  ggplot() +
    aes(x = fi_legprop_pd, y = wdi_lifexp, color = factor(groups_fraser)) +
    geom_point() +
    labs(x = "Property rights", y = "Life expectancy")

institutions %>% 
  ggplot() +
    aes(x = fi_legprop_pd, y = gea_ea3544f, color = factor(groups_fraser)) +
    geom_point() +
    labs(x = "Property rights", y = "Education, female (35-44)")

institutions %>% 
  ggplot() +
    aes(x = fi_legprop_pd, y = wdi_gdpcapgr, color = factor(groups_fraser)) +
    geom_point() +
    labs(x = "Property rights", y = "Growth rate")

Calculate summary statistics based on the cluster analysis

# calculate means of the outcome variables for each cluster
outcomes_means <- institutions %>%
  group_by(groups_fraser) %>%
  summarise_at(vars(outcomes), funs(mean)) %>%  
  gather(key = "variable", value = "mean", -groups_fraser)

# calculate the standard error 
std.err <- function(x) { sqrt(var(x, na.rm=TRUE)/length(na.omit(x))) }

# calculate standard errors (uses function `std.err` defined above)
outcomes_se <- institutions %>%
  group_by(groups_fraser) %>%
  summarise_at(vars(outcomes), funs(std.err)) %>% 
  gather(key = "variable", value = "se", -groups_fraser)

outcomes_summary <- full_join(outcomes_means, outcomes_se)

# show results as table
knitr::kable(outcomes_summary, digits=2)

groups_fraser	variable	mean	se
1	unna_gdppc	2617.13	661.45
2	unna_gdppc	7650.93	1193.23
3	unna_gdppc	39377.71	4193.23
1	wdi_gdpcapgr	2.27	0.45
2	wdi_gdpcapgr	2.67	0.32
3	wdi_gdpcapgr	1.84	0.35
1	wdi_gini	40.15	1.22
2	wdi_gini	39.43	1.21
3	wdi_gini	32.38	1.03
1	wdi_lifexp	63.66	1.50
2	wdi_lifexp	72.36	0.75
3	wdi_lifexp	80.18	0.65
1	al_ethnic	0.63	0.05
2	al_ethnic	0.45	0.03
3	al_ethnic	0.24	0.04
1	al_religion	0.41	0.05
2	al_religion	0.43	0.03
3	al_religion	0.43	0.04
1	gea_ea3544f	4.78	0.60
2	gea_ea3544f	10.01	0.46
3	gea_ea3544f	13.31	0.42
1	gea_ea3544m	6.68	0.52
2	gea_ea3544m	10.41	0.33
3	gea_ea3544m	13.16	0.37
1	vdem_pubcorr	0.67	0.04
2	vdem_pubcorr	0.50	0.04
3	vdem_pubcorr	0.10	0.02
1	vdem_corr	0.68	0.03
2	vdem_corr	0.54	0.04
3	vdem_corr	0.14	0.02

Show summary statistics results as bar plots

outcomes_summary %>%
  ggplot() +
    aes(x = groups_fraser, y = mean) +
    geom_bar(stat = "identity") +
    geom_errorbar(aes(ymin = mean - se, ymax = mean + se), width = 0.3) +
    facet_wrap(~variable, scale = "free", ncol = 2)

Compare different clustering methods

## calculate hierarchical cluster
## comparison of different clustering methods
hc1 <- fraser_scaled %>% dist(method = "euclidean") %>% hclust(method="single")
hc2 <- fraser_scaled %>% dist(method = "euclidean") %>% hclust(method="complete")
hc3 <- fraser_scaled %>% dist(method = "euclidean") %>% hclust(method="average")
hc4 <- fraser_scaled %>% dist(method = "euclidean") %>% hclust(method="mcquitty")
hc5 <- fraser_scaled %>% dist(method = "euclidean") %>% hclust(method="median")
hc6 <- fraser_scaled %>% dist(method = "euclidean") %>% hclust(method="centroid")
hc7 <- fraser_scaled %>% dist(method = "euclidean") %>% hclust(method="ward.D")
hc8 <- fraser_scaled %>% dist(method = "euclidean") %>% hclust(method="ward.D2")

fraser_dendlist <- dendlist(
  as.dendrogram(hc1),
  as.dendrogram(hc2),
  as.dendrogram(hc3),
  as.dendrogram(hc4),
  as.dendrogram(hc5),
  as.dendrogram(hc6),
  as.dendrogram(hc7),
  as.dendrogram(hc8)
)

hclust_methods <- c("single", "complete", "average", "mcquitty",
                    "median", "centroid", "ward.D", "ward.D2")
names(fraser_dendlist) <- hclust_methods

# calculate correlations
fraser_cor <- cor.dendlist(fraser_dendlist)

# plot correlation table
corrplot.mixed(fraser_cor)

Compare countries based on outcomes

hc_o1 <- outcomes_scaled %>% dist(method = "euclidean") %>% hclust(method="single")
hc_o2 <- outcomes_scaled %>% dist(method = "euclidean") %>% hclust(method="complete")
hc_o3 <- outcomes_scaled %>% dist(method = "euclidean") %>% hclust(method="average")
hc_o4 <- outcomes_scaled %>% dist(method = "euclidean") %>% hclust(method="mcquitty")
hc_o5 <- outcomes_scaled %>% dist(method = "euclidean") %>% hclust(method="median")
hc_o6 <- outcomes_scaled %>% dist(method = "euclidean") %>% hclust(method="centroid")
hc_o7 <- outcomes_scaled %>% dist(method = "euclidean") %>% hclust(method="ward.D")
hc_o8 <- outcomes_scaled %>% dist(method = "euclidean") %>% hclust(method="ward.D2")

outcomes_dendlist <- dendlist(
  as.dendrogram(hc_o1),
  as.dendrogram(hc_o2),
  as.dendrogram(hc_o3),
  as.dendrogram(hc_o4),
  as.dendrogram(hc_o5),
  as.dendrogram(hc_o6),
  as.dendrogram(hc_o7),
  as.dendrogram(hc_o8)
)

hclust_methods <- c("single", "complete", "average", "mcquitty",
                    "median", "centroid", "ward.D", "ward.D2")
names(outcomes_dendlist) <- hclust_methods

# calculate correlations
outcomes_cor <- cor.dendlist(outcomes_dendlist)

# plot correlation table
corrplot.mixed(outcomes_cor)

# plot cluster
plot(hc_o7, labels=outcomes_scaled$cname, hang=-1, cex = 0.7)

Correlation between institutions tree and outcomes tree

Which clustering method leads to the highest correlation between the tree based on institutions and the tree based on outcomes?

compar_list <- dendlist(
  as.dendrogram(hc1),
  as.dendrogram(hc2),
  as.dendrogram(hc3),
  as.dendrogram(hc4),
  as.dendrogram(hc5),
  as.dendrogram(hc6),
  as.dendrogram(hc7),
  as.dendrogram(hc8),
  as.dendrogram(hc_o1),
  as.dendrogram(hc_o2),
  as.dendrogram(hc_o3),
  as.dendrogram(hc_o4),
  as.dendrogram(hc_o5),
  as.dendrogram(hc_o6),
  as.dendrogram(hc_o7),
  as.dendrogram(hc_o8)
)

names(compar_list) <- c("fraser1", "fraser2","fraser3","fraser4","fraser5",
                        "fraser6","fraser7","fraser8",
                        "outcomes1", "outcomes2", "outcomes3", "outcomes4",
                        "outcomes5", "outcomes6", "outcomes7", "outcomes8")

# correlations plot
corrplot.mixed(cor.dendlist(compar_list))

Another type of analysis that we could make: Instead of looking at different methods using the same set of variables, we can ask which set of institutional variables leads to the highest correlation to the outcomes tree?

Visual comparison of trees, between institutions and outcomes:

dend_fraser   <- as.dendrogram(hc7, h = 25)
dend_outcomes <- as.dendrogram(hc_o7, h = 25)

tanglegram(dend_fraser, dend_outcomes,
           sort = TRUE,
           common_subtrees_color_lines = FALSE,
           highlight_distinct_edges  = FALSE,
           highlight_branches_lwd = FALSE,
           margin_inner = 0.5,
           main_left = "Tree based on institutions",
           main_right = "Tree based on outcomes")

K-means clustering

Check this link for brief intro to the theory. Unlike the case of hierarchical clustering, with k-means clustering you need to specify the number of desired clusters (k – hence the name “k-means”).

This algorithm works in the following way:

start with $k$ randomly placed “centroids” (the centers of the clusters)
allocate each case to the closest centroid
move each centroid to the middle of their cluster
repeat steps 2-3 until all the centroids stabilize and reach their equilibrium positions

We do the k-means clustering with the function kmeans(data, k):

km <- institutions %>% 
  select(fraser) %>% 
  kmeans(3)

km

## K-means clustering with 3 clusters of sizes 43, 28, 33
## 
## Cluster means:
##   fi_ftradeint_pd fi_legprop_pd fi_reg_pd fi_sm_pd fi_sog_pd
## 1        7.598851      5.152179  7.162781 8.901143  7.451189
## 2        8.152068      7.132163  7.761195 9.542097  5.244410
## 3        5.829118      3.892006  6.105180 7.136185  6.197818
## 
## Clustering vector:
##   [1] 1 3 3 2 1 2 1 3 1 3 1 3 3 3 2 3 3 1 3 3 3 3 1 2 1 2 3 2 1 3 1 2 3 2 2
##  [36] 1 2 3 2 1 3 1 1 2 2 3 1 3 2 1 2 3 1 1 1 1 1 2 1 1 3 3 1 1 1 1 2 1 3 2
##  [71] 3 1 1 1 1 2 2 1 1 2 3 1 3 2 2 1 3 2 2 2 1 3 3 1 1 3 1 3 2 1 2 3 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 122.55388  67.24369 151.35111
##  (between_SS / total_SS =  58.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

# Use the clustering vector to label the cases
institutions <- institutions %>% mutate(kmeans = km$cluster)

Visualize the results

Some plots:

institutions %>% 
  ggplot() +
    aes(x = fi_legprop_pd, y = gea_ea3544f, color = factor(kmeans)) +
    geom_point() +
    labs(x = "Property rights", y = "Education, female (35-44)")

institutions %>% 
  ggplot() +
    aes(x = fi_legprop_pd, y = log(unna_gdppc), color = factor(kmeans)) +
    geom_point() +
    labs(x = "Property rights", y = "Log of GDP per capita")

As you can see, the result of the k-means clustering is very similar to that of hierarchical clustering.

Alternative way to visualize using the first two principal components:

institutions %>% 
  select(fraser) %>% 
  fviz_cluster(km, geom = "point", data = .) + ggtitle("k = 3")

Find the natural number of clusters

How many clusters should we make? See which number maximizes the silhouette:

institutions %>% 
  select(fraser) %>% 
  fviz_nbclust(kmeans, method = "silhouette")

The sihouette is a measure that takes into account both how close the different cases within a cluster are to each other and how distant the different clusters are from one another.

The kmeans with 2 clusters:

institutions %>% 
  select(fraser) %>% 
  kmeans(2) %>% 
  fviz_cluster(geom = "point", 
               data = institutions %>% select(fraser)) + ggtitle("k = 2")

institutions %>% 
  select(fraser) %>% 
  kmeans(8) %>% 
  fviz_cluster(geom = "point", 
               data = institutions %>% select(fraser)) + ggtitle("k = 8")

Comparative institutional analysis using clustering methods

Vlad Tarko

3/19/2019