General approach

The analysis involves several steps:

  1. Choose the set of
  • explanatory variables – in our case, measures of institutions;
  • outcomes you’re interested in – in our case, various measures of desirable features of a society such as income, life expectancy, education, equality, etc., as well as of some other features that may or may not be correlated with institutions, like ethnic and religious heterogeneity.

    Note: you often have some theoretical grounds on which to assume that the explanatory variables cause the outcomes, but, strictly speaking, assuming such causality is not necessary – the analysis may be purely descriptive.

  1. Do a cluster analysis on the set of explanatory variables. The result of this analysis will be to label each case in your dataset (in our case, each country) as belonging to a particular category (cluster).

  2. Calculate the means and standard errors of the outcomes of interest within each category, and see if outcomes differ from one category to another.
  • You can vizualize these results using bar plots with error bars.
  • Interpretation: If the error bars made with the standard errors overlap, you can conclude that the differences between means across categories are not statistically significant. The opposite, however, is not true – if they do not operlap you are still not sure the differences in means are statistically significant. By contrast, if you create the error bars with the confidence intervals, overlap implies nothing, but non-overlap implies statistical significance. See this link for more info.

These three steps are the standard analysis for any type of cluster analysis.

With hierarchical cluster analysis you can go further and have a more complex analysis as follows:

  1. Do a cluster analysis on the set of outcomes. This will generate a tree showing you how similar different countries are in terms of the set of outcomes of interest.

  2. Calculate the correlation between the tree created by the cluster analysis on the set of institutions (step 2) and the tree created on the set of outcomes (step 4). This will tell you to what extent your set of institutions provide a good explanation for the outcomes.

  3. Try out different sets of explanatory variables, for instance to see which set of institutions leads to a cluster tree that best matches the cluster tree based on outcomes. Possible questions you can ask: What if we add political institutions and some measures of culture to the set of economic institutions? Will that provide a better account of the differences in outcomes between countries?

In the case of regression analysis, we get some conclusion about the impact of each individual explanatory variable upon one outcome of interest (and posibly also the impact of some combination of variables, if we add interaction terms to the regression). The cluster analysis allows us to study the effects of packages of institutions as a whole. The downside is that we cannot estimate what effect any one specific institutional change (to, say, one single institutional variable) would have upon the outcome. But this analysis does allow us to discover what alternative institutional packages exist, and to analyze their comparative performance – rather than talking about comparative economic systems based on purely intuitive categories, as it is all too common.

Packages

# for plotting trees
library(dendextend)
library(ape)

# for visualizing kmeans results
library(factoextra)

library(tidyverse)

# for calculating and plotting correlations between dendrograms
library(corrplot)

Load and explore dataset

I’m using here the Quality of Government standard dataset.

qog <- rio::import("data/qog_std_cs_jan18.dta")

Build summary of the dataset:

qog_summary <- data.frame(
    Variable    = names(qog), 
    Description = Hmisc::label(qog), 
    t(pastecs::stat.desc(qog)) 
  ) %>% 
  select(Variable, Description, Obs = nbr.val, min, median, mean, max) %>% 
  mutate_if(is.numeric, round, 1)

qog_summary %>% 
  DT::datatable(rownames = FALSE, list(pageLength = 5), 
                caption = "Search for things like GDP per capita, equality, corruption, democracy, education, etc.; or for data sources such as wvs_ (World Values Survey), fi_ (Fraser Institute), wdi_ (World Bank's Development Indicators), etc.") 

Data preparation

Select explanatory variables and outcome variables

Define two lists of variables, one for institutional factors and another for outcomes of interest, and create smaller dataset for our purposes

# pick the list of variables we want to use from the QoG
fraser <- c("fi_ftradeint_pd", "fi_legprop_pd", "fi_reg_pd", "fi_sm_pd", "fi_sog_pd")

outcomes <- c(
  # GDP per capita, growth, Gini, life expectancy
  "unna_gdppc", "wdi_gdpcapgr", "wdi_gini", "wdi_lifexp",

  # fractionalization
  "al_ethnic", "al_religion",

  # education female and male ages 35-44
  "gea_ea3544f", "gea_ea3544m",

  # corruption, public sector and political
  "vdem_pubcorr", "vdem_corr")

# prepare smaller work dataset containing institutional factors
institutions <- qog %>% 
  select(cname, ccodealp, fraser, outcomes)

# remove rows with missing data
institutions <- na.omit(institutions)

# use row names to identify the countries
row.names(institutions) <- institutions$cname

qog_summary %>% 
  filter(Variable %in% c(fraser, outcomes)) %>% 
  knitr::kable()
Variable Description Obs min median mean max
al_ethnic Ethnic Fractionalization 186 0.0 0.4 0.4 0.9
al_religion Religion Fractionalization 188 0.0 0.5 0.4 0.9
fi_ftradeint_pd Freedom to Trade Internationally (panel data) 158 3.0 7.2 7.1 9.5
fi_legprop_pd Legal Structure and Security of Property Rights (panel data) 158 1.6 5.0 5.2 8.8
fi_reg_pd Regulation of Credit, Labor and Business (panel data) 158 3.0 7.1 7.0 9.0
fi_sm_pd Access to Sound Money (chain_linked) 158 3.2 8.7 8.3 9.8
fi_sog_pd Size of Government: Expenditures, Taxes and Enterprises (panel data) 158 3.5 6.4 6.4 9.5
gea_ea3544f Educational Attainment (35-44 years, Female) 187 0.7 9.4 8.9 15.5
gea_ea3544m Educational Attainment (35-44 years, Male) 187 2.0 9.9 9.8 15.2
unna_gdppc GDP per Capita (Current Prices in US$) 193 148.8 5496.0 15827.6 187649.8
vdem_corr Political corruption index 168 0.0 0.6 0.5 1.0
vdem_pubcorr Public sector corruption index 168 0.0 0.5 0.5 1.0
wdi_gdpcapgr GDP per capita growth (annual %) 187 -62.2 1.9 1.8 24.6
wdi_gini GINI index (World Bank estimate) 115 24.1 36.4 37.5 63.4
wdi_lifexp Life expectancy at birth, total (years) 185 50.6 72.9 71.3 83.6

Scaling the data to account for differing units of measure

In general, before doing the cluster analysis, you need to make sure that all your variables are measured in comparable units – otherwise your analysis can be heavily distorted by just one variable that happens to have much larger values than the others.

In the present case, I’m using the Fraser Institute data on economic institutions, and this data is already scaled – all variables are between 1 and 10. But if you add some other instititutional factors you need to consider this possible problem.

The simplest way to standardize the dataset is to use the scale function:

fraser_scaled   <- institutions %>% select(fraser) %>% scale() %>% as.data.frame()
outcomes_scaled <- institutions %>% select(outcomes) %>% scale() %>% as.data.frame()

Hierarchical clustering

The basic analysis

The cluster analysis identifies how similar two countries are to each other, either based on Euclidean distance or based on correlations.

If our case, the Euclidean distance between, say, USA and France is:

\[ \begin{align} d_{USA-FR}^2 =& [(Size of Govt)_{USA}-(Size of Govt)_{FR}]^2 + \\ & [(Deregulation)_{USA}-(Deregulation)_{FR}]^2 + \\ & [(Free Trade)_{USA}-(Free Trade)_{FR}]^2 + \\ & [(Sound Money)_{USA}-(Sound Money)_{FR}]^2 + \\ & [(Property Rights)_{USA}-(Property Rights)_{FR}]^2 \end{align} \] The “ward.D” method of clustering is calculating these distances between all pairs of countries and creates the clusters such that countries within a cluster are those that have the smallest distances between each other. This is usually the best method for identifying clearly distinct categories. See this link for more information about clustering methods.

Create a hierarchical cluster tree based on Euclidean distance between cases.

hc <- institutions %>% 
  select(fraser) %>%                # choose only fraser variables
  dist(method = "euclidean") %>%    # calculate distances between pairs of countries
  hclust(method="ward.D")           # build cluster tree

Plot the cluster tree (known as a “dendrogram”)

plot(hc, 
     hang = -1,   # arranges labels at fixed distance
     cex = 0.7,   # size of labels
     main = ""
     )

Plot a circular dendrogram

hc %>% 
  ape::as.phylo() %>% 
  plot(type="fan", cex = 0.8)

Based on the plot we can see that there we can choose to separate the countries into 2, 3 or 4 categories. I’m going to choose 3 here.

Categorize the data based on the clusters

Split the set into 3 groups and add the group number to the dataset.

institutions <- institutions %>% 
  mutate(groups_fraser = cutree(hc, 3)) %>% 
  
  # recode the group numbers to be in a better order
  # (this was discovered by first doing the analysis without recoding)
  mutate(groups_fraser = recode(groups_fraser, `1` = 2, `2` = 1, `3` = 3))

List countries in each group

map_df(1:3, function(g){
  institutions %>% 
    filter(groups_fraser == g) %>% 
    select(cname) %>% 
    rename_at("cname", funs(paste("Group", g)))
  }) %>% 
  
  # shift elements in Group 2 and 3 upwards in the table to replace the NAs
  mutate(`Group 2` = lead(`Group 2`, length(na.omit(`Group 1`)))) %>% 
  mutate(`Group 3` = lead(`Group 3`, length(na.omit(`Group 1`)) + 
                                     length(na.omit(`Group 2`))
                          )
         ) %>% 
  
  # eliminate rowns that have only NAs
  filter(!is.na(`Group 1`) | !is.na(`Group 2`) | !is.na(`Group 3`)) %>% 
  
  # change NAs to ""
  mutate_all(funs(ifelse(is.na(.), "", .))) %>% 
  
  knitr::kable()
Group 1 Group 2 Group 3
Algeria Albania Austria
Argentina Armenia Belgium
Bolivia Bhutan Canada
Brazil Bosnia and Herzegovina Croatia
Myanmar Bulgaria Czech Republic
Burundi Sri Lanka Denmark
Cameroon Chile Estonia
Chad China Finland
Congo Colombia France (1963-)
Congo, Democratic Republic Costa Rica Germany
Benin Cyprus (1975-) Greece
Ecuador Dominican Republic Hungary
Guinea El Salvador Ireland
Haiti Fiji Israel
Iran Georgia Italy
Cote d’Ivoire Ghana Korea, South
Madagascar Guatemala Netherlands
Mauritania Honduras Norway
Niger Iceland Poland
Pakistan (1971-) India Portugal
Senegal Indonesia Rwanda
Sierra Leone Kazakhstan Slovakia
Zimbabwe Kyrgyzstan Slovenia
Togo Laos Spain
Egypt Lebanon Sweden
Burkina Faso Latvia Switzerland
Liberia United Kingdom
Lithuania United States
Mauritius
Mexico
Mongolia
Moldova
Nicaragua
Panama
Paraguay
Peru
Philippines
Romania
Russia
Seychelles
South Africa
Tajikistan
Thailand
Turkey
Uganda
Ukraine
Macedonia
Tanzania
Uruguay
Zambia

Plot some relationships between institutions and outcomes

institutions %>% 
  ggplot() +
    aes(x = fi_legprop_pd, y = log(unna_gdppc), color = factor(groups_fraser)) +
    geom_point() +
    labs(x = "Property rights", y = "Log of GDP per capita")

institutions %>% 
  ggplot() +
    aes(x = fi_legprop_pd, y = vdem_pubcorr, color = factor(groups_fraser)) +
    geom_point() +
    labs(x = "Property rights", y = "Corruption")

institutions %>% 
  ggplot() +
    aes(x = fi_legprop_pd, y = wdi_lifexp, color = factor(groups_fraser)) +
    geom_point() +
    labs(x = "Property rights", y = "Life expectancy")

institutions %>% 
  ggplot() +
    aes(x = fi_legprop_pd, y = gea_ea3544f, color = factor(groups_fraser)) +
    geom_point() +
    labs(x = "Property rights", y = "Education, female (35-44)")

institutions %>% 
  ggplot() +
    aes(x = fi_legprop_pd, y = wdi_gdpcapgr, color = factor(groups_fraser)) +
    geom_point() +
    labs(x = "Property rights", y = "Growth rate")

Calculate summary statistics based on the cluster analysis

# calculate means of the outcome variables for each cluster
outcomes_means <- institutions %>%
  group_by(groups_fraser) %>%
  summarise_at(vars(outcomes), funs(mean)) %>%  
  gather(key = "variable", value = "mean", -groups_fraser)

# calculate the standard error 
std.err <- function(x) { sqrt(var(x, na.rm=TRUE)/length(na.omit(x))) }

# calculate standard errors (uses function `std.err` defined above)
outcomes_se <- institutions %>%
  group_by(groups_fraser) %>%
  summarise_at(vars(outcomes), funs(std.err)) %>% 
  gather(key = "variable", value = "se", -groups_fraser)

outcomes_summary <- full_join(outcomes_means, outcomes_se)

# show results as table
knitr::kable(outcomes_summary, digits=2)
groups_fraser variable mean se
1 unna_gdppc 2617.13 661.45
2 unna_gdppc 7650.93 1193.23
3 unna_gdppc 39377.71 4193.23
1 wdi_gdpcapgr 2.27 0.45
2 wdi_gdpcapgr 2.67 0.32
3 wdi_gdpcapgr 1.84 0.35
1 wdi_gini 40.15 1.22
2 wdi_gini 39.43 1.21
3 wdi_gini 32.38 1.03
1 wdi_lifexp 63.66 1.50
2 wdi_lifexp 72.36 0.75
3 wdi_lifexp 80.18 0.65
1 al_ethnic 0.63 0.05
2 al_ethnic 0.45 0.03
3 al_ethnic 0.24 0.04
1 al_religion 0.41 0.05
2 al_religion 0.43 0.03
3 al_religion 0.43 0.04
1 gea_ea3544f 4.78 0.60
2 gea_ea3544f 10.01 0.46
3 gea_ea3544f 13.31 0.42
1 gea_ea3544m 6.68 0.52
2 gea_ea3544m 10.41 0.33
3 gea_ea3544m 13.16 0.37
1 vdem_pubcorr 0.67 0.04
2 vdem_pubcorr 0.50 0.04
3 vdem_pubcorr 0.10 0.02
1 vdem_corr 0.68 0.03
2 vdem_corr 0.54 0.04
3 vdem_corr 0.14 0.02

Show summary statistics results as bar plots

outcomes_summary %>%
  ggplot() +
    aes(x = groups_fraser, y = mean) +
    geom_bar(stat = "identity") +
    geom_errorbar(aes(ymin = mean - se, ymax = mean + se), width = 0.3) +
    facet_wrap(~variable, scale = "free", ncol = 2)

Compare different clustering methods

## calculate hierarchical cluster
## comparison of different clustering methods
hc1 <- fraser_scaled %>% dist(method = "euclidean") %>% hclust(method="single")
hc2 <- fraser_scaled %>% dist(method = "euclidean") %>% hclust(method="complete")
hc3 <- fraser_scaled %>% dist(method = "euclidean") %>% hclust(method="average")
hc4 <- fraser_scaled %>% dist(method = "euclidean") %>% hclust(method="mcquitty")
hc5 <- fraser_scaled %>% dist(method = "euclidean") %>% hclust(method="median")
hc6 <- fraser_scaled %>% dist(method = "euclidean") %>% hclust(method="centroid")
hc7 <- fraser_scaled %>% dist(method = "euclidean") %>% hclust(method="ward.D")
hc8 <- fraser_scaled %>% dist(method = "euclidean") %>% hclust(method="ward.D2")

fraser_dendlist <- dendlist(
  as.dendrogram(hc1),
  as.dendrogram(hc2),
  as.dendrogram(hc3),
  as.dendrogram(hc4),
  as.dendrogram(hc5),
  as.dendrogram(hc6),
  as.dendrogram(hc7),
  as.dendrogram(hc8)
)

hclust_methods <- c("single", "complete", "average", "mcquitty",
                    "median", "centroid", "ward.D", "ward.D2")
names(fraser_dendlist) <- hclust_methods

# calculate correlations
fraser_cor <- cor.dendlist(fraser_dendlist)

# plot correlation table
corrplot.mixed(fraser_cor)

Compare countries based on outcomes

hc_o1 <- outcomes_scaled %>% dist(method = "euclidean") %>% hclust(method="single")
hc_o2 <- outcomes_scaled %>% dist(method = "euclidean") %>% hclust(method="complete")
hc_o3 <- outcomes_scaled %>% dist(method = "euclidean") %>% hclust(method="average")
hc_o4 <- outcomes_scaled %>% dist(method = "euclidean") %>% hclust(method="mcquitty")
hc_o5 <- outcomes_scaled %>% dist(method = "euclidean") %>% hclust(method="median")
hc_o6 <- outcomes_scaled %>% dist(method = "euclidean") %>% hclust(method="centroid")
hc_o7 <- outcomes_scaled %>% dist(method = "euclidean") %>% hclust(method="ward.D")
hc_o8 <- outcomes_scaled %>% dist(method = "euclidean") %>% hclust(method="ward.D2")

outcomes_dendlist <- dendlist(
  as.dendrogram(hc_o1),
  as.dendrogram(hc_o2),
  as.dendrogram(hc_o3),
  as.dendrogram(hc_o4),
  as.dendrogram(hc_o5),
  as.dendrogram(hc_o6),
  as.dendrogram(hc_o7),
  as.dendrogram(hc_o8)
)

hclust_methods <- c("single", "complete", "average", "mcquitty",
                    "median", "centroid", "ward.D", "ward.D2")
names(outcomes_dendlist) <- hclust_methods

# calculate correlations
outcomes_cor <- cor.dendlist(outcomes_dendlist)

# plot correlation table
corrplot.mixed(outcomes_cor)

# plot cluster
plot(hc_o7, labels=outcomes_scaled$cname, hang=-1, cex = 0.7)

Correlation between institutions tree and outcomes tree

Which clustering method leads to the highest correlation between the tree based on institutions and the tree based on outcomes?

compar_list <- dendlist(
  as.dendrogram(hc1),
  as.dendrogram(hc2),
  as.dendrogram(hc3),
  as.dendrogram(hc4),
  as.dendrogram(hc5),
  as.dendrogram(hc6),
  as.dendrogram(hc7),
  as.dendrogram(hc8),
  as.dendrogram(hc_o1),
  as.dendrogram(hc_o2),
  as.dendrogram(hc_o3),
  as.dendrogram(hc_o4),
  as.dendrogram(hc_o5),
  as.dendrogram(hc_o6),
  as.dendrogram(hc_o7),
  as.dendrogram(hc_o8)
)

names(compar_list) <- c("fraser1", "fraser2","fraser3","fraser4","fraser5",
                        "fraser6","fraser7","fraser8",
                        "outcomes1", "outcomes2", "outcomes3", "outcomes4",
                        "outcomes5", "outcomes6", "outcomes7", "outcomes8")

# correlations plot
corrplot.mixed(cor.dendlist(compar_list))

Another type of analysis that we could make: Instead of looking at different methods using the same set of variables, we can ask which set of institutional variables leads to the highest correlation to the outcomes tree?

Visual comparison of trees, between institutions and outcomes:

dend_fraser   <- as.dendrogram(hc7, h = 25)
dend_outcomes <- as.dendrogram(hc_o7, h = 25)

tanglegram(dend_fraser, dend_outcomes,
           sort = TRUE,
           common_subtrees_color_lines = FALSE,
           highlight_distinct_edges  = FALSE,
           highlight_branches_lwd = FALSE,
           margin_inner = 0.5,
           main_left = "Tree based on institutions",
           main_right = "Tree based on outcomes")

K-means clustering

Check this link for brief intro to the theory. Unlike the case of hierarchical clustering, with k-means clustering you need to specify the number of desired clusters (k – hence the name “k-means”).

This algorithm works in the following way:

  1. start with \(k\) randomly placed “centroids” (the centers of the clusters)
  2. allocate each case to the closest centroid
  3. move each centroid to the middle of their cluster
  4. repeat steps 2-3 until all the centroids stabilize and reach their equilibrium positions

We do the k-means clustering with the function kmeans(data, k):

km <- institutions %>% 
  select(fraser) %>% 
  kmeans(3)

km
## K-means clustering with 3 clusters of sizes 43, 28, 33
## 
## Cluster means:
##   fi_ftradeint_pd fi_legprop_pd fi_reg_pd fi_sm_pd fi_sog_pd
## 1        7.598851      5.152179  7.162781 8.901143  7.451189
## 2        8.152068      7.132163  7.761195 9.542097  5.244410
## 3        5.829118      3.892006  6.105180 7.136185  6.197818
## 
## Clustering vector:
##   [1] 1 3 3 2 1 2 1 3 1 3 1 3 3 3 2 3 3 1 3 3 3 3 1 2 1 2 3 2 1 3 1 2 3 2 2
##  [36] 1 2 3 2 1 3 1 1 2 2 3 1 3 2 1 2 3 1 1 1 1 1 2 1 1 3 3 1 1 1 1 2 1 3 2
##  [71] 3 1 1 1 1 2 2 1 1 2 3 1 3 2 2 1 3 2 2 2 1 3 3 1 1 3 1 3 2 1 2 3 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 122.55388  67.24369 151.35111
##  (between_SS / total_SS =  58.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"
# Use the clustering vector to label the cases
institutions <- institutions %>% mutate(kmeans = km$cluster)

Visualize the results

Some plots:

institutions %>% 
  ggplot() +
    aes(x = fi_legprop_pd, y = gea_ea3544f, color = factor(kmeans)) +
    geom_point() +
    labs(x = "Property rights", y = "Education, female (35-44)")

institutions %>% 
  ggplot() +
    aes(x = fi_legprop_pd, y = log(unna_gdppc), color = factor(kmeans)) +
    geom_point() +
    labs(x = "Property rights", y = "Log of GDP per capita")

As you can see, the result of the k-means clustering is very similar to that of hierarchical clustering.

Alternative way to visualize using the first two principal components:

institutions %>% 
  select(fraser) %>% 
  fviz_cluster(km, geom = "point", data = .) + ggtitle("k = 3")

Find the natural number of clusters

How many clusters should we make? See which number maximizes the silhouette:

institutions %>% 
  select(fraser) %>% 
  fviz_nbclust(kmeans, method = "silhouette")

The sihouette is a measure that takes into account both how close the different cases within a cluster are to each other and how distant the different clusters are from one another.

The kmeans with 2 clusters:

institutions %>% 
  select(fraser) %>% 
  kmeans(2) %>% 
  fviz_cluster(geom = "point", 
               data = institutions %>% select(fraser)) + ggtitle("k = 2")

institutions %>% 
  select(fraser) %>% 
  kmeans(8) %>% 
  fviz_cluster(geom = "point", 
               data = institutions %>% select(fraser)) + ggtitle("k = 8")