The analysis involves several steps:
outcomes you’re interested in – in our case, various measures of desirable features of a society such as income, life expectancy, education, equality, etc., as well as of some other features that may or may not be correlated with institutions, like ethnic and religious heterogeneity.
Note: you often have some theoretical grounds on which to assume that the explanatory variables cause the outcomes, but, strictly speaking, assuming such causality is not necessary – the analysis may be purely descriptive.
Do a cluster analysis on the set of explanatory variables. The result of this analysis will be to label each case in your dataset (in our case, each country) as belonging to a particular category (cluster).
These three steps are the standard analysis for any type of cluster analysis.
With hierarchical cluster analysis you can go further and have a more complex analysis as follows:
Do a cluster analysis on the set of outcomes. This will generate a tree showing you how similar different countries are in terms of the set of outcomes of interest.
Calculate the correlation between the tree created by the cluster analysis on the set of institutions (step 2) and the tree created on the set of outcomes (step 4). This will tell you to what extent your set of institutions provide a good explanation for the outcomes.
Try out different sets of explanatory variables, for instance to see which set of institutions leads to a cluster tree that best matches the cluster tree based on outcomes. Possible questions you can ask: What if we add political institutions and some measures of culture to the set of economic institutions? Will that provide a better account of the differences in outcomes between countries?
In the case of regression analysis, we get some conclusion about the impact of each individual explanatory variable upon one outcome of interest (and posibly also the impact of some combination of variables, if we add interaction terms to the regression). The cluster analysis allows us to study the effects of packages of institutions as a whole. The downside is that we cannot estimate what effect any one specific institutional change (to, say, one single institutional variable) would have upon the outcome. But this analysis does allow us to discover what alternative institutional packages exist, and to analyze their comparative performance – rather than talking about comparative economic systems based on purely intuitive categories, as it is all too common.
# for plotting trees
library(dendextend)
library(ape)
# for visualizing kmeans results
library(factoextra)
library(tidyverse)
# for calculating and plotting correlations between dendrograms
library(corrplot)
I’m using here the Quality of Government standard dataset.
qog <- rio::import("data/qog_std_cs_jan18.dta")
Build summary of the dataset:
qog_summary <- data.frame(
Variable = names(qog),
Description = Hmisc::label(qog),
t(pastecs::stat.desc(qog))
) %>%
select(Variable, Description, Obs = nbr.val, min, median, mean, max) %>%
mutate_if(is.numeric, round, 1)
qog_summary %>%
DT::datatable(rownames = FALSE, list(pageLength = 5),
caption = "Search for things like GDP per capita, equality, corruption, democracy, education, etc.; or for data sources such as wvs_ (World Values Survey), fi_ (Fraser Institute), wdi_ (World Bank's Development Indicators), etc.")
Define two lists of variables, one for institutional factors and another for outcomes of interest, and create smaller dataset for our purposes
# pick the list of variables we want to use from the QoG
fraser <- c("fi_ftradeint_pd", "fi_legprop_pd", "fi_reg_pd", "fi_sm_pd", "fi_sog_pd")
outcomes <- c(
# GDP per capita, growth, Gini, life expectancy
"unna_gdppc", "wdi_gdpcapgr", "wdi_gini", "wdi_lifexp",
# fractionalization
"al_ethnic", "al_religion",
# education female and male ages 35-44
"gea_ea3544f", "gea_ea3544m",
# corruption, public sector and political
"vdem_pubcorr", "vdem_corr")
# prepare smaller work dataset containing institutional factors
institutions <- qog %>%
select(cname, ccodealp, fraser, outcomes)
# remove rows with missing data
institutions <- na.omit(institutions)
# use row names to identify the countries
row.names(institutions) <- institutions$cname
qog_summary %>%
filter(Variable %in% c(fraser, outcomes)) %>%
knitr::kable()
Variable | Description | Obs | min | median | mean | max |
---|---|---|---|---|---|---|
al_ethnic | Ethnic Fractionalization | 186 | 0.0 | 0.4 | 0.4 | 0.9 |
al_religion | Religion Fractionalization | 188 | 0.0 | 0.5 | 0.4 | 0.9 |
fi_ftradeint_pd | Freedom to Trade Internationally (panel data) | 158 | 3.0 | 7.2 | 7.1 | 9.5 |
fi_legprop_pd | Legal Structure and Security of Property Rights (panel data) | 158 | 1.6 | 5.0 | 5.2 | 8.8 |
fi_reg_pd | Regulation of Credit, Labor and Business (panel data) | 158 | 3.0 | 7.1 | 7.0 | 9.0 |
fi_sm_pd | Access to Sound Money (chain_linked) | 158 | 3.2 | 8.7 | 8.3 | 9.8 |
fi_sog_pd | Size of Government: Expenditures, Taxes and Enterprises (panel data) | 158 | 3.5 | 6.4 | 6.4 | 9.5 |
gea_ea3544f | Educational Attainment (35-44 years, Female) | 187 | 0.7 | 9.4 | 8.9 | 15.5 |
gea_ea3544m | Educational Attainment (35-44 years, Male) | 187 | 2.0 | 9.9 | 9.8 | 15.2 |
unna_gdppc | GDP per Capita (Current Prices in US$) | 193 | 148.8 | 5496.0 | 15827.6 | 187649.8 |
vdem_corr | Political corruption index | 168 | 0.0 | 0.6 | 0.5 | 1.0 |
vdem_pubcorr | Public sector corruption index | 168 | 0.0 | 0.5 | 0.5 | 1.0 |
wdi_gdpcapgr | GDP per capita growth (annual %) | 187 | -62.2 | 1.9 | 1.8 | 24.6 |
wdi_gini | GINI index (World Bank estimate) | 115 | 24.1 | 36.4 | 37.5 | 63.4 |
wdi_lifexp | Life expectancy at birth, total (years) | 185 | 50.6 | 72.9 | 71.3 | 83.6 |
In general, before doing the cluster analysis, you need to make sure that all your variables are measured in comparable units – otherwise your analysis can be heavily distorted by just one variable that happens to have much larger values than the others.
In the present case, I’m using the Fraser Institute data on economic institutions, and this data is already scaled – all variables are between 1 and 10. But if you add some other instititutional factors you need to consider this possible problem.
The simplest way to standardize the dataset is to use the scale
function:
fraser_scaled <- institutions %>% select(fraser) %>% scale() %>% as.data.frame()
outcomes_scaled <- institutions %>% select(outcomes) %>% scale() %>% as.data.frame()
The cluster analysis identifies how similar two countries are to each other, either based on Euclidean distance or based on correlations.
If our case, the Euclidean distance between, say, USA and France is:
\[ \begin{align} d_{USA-FR}^2 =& [(Size of Govt)_{USA}-(Size of Govt)_{FR}]^2 + \\ & [(Deregulation)_{USA}-(Deregulation)_{FR}]^2 + \\ & [(Free Trade)_{USA}-(Free Trade)_{FR}]^2 + \\ & [(Sound Money)_{USA}-(Sound Money)_{FR}]^2 + \\ & [(Property Rights)_{USA}-(Property Rights)_{FR}]^2 \end{align} \] The “ward.D” method of clustering is calculating these distances between all pairs of countries and creates the clusters such that countries within a cluster are those that have the smallest distances between each other. This is usually the best method for identifying clearly distinct categories. See this link for more information about clustering methods.
Create a hierarchical cluster tree based on Euclidean distance between cases.
hc <- institutions %>%
select(fraser) %>% # choose only fraser variables
dist(method = "euclidean") %>% # calculate distances between pairs of countries
hclust(method="ward.D") # build cluster tree
Plot the cluster tree (known as a “dendrogram”)
plot(hc,
hang = -1, # arranges labels at fixed distance
cex = 0.7, # size of labels
main = ""
)
Plot a circular dendrogram
hc %>%
ape::as.phylo() %>%
plot(type="fan", cex = 0.8)
Based on the plot we can see that there we can choose to separate the countries into 2, 3 or 4 categories. I’m going to choose 3 here.
Split the set into 3 groups and add the group number to the dataset.
institutions <- institutions %>%
mutate(groups_fraser = cutree(hc, 3)) %>%
# recode the group numbers to be in a better order
# (this was discovered by first doing the analysis without recoding)
mutate(groups_fraser = recode(groups_fraser, `1` = 2, `2` = 1, `3` = 3))
List countries in each group
map_df(1:3, function(g){
institutions %>%
filter(groups_fraser == g) %>%
select(cname) %>%
rename_at("cname", funs(paste("Group", g)))
}) %>%
# shift elements in Group 2 and 3 upwards in the table to replace the NAs
mutate(`Group 2` = lead(`Group 2`, length(na.omit(`Group 1`)))) %>%
mutate(`Group 3` = lead(`Group 3`, length(na.omit(`Group 1`)) +
length(na.omit(`Group 2`))
)
) %>%
# eliminate rowns that have only NAs
filter(!is.na(`Group 1`) | !is.na(`Group 2`) | !is.na(`Group 3`)) %>%
# change NAs to ""
mutate_all(funs(ifelse(is.na(.), "", .))) %>%
knitr::kable()
Group 1 | Group 2 | Group 3 |
---|---|---|
Algeria | Albania | Austria |
Argentina | Armenia | Belgium |
Bolivia | Bhutan | Canada |
Brazil | Bosnia and Herzegovina | Croatia |
Myanmar | Bulgaria | Czech Republic |
Burundi | Sri Lanka | Denmark |
Cameroon | Chile | Estonia |
Chad | China | Finland |
Congo | Colombia | France (1963-) |
Congo, Democratic Republic | Costa Rica | Germany |
Benin | Cyprus (1975-) | Greece |
Ecuador | Dominican Republic | Hungary |
Guinea | El Salvador | Ireland |
Haiti | Fiji | Israel |
Iran | Georgia | Italy |
Cote d’Ivoire | Ghana | Korea, South |
Madagascar | Guatemala | Netherlands |
Mauritania | Honduras | Norway |
Niger | Iceland | Poland |
Pakistan (1971-) | India | Portugal |
Senegal | Indonesia | Rwanda |
Sierra Leone | Kazakhstan | Slovakia |
Zimbabwe | Kyrgyzstan | Slovenia |
Togo | Laos | Spain |
Egypt | Lebanon | Sweden |
Burkina Faso | Latvia | Switzerland |
Liberia | United Kingdom | |
Lithuania | United States | |
Mauritius | ||
Mexico | ||
Mongolia | ||
Moldova | ||
Nicaragua | ||
Panama | ||
Paraguay | ||
Peru | ||
Philippines | ||
Romania | ||
Russia | ||
Seychelles | ||
South Africa | ||
Tajikistan | ||
Thailand | ||
Turkey | ||
Uganda | ||
Ukraine | ||
Macedonia | ||
Tanzania | ||
Uruguay | ||
Zambia |
institutions %>%
ggplot() +
aes(x = fi_legprop_pd, y = log(unna_gdppc), color = factor(groups_fraser)) +
geom_point() +
labs(x = "Property rights", y = "Log of GDP per capita")
institutions %>%
ggplot() +
aes(x = fi_legprop_pd, y = vdem_pubcorr, color = factor(groups_fraser)) +
geom_point() +
labs(x = "Property rights", y = "Corruption")
institutions %>%
ggplot() +
aes(x = fi_legprop_pd, y = wdi_lifexp, color = factor(groups_fraser)) +
geom_point() +
labs(x = "Property rights", y = "Life expectancy")
institutions %>%
ggplot() +
aes(x = fi_legprop_pd, y = gea_ea3544f, color = factor(groups_fraser)) +
geom_point() +
labs(x = "Property rights", y = "Education, female (35-44)")
institutions %>%
ggplot() +
aes(x = fi_legprop_pd, y = wdi_gdpcapgr, color = factor(groups_fraser)) +
geom_point() +
labs(x = "Property rights", y = "Growth rate")
# calculate means of the outcome variables for each cluster
outcomes_means <- institutions %>%
group_by(groups_fraser) %>%
summarise_at(vars(outcomes), funs(mean)) %>%
gather(key = "variable", value = "mean", -groups_fraser)
# calculate the standard error
std.err <- function(x) { sqrt(var(x, na.rm=TRUE)/length(na.omit(x))) }
# calculate standard errors (uses function `std.err` defined above)
outcomes_se <- institutions %>%
group_by(groups_fraser) %>%
summarise_at(vars(outcomes), funs(std.err)) %>%
gather(key = "variable", value = "se", -groups_fraser)
outcomes_summary <- full_join(outcomes_means, outcomes_se)
# show results as table
knitr::kable(outcomes_summary, digits=2)
groups_fraser | variable | mean | se |
---|---|---|---|
1 | unna_gdppc | 2617.13 | 661.45 |
2 | unna_gdppc | 7650.93 | 1193.23 |
3 | unna_gdppc | 39377.71 | 4193.23 |
1 | wdi_gdpcapgr | 2.27 | 0.45 |
2 | wdi_gdpcapgr | 2.67 | 0.32 |
3 | wdi_gdpcapgr | 1.84 | 0.35 |
1 | wdi_gini | 40.15 | 1.22 |
2 | wdi_gini | 39.43 | 1.21 |
3 | wdi_gini | 32.38 | 1.03 |
1 | wdi_lifexp | 63.66 | 1.50 |
2 | wdi_lifexp | 72.36 | 0.75 |
3 | wdi_lifexp | 80.18 | 0.65 |
1 | al_ethnic | 0.63 | 0.05 |
2 | al_ethnic | 0.45 | 0.03 |
3 | al_ethnic | 0.24 | 0.04 |
1 | al_religion | 0.41 | 0.05 |
2 | al_religion | 0.43 | 0.03 |
3 | al_religion | 0.43 | 0.04 |
1 | gea_ea3544f | 4.78 | 0.60 |
2 | gea_ea3544f | 10.01 | 0.46 |
3 | gea_ea3544f | 13.31 | 0.42 |
1 | gea_ea3544m | 6.68 | 0.52 |
2 | gea_ea3544m | 10.41 | 0.33 |
3 | gea_ea3544m | 13.16 | 0.37 |
1 | vdem_pubcorr | 0.67 | 0.04 |
2 | vdem_pubcorr | 0.50 | 0.04 |
3 | vdem_pubcorr | 0.10 | 0.02 |
1 | vdem_corr | 0.68 | 0.03 |
2 | vdem_corr | 0.54 | 0.04 |
3 | vdem_corr | 0.14 | 0.02 |
Show summary statistics results as bar plots
outcomes_summary %>%
ggplot() +
aes(x = groups_fraser, y = mean) +
geom_bar(stat = "identity") +
geom_errorbar(aes(ymin = mean - se, ymax = mean + se), width = 0.3) +
facet_wrap(~variable, scale = "free", ncol = 2)
## calculate hierarchical cluster
## comparison of different clustering methods
hc1 <- fraser_scaled %>% dist(method = "euclidean") %>% hclust(method="single")
hc2 <- fraser_scaled %>% dist(method = "euclidean") %>% hclust(method="complete")
hc3 <- fraser_scaled %>% dist(method = "euclidean") %>% hclust(method="average")
hc4 <- fraser_scaled %>% dist(method = "euclidean") %>% hclust(method="mcquitty")
hc5 <- fraser_scaled %>% dist(method = "euclidean") %>% hclust(method="median")
hc6 <- fraser_scaled %>% dist(method = "euclidean") %>% hclust(method="centroid")
hc7 <- fraser_scaled %>% dist(method = "euclidean") %>% hclust(method="ward.D")
hc8 <- fraser_scaled %>% dist(method = "euclidean") %>% hclust(method="ward.D2")
fraser_dendlist <- dendlist(
as.dendrogram(hc1),
as.dendrogram(hc2),
as.dendrogram(hc3),
as.dendrogram(hc4),
as.dendrogram(hc5),
as.dendrogram(hc6),
as.dendrogram(hc7),
as.dendrogram(hc8)
)
hclust_methods <- c("single", "complete", "average", "mcquitty",
"median", "centroid", "ward.D", "ward.D2")
names(fraser_dendlist) <- hclust_methods
# calculate correlations
fraser_cor <- cor.dendlist(fraser_dendlist)
# plot correlation table
corrplot.mixed(fraser_cor)
hc_o1 <- outcomes_scaled %>% dist(method = "euclidean") %>% hclust(method="single")
hc_o2 <- outcomes_scaled %>% dist(method = "euclidean") %>% hclust(method="complete")
hc_o3 <- outcomes_scaled %>% dist(method = "euclidean") %>% hclust(method="average")
hc_o4 <- outcomes_scaled %>% dist(method = "euclidean") %>% hclust(method="mcquitty")
hc_o5 <- outcomes_scaled %>% dist(method = "euclidean") %>% hclust(method="median")
hc_o6 <- outcomes_scaled %>% dist(method = "euclidean") %>% hclust(method="centroid")
hc_o7 <- outcomes_scaled %>% dist(method = "euclidean") %>% hclust(method="ward.D")
hc_o8 <- outcomes_scaled %>% dist(method = "euclidean") %>% hclust(method="ward.D2")
outcomes_dendlist <- dendlist(
as.dendrogram(hc_o1),
as.dendrogram(hc_o2),
as.dendrogram(hc_o3),
as.dendrogram(hc_o4),
as.dendrogram(hc_o5),
as.dendrogram(hc_o6),
as.dendrogram(hc_o7),
as.dendrogram(hc_o8)
)
hclust_methods <- c("single", "complete", "average", "mcquitty",
"median", "centroid", "ward.D", "ward.D2")
names(outcomes_dendlist) <- hclust_methods
# calculate correlations
outcomes_cor <- cor.dendlist(outcomes_dendlist)
# plot correlation table
corrplot.mixed(outcomes_cor)
# plot cluster
plot(hc_o7, labels=outcomes_scaled$cname, hang=-1, cex = 0.7)
Which clustering method leads to the highest correlation between the tree based on institutions and the tree based on outcomes?
compar_list <- dendlist(
as.dendrogram(hc1),
as.dendrogram(hc2),
as.dendrogram(hc3),
as.dendrogram(hc4),
as.dendrogram(hc5),
as.dendrogram(hc6),
as.dendrogram(hc7),
as.dendrogram(hc8),
as.dendrogram(hc_o1),
as.dendrogram(hc_o2),
as.dendrogram(hc_o3),
as.dendrogram(hc_o4),
as.dendrogram(hc_o5),
as.dendrogram(hc_o6),
as.dendrogram(hc_o7),
as.dendrogram(hc_o8)
)
names(compar_list) <- c("fraser1", "fraser2","fraser3","fraser4","fraser5",
"fraser6","fraser7","fraser8",
"outcomes1", "outcomes2", "outcomes3", "outcomes4",
"outcomes5", "outcomes6", "outcomes7", "outcomes8")
# correlations plot
corrplot.mixed(cor.dendlist(compar_list))
Another type of analysis that we could make: Instead of looking at different methods using the same set of variables, we can ask which set of institutional variables leads to the highest correlation to the outcomes tree?
Visual comparison of trees, between institutions and outcomes:
dend_fraser <- as.dendrogram(hc7, h = 25)
dend_outcomes <- as.dendrogram(hc_o7, h = 25)
tanglegram(dend_fraser, dend_outcomes,
sort = TRUE,
common_subtrees_color_lines = FALSE,
highlight_distinct_edges = FALSE,
highlight_branches_lwd = FALSE,
margin_inner = 0.5,
main_left = "Tree based on institutions",
main_right = "Tree based on outcomes")
Check this link for brief intro to the theory. Unlike the case of hierarchical clustering, with k-means clustering you need to specify the number of desired clusters (k – hence the name “k-means”).
This algorithm works in the following way:
We do the k-means clustering with the function kmeans(data, k)
:
km <- institutions %>%
select(fraser) %>%
kmeans(3)
km
## K-means clustering with 3 clusters of sizes 43, 28, 33
##
## Cluster means:
## fi_ftradeint_pd fi_legprop_pd fi_reg_pd fi_sm_pd fi_sog_pd
## 1 7.598851 5.152179 7.162781 8.901143 7.451189
## 2 8.152068 7.132163 7.761195 9.542097 5.244410
## 3 5.829118 3.892006 6.105180 7.136185 6.197818
##
## Clustering vector:
## [1] 1 3 3 2 1 2 1 3 1 3 1 3 3 3 2 3 3 1 3 3 3 3 1 2 1 2 3 2 1 3 1 2 3 2 2
## [36] 1 2 3 2 1 3 1 1 2 2 3 1 3 2 1 2 3 1 1 1 1 1 2 1 1 3 3 1 1 1 1 2 1 3 2
## [71] 3 1 1 1 1 2 2 1 1 2 3 1 3 2 2 1 3 2 2 2 1 3 3 1 1 3 1 3 2 1 2 3 1 1
##
## Within cluster sum of squares by cluster:
## [1] 122.55388 67.24369 151.35111
## (between_SS / total_SS = 58.6 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
# Use the clustering vector to label the cases
institutions <- institutions %>% mutate(kmeans = km$cluster)
Some plots:
institutions %>%
ggplot() +
aes(x = fi_legprop_pd, y = gea_ea3544f, color = factor(kmeans)) +
geom_point() +
labs(x = "Property rights", y = "Education, female (35-44)")
institutions %>%
ggplot() +
aes(x = fi_legprop_pd, y = log(unna_gdppc), color = factor(kmeans)) +
geom_point() +
labs(x = "Property rights", y = "Log of GDP per capita")
As you can see, the result of the k-means clustering is very similar to that of hierarchical clustering.
Alternative way to visualize using the first two principal components:
institutions %>%
select(fraser) %>%
fviz_cluster(km, geom = "point", data = .) + ggtitle("k = 3")
How many clusters should we make? See which number maximizes the silhouette:
institutions %>%
select(fraser) %>%
fviz_nbclust(kmeans, method = "silhouette")
The sihouette is a measure that takes into account both how close the different cases within a cluster are to each other and how distant the different clusters are from one another.
The kmeans with 2 clusters:
institutions %>%
select(fraser) %>%
kmeans(2) %>%
fviz_cluster(geom = "point",
data = institutions %>% select(fraser)) + ggtitle("k = 2")
institutions %>%
select(fraser) %>%
kmeans(8) %>%
fviz_cluster(geom = "point",
data = institutions %>% select(fraser)) + ggtitle("k = 8")