A Data Science Walkthrough Using Global Happiness Data

Kevin Xu

12/20/2020

1. Introduction

The World Happiness Report is an annual publication by the United Nations that scores and ranks the national happiness of reporting countries based on various quality of life factors, including perceived social support and life expectancy. we will attempt to analyze this data in order to determine what characteristics have the greatest impact on happiness, as well as if there are any global or regional trends in measured happiness or other features over time through exploratory data analysis. Additionally, we will attempt to model scores based on relevant features through regression.

2. Data Collection

The dataset was obtained from Kaggle and contains the 2015-2019 World Happiness Reports. The 2020 dataset was independently scraped from Wikipedia. These feature data on GDP, Social Support, Life Expectancy, Freedom to make Life Choices, Generosity, and Perceptions of Corruption, as well as a holistic Happiness Score. Certain aspects of the dataset had to be cleaned, which was more easily done by directly modifying the .csv file. For example, some of the country names had inconsistencies and were corrected to "Hong Kong, Trinidad and Tobago, Taiwan". We can read the data in to get an initial look at it.

In [1]:
import pandas as pd
from IPython.display import display, HTML
import warnings
warnings.filterwarnings('ignore')

#We'll read in the data from 6 different csv files into a list for now
data = []
for i in range(2015,2021):
    input = pd.read_csv("{}.csv".format(i),index_col=False)
    data.append(input)
for i in range(len(data)):
    print("{} Data:".format(2015+i))
    display(data[i])
2015 Data:
Country Region Happiness Rank Happiness Score Standard Error Economy (GDP per Capita) Family Health (Life Expectancy) Freedom Trust (Government Corruption) Generosity Dystopia Residual
0 Switzerland Western Europe 1 7.587 0.03411 1.39651 1.34951 0.94143 0.66557 0.41978 0.29678 2.51738
1 Iceland Western Europe 2 7.561 0.04884 1.30232 1.40223 0.94784 0.62877 0.14145 0.43630 2.70201
2 Denmark Western Europe 3 7.527 0.03328 1.32548 1.36058 0.87464 0.64938 0.48357 0.34139 2.49204
3 Norway Western Europe 4 7.522 0.03880 1.45900 1.33095 0.88521 0.66973 0.36503 0.34699 2.46531
4 Canada North America 5 7.427 0.03553 1.32629 1.32261 0.90563 0.63297 0.32957 0.45811 2.45176
... ... ... ... ... ... ... ... ... ... ... ... ...
153 Rwanda Sub-Saharan Africa 154 3.465 0.03464 0.22208 0.77370 0.42864 0.59201 0.55191 0.22628 0.67042
154 Benin Sub-Saharan Africa 155 3.340 0.03656 0.28665 0.35386 0.31910 0.48450 0.08010 0.18260 1.63328
155 Syria Middle East and Northern Africa 156 3.006 0.05015 0.66320 0.47489 0.72193 0.15684 0.18906 0.47179 0.32858
156 Burundi Sub-Saharan Africa 157 2.905 0.08658 0.01530 0.41587 0.22396 0.11850 0.10062 0.19727 1.83302
157 Togo Sub-Saharan Africa 158 2.839 0.06727 0.20868 0.13995 0.28443 0.36453 0.10731 0.16681 1.56726

158 rows × 12 columns

2016 Data:
Country Region Happiness Rank Happiness Score Lower Confidence Interval Upper Confidence Interval Economy (GDP per Capita) Family Health (Life Expectancy) Freedom Trust (Government Corruption) Generosity Dystopia Residual
0 Denmark Western Europe 1 7.526 7.460 7.592 1.44178 1.16374 0.79504 0.57941 0.44453 0.36171 2.73939
1 Switzerland Western Europe 2 7.509 7.428 7.590 1.52733 1.14524 0.86303 0.58557 0.41203 0.28083 2.69463
2 Iceland Western Europe 3 7.501 7.333 7.669 1.42666 1.18326 0.86733 0.56624 0.14975 0.47678 2.83137
3 Norway Western Europe 4 7.498 7.421 7.575 1.57744 1.12690 0.79579 0.59609 0.35776 0.37895 2.66465
4 Finland Western Europe 5 7.413 7.351 7.475 1.40598 1.13464 0.81091 0.57104 0.41004 0.25492 2.82596
... ... ... ... ... ... ... ... ... ... ... ... ... ...
152 Benin Sub-Saharan Africa 153 3.484 3.404 3.564 0.39499 0.10419 0.21028 0.39747 0.06681 0.20180 2.10812
153 Afghanistan Southern Asia 154 3.360 3.288 3.432 0.38227 0.11037 0.17344 0.16430 0.07112 0.31268 2.14558
154 Togo Sub-Saharan Africa 155 3.303 3.192 3.414 0.28123 0.00000 0.24811 0.34678 0.11587 0.17517 2.13540
155 Syria Middle East and Northern Africa 156 3.069 2.936 3.202 0.74719 0.14866 0.62994 0.06912 0.17233 0.48397 0.81789
156 Burundi Sub-Saharan Africa 157 2.905 2.732 3.078 0.06831 0.23442 0.15747 0.04320 0.09419 0.20290 2.10404

157 rows × 13 columns

2017 Data:
Country Happiness.Rank Happiness.Score Whisker.high Whisker.low Economy..GDP.per.Capita. Family Health..Life.Expectancy. Freedom Generosity Trust..Government.Corruption. Dystopia.Residual
0 Norway 1 7.537 7.594445 7.479556 1.616463 1.533524 0.796667 0.635423 0.362012 0.315964 2.277027
1 Denmark 2 7.522 7.581728 7.462272 1.482383 1.551122 0.792566 0.626007 0.355280 0.400770 2.313707
2 Iceland 3 7.504 7.622030 7.385970 1.480633 1.610574 0.833552 0.627163 0.475540 0.153527 2.322715
3 Switzerland 4 7.494 7.561772 7.426227 1.564980 1.516912 0.858131 0.620071 0.290549 0.367007 2.276716
4 Finland 5 7.469 7.527542 7.410458 1.443572 1.540247 0.809158 0.617951 0.245483 0.382612 2.430182
... ... ... ... ... ... ... ... ... ... ... ... ...
150 Rwanda 151 3.471 3.543030 3.398970 0.368746 0.945707 0.326425 0.581844 0.252756 0.455220 0.540061
151 Syria 152 3.462 3.663669 3.260331 0.777153 0.396103 0.500533 0.081539 0.493664 0.151347 1.061574
152 Tanzania 153 3.349 3.461430 3.236570 0.511136 1.041990 0.364509 0.390018 0.354256 0.066035 0.621130
153 Burundi 154 2.905 3.074690 2.735310 0.091623 0.629794 0.151611 0.059901 0.204435 0.084148 1.683024
154 Central African Republic 155 2.693 2.864884 2.521116 0.000000 0.000000 0.018773 0.270842 0.280876 0.056565 2.066005

155 rows × 12 columns

2018 Data:
Overall rank Country or region Score GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption
0 1 Finland 7.632 1.305 1.592 0.874 0.681 0.202 0.393
1 2 Norway 7.594 1.456 1.582 0.861 0.686 0.286 0.340
2 3 Denmark 7.555 1.351 1.590 0.868 0.683 0.284 0.408
3 4 Iceland 7.495 1.343 1.644 0.914 0.677 0.353 0.138
4 5 Switzerland 7.487 1.420 1.549 0.927 0.660 0.256 0.357
... ... ... ... ... ... ... ... ... ...
151 152 Yemen 3.355 0.442 1.073 0.343 0.244 0.083 0.064
152 153 Tanzania 3.303 0.455 0.991 0.381 0.481 0.270 0.097
153 154 South Sudan 3.254 0.337 0.608 0.177 0.112 0.224 0.106
154 155 Central African Republic 3.083 0.024 0.000 0.010 0.305 0.218 0.038
155 156 Burundi 2.905 0.091 0.627 0.145 0.065 0.149 0.076

156 rows × 9 columns

2019 Data:
Overall rank Country or region Score GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption
0 1 Finland 7.769 1.340 1.587 0.986 0.596 0.153 0.393
1 2 Denmark 7.600 1.383 1.573 0.996 0.592 0.252 0.410
2 3 Norway 7.554 1.488 1.582 1.028 0.603 0.271 0.341
3 4 Iceland 7.494 1.380 1.624 1.026 0.591 0.354 0.118
4 5 Netherlands 7.488 1.396 1.522 0.999 0.557 0.322 0.298
... ... ... ... ... ... ... ... ... ...
151 152 Rwanda 3.334 0.359 0.711 0.614 0.555 0.217 0.411
152 153 Tanzania 3.231 0.476 0.885 0.499 0.417 0.276 0.147
153 154 Afghanistan 3.203 0.350 0.517 0.361 0.000 0.158 0.025
154 155 Central African Republic 3.083 0.026 0.000 0.105 0.225 0.235 0.035
155 156 South Sudan 2.853 0.306 0.575 0.295 0.010 0.202 0.091

156 rows × 9 columns

2020 Data:
Country name Regional indicator Ladder score Standard error of ladder score upperwhisker lowerwhisker GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Ladder score in Dystopia Explained by: Log GDP per capita Explained by: Social support Explained by: Healthy life expectancy Explained by: Freedom to make life choices Explained by: Generosity Explained by: Perceptions of corruption Dystopia + residual
0 Finland Western Europe 7.8087 0.031156 7.869766 7.747634 1.285 1.500 0.961 0.662 0.160 0.478 1.972317 1.285190 1.499526 0.961271 0.662317 0.159670 0.477857 2.762835
1 Denmark Western Europe 7.6456 0.033492 7.711245 7.579955 1.327 1.503 0.979 0.665 0.243 0.495 1.972317 1.326949 1.503449 0.979333 0.665040 0.242793 0.495260 2.432741
2 Switzerland Western Europe 7.5599 0.035014 7.628528 7.491272 1.391 1.472 1.041 0.629 0.269 0.408 1.972317 1.390774 1.472403 1.040533 0.628954 0.269056 0.407946 2.350267
3 Iceland Western Europe 7.5045 0.059616 7.621347 7.387653 1.327 1.548 1.001 0.662 0.362 0.145 1.972317 1.326502 1.547567 1.000843 0.661981 0.362330 0.144541 2.460688
4 Norway Western Europe 7.4880 0.034837 7.556281 7.419719 1.424 1.495 1.008 0.670 0.288 0.434 1.972317 1.424207 1.495173 1.008072 0.670201 0.287985 0.434101 2.168266
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
148 Central African Republic Sub-Saharan Africa 3.4759 0.115183 3.701658 3.250141 0.041 0.000 0.000 0.293 0.254 0.028 1.972317 0.041072 0.000000 0.000000 0.292814 0.253513 0.028265 2.860198
149 Rwanda Sub-Saharan Africa 3.3123 0.052425 3.415053 3.209547 0.343 0.523 0.572 0.604 0.236 0.486 1.972317 0.343243 0.522876 0.572383 0.604088 0.235705 0.485542 0.548445
150 Zimbabwe Sub-Saharan Africa 3.2992 0.058674 3.414202 3.184198 0.426 1.048 0.375 0.377 0.151 0.081 1.972317 0.425564 1.047835 0.375038 0.377405 0.151349 0.080929 0.841031
151 South Sudan Sub-Saharan Africa 2.8166 0.107610 3.027516 2.605684 0.289 0.553 0.209 0.066 0.210 0.111 1.972317 0.289083 0.553279 0.208809 0.065609 0.209935 0.111157 1.378751
152 Afghanistan South Asia 2.5669 0.031311 2.628270 2.505530 0.301 0.356 0.266 0.000 0.135 0.001 1.972317 0.300706 0.356434 0.266052 0.000000 0.135235 0.001226 1.507236

153 rows × 20 columns

It looks like there's a pretty wide range of formats for the data, although most of the important columns are retained albeit in different names. The simplest way to unify this is to drop extraneous columns and effectively take the intersection of the different data sets.

3. Data Processing

Before analysis, the data needs to be cleaned. The countries are inconsistent across years, with some countries only reporting data for one year. In addition, the data across years are in varying formats, with some years missing region data or other columns. This makes it difficult to preserve region data, since we don't have an overarching set that can be used to assign regions to years lacking the column. The standard deviation and whisker columns have to be dropped, since we don't have equivalent values for each year.

3.1 Endpoints

First, we'll set aside the 2015 and 2020 data to examine starting and final trends. As a side effect, we can also use the region data that exists in these sets to compare regions over time.

In [2]:
data2015 = data[0].copy()
data2020 = data[5].copy()

3.2 Cleaning Data

The labels and columns for each year vary and need to be consolidated before we can concatenate the data together. In addition, we need to add an additional column with the year for the entry.

3.2.1 2015

We can drop the regions, standard error, and dystopia residual data. We also correct the labels of the columns to be consistent with what we'll be using later on. Finally, we add the year values.

In [3]:
data[0].drop(data[0].columns[[1,4, 11]], axis=1, inplace=True)
data[0].columns=['Country','Rank','Score','GDP','Social_Support','Life_Expectancy','Freedom','Generosity','Corruption']
data[0]['Year']=[2015]*len(data[0])

3.2.2 2016

We can drop the regions, upper/lower confidence intervals, and dystopia residual data. Again, we correct the columns labels and add the years.

In [4]:
data[1].drop(data[1].columns[[1,4,5,12]],axis = 1, inplace=True)
data[1].columns = ['Country','Rank','Score','GDP','Social_Support','Life_Expectancy','Freedom','Corruption','Generosity']
data[1]['Year']=[2016]*len(data[1])

3.2.3 2017

We can drop the high/low whiskers and dystopia residual data. We correct the columns labels and add the years.

In [5]:
data[2].drop(data[2].columns[[3,4,11]], axis=1, inplace=True)
data[2].columns = ['Country','Rank','Score','GDP','Social_Support','Life_Expectancy','Freedom','Generosity','Corruption']
data[2]['Year']=[2017]*len(data[2])

3.2.4 2018

There aren't any extraneous columns here, but there is a missing value: United Arab Emirates doesn't have a value for Perceptions of Corruption. For now, we'll drop it.

In [6]:
data[3].columns = ['Rank','Country','Score','GDP','Social_Support','Life_Expectancy','Freedom','Generosity','Corruption']
data[3].dropna(inplace=True)
data[3]['Year']=[2018]*len(data[3])

3.2.5 2019

This one is easy—we just standardize the column labels and add year data

In [7]:
data[4].columns = ['Rank','Country','Score','GDP','Social_Support','Life_Expectancy','Freedom','Generosity','Corruption']
data[4]['Year']=[2019]*len(data[4])

3.2.6 2020

What a mess! We can drop all the columns with proportion explained by features, as well as the standard error and whiskers. We change the column labels and add years.

In [8]:
data[5].drop(data[5].columns[[1,3,4,5,12,13,14,15,16,17,18,19]], axis=1, inplace=True)
data[5].columns = ['Country','Score','GDP','Social_Support','Life_Expectancy','Freedom','Generosity','Corruption']
data[5]['Rank']=[i for i in range(len(data[5]))]
data[5]['Year']=[2020]*len(data[5])

3.2.7 Merging

Now that all the columns are identical, we can concatenate all of the data into a single variable. With our data tidy and condensed, this will make it easier for us to perform analysis in the future.

In [9]:
merged_data = pd.concat([i for i in data])
display(merged_data)
Country Rank Score GDP Social_Support Life_Expectancy Freedom Generosity Corruption Year
0 Switzerland 1 7.5870 1.39651 1.34951 0.94143 0.66557 0.41978 0.29678 2015
1 Iceland 2 7.5610 1.30232 1.40223 0.94784 0.62877 0.14145 0.43630 2015
2 Denmark 3 7.5270 1.32548 1.36058 0.87464 0.64938 0.48357 0.34139 2015
3 Norway 4 7.5220 1.45900 1.33095 0.88521 0.66973 0.36503 0.34699 2015
4 Canada 5 7.4270 1.32629 1.32261 0.90563 0.63297 0.32957 0.45811 2015
... ... ... ... ... ... ... ... ... ... ...
148 Central African Republic 148 3.4759 0.04100 0.00000 0.00000 0.29300 0.25400 0.02800 2020
149 Rwanda 149 3.3123 0.34300 0.52300 0.57200 0.60400 0.23600 0.48600 2020
150 Zimbabwe 150 3.2992 0.42600 1.04800 0.37500 0.37700 0.15100 0.08100 2020
151 South Sudan 151 2.8166 0.28900 0.55300 0.20900 0.06600 0.21000 0.11100 2020
152 Afghanistan 152 2.5669 0.30100 0.35600 0.26600 0.00000 0.13500 0.00100 2020

934 rows × 10 columns

4. Exploratory Data Analysis

4.1 Features over Time

Let's take a look at trends over time. We can group the total data by both country and year and use a violin plot in order to see the yearly distribution of country values. To do this, we can use the seaborn python library. Ideally, we would expect to see score and all other features increase over time, with the sole exception of corruption which should decrease over time.

In [10]:
import matplotlib.pyplot as plt
import seaborn as sns

#We want to pivot the data to get country values on a yearly basis
scores = merged_data.groupby(['Country','Year'],as_index=False)['Score'].sum()
scores = scores.pivot(index='Year',columns='Country',values='Score')

def axis():
    chart1, ax1 = plt.subplots()
    return ax1
display(scores)
Country Afghanistan Albania Algeria Angola Argentina Armenia Australia Austria Azerbaijan Bahrain ... United Arab Emirates United Kingdom United States Uruguay Uzbekistan Venezuela Vietnam Yemen Zambia Zimbabwe
Year
2015 3.5750 4.9590 5.6050 4.033 6.5740 4.3500 7.2840 7.2000 5.2120 5.9600 ... 6.9010 6.8670 7.1190 6.4850 6.0030 6.8100 5.3600 4.0770 5.1290 4.6100
2016 3.3600 4.6550 6.3550 3.866 6.6500 4.3600 7.3130 7.1190 5.2910 6.2180 ... 6.5730 6.7250 7.1040 6.5450 5.9870 6.0840 5.0610 3.7240 4.7950 4.1930
2017 3.7940 4.6440 5.8720 3.795 6.5990 4.3760 7.2840 7.0060 5.2340 6.0870 ... 6.6480 6.7140 6.9930 6.4540 5.9710 5.2500 5.0740 3.5930 4.5140 3.8750
2018 3.6320 4.5860 5.2950 3.795 6.3880 4.3210 7.2720 7.1390 5.2010 6.1050 ... NaN 7.1900 6.8860 6.3790 6.0960 4.8060 5.1030 3.3550 4.3770 3.6920
2019 3.2030 4.7190 5.2110 NaN 6.0860 4.5590 7.2280 7.2460 5.2080 6.1990 ... 6.8250 7.0540 6.8920 6.2930 6.1740 4.7070 5.1750 3.3800 4.1070 3.6630
2020 2.5669 4.8827 5.0051 NaN 5.9747 4.6768 7.2228 7.2942 5.1648 6.2273 ... 6.7908 7.1645 6.9396 6.4401 6.2576 5.0532 5.3535 3.5274 3.7594 3.2992

6 rows × 167 columns

4.1.1 Score over Time

That looks good, although ideally the change would be more dramatic. The score generally appears to increase over time, with the average rising very slightly. More interestingly, the distribution of scores shifts over time, from having a higher probability of values being below the mean for 2015, to being evenly distributed around the center and slightly favoring higher values in 2020. This is good news—even if news and media seems to be getting more negative recently, global happiness is trending upwards.

In [11]:
ax = sns.violinplot(data=merged_data,x='Year',y='Score',ax=axis()).set_title('Score')

4.1.2 GDP over Time

Global GDP remains generally constant, although the values are getting more centralized as time progresses. A closer look at economic trends would be necessary for further analysis, but for now it is sufficient to say that GDP is more or less stable.

In [12]:
ax = sns.violinplot(data=merged_data,x='Year',y='GDP',ax=axis()).set_title('GDP')

4.1.3 Social Support over Time

There is a large amount of variance in social support, suggesting that regions or individual countries differ greatly. Based on the distribution, it would appear that there are a few countries with very low social support compared to the average. In general, social support has increased over time, although it has remained stable in the last 4 years.

In [13]:
ax = sns.violinplot(data=merged_data,x='Year',y='Social_Support',ax=axis()).set_title('Social Support')

4.1.4 Life Expectancy over Time

Life expectancy has trended upwards over time. This matches expectations that technological development and globalization will lead to improved health care and increased life expectancy.

In [14]:
ax = sns.violinplot(data=merged_data,x='Year',y='Life_Expectancy',ax=axis()).set_title('Life Expectancy')

4.1.5 Freedom over Time

Freedom to make life choices seems to have slightly increased, although in general this can be considered to remain stable. These values seem to fluctuate the most of any feature on a year-to-year basis, which is likely explained by its inherent qualitativeness. Individual people in a country likely have different perceptions on freedom, and even different baselines on what freedom is to be expected.

In [15]:
ax = sns.violinplot(data=merged_data,x='Year',y='Freedom',ax=axis()).set_title('Freedom')

4.1.6 Generosity over Time

Generosity has increased in general, although outliers have become less generous over time. This is the most consistent of the features globally, with the majority of the globe falling into a distribution range that is roughly 0.2 units wide. It is possible that countries look to each other for an expectation of generosity, resulting in this uniformity.

In [16]:
ax = sns.violinplot(data=merged_data,x='Year',y='Generosity',ax=axis()).set_title('Generosity')

4.1.7 Corruption over Time

Perceived corruption has remained generally stable in the past 5 years, with unusually high values for 2015. It seems that this is low for the majority of the globe, with the exception of a few outlier countries that experience high corruption.

In [17]:
ax = sns.violinplot(data=merged_data,x='Year',y='Corruption',ax=axis()).set_title('Corruption')

4.2 Correlation between Features

Next, we can look at the amount of correlation between features in order to see how related they are to one another. Specifically, we are most interested which features have the highest influence on happiness score.

In [18]:
display(merged_data.corr())
Rank Score GDP Social_Support Life_Expectancy Freedom Generosity Corruption Year
Rank 1.000000 -0.990941 -0.790341 -0.660676 -0.745432 -0.544824 -0.142883 -0.307119 -0.019457
Score -0.990941 1.000000 0.784859 0.666777 0.745392 0.558018 0.161054 0.331671 0.024733
GDP -0.790341 0.784859 1.000000 0.611600 0.785172 0.347417 0.051048 0.208126 -0.015481
Social_Support -0.660676 0.666777 0.611600 1.000000 0.601238 0.433763 0.009590 0.070000 0.317141
Life_Expectancy -0.745432 0.745392 0.785172 0.601238 1.000000 0.367328 0.008071 0.227170 0.166751
Freedom -0.544824 0.558018 0.347417 0.433763 0.367328 1.000000 0.277289 0.406208 0.090646
Generosity -0.142883 0.161054 0.051048 0.009590 0.008071 0.277289 1.000000 0.200695 -0.002304
Corruption -0.307119 0.331671 0.208126 0.070000 0.227170 0.406208 0.200695 1.000000 -0.265104
Year -0.019457 0.024733 -0.015481 0.317141 0.166751 0.090646 -0.002304 -0.265104 1.000000

We can visualize this correlation data through a correlation matrix, which will make it easier for us to compare values at a glance.

In [19]:
f = plt.figure(figsize=(12, 8))
ax=f.add_subplot()

f.colorbar(ax.matshow(merged_data.corr()))
ax.tick_params(labelsize=14, rotation = 45)
ax.set_xticklabels(['']+merged_data.columns)
ax.set_yticklabels(['']+merged_data.columns)

f.show()

From this, we can see that happiness score correlates most strongly with GDP, life expectancy, social support, and freedom in that order. We know from our previous visualization that GDP, social support, and freedom have all remained generally stable, while life expectancy has increased. This supports the observation that global happiness has increased slightly over time.

Interestingly, these four features also have the highest correlation with each other, while freedom, generosity, and corruption also have the highest correlation with each other. This seems to suggest independent relationships between these two groups. In other words, while it is unlikely to have a country with high GDP but low life expectancy, it is possible to have a country with high GDP and low freedom, which makes sense.

These global trends are interesting, but it is likely that they will vary across different regions. We can look more closely to see if there are hidden regional trends that are otherwise obfuscated. The data distribution is too thin to effectively visualize with a violinplot, so let's use a boxplot instead.

4.3.1 Score by Region

Using the 2015 and 2020 data, we can compare how happiness scores have trended by region. There are some descrepancies in region labeling, so we can fix this by relabeling the region values preferring the 2015 values.

In [20]:
#Adjust the column labels for consistency
data2020=data2020.rename(columns={'Country name':'Country','Regional indicator':'Region','Ladder score':'Happiness Score'})

#create a union of region values from 2015 and 2020, with a preference for 2015 if available
regions= data2015[['Country','Region']].merge(data2020[['Country','Region']],on='Country',how='outer')
regions['Region']=regions.apply(lambda x:x[2] if pd.isnull(x[1]) else x[1],axis=1)

#Reassign the new region values in another column
data2015['Fixed Region']=data2015.apply(lambda x: regions.loc[regions['Country']==x[0]]['Region'].iloc[0],axis=1)
data2020['Fixed Region']=data2020.apply(lambda x: regions.loc[regions['Country']==x[0]]['Region'].iloc[0],axis=1)

data2015.boxplot(column=['Happiness Score'],by='Fixed Region',figsize=(20,5))
plt.xticks(rotation=45)
plt.show()
data2020.boxplot(column=['Happiness Score'],by='Fixed Region',figsize=(20,5))
plt.xticks(rotation=45)
plt.show()

Based on this, we can see that Central/Eastern Europe, Southern Asia, Sub-Saharan Africa, and Western Europe all increased in happiness over time. The Middle East and Northern Africa as well as North America both decreased. East Asia, Latin America, and Southeast Asia all decreased in spread.

4.3.2 Features by Region

Taking the 2020 data, we can compare different features by region for trends. For brevity, we will only examine the two features with the greatest correlation with score: GDP and life expectancy.

4.3.2.1 GDP by Region

The trend is generally similar to the happiness data shown above, although there is less spread overall in each region. The Middle East and North Africa has a higher GDP per capita than their score would suggest.

In [21]:
data2020.boxplot(column=['GDP per capita'],by='Fixed Region',figsize=(20,5))
plt.xticks(rotation=45)
plt.show()

4.3.2.2 Life Expectancy by Region

Again, the trend is similar to score but with less spread in each region. Additionally, South Asia and East Asia have higher life expectancies than their scores suggest.

In [22]:
data2020.boxplot(column=['Healthy life expectancy'],by='Fixed Region',figsize=(20,5))
plt.xticks(rotation=45)
plt.show()

5. Hypothesis Testing

Based on the correlation matrix, we hypothesize that there is a relationship between our happiness score and the other features. Let's see if we can use a model to fit and predict the score values over time. We will only use the four variables with the most correlation with scores, along with the year values. The data itself is also randomized.

In [29]:
variables = ['GDP','Social_Support','Life_Expectancy','Freedom','Year']
data_random = merged_data.sample(frac=1).reset_index(drop=True)

5.1 Linear Regression

We first try a linear model as a predictor and evaluate the score using 10-fold cross validation. It's not bad, with the R-squared value averaging around 0.75 with a standard deviation around 0.04.

In [24]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
import statistics as stats

kf = KFold(n_splits=10)

def kfold(data_random,model,variables,predict):
    scores=[]
    #Generate indices for cross valiation
    for train_index, test_index in kf.split(data_random):
        #Generate training and test data based on indices
        X_train, X_test = data_random.iloc[train_index][variables], data_random.iloc[test_index][variables]
        y_train, y_test = data_random.iloc[train_index][predict],data_random.iloc[test_index][predict]

        #fit the model and return the score as a list
        model.fit(X_train,y_train)
        score = model.score(X_test,y_test)
        scores.append(score)
    return scores

model = LinearRegression()

results = kfold(data_random,model,variables,'Score')

print("Average Score:",stats.mean(results))
print("Standard Deviation:",stats.stdev(results))
Average Score: 0.7504279894262048
Standard Deviation: 0.04509034341439948

5.2 Random Forest

Can we do better than this? Next we'll try a random forest with the default 100 decision trees. This does better, with an average R-squared of 0.8 and standard deviation of around 0.03.

In [30]:
from sklearn.ensemble import RandomForestRegressor

forest = RandomForestRegressor()

forest_scores = kfold(data_random,forest,variables,'Score')
print("Average Score:",stats.mean(forest_scores))
print("Standard Deviation:",stats.stdev(forest_scores))
Average Score: 0.8047514386998431
Standard Deviation: 0.02320541255317844

5.3 Regression on Regional Data

This performance is alright, but it is likely that there are confounding factors across regions that muddle an overall trend. Let's select a single region to see if the prediction is any better. Arbitrarily, we'll pick the Middle East and Northern Africa.

In [26]:
region = data2015.groupby('Region').get_group('Middle East and Northern Africa')['Country']
region_data = pd.concat([i.loc[i['Country'].isin(region)] for i in data])
display(region_data)
Country Rank Score GDP Social_Support Life_Expectancy Freedom Generosity Corruption Year
10 Israel 11 7.2780 1.22857 1.22393 0.91387 0.41319 0.07785 0.33172 2015
19 United Arab Emirates 20 6.9010 1.42727 1.12575 0.80925 0.64157 0.38583 0.26428 2015
21 Oman 22 6.8530 1.36011 1.08182 0.76276 0.63274 0.32524 0.21542 2015
27 Qatar 28 6.6110 1.69042 1.07860 0.79733 0.64040 0.52208 0.32573 2015
34 Saudi Arabia 35 6.4110 1.39541 1.08393 0.72025 0.31048 0.32524 0.13706 2015
... ... ... ... ... ... ... ... ... ... ...
118 Jordan 118 4.6334 0.78500 1.14000 0.77800 0.42500 0.09100 0.15200 2020
124 Palestinian Territories 124 4.5528 0.58800 1.19500 0.61400 0.29900 0.09200 0.07200 2020
127 Tunisia 127 4.3922 0.87500 0.87200 0.78100 0.23600 0.05600 0.04400 2020
137 Egypt 137 4.1514 0.87500 0.98300 0.59700 0.37400 0.06900 0.09500 2020
145 Yemen 145 3.5274 0.39300 1.17700 0.41500 0.24400 0.09500 0.08700 2020

112 rows × 10 columns

Doing a random forest regression like before, the score remains the same at around 0.8, although the standard deviation increases to around 0.12. Interestingly, this suggests that the model fits better on the global data rather than specific regions.

In [31]:
region_sample = region_data.sample(frac=1).reset_index(drop=True)

region_forest = kfold(region_sample,forest,variables,'Score')
print("Average Score:",stats.mean(region_forest))
print("Standard Deviation:",stats.stdev(region_forest))
Average Score: 0.7999372284404023
Standard Deviation: 0.1270024460597492

5.4 Score as a Predictor

Out of curiousity, how effective is happiness score as a predictor for other features? We know the correlation, but we can also use our best-performing random forest model on global data to examine the bidirectionality of the relationships.

In [28]:
for value in ['GDP','Social_Support','Life_Expectancy','Freedom','Generosity','Corruption']:
    forest_scores = kfold(data_random,forest,['Score'],value)
    print("Average Score for {}:".format(value),stats.mean(forest_scores))
Average Score for GDP: 0.48299418191953636
Average Score for Social_Support: 0.21953873761560014
Average Score for Life_Expectancy: 0.37924719605974183
Average Score for Freedom: 0.04318351701837818
Average Score for Generosity: -0.32936916435024
Average Score for Corruption: -0.005292859575628461

It turns out, happiness is a pretty bad predictor for its own contributing features. This makes sense; it's inherently a loss of dimensionality due to integrating a bunch of different features into one value.

6. Conclusion

We analysed the UN's World Happiness Report data from 2015 through 2020. We adjusted and reformatted columns to be consistent across years and visualized the relationships between happiness score and various factors. We found that score has a weakly positive correlation with time, suggesting that global happiness does tend to increase over time. It most strongly relates to wealth, health, social support, and freedom, which was confirmed by a correlation matrix. In general, happiness is highest in the Americas, Western Europe, and Australia, which are commonly perceived as western countries. Happiness is lowest in Africa and Asia. Finally, we used a linear regression and a random forest to fit the data with respect to GDP, life expectancy, social support, freedom, and time. The random forest performed the best, and was more accurate on global data than regional data.

In general, the relatively short timeframe of the report makes it difficult to determine any trends. Furthermore, it is likely that there is a great deal of variance in how values are reported across countries, due to differences in culture or perception. This degree of inconsistency makes it difficult to compare values between countries and in general causes reported scores to not deviate much from year to year. Ultimately, since these values are largely qualitative and subjective, it is difficult to use them to quantitatively predict the future. Instead, the study should be used to holistically assess the performance of different countries, ideally across a timespan of several centuries.

Of greater interest are the features that had the greatest correlation with a high level of happiness. While wealth and health were both expected to be influential, it surprised me that social support was so high and that perceived corruption was so low. This makes sense, however, since humans are innately social beings, and the average person is routinely influenced more by their immediate community than corruption in government or bureaucracy.

One final thing of note is that the data for 2020 was collected in March, prior to the brunt of the impact from the COVID-19 pandemic. It is likely that corresponding scores for 2021 would be much lower than trends expect as a result. While overall this would be expected to be a temporary deviation from the growth curve, it is worth using as an example that these models are only useful over the long-term, both in terms of collecting useful data as well as predicting future trends.

7. Resources

Kaggle data: https://www.kaggle.com/londeen/world-happiness-report-2020

World Happiness Report site: https://worldhappiness.report/

Another Study on World Happiness Data: https://rpubs.com/LeonaAnn/645318