WDI Helper Functions
Contents
WDI Helper Functions#
Helper functions to download and import the most recent World Development Indicators dataset and prepare it for analysis.
Usage#
Important
These functions rely on standard Linux utilities like rm
, wget
and unzip
, rather than pure Python.
To use these functions in your own notebook, download this notebook and source it with %run
.
Example usage:
%run WDI.ipynb
data = get_wdi()
data = wdi_remove_groups(data)
df = wdi_pivot(data, year='2018')
Your df
is now a dataset with all the country data for 2018, with each country in its own row and each indicator in its own column.
Setup#
import pandas as pd
import csv
Download the dataset.#
def get_wdi():
"""Download and return the WDI dataset."""
## Download and extract using wget, unzip, and bash.
!mkdir -p data
!if [ ! -e data/WDI_csv.zip ] ; then cd data && wget http://databank.worldbank.org/data/download/WDI_csv.zip ; fi
!if [ ! -e data/WDIData.csv ] ; then cd data && unzip -o WDI_csv.zip 1> /dev/null; fi
# If this import errors, check that the unzip completed with no errors.
data = pd.read_csv('data/WDIData.csv')
# Remove the CSV files, which we don't need.
!cd data && rm -f WDICountry-Series.csv WDICountry.csv WDIData.csv WDIFootNote.csv WDISeries-Time.csv WDISeries.csv
# Test dataset.
data.shape
assert data.shape[0] == 383572, "WDI data may have been updated. Please update this notebook and rerun."
assert data.shape[1] == 67, "WDI data may have been updated. Please update this notebook and rerun."
return data
Remove Groups#
This dataset begins with rows for groups of countries, such as “High Income”. If you want to work only with individual country data, use wdi_remove_groups()
.
Warning
In future editions of this dataset, the final group code may be different. If so, update final_group_code
.
def wdi_remove_groups(data):
"""Remove country groups from dataset."""
final_group_code = 'WLD'
final_group_index = list(data['Country Code'].unique()).index(final_group_code)
country_group_codes = list(data['Country Code'].unique()[:final_group_index+1])
assert len(country_group_codes)==49, "Number of group codes has changed. Please confirm that final_group_code is correct."
data = data[~data['Country Code'].isin(country_group_codes)]
return data
Choose a Year#
def wdi_show_by_year(data, recent=None):
"""Show number of entries for each (recent) year."""
rows = len(data)
print("Total rows:", rows)
if recent:
print(f"Limiting to the most recent {recent} years.")
col_before_years=4
drop_to_col = (len(data.columns)-recent)
data=data.drop(data.iloc[:,col_before_years:drop_to_col],axis = 1)
print("Year\t Entries")
for col in data.columns:
## If this columns is a year, how many non-NaN entries do we have in this column?
if col[0].isdigit():
print(col, "\t", rows-data[col].isna().sum())
Pivot#
We now have all the data for our chosen year. Let’s pivot this table, so that each row represents a single country and each column (or dimension) represents an indicator.
def wdi_pivot(data, year=None, pivot_column='Indicator Code', index_column='Country Code'):
"""Pivot the WDI dataset.
Params:
data: WDI dataset
year: Return only data in this year.
pivot_column: Pivot on this column.
index_column: Column to use as index.
"""
if year is None:
year == max(data['Year'])
data = data[[index_column,pivot_column,str(year)]]
## IMPORTANT: Do NOT surround any of these column names in brackets.
## If you do, you create additional levels of indexing that make accessing the data difficult.
df = pd.pivot_table(data=data, \
index=index_column, \
columns=pivot_column, \
values=str(year), \
fill_value=0)
## Confirm that we have one column for each indicator code.
df_indicators = data[pd.notna(data[str(year)])][pivot_column].unique()
#print(len(df_indicators), len(df.columns))
assert(len(df_indicators) == len(df.columns))
## Spot-check a value to ensure the pivot was successful.
#assert(df.loc['AFG']['AG.LND.FRST.K2'] == float(data[data['Country Code']=='AFG'][data['Indicator Code'] == 'AG.LND.FRST.K2'][str(year)]))
return df
Retrieve Data#
This dataset uses unique codes for indicators and country names. These functions make it easy to get the name for a particular code.
def get_indicator_name(df, code):
return df[df['Indicator Code'] == code]['Indicator Name'].unique()[0]
def get_country_name(df, code):
return df[df['Country Code'] == code]['Country Name'].unique()[0]