WDI Helper Functions#

Helper functions to download and import the most recent World Development Indicators dataset and prepare it for analysis.

Usage#

Important

These functions rely on standard Linux utilities like rm, wget and unzip, rather than pure Python.

To use these functions in your own notebook, download this notebook and source it with %run.

Example usage:

%run WDI.ipynb
data = get_wdi()
data = wdi_remove_groups(data)
df = wdi_pivot(data, year='2018')

Your df is now a dataset with all the country data for 2018, with each country in its own row and each indicator in its own column.

Setup#

Python Logo
import pandas as pd
import csv

Download the dataset.#

def get_wdi():
    """Download and return the WDI dataset."""

    ## Download and extract using wget, unzip, and bash.
    !mkdir -p data
    !if [ ! -e data/WDI_csv.zip ] ; then cd data && wget http://databank.worldbank.org/data/download/WDI_csv.zip ; fi
    !if [ ! -e data/WDIData.csv ] ; then cd data && unzip -o WDI_csv.zip 1> /dev/null; fi
    
    # If this import errors, check that the unzip completed with no errors.
    data = pd.read_csv('data/WDIData.csv')
    
    # Remove the CSV files, which we don't need.
    !cd data && rm -f WDICountry-Series.csv WDICountry.csv WDIData.csv WDIFootNote.csv WDISeries-Time.csv WDISeries.csv
    
    # Test dataset.
    data.shape
    assert data.shape[0] == 383572, "WDI data may have been updated. Please update this notebook and rerun."
    assert data.shape[1] == 67, "WDI data may have been updated. Please update this notebook and rerun."
    return data

Remove Groups#

This dataset begins with rows for groups of countries, such as “High Income”. If you want to work only with individual country data, use wdi_remove_groups().

Warning

In future editions of this dataset, the final group code may be different. If so, update final_group_code.

def wdi_remove_groups(data):
    """Remove country groups from dataset."""
    final_group_code = 'WLD'
    final_group_index = list(data['Country Code'].unique()).index(final_group_code)
    country_group_codes = list(data['Country Code'].unique()[:final_group_index+1])
    assert len(country_group_codes)==49, "Number of group codes has changed. Please confirm that final_group_code is correct." 
    data = data[~data['Country Code'].isin(country_group_codes)]
    return data
    

Choose a Year#

def wdi_show_by_year(data, recent=None):
    """Show number of entries for each (recent) year."""
    rows = len(data)
    print("Total rows:", rows)
    if recent:
        print(f"Limiting to the most recent {recent} years.")
        col_before_years=4
        drop_to_col = (len(data.columns)-recent)
        data=data.drop(data.iloc[:,col_before_years:drop_to_col],axis = 1)
        
    print("Year\t Entries")
    
    for col in data.columns:
        ## If this columns is a year, how many non-NaN entries do we have in this column?
        if col[0].isdigit():
            print(col, "\t", rows-data[col].isna().sum())

Pivot#

We now have all the data for our chosen year. Let’s pivot this table, so that each row represents a single country and each column (or dimension) represents an indicator.

def wdi_pivot(data, year=None, pivot_column='Indicator Code', index_column='Country Code'):
    """Pivot the WDI dataset.
    
    Params:
       data: WDI dataset
       year: Return only data in this year.
       pivot_column: Pivot on this column.
       index_column: Column to use as index.
    """
    if year is None:
        year == max(data['Year'])
        
    data = data[[index_column,pivot_column,str(year)]]
    
    ## IMPORTANT: Do NOT surround any of these column names in brackets. 
    ## If you do, you create additional levels of indexing that make accessing the data difficult.
    df = pd.pivot_table(data=data, \
                        index=index_column, \
                        columns=pivot_column, \
                        values=str(year), \
                        fill_value=0)
    
    ## Confirm that we have one column for each indicator code.
    df_indicators = data[pd.notna(data[str(year)])][pivot_column].unique()
    
    #print(len(df_indicators), len(df.columns))
    assert(len(df_indicators) == len(df.columns))
    
    ## Spot-check a value to ensure the pivot was successful.
    #assert(df.loc['AFG']['AG.LND.FRST.K2'] == float(data[data['Country Code']=='AFG'][data['Indicator Code'] == 'AG.LND.FRST.K2'][str(year)]))
    
    return df

Retrieve Data#

This dataset uses unique codes for indicators and country names. These functions make it easy to get the name for a particular code.

def get_indicator_name(df, code):
    return df[df['Indicator Code'] == code]['Indicator Name'].unique()[0]

def get_country_name(df, code):
    return df[df['Country Code'] == code]['Country Name'].unique()[0]