Getting started with ihsMW

The ihsMW package is a dedicated toolkit designed to clean, harmonise, and aggregate household survey data from the Malawi Integrated Household Survey (IHS) series. It is built to support researchers and analysts working with the IHS2 (2004/05), IHS3 (2010/11), IHS4 (2016/17), and IHS5 (2019/20) datasets.

1. Installation

You can install the stable release of ihsMW from CRAN:

install.packages("ihsMW")

Or install the development version from GitHub:

# Install using pak
pak::pak("vituk123/ihsMW")

# Or using remotes
remotes::install_github("vituk123/ihsMW")

2. The IHS Landscape

The Malawi National Statistical Office (NSO) conducts the Integrated Household Survey (IHS) periodically to track poverty, household expenditure, agriculture, and other socio-economic indicators. The primary rounds include:

  • IHS2: 2004–2005
  • IHS3: 2010–2011
  • IHS4: 2016–2017
  • IHS5: 2019–2020

Due to licensing restrictions, the raw microdata cannot be redistributed directly within R packages. Researchers must first register and manually download the survey data in Stata (.dta) format from the World Bank Microdata Library.

Once downloaded, place the files in a structured folder hierarchy on your local machine.

3. Loading and Harmonising

Each round of the IHS uses different variable names for the same question. For example, household size is recorded under different column names depending on the round. ihsMW uses a comprehensive crosswalk to harmonise these variable names.

To load and harmonise a raw survey file:

library(ihsMW)
library(haven)

# Load the raw Stata file
raw_data <- read_dta("path/to/IHS5/hh_mod_a_filt.dta")

# Harmonise variables to standard names
harmonised_data <- ihs_harmonise(raw_data, round = "IHS5")

4. Searching Variables

To find variables mapped in the crosswalk, use ihs_search(). You can search by keywords or labels:

# Search for consumption-related variables
ihs_search("consumption")

# Search for age within a specific round
ihs_search("age", round = "IHS5")

To view a summary of the crosswalk coverage and flag variables needing review, use ihs_crosswalk_check():

ihs_crosswalk_check()

5. Data Cleaning

ihsMW provides tools to clean standard survey anomalies, handle missing value codes, and winsorize extreme values:

# Convert standard survey missing codes (-99, -98, etc.) to NA
df_clean <- ihs_standardize_missing(harmonised_data)

# Winsorize outliers (e.g. food expenditure) stratified by urban/rural
df_winsor <- ihs_winsorize(df_clean, value_col = "food_exp", strata_col = "urban")

# Run the master cleaning wrapper which applies both steps and logs changes
df_cleaned <- ihs_clean(
  data = harmonised_data,
  missing_cols = c("food_exp", "nonfood_exp"),
  winsorize_cols = "food_exp",
  strata_col = "urban"
)

6. Unit Conversion

Agricultural modules in the IHS allow households to report harvest quantities in non-standard units (e.g., pails, basins, ox-carts, bags) rather than standard kilograms. ihsMW bundles official NSO conversion factors to convert these quantities to standard kilograms:

# Convert quantities reported in non-standard units to kilograms
crop_data <- data.frame(
  crop_code = c(1, 2),
  unit_code = c(3, 4),
  quantity = c(10, 5),
  region = c(1, 2)
)

crop_data_kg <- ihs_convert_units(
  data = crop_data,
  crop_col = "crop_code",
  unit_col = "unit_code",
  qty_col = "quantity",
  region_col = "region"
)

7. Aggregation

To aggregate member-level or agricultural plot-level data up to the household level, use ihs_aggregate():

# Aggregate individual-level education to household level
hh_edu <- ihs_aggregate(
  data = member_data,
  id_cols = "case_id",
  val_cols = c("years_education", "completed_primary")
)