5 Data cleaning

There are several simple routines that can help make the preparation of civil registration data for tabulation less time consuming. This chapter uses the tidyverse suite of R packages to deal with common problems such as incorrect data types, duplicate rows, and missing values.

5.1 Practice dataset

We will use a synthetic dataset (Table 5.1) that has the appearance of civil registration data to practice with. The states of Nigeria are used but otherwise the data are completely fictional.

Table 5.1: Synthetic dataset, first 5 rows
id	birth1a	birth1b	birth1c	birth1g	birth1h	birth1i	birth2a	birth3b
179	2024-07-18	2024-08-12	Imo	Quadruplet or higher	Midwife	Hospital	M	25
162	2024-06-22	2024-07-10	Delta	Single	Midwife	Home	M	23
532	2024-08-05	2024-08-08	Delta	Twin	Doctor	Other	M	47
918	2024-07-13	2024-07-17	Yobe	Quadruplet or higher	Midwife	Home	F	35
237	2024-07-09	2024-07-29	Benue	Triplet	Nurse	Unknown	F	29

5.2 Setup

Before we start we need to load the tidyverse package. This will make available a range of handy functions for data manipulation.

library(tidyverse)

Then we need to load our dataset using the read_csv() function. All you need to do is provide a path to your CSV file.

df <- read_csv("data/practice_data.csv")

Parsing data types

Variables are parsed automatically with the read_csv() function but you can set data types for specific columns manually. For example,

df <- read_csv(
  "data/practice_data.csv",
  col_types = cols(
    x = col_character(),
    y = col_logical()
  )
)

Then we can inspect our dataset.

glimpse(df)

Rows: 1,004
Columns: 9
$ id      <int> 179, 162, 532, 918, 237, 501, 613, 784, 47, 780, 239, 519, 120…
$ birth1a <chr> "2024-07-18", "2024-06-22", "2024-08-05", "2024-07-13", "2024-…
$ birth1b <date> 2024-08-12, 2024-07-10, 2024-08-08, 2024-07-17, 2024-07-29, 2…
$ birth1c <chr> "Imo", "Delta", "Delta", "Yobe", "Benue", "Enugu", "Kebbi", "I…
$ birth1g <chr> "Quadruplet or higher", "Single", "Twin", "Quadruplet or highe…
$ birth1h <chr> "Midwife", "Midwife", "Doctor", "Midwife", "Nurse", "Nurse", "…
$ birth1i <chr> "Hospital", "Home", "Other", "Home", "Unknown", "Other", "Home…
$ birth2a <chr> "M", "M", "M", "F", "F", "F", "F", "F", "M", "M", "F", "M", "M…
$ birth3b <int> 25, 23, 47, 35, 29, 4, 66, 1, 86, 29, 26, 72, 79, 7, 37, 80, 7…

There are 9 variables in our data:

id
birth1a (Date of occurrence, Event)
birth1b (Date of registration, Event)
birth1c (Place of occurrence, Event)
birth1g (Type of birth, Event)
bith1h (Attendant at birth, Event)
birth1i (Type of place of occurrence, Event)
birth2a (Sex, Newborn)
birth3b (Age, Mother)

The variable birth1b (Date of registration, Event) is of type date, id and birth3b (Age, Mother) integer, and the others are all character.

5.3 Data de-identification

Information that could identify individuals must be removed from the datatset.

5.3.1 Removing columns

Sometimes personal information can be left in a dataset. We can easily drop identifying variables with select(). For example, if the id variable were actually a national identity number you could remove it with:

df |> 
  select(-id)

5.4 Valid

Variables should conform to data types and values should fall within expected ranges.

5.4.1 Characters as dates

Dates can be inputted in a variety of formats such as “1 August 24”, Aug 1, 2024”, and “2024-08-01”. Sometimes R will not be able to parse a date and will read it as a character.

The variable birth1a (Date of occurrence, Event) is a character but it should be a date. We can coerce it to a date by passing it to the as.Date() function and choosing the corresponding date format. birth1a follows the ISO 8601 standard for dates or ‘YYYY-MM-DD’ so we need to format the data with “%Y-%m-%d”.

df <- mutate(df, birth1a = as.Date(birth1a, format = "%Y-%m-%d"))
class(df$birth1a)

[1] "Date"

Table 5.2: Date formats in R
Symbol	Description	Example
%d	Day as a number	01-31
%a	Abbreviated weekday	Wed
%A	Unabbreviated weekday	Wednesday
%m	Month as a number	1-12
%b	Abbreviated month	Feb
%B	Unabbreviated month	February
%y	2-digit year	24
%Y	4-digit year	2024

You can also use the ymd() function which more intuitively transforms the date into year, month, and day.

mutate(df, birth1a = ymd(birth1a))

We can see that some registration dates have been entered as “1900-01-01”.

5.4.2 Characters as factors

It is common practice to read known values of a variable as characters, hence stringsAsFactors = FALSE. However, once you have read in the data you may wish to convert a character variable into a factor. Here we are converting birth1g (Type of birth, Event) into a factor.

class(df$birth1g)

[1] "character"

df$birth1g <- as_factor(df$birth1g)
class(df$birth1g)

[1] "factor"

5.4.3 Remove white space

Leading and trailing white space can easily be coded in error with manual data entry. We can get rid of these blank characters with the str_squish() function from the stringr package.

str_squish(" Federal    Capital Territory")

[1] "Federal Capital Territory"

5.4.4 Values within expected range

Sometimes there can be rogue values in a dataset that have been entered in error. We can check for these by calculating some summary statistics. The skim() function from the skimr package prints the average, quartiles and a neat histogram.

library(skimr)
skim(df, "birth3b")


── Variable type: numeric ──────────────────────────────────────────────────────
  skim_variable n_missing complete_rate mean   sd p0 p25 p50 p75 p100 hist 
1 birth3b              10         0.990 45.0 25.8  0  24  45  67   90 ▇▇▇▇▆

5.4.5 Wrong order of levels

Sometimes the order of levels in a factor variable is not very helpful. The fct_relevel() function allows you to customise the order of levels. Here we reorder the values of birth1g (Type of birth, Event) so that “Single” comes first, then “Twin” etc.

df$birth1g |> levels()

[1] "Quadruplet or higher" "Single"               "Twin"                
[4] "Triplet"

df$birth1g |> fct_relevel("Single", "Twin", "Triplet", "Quadruplet or higher") |> levels()

[1] "Single"               "Twin"                 "Triplet"             
[4] "Quadruplet or higher"

5.4.6 Recode values

Often values of a variable are coded unhelpfully. For example, in birth2a (Sex, Newborn) the sex of the newborn baby has been recorded as “M” and “F”. We can easily recode these so that they are more readable. The fct_recode() function enables you to change factor levels by hand.

df$birth2a |> 
  fct_recode(Female = "F", Male = "M") |> 
  levels()

[1] "Female" "Male"

You can also use the case_match() function within mutate().

df |>
  mutate(birth2a = case_match(birth2a, "F" ~ "Female", "M" ~ "Male"))

5.5 Accurate

There should be no conflicting information in the dataset.

5.5.1 Dates out of sequence

There may be inputting errors in the civil registration data such as birth1a (Date of occurrence, Event) being recorded after birth1b (Date of registration, Event). To identify those records that are out of sync we can use a simple filter. The code below selects all of the rows where birth1a (Date of occurrence, Event) is after birth1b (Date of registration, Event) and prints out selected columns.

filter(df, birth1a > birth1b) |> 
  select(id, birth1a, birth1b)

   id    birth1a    birth1b
1 149 2024-06-20 1900-01-01
2 228 2024-05-18 1900-01-01
3 725 2024-05-01 1900-01-01

5.6 Complete

There should be no missing or duplicate cases in the dataset.

5.6.1 Identifying duplicate rows

We can identify duplicate rows by using the get_dupes() function from the janitor package. The function will return any duplicate records and add a dupe_count column containing the number of duplicates. You just need to supply a unique identifier e.g. `id`` or a unique combination of columns.

df |> 
  get_dupes(id)

   id dupe_count    birth1a    birth1b birth1c              birth1g birth1h
1 179          5 2024-07-18 2024-08-12     Imo Quadruplet or higher Midwife
2 179          5 2024-07-18 2024-08-12     Imo Quadruplet or higher Midwife
3 179          5 2024-07-18 2024-08-12     Imo Quadruplet or higher Midwife
4 179          5 2024-07-18 2024-08-12     Imo Quadruplet or higher Midwife
5 179          5 2024-07-18 2024-08-12     Imo Quadruplet or higher Midwife
   birth1i birth2a birth3b
1 Hospital       M      25
2 Hospital       M      25
3 Hospital       M      25
4 Hospital       M      25
5 Hospital       M      25

5.6.2 Removing duplicates

To drop duplicate rows whilst retaining the first row we can use the distinct() function. Remember to pass the argument .keep_all = TRUE to keep all the variables not just the unique identifier.

df |>
  distinct(id, .keep_all = TRUE)

5.6.3 Convert values to NA

Missing values are recorded in a variety of ways e.g. 999999, NULL, or even ” “. We can recode these missing values to a more explicit NA using the na_if() function. The variable birth1c (Place of occurrence, Event) includes a number of missing values that have been recorded as” “. We can replace these with NA using the code below.

df <- mutate(df, birth1c = na_if(birth1c, ""))

You can also use the case_match() function as an alternative.

df <- mutate(df, birth1c = case_match(birth1c, NA ~ "", .default = birth1c))

5.6.4 Checking missing data

To identify how many missing values there are in the whole dataset we can run:

df |> map_dbl(~sum(is.na(.)))

     id birth1a birth1b birth1c birth1g birth1h birth1i birth2a birth3b 
      0       0       0       0       0       0       0       0      10

There are 0 NA values in birth1c (Place of occurrence, Event) and 10 in birth3b (Age, Mother).

5.6.5 Fill missing value with previous value

Sometimes values are not explicit in a dataset. For example, you might have some population data and only the first value explicitly records the sex.

kable(pop)

Age	Sex	Population
0-4	Male	9432700
5-9	NA	7875387
10-14	NA	7088327
15-19	NA	6159773
20-24	NA	5139436
25-29	NA	4238963
30-34	NA	3443384
35-39	NA	2756472

We can solve this problem with the fill() function which takes the previous value and uses it to fill subsequent missing values.

pop |> fill(Sex)

Age	Sex	Population
0-4	Male	9432700
5-9	Male	7875387
10-14	Male	7088327
15-19	Male	6159773
20-24	Male	5139436
25-29	Male	4238963
30-34	Male	3443384
35-39	Male	2756472

5.6.6 Imputing missing data

In cases where there are missing values in an integer variable we can impute them as an average. For example, there are 10 missing values in birth3b (Age, Mother). We can impute the mean age with the following code.

df <- df |> 
  mutate(birth3b = case_when(
    is.na(birth3b) ~ as.integer(mean(birth3b, na.rm = TRUE)),
    TRUE ~ birth3b)
    )