Data cleaning • crvsreportpackage

There are several simple routines that can help make the preparation of civil registration data for tabulation less time consuming. This vignette uses the tidyverse suite of R packages to deal with common problems such as incorrect data types, duplicate rows, and missing values.

We will use a synthetic dataset that has the appearance of civil registration data to experiment with. The states of Nigeria are used but otherwise the data are completely fictional.

Our synthetic dataset

Before we start we need to load the tidyverse package.

library(tidyverse)

Then we can inspect our dataset.

glimpse(df)
#> Rows: 1,004
#> Columns: 9
#> $ id      <int> 173, 407, 460, 63, 485, 870, 879, 559, 927, 452, 201, 197, 517…
#> $ birth1a <chr> "2024-07-08", "2024-05-04", "2024-06-27", "2024-05-31", "2024-…
#> $ birth1b <date> 2024-08-03, 2024-05-17, 2024-07-18, 2024-06-05, 2024-05-07, 2…
#> $ birth1c <chr> "", "Osun", "Oyo", "Imo", "Delta", "Osun", "Nassarawa", "Nassa…
#> $ birth1g <chr> "Quadruplet or higher", "Twin", "Quadruplet or higher", "Tripl…
#> $ birth1h <chr> "Doctor", "Doctor", "Nurse", "Other", "Doctor", "Doctor", "Nur…
#> $ birth1i <chr> "Hospital", "Unknown", "Other", "Unknown", "Other", "Unknown",…
#> $ birth2a <chr> "F", "M", "M", "M", "M", "M", "M", "M", "F", "M", "M", "F", "M…
#> $ birth3b <int> 85, 11, 13, 87, 37, 42, 47, 3, 20, 40, 61, 63, 13, 57, 45, 66,…

There are 9 variables in our data:

id
birth1a (Date of occurrence, Event)
birth1b (Date of registration, Event)
birth1c (Place of occurrence, Event)
birth1g (Type of birth, Event)
bith1h (Attendant at birth, Event)
birth1i (Type of place of occurrence, Event)
birth2a (Sex, Newborn)
birth3b (Age, Mother)

The variable birth1b (Date of registration, Event) is of type date, id and birth3b (Age, Mother) integer, and the others are all character.

Characters as dates

Dates can be inputted in a variety of formats such as “1 August 24”, Aug 1, 2024”, and “2024-08-01”. Sometimes R will not be able to parse a date and will read it as a character.

The variable birth1a (Date of occurrence, Event) is a character but it should be a date. We can coerce it to a date by passing it to the as.Date() function and choosing the corresponding date format. birth1a follows the ISO 8601 standard for dates or ‘YYYY-MM-DD’ so we need to format the data with “%Y-%m-%d”.

df <- mutate(df, birth1a = as.Date(birth1a, format = "%Y-%m-%d"))
class(df$birth1a)
#> [1] "Date"

Date formats in R

You can also use the ymd() function which more intuitively transforms the date into year, month, and day.

mutate(df, birth1a = ymd(birth1a))

Dates out of sequence

There may be inputting errors in the civil registration data such as birth1a (Date of occurrence, Event) being recorded after birth1b (Date of registration, Event). To identify those records that are out of sync we can use a simple filter. The code below selects all of the rows where birth1a (Date of occurrence, Event) is after birth1b (Date of registration, Event) and prints out selected columns.

filter(df, birth1a > birth1b) |> 
  select(id, birth1a, birth1b)
#>    id    birth1a    birth1b
#> 1  75 2024-05-22 1900-01-01
#> 2 478 2024-07-24 1900-01-01
#> 3 581 2024-05-28 1900-01-01

We can see that some registration dates have been entered as “1900-01-01”.

Duplicate rows

We can identify duplicate rows by using the id variable which should be unique. The code below groups the data by id and then filters those with more than one row.

df |> 
  group_by(id) |> 
  filter(n()>1)
#> # A tibble: 5 × 9
#> # Groups:   id [1]
#>      id birth1a    birth1b    birth1c birth1g    birth1h birth1i birth2a birth3b
#>   <int> <date>     <date>     <chr>   <chr>      <chr>   <chr>   <chr>     <int>
#> 1   173 2024-07-08 2024-08-03 ""      Quadruple… Doctor  Hospit… F            85
#> 2   173 2024-07-08 2024-08-03 ""      Quadruple… Doctor  Hospit… F            85
#> 3   173 2024-07-08 2024-08-03 ""      Quadruple… Doctor  Hospit… F            85
#> 4   173 2024-07-08 2024-08-03 ""      Quadruple… Doctor  Hospit… F            85
#> 5   173 2024-07-08 2024-08-03 ""      Quadruple… Doctor  Hospit… F            85

When no id variable is present in the data you could create a unique identifier by concatenating one or more values. This could be achieved with the join() function.

Characters as factors

It is common practice to read known values of a variable as characters, hence stringsAsFactors = FALSE. However, once you have read in the data you may wish to convert a character variable into a factor. Here we are converting birth1g (Type of birth, Event) into a factor.

class(df$birth1g)
#> [1] "character"
df$birth1g <- as_factor(df$birth1g)
class(df$birth1g)
#> [1] "factor"

Wrong order of levels

Sometimes the order of levels in a factor variable is not very helpful. The fct_relevel() function allows you to customise the order of levels. Here we reorder the values of birth1g (Type of birth, Event) so that “Single” comes first, then “Twin” etc.

df$birth1g |> levels()
#> [1] "Quadruplet or higher" "Twin"                 "Triplet"             
#> [4] "Single"
df$birth1g |> fct_relevel("Single", "Twin", "Triplet", "Quadruplet or higher") |> levels()
#> [1] "Single"               "Twin"                 "Triplet"             
#> [4] "Quadruplet or higher"

Unhelpfully coded values

Often values of a variable are coded unhelpfully. For example, in birth2a (Sex, Newborn) the sex of the newborn baby has been recorded as “M” and “F”. We can easily recode these so that they are more readable. The fct_recode() function enables you to change factor levels by hand.

df$birth2a |> 
  fct_recode(Female = "F", Male = "M") |> 
  levels()
#> [1] "Female" "Male"

Convert values to NA

Missing values are recorded in a variety of ways e.g. 999999, NULL, or even ” “. We can recode these missing values to a more explicit NA using the na_if() function. The variable birth1c (Place of occurrence, Event) includes a number of missing values that have been recorded as” “. We can replace these with NA using the code below.

df <- mutate(df, birth1c = na_if(birth1c, ""))

Checking missing data

To identify how many missing values there are in the whole dataset we can run:

df |> map_dbl(~sum(is.na(.)))
#>      id birth1a birth1b birth1c birth1g birth1h birth1i birth2a birth3b 
#>       0       0       0      33       0       0       0       0      10

There are 33 NA values in birth1c (Place of occurrence, Event) and 10 in birth3b (Age, Mother).

Imputing missing data

In cases where there are missing values in an integer variable we can impute them as an average. For example, there are 10 missing values in birth3b (Age, Mother). We can impute the mean age with the following code.

df <- df |> 
  mutate(birth3b = case_when(
    is.na(birth3b) ~ as.integer(mean(birth3b, na.rm = TRUE)),
    TRUE ~ birth3b)
    )