Data cleaning
data_cleaning.RmdThere are several simple routines that can help make the preparation of civil registration data for tabulation less time consuming. This vignette uses the tidyverse suite of R packages to deal with common problems such as incorrect data types, duplicate rows, and missing values.
We will use a synthetic dataset that has the appearance of civil registration data to experiment with. The states of Nigeria are used but otherwise the data are completely fictional.
Our synthetic dataset
Before we start we need to load the tidyverse package.
Then we can inspect our dataset.
glimpse(df)
#> Rows: 1,004
#> Columns: 9
#> $ id <int> 173, 407, 460, 63, 485, 870, 879, 559, 927, 452, 201, 197, 517…
#> $ birth1a <chr> "2024-07-08", "2024-05-04", "2024-06-27", "2024-05-31", "2024-…
#> $ birth1b <date> 2024-08-03, 2024-05-17, 2024-07-18, 2024-06-05, 2024-05-07, 2…
#> $ birth1c <chr> "", "Osun", "Oyo", "Imo", "Delta", "Osun", "Nassarawa", "Nassa…
#> $ birth1g <chr> "Quadruplet or higher", "Twin", "Quadruplet or higher", "Tripl…
#> $ birth1h <chr> "Doctor", "Doctor", "Nurse", "Other", "Doctor", "Doctor", "Nur…
#> $ birth1i <chr> "Hospital", "Unknown", "Other", "Unknown", "Other", "Unknown",…
#> $ birth2a <chr> "F", "M", "M", "M", "M", "M", "M", "M", "F", "M", "M", "F", "M…
#> $ birth3b <int> 85, 11, 13, 87, 37, 42, 47, 3, 20, 40, 61, 63, 13, 57, 45, 66,…There are 9 variables in our data:
id-
birth1a(Date of occurrence, Event) -
birth1b(Date of registration, Event) -
birth1c(Place of occurrence, Event) -
birth1g(Type of birth, Event) -
bith1h(Attendant at birth, Event) -
birth1i(Type of place of occurrence, Event) -
birth2a(Sex, Newborn) -
birth3b(Age, Mother)
The variable birth1b (Date of registration, Event) is of
type date, id and birth3b (Age, Mother)
integer, and the others are all character.
Characters as dates
Dates can be inputted in a variety of formats such as “1 August 24”, Aug 1, 2024”, and “2024-08-01”. Sometimes R will not be able to parse a date and will read it as a character.
The variable birth1a (Date of occurrence, Event) is a
character but it should be a date. We can coerce it to a date by passing
it to the as.Date() function and choosing the corresponding
date format. birth1a follows the ISO
8601 standard for dates or ‘YYYY-MM-DD’ so we need to format the
data with “%Y-%m-%d”.
Date formats in R
You can also use the ymd()
function which more intuitively transforms the date into year, month,
and day.
Dates out of sequence
There may be inputting errors in the civil registration data such as
birth1a (Date of occurrence, Event) being recorded
after birth1b (Date of registration, Event). To
identify those records that are out of sync we can use a simple filter.
The code below selects all of the rows where birth1a (Date
of occurrence, Event) is after birth1b (Date of
registration, Event) and prints out selected columns.
filter(df, birth1a > birth1b) |>
select(id, birth1a, birth1b)
#> id birth1a birth1b
#> 1 75 2024-05-22 1900-01-01
#> 2 478 2024-07-24 1900-01-01
#> 3 581 2024-05-28 1900-01-01We can see that some registration dates have been entered as “1900-01-01”.
Duplicate rows
We can identify duplicate rows by using the id variable
which should be unique. The code below groups the data by
id and then filters those with more than one row.
df |>
group_by(id) |>
filter(n()>1)
#> # A tibble: 5 × 9
#> # Groups: id [1]
#> id birth1a birth1b birth1c birth1g birth1h birth1i birth2a birth3b
#> <int> <date> <date> <chr> <chr> <chr> <chr> <chr> <int>
#> 1 173 2024-07-08 2024-08-03 "" Quadruple… Doctor Hospit… F 85
#> 2 173 2024-07-08 2024-08-03 "" Quadruple… Doctor Hospit… F 85
#> 3 173 2024-07-08 2024-08-03 "" Quadruple… Doctor Hospit… F 85
#> 4 173 2024-07-08 2024-08-03 "" Quadruple… Doctor Hospit… F 85
#> 5 173 2024-07-08 2024-08-03 "" Quadruple… Doctor Hospit… F 85When no id variable is present in the data you could create a unique
identifier by concatenating one or more values. This could be achieved
with the join()
function.
Characters as factors
It is common practice to read known values of a variable as
characters, hence stringsAsFactors = FALSE. However, once
you have read in the data you may wish to convert a character variable
into a factor. Here we are converting birth1g (Type of
birth, Event) into a factor.
Wrong order of levels
Sometimes the order of levels in a factor variable is not very
helpful. The fct_relevel()
function allows you to customise the order of levels. Here we reorder
the values of birth1g (Type of birth, Event) so that
“Single” comes first, then “Twin” etc.
df$birth1g |> levels()
#> [1] "Quadruplet or higher" "Twin" "Triplet"
#> [4] "Single"
df$birth1g |> fct_relevel("Single", "Twin", "Triplet", "Quadruplet or higher") |> levels()
#> [1] "Single" "Twin" "Triplet"
#> [4] "Quadruplet or higher"Unhelpfully coded values
Often values of a variable are coded unhelpfully. For example, in
birth2a (Sex, Newborn) the sex of the newborn baby has been
recorded as “M” and “F”. We can easily recode these so that they are
more readable. The fct_recode()
function enables you to change factor levels by hand.
df$birth2a |>
fct_recode(Female = "F", Male = "M") |>
levels()
#> [1] "Female" "Male"Convert values to NA
Missing values are recorded in a variety of ways e.g. 999999, NULL,
or even ” “. We can recode these missing values to a more explicit NA
using the na_if()
function. The variable birth1c (Place of occurrence, Event)
includes a number of missing values that have been recorded as” “. We
can replace these with NA using the code below.
Checking missing data
To identify how many missing values there are in the whole dataset we can run:
df |> map_dbl(~sum(is.na(.)))
#> id birth1a birth1b birth1c birth1g birth1h birth1i birth2a birth3b
#> 0 0 0 33 0 0 0 0 10There are 33 NA values in birth1c (Place of occurrence,
Event) and 10 in birth3b (Age, Mother).
Imputing missing data
In cases where there are missing values in an integer variable we can
impute them as an average. For example, there are 10 missing values in
birth3b (Age, Mother). We can impute the mean age with the
following code.
df <- df |>
mutate(birth3b = case_when(
is.na(birth3b) ~ as.integer(mean(birth3b, na.rm = TRUE)),
TRUE ~ birth3b)
)