id | birth1a | birth1b | birth1c | birth1g | birth1h | birth1i | birth2a | birth3b |
---|---|---|---|---|---|---|---|---|
179 | 2024-07-18 | 2024-08-12 | Imo | Quadruplet or higher | Midwife | Hospital | M | 25 |
162 | 2024-06-22 | 2024-07-10 | Delta | Single | Midwife | Home | M | 23 |
532 | 2024-08-05 | 2024-08-08 | Delta | Twin | Doctor | Other | M | 47 |
918 | 2024-07-13 | 2024-07-17 | Yobe | Quadruplet or higher | Midwife | Home | F | 35 |
237 | 2024-07-09 | 2024-07-29 | Benue | Triplet | Nurse | Unknown | F | 29 |
5 Data cleaning
There are several simple routines that can help make the preparation of civil registration data for tabulation less time consuming. This chapter uses the tidyverse suite of R packages to deal with common problems such as incorrect data types, duplicate rows, and missing values.
5.1 Practice dataset
We will use a synthetic dataset (Table 5.1) that has the appearance of civil registration data to practice with. The states of Nigeria are used but otherwise the data are completely fictional.
5.2 Setup
Before we start we need to load the tidyverse package. This will make available a range of handy functions for data manipulation.
Then we need to load our dataset using the read_csv()
function. All you need to do is provide a path to your CSV file.
Then we can inspect our dataset.
Rows: 1,004
Columns: 9
$ id <int> 179, 162, 532, 918, 237, 501, 613, 784, 47, 780, 239, 519, 120…
$ birth1a <chr> "2024-07-18", "2024-06-22", "2024-08-05", "2024-07-13", "2024-…
$ birth1b <date> 2024-08-12, 2024-07-10, 2024-08-08, 2024-07-17, 2024-07-29, 2…
$ birth1c <chr> "Imo", "Delta", "Delta", "Yobe", "Benue", "Enugu", "Kebbi", "I…
$ birth1g <chr> "Quadruplet or higher", "Single", "Twin", "Quadruplet or highe…
$ birth1h <chr> "Midwife", "Midwife", "Doctor", "Midwife", "Nurse", "Nurse", "…
$ birth1i <chr> "Hospital", "Home", "Other", "Home", "Unknown", "Other", "Home…
$ birth2a <chr> "M", "M", "M", "F", "F", "F", "F", "F", "M", "M", "F", "M", "M…
$ birth3b <int> 25, 23, 47, 35, 29, 4, 66, 1, 86, 29, 26, 72, 79, 7, 37, 80, 7…
There are 9 variables in our data:
id
-
birth1a
(Date of occurrence, Event) -
birth1b
(Date of registration, Event) -
birth1c
(Place of occurrence, Event) -
birth1g
(Type of birth, Event) -
bith1h
(Attendant at birth, Event) -
birth1i
(Type of place of occurrence, Event) -
birth2a
(Sex, Newborn) -
birth3b
(Age, Mother)
The variable birth1b
(Date of registration, Event) is of type date, id
and birth3b
(Age, Mother) integer, and the others are all character.
5.3 Data de-identification
Information that could identify individuals must be removed from the datatset.
5.3.1 Removing columns
Sometimes personal information can be left in a dataset. We can easily drop identifying variables with select()
. For example, if the id
variable were actually a national identity number you could remove it with:
5.4 Valid
Variables should conform to data types and values should fall within expected ranges.
5.4.1 Characters as dates
Dates can be inputted in a variety of formats such as “1 August 24”, Aug 1, 2024”, and “2024-08-01”. Sometimes R will not be able to parse a date and will read it as a character.
The variable birth1a
(Date of occurrence, Event) is a character but it should be a date. We can coerce it to a date by passing it to the as.Date()
function and choosing the corresponding date format. birth1a
follows the ISO 8601 standard for dates or ‘YYYY-MM-DD’ so we need to format the data with “%Y-%m-%d”.
Symbol | Description | Example |
---|---|---|
%d | Day as a number | 01-31 |
%a | Abbreviated weekday | Wed |
%A | Unabbreviated weekday | Wednesday |
%m | Month as a number | 1-12 |
%b | Abbreviated month | Feb |
%B | Unabbreviated month | February |
%y | 2-digit year | 24 |
%Y | 4-digit year | 2024 |
You can also use the ymd()
function which more intuitively transforms the date into year, month, and day.
We can see that some registration dates have been entered as “1900-01-01”.
5.4.2 Characters as factors
It is common practice to read known values of a variable as characters, hence stringsAsFactors = FALSE
. However, once you have read in the data you may wish to convert a character variable into a factor. Here we are converting birth1g
(Type of birth, Event) into a factor.
5.4.3 Remove white space
Leading and trailing white space can easily be coded in error with manual data entry. We can get rid of these blank characters with the str_squish()
function from the stringr
package.
5.4.4 Values within expected range
Sometimes there can be rogue values in a dataset that have been entered in error. We can check for these by calculating some summary statistics. The skim()
function from the skimr
package prints the average, quartiles and a neat histogram.
── Variable type: numeric ──────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
1 birth3b 10 0.990 45.0 25.8 0 24 45 67 90 ▇▇▇▇▆
5.4.5 Wrong order of levels
Sometimes the order of levels in a factor variable is not very helpful. The fct_relevel()
function allows you to customise the order of levels. Here we reorder the values of birth1g
(Type of birth, Event) so that “Single” comes first, then “Twin” etc.
5.4.6 Recode values
Often values of a variable are coded unhelpfully. For example, in birth2a
(Sex, Newborn) the sex of the newborn baby has been recorded as “M” and “F”. We can easily recode these so that they are more readable. The fct_recode()
function enables you to change factor levels by hand.
You can also use the case_match()
function within mutate()
.
5.5 Accurate
There should be no conflicting information in the dataset.
5.5.1 Dates out of sequence
There may be inputting errors in the civil registration data such as birth1a
(Date of occurrence, Event) being recorded after birth1b
(Date of registration, Event). To identify those records that are out of sync we can use a simple filter. The code below selects all of the rows where birth1a
(Date of occurrence, Event) is after birth1b
(Date of registration, Event) and prints out selected columns.
5.6 Complete
There should be no missing or duplicate cases in the dataset.
5.6.1 Identifying duplicate rows
We can identify duplicate rows by using the get_dupes()
function from the janitor package. The function will return any duplicate records and add a dupe_count
column containing the number of duplicates. You just need to supply a unique identifier e.g. `id`` or a unique combination of columns.
id dupe_count birth1a birth1b birth1c birth1g birth1h
1 179 5 2024-07-18 2024-08-12 Imo Quadruplet or higher Midwife
2 179 5 2024-07-18 2024-08-12 Imo Quadruplet or higher Midwife
3 179 5 2024-07-18 2024-08-12 Imo Quadruplet or higher Midwife
4 179 5 2024-07-18 2024-08-12 Imo Quadruplet or higher Midwife
5 179 5 2024-07-18 2024-08-12 Imo Quadruplet or higher Midwife
birth1i birth2a birth3b
1 Hospital M 25
2 Hospital M 25
3 Hospital M 25
4 Hospital M 25
5 Hospital M 25
5.6.2 Removing duplicates
To drop duplicate rows whilst retaining the first row we can use the distinct()
function. Remember to pass the argument .keep_all = TRUE
to keep all the variables not just the unique identifier.
5.6.3 Convert values to NA
Missing values are recorded in a variety of ways e.g. 999999, NULL, or even ” “. We can recode these missing values to a more explicit NA using the na_if()
function. The variable birth1c
(Place of occurrence, Event) includes a number of missing values that have been recorded as” “. We can replace these with NA using the code below.
You can also use the case_match()
function as an alternative.
5.6.4 Checking missing data
To identify how many missing values there are in the whole dataset we can run:
id birth1a birth1b birth1c birth1g birth1h birth1i birth2a birth3b
0 0 0 0 0 0 0 0 10
There are 0 NA values in birth1c
(Place of occurrence, Event) and 10 in birth3b
(Age, Mother).
5.6.5 Fill missing value with previous value
Sometimes values are not explicit in a dataset. For example, you might have some population data and only the first value explicitly records the sex.
Age | Sex | Population |
---|---|---|
0-4 | Male | 9432700 |
5-9 | NA | 7875387 |
10-14 | NA | 7088327 |
15-19 | NA | 6159773 |
20-24 | NA | 5139436 |
25-29 | NA | 4238963 |
30-34 | NA | 3443384 |
35-39 | NA | 2756472 |
We can solve this problem with the fill()
function which takes the previous value and uses it to fill subsequent missing values.
Age | Sex | Population |
---|---|---|
0-4 | Male | 9432700 |
5-9 | Male | 7875387 |
10-14 | Male | 7088327 |
15-19 | Male | 6159773 |
20-24 | Male | 5139436 |
25-29 | Male | 4238963 |
30-34 | Male | 3443384 |
35-39 | Male | 2756472 |
5.6.6 Imputing missing data
In cases where there are missing values in an integer variable we can impute them as an average. For example, there are 10 missing values in birth3b
(Age, Mother). We can impute the mean age with the following code.