Handy Functions for Data Cleaning

Welcome to this post on data cleaning in R! This post will describe the data exploration process and introduce useful functions to streamline common tasks.

Check out the package on GitHub.

Philosophy

There are common steps that you may face in your data exploration:

Import (messy) data
Clean messy data
Describe cleaned data
Plot cleaned data
Model data
Write output tables

This package aims to facilitate this basic data handling with functions that I’ve found handy.

Quick Reference Table

handy Function	Use Case	Description
`clean_names`	Clean Data	Normalize variable names of a data.frame to snake_case
`title_names`	Clean Data	Normalize variable names of a data.frame to Title Case
`remove_columns`	Clean Data	Removes columns that are either missing or non-unique
`add_mean_row`	Describe Data	Add a mean row to a data.frame of numeric values
`add_total_row`	Describe Data	Add a total row to a data.frame of numeric values
`check_variables`	Describe Data	Calculate % missing and number of unique values for all variables
`%p%`	Shorthand	Allow string concatenation by piping
`cs`	Shorthand	Makes a character vector without quotes
`named_list`	Shorthand	Makes a named list using object names
`multiplot`	Plotting	Arrange multiple plots
`scale_x_human`	Plotting	Scales ggplot axis to easily readable K, M, B, etc.
`scale_y_human`	Plotting	Scales ggplot axis to easily readable K, M, B, etc.
`write_excel`	Write Output	Write a list of data.frames in R to named sheets in an Excel workbook
`write_regression_to_excel`	Write Output	Write a list of data.frames in R to named sheets in an Excel workbook

Clean

Let’s use R’s built in iris dataset.

These cleaning functions are built to be compatible with magrittr/dplyr pipes.

library(handy)
library(magrittr)
library(knitr)
iris3obs <- head(iris, 3)
iris3obs %>% kable()
iris3obs %>% clean_names %>% kable()

What if we want to focus on columns that change for the set we’re looking for?

iris3obs %>% remove_columns(1) %>% kable()

Describe

For every dataset, we want to know what it contains:

mtcars %>% check_variables %>% kable

Sometimes it’s helpful to add a mean or total row for a table that’s to be printed out. These are best for datasets with a character key as the first column.

arrests <- USArrests %>% tibble::rownames_to_column() %>% dplyr::as_tibble()
arrests %>% head %>% add_mean_row %>% kable
arrests %>% head %>% add_total_row %>% kable

Plot

library(ggplot2)
p1 <- Seatbelts %>% tibble::as_tibble() %>%
  ggplot(aes(kms, DriversKilled)) +
  geom_point() +
  scale_x_human()
p2 <- Seatbelts %>% tibble::as_tibble() %>%
  ggplot(aes(DriversKilled)) +
  geom_histogram(alpha = .5, bins = 30)
multiplot(p1, p2)

Write

And for some analyses, it’s nice to write out the final data to Excel.

named_list(mtcars, iris) %>%
  write_excel(workbook_fname = 'wb.xlsx')

readxl::excel_sheets('wb.xlsx')
readxl::read_excel('wb.xlsx', sheet = 'iris') %>% head %>% kable

Here, using named_list assembles a list with the list names set to the names of the objects.

lm_assault <- lm(Assault ~ UrbanPop, data = arrests)
lm_anscombe <- lm(y1 ~ x1, data = anscombe)
reg_list <- named_list(lm_assault, lm_anscombe)

reg_list

reg_list %>% write_regression_to_excel('reg.xlsx')
readxl::read_excel('reg.xlsx', sheet = 'lm_anscombe') %>% head %>% kable

← Back to blog

Table of contents