26 Common Data Problems
In this chapter, you’ll learn how to overcome some of the most common dirty data problems. You’ll convert data types, apply range constraints to remove future data points, and remove duplicated data points to avoid double-counting.
26.1 Converting data types
Throughout this chapter, you’ll be working with San Francisco bike share ride data called bike_share_rides. It contains information on start and end stations of each trip, the trip duration, and some user information.
Before beginning to analyze any dataset, it’s important to take a look at the different types of columns you’ll be working with, which you can do using glimpse().
In this exercise, you’ll take a look at the data types contained in bike_share_rides and see how an incorrect data type can flaw your analysis.
dplyr and assertive are loaded and bike_share_rides is available.
Instructions 100 XP
- Examine the data types of the columns of bike_share_rides.
- Get a summary of the user_birth_year column of bike_share_rides.
ex_001.R
# Glimpse at bike_share_rides
glimpse(bike_share_rides)
# Summary of user_birth_year
summary(bike_share_rides$user_birth_year)
# Convert user_birth_year to factor: user_birth_year_fct
bike_share_rides <- bike_share_rides %>%
mutate(user_birth_year_fct = as.factor(user_birth_year))
# Assert user_birth_year_fct is a factor
assert_is_factor(bike_share_rides$user_birth_year_fct)
# Summary of user_birth_year_fct
summary(bike_share_rides$user_birth_year_fct)26.2 Trimming strings
In the previous exercise, you were able to identify the correct data type and convert user_birth_year to the correct type, allowing you to extract counts that gave you a bit more insight into the dataset.
Another common dirty data problem is having extra bits like percent signs or periods in numbers, causing them to be read in as characters. In order to be able to crunch these numbers, the extra bits need to be removed and the numbers need to be converted from character to numeric. In this exercise, you’ll need to convert the duration column from character to numeric, but before this can happen, the word “minutes” needs to be removed from each value.
dplyr, assertive, and stringr are loaded and bike_share_rides is available.
Instructions 100 XP
- Use
str_remove()to remove"minutes"from the duration column ofbike_share_rides. Add this as a new column calledduration_trimmed. - Convert the
duration_trimmedcolumn to a numeric type and add this as a new column calledduration_mins. - Glimpse at
bike_share_ridesand assert that theduration_minscolumn isnumeric. - Calculate the mean of
duration_mins.
ex_002.R
bike_share_rides <- bike_share_rides %>%
# Remove 'minutes' from duration: duration_trimmed
mutate(duration_trimmed = str_remove(duration, "minutes"),
# Convert duration_trimmed to numeric: duration_mins
duration_mins = as.numeric(duration_trimmed))
# Glimpse at bike_share_rides
glimpse(bike_share_rides)
# Assert duration_mins is numeric
assert_is_numeric(bike_share_rides$duration_mins)
# Calculate mean duration
mean(bike_share_rides$duration_mins)26.3 Ride duration constraints
Values that are out of range can throw off an analysis, so it’s important to catch them early on. In this exercise, you’ll be examining the duration_min column more closely. Bikes are not allowed to be kept out for more than 24 hours, or 1440 minutes at a time, but issues with some of the bikes caused inaccurate recording of the time they were returned.
In this exercise, you’ll replace erroneous data with the range limit ( 1440 minutes), however, you could just as easily replace these values with NAs.
load dplyr, assertive, and ggplot2 andbike_share_rides.
Instructions 100 XP
Create a three-bin histogram of the duration_min column of bike_share_rides using ggplot2 to identify if there is out-of-range data.
ex_003.R
# Create breaks
breaks <-
c(
min(bike_share_rides$duration_min),
0,
1440,
max(bike_share_rides$duration_min)
)
# Create a histogram of duration_min
ggplot(bike_share_rides, aes(duration_min)) +
geom_histogram(breaks = breaks)
# duration_min_const: replace vals of duration_min > 1440 with 1440
bike_share_rides <- bike_share_rides %>%
mutate(
duration_min_const = replace(
duration_min,
duration_min > 1440 ,
1440
)
)
# Make sure all values of
# duration_min_const are between 0 and 1440
assert_all_are_in_closed_range(
bike_share_rides$duration_min_const,
lower = 0, upper = 1440
)26.4 Back to the future
Something has gone wrong and it looks like you have data with dates from the future, which is way outside of the date range you expected to be working with. To fix this, you’ll need to remove any rides from the dataset that have a date in the future. Before you can do this, the date column needs to be converted from a character to a Date. Having these as Date objects will make it much easier to figure out which rides are from the future, since R makes it easy to check if one Date object is before (<) or after (>) another.
load dplyr, assertive and bike_share_rides.
Instructions 100 XP
- Convert the date column of
bike_share_ridesfromcharacterto theDatedata type. - Assert that all values in the
datecolumn happened sometime in the past and not in the future. - Filter
bike_share_ridesto get only the rides from the past or today, and save this asbike_share_rides_past. - Assert that the dates in
bike_share_rides_pastoccurred only in the past.
ex_004.R
library(lubridate)
# Convert date to Date type
bike_share_rides <- bike_share_rides %>%
mutate(date = as.Date(date))
# Make sure all dates are in the past
assert_all_are_in_past(bike_share_rides$date)
# Filter for rides that occurred before or on today's date
bike_share_rides_past <- bike_share_rides %>%
filter(date <= today())
# Make sure all dates from bike_share_rides_past are in the past
assert_all_are_in_past(bike_share_rides_past$date)26.5 Full duplicates
You’ve been notified that an update has been made to the bike sharing data pipeline to make it more efficient, but that duplicates are more likely to be generated as a result. To make sure that you can continue using the same scripts to run your weekly analyses about ride statistics, you’ll need to ensure that any duplicates in the dataset are removed first.
When multiple rows of a data frame share the same values for all columns, they’re full duplicates of each other. Removing duplicates like this is important, since having the same value repeated multiple times can alter summary statistics like the mean and median. Each ride, including its ride_id should be unique.
be sure thatdplyr is loaded and bike_share_rides is available.
Instructions 100 XP
- Get the total number of full duplicates in
bike_share_rides. - Remove all full duplicates from
bike_share_ridesand save the new data frame asbike_share_rides_unique. - Get the total number of full duplicates in the new
bike_share_rides_uniquedata frame.
ex_005.R
# Count the number of full duplicate
sum(duplicated(bike_share_rides))
# Remove duplicates
bike_share_rides_unique <- distinct(bike_share_rides)
# Count the full duplicates in bike_share_rides_unique
sum(duplicated(bike_share_rides_unique))26.6 Removing partial duplicates
Now that you’ve identified and removed the full duplicates, it’s time to check for partial duplicates. Partial duplicates are a bit tricker to deal with than full duplicates. In this exercise, you’ll first identify any partial duplicates and then practice the most common technique to deal with them, which involves dropping all partial duplicates, keeping only the first.
dplyr is loaded and bike_share_rides is available.
Instructions 100 XP
- Count the number of occurrences of each
ride_id. - Filter for
ride_ids that occur multiple times. - Remove full and partial duplicates from
bike_share_ridesbased onride_idonly, keeping all columns. - Store this as
bike_share_rides_unique.
ex_006.R
# Find duplicated ride_ids
bike_share_rides %>%
count(ride_id) %>%
filter(n > 1)
# Remove full and partial duplicates
bike_share_rides_unique <- bike_share_rides %>%
# Only based on ride_id instead of all cols
distinct(ride_id, .keep_all = TRUE)
# Find duplicated ride_ids in bike_share_rides_unique
bike_share_rides_unique %>%
# Count the number of occurrences of each ride_id
count(ride_id) %>%
# Filter for rows with a count > 1
filter(n>1)26.7 Aggregating partial duplicates
Another way of handling partial duplicates is to compute a summary statistic of the values that differ between partial duplicates, such as mean, median, maximum, or minimum. This can come in handy when you’re not sure how your data was collected and want an average, or if based on domain knowledge, you’d rather have too high of an estimate than too low of an estimate (or vice versa).
dplyr is loaded and bike_share_rides is available.
26.7.1 Instructions 100 XP
- Group
bike_share_ridesby ride_id and date. - Add a column called duration_min_avg that contains the mean ride duration for the row’s ride_id and date.
- Remove duplicates based on ride_id and date, keeping all columns of the data frame.
- Remove the duration_min column.
ex_007.R
bike_share_rides %>%
# Group by ride_id and date
group_by(ride_id, date) %>%
# Add duration_min_avg column
mutate(duration_min_avg = mean(duration_min)) %>%
# Remove duplicates based on ride_id and date, keep all cols
distinct(ride_id, date, .keep_all = TRUE) %>%
# Remove duration_min column
select(-duration_min)