27 Categorical and Text Data
Categorical and text data can often be some of the messiest parts of a dataset due to their unstructured nature. In this chapter, you’ll learn how to fix whitespace and capitalization inconsistencies in category labels, collapse multiple categories into one, and reformat strings for consistency.
27.1 Not a member
Now that you’ve practiced identifying membership constraint problems, it’s time to fix these problems in a new dataset. Throughout this chapter, you’ll be working with a dataset called sfo_survey, containing survey responses from passengers taking flights from San Francisco International Airport (SFO). Participants were asked questions about the airport’s cleanliness, wait times, safety, and their overall satisfaction.
There were a few issues during data collection that resulted in some inconsistencies in the dataset. In this exercise, you’ll be working with the dest_size column, which categorizes the size of the destination airport that the passengers were flying to. A data frame called dest_sizes is available that contains all the possible destination sizes. Your mission is to find rows with invalid dest_sizes and remove them from the data frame.
Load dplyr, sfo_survey and dest_sizes.
Instructions 100 XP
- Count the number of occurrences of each
dest_sizeinsfo_survey.
ex_008.R
# Count the number of occurrences of dest_size
sfo_survey %>%
count(dest_size)
# Find bad dest_size rows
sfo_survey %>%
# Join with dest_sizes data frame to get bad dest_size rows
anti_join(dest_sizes, by = "dest_size") %>%
# Select id, airline, destination, and dest_size cols
select(id, airline, destination, dest_size)
# Remove bad dest_size rows
sfo_survey %>%
# Join with dest_sizes
semi_join(dest_sizes, by = "dest_size") %>%
# Count the number of each dest_size
count(dest_size)27.2 Identifiying inconsistency
In the video exercise, you learned about different kinds of inconsistencies that can occur within categories, making it look like a variable has more categories than it should.
In this exercise, you’ll continue working with the sfo_survey dataset. You’ll examine the dest_size column again as well as the cleanliness column and determine what kind of issues, if any, these two categorical variables face.
Load dplyr and sfo_survey .
Instructions 100 XP
- Count the number of occurrences of each category of the
dest_sizevariable ofsfo_survey.
ex_009.R
# Count dest_size
# Count dest_size
sfo_survey %>%
count(dest_size)
# Count cleanliness
sfo_survey %>%
count(cleanliness)27.3 Correcting inconsistency
Now that you’ve identified that dest_size has whitespace inconsistencies and cleanliness has capitalization inconsistencies, you’ll use the new tools at your disposal to fix the inconsistent values in sfo_survey instead of removing the data points entirely, which could add bias to your dataset if more than 5% of the data points need to be dropped.
Load dplyr, stringr and sfo_survey.
Instructions 100 XP
- Add a column to sfo_survey called dest_size_trimmed that contains the values in the dest_size column with all leading and trailing whitespace removed.
- Add another column called cleanliness_lower that contains the values in the cleanliness column converted to all lowercase.
- Count the number of occurrences of each category in dest_size_trimmed.
- Count the number of occurrences of each category in cleanliness_lower.
ex_010.R
# Add new columns to sfo_survey
sfo_survey <- sfo_survey %>%
# dest_size_trimmed: dest_size without whitespace
mutate(dest_size_trimmed = str_trim(dest_size),
# cleanliness_lower: cleanliness converted to lowercase
cleanliness_lower = str_to_lower(cleanliness))
# Count values of dest_size_trimmed
sfo_survey %>%
count(dest_size_trimmed)
# Count values of cleanliness_lower
sfo_survey %>%
count(cleanliness_lower)27.4 Collapsing categories
One of the tablets that participants filled out the sfo_survey on was not properly configured, allowing the response for dest_region to be free text instead of a dropdown menu. This resulted in some inconsistencies in the dest_region variable that you’ll need to correct in this exercise to ensure that the numbers you report to your boss are as accurate as possible.
Make dplyr, forcats and sfo_survey available.
Instructions 100 XP
- Count the categories of dest_region.
ex_011.R
# Count categories of dest_region
sfo_survey %>%
count(dest_region)
# Categories to map to Europe
europe_categories <- c("EU","eur", "Europ")
# Add a new col dest_region_collapsed
sfo_survey %>%
# Map all categories in europe_categories to Europe
mutate(
dest_region_collapsed = fct_collapse(dest_region,
Europe = europe_categories)) %>%
# Count categories of dest_region_collapsed
count(dest_region_collapsed)27.5 Detecting inconsistency text data
You’ve recently received some news that the customer support team wants to ask the SFO survey participants some follow-up questions. However, the auto-dialer that the call center uses isn’t able to parse all of the phone numbers since they’re all in different formats. After some investigation, you found that some phone numbers are written with hyphens (-) and some are written with parentheses ((,)). In this exercise, you’ll figure out which phone numbers have these issues so that you know which ones need fixing.
Load dplyr, stringr and sfo_survey.
Instructions 100 XP
- Filter for rows with phone numbers that contain “-”s.
- Filter for rows with phone numbers that contain “(”, or “)”. Remember to use fixed() when searching for parentheses.
ex_012.R
# Filter for rows with "-" in the phone column
sfo_survey %>%
filter(str_detect(phone, "-"))
# Filter for rows with "(" or ")" in the phone column
sfo_survey %>%
filter(str_detect(phone, fixed("(")) | str_detect(phone, fixed(")")))27.6 Replacing and removing
In the last exercise, you saw that the phone column of sfo_survey is plagued with unnecessary parentheses and hyphens. The customer support team has requested that all phone numbers be in the format "123 456 7890". In this exercise, you’ll use your new stringr skills to fulfill this request.
Make dplyr, stringr and sfo_survey available.
Instructions 100 XP
Remove opening and closing parentheses from the phone column. Store this as a variable called phone_no_parens. Remember to use fixed()!
Add a new column to sfo_survey called phone_no_parens that contains the contents of phone_no_parens.
ex_013.R
# Remove parentheses from phone column
phone_no_parens <- sfo_survey$phone %>%
# Remove "("s
str_remove(fixed("(")) %>%
# Remove ")"s
str_remove(fixed(")"))
# Remove parentheses from phone column
phone_no_parens <- sfo_survey$phone %>%
# Remove "("s
str_remove_all(fixed("(")) %>%
# Remove ")"s
str_remove_all(fixed(")"))
# Add phone_no_parens as column
sfo_survey %>%
mutate(phone_no_parens=phone_no_parens)
# Remove parentheses from phone column
phone_no_parens <- sfo_survey$phone %>%
# Remove "("s
str_remove_all(fixed("(")) %>%
# Remove ")"s
str_remove_all(fixed(")"))
# Add phone_no_parens as column
sfo_survey %>%
mutate(
phone_no_parens = phone_no_parens,
# Replace all hyphens in phone_no_parens with spaces
phone_clean = str_replace_all(
phone_no_parens, "-", " "))27.7 Invalid phone numbers
The customer support team is grateful for your work so far, but during their first day of calling participants, they ran into some phone numbers that were invalid. In this exercise, you’ll remove any rows with invalid phone numbers so that these faulty numbers don’t keep slowing the team down.
Make dplyr, stringr and sfo_survey available.
Instructions 100 XP
- Examine the invalid phone numbers by filtering for numbers whose length is not equal to 12.
- Remove the rows with invalid numbers by filtering for numbers with a length of exactly 12
ex_014.R
# Check out the invalid numbers
sfo_survey %>%
filter(str_length(phone) != 12)
# Remove rows with invalid numbers
sfo_survey %>%
filter(str_length(phone) == 12)