27 Categorical and Text Data

Categorical and text data can often be some of the messiest parts of a dataset due to their unstructured nature. In this chapter, you’ll learn how to fix whitespace and capitalization inconsistencies in category labels, collapse multiple categories into one, and reformat strings for consistency.

27.1 Not a member

Now that you’ve practiced identifying membership constraint problems, it’s time to fix these problems in a new dataset. Throughout this chapter, you’ll be working with a dataset called sfo_survey, containing survey responses from passengers taking flights from San Francisco International Airport (SFO). Participants were asked questions about the airport’s cleanliness, wait times, safety, and their overall satisfaction.

There were a few issues during data collection that resulted in some inconsistencies in the dataset. In this exercise, you’ll be working with the dest_size column, which categorizes the size of the destination airport that the passengers were flying to. A data frame called dest_sizes is available that contains all the possible destination sizes. Your mission is to find rows with invalid dest_sizes and remove them from the data frame.

Load dplyr, sfo_survey and dest_sizes.

Instructions `100 XP`

Count the number of occurrences of each dest_size in sfo_survey.

ex_008.R


# Count the number of occurrences of dest_size
sfo_survey %>%
  count(dest_size)

# Find bad dest_size rows
sfo_survey %>% 
  # Join with dest_sizes data frame to get bad dest_size rows
  anti_join(dest_sizes, by = "dest_size") %>%
  # Select id, airline, destination, and dest_size cols
  select(id, airline, destination, dest_size)

# Remove bad dest_size rows
sfo_survey %>% 
  # Join with dest_sizes
  semi_join(dest_sizes, by = "dest_size") %>%
  # Count the number of each dest_size
  count(dest_size)

27.2 Identifiying inconsistency

In the video exercise, you learned about different kinds of inconsistencies that can occur within categories, making it look like a variable has more categories than it should.

In this exercise, you’ll continue working with the sfo_survey dataset. You’ll examine the dest_size column again as well as the cleanliness column and determine what kind of issues, if any, these two categorical variables face.

Load dplyr and sfo_survey .

Instructions `100 XP`

Count the number of occurrences of each category of the dest_size variable of sfo_survey.

ex_009.R

# Count dest_size
# Count dest_size
sfo_survey %>%
  count(dest_size)

# Count cleanliness
sfo_survey %>%
  count(cleanliness)

27.3 Correcting inconsistency

Now that you’ve identified that dest_size has whitespace inconsistencies and cleanliness has capitalization inconsistencies, you’ll use the new tools at your disposal to fix the inconsistent values in sfo_survey instead of removing the data points entirely, which could add bias to your dataset if more than 5% of the data points need to be dropped.

Load dplyr, stringr and sfo_survey.

Instructions `100 XP`

Add a column to sfo_survey called dest_size_trimmed that contains the values in the dest_size column with all leading and trailing whitespace removed.
Add another column called cleanliness_lower that contains the values in the cleanliness column converted to all lowercase.
Count the number of occurrences of each category in dest_size_trimmed.
Count the number of occurrences of each category in cleanliness_lower.

ex_010.R

# Add new columns to sfo_survey
sfo_survey <- sfo_survey %>%
  # dest_size_trimmed: dest_size without whitespace
  mutate(dest_size_trimmed =  str_trim(dest_size),
         # cleanliness_lower: cleanliness converted to lowercase
         cleanliness_lower = str_to_lower(cleanliness))

# Count values of dest_size_trimmed
sfo_survey %>%
  count(dest_size_trimmed)

# Count values of cleanliness_lower
sfo_survey %>%
  count(cleanliness_lower)

27.4 Collapsing categories

One of the tablets that participants filled out the sfo_survey on was not properly configured, allowing the response for dest_region to be free text instead of a dropdown menu. This resulted in some inconsistencies in the dest_region variable that you’ll need to correct in this exercise to ensure that the numbers you report to your boss are as accurate as possible.

Make dplyr, forcats and sfo_survey available.

Instructions `100 XP`

Count the categories of dest_region.

ex_011.R

# Count categories of dest_region
sfo_survey %>%
  count(dest_region)

# Categories to map to Europe
europe_categories <- c("EU","eur", "Europ")

# Add a new col dest_region_collapsed
sfo_survey %>%
  # Map all categories in europe_categories to Europe
  mutate(
    dest_region_collapsed = fct_collapse(dest_region, 
                                     Europe = europe_categories)) %>%
  # Count categories of dest_region_collapsed
  count(dest_region_collapsed)

27.5 Detecting inconsistency text data

You’ve recently received some news that the customer support team wants to ask the SFO survey participants some follow-up questions. However, the auto-dialer that the call center uses isn’t able to parse all of the phone numbers since they’re all in different formats. After some investigation, you found that some phone numbers are written with hyphens (-) and some are written with parentheses ((,)). In this exercise, you’ll figure out which phone numbers have these issues so that you know which ones need fixing.

Load dplyr, stringr and sfo_survey.

Instructions `100 XP`

Filter for rows with phone numbers that contain “-”s.
Filter for rows with phone numbers that contain “(”, or “)”. Remember to use fixed() when searching for parentheses.

ex_012.R


# Filter for rows with "-" in the phone column
sfo_survey %>%
  filter(str_detect(phone, "-"))

# Filter for rows with "(" or ")" in the phone column
sfo_survey %>%
  filter(str_detect(phone, fixed("(")) | str_detect(phone, fixed(")")))

27.6 Replacing and removing

In the last exercise, you saw that the phone column of sfo_survey is plagued with unnecessary parentheses and hyphens. The customer support team has requested that all phone numbers be in the format "123 456 7890". In this exercise, you’ll use your new stringr skills to fulfill this request.

Make dplyr, stringr and sfo_survey available.

Instructions `100 XP`

Remove opening and closing parentheses from the phone column. Store this as a variable called phone_no_parens. Remember to use fixed()!
Add a new column to sfo_survey called phone_no_parens that contains the contents of phone_no_parens.

ex_013.R

# Remove parentheses from phone column
phone_no_parens <- sfo_survey$phone %>%
  # Remove "("s
  str_remove(fixed("(")) %>%
  # Remove ")"s
  str_remove(fixed(")"))

  # Remove parentheses from phone column
phone_no_parens <- sfo_survey$phone %>%
  # Remove "("s
  str_remove_all(fixed("(")) %>%
  # Remove ")"s
  str_remove_all(fixed(")"))

# Add phone_no_parens as column
sfo_survey %>%
  mutate(phone_no_parens=phone_no_parens)

# Remove parentheses from phone column
phone_no_parens <- sfo_survey$phone %>%
  # Remove "("s
  str_remove_all(fixed("(")) %>%
  # Remove ")"s
  str_remove_all(fixed(")"))

# Add phone_no_parens as column
sfo_survey %>%
  mutate(
    phone_no_parens = phone_no_parens,
  # Replace all hyphens in phone_no_parens with spaces
    phone_clean = str_replace_all(
      phone_no_parens, "-", " "))

27.7 Invalid phone numbers

The customer support team is grateful for your work so far, but during their first day of calling participants, they ran into some phone numbers that were invalid. In this exercise, you’ll remove any rows with invalid phone numbers so that these faulty numbers don’t keep slowing the team down.

Make dplyr, stringr and sfo_survey available.

Instructions `100 XP`

Examine the invalid phone numbers by filtering for numbers whose length is not equal to 12.
Remove the rows with invalid numbers by filtering for numbers with a length of exactly 12

ex_014.R

# Check out the invalid numbers
sfo_survey %>%
  filter(str_length(phone) != 12)

# Remove rows with invalid numbers
sfo_survey %>%
   filter(str_length(phone) == 12)

# Categorical and Text Data Categorical and text data can often be some of the messiest parts of a dataset due to their unstructured nature. In this chapter, you’ll learn how to fix whitespace and capitalization inconsistencies in category labels, collapse multiple categories into one, and reformat strings for consistency. ## Not a member Now that you've practiced identifying membership constraint problems, it's time to fix these problems in a new dataset. Throughout this chapter, you'll be working with a dataset called `sfo_survey`, containing survey responses from passengers taking flights from San Francisco International Airport (SFO). Participants were asked questions about the airport's cleanliness, wait times, safety, and their overall satisfaction. There were a few issues during data collection that resulted in some inconsistencies in the dataset. In this exercise, you'll be working with the `dest_size` column, which categorizes the size of the destination airport that the passengers were flying to. A data frame called `dest_sizes` is available that contains all the possible destination sizes. Your mission is to find rows with invalid `dest_sizes` and remove them from the data frame. Load `dplyr`, `sfo_survey` and `dest_sizes`. ### Instructions `100 XP` {.unnumbered} - Count the number of occurrences of each `dest_size` in `sfo_survey`. ```{.r filename="ex_008.R"} # Count the number of occurrences of dest_size sfo_survey %>% count(dest_size) # Find bad dest_size rows sfo_survey %>% # Join with dest_sizes data frame to get bad dest_size rows anti_join(dest_sizes, by = "dest_size") %>% # Select id, airline, destination, and dest_size cols select(id, airline, destination, dest_size) # Remove bad dest_size rows sfo_survey %>% # Join with dest_sizes semi_join(dest_sizes, by = "dest_size") %>% # Count the number of each dest_size count(dest_size) ``` ## Identifiying inconsistency In the video exercise, you learned about different kinds of inconsistencies that can occur within categories, making it look like a variable has more categories than it should. In this exercise, you'll continue working with the sfo_survey dataset. You'll examine the `dest_size` column again as well as the cleanliness column and determine what kind of issues, if any, these two categorical variables face. Load `dplyr` and `sfo_survey` . ### Instructions `100 XP` {.unnumbered} - Count the number of occurrences of each category of the `dest_size` variable of `sfo_survey`. ```{.r filename="ex_009.R"} # Count dest_size # Count dest_size sfo_survey %>% count(dest_size) # Count cleanliness sfo_survey %>% count(cleanliness) ``` ## Correcting inconsistency Now that you've identified that dest_size has whitespace inconsistencies and cleanliness has capitalization inconsistencies, you'll use the new tools at your disposal to fix the inconsistent values in `sfo_survey` instead of removing the data points entirely, which could add bias to your dataset if more than 5% of the data points need to be dropped. Load `dplyr`, `stringr` and `sfo_survey`. ### Instructions `100 XP` {.unnumbered} - Add a column to sfo_survey called dest_size_trimmed that contains the values in the dest_size column with all leading and trailing whitespace removed. - Add another column called cleanliness_lower that contains the values in the cleanliness column converted to all lowercase. - Count the number of occurrences of each category in dest_size_trimmed. - Count the number of occurrences of each category in cleanliness_lower. ```{.r filename="ex_010.R"} # Add new columns to sfo_survey sfo_survey <- sfo_survey %>% # dest_size_trimmed: dest_size without whitespace mutate(dest_size_trimmed = str_trim(dest_size), # cleanliness_lower: cleanliness converted to lowercase cleanliness_lower = str_to_lower(cleanliness)) # Count values of dest_size_trimmed sfo_survey %>% count(dest_size_trimmed) # Count values of cleanliness_lower sfo_survey %>% count(cleanliness_lower) ``` ## Collapsing categories One of the tablets that participants filled out the `sfo_survey` on was not properly configured, allowing the response for `dest_region` to be free text instead of a dropdown menu. This resulted in some inconsistencies in the `dest_region` variable that you'll need to correct in this exercise to ensure that the numbers you report to your boss are as accurate as possible. Make `dplyr`, `forcats` and `sfo_survey` available. ### Instructions `100 XP` {.unnumbered} - Count the categories of dest_region. ```{.r filename="ex_011.R"} # Count categories of dest_region sfo_survey %>% count(dest_region) # Categories to map to Europe europe_categories <- c("EU","eur", "Europ") # Add a new col dest_region_collapsed sfo_survey %>% # Map all categories in europe_categories to Europe mutate( dest_region_collapsed = fct_collapse(dest_region, Europe = europe_categories)) %>% # Count categories of dest_region_collapsed count(dest_region_collapsed) ``` ## Detecting inconsistency text data You've recently received some news that the customer support team wants to ask the SFO survey participants some follow-up questions. However, the auto-dialer that the call center uses isn't able to parse all of the phone numbers since they're all in different formats. After some investigation, you found that some phone numbers are written with hyphens (-) and some are written with parentheses ((,)). In this exercise, you'll figure out which phone numbers have these issues so that you know which ones need fixing. Load `dplyr`, `stringr` and `sfo_survey`. ### Instructions `100 XP` {.unnumbered} - Filter for rows with phone numbers that contain "-"s. - Filter for rows with phone numbers that contain "(", or ")". Remember to use fixed() when searching for parentheses. ```{.r filename="ex_012.R"} # Filter for rows with "-" in the phone column sfo_survey %>% filter(str_detect(phone, "-")) # Filter for rows with "(" or ")" in the phone column sfo_survey %>% filter(str_detect(phone, fixed("(")) | str_detect(phone, fixed(")"))) ``` ## Replacing and removing In the last exercise, you saw that the `phone` column of `sfo_survey` is plagued with unnecessary parentheses and hyphens. The customer support team has requested that all phone numbers be in the format `"123 456 7890"`. In this exercise, you'll use your new stringr skills to fulfill this request. Make `dplyr`, `stringr` and `sfo_survey` available. ### Instructions `100 XP` {.unnumbered} - Remove opening and closing parentheses from the phone column. Store this as a variable called phone_no_parens. Remember to use fixed()! - Add a new column to sfo_survey called phone_no_parens that contains the contents of phone_no_parens. ```{.r filename="ex_013.R"} # Remove parentheses from phone column phone_no_parens <- sfo_survey$phone %>% # Remove "("s str_remove(fixed("(")) %>% # Remove ")"s str_remove(fixed(")")) # Remove parentheses from phone column phone_no_parens <- sfo_survey$phone %>% # Remove "("s str_remove_all(fixed("(")) %>% # Remove ")"s str_remove_all(fixed(")")) # Add phone_no_parens as column sfo_survey %>% mutate(phone_no_parens=phone_no_parens) # Remove parentheses from phone column phone_no_parens <- sfo_survey$phone %>% # Remove "("s str_remove_all(fixed("(")) %>% # Remove ")"s str_remove_all(fixed(")")) # Add phone_no_parens as column sfo_survey %>% mutate( phone_no_parens = phone_no_parens, # Replace all hyphens in phone_no_parens with spaces phone_clean = str_replace_all( phone_no_parens, "-", " ")) ``` ## Invalid phone numbers The customer support team is grateful for your work so far, but during their first day of calling participants, they ran into some phone numbers that were invalid. In this exercise, you'll remove any rows with invalid phone numbers so that these faulty numbers don't keep slowing the team down. Make `dplyr`, `stringr` and `sfo_survey` available. ### Instructions `100 XP` {.unnumbered} - Examine the invalid phone numbers by filtering for numbers whose length is not equal to 12. - Remove the rows with invalid numbers by filtering for numbers with a length of exactly 12 ```{.r filename="ex_014.R"} # Check out the invalid numbers sfo_survey %>% filter(str_length(phone) != 12) # Remove rows with invalid numbers sfo_survey %>% filter(str_length(phone) == 12) ```

27.1 Not a member

Instructions 100 XP

27.2 Identifiying inconsistency

Instructions 100 XP

27.3 Correcting inconsistency

Instructions 100 XP

27.4 Collapsing categories

Instructions 100 XP

27.5 Detecting inconsistency text data

Instructions 100 XP

27.6 Replacing and removing

Instructions 100 XP

27.7 Invalid phone numbers

Instructions 100 XP

Instructions `100 XP`

Instructions `100 XP`

Instructions `100 XP`

Instructions `100 XP`

Instructions `100 XP`

Instructions `100 XP`

Instructions `100 XP`