37 Case Study: Joins on Stack Overflow Data
Put together all the types of join you learned in this course to analyze a new dataset: Stack Overflow questions, answers, and tags. This includes calculating and visualizing trends for some notable tags like dplyr and ggplot2. You’ll also master one more method for combining tables, the bind_rows verb, which stacks tables on top of each other.
Instructions 100 XP
Join together questions and question_tags using the id and question_id columns, respectively.
ex_025.R
# Join the questions and question_tags tables
questions %>%
left_join(
question_tags,
by = c("id" = "question_id")
)
# Join in the tags table
questions %>%
left_join(
question_tags,
by = c("id" = "question_id")
) %>%
left_join(
tags,
by = c("tag_id" = "id")
)
# Replace the NAs in the tag_name column
questions %>%
left_join(
question_tags,
by = c("id" = "question_id")
) %>%
left_join(
tags,
by = c("tag_id" = "id")
) %>%
replace_na(
list(tag_name = "only-r")
)Instructions 100 XP
- Aggregate by the
tag_name. - Summarize to get the mean
scorefor each question,score, as well as the total number of questions,num_questions. - Arrange
num_questionsin descending order to sort the answers by the most asked questions.
ex_026.R
questions_with_tags %>%
# Group by tag_name
group_by(tag_name) %>%
# Get mean score and num_questions
summarize(score = mean(score),
num_questions = n()) %>%
# Sort num_questions in descending order
arrange(desc(num_questions))Instructions 100 XP
- Use a join to determine which tags never appear on an R question.
ex_027.R
tags %>%
anti_join(question_tags, by = c("id" = "tag_id"))37.4 Finding gaps between questions and answers
Now we’ll join together questions with answers so we can measure the time between questions and answers.
Make sure to explore the tables and columns in the console before starting the exercise. Can you tell how are questions identified in the questions table? How can you identify which answer corresponds to which question using the answers table?
Instructions 100 XP
Use an inner join to combine the
questionsandanswerstables using the suffixes"_question"and"_answer", respectively.Subtract
creation_date_questionfromcreation_date_answerwithin theas.integer()function to create thegapcolumn.
ex_028.R
questions %>%
# Inner join questions and answers with proper suffixes
inner_join(
answers,
by = c("id" = "question_id"),
suffix = c("_question", "_answer")
) %>%
# Subtract creation_date_question from creation_date_answer to create gap
mutate(
gap = as.integer(
as.integer(
creation_date_answer - creation_date_question
)
)
)37.5 Joining questions, answers, and counts
We can also determine how many questions actually yield answers. If we count the number of answers for each question, we can then join the answers counts with the questions table.
Instructions 100 XP
- Count and sort the
question_idcolumn in theanswerstable to create theanswer_countstable. - Join the questions table with the
answer_countstable and include all observations from thequestionstable. -Replace the NA values in thencolumn with 0s.
ex_029.R
# Count and sort the question id column in the answers table
answer_counts <- answers %>%
count(question_id, sort = TRUE)
# Combine the answer_counts and questions tables
questions %>%
left_join(
answer_counts,
by = c("id" = "question_id")
) %>%
# Replace the NAs in the n column
replace_na(list(n=0))Instructions 100 XP
- Combine the
question_tagstable withquestion_answer_countsusing aninner_join. - Now, use another
inner_jointo add thetagstable.
ex_030.R
question_answer_counts %>%
# Join the question_tags tables
inner_join(
question_tags,
by = c("id" = "question_id")
) %>%
# Join the tags table
inner_join(tags, by = c("tag_id" = "id"))37.7 Average answers by question
The table you created in the last exercise has been preloaded for you as tagged_answers. You can use this table to determine, on average, how many answers each questions gets.
tagged_answers <- question_answer_counts %>%
inner_join(question_tags, by = c("id" = "question_id")) %>%
inner_join(tags, by = c("tag_id" = "id"))Some of the important variables from this table include: n, the number of answers for each question, and tag_name, the name of each tag associated with each question.
Let’s use some of our favorite dplyr verbs to find out how many answers each question gets on average.
Instructions 100 XP
- Aggregate the
tagged_answerstable bytag_name. - Summarize
tagged_answersto get the count of questions and theaverage_answers. - Sort the resulting questions column in descending order.
ex_031.R
tagged_answers %>%
# Aggregate by tag_name
group_by(tag_name) %>%
# Summarize questions and average_answers
summarize(
questions = n(),
average_answers = mean(n)
) %>%
# Sort the questions in descending order
arrange(desc(questions))Instructions 100 XP
- Use two inner joins to combine the
question_tagsand tags tables with the questions table. - Now, use two inner joins to combine the
question_tagsandtagstables with theanswerstable.
ex_032.R
# Inner join the question_tags and tags tables with the questions table
questions %>%
inner_join(
question_tags,
by = c("id" = "question_id")
) %>%
inner_join(
tags,
by = c("tag_id" = "id")
)
# Inner join the question_tags and tags tables with the answers table
answers %>%
inner_join(question_tags, by = "question_id") %>%
inner_join(tags, by = c("tag_id" = "id"))
Instructions 100 XP
- Combine the
questions_with_tagsandanswers_with_tagstables intoposts_with_tags. - Add a year column to the
posts_with_tagstable, then count posts by type, year, and tag_name.
ex_033.R
# Combine the two tables into posts_with_tags
posts_with_tags <-
bind_rows(
questions_with_tags %>%
mutate(type = "question"),
answers_with_tags %>%
mutate(type = "answer")
)
# Add a year column, then count by type, year, and tag_name
posts_with_tags %>%
mutate(year = year(creation_date)) %>%
count(type, year, tag_name)Instructions 100 XP
- Filter the
by_type_year_tagtable for the dplyr and ggplot2 tags. - Create a line plot with that filtered table that plots the frequency (
n) over time, colored by question/answer and faceted by tag.
ex_034.R
# Filter for the dplyr and ggplot2 tag names
by_type_year_tag_filtered <- by_type_year_tag %>%
filter(
tag_name %in% c("dplyr", "ggplot2")
)
# Create a line plot faceted by the tag name
ggplot(
by_type_year_tag_filtered,
aes(year, n, color = type)) +
geom_line() +
facet_wrap(~ tag_name)