12  How to Write a Function

12.1 Calling functions

One way to make your code more readable is to be careful about the order you pass arguments when you call functions, and whether you pass the arguments by position or by name.

gold_medals, a numeric vector of the number of gold medals won by each country in the 2016 Summer Olympics, is provided.

For convenience, the arguments ofmedian() and are displayed using args(). Setting rank()’s na.last argument to “keep” means “keep the rank of NA values as NA”.

Best practice for calling functions is to include them in the order shown by args(), and to only name rare arguments.

Instructions 100 XP

  • The final line calculates the median number of gold medals each country won.
  • Rewrite the call to median(), following best practices.
ex_01.R
# Look at the gold medals data
gold_medals

# Note the arguments to median()
args(median)

# Rewrite this function call, following best practices
median(gold_medals, na.rm = TRUE)

12.2 The benefits of writing functions

There are lots of great reasons that you should write your own functions.

Which of these is not one of them?

Answer the question 50XP

Possible Answers - You can type less code, saving effort and making your analyses more readable. - You make less “copy and paste”-related errors. - You can reuse your code from project to project. - You can make your code harder to read, potentially improving your job security because only you can maintain it.

12.3 Your first function: tossing a coin

Time to write your first function! It’s a really good idea when writing functions to start simple. You can always make a function more complicated later if it’s really necessary, so let’s not worry about arguments for now.

Instructions 100 XP

  • Simulate a single coin toss by using sample() to sample from coin_sides once.
  • Write a template for your function, naming it toss_coin. The function should take no arguments. Don’t include the body of the function yet.
  • Copy your script, and paste it into the function body.
  • Call your function.
ex_02.R
coin_sides <- c("head", "tail")

# Sample from coin_sides once
sample(coin_sides,1)

# Write a template for your function, toss_coin()
toss_coin <- function() {
  

  # (Leave the contents of the body for later)
# Add punctuation to finish the body
}

# Your script, from a previous step
coin_sides <- c("head", "tail")

# Paste your script into the function body
toss_coin <- function() {
  sample(coin_sides, 1)
  
}

# Your functions, from previous steps
toss_coin <- function() {
  coin_sides <- c("head", "tail")
  sample(coin_sides, 1)
}

# Call your function
toss_coin()

12.4 Inputs to functions

Most functions require some sort of input to determine what to compute. The inputs to functions are called arguments. You specify them inside the parentheses after the word “function.”

As mentioned in the video, the following exercises assume that you are using sample() to do random sampling.

Instructions 100 XP

  • Sample from coin_sides n_flips times with replacement.
  • Update the definition of toss_coin() to accept a single argument, n_flips. The function should sample coin_sides n_flips times with replacement. Remember to change the signature and the body.
  • Generate 10 coin flips.
ex_03.R
coin_sides <- c("head", "tail")
n_flips <- 10

# Sample from coin_sides n_flips times with replacement
sample(coin_sides,n_flips,replace = TRUE)

# Update the function to return n coin tosses
toss_coin <- function(n_flips) {
  coin_sides <- c("head", "tail")
  sample(coin_sides, n_flips,replace = TRUE)
}

# Generate 10 coin tosses
toss_coin(10)

12.5 Multiple inputs to functions

If a function should have more than one argument, list them in the function signature, separated by commas.

To solve this exercise, you need to know how to specify sampling weights to sample(). Set the prob argument to a numeric vector with the same length as x. Each value of prob is the probability of sampling the corresponding element of x, so their values add up to one. In the following example, each sample has a 20% chance of “bat”, a 30% chance of “cat” and a 50% chance of “rat”.

sample(c(“bat”, “cat”, “rat”), 10, replace = TRUE, prob = c(0.2, 0.3, 0.5))

Instructions 100 XP

  • Bias the coin by weighting the sampling. Specify the prob argument so that heads are sampled with probability p_head (and tails are sampled with probability 1 - p_head).
  • Update the definition of toss_coin() so it accepts an argument, p_head, and weights the samples using the code you wrote in the previous step.
  • Generate 10 coin tosses with an 80% chance of each head.
ex_04.R

coin_sides <- c("head", "tail")
n_flips <- 10
p_head <- 0.8

# Define a vector of weights
weights <- c(p_head, 1 - p_head)

# Update so that heads are sampled with prob p_head
sample(coin_sides, n_flips, replace = TRUE, prob = weights)

# Update the function so heads have probability p_head
toss_coin <- function(n_flips,p_head) {
  coin_sides <- c("head", "tail")
  # Define a vector of weights
  weights <- c(p_head,1-p_head)
  # Modify the sampling to be weighted
  sample(coin_sides, n_flips, replace = TRUE,prob=weights)
}

# Generate 10 coin tosses
toss_coin(10,0.8)

12.6 Renaming GLM

R’s generalized linear regression function, glm(), suffers the same usability problems as lm(): its name is an acronym, and its formula and data arguments are in the wrong order.

To solve this exercise, you need to know two things about generalized linear regression:

glm() formulas are specified like lm() formulas: response is on the left, and explanatory variables are added on the right. To model count data, set glm()’s family argument to poisson, making it a Poisson regression. Here you’ll use data on the number of yearly visits to Snake River at Jackson Hole, Wyoming, snake_river_visits.

Instructions 100 XP

  • Run a generalized linear regression by calling glm(). Model n_visits vs.  gender, income, and travel on the snake_river_visits dataset, setting the family to poisson.

  • Define a function, run_poisson_regression(), to run a Poisson regression. This should take two arguments: data and formula, and call glm(), passing those arguments and setting family to poisson.

  • Recreate the Poisson regression model from the first step, this time by calling your run_poisson_regression() function.

ex_05.R
# Run a generalized linear regression 
glm(
  # Model no. of visits vs. gender, income, travel
  n_visits ~ gender + income + travel, 
  # Use the snake_river_visits dataset
  data = snake_river_visits, 
  # Make it a Poisson regression
  family = poisson
)

# Write a function to run a Poisson regression
run_poisson_regression <- function(data, formula) {
    glm(formula, data, family = poisson)
}

# From previous step
run_poisson_regression <- function(data, formula) {
  glm(formula, data, family = poisson)
}

# Re-run the Poisson regression, using your function
model <- snake_river_visits %>%
  run_poisson_regression(n_visits ~ gender + income + travel)

# Run this to see the predictions
snake_river_explanatory %>%
  mutate(predicted_n_visits = predict(model, ., type = "response"))%>%
  arrange(desc(predicted_n_visits))

12.7 Numeric defaults

cut_by_quantile() converts a numeric vector into a categorical variable where quantiles define the cut points. This is a useful function, but at the moment you have to specify five arguments to make it work. This is too much thinking and typing.

By specifying default arguments, you can make it easier to use. Let’s start with n, which specifies how many categories to cut x into.

A numeric vector of the number of visits to Snake River is provided as n_visits.

Instructions 100 XP

Update the definition of cut_by_quantile() so that the n argument defaults to 5. Remove the n argument from the call to cut_by_quantile().

ex_06.R
# Set the default for n to 5
cut_by_quantile <- function(x, n=5, na.rm, labels, interval_type) {
  probs <- seq(0, 1, length.out = n + 1)
  qtiles <- quantile(x, probs, na.rm = na.rm, names = FALSE)
  right <- switch(interval_type, "(lo, hi]" = TRUE, "[lo, hi)" = FALSE)
  cut(x, qtiles, labels = labels, right = right, include.lowest = TRUE)
}

# Remove the n argument from the call
cut_by_quantile(
  n_visits, 
  na.rm = FALSE, 
  labels = c("very low", "low", "medium", "high", "very high"),
  interval_type = "(lo, hi]"
)
formals(cut_by_quantile)

12.8 Logical defaults

cut_by_quantile() is now slightly easier to use, but you still always have to specify the na.rm argument. This removes missing values—it behaves the same as the na.rm argument to mean() or sd().

Where functions have an argument for removing missing values, the best practice is to not remove them by default (in case you hadn’t spotted that you had missing values). That means that the default for na.rm should be FALSE.

Instructions 100 XP

Update the definition of cut_by_quantile() so that the na.rm argument defaults to FALSE. Remove the na.rm argument from the call to cut_by_quantile().

ex_07.R
# Set the default for na.rm to FALSE
cut_by_quantile <- function(x, n = 5, na.rm = FALSE, labels, interval_type) {
  probs <- seq(0, 1, length.out = n + 1)
  qtiles <- quantile(x, probs, na.rm = na.rm, names = FALSE)
  right <- switch(interval_type, "(lo, hi]" = TRUE, "[lo, hi)" = FALSE)
  cut(x, qtiles, labels = labels, right = right, include.lowest = TRUE)
}

# Remove the na.rm argument from the call
cut_by_quantile(
  n_visits, 
  labels = c("very low", "low", "medium", "high", "very high"),
  interval_type = "(lo, hi]"
)