17  Importing data from flat files with utils

A lot of data comes in the form of flat files: simple tabular text files. Learn how to import the common formats of flat file data with base R functions.

17.1 read.csv

The utils package, which is automatically loaded in your R session on startup, can import CSV files with the read.csv() function.

In this exercise, you’ll be working with swimming_pools.csv (view); it contains data on swimming pools in Brisbane, Australia (Source: data.gov.au). The file contains the column names in the first row. It uses a comma to separate values within rows.

Type dir() in the console to list the files in your working directory. You’ll see that it contains swimming_pools.csv, so you can start straight away.

17.1.1 Instructions 100 XP

  • Use read.csv() to import “swimming_pools.csv” as a data frame with the name pools.
  • Print the structure of pools using str().
ex_001.R
# Import swimming_pools.csv: pools
pools <- read.csv("swimming_pools.csv")
# Print the structure of pools
str(pools)

References

17.2 stringsAsFactors

With stringsAsFactors, you can tell R whether it should convert strings in the flat file to factors.

For all importing functions in the utils package, this argument is TRUE, which means that you import strings as factors. This only makes sense if the strings you import represent categorical variables in R. If you set stringsAsFactors to FALSE, the data frame columns corresponding to strings in your text file will be character.

You’ll again be working with the swimming_pools.csv (view in data folder) file. It contains two columns (Name and Address), which shouldn’t be factors.

Instructions 100 XP

  • Use read.csv() to import the data in "swimming_pools.csv" as a data frame called pools; make sure that strings are imported as characters, not as factors.

  • Using str(), display the structure of the dataset and check that you indeed get character vectors instead of factors.

ex_002.R
# Import swimming_pools.csv correctly: pools
pools <- read.csv("swimming_pools.csv", stringsAsFactors = FALSE)

# Check the structure of pools
str(pools)

17.3 Any changes?

Consider the code below that loads data from swimming_pools.csv in two distinct ways:

# Option A
pools <- read.csv("swimming_pools.csv", stringsAsFactors = TRUE)

# Option B
pools <- read.csv("swimming_pools.csv", stringsAsFactors = FALSE)
ex_003.R
library(projmgr)

# the following could be run in RMarkdown
todo_path <- system.file(
    "extdata",
    "todo-ex.yml",
     package = "projmgr",
     mustWork = TRUE
)

my_todo <- read_todo(todo_path)
report_todo(my_todo)

17.4 read.delim

Aside from .csv files, there are also the .txt files which are basically text files. You can import these functions with read.delim(). By default, it sets the sep argument to "\t" (fields in a record are delimited by tabs) and the header argument to TRUE (the first row contains the field names).

In this exercise, you will import hotdogs.txt (view), containing information on sodium and calorie levels in different hotdogs (Source: UCLA). The dataset has 3 variables, but the variable names are not available in the first line of the file. The file uses tabs as field separators.

Instructions 100 XP

  • Import the data in "hotdogs.txt" with read.delim(). Call the resulting data frame hotdogs. The variable names are not on the first line, so make sure to set the header argument appropriately.
  • Call summary() on hotdogs. This will print out some summary statistics about all variables in the data frame.
ex_004.R
# Import hotdogs.txt: hotdogs
hotdogs <- read.delim(
    "hotdogs.txt",
    sep = '\t',
    header = FALSE
)
# Summarize hotdogs
summary(hotdogs)

17.5 read.table

If you’re dealing with more exotic flat file formats, you’ll want to use read.table(). It’s the most basic importing function; you can specify tons of different arguments in this function. Unlike read.csv() and read.delim(), the header argument defaults to FALSE and the sep argument is “” by default.

Up to you again! The data is still hotdogs.txt (view). It has no column names in the first row, and the field separators are tabs. This time, though, the file is in the data folder inside your current working directory. A variable path with the location of this file is already coded for you.

Instructions 100 XP

  • Finish the read.table() call that’s been prepared for you. Use the path variable, and make sure to set sep correctly.
  • Call head() on hotdogs; this will print the first 6 observations in the data frame.
ex_005.R
# Path to the hotdogs.txt file: path
path <- file.path("data", "hotdogs.txt")

# Import the hotdogs.txt file: hotdogs
hotdogs <- 
    read.table(
        path, 
        sep = '\t',
        col.names = c("type", "calories", "sodium")
    )

# Call head() on hotdogs
head(hotdogs)

17.6 Arguments

Lily and Tom are having an argument because they want to share a hot dog but they can’t seem to agree on which one to choose. After some time, they simply decide that they will have one each. Lily wants to have the one with the fewest calories while Tom wants to have the one with the most sodium.

Next to calories and sodium, the hotdogs have one more variable: type. This can be one of three things: Beef, Meat, or Poultry, so a categorical variable: a factor is fine.

Instructions 100 XP

  • Finish the read.delim() call to import the data in “hotdogs.txt”. It’s a tab- delimited file without names in the first row.
  • The code that selects the observation with the lowest calorie count and stores it in the variable lily is already available. It uses the function which.min(), that returns the index the smallest value in a vector.
  • Do a similar thing for Tom: select the observation with the most sodium and store it in tom. Use which.max() this time.
  • Finally, print both the observations lily and tom.
ex_006.R
# Finish the read.delim() call
hotdogs <- 
    read.delim(
        "hotdogs.txt",
        header = FALSE,
        col.names = c("type", "calories", "sodium")
    )

# Select the hot dog with the least calories: lily
lily <- hotdogs[which.min(hotdogs$calories), ]

# Select the observation with the most sodium: tom

tom <- hotdogs[which.max(hotdogs$sodium), ]
# Print lily and tom
lily
tom

17.7 Column classes

Next to column names, you can also specify the column types or column classes of the resulting data frame. You can do this by setting the colClasses argument to a vector of strings representing classes:

read.delim("my_file.txt", 
           colClasses = c("character",
                          "numeric",
                          "logical"))

This approach can be useful if you have some columns that should be factors and others that should be characters. You don’t have to bother with stringsAsFactors anymore; just state for each column what the class should be.

If a column is set to “NULL” in the colClasses vector, this column will be skipped and will not be loaded into the data frame.

Instructions 100 XP

  • The read.delim() call from before is already included and creates the hotdogs data frame. Go ahead and display the structure of hotdogs.
  • Edit the second read.delim() call. Assign the correct vector to the colClasses argument. NA should be replaced with a character vector: c("factor", "NULL", "numeric").
  • Display the structure of hotdogs2 and look for the difference.
ex_007.R
# Previous call to import hotdogs.txt
hotdogs <- 
    read.delim(
        "hotdogs.txt",
         header = FALSE,
         col.names = c("type", "calories", "sodium")
    )

# Display structure of hotdogs
str(hotdogs)

# Edit the colClasses argument to import the data correctly: hotdogs2
hotdogs2 <- read.delim(
    "hotdogs.txt",
    header = FALSE,
    col.names = c("type", "calories", "sodium"),
    colClasses =  c("factor", "NULL", "numeric")
)


# Display structure of hotdogs2
str(hotdogs2)