18 readr & data.table

In addition to base R, there are dedicated packages to easily and efficiently import flat file data. We’ll talk about two such packages: readr and data.table.

18.1 read_csv

CSV files can be imported with read_csv(). It’s a wrapper function around read_delim() that handles all the details for you. For example, it will assume that the first row contains the column names.

The dataset you’ll be working with here is potatoes.csv (view in dataset folder). It gives information on the impact of storage period and cooking on potatoes’ flavor. It uses commas to delimit fields in a record, and contains column names in the first row. The file is available in your workspace.

Instructions `100 XP`

Load the readr package with library().

Import "potatoes.csv" using read_csv(). Assign the resulting data frame to the variable potatoes.

ex_008.R

# Load the readr package
library(readr)

# Import potatoes.csv with read_csv(): potatoes
potatoes <- read_csv("potatoes.csv")

18.2 read_tsv

Where you use read_csv() to easily read in CSV files, you use read_tsv() to easily read in TSV files. TSV is short for tab- separated values.

This time, the potatoes data comes in the form of a tab-separated values file; potatoes.txt. In contrast to potatoes.csv, this file does not contain columns names in the first row, though.

There’s a vector properties that you can use to specify these column names manually.

Instructions `100 XP`

Use read_tsv() to import the potatoes data from potatoes.txt and store it in the data frame potatoes. In addition to the path to the file, you’ll also have to specify the col_names argument; you can use the properties vector for this.
Call head() on potatoes to show the first observations of your dataset.

ex_009.R

# readr is already loaded

# Column names
properties <- c(
    "area",
    "temp",
    "size",
    "storage",
    "method",
    "texture",
    "flavor",
    "moistness"
)

# Import potatoes.txt: potatoes
potatoes <- read_tsv(
    "potatoes.txt",
    col_names = properties
)

# Call head() on potatoes
head(potatoes)

18.3 read_delim

Just as read.table() was the main utils function, read_delim() is the main readr function.

read_delim() takes two mandatory arguments:

file: the file that contains the data
delim: the character that separates the values in the data file

You’ll again be working with potatoes.txt (view); the file uses tabs ("\t") to delimit values and does not contain column names in its first line. It’s available in your working directory so you can start right away. As before, the vector properties is available to set the col_names.

Instructions `100 XP`

Import all the data in “potatoes.txt” using read_delim(); store the resulting data frame in potatoes.
Print out potatoes.

ex_010.R

# readr is already loaded

# Column names
properties <- c(
    "area",
    "temp",
    "size",
    "storage",
    "method",
    "texture",
    "flavor",
    "moistness"
)

# Import potatoes.txt using read_delim(): potatoes
potatoes <- read_delim(
    "potatoes.txt",
    delim = "\t",
    col_names = properties
)
# Print out potatoes
potatoes

18.4 skip and n_max

Through skip and n_max you can control which part of your flat file you’re actually importing into R.

skip specifies the number of lines you’re ignoring in the flat file before actually starting to import data.
n_max specifies the number of lines you’re actually importing. Say for example you have a CSV file with 20 lines, and set skip = 2 and n_max = 3, you’re only reading in lines 3, 4 and 5 of the file.

Watch out: Once you skip some lines, you also skip the first line that can contain column names!

potatoes.txt, a flat file with tab-delimited records and without column names, is available in your workspace.

Instructions 100 XP

Finish the first read_tsv() call to import observations 7, 8, 9, 10 and 11 from potatoes.txt.

ex_011.R

# readr is already loaded

# Column names
properties <- c(
    "area",
    "temp",
    "size",
    "storage",
    "method",
    "texture",
    "flavor",
    "moistness"
)

# Import 5 observations from potatoes.txt: potatoes_fragment
potatoes_fragment <- read_tsv(
    "potatoes.txt",
    skip = 6, 
    n_max = 5,
    col_names = properties
)

18.5 col_types

You can also specify which types the columns in your imported data frame should have. You can do this with col_types. If set to NULL, the default, functions from the readr package will try to find the correct types themselves. You can manually set the types with a string, where each character denotes the class of the column: character, double, integer and logical. _ skips the column as a whole.

potatoes.txt, a flat file with tab-delimited records and without column names, is again available in your workspace.

Instructions `100 XP`

In the second read_tsv() call, edit the col_types argument to import all columns as characters (c). Store the resulting data frame in potatoes_char.
Print out the structure of potatoes_char and verify whether all column types are chr, short for character.

ex_012.R

# Column names
properties <- c(
    "area",
    "temp",
    "size",
    "storage",
    "method",
    "texture",
    "flavor",
    "moistness"
)

# Import all data, but force all columns to be character: potatoes_char
potatoes_char <- read_tsv(
    "potatoes.txt",
    col_types = "cccccccc",
    col_names = properties
)
# Print out structure of potatoes_char
str(potatoes_char)

18.6 col_types with collectors

Another way of setting the types of the imported columns is using collectors. Collector functions can be passed in a list() to the col_types argument of read_ functions to tell them how to interpret values in a column.

For a complete list of collector functions, you can take a look at the collector documentation. For this exercise you will need two collector functions:

col_integer(): the column should be interpreted as an integer.
col_factor(levels, ordered = FALSE): the column should be interpreted as a factor with levels.

In this exercise, you will work with `hotdogs.txt, which is a tab-delimited file without column names in the first row.

Instructions `100 XP`

hotdogs is created for you without setting the column types. Inspect its summary using the summary() function.
Two collector functions are defined for you: fac and int. Have a look at them, do you understand what they’re collecting?
In the second read_tsv() call, edit the col_types argument: Pass a list() with the elements fac, int and int, so the first column is imported as a factor, and the second and third column as integers.
Create a summary() of hotdogs_factor. Compare this to the summary of hotdogs.

ex_013.R

# check if readr is already loaded

# Import without col_types
hotdogs <- read_tsv(
    "hotdogs.txt",
    col_names = c("type", "calories", "sodium")
)

# Display the summary of hotdogs
summary(hotdogs)

# The collectors you will need to import the data
fac <- col_factor(levels = c("Beef", "Meat", "Poultry"))
int <- col_integer()

# Edit the col_types argument to import the data correctly: hotdogs_factor
hotdogs_factor <- read_tsv(
    "hotdogs.txt",
    col_names = c("type", "calories", "sodium"),
    col_types = list(fac, int, int)
)

# Display the summary of hotdogs_factor
summary(hotdogs_factor)

18.7 fread

You still remember how to use read.table(), right? Well, fread() is a function that does the same job with very similar arguments. It is extremely easy to use and blazingly fast! Often, simply specifying the path to the file is enough to successfully import your data.

Don’t take our word for it, try it yourself! You’ll be working with the potatoes.csv file, that’s available the dataset folder. Fields are delimited by commas, and the first line contains the column names.

Instructions `100 XP`

-Use library() to load (NOT install) the data.table package. Check If you
need to install this package.

-Import "potatoes.csv" with fread(). Simply pass it the file path and see if it worked. Store the result in a variable potatoes. Print out potatoes.

ex_014.R

# load the data.table package using library()
library(data.table)

# Import potatoes.csv with fread(): potatoes
potatoes <- fread("potatoes.csv")

# Print out potatoes
potatoes

18.8 fread: more advance use

Now that you know the basics about fread(), you should know about two arguments of the function: drop and select, to drop or select variables of interest.

Suppose you have a dataset that contains 5 variables and you want to keep the first and fifth variable, named “a” and “e”. The following options will all do the trick:

fread("path/to/file.txt", drop = 2:4)
fread("path/to/file.txt", select = c(1, 5))
fread("path/to/file.txt", drop = c("b", "c", "d"))
fread("path/to/file.txt", select = c("a", "e"))

Let’s stick with potatoes since we’re particularly fond of them here at DataCamp. The data is again available in the file potatoes.csv (view), containing comma- separated records.

Instructions `100 XP`

Using fread() and select or drop as arguments, only import the texture and moistness columns of the flat file. They correspond to the columns 6 and 8 in "potatoes.csv". Store the result in a variable potatoes. - plot() 2 columns of the potatoes data frame: texture on the x-axis, moistness on the y-axis. Use the dollar sign notation twice. Feel free to name your axes and plot.

ex_014.R

# fread is already loaded
library("data.table")
path <- "introduction_to_importing_data_in_R/potatoes.csv"
# Import columns 6 and 8 of potatoes.csv: potatoes
potatoes <- fread(path, select = c(6, 8))


# Plot texture (x) and moistness (y) of potatoes
plot(potatoes$texture, potatoes$moistness)

# readr & data.table In addition to base R, there are dedicated packages to easily and efficiently import flat file data. We'll talk about two such packages: `readr` and `data.table`. ## read_csv CSV files can be imported with `read_csv()`. It's a wrapper function around `read_delim()` that handles all the details for you. For example, it will assume that the first row contains the column names. The dataset you'll be working with here is `potatoes.csv` (view in dataset folder). It gives information on the impact of storage period and cooking on potatoes' flavor. It uses commas to delimit fields in a record, and contains column names in the first row. The file is available in your workspace. ## Instructions `100 XP` {.unnumbered} Load the `readr` package with `library()`. Import `"potatoes.csv"` using `read_csv()`. Assign the resulting data frame to the variable potatoes. ```{.r filename="ex_008.R"} # Load the readr package library(readr) # Import potatoes.csv with read_csv(): potatoes potatoes <- read_csv("potatoes.csv") ``` ## read_tsv Where you use `read_csv()` to easily read in CSV files, you use `read_tsv()` to easily read in TSV files. TSV is short for tab- separated values. This time, the potatoes data comes in the form of a tab-separated values file; `potatoes.txt`. In contrast to `potatoes.csv`, this file does not contain columns names in the first row, though. There's a vector properties that you can use to specify these column names manually. ## Instructions `100 XP` {.unnumbered} - Use `read_tsv()` to import the potatoes data from potatoes.txt and store it in the data frame potatoes. In addition to the path to the file, you'll also have to specify the `col_names` argument; you can use the properties vector for this. - Call `head()` on potatoes to show the first observations of your dataset. ```{.r filename="ex_009.R"} # readr is already loaded # Column names properties <- c( "area", "temp", "size", "storage", "method", "texture", "flavor", "moistness" ) # Import potatoes.txt: potatoes potatoes <- read_tsv( "potatoes.txt", col_names = properties ) # Call head() on potatoes head(potatoes) ``` ## read_delim Just as `read.table()` was the main utils function, `read_delim()` is the main readr function. `read_delim()` takes two mandatory arguments: - `file`: the file that contains the data - `delim`: the character that separates the values in the data file You'll again be working with `potatoes.txt` (view); the file uses tabs (`"\t"`) to delimit values and does not contain column names in its first line. It's available in your working directory so you can start right away. As before, the vector `properties` is available to set the `col_names`. ### Instructions `100 XP` {.unnumbered} - Import all the data in "potatoes.txt" using read_delim(); store the resulting data frame in potatoes. - Print out potatoes. ```{.r filename="ex_010.R"} # readr is already loaded # Column names properties <- c( "area", "temp", "size", "storage", "method", "texture", "flavor", "moistness" ) # Import potatoes.txt using read_delim(): potatoes potatoes <- read_delim( "potatoes.txt", delim = "\t", col_names = properties ) # Print out potatoes potatoes ``` ## skip and n_max Through `skip` and `n_max` you can control which part of your flat file you're actually importing into R. - skip specifies the number of lines you're ignoring in the flat file before actually starting to import data. - n_max specifies the number of lines you're actually importing. Say for example you have a CSV file with 20 lines, and set `skip = 2` and `n_max = 3`, you're only reading in lines 3, 4 and 5 of the file. Watch out: Once you skip some lines, you also skip the first line that can contain column names! `potatoes.txt`, a flat file with tab-delimited records and without column names, is available in your workspace. ### Instructions 100 XP {.unnumbered} - Finish the first read_tsv() call to import observations 7, 8, 9, 10 and 11 from `potatoes.txt`. ```{.r filename="ex_011.R"} # readr is already loaded # Column names properties <- c( "area", "temp", "size", "storage", "method", "texture", "flavor", "moistness" ) # Import 5 observations from potatoes.txt: potatoes_fragment potatoes_fragment <- read_tsv( "potatoes.txt", skip = 6, n_max = 5, col_names = properties ) ``` ## col_types You can also specify which types the columns in your imported data frame should have. You can do this with `col_types`. If set to `NULL`, the default, functions from the readr package will try to find the correct types themselves. You can manually set the types with a `string`, where each character denotes the class of the column: `character`, `double`, `integer` and `logical`. `_` skips the column as a whole. `potatoes.txt`, a flat file with tab-delimited records and without column names, is again available in your workspace. ### Instructions `100 XP` {.unnumbered} - In the second `read_tsv()` call, edit the `col_types` argument to import all columns as characters (`c`). Store the resulting data frame in `potatoes_char`. - Print out the structure of `potatoes_char` and verify whether all column types are `chr`, short for `character`. ```{.r filename="ex_012.R"} # Column names properties <- c( "area", "temp", "size", "storage", "method", "texture", "flavor", "moistness" ) # Import all data, but force all columns to be character: potatoes_char potatoes_char <- read_tsv( "potatoes.txt", col_types = "cccccccc", col_names = properties ) # Print out structure of potatoes_char str(potatoes_char) ``` ## col_types with collectors Another way of setting the types of the imported columns is using collectors. Collector functions can be passed in a `list()` to the `col_types` argument of `read_` functions to tell them how to interpret values in a column. For a complete list of collector functions, you can take a look at the `collector` documentation. For this exercise you will need two collector functions: - `col_integer()`: the column should be interpreted as an integer. - `col_factor(levels, ordered = FALSE)`: the column should be interpreted as a factor with levels. In this exercise, you will work with `hotdogs.txt, which is a tab-delimited file without column names in the first row. ### Instructions `100 XP` {.unnumbered} - `hotdogs` is created for you without setting the column types. Inspect its summary using the `summary()` function. - Two collector functions are defined for you: `fac` and `int`. Have a look at them, do you understand what they're collecting? - In the second `read_tsv()` call, edit the `col_types` argument: Pass a `list()` with the elements `fac`, `int` and `int`, so the first column is imported as a factor, and the second and third column as integers. - Create a `summary()` of `hotdogs_factor`. Compare this to the summary of hotdogs. ```{.r filename="ex_013.R"} # check if readr is already loaded # Import without col_types hotdogs <- read_tsv( "hotdogs.txt", col_names = c("type", "calories", "sodium") ) # Display the summary of hotdogs summary(hotdogs) # The collectors you will need to import the data fac <- col_factor(levels = c("Beef", "Meat", "Poultry")) int <- col_integer() # Edit the col_types argument to import the data correctly: hotdogs_factor hotdogs_factor <- read_tsv( "hotdogs.txt", col_names = c("type", "calories", "sodium"), col_types = list(fac, int, int) ) # Display the summary of hotdogs_factor summary(hotdogs_factor) ``` ## fread You still remember how to use `read.table()`, right? Well, `fread()` is a function that does the same job with very similar arguments. It is extremely easy to use and blazingly fast! Often, simply specifying the path to the file is enough to successfully import your data. Don't take our word for it, try it yourself! You'll be working with the `potatoes.csv` file, that's available the dataset folder. Fields are delimited by commas, and the first line contains the column names. ### Instructions `100 XP` {.unnumbered} -Use `library()` to load (NOT install) the `data.table` package. Check If you need to install this package. -Import `"potatoes.csv"` with `fread()`. Simply pass it the file path and see if it worked. Store the result in a variable potatoes. Print out potatoes. ```{.r filename="ex_014.R"} # load the data.table package using library() library(data.table) # Import potatoes.csv with fread(): potatoes potatoes <- fread("potatoes.csv") # Print out potatoes potatoes ``` ## fread: more advance use Now that you know the basics about `fread()`, you should know about two arguments of the function: `drop` and `select`, to drop or select variables of interest. Suppose you have a dataset that contains 5 variables and you want to keep the first and fifth variable, named "a" and "e". The following options will all do the trick: ```{.r code-line-numbers="false"} fread("path/to/file.txt", drop = 2:4) fread("path/to/file.txt", select = c(1, 5)) fread("path/to/file.txt", drop = c("b", "c", "d")) fread("path/to/file.txt", select = c("a", "e")) ``` Let's stick with potatoes since we're particularly fond of them here at DataCamp. The data is again available in the file potatoes.csv (view), containing comma- separated records. ### Instructions `100 XP` {.unnumbered} Using `fread()` and `select` or `drop` as arguments, only import the `texture` and `moistness` columns of the flat file. They correspond to the columns 6 and 8 in `"potatoes.csv"`. Store the result in a variable potatoes. - plot() 2 columns of the potatoes data frame: texture on the x-axis, moistness on the y-axis. Use the dollar sign notation twice. Feel free to name your axes and plot. ```{.r filename="ex_014.R"} # fread is already loaded library("data.table") path <- "introduction_to_importing_data_in_R/potatoes.csv" # Import columns 6 and 8 of potatoes.csv: potatoes potatoes <- fread(path, select = c(6, 8)) # Plot texture (x) and moistness (y) of potatoes plot(potatoes$texture, potatoes$moistness) ```