18 readr & data.table
In addition to base R, there are dedicated packages to easily and efficiently import flat file data. We’ll talk about two such packages: readr and data.table.
18.1 read_csv
CSV files can be imported with read_csv(). It’s a wrapper function around read_delim() that handles all the details for you. For example, it will assume that the first row contains the column names.
The dataset you’ll be working with here is potatoes.csv (view in dataset folder). It gives information on the impact of storage period and cooking on potatoes’ flavor. It uses commas to delimit fields in a record, and contains column names in the first row. The file is available in your workspace.
Instructions 100 XP
Load the readr package with library().
Import "potatoes.csv" using read_csv(). Assign the resulting data frame to the variable potatoes.
ex_008.R
# Load the readr package
library(readr)
# Import potatoes.csv with read_csv(): potatoes
potatoes <- read_csv("potatoes.csv")18.2 read_tsv
Where you use read_csv() to easily read in CSV files, you use read_tsv() to easily read in TSV files. TSV is short for tab- separated values.
This time, the potatoes data comes in the form of a tab-separated values file; potatoes.txt. In contrast to potatoes.csv, this file does not contain columns names in the first row, though.
There’s a vector properties that you can use to specify these column names manually.
Instructions 100 XP
- Use
read_tsv()to import the potatoes data from potatoes.txt and store it in the data frame potatoes. In addition to the path to the file, you’ll also have to specify thecol_namesargument; you can use the properties vector for this. - Call
head()on potatoes to show the first observations of your dataset.
ex_009.R
# readr is already loaded
# Column names
properties <- c(
"area",
"temp",
"size",
"storage",
"method",
"texture",
"flavor",
"moistness"
)
# Import potatoes.txt: potatoes
potatoes <- read_tsv(
"potatoes.txt",
col_names = properties
)
# Call head() on potatoes
head(potatoes)18.3 read_delim
Just as read.table() was the main utils function, read_delim() is the main readr function.
read_delim() takes two mandatory arguments:
file: the file that contains the datadelim: the character that separates the values in the data file
You’ll again be working with potatoes.txt (view); the file uses tabs ("\t") to delimit values and does not contain column names in its first line. It’s available in your working directory so you can start right away. As before, the vector properties is available to set the col_names.
Instructions 100 XP
- Import all the data in “potatoes.txt” using read_delim(); store the resulting data frame in potatoes.
- Print out potatoes.
ex_010.R
# readr is already loaded
# Column names
properties <- c(
"area",
"temp",
"size",
"storage",
"method",
"texture",
"flavor",
"moistness"
)
# Import potatoes.txt using read_delim(): potatoes
potatoes <- read_delim(
"potatoes.txt",
delim = "\t",
col_names = properties
)
# Print out potatoes
potatoes18.4 skip and n_max
Through skip and n_max you can control which part of your flat file you’re actually importing into R.
- skip specifies the number of lines you’re ignoring in the flat file before actually starting to import data.
- n_max specifies the number of lines you’re actually importing. Say for example you have a CSV file with 20 lines, and set
skip = 2andn_max = 3, you’re only reading in lines 3, 4 and 5 of the file.
Watch out: Once you skip some lines, you also skip the first line that can contain column names!
potatoes.txt, a flat file with tab-delimited records and without column names, is available in your workspace.
Instructions 100 XP
- Finish the first read_tsv() call to import observations 7, 8, 9, 10 and 11 from
potatoes.txt.
ex_011.R
# readr is already loaded
# Column names
properties <- c(
"area",
"temp",
"size",
"storage",
"method",
"texture",
"flavor",
"moistness"
)
# Import 5 observations from potatoes.txt: potatoes_fragment
potatoes_fragment <- read_tsv(
"potatoes.txt",
skip = 6,
n_max = 5,
col_names = properties
)18.5 col_types
You can also specify which types the columns in your imported data frame should have. You can do this with col_types. If set to NULL, the default, functions from the readr package will try to find the correct types themselves. You can manually set the types with a string, where each character denotes the class of the column: character, double, integer and logical. _ skips the column as a whole.
potatoes.txt, a flat file with tab-delimited records and without column names, is again available in your workspace.
Instructions 100 XP
- In the second
read_tsv()call, edit thecol_typesargument to import all columns as characters (c). Store the resulting data frame inpotatoes_char. - Print out the structure of
potatoes_charand verify whether all column types arechr, short forcharacter.
ex_012.R
# Column names
properties <- c(
"area",
"temp",
"size",
"storage",
"method",
"texture",
"flavor",
"moistness"
)
# Import all data, but force all columns to be character: potatoes_char
potatoes_char <- read_tsv(
"potatoes.txt",
col_types = "cccccccc",
col_names = properties
)
# Print out structure of potatoes_char
str(potatoes_char)18.6 col_types with collectors
Another way of setting the types of the imported columns is using collectors. Collector functions can be passed in a list() to the col_types argument of read_ functions to tell them how to interpret values in a column.
For a complete list of collector functions, you can take a look at the collector documentation. For this exercise you will need two collector functions:
col_integer(): the column should be interpreted as an integer.col_factor(levels, ordered = FALSE): the column should be interpreted as a factor with levels.
In this exercise, you will work with `hotdogs.txt, which is a tab-delimited file without column names in the first row.
Instructions 100 XP
hotdogsis created for you without setting the column types. Inspect its summary using thesummary()function.- Two collector functions are defined for you:
facandint. Have a look at them, do you understand what they’re collecting? - In the second
read_tsv()call, edit thecol_typesargument: Pass alist()with the elementsfac,intandint, so the first column is imported as a factor, and the second and third column as integers. - Create a
summary()ofhotdogs_factor. Compare this to the summary of hotdogs.
ex_013.R
# check if readr is already loaded
# Import without col_types
hotdogs <- read_tsv(
"hotdogs.txt",
col_names = c("type", "calories", "sodium")
)
# Display the summary of hotdogs
summary(hotdogs)
# The collectors you will need to import the data
fac <- col_factor(levels = c("Beef", "Meat", "Poultry"))
int <- col_integer()
# Edit the col_types argument to import the data correctly: hotdogs_factor
hotdogs_factor <- read_tsv(
"hotdogs.txt",
col_names = c("type", "calories", "sodium"),
col_types = list(fac, int, int)
)
# Display the summary of hotdogs_factor
summary(hotdogs_factor)18.7 fread
You still remember how to use read.table(), right? Well, fread() is a function that does the same job with very similar arguments. It is extremely easy to use and blazingly fast! Often, simply specifying the path to the file is enough to successfully import your data.
Don’t take our word for it, try it yourself! You’ll be working with the potatoes.csv file, that’s available the dataset folder. Fields are delimited by commas, and the first line contains the column names.
Instructions 100 XP
-Use library() to load (NOT install) the data.table package. Check If you
need to install this package.
-Import "potatoes.csv" with fread(). Simply pass it the file path and see if it worked. Store the result in a variable potatoes. Print out potatoes.
ex_014.R
# load the data.table package using library()
library(data.table)
# Import potatoes.csv with fread(): potatoes
potatoes <- fread("potatoes.csv")
# Print out potatoes
potatoes18.8 fread: more advance use
Now that you know the basics about fread(), you should know about two arguments of the function: drop and select, to drop or select variables of interest.
Suppose you have a dataset that contains 5 variables and you want to keep the first and fifth variable, named “a” and “e”. The following options will all do the trick:
fread("path/to/file.txt", drop = 2:4)
fread("path/to/file.txt", select = c(1, 5))
fread("path/to/file.txt", drop = c("b", "c", "d"))
fread("path/to/file.txt", select = c("a", "e"))Let’s stick with potatoes since we’re particularly fond of them here at DataCamp. The data is again available in the file potatoes.csv (view), containing comma- separated records.
Instructions 100 XP
Using fread() and select or drop as arguments, only import the texture and moistness columns of the flat file. They correspond to the columns 6 and 8 in "potatoes.csv". Store the result in a variable potatoes. - plot() 2 columns of the potatoes data frame: texture on the x-axis, moistness on the y-axis. Use the dollar sign notation twice. Feel free to name your axes and plot.
ex_014.R
# fread is already loaded
library("data.table")
path <- "introduction_to_importing_data_in_R/potatoes.csv"
# Import columns 6 and 8 of potatoes.csv: potatoes
potatoes <- fread(path, select = c(6, 8))
# Plot texture (x) and moistness (y) of potatoes
plot(potatoes$texture, potatoes$moistness)