25 Importing data from statistical software packages
Next to R, there are also other commonly used statistical software packages: SAS, STATA and SPSS. Each of them has their own file format. Learn how to use the haven and foreign packages to get them into R with remarkable ease!
25.1 Import SAS data with haven
haven is an extremely easy-to-use package to import data from three software packages: SAS, STATA and SPSS. Depending on the software, you use different functions:
- SAS: read_sas()
- STATA: read_dta() (or read_stata(), which are identical)
- SPSS: read_sav() or read_por(), depending on the file type.
All these functions take one key argument: the path to your local file. In fact, you can even pass a URL; haven will then automatically download the file for you before importing it.
You’ll be working with data on the age, gender, income, and purchase level (0 = low, 1 = high) of 36 individuals (Source: SAS). The information is stored in a SAS file, sales.sas7bdat, which is available in the dataset directory. You can also download the data here http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/sales.sas7bdat.
Instructions 100 XP
- Load the
havenpackage. - Import the data file
"sales.sas7bdat". Call the imported data frame sales. - Display the structure of
saleswithstr(). Some columns represent categorical variables, so they should be factors.
ex_023.R
# Load the haven package
library(haven)
# Import sales.sas7bdat: sales
sales <- read_sas("sales.sas7bdat")
# Display the structure of sales
str(sales)25.2 Import STATA data with haven
Next up are STATA data files; you can use read_dta() for these.
When inspecting the result of the read_dta() call, you will notice that one column will be imported as a labelled vector, an R equivalent for the common data structure in other statistical environments. In order to effectively continue working on the data in R, it’s best to change this data into a standard R class. To convert a variable of the class labelled to a factor, you’ll need haven’s as_factor() function.
In this exercise, you will work with data on yearly import and export numbers of sugar, both in USD and in weight. The data can be found at: http://assets.datacamp.com/production/course_1478/datasets/trade.dta
Instructions 100 XP
- Import the data file directly from the URL using read_dta(), and store it as sugar.
- Print out the structure of sugar. The Date column has class labelled.
- Convert the values in the Date column of sugar to dates, using
as.Date(as_factor(___)). - Print out the structure of sugar once more. Looks better now?
ex_024.R
# haven is already loaded
library(haven)
# Import the data from the URL: sugar
url <- "http://assets.datacamp.com/production/course_1478/datasets/trade.dta"
sugar <- read_dta(url)
# Structure of sugar
str(sugar)
# Convert values in Date column to dates
sugar$Date <- as.Date(as_factor(sugar$Date))
# Structure of sugar again
str(sugar)25.3 Import SPSS data with haven
The haven package can also import data files from SPSS. Again, importing the data is pretty straightforward. Depending on the SPSS data file you’re working with, you’ll need either read_sav() - for .sav files - or read_por() - for .por files.
In this exercise, you will work with data on four of the Big Five personality traits for 434 persons (Source: University of Bath). The Big Five is a psychological concept including, originally, five dimensions of personality to classify human personality. The SPSS dataset is called person.sav and is available in your working directory.
Instructions 100 XP
- Use
read_sav()to import theSPSSdata in"person.sav". Name the imported data frametraits. traitscontains several missing values, orNAs. Runsummary()on it to find out how manyNAs are contained in each variable.- Print out a subset of those individuals that scored high on Extroversion and on Agreeableness, i.e. scoring higher than 40 on each of these two categories. You can use
subset()for this.
ex_025.R
# haven is already loaded
# Import person.sav: traits
traits <- read_sav("person.sav")
# Summarize traits
summary(traits)
# Print out a subset
subset(traits, Extroversion > 40 & Agreeableness > 40)25.4 Factorize, round two
In the last exercise you learned how to import a data file using the command read_sav(). With SPSS data files, it can also happen that some of the variables you import have the labelled class. This is done to keep all the labelling information that was originally present in the .sav and .por files. It’s advised to coerce (or change) these variables to factors or other standard R classes.
The data for this exercise involves information on employees and their demographic and economic attributes (Source: QRiE). The data can be found on the following URL: http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/employee.sav
Instructions 100 XP
- Import the SPSS data straight from the URL and store the resulting data frame as work.
- Display the summary of the GENDER column of work. This information doesn’t give you a lot of useful information, right?
- Convert the GENDER column in work to a factor, the class to denote categorical variables in R. Use as_factor().
- Once again display the summary of the GENDER column. This time, the printout makes much more sense.
ex_027.R
# haven is already loaded
# Import SPSS data from the URL: work
url <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/employee.sav"
work <- read_sav(url)
# Display summary of work$GENDER
summary(work$GENDER)
# Convert work$GENDER to a factor
work$GENDER <- as_factor(work$GENDER)
# Display summary of work$GENDER again
summary(work$GENDER)25.5 Import STATA data with foreign (1)
The foreign package offers a simple function to import and read STATA data: read.dta().
In this exercise, you will import data on the US presidential elections in the year 2000. The data in florida.dta contains the total numbers of votes for each of the four candidates as well as the total number of votes per election area in the state of Florida (Source: Florida Department of State). The file is available in the dataset directory, you can download it here http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/florida.dta
if you want to experiment some more.
Instructions 100 XP
- Load the
foreignpackage; it’s already installed on DataCamp’s servers. - Import the data on the elections in Florida,
"florida.dta", and name the resulting data frame florida. Useread.dta()without specifying extra arguments. - Check out the last 6 observations of florida with
tail().
ex_028.R
# Load the foreign package
library(foreign)
# Import florida.dta and name the resulting data frame florida
florida <- read.dta("florida.dta")
# Check tail() of florida
tail(florida)25.6 Import STATA data with foreign (2)
Data can be very diverse, going from character vectors to categorical variables, dates and more. It’s in these cases that the additional arguments of read.dta() will come in handy.
The arguments you will use most often are convert.dates, convert.factors, missing.type and convert.underscore. Their meaning is pretty straightforward, as Filip explained in the video. It’s all about correctly converting STATA data to standard R data structures. Type ?read.dta to find out about about the default values.
The dataset for this exercise contains socio-economic measures and access to education for different individuals (Source: World Bank). This data is available as edequality.dta, which is located in the worldbank folder in your working directory.
Instructions 100 XP
ex_029.R
# foreign is already loaded
# Specify the file path using file.path(): path
path <- file.path("worldbank", "edequality.dta")
# Create and print structure of edu_equal_1
edu_equal_1 <- read.dta(path)
str(edu_equal_1)
# Create and print structure of edu_equal_2
edu_equal_2 <-
read.dta(
path,
convert.factors = FALSE
)
str(edu_equal_2)
# Create and print structure of edu_equal_3
edu_equal_3 <-
read.dta(
path,
convert.underscore = TRUE
)
str(edu_equal_3)
nrow(
subset(
edu_equal_1,
edu_equal_1$ethnicity_head == "Bulgarian",
income > 1000
)
)25.7 Import SPSS data with foreign (1)
All great things come in pairs. Where foreign provided read.dta() to read Stata data, there’s also read.spss() to read SPSS data files. To get a data frame, make sure to set to.data.frame = TRUE inside read.spss().
In this exercise, you’ll be working with socio-economic variables from different countries (Source: Quantative Data Analysis in Education). The SPSS data is in a file called international.sav, which is in your working directory. You can also download it here if you want to play around with it some more.
Instructions 100 XP
ex_030.R
# foreign is already loaded
# Import international.sav as a data frame: demo
demo <- read.spss(
"international.sav", to.data.frame=TRUE)
# Create boxplot of gdp variable of demo
boxplot(demo$gdp)25.8 Import SPSS data with foreign (2)
In the previous exercise, you used the to.data.frame argument inside read.spss(). There are many other ways in which to customize the way your SPSS data is imported.
In this exercise you will experiment with another argument, use.value.labels. It specifies whether variables with value labels should be converted into R factors with levels that are named accordingly. The argument is TRUE by default which means that so called labelled variables inside SPSS are converted to factors inside R.
You’ll again be working with the international.sav data, which is available in your current working directory.
Instructions 100 XP
- Import the data file
"international.sav"as a data frame,demo_1. - Print the first few rows of
demo_1using thehead()function. - Import the data file
"international.sav"as a data frame,demo_2, but this time in a way such that variables with value labels are not converted to R factors. - Again, print the first few rows of
demo_2. Can you tell the difference between the two data frames?
ex_031.R
# foreign is already loaded
# Import international.sav as demo_1
demo_1 <- read.spss(
"international.sav",
to.data.frame = TRUE,
)
# Print out the head of demo_1
head(demo_1)
# Import international.sav as demo_2
demo_2 <- read.spss(
"international.sav",
to.data.frame = TRUE,
use.value.labels = FALSE
)
# Print out the head of demo_2
head(demo_2)