25 Importing data from statistical software packages

Next to R, there are also other commonly used statistical software packages: SAS, STATA and SPSS. Each of them has their own file format. Learn how to use the haven and foreign packages to get them into R with remarkable ease!

25.1 Import SAS data with haven

haven is an extremely easy-to-use package to import data from three software packages: SAS, STATA and SPSS. Depending on the software, you use different functions:

SAS: read_sas()
STATA: read_dta() (or read_stata(), which are identical)
SPSS: read_sav() or read_por(), depending on the file type.

All these functions take one key argument: the path to your local file. In fact, you can even pass a URL; haven will then automatically download the file for you before importing it.

You’ll be working with data on the age, gender, income, and purchase level (0 = low, 1 = high) of 36 individuals (Source: SAS). The information is stored in a SAS file, sales.sas7bdat, which is available in the dataset directory. You can also download the data here http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/sales.sas7bdat.

Instructions `100 XP`

Load the haven package.
Import the data file "sales.sas7bdat". Call the imported data frame sales.
Display the structure of sales with str(). Some columns represent categorical variables, so they should be factors.

ex_023.R

# Load the haven package
library(haven)

# Import sales.sas7bdat: sales
sales <- read_sas("sales.sas7bdat")

# Display the structure of sales
str(sales)

25.2 Import STATA data with haven

Next up are STATA data files; you can use read_dta() for these.

When inspecting the result of the read_dta() call, you will notice that one column will be imported as a labelled vector, an R equivalent for the common data structure in other statistical environments. In order to effectively continue working on the data in R, it’s best to change this data into a standard R class. To convert a variable of the class labelled to a factor, you’ll need haven’s as_factor() function.

In this exercise, you will work with data on yearly import and export numbers of sugar, both in USD and in weight. The data can be found at: http://assets.datacamp.com/production/course_1478/datasets/trade.dta

Instructions `100 XP`

Import the data file directly from the URL using read_dta(), and store it as sugar.
Print out the structure of sugar. The Date column has class labelled.
Convert the values in the Date column of sugar to dates, using as.Date(as_factor(___)).
Print out the structure of sugar once more. Looks better now?

ex_024.R

# haven is already loaded
library(haven)
# Import the data from the URL: sugar
url <- "http://assets.datacamp.com/production/course_1478/datasets/trade.dta"

sugar <- read_dta(url)

# Structure of sugar
str(sugar)

# Convert values in Date column to dates
sugar$Date <- as.Date(as_factor(sugar$Date))

# Structure of sugar again
str(sugar)

25.3 Import SPSS data with haven

The haven package can also import data files from SPSS. Again, importing the data is pretty straightforward. Depending on the SPSS data file you’re working with, you’ll need either read_sav() - for .sav files - or read_por() - for .por files.

In this exercise, you will work with data on four of the Big Five personality traits for 434 persons (Source: University of Bath). The Big Five is a psychological concept including, originally, five dimensions of personality to classify human personality. The SPSS dataset is called person.sav and is available in your working directory.

Instructions `100 XP`

Use read_sav() to import the SPSS data in "person.sav". Name the imported data frame traits.
traits contains several missing values, or NAs. Run summary() on it to find out how manyNAs are contained in each variable.
Print out a subset of those individuals that scored high on Extroversion and on Agreeableness, i.e. scoring higher than 40 on each of these two categories. You can use subset() for this.

ex_025.R

# haven is already loaded

# Import person.sav: traits
traits <- read_sav("person.sav")

# Summarize traits
summary(traits)

# Print out a subset

subset(traits, Extroversion > 40 & Agreeableness > 40)

25.4 Factorize, round two

In the last exercise you learned how to import a data file using the command read_sav(). With SPSS data files, it can also happen that some of the variables you import have the labelled class. This is done to keep all the labelling information that was originally present in the .sav and .por files. It’s advised to coerce (or change) these variables to factors or other standard R classes.

The data for this exercise involves information on employees and their demographic and economic attributes (Source: QRiE). The data can be found on the following URL: http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/employee.sav

Instructions `100 XP`

Import the SPSS data straight from the URL and store the resulting data frame as work.
Display the summary of the GENDER column of work. This information doesn’t give you a lot of useful information, right?
Convert the GENDER column in work to a factor, the class to denote categorical variables in R. Use as_factor().
Once again display the summary of the GENDER column. This time, the printout makes much more sense.

ex_027.R

# haven is already loaded

# Import SPSS data from the URL: work
url <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/employee.sav"
work <- read_sav(url)

# Display summary of work$GENDER
summary(work$GENDER)
# Convert work$GENDER to a factor
work$GENDER <- as_factor(work$GENDER)
# Display summary of work$GENDER again
summary(work$GENDER)

25.5 Import STATA data with foreign (1)

The foreign package offers a simple function to import and read STATA data: read.dta().

In this exercise, you will import data on the US presidential elections in the year 2000. The data in florida.dta contains the total numbers of votes for each of the four candidates as well as the total number of votes per election area in the state of Florida (Source: Florida Department of State). The file is available in the dataset directory, you can download it here http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/florida.dta

if you want to experiment some more.

Instructions `100 XP`

Load the foreign package; it’s already installed on DataCamp’s servers.
Import the data on the elections in Florida, "florida.dta", and name the resulting data frame florida. Use read.dta() without specifying extra arguments.
Check out the last 6 observations of florida with tail().

ex_028.R

# Load the foreign package
library(foreign)

# Import florida.dta and name the resulting data frame florida

florida <- read.dta("florida.dta")
# Check tail() of florida
tail(florida)

25.6 Import STATA data with foreign (2)

Data can be very diverse, going from character vectors to categorical variables, dates and more. It’s in these cases that the additional arguments of read.dta() will come in handy.

The arguments you will use most often are convert.dates, convert.factors, missing.type and convert.underscore. Their meaning is pretty straightforward, as Filip explained in the video. It’s all about correctly converting STATA data to standard R data structures. Type ?read.dta to find out about about the default values.

The dataset for this exercise contains socio-economic measures and access to education for different individuals (Source: World Bank). This data is available as edequality.dta, which is located in the worldbank folder in your working directory.

Instructions `100 XP`

ex_029.R

# foreign is already loaded

# Specify the file path using file.path(): path
path <- file.path("worldbank", "edequality.dta")

# Create and print structure of edu_equal_1
edu_equal_1 <- read.dta(path)
str(edu_equal_1)


# Create and print structure of edu_equal_2
edu_equal_2 <- 
    read.dta(
        path,
        convert.factors = FALSE
    )
str(edu_equal_2)


# Create and print structure of edu_equal_3

edu_equal_3 <- 
    read.dta(
        path,
        convert.underscore = TRUE
    )
str(edu_equal_3)


nrow(
    subset(
        edu_equal_1,
        edu_equal_1$ethnicity_head == "Bulgarian",
        income > 1000
    )
)

25.7 Import SPSS data with foreign (1)

All great things come in pairs. Where foreign provided read.dta() to read Stata data, there’s also read.spss() to read SPSS data files. To get a data frame, make sure to set to.data.frame = TRUE inside read.spss().

In this exercise, you’ll be working with socio-economic variables from different countries (Source: Quantative Data Analysis in Education). The SPSS data is in a file called international.sav, which is in your working directory. You can also download it here if you want to play around with it some more.

Instructions `100 XP`

ex_030.R

# foreign is already loaded

# Import international.sav as a data frame: demo

demo <- read.spss(
    "international.sav", to.data.frame=TRUE)
# Create boxplot of gdp variable of demo
boxplot(demo$gdp)

25.8 Import SPSS data with foreign (2)

In the previous exercise, you used the to.data.frame argument inside read.spss(). There are many other ways in which to customize the way your SPSS data is imported.

In this exercise you will experiment with another argument, use.value.labels. It specifies whether variables with value labels should be converted into R factors with levels that are named accordingly. The argument is TRUE by default which means that so called labelled variables inside SPSS are converted to factors inside R.

You’ll again be working with the international.sav data, which is available in your current working directory.

Instructions `100 XP`

Import the data file "international.sav" as a data frame, demo_1.
Print the first few rows of demo_1 using the head() function.
Import the data file "international.sav" as a data frame, demo_2, but this time in a way such that variables with value labels are not converted to R factors.
Again, print the first few rows of demo_2. Can you tell the difference between the two data frames?

ex_031.R

# foreign is already loaded
# Import international.sav as demo_1
demo_1 <- read.spss(
    "international.sav",
    to.data.frame = TRUE,
    )
# Print out the head of demo_1
head(demo_1)

# Import international.sav as demo_2
demo_2 <- read.spss(
    "international.sav",
    to.data.frame = TRUE,
    use.value.labels = FALSE
    )

# Print out the head of demo_2
head(demo_2)

# Importing data from statistical software packages Next to R, there are also other commonly used statistical software packages: SAS, STATA and SPSS. Each of them has their own file format. Learn how to use the haven and foreign packages to get them into R with remarkable ease! ## Import SAS data with haven `haven` is an extremely easy-to-use package to import data from three software packages: SAS, STATA and SPSS. Depending on the software, you use different functions: - SAS: read_sas() - STATA: read_dta() (or read_stata(), which are identical) - SPSS: read_sav() or read_por(), depending on the file type. All these functions take one key argument: the path to your local file. In fact, you can even pass a URL; `haven` will then automatically download the file for you before importing it. You'll be working with data on the age, gender, income, and purchase level (0 = low, 1 = high) of 36 individuals (Source: SAS). The information is stored in a SAS file, `sales.sas7bdat`, which is available in the dataset directory. You can also download the data here <http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/sales.sas7bdat>. ### Instructions `100 XP` {.unnumbered} - Load the `haven` package. - Import the data file `"sales.sas7bdat"`. Call the imported data frame sales. - Display the structure of `sales` with `str()`. Some columns represent categorical variables, so they should be factors. ```{.r filename="ex_023.R"} # Load the haven package library(haven) # Import sales.sas7bdat: sales sales <- read_sas("sales.sas7bdat") # Display the structure of sales str(sales) ``` ## Import STATA data with haven Next up are STATA data files; you can use `read_dta()` for these. When inspecting the result of the `read_dta()` call, you will notice that one column will be imported as a `labelled` vector, an R equivalent for the common data structure in other statistical environments. In order to effectively continue working on the data in R, it's best to change this data into a standard R class. To convert a variable of the class labelled to a factor, you'll need haven's `as_factor()` function. In this exercise, you will work with data on yearly import and export numbers of sugar, both in USD and in weight. The data can be found at: <http://assets.datacamp.com/production/course_1478/datasets/trade.dta> ### Instructions `100 XP` {.unnumbered} - Import the data file directly from the URL using read_dta(), and store it as sugar. - Print out the structure of sugar. The Date column has class labelled. - Convert the values in the Date column of sugar to dates, using `as.Date(as_factor(___))`. - Print out the structure of sugar once more. Looks better now? ```{.r filename="ex_024.R"} # haven is already loaded library(haven) # Import the data from the URL: sugar url <- "http://assets.datacamp.com/production/course_1478/datasets/trade.dta" sugar <- read_dta(url) # Structure of sugar str(sugar) # Convert values in Date column to dates sugar$Date <- as.Date(as_factor(sugar$Date)) # Structure of sugar again str(sugar) ``` ## Import SPSS data with haven The haven package can also import data files from SPSS. Again, importing the data is pretty straightforward. Depending on the SPSS data file you're working with, you'll need either `read_sav()` - for `.sav` files - or `read_por()` - for `.por` files. In this exercise, you will work with data on four of the Big Five personality traits for 434 persons (Source: University of Bath). The Big Five is a psychological concept including, originally, five dimensions of personality to classify human personality. The SPSS dataset is called person.sav and is available in your working directory. ### Instructions `100 XP` {.unnumbered} - Use `read_sav()` to import the `SPSS` data in `"person.sav"`. Name the imported data frame `traits`. - `traits` contains several missing values, or `NAs`. Run `summary()` on it to find out how many` NA`s are contained in each variable. - Print out a subset of those individuals that scored high on Extroversion and on Agreeableness, i.e. scoring higher than 40 on each of these two categories. You can use `subset()` for this. ```{.r filename="ex_025.R"} # haven is already loaded # Import person.sav: traits traits <- read_sav("person.sav") # Summarize traits summary(traits) # Print out a subset subset(traits, Extroversion > 40 & Agreeableness > 40) ``` ## Factorize, round two In the last exercise you learned how to import a data file using the command `read_sav().` With SPSS data files, it can also happen that some of the variables you import have the `labelled` class. This is done to keep all the labelling information that was originally present in the `.sav` and `.por` files. It's advised to coerce (or change) these variables to factors or other standard R classes. The data for this exercise involves information on employees and their demographic and economic attributes (Source: QRiE). The data can be found on the following URL: <http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/employee.sav> ### Instructions `100 XP` {.unnumbered} - Import the SPSS data straight from the URL and store the resulting data frame as work. - Display the summary of the GENDER column of work. This information doesn't give you a lot of useful information, right? - Convert the GENDER column in work to a factor, the class to denote categorical variables in R. Use as_factor(). - Once again display the summary of the GENDER column. This time, the printout makes much more sense. ```{.r filename="ex_027.R"} # haven is already loaded # Import SPSS data from the URL: work url <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/employee.sav" work <- read_sav(url) # Display summary of work$GENDER summary(work$GENDER) # Convert work$GENDER to a factor work$GENDER <- as_factor(work$GENDER) # Display summary of work$GENDER again summary(work$GENDER) ``` ## Import STATA data with foreign (1) The `foreign` package offers a simple function to import and read STATA data: `read.dta()`. In this exercise, you will import data on the US presidential elections in the year 2000. The data in `florida.dta` contains the total numbers of votes for each of the four candidates as well as the total number of votes per election area in the state of Florida (Source: Florida Department of State). The file is available in the dataset directory, you can download it here <http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/florida.dta> if you want to experiment some more. ### Instructions `100 XP` {.unnumbered} - Load the `foreign` package; it's already installed on DataCamp's servers. - Import the data on the elections in Florida, `"florida.dta"`, and name the resulting data frame florida. Use `read.dta()` without specifying extra arguments. - Check out the last 6 observations of florida with `tail()`. ```{.r filename="ex_028.R"} # Load the foreign package library(foreign) # Import florida.dta and name the resulting data frame florida florida <- read.dta("florida.dta") # Check tail() of florida tail(florida) ``` ## Import STATA data with foreign (2) Data can be very diverse, going from character vectors to categorical variables, dates and more. It's in these cases that the additional arguments of `read.dta()` will come in handy. The arguments you will use most often are convert.dates, convert.factors, missing.type and convert.underscore. Their meaning is pretty straightforward, as Filip explained in the video. It's all about correctly converting STATA data to standard R data structures. Type ?read.dta to find out about about the default values. The dataset for this exercise contains socio-economic measures and access to education for different individuals (Source: World Bank). This data is available as `edequality.dta`, which is located in the worldbank folder in your working directory. ### Instructions `100 XP` {.unnumbered} ```{.r filename="ex_029.R"} # foreign is already loaded # Specify the file path using file.path(): path path <- file.path("worldbank", "edequality.dta") # Create and print structure of edu_equal_1 edu_equal_1 <- read.dta(path) str(edu_equal_1) # Create and print structure of edu_equal_2 edu_equal_2 <- read.dta( path, convert.factors = FALSE ) str(edu_equal_2) # Create and print structure of edu_equal_3 edu_equal_3 <- read.dta( path, convert.underscore = TRUE ) str(edu_equal_3) nrow( subset( edu_equal_1, edu_equal_1$ethnicity_head == "Bulgarian", income > 1000 ) ) ``` ## Import SPSS data with foreign (1) All great things come in pairs. Where foreign provided `read.dta()` to read Stata data, there's also `read.spss()` to read SPSS data files. To get a data frame, make sure to set `to.data.frame = TRUE` inside `read.spss()`. In this exercise, you'll be working with socio-economic variables from different countries (Source: Quantative Data Analysis in Education). The SPSS data is in a file called international.sav, which is in your working directory. You can also download it here if you want to play around with it some more. ### Instructions `100 XP` {.unnumbered} ```{.r filename="ex_030.R"} # foreign is already loaded # Import international.sav as a data frame: demo demo <- read.spss( "international.sav", to.data.frame=TRUE) # Create boxplot of gdp variable of demo boxplot(demo$gdp) ``` ## Import SPSS data with foreign (2) In the previous exercise, you used the to.data.frame argument inside `read.spss()`. There are many other ways in which to customize the way your SPSS data is imported. In this exercise you will experiment with another argument, `use.value.labels`. It specifies whether variables with value labels should be converted into R factors with levels that are named accordingly. The argument is TRUE by default which means that so called labelled variables inside SPSS are converted to factors inside R. You'll again be working with the `international.sav` data, which is available in your current working directory. ### Instructions `100 XP` {.unnumbered} - Import the data file `"international.sav"` as a data frame, `demo_1`. - Print the first few rows of `demo_1` using the `head()` function. - Import the data file `"international.sav"` as a data frame, `demo_2`, but this time in a way such that variables with value labels are not converted to R factors. - Again, print the first few rows of `demo_2`. Can you tell the difference between the two data frames? ```{.r filename="ex_031.R"} # foreign is already loaded # Import international.sav as demo_1 demo_1 <- read.spss( "international.sav", to.data.frame = TRUE, ) # Print out the head of demo_1 head(demo_1) # Import international.sav as demo_2 demo_2 <- read.spss( "international.sav", to.data.frame = TRUE, use.value.labels = FALSE ) # Print out the head of demo_2 head(demo_2) ```

25.1 Import SAS data with haven

Instructions 100 XP

25.2 Import STATA data with haven

Instructions 100 XP

25.3 Import SPSS data with haven

Instructions 100 XP

25.4 Factorize, round two

Instructions 100 XP

25.5 Import STATA data with foreign (1)

Instructions 100 XP

25.6 Import STATA data with foreign (2)

Instructions 100 XP

25.7 Import SPSS data with foreign (1)

Instructions 100 XP

25.8 Import SPSS data with foreign (2)

Instructions 100 XP

Instructions `100 XP`

Instructions `100 XP`

Instructions `100 XP`

Instructions `100 XP`

Instructions `100 XP`

Instructions `100 XP`

Instructions `100 XP`

Instructions `100 XP`