Key Things to Know

What to Remember from this Section



  1. Lots of built-in data sets provided by R
  2. Text files use delimiters - these determine the function to read in the data
  3. There is no base R function to read in Excel data - must use a package
  4. The haven package can read and write SAS, Stata and SPSS files
  5. Web scraping is a dense subject but scraping online text/Excel files can be as easy as reading in these files from your hard drive
  6. SDMX APIs can be directly accessed from R using Java functions
  7. Provider-specific functions can be created to extract and transform data from non-SDMX REST APIs

Built-in Data

Built-in Data Sets

  • R has many built-in data sets
  • type data() into your console (104 data sets should appear)
mtcars
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Built in Data Sets

  • R also has many convenient built-in character strings to be aware of
  • Try these:
letters
LETTERS
month.abb
month.name
state.abb
state.division
state.name
state.region


Note

  • to load any of these data sets/strings into your global environment: data("name of data set")
  • data sets that are part of unloaded packages can be loaded using data(... package = ...)
  • type ?name to get more information about the built-in data (the containing package must be loaded)

Example

1. Load the economics data set

data(economics, package = "ggplot2")

2. What is this data measuring?

library(ggplot2)
?economics

Description from the help screen: "This dataset was produced from US economic time series data available from <URL: http://research.stlouisfed.org/fred2>. ‘economics’ is in "wide" format, ‘economics_long’ is in "long" format."

  • date Month of data collection
  • psavert personal savings rate
  • pce personal consumption expenditures, in billions of dollars
  • unemploy number of unemployed in thousands
  • uempmed median duration of unemployment, in week
  • pop total population, in thousands

Importing Text Files

Importing Text Files

.csv .txt .tsv

  • Text files are a popular way to hold and exchange tabular data
  • Text file formats use delimiters to separate the different elements
    • .csv
    • .txt
    • .tsv
  • These delimiters help us know what functions to use to read in the data
  • I'll illustrate with the following files in the R-Bootcamp download file
    • mydata.csv
    • mydata.txt
    • mydata.tsv

Importing Text Files

Base R

Use read.csv for comma separated files (automatically sets the sep = ",")

read.csv("../data/mydata.csv")
##   variable.1 variable.2 variable.3
## 1         10       beer       TRUE
## 2         25       wine       TRUE
## 3          8     cheese      FALSE


Use read.delim for tab delimited files (automatically sets the sep = "\t")

read.delim("../data/mydata.txt")
##   variable.1 variable.2 variable.3
## 1         10       beer       TRUE
## 2         25       wine       TRUE
## 3          8     cheese      FALSE

Importing Text Files

Base R

When importing data, store as data.frame object by using the assignment operator:

mydata <- read.delim("../data/mydata.tsv")

mydata
##   variable.1 variable.2 variable.3
## 1         10       beer       TRUE
## 2         25       wine       TRUE
## 3          8     cheese      FALSE


  • You now have a data object in your global environment named mydata
  • View your data in a spreadsheet form:
    • clicking on the object in the global environment space or
    • type View(mydata) in your console

OECD.Stat CSV Example

OECD.Stat CSV Example

Read in the sna_table1.csv file and create object sna_table1

sna_table1 <- read.csv("../data/sna_table1.csv")

Take a peek at the object's data structure

str(sna_table1, width = 100, strict.width = "cut")
## 'data.frame':    2208 obs. of  17 variables:
##  $ LOCATION             : Factor w/ 47 levels "ARG","AUS","AUT",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Country              : Factor w/ 47 levels "Argentina","Australia",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ TRANSACT             : Factor w/ 1 level "B1_GA": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Transaction          : Factor w/ 1 level "Gross domestic product (output approach)": 1 1 1 1 1 ..
##  $ MEASURE              : Factor w/ 2 levels "C","V": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Measure              : Factor w/ 2 levels "Constant prices, national base year",..: 2 2 2 2 2 2..
##  $ TIME                 : int  1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 ...
##  $ Year                 : int  1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 ...
##  $ Unit.Code            : Factor w/ 29 levels "ARS","AUD","BRL",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Unit                 : Factor w/ 29 levels "Argentine Peso",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ PowerCode.Code       : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ PowerCode            : Factor w/ 1 level "Millions": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Reference.Period.Code: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Reference.Period     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Value                : num  414815 422927 443969 466616 495554 ...
##  $ Flag.Codes           : Factor w/ 2 levels "","E": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Flags                : Factor w/ 2 levels "","Estimated value": 1 1 1 1 1 1 1 1 1 1 ...

Importing Excel Files

Importing Excel Files

  • Excel is still the spreadsheet software of choice
  • Base R does not have an Excel importing function but we can use the readxl package
# if you haven't installed the readxl package run the following line (minus hashtag)
# install.packages("readxl")

library(readxl)

read_excel("../data/mydata.xlsx", sheet = "Sheet5")
## # A tibble: 3 × 5
##   `variable 1` `variable 2` `variable 3` `variable 4`        `variable 5`
##          <dbl>        <chr>        <dbl>        <dbl>              <dttm>
## 1           10         beer            1        42328 2015-11-20 13:30:00
## 2           25         wine            1           NA 2015-11-21 16:30:00
## 3            8         <NA>            0        42330 2015-11-22 14:45:00
  • Benefits of readxl
    • recognizes date and time variables
    • keeps text variables as characters rather than factors (more to come on this)
    • does not collapse variable names
    • read more about readxl capabilities here
  • xlsx and XLConnect are alternative packages for reading and writing Excel files

Example

Read in the spreadsheet 3. Median HH income, metro in the PEW-Middle-Class-Data.xlsx file

☛ We need to specify the sheet using read_excel(..., sheet = ...)

☛ We need to skip 5 lines using read_excel(..., skip = ...)

Hint: ?read_excel provides help on function parameters

Example

Read in the .xlsx file and save as pew

pew <- read_excel("../data/PEW-Middle-Class-Data.xlsx", sheet = "3. Median HH income, metro", skip = 5)

Take a peek at what this data looks like

head(pew)
## # A tibble: 6 × 9
##                               Metro All_1999 Lower_1999 Middle_1999
##                               <chr>    <dbl>      <dbl>       <dbl>
## 1                         Akron, OH 77688.32   27587.97    81671.30
## 2       Albany-Schenectady-Troy, NY 73185.16   27322.46    79718.21
## 3                   Albuquerque, NM 66067.18   26426.87    76363.59
## 4 Allentown-Bethlehem-Easton, PA-NJ 71607.73   29007.45    78782.43
## 5                      Amarillo, TX 60599.32   27014.16    76296.73
## 6                     Anchorage, AK 76652.96   27813.29    81361.67
## # ... with 5 more variables: Upper_1999 <dbl>, All_2014 <dbl>,
## #   Lower_2014 <dbl>, Middle_2014 <dbl>, Upper_2014 <dbl>

Importing SAS and Stata files

Importing SAS Files

  • The SAS System uses a prorpietary binary format for data files (.sas7bdat and .sas7bcat)
  • The haven package leverages the ReadStat C library to read both file formats and write .sas7bdat
library(haven)

helpfromSAS <- read_sas("../data/help.sas7bdat")
str(helpfromSAS[, 1:2])
## Classes 'tbl_df', 'tbl' and 'data.frame':    453 obs. of  2 variables:
##  $ ID  : atomic  1 2 3 4 5 6 7 8 9 10 ...
##   ..- attr(*, "label")= chr "subject id"
##  $ E2B1: atomic  NA NA NA NA NA NA NA 1 1 1 ...
##   ..- attr(*, "label")= chr "1 e2b"
  • Benefits of haven
    • date times are converted to corresponding R classes
    • labelled vectors are returned as a new labelled class
    • can easily coerce to factors or replace labelled values with missings
    • all functions return tibbles
    • read more about the C library here

Importing Stata Files

  • haven/ReadStat can read Stata files up to version 14 (latest) and save objets as .dta
  • Stata missing values are displayed as NaN
carsdataSTATA <- read_dta("../data/carsdata.dta")
psych::describe(carsdataSTATA, skew = FALSE)
##        vars n mean   sd median trimmed  mad min max range   se
## cars      1 5    2 0.71      2       2 0.00   1   3     2 0.32
## hhsize    2 5    3 1.58      3       3 1.48   1   5     4 0.71

carsdataSTATA$cars[1] <- NA
write_dta(carsdataSTATA, path = "../data/carsdata_out.dta", version = 14)

read_dta("../data/carsdata_out.dta")[1:3,]
## # A tibble: 3 × 2
##    cars hhsize
##   <dbl>  <dbl>
## 1   NaN      1
## 2     2      2
## 3     2      3
  • Supports Stata "extended" missing values, .A through .Z using a tagged NA in R
  • See semantics vignette of haven package for additional information

Reading Online Files

Reading Online Files

Tabular files

  • Download URLs to csv and Excel files can be passed directly to R functions
  • For inspection with other tools, the file can be downloaded first using download.file()
url <- "https://inventory.data.gov/dataset/f6f06df8-c102-4fe8-af31-c368be66166d/resource/d65072b5-08f0-4f90-bd57-65e731c59df2/download/userssharedsdfdata.govfederalagencyparticipation.csv"
data_gov <- read.csv(file = url, stringsAsFactors = FALSE)
str(data_gov, width = 100, strict.width = "cut", list.len = 2)
## 'data.frame':    185 obs. of  11 variables:
##  $ Agency.Name            : chr  "Broadcasting Board of Governors" "Commodity Futures Trading Com"..
##  $ Agency.Abbreviation    : chr  "BBG" "CFTC" "CNS" "CSOSA" ...
##   [list output truncated]

## download file
destfile <- tempfile(fileext = ".csv")
download.file(url = url, destfile = destfile)
data_gov <- read.csv(file = destfile, stringsAsFactors = FALSE)

str(data_gov, width = 100, strict.width = "cut", list.len = 2)
## 'data.frame':    185 obs. of  11 variables:
##  $ Agency.Name            : chr  "Broadcasting Board of Governors" "Commodity Futures Trading Com"..
##  $ Agency.Abbreviation    : chr  "BBG" "CFTC" "CNS" "CSOSA" ...
##   [list output truncated]

Reading Online Files

Excel files

  • Reading online Excel files follows a similar process
  • gdata package is particular easy to use

Let's download some data from Fair Market Rents for Section 8 Housing:

suppressMessages(library(gdata))

url <- "http://www.huduser.org/portal/datasets/fmr/fmr2015f/FY2015F_4050_Final.xls"

# use read.xls to import
rents <- read.xls(url)

str(rents, list.len = 5)
## 'data.frame':    4769 obs. of  18 variables:
##  $ fips2000        : num  1.00e+08 1.00e+08 1.01e+08 1.01e+08 1.01e+08 ...
##  $ fips2010        : num  1.00e+08 1.00e+08 1.01e+08 1.01e+08 1.01e+08 ...
##  $ fmr2            : int  788 762 670 773 773 599 599 675 696 599 ...
##  $ fmr0            : int  628 494 492 545 545 481 453 491 511 481 ...
##  $ fmr1            : int  663 643 495 652 652 505 456 513 514 500 ...
##   [list output truncated]

Example

Reading file from Dropbox



1. Download the file stored at:

https://dl.dropboxusercontent.com/u/1807228/reddit.csv?dl=1


2. Save it as an object titled reddit


3. Inspect the data structure

Example

1. Retrieve the file contents stored at: https://dl.dropboxusercontent.com/u/1807228/reddit.csv?dl=1

suppressMessages(library(RCurl))
url <- "https://dl.dropboxusercontent.com/u/1807228/reddit.csv?dl=1"
tt <- getURL(url)
substr(tt, 1, 100)
## [1] "id,gender,age.range,marital.status,employment.status,military.service,children,education,country,sta"

2. Save it as an object titled reddit

reddit <- read.csv(url)

3. Inspect the data structure

str(reddit)
## 'data.frame':    32754 obs. of  14 variables:
##  $ id               : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ gender           : int  0 0 1 0 1 0 0 0 0 0 ...
##  $ age.range        : Factor w/ 7 levels "18-24","25-34",..: 2 2 1 2 2 2 2 1 3 2 ...
##  $ marital.status   : Factor w/ 6 levels "Engaged","Forever Alone",..: NA NA NA NA NA 4 3 4 4 3 ...
##  $ employment.status: Factor w/ 6 levels "Employed full time",..: 1 1 2 2 1 1 1 4 1 2 ...
##  $ military.service : Factor w/ 2 levels "No","Yes": NA NA NA NA NA 1 1 1 1 1 ...
##  $ children         : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ education        : Factor w/ 7 levels "Associate degree",..: 2 2 5 2 2 2 5 2 2 5 ...
##  $ country          : Factor w/ 439 levels " Canada"," Canada eh",..: 394 394 394 394 394 394 125 394 394 125 ...
##  $ state            : Factor w/ 53 levels "","Alabama","Alaska",..: 33 33 48 33 6 33 1 6 33 1 ...
##  $ income.range     : Factor w/ 8 levels "$100,000 - $149,999",..: 2 2 8 2 7 2 NA 7 2 7 ...
##  $ fav.reddit       : Factor w/ 1834 levels "","___","-","?",..: 720 691 1511 1528 188 691 1318 571 1629 1 ...
##  $ dog.cat          : Factor w/ 3 levels "I like cats.",..: NA NA NA NA NA 2 2 2 1 1 ...
##  $ cheese           : Factor w/ 11 levels "American","Brie",..: NA NA NA NA NA 3 3 1 10 7 ...

SDMX

SDMX

  • Standard for Data and Metadata Exchange
  • International Organisations (OECD, Eurostat, ECB, IMF, World Bank, Unesco, ILO)
  • OECD.Stat: National Accounts > Annual National Accounts > Main Aggregates > GDP
  • Use Export (SDMX)
  • modify URL in SDMX browser at sdmx.rdata.work
  • download file from sdmx.rdata.work using getURL()
http_response <- getURL("http://sdmx.rdata.work/OECD/SNA_TABLE1.*.B1_GA.C+V/?start=1990&end=2015")
tt <- getURL("http://sdmx.rdata.work/getdownloadsdmx?")
read.csv(text = tt)[1:3, 1:3]
##   TIME_PERIOD SNA_TABLE1.AUS.B1_GA.C SNA_TABLE1.AUS.B1_GA.V
## 1  1990/01/01                 414815                 758132
## 2  1991/01/01                 422927                 761156
## 3  1992/01/01                 443969                 792060
  • constantly growing number of providers
  • few providers use latest standard of SDMX, making it difficult to develop tools
  • SDMX Java functions available for a range of tools (R, SAS, Matlab, Stata, Excel)

curling REST APIs

curling REST APIs

  • National Statistical Offices have started setting REST APIs for public data retrieval
  • The majority of NSOs create their individual API and few use existing community standards
  • nsoApi provides functions to harmonize information into an R time series format
  • opendata-tables lists tables that can be retrieved from SDMX, PXWEB and other standards
  • Some of the providers require registration, e.g. the BEA (USA) and ONS (UK)
  • GENESIS (Germany) requires a paid subscription to access the REST API
library(nsoApi)
cbs_data <- cbsODataAPI(api="http://opendata.cbs.nl/ODataApi/OData/", DSD="82572ENG", scheme="TypedDataSet")
head(cbs_data[, 1:4])
##   ID SectorBranchesSIC2008  Periods OutputBasicPrices_1
## 1  0                300025 1995JJ00              587866
## 2  1                300025 1996JJ00              624009
## 3  2                300025 1997JJ00              671992
## 4  3                300025 1998JJ00              714756
## 5  4                300025 1999JJ00              762732
## 6  5                300025 2000JJ00              838703

Key Things to Remember

Remember These Functions!

Operator/Function Description
data() access built-in data sets
? will provide you information regarding built-in data (i.e. ?mtcars)
read.csv() base R function for reading in .csv files (can also be used to read in a .csv file stored online)
read.delim() base R function for reading in .txt and .tsv files
read_excel() imports Excel data (provided by the readxl package)
read.xls() imports Excel data stored online (provided by the gdata package)
read_sas() imports .sas7bdat and .sas7bcat files (provided by the haven package)
read_dta() imports Stata files (provided by the haven package)
View() opens a spreadsheet-style data viewer
str() print data structure

Hands-on script 2-get-data.R