R Programming

Key Things to Know

What to Remember from this Section

Lots of built-in data sets provided by R
Text files use delimiters - these determine the function to read in the data
There is no base R function to read in Excel data - must use a package
The haven package can read and write SAS, Stata and SPSS files
Web scraping is a dense subject but scraping online text/Excel files can be as easy as reading in these files from your hard drive
SDMX APIs can be directly accessed from R using Java functions
Provider-specific functions can be created to extract and transform data from non-SDMX REST APIs

Built-in Data

Built-in Data Sets

R has many built-in data sets
type data() into your console (104 data sets should appear)

mtcars

##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Built in Data Sets

R also has many convenient built-in character strings to be aware of
Try these:

letters
LETTERS
month.abb
month.name
state.abb
state.division
state.name
state.region

Note

to load any of these data sets/strings into your global environment: data("name of data set")
data sets that are part of unloaded packages can be loaded using data(... package = ...)
type ?name to get more information about the built-in data (the containing package must be loaded)

Example

1. Load the economics data set

data(economics, package = "ggplot2")

2. What is this data measuring?

library(ggplot2)
?economics

Description from the help screen: "This dataset was produced from US economic time series data available from <URL: http://research.stlouisfed.org/fred2>. ‘economics’ is in "wide" format, ‘economics_long’ is in "long" format."

date Month of data collection
psavert personal savings rate
pce personal consumption expenditures, in billions of dollars
unemploy number of unemployed in thousands
uempmed median duration of unemployment, in week
pop total population, in thousands

Importing Text Files

.csv .txt .tsv

Text files are a popular way to hold and exchange tabular data
Text file formats use delimiters to separate the different elements
- .csv
- .txt
- .tsv
These delimiters help us know what functions to use to read in the data
I'll illustrate with the following files in the R-Bootcamp download file
- mydata.csv
- mydata.txt
- mydata.tsv

Importing Text Files

Base R

Use read.csv for comma separated files (automatically sets the sep = ",")

read.csv("../data/mydata.csv")
##   variable.1 variable.2 variable.3
## 1         10       beer       TRUE
## 2         25       wine       TRUE
## 3          8     cheese      FALSE

Use read.delim for tab delimited files (automatically sets the sep = "\t")

read.delim("../data/mydata.txt")
##   variable.1 variable.2 variable.3
## 1         10       beer       TRUE
## 2         25       wine       TRUE
## 3          8     cheese      FALSE

Importing Text Files

Base R

When importing data, store as data.frame object by using the assignment operator:

mydata <- read.delim("../data/mydata.tsv")

mydata
##   variable.1 variable.2 variable.3
## 1         10       beer       TRUE
## 2         25       wine       TRUE
## 3          8     cheese      FALSE

You now have a data object in your global environment named mydata
View your data in a spreadsheet form:
- clicking on the object in the global environment space or
- type View(mydata) in your console

OECD.Stat CSV Example

Open http://stats.oecd.org/Index.aspx?DataSetCode=SNA_TABLE1 in web browser, customise selection and export Text file (CSV)

☛ Transaction: B1_GA ☛ Measure: C, V ☛ Year: 1990-2015

OECD.Stat CSV Example

Read in the sna_table1.csv file and create object sna_table1

sna_table1 <- read.csv("../data/sna_table1.csv")

Take a peek at the object's data structure

str(sna_table1, width = 100, strict.width = "cut")
## 'data.frame':    2208 obs. of  17 variables:
##  $ LOCATION             : Factor w/ 47 levels "ARG","AUS","AUT",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Country              : Factor w/ 47 levels "Argentina","Australia",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ TRANSACT             : Factor w/ 1 level "B1_GA": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Transaction          : Factor w/ 1 level "Gross domestic product (output approach)": 1 1 1 1 1 ..
##  $ MEASURE              : Factor w/ 2 levels "C","V": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Measure              : Factor w/ 2 levels "Constant prices, national base year",..: 2 2 2 2 2 2..
##  $ TIME                 : int  1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 ...
##  $ Year                 : int  1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 ...
##  $ Unit.Code            : Factor w/ 29 levels "ARS","AUD","BRL",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Unit                 : Factor w/ 29 levels "Argentine Peso",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ PowerCode.Code       : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ PowerCode            : Factor w/ 1 level "Millions": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Reference.Period.Code: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Reference.Period     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Value                : num  414815 422927 443969 466616 495554 ...
##  $ Flag.Codes           : Factor w/ 2 levels "","E": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Flags                : Factor w/ 2 levels "","Estimated value": 1 1 1 1 1 1 1 1 1 1 ...

Importing Excel Files

Excel is still the spreadsheet software of choice
Base R does not have an Excel importing function but we can use the readxl package

# if you haven't installed the readxl package run the following line (minus hashtag)
# install.packages("readxl")

library(readxl)

read_excel("../data/mydata.xlsx", sheet = "Sheet5")
## # A tibble: 3 × 5
##   `variable 1` `variable 2` `variable 3` `variable 4`        `variable 5`
##          <dbl>        <chr>        <dbl>        <dbl>              <dttm>
## 1           10         beer            1        42328 2015-11-20 13:30:00
## 2           25         wine            1           NA 2015-11-21 16:30:00
## 3            8         <NA>            0        42330 2015-11-22 14:45:00

Benefits of readxl
- recognizes date and time variables
- keeps text variables as characters rather than factors (more to come on this)
- does not collapse variable names
- read more about readxl capabilities here
xlsx and XLConnect are alternative packages for reading and writing Excel files

Example

Read in the spreadsheet 3. Median HH income, metro in the PEW-Middle-Class-Data.xlsx file

☛ We need to specify the sheet using read_excel(..., sheet = ...)

☛ We need to skip 5 lines using read_excel(..., skip = ...)

Hint: ?read_excel provides help on function parameters

Example

Read in the .xlsx file and save as pew

pew <- read_excel("../data/PEW-Middle-Class-Data.xlsx", sheet = "3. Median HH income, metro", skip = 5)

Take a peek at what this data looks like

head(pew)
## # A tibble: 6 × 9
##                               Metro All_1999 Lower_1999 Middle_1999
##                               <chr>    <dbl>      <dbl>       <dbl>
## 1                         Akron, OH 77688.32   27587.97    81671.30
## 2       Albany-Schenectady-Troy, NY 73185.16   27322.46    79718.21
## 3                   Albuquerque, NM 66067.18   26426.87    76363.59
## 4 Allentown-Bethlehem-Easton, PA-NJ 71607.73   29007.45    78782.43
## 5                      Amarillo, TX 60599.32   27014.16    76296.73
## 6                     Anchorage, AK 76652.96   27813.29    81361.67
## # ... with 5 more variables: Upper_1999 <dbl>, All_2014 <dbl>,
## #   Lower_2014 <dbl>, Middle_2014 <dbl>, Upper_2014 <dbl>

Importing SAS and Stata files

Importing SAS Files

The SAS System uses a prorpietary binary format for data files (.sas7bdat and .sas7bcat)
The haven package leverages the ReadStat C library to read both file formats and write .sas7bdat

library(haven)

helpfromSAS <- read_sas("../data/help.sas7bdat")
str(helpfromSAS[, 1:2])
## Classes 'tbl_df', 'tbl' and 'data.frame':    453 obs. of  2 variables:
##  $ ID  : atomic  1 2 3 4 5 6 7 8 9 10 ...
##   ..- attr(*, "label")= chr "subject id"
##  $ E2B1: atomic  NA NA NA NA NA NA NA 1 1 1 ...
##   ..- attr(*, "label")= chr "1 e2b"

Benefits of haven
- date times are converted to corresponding R classes
- labelled vectors are returned as a new labelled class
- can easily coerce to factors or replace labelled values with missings
- all functions return tibbles
- read more about the C library here

Importing Stata Files

haven/ReadStat can read Stata files up to version 14 (latest) and save objets as .dta
Stata missing values are displayed as NaN

carsdataSTATA <- read_dta("../data/carsdata.dta")
psych::describe(carsdataSTATA, skew = FALSE)
##        vars n mean   sd median trimmed  mad min max range   se
## cars      1 5    2 0.71      2       2 0.00   1   3     2 0.32
## hhsize    2 5    3 1.58      3       3 1.48   1   5     4 0.71

carsdataSTATA$cars[1] <- NA
write_dta(carsdataSTATA, path = "../data/carsdata_out.dta", version = 14)

read_dta("../data/carsdata_out.dta")[1:3,]
## # A tibble: 3 × 2
##    cars hhsize
##   <dbl>  <dbl>
## 1   NaN      1
## 2     2      2
## 3     2      3

Supports Stata "extended" missing values, .A through .Z using a tagged NA in R
See semantics vignette of haven package for additional information

Reading Online Files

Tabular files

Download URLs to csv and Excel files can be passed directly to R functions
For inspection with other tools, the file can be downloaded first using download.file()

url <- "https://inventory.data.gov/dataset/f6f06df8-c102-4fe8-af31-c368be66166d/resource/d65072b5-08f0-4f90-bd57-65e731c59df2/download/userssharedsdfdata.govfederalagencyparticipation.csv"
data_gov <- read.csv(file = url, stringsAsFactors = FALSE)
str(data_gov, width = 100, strict.width = "cut", list.len = 2)
## 'data.frame':    185 obs. of  11 variables:
##  $ Agency.Name            : chr  "Broadcasting Board of Governors" "Commodity Futures Trading Com"..
##  $ Agency.Abbreviation    : chr  "BBG" "CFTC" "CNS" "CSOSA" ...
##   [list output truncated]

## download file
destfile <- tempfile(fileext = ".csv")
download.file(url = url, destfile = destfile)
data_gov <- read.csv(file = destfile, stringsAsFactors = FALSE)

str(data_gov, width = 100, strict.width = "cut", list.len = 2)
## 'data.frame':    185 obs. of  11 variables:
##  $ Agency.Name            : chr  "Broadcasting Board of Governors" "Commodity Futures Trading Com"..
##  $ Agency.Abbreviation    : chr  "BBG" "CFTC" "CNS" "CSOSA" ...
##   [list output truncated]

Reading Online Files

Excel files

Reading online Excel files follows a similar process
gdata package is particular easy to use

Let's download some data from Fair Market Rents for Section 8 Housing:

suppressMessages(library(gdata))

url <- "http://www.huduser.org/portal/datasets/fmr/fmr2015f/FY2015F_4050_Final.xls"

# use read.xls to import
rents <- read.xls(url)

str(rents, list.len = 5)
## 'data.frame':    4769 obs. of  18 variables:
##  $ fips2000        : num  1.00e+08 1.00e+08 1.01e+08 1.01e+08 1.01e+08 ...
##  $ fips2010        : num  1.00e+08 1.00e+08 1.01e+08 1.01e+08 1.01e+08 ...
##  $ fmr2            : int  788 762 670 773 773 599 599 675 696 599 ...
##  $ fmr0            : int  628 494 492 545 545 481 453 491 511 481 ...
##  $ fmr1            : int  663 643 495 652 652 505 456 513 514 500 ...
##   [list output truncated]

Example

Reading file from Dropbox

1. Download the file stored at:

https://dl.dropboxusercontent.com/u/1807228/reddit.csv?dl=1

2. Save it as an object titled reddit

3. Inspect the data structure

Example

1. Retrieve the file contents stored at: https://dl.dropboxusercontent.com/u/1807228/reddit.csv?dl=1

suppressMessages(library(RCurl))
url <- "https://dl.dropboxusercontent.com/u/1807228/reddit.csv?dl=1"
tt <- getURL(url)
substr(tt, 1, 100)
## [1] "id,gender,age.range,marital.status,employment.status,military.service,children,education,country,sta"

2. Save it as an object titled reddit

reddit <- read.csv(url)

3. Inspect the data structure

str(reddit)
## 'data.frame':    32754 obs. of  14 variables:
##  $ id               : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ gender           : int  0 0 1 0 1 0 0 0 0 0 ...
##  $ age.range        : Factor w/ 7 levels "18-24","25-34",..: 2 2 1 2 2 2 2 1 3 2 ...
##  $ marital.status   : Factor w/ 6 levels "Engaged","Forever Alone",..: NA NA NA NA NA 4 3 4 4 3 ...
##  $ employment.status: Factor w/ 6 levels "Employed full time",..: 1 1 2 2 1 1 1 4 1 2 ...
##  $ military.service : Factor w/ 2 levels "No","Yes": NA NA NA NA NA 1 1 1 1 1 ...
##  $ children         : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ education        : Factor w/ 7 levels "Associate degree",..: 2 2 5 2 2 2 5 2 2 5 ...
##  $ country          : Factor w/ 439 levels " Canada"," Canada eh",..: 394 394 394 394 394 394 125 394 394 125 ...
##  $ state            : Factor w/ 53 levels "","Alabama","Alaska",..: 33 33 48 33 6 33 1 6 33 1 ...
##  $ income.range     : Factor w/ 8 levels "$100,000 - $149,999",..: 2 2 8 2 7 2 NA 7 2 7 ...
##  $ fav.reddit       : Factor w/ 1834 levels "","___","-","?",..: 720 691 1511 1528 188 691 1318 571 1629 1 ...
##  $ dog.cat          : Factor w/ 3 levels "I like cats.",..: NA NA NA NA NA 2 2 2 1 1 ...
##  $ cheese           : Factor w/ 11 levels "American","Brie",..: NA NA NA NA NA 3 3 1 10 7 ...

SDMX

Standard for Data and Metadata Exchange
International Organisations (OECD, Eurostat, ECB, IMF, World Bank, Unesco, ILO)
OECD.Stat: National Accounts > Annual National Accounts > Main Aggregates > GDP
Use Export (SDMX)
modify URL in SDMX browser at sdmx.rdata.work
download file from sdmx.rdata.work using getURL()

http_response <- getURL("http://sdmx.rdata.work/OECD/SNA_TABLE1.*.B1_GA.C+V/?start=1990&end=2015")
tt <- getURL("http://sdmx.rdata.work/getdownloadsdmx?")
read.csv(text = tt)[1:3, 1:3]

##   TIME_PERIOD SNA_TABLE1.AUS.B1_GA.C SNA_TABLE1.AUS.B1_GA.V
## 1  1990/01/01                 414815                 758132
## 2  1991/01/01                 422927                 761156
## 3  1992/01/01                 443969                 792060

constantly growing number of providers
few providers use latest standard of SDMX, making it difficult to develop tools
SDMX Java functions available for a range of tools (R, SAS, Matlab, Stata, Excel)

curling REST APIs

National Statistical Offices have started setting REST APIs for public data retrieval
The majority of NSOs create their individual API and few use existing community standards
nsoApi provides functions to harmonize information into an R time series format
opendata-tables lists tables that can be retrieved from SDMX, PXWEB and other standards
Some of the providers require registration, e.g. the BEA (USA) and ONS (UK)
GENESIS (Germany) requires a paid subscription to access the REST API

library(nsoApi)
cbs_data <- cbsODataAPI(api="http://opendata.cbs.nl/ODataApi/OData/", DSD="82572ENG", scheme="TypedDataSet")
head(cbs_data[, 1:4])
##   ID SectorBranchesSIC2008  Periods OutputBasicPrices_1
## 1  0                300025 1995JJ00              587866
## 2  1                300025 1996JJ00              624009
## 3  2                300025 1997JJ00              671992
## 4  3                300025 1998JJ00              714756
## 5  4                300025 1999JJ00              762732
## 6  5                300025 2000JJ00              838703

examples for other providers available in nsoApi package vignette

Key Things to Remember

Remember These Functions!

Operator/Function	Description
`data()`	access built-in data sets
`?`	will provide you information regarding built-in data (i.e. `?mtcars`)
`read.csv()`	base R function for reading in .csv files (can also be used to read in a .csv file stored online)
`read.delim()`	base R function for reading in .txt and .tsv files
`read_excel()`	imports Excel data (provided by the `readxl` package)
`read.xls()`	imports Excel data stored online (provided by the `gdata` package)
`read_sas()`	imports `.sas7bdat` and `.sas7bcat` files (provided by the `haven` package)
`read_dta()`	imports Stata files (provided by the `haven` package)
`View()`	opens a spreadsheet-style data viewer
`str()`	print data structure

R Data Import/Export manual

Key Things to Know

What to Remember from this Section

Built-in Data

Built-in Data Sets

Built in Data Sets

Note

Example

Importing Text Files

Importing Text Files

.csv .txt .tsv

Importing Text Files

Base R

Importing Text Files

Base R

OECD.Stat CSV Example

OECD.Stat CSV Example

Importing Excel Files

Importing Excel Files

Example

Example

Importing SAS and Stata files

Importing SAS Files

Importing Stata Files

Reading Online Files

Reading Online Files

Tabular files

Reading Online Files

Excel files

Example

Reading file from Dropbox

Example

SDMX

SDMX

curling REST APIs

curling REST APIs

Key Things to Remember

Remember These Functions!

Hands-on script 2-get-data.R