R Programming

Key Things to Remember

What to Remember from this Section

R has the ability to work with a variety of data types
As an analyst, you should become familiar with dealing with the following types of data:

Data Type	Description
- Numbers	integer (i.e. 1,2,3…), double (i.e. 1.5, 3.66)
- Character Strings	"r", "I attend UC", etc.
- Regular expressions	patterns within text strings
- Factors	nominal (male, female), ordinal (freshman, sophmore, junior), interval ($0-25, $26-50, $51-75)
- Dates	calendar dates (i.e. 2016-08-06, 08/06/2016), weekdays, hours, etc.
- Logical	`TRUE`, `FALSE`, `any`, `all`

Numbers

Numbers: two types of numbers

Numeric data pimarily comes in two forms: integer & double (double precision floating point)

# create a string of double-precision values
dbl_var <- c(1, 2.5, 4.5)  
class(dbl_var)
## [1] "numeric"

# placing an L after the values creates a string of integers
int_var <- c(1L, 6L, 10L)
class(int_var)
## [1] "integer"

We can coerce integers to doubles and vice versa with as.double() and as.integer()

as.integer(dbl_var)
## [1] 1 2 4

int_to_dbl <- as.double(int_var)
class(int_to_dbl)
## [1] "numeric"

# Combining double and integer will automatically coerce to the simplest form (double)
c(dbl_var, int_var)
## [1]  1.0  2.5  4.5  1.0  6.0 10.0

Numbers: comparing numbers

You've already seen logical operators using ==, !=, <, <=, >, >=

x <- c(4, 4, 9, 12)
y <- c(4, 4, 9, 12.00000008)

x == y
## [1]  TRUE  TRUE  TRUE FALSE

Can also test for exact equality with identical() and near equality with all.equal()

z <- c(4, 4, 9, 12)

identical(x, y)
## [1] FALSE
identical(x, z)
## [1] TRUE

all.equal(x, y)
## [1] TRUE

Numbers: rounding

We can also round numbers multiple ways:

x <- c(1, 1.35, 1.7, 2.05, 2.4, 2.75)

# round to the nearest integer
round(x)
## [1] 1 1 2 2 2 3

# round up
ceiling(x)
## [1] 1 2 2 3 3 3

# round down
floor(x)
## [1] 1 1 1 2 2 2

# round to a specified decimal
round(x, digits = 1)
## [1] 1.0 1.4 1.7 2.0 2.4 2.8

Your Turn

Import the numbers-your-turn.csv file in the data folder

1. Are the vectors x & y equal? Exactly or approximately equal?

2. Are the vectors y & z equal? Exactly or approximately equal?

3. Round x & y numbers to the 4th digit

4. Are these vectors equal now?

Solution

# import the numbers-your-turn.csv file in the data folder
df <- read.csv("../data/numbers-your-turn.csv")

# 1. Are the vectors x & y equal? Exactly or approximately equal?
identical(df$x, df$y)
## [1] FALSE

all.equal(df$x, df$y)
## [1] "Mean relative difference: 1.041407e-07"

# 2. Are the vectors x & z equal? Exactly or approximately equal?
identical(df$y, df$z)
## [1] FALSE

all.equal(df$y, df$z)
## [1] TRUE

# 3. Round x & y numbers to the 4th digit
x <- round(df$x, digits = 4)
y <- round(df$y, digits = 4)

# 4. Are these vectors equal now?
identical(x, y)
## [1] TRUE

all.equal(x, y)
## [1] TRUE

Characters

"Hello world!"

Characters: creating simple character strings

Create character strings using ""

a <- "learning to create"
b <- "character strings"

Combine character strings with c, paste() or paste0

# create a vector containing two elements - a and b
c(a, b)
## [1] "learning to create" "character strings"

# create a vector containing one element - a and b combined
paste(a, b)
## [1] "learning to create character strings"

# paste multiple strings
paste("I", "love", "R")
## [1] "I love R"

# change the separator
paste("I", "love", "R", sep = "-")
## [1] "I-love-R"

# collapse space between characters
paste0("I", "love", "R")
## [1] "IloveR"

Characters: test, conversion & coercion

Use class(), mode() and/or is.character() to assess the data type

a <- "Life of"
b <- pi

class(a)
## [1] "character"

mode(a)
## [1] "character"

is.character(pi)
## [1] FALSE

Use as.character() to convert non-character to a character

as.character(pi)
## [1] "3.14159265358979"

Combining characters and non-characters will coerce all inputs to a character

c(a, b)
## [1] "Life of"          "3.14159265358979"

Characters: summarizing

Use length() to count the number of elements (individual character strings) in a vector

length("How many elements are in this string?")
## [1] 1

length(c("How", "many", "elements", "are", "in", "this", "string?"))
## [1] 7

Use nchar() to count the number of characters in each element

nchar("How many characters are in this string?")
## [1] 39

nchar(c("How", "many", "characters", "are", "in", "this", "string?"))
## [1]  3  4 10  3  2  4  7

Factors

aka categorical variables

Factors: different from characters

Key Words : finite options and levels

nominal variables
- male, female
- brunnette, blonde, red, black
- Hispanic, Caucasion, Asian, African
ordinal variables
- slow, medium, fast
- freshman, sophomore, junior, senior
interval variables
- $1-100, $101-200, $201-300
- 0-10, 11-20, 21-30

Factors: creating nominal factors

Create nominal factors with factor()

gender <- c("male", "female", "female")

class(gender)
## [1] "character"

gender2 <- factor(gender)

class(gender2)
## [1] "factor"

gender2
## [1] male   female female
## Levels: female male

set level preferences with level argument

factor(gender, levels = c("male", "female"))
## [1] male   female female
## Levels: male female

Factors: creating ordered factors

Create ordinal/interval factors with ordered(); set level preferences with level argument

age.range <- c("18-24", "25-34", "35-44", "45-54", "55-64", "65 or Above", "Under 18")

class(age.range)
## [1] "character"

# turn x into an ordered factor - levels default to the order of the data
age.range2 <- ordered(age.range)

class(age.range2)
## [1] "ordered" "factor"

age.range2
## [1] 18-24       25-34       35-44       45-54       55-64       65 or Above
## [7] Under 18   
## 7 Levels: 18-24 < 25-34 < 35-44 < 45-54 < 55-64 < ... < Under 18

set level preferences with level argument

ordered(age.range, levels = c("Under 18", "18-24", "25-34", "35-44", "45-54", "55-64", "65 or Above"))
## [1] 18-24       25-34       35-44       45-54       55-64       65 or Above
## [7] Under 18   
## 7 Levels: Under 18 < 18-24 < 25-34 < 35-44 < 45-54 < ... < 65 or Above

Factors: summarizing

If you want to know the levels that exist in your factor variable use levels()

facebook <- read.delim("../data/facebook.tsv")

levels(facebook$gender)
## [1] "female" "male"

We can use the table() function to quickly assess the counts of each level

table(facebook$gender)
## 
## female   male 
##  40254  58574

Your Turn

Import the reddit.csv file in the data folder

1. What are the levels for the income.range variable?

2. Properly order the levels for income.range.

3. What are the counts for each level?

Solution

# import the reddit.csv file in the data folder
reddit <- read.csv("../data/reddit.csv")

# 1. What are the levels for the `income.range` variable?
levels(reddit$income.range)
## [1] "$100,000 - $149,999" "$150,000 or more"    "$20,000 - $29,999"  
## [4] "$30,000 - $39,999"   "$40,000 - $49,999"   "$50,000 - $69,999"  
## [7] "$70,000 - $99,999"   "Under $20,000"

# 2. Properly order the levels for income.range.
reddit$income.range <- ordered(reddit$income.range, 
                               levels = c("Under $20,000", "$20,000 - $29,999", "$30,000 - $39,999",
                                          "$40,000 - $49,999", "$50,000 - $69,999", "$70,000 - $99,999",
                                          "$100,000 - $149,999", "$150,000 or more"))

# 3. What are the counts for each level?
table(reddit$income.range)
## 
##       Under $20,000   $20,000 - $29,999   $30,000 - $39,999 
##                7892                3206                2904 
##   $40,000 - $49,999   $50,000 - $69,999   $70,000 - $99,999 
##                2686                4133                4101 
## $100,000 - $149,999    $150,000 or more 
##                3522                2695

Dates

The neglected variable

Dates: creating

The lubridate package makes working with dates extremely easy
To create a date variable we simply need to know the year-month-day order

Function	Order of elements in date-time
`ymd()`	year, month, day
`ydm()`	year, day, month
`mdy()`	month, day, year
`dmy()`	day, month, year
`hm()`	hour, minute
`hms()`	hour, minute, second
`ymd_hms()`	year, month, day, hour, minute, second

Dates: creating

The lubridate package makes working with dates extremely easy
To create a date variable we simply need to know the year-month-day order

dates <- c("2015-07-01", "2015-08-01", "2015-09-01")

class(dates)
## [1] "character"

Convert this character string to date format with lubridate's ymd() function

# install.packages("lubridate") # run this line if you have not yet installed lubridate
library(lubridate)

dates2 <- ymd(dates)

class(dates2)
## [1] "Date"

dates2
## [1] "2015-07-01" "2015-08-01" "2015-09-01"

Dates: create by merging

Sometimes your date data are collected in separate elements
To convert these separate data into one date object incorporate the ISOdate() function:

yr <- c("2012", "2013", "2014", "2015") 
mo <- c("1", "5", "7", "2")
day <- c("02", "22", "15", "28")

# ISOdate converts to a POSIXct object
full_date <- ISOdate(year = yr, month = mo, day = day)
full_date
## [1] "2012-01-02 12:00:00 GMT" "2013-05-22 12:00:00 GMT"
## [3] "2014-07-15 12:00:00 GMT" "2015-02-28 12:00:00 GMT"

We can truncate the unused time data by converting with as.Date()

as.Date(full_date)
## [1] "2012-01-02" "2013-05-22" "2014-07-15" "2015-02-28"

Dates: extract & manipulate

We can also easily extract components of dates using lubridate

Function	Date-time element to extract
`year()`	Year
`month()`	Month
`week()`	Week
`yday()`	Day of year
`mday()`	Day of month
`wday()`	Day of week
`hour()`	Hour
`minute()`	Minute
`second()`	Second
`tz()`	Time zone

Dates: extract & manipulate

We can also easily extract components of dates using lubridate

Extract time components:

year(full_date)
## [1] 2012 2013 2014 2015

week(full_date)
## [1]  1 21 28  9

wday(full_date, label = TRUE)
## [1] Mon  Wed  Tues Sat 
## Levels: Sun < Mon < Tues < Wed < Thurs < Fri < Sat

Manipulate or change date-time components by using the function and then assignment

as.Date(full_date)
## [1] "2012-01-02" "2013-05-22" "2014-07-15" "2015-02-28"

year(full_date) <- c(2014, 2015, 2015, 2016)

as.Date(full_date)
## [1] "2014-01-02" "2015-05-22" "2015-07-15" "2016-02-28"

Dates: summarizing

We can also do regular statistical summaries of date objects
Illustrate with the lakers data set that comes with the lubridate package

dates <- ymd(lakers$date)

min(dates)
## [1] "2008-10-28"

max(dates)
## [1] "2009-04-14"

mean(dates)
## [1] "2009-01-22"

median(dates)
## [1] "2009-01-21"

summary(dates)
##         Min.      1st Qu.       Median         Mean      3rd Qu. 
## "2008-10-28" "2008-12-10" "2009-01-21" "2009-01-22" "2009-03-09" 
##         Max. 
## "2009-04-14"

Your Turn

Import the facebook.tsv file in the data folder

1. Create a new date variable that combines the dob_day,
dob_month, & dob_year variables.

2. What is the min, max, mean, and median date of births in
this data frame?

NOTE: If you save the new variable as facebook$dob <- _____________ it will add this new variable to the facebook data frame

Solution

# Import the `facebook.tsv` file in the data folder
facebook <- read.delim("../data/facebook.tsv")

# 1. Create a new date variable that combines the dob_day, dob_month, & dob_year variables.
facebook$dob <- as.Date(ISOdate(year = facebook$dob_year, 
                                month = facebook$dob_month, 
                                day = facebook$dob_day))

# 2. What is the min, max, mean, and median date of births in this data frame?
summary(facebook$dob)
##         Min.      1st Qu.       Median         Mean      3rd Qu. 
## "1900-01-01" "1963-08-14" "1985-01-20" "1976-03-12" "1993-01-01" 
##         Max. 
## "2000-10-27"

Logical

Boolean as in George Boole

Logical: the basics

We already saw how we can get TRUE/FALSE responses from comparing elements

x <- c(4, 4, 9, 12, 2, 2, 10)
y <- c(4, 5, 9, 13, 2, 1, 10)

x == y
## [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE

This is just vector containing logical elements

z <- x == y

class(z)
## [1] "logical"

We can assess if any or all the elements are TRUE

any(z)
## [1] TRUE

all(z)
## [1] FALSE

Key Things to Remember

Remember These Functions!

Operator/Function	Description
`as.double()`, `as.integer()`	coerce to double floating point or integer numbers
`identical()`, `all.equal()`	test for exact and near equality
`round()`, `ceiling()`, `floor()`	round numbers
`c()`, `paste()`, `paste0()`	combine character strings
`as.character()`	coerce non-character to a character
`nchar()`	count the number of characters in each element
`factor()`, `ordered()`	create or coerce to factor variables
`levels()`	assess the levels of a factor
`table()`	get the counts of each level

Remember These Functions!

Operator/Function	Description
`ymd()`, `mdy()`, `hm()`, etc	`lubridate`: create or convert to date-time variable
`Isodate()`	create date variable by mergine separate date components
`as.Date()`	truncate date-time variable to just date variable
`year()`, `week()`, etc	`lubridate`: extract individual date components
`any()`, `all()`	assess if any or all elements are `TRUE`

Break

5 minutes!