Key Things to Remember

What to Remember from this Section

  • R has the ability to work with a variety of data types
  • As an analyst, you should become familiar with dealing with the following types of data:
Data Type Description
- Numbers integer (i.e. 1,2,3…), double (i.e. 1.5, 3.66)
- Character Strings "r", "I attend UC", etc.
- Regular expressions patterns within text strings
- Factors nominal (male, female), ordinal (freshman, sophmore, junior), interval ($0-25, $26-50, $51-75)
- Dates calendar dates (i.e. 2016-08-06, 08/06/2016), weekdays, hours, etc.
- Logical TRUE, FALSE, any, all

Numbers

Numbers: two types of numbers

Numeric data pimarily comes in two forms: integer & double (double precision floating point)

# create a string of double-precision values
dbl_var <- c(1, 2.5, 4.5)  
class(dbl_var)
## [1] "numeric"

# placing an L after the values creates a string of integers
int_var <- c(1L, 6L, 10L)
class(int_var)
## [1] "integer"

We can coerce integers to doubles and vice versa with as.double() and as.integer()

as.integer(dbl_var)
## [1] 1 2 4

int_to_dbl <- as.double(int_var)
class(int_to_dbl)
## [1] "numeric"

# Combining double and integer will automatically coerce to the simplest form (double)
c(dbl_var, int_var)
## [1]  1.0  2.5  4.5  1.0  6.0 10.0

Numbers: comparing numbers

You've already seen logical operators using ==, !=, <, <=, >, >=

x <- c(4, 4, 9, 12)
y <- c(4, 4, 9, 12.00000008)

x == y
## [1]  TRUE  TRUE  TRUE FALSE

Can also test for exact equality with identical() and near equality with all.equal()

z <- c(4, 4, 9, 12)

identical(x, y)
## [1] FALSE
identical(x, z)
## [1] TRUE

all.equal(x, y)
## [1] TRUE

Numbers: rounding

We can also round numbers multiple ways:

x <- c(1, 1.35, 1.7, 2.05, 2.4, 2.75)

# round to the nearest integer
round(x)
## [1] 1 1 2 2 2 3

# round up
ceiling(x)
## [1] 1 2 2 3 3 3

# round down
floor(x)
## [1] 1 1 1 2 2 2

# round to a specified decimal
round(x, digits = 1)
## [1] 1.0 1.4 1.7 2.0 2.4 2.8

Your Turn

Import the numbers-your-turn.csv file in the data folder






1. Are the vectors x & y equal? Exactly or approximately equal?

2. Are the vectors y & z equal? Exactly or approximately equal?

3. Round x & y numbers to the 4th digit

4. Are these vectors equal now?

Solution

# import the numbers-your-turn.csv file in the data folder
df <- read.csv("../data/numbers-your-turn.csv")

# 1. Are the vectors x & y equal? Exactly or approximately equal?
identical(df$x, df$y)
## [1] FALSE

all.equal(df$x, df$y)
## [1] "Mean relative difference: 1.041407e-07"

# 2. Are the vectors x & z equal? Exactly or approximately equal?
identical(df$y, df$z)
## [1] FALSE

all.equal(df$y, df$z)
## [1] TRUE

# 3. Round x & y numbers to the 4th digit
x <- round(df$x, digits = 4)
y <- round(df$y, digits = 4)

# 4. Are these vectors equal now?
identical(x, y)
## [1] TRUE

all.equal(x, y)
## [1] TRUE

Characters

"Hello world!"

Characters: creating simple character strings

Create character strings using ""

a <- "learning to create"
b <- "character strings"

Combine character strings with c, paste() or paste0

# create a vector containing two elements - a and b
c(a, b)
## [1] "learning to create" "character strings"

# create a vector containing one element - a and b combined
paste(a, b)
## [1] "learning to create character strings"

# paste multiple strings
paste("I", "love", "R")
## [1] "I love R"

# change the separator
paste("I", "love", "R", sep = "-")
## [1] "I-love-R"

# collapse space between characters
paste0("I", "love", "R")
## [1] "IloveR"

Characters: test, conversion & coercion

Use class(), mode() and/or is.character() to assess the data type

a <- "Life of"
b <- pi

class(a)
## [1] "character"

mode(a)
## [1] "character"

is.character(pi)
## [1] FALSE

Use as.character() to convert non-character to a character

as.character(pi)
## [1] "3.14159265358979"

Combining characters and non-characters will coerce all inputs to a character

c(a, b)
## [1] "Life of"          "3.14159265358979"

Characters: summarizing

Use length() to count the number of elements (individual character strings) in a vector

length("How many elements are in this string?")
## [1] 1

length(c("How", "many", "elements", "are", "in", "this", "string?"))
## [1] 7


Use nchar() to count the number of characters in each element

nchar("How many characters are in this string?")
## [1] 39

nchar(c("How", "many", "characters", "are", "in", "this", "string?"))
## [1]  3  4 10  3  2  4  7

Factors

aka categorical variables

Factors: different from characters

Key Words : finite options and levels

  • nominal variables
    • male, female
    • brunnette, blonde, red, black
    • Hispanic, Caucasion, Asian, African
  • ordinal variables
    • slow, medium, fast
    • freshman, sophomore, junior, senior
  • interval variables
    • $1-100, $101-200, $201-300
    • 0-10, 11-20, 21-30

Factors: creating nominal factors

Create nominal factors with factor()

gender <- c("male", "female", "female")

class(gender)
## [1] "character"

gender2 <- factor(gender)

class(gender2)
## [1] "factor"

gender2
## [1] male   female female
## Levels: female male

set level preferences with level argument

factor(gender, levels = c("male", "female"))
## [1] male   female female
## Levels: male female

Factors: creating ordered factors

Create ordinal/interval factors with ordered(); set level preferences with level argument

age.range <- c("18-24", "25-34", "35-44", "45-54", "55-64", "65 or Above", "Under 18")

class(age.range)
## [1] "character"

# turn x into an ordered factor - levels default to the order of the data
age.range2 <- ordered(age.range)

class(age.range2)
## [1] "ordered" "factor"

age.range2
## [1] 18-24       25-34       35-44       45-54       55-64       65 or Above
## [7] Under 18   
## 7 Levels: 18-24 < 25-34 < 35-44 < 45-54 < 55-64 < ... < Under 18

set level preferences with level argument

ordered(age.range, levels = c("Under 18", "18-24", "25-34", "35-44", "45-54", "55-64", "65 or Above"))
## [1] 18-24       25-34       35-44       45-54       55-64       65 or Above
## [7] Under 18   
## 7 Levels: Under 18 < 18-24 < 25-34 < 35-44 < 45-54 < ... < 65 or Above

Factors: summarizing

If you want to know the levels that exist in your factor variable use levels()

facebook <- read.delim("../data/facebook.tsv")

levels(facebook$gender)
## [1] "female" "male"


We can use the table() function to quickly assess the counts of each level

table(facebook$gender)
## 
## female   male 
##  40254  58574

Your Turn

Import the reddit.csv file in the data folder




1. What are the levels for the income.range variable?


2. Properly order the levels for income.range.


3. What are the counts for each level?

Solution

# import the reddit.csv file in the data folder
reddit <- read.csv("../data/reddit.csv")

# 1. What are the levels for the `income.range` variable?
levels(reddit$income.range)
## [1] "$100,000 - $149,999" "$150,000 or more"    "$20,000 - $29,999"  
## [4] "$30,000 - $39,999"   "$40,000 - $49,999"   "$50,000 - $69,999"  
## [7] "$70,000 - $99,999"   "Under $20,000"

# 2. Properly order the levels for income.range.
reddit$income.range <- ordered(reddit$income.range, 
                               levels = c("Under $20,000", "$20,000 - $29,999", "$30,000 - $39,999",
                                          "$40,000 - $49,999", "$50,000 - $69,999", "$70,000 - $99,999",
                                          "$100,000 - $149,999", "$150,000 or more"))

# 3. What are the counts for each level?
table(reddit$income.range)
## 
##       Under $20,000   $20,000 - $29,999   $30,000 - $39,999 
##                7892                3206                2904 
##   $40,000 - $49,999   $50,000 - $69,999   $70,000 - $99,999 
##                2686                4133                4101 
## $100,000 - $149,999    $150,000 or more 
##                3522                2695

Dates

The neglected variable

Dates: creating

  • The lubridate package makes working with dates extremely easy
  • To create a date variable we simply need to know the year-month-day order
Function Order of elements in date-time
ymd() year, month, day
ydm() year, day, month
mdy() month, day, year
dmy() day, month, year
hm() hour, minute
hms() hour, minute, second
ymd_hms() year, month, day, hour, minute, second

Dates: creating

  • The lubridate package makes working with dates extremely easy
  • To create a date variable we simply need to know the year-month-day order
dates <- c("2015-07-01", "2015-08-01", "2015-09-01")

class(dates)
## [1] "character"

Convert this character string to date format with lubridate's ymd() function

# install.packages("lubridate") # run this line if you have not yet installed lubridate
library(lubridate)

dates2 <- ymd(dates)

class(dates2)
## [1] "Date"

dates2
## [1] "2015-07-01" "2015-08-01" "2015-09-01"

Dates: create by merging

  • Sometimes your date data are collected in separate elements
  • To convert these separate data into one date object incorporate the ISOdate() function:
yr <- c("2012", "2013", "2014", "2015") 
mo <- c("1", "5", "7", "2")
day <- c("02", "22", "15", "28")

# ISOdate converts to a POSIXct object
full_date <- ISOdate(year = yr, month = mo, day = day)
full_date
## [1] "2012-01-02 12:00:00 GMT" "2013-05-22 12:00:00 GMT"
## [3] "2014-07-15 12:00:00 GMT" "2015-02-28 12:00:00 GMT"


We can truncate the unused time data by converting with as.Date()

as.Date(full_date)
## [1] "2012-01-02" "2013-05-22" "2014-07-15" "2015-02-28"

Dates: extract & manipulate

We can also easily extract components of dates using lubridate

Function Date-time element to extract
year() Year
month() Month
week() Week
yday() Day of year
mday() Day of month
wday() Day of week
hour() Hour
minute() Minute
second() Second
tz() Time zone

Dates: extract & manipulate

We can also easily extract components of dates using lubridate

Extract time components:

year(full_date)
## [1] 2012 2013 2014 2015

week(full_date)
## [1]  1 21 28  9

wday(full_date, label = TRUE)
## [1] Mon  Wed  Tues Sat 
## Levels: Sun < Mon < Tues < Wed < Thurs < Fri < Sat

Manipulate or change date-time components by using the function and then assignment

as.Date(full_date)
## [1] "2012-01-02" "2013-05-22" "2014-07-15" "2015-02-28"

year(full_date) <- c(2014, 2015, 2015, 2016)

as.Date(full_date)
## [1] "2014-01-02" "2015-05-22" "2015-07-15" "2016-02-28"

Dates: summarizing

  • We can also do regular statistical summaries of date objects
  • Illustrate with the lakers data set that comes with the lubridate package
dates <- ymd(lakers$date)

min(dates)
## [1] "2008-10-28"

max(dates)
## [1] "2009-04-14"

mean(dates)
## [1] "2009-01-22"

median(dates)
## [1] "2009-01-21"

summary(dates)
##         Min.      1st Qu.       Median         Mean      3rd Qu. 
## "2008-10-28" "2008-12-10" "2009-01-21" "2009-01-22" "2009-03-09" 
##         Max. 
## "2009-04-14"

Your Turn

Import the facebook.tsv file in the data folder




1. Create a new date variable that combines the dob_day,
     dob_month, & dob_year variables.



2. What is the min, max, mean, and median date of births in
     this data frame?





NOTE: If you save the new variable as facebook$dob <- _____________ it will add this new variable to the facebook data frame

Solution

# Import the `facebook.tsv` file in the data folder
facebook <- read.delim("../data/facebook.tsv")

# 1. Create a new date variable that combines the dob_day, dob_month, & dob_year variables.
facebook$dob <- as.Date(ISOdate(year = facebook$dob_year, 
                                month = facebook$dob_month, 
                                day = facebook$dob_day))

# 2. What is the min, max, mean, and median date of births in this data frame?
summary(facebook$dob)
##         Min.      1st Qu.       Median         Mean      3rd Qu. 
## "1900-01-01" "1963-08-14" "1985-01-20" "1976-03-12" "1993-01-01" 
##         Max. 
## "2000-10-27"

Logical

Boolean as in George Boole

Logical: the basics

We already saw how we can get TRUE/FALSE responses from comparing elements

x <- c(4, 4, 9, 12, 2, 2, 10)
y <- c(4, 5, 9, 13, 2, 1, 10)

x == y
## [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE

This is just vector containing logical elements

z <- x == y

class(z)
## [1] "logical"

We can assess if any or all the elements are TRUE

any(z)
## [1] TRUE

all(z)
## [1] FALSE

Key Things to Remember

Remember These Functions!

Operator/Function Description
as.double(), as.integer() coerce to double floating point or integer numbers
identical(), all.equal() test for exact and near equality
round(), ceiling(), floor() round numbers
c(), paste(), paste0() combine character strings
as.character() coerce non-character to a character
nchar() count the number of characters in each element
factor(), ordered() create or coerce to factor variables
levels() assess the levels of a factor
table() get the counts of each level

Remember These Functions!

Operator/Function Description
ymd(), mdy(), hm(), etc lubridate: create or convert to date-time variable
Isodate() create date variable by mergine separate date components
as.Date() truncate date-time variable to just date variable
year(), week(), etc lubridate: extract individual date components
any(), all() assess if any or all elements are TRUE

Break

5 minutes!