| Data Type | Description |
|---|---|
| - Numbers | integer (i.e. 1,2,3…), double (i.e. 1.5, 3.66) |
| - Character Strings | "r", "I attend UC", etc. |
| - Regular expressions | patterns within text strings |
| - Factors | nominal (male, female), ordinal (freshman, sophmore, junior), interval ($0-25, $26-50, $51-75) |
| - Dates | calendar dates (i.e. 2016-08-06, 08/06/2016), weekdays, hours, etc. |
| - Logical | TRUE, FALSE, any, all |
Numeric data pimarily comes in two forms: integer & double (double precision floating point)
# create a string of double-precision values dbl_var <- c(1, 2.5, 4.5) class(dbl_var) ## [1] "numeric" # placing an L after the values creates a string of integers int_var <- c(1L, 6L, 10L) class(int_var) ## [1] "integer"
We can coerce integers to doubles and vice versa with as.double() and as.integer()
as.integer(dbl_var) ## [1] 1 2 4 int_to_dbl <- as.double(int_var) class(int_to_dbl) ## [1] "numeric" # Combining double and integer will automatically coerce to the simplest form (double) c(dbl_var, int_var) ## [1] 1.0 2.5 4.5 1.0 6.0 10.0
You've already seen logical operators using ==, !=, <, <=, >, >=
x <- c(4, 4, 9, 12) y <- c(4, 4, 9, 12.00000008) x == y ## [1] TRUE TRUE TRUE FALSE
Can also test for exact equality with identical() and near equality with all.equal()
z <- c(4, 4, 9, 12) identical(x, y) ## [1] FALSE identical(x, z) ## [1] TRUE all.equal(x, y) ## [1] TRUE
We can also round numbers multiple ways:
x <- c(1, 1.35, 1.7, 2.05, 2.4, 2.75) # round to the nearest integer round(x) ## [1] 1 1 2 2 2 3 # round up ceiling(x) ## [1] 1 2 2 3 3 3 # round down floor(x) ## [1] 1 1 1 2 2 2 # round to a specified decimal round(x, digits = 1) ## [1] 1.0 1.4 1.7 2.0 2.4 2.8
Import the numbers-your-turn.csv file in the data folder
1. Are the vectors x & y equal? Exactly or approximately equal?
2. Are the vectors y & z equal? Exactly or approximately equal?
3. Round x & y numbers to the 4th digit
4. Are these vectors equal now?
# import the numbers-your-turn.csv file in the data folder
df <- read.csv("../data/numbers-your-turn.csv")
# 1. Are the vectors x & y equal? Exactly or approximately equal?
identical(df$x, df$y)
## [1] FALSE
all.equal(df$x, df$y)
## [1] "Mean relative difference: 1.041407e-07"
# 2. Are the vectors x & z equal? Exactly or approximately equal?
identical(df$y, df$z)
## [1] FALSE
all.equal(df$y, df$z)
## [1] TRUE
# 3. Round x & y numbers to the 4th digit
x <- round(df$x, digits = 4)
y <- round(df$y, digits = 4)
# 4. Are these vectors equal now?
identical(x, y)
## [1] TRUE
all.equal(x, y)
## [1] TRUE
Create character strings using ""
a <- "learning to create" b <- "character strings"
Combine character strings with c, paste() or paste0
# create a vector containing two elements - a and b
c(a, b)
## [1] "learning to create" "character strings"
# create a vector containing one element - a and b combined
paste(a, b)
## [1] "learning to create character strings"
# paste multiple strings
paste("I", "love", "R")
## [1] "I love R"
# change the separator
paste("I", "love", "R", sep = "-")
## [1] "I-love-R"
# collapse space between characters
paste0("I", "love", "R")
## [1] "IloveR"
Use class(), mode() and/or is.character() to assess the data type
a <- "Life of" b <- pi class(a) ## [1] "character" mode(a) ## [1] "character" is.character(pi) ## [1] FALSE
Use as.character() to convert non-character to a character
as.character(pi) ## [1] "3.14159265358979"
Combining characters and non-characters will coerce all inputs to a character
c(a, b) ## [1] "Life of" "3.14159265358979"
Use length() to count the number of elements (individual character strings) in a vector
length("How many elements are in this string?")
## [1] 1
length(c("How", "many", "elements", "are", "in", "this", "string?"))
## [1] 7
Use nchar() to count the number of characters in each element
nchar("How many characters are in this string?")
## [1] 39
nchar(c("How", "many", "characters", "are", "in", "this", "string?"))
## [1] 3 4 10 3 2 4 7
Key Words : finite options and levels
Create nominal factors with factor()
gender <- c("male", "female", "female")
class(gender)
## [1] "character"
gender2 <- factor(gender)
class(gender2)
## [1] "factor"
gender2
## [1] male female female
## Levels: female male
set level preferences with level argument
factor(gender, levels = c("male", "female"))
## [1] male female female
## Levels: male female
Create ordinal/interval factors with ordered(); set level preferences with level argument
age.range <- c("18-24", "25-34", "35-44", "45-54", "55-64", "65 or Above", "Under 18")
class(age.range)
## [1] "character"
# turn x into an ordered factor - levels default to the order of the data
age.range2 <- ordered(age.range)
class(age.range2)
## [1] "ordered" "factor"
age.range2
## [1] 18-24 25-34 35-44 45-54 55-64 65 or Above
## [7] Under 18
## 7 Levels: 18-24 < 25-34 < 35-44 < 45-54 < 55-64 < ... < Under 18
set level preferences with level argument
ordered(age.range, levels = c("Under 18", "18-24", "25-34", "35-44", "45-54", "55-64", "65 or Above"))
## [1] 18-24 25-34 35-44 45-54 55-64 65 or Above
## [7] Under 18
## 7 Levels: Under 18 < 18-24 < 25-34 < 35-44 < 45-54 < ... < 65 or Above
If you want to know the levels that exist in your factor variable use levels()
facebook <- read.delim("../data/facebook.tsv")
levels(facebook$gender)
## [1] "female" "male"
We can use the table() function to quickly assess the counts of each level
table(facebook$gender) ## ## female male ## 40254 58574
Import the reddit.csv file in the data folder
1. What are the levels for the income.range variable?
2. Properly order the levels for income.range.
3. What are the counts for each level?
# import the reddit.csv file in the data folder
reddit <- read.csv("../data/reddit.csv")
# 1. What are the levels for the `income.range` variable?
levels(reddit$income.range)
## [1] "$100,000 - $149,999" "$150,000 or more" "$20,000 - $29,999"
## [4] "$30,000 - $39,999" "$40,000 - $49,999" "$50,000 - $69,999"
## [7] "$70,000 - $99,999" "Under $20,000"
# 2. Properly order the levels for income.range.
reddit$income.range <- ordered(reddit$income.range,
levels = c("Under $20,000", "$20,000 - $29,999", "$30,000 - $39,999",
"$40,000 - $49,999", "$50,000 - $69,999", "$70,000 - $99,999",
"$100,000 - $149,999", "$150,000 or more"))
# 3. What are the counts for each level?
table(reddit$income.range)
##
## Under $20,000 $20,000 - $29,999 $30,000 - $39,999
## 7892 3206 2904
## $40,000 - $49,999 $50,000 - $69,999 $70,000 - $99,999
## 2686 4133 4101
## $100,000 - $149,999 $150,000 or more
## 3522 2695
lubridate package makes working with dates extremely easy| Function | Order of elements in date-time |
|---|---|
ymd() |
year, month, day |
ydm() |
year, day, month |
mdy() |
month, day, year |
dmy() |
day, month, year |
hm() |
hour, minute |
hms() |
hour, minute, second |
ymd_hms() |
year, month, day, hour, minute, second |
lubridate package makes working with dates extremely easydates <- c("2015-07-01", "2015-08-01", "2015-09-01")
class(dates)
## [1] "character"
Convert this character string to date format with lubridate's ymd() function
# install.packages("lubridate") # run this line if you have not yet installed lubridate
library(lubridate)
dates2 <- ymd(dates)
class(dates2)
## [1] "Date"
dates2
## [1] "2015-07-01" "2015-08-01" "2015-09-01"
ISOdate() function:yr <- c("2012", "2013", "2014", "2015")
mo <- c("1", "5", "7", "2")
day <- c("02", "22", "15", "28")
# ISOdate converts to a POSIXct object
full_date <- ISOdate(year = yr, month = mo, day = day)
full_date
## [1] "2012-01-02 12:00:00 GMT" "2013-05-22 12:00:00 GMT"
## [3] "2014-07-15 12:00:00 GMT" "2015-02-28 12:00:00 GMT"
We can truncate the unused time data by converting with as.Date()
as.Date(full_date) ## [1] "2012-01-02" "2013-05-22" "2014-07-15" "2015-02-28"
We can also easily extract components of dates using lubridate
| Function | Date-time element to extract |
|---|---|
year() |
Year |
month() |
Month |
week() |
Week |
yday() |
Day of year |
mday() |
Day of month |
wday() |
Day of week |
hour() |
Hour |
minute() |
Minute |
second() |
Second |
tz() |
Time zone |
We can also easily extract components of dates using lubridate
Extract time components:
year(full_date) ## [1] 2012 2013 2014 2015 week(full_date) ## [1] 1 21 28 9 wday(full_date, label = TRUE) ## [1] Mon Wed Tues Sat ## Levels: Sun < Mon < Tues < Wed < Thurs < Fri < Sat
Manipulate or change date-time components by using the function and then assignment
as.Date(full_date) ## [1] "2012-01-02" "2013-05-22" "2014-07-15" "2015-02-28" year(full_date) <- c(2014, 2015, 2015, 2016) as.Date(full_date) ## [1] "2014-01-02" "2015-05-22" "2015-07-15" "2016-02-28"
lakers data set that comes with the lubridate packagedates <- ymd(lakers$date) min(dates) ## [1] "2008-10-28" max(dates) ## [1] "2009-04-14" mean(dates) ## [1] "2009-01-22" median(dates) ## [1] "2009-01-21" summary(dates) ## Min. 1st Qu. Median Mean 3rd Qu. ## "2008-10-28" "2008-12-10" "2009-01-21" "2009-01-22" "2009-03-09" ## Max. ## "2009-04-14"
Import the facebook.tsv file in the data folder
1. Create a new date variable that combines the dob_day,
dob_month, & dob_year variables.
2. What is the min, max, mean, and median date of births in
this data frame?
NOTE: If you save the new variable as facebook$dob <- _____________ it will add this new variable to the facebook data frame
# Import the `facebook.tsv` file in the data folder
facebook <- read.delim("../data/facebook.tsv")
# 1. Create a new date variable that combines the dob_day, dob_month, & dob_year variables.
facebook$dob <- as.Date(ISOdate(year = facebook$dob_year,
month = facebook$dob_month,
day = facebook$dob_day))
# 2. What is the min, max, mean, and median date of births in this data frame?
summary(facebook$dob)
## Min. 1st Qu. Median Mean 3rd Qu.
## "1900-01-01" "1963-08-14" "1985-01-20" "1976-03-12" "1993-01-01"
## Max.
## "2000-10-27"
We already saw how we can get TRUE/FALSE responses from comparing elements
x <- c(4, 4, 9, 12, 2, 2, 10) y <- c(4, 5, 9, 13, 2, 1, 10) x == y ## [1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE
This is just vector containing logical elements
z <- x == y class(z) ## [1] "logical"
We can assess if any or all the elements are TRUE
any(z) ## [1] TRUE all(z) ## [1] FALSE
| Operator/Function | Description |
|---|---|
as.double(), as.integer() |
coerce to double floating point or integer numbers |
identical(), all.equal() |
test for exact and near equality |
round(), ceiling(), floor() |
round numbers |
c(), paste(), paste0() |
combine character strings |
as.character() |
coerce non-character to a character |
nchar() |
count the number of characters in each element |
factor(), ordered() |
create or coerce to factor variables |
levels() |
assess the levels of a factor |
table() |
get the counts of each level |
| Operator/Function | Description |
|---|---|
ymd(), mdy(), hm(), etc |
lubridate: create or convert to date-time variable |
Isodate() |
create date variable by mergine separate date components |
as.Date() |
truncate date-time variable to just date variable |
year(), week(), etc |
lubridate: extract individual date components |
any(), all() |
assess if any or all elements are TRUE |
5 minutes!