Data Type | Description |
---|---|
- Numbers | integer (i.e. 1,2,3…), double (i.e. 1.5, 3.66) |
- Character Strings | "r", "I attend UC", etc. |
- Regular expressions | patterns within text strings |
- Factors | nominal (male, female), ordinal (freshman, sophmore, junior), interval ($0-25, $26-50, $51-75) |
- Dates | calendar dates (i.e. 2016-08-06, 08/06/2016), weekdays, hours, etc. |
- Logical | TRUE , FALSE , any , all |
Numeric data pimarily comes in two forms: integer & double (double precision floating point)
# create a string of double-precision values dbl_var <- c(1, 2.5, 4.5) class(dbl_var) ## [1] "numeric" # placing an L after the values creates a string of integers int_var <- c(1L, 6L, 10L) class(int_var) ## [1] "integer"
We can coerce integers to doubles and vice versa with as.double()
and as.integer()
as.integer(dbl_var) ## [1] 1 2 4 int_to_dbl <- as.double(int_var) class(int_to_dbl) ## [1] "numeric" # Combining double and integer will automatically coerce to the simplest form (double) c(dbl_var, int_var) ## [1] 1.0 2.5 4.5 1.0 6.0 10.0
You've already seen logical operators using ==, !=, <, <=, >, >=
x <- c(4, 4, 9, 12) y <- c(4, 4, 9, 12.00000008) x == y ## [1] TRUE TRUE TRUE FALSE
Can also test for exact equality with identical()
and near equality with all.equal()
z <- c(4, 4, 9, 12) identical(x, y) ## [1] FALSE identical(x, z) ## [1] TRUE all.equal(x, y) ## [1] TRUE
We can also round numbers multiple ways:
x <- c(1, 1.35, 1.7, 2.05, 2.4, 2.75) # round to the nearest integer round(x) ## [1] 1 1 2 2 2 3 # round up ceiling(x) ## [1] 1 2 2 3 3 3 # round down floor(x) ## [1] 1 1 1 2 2 2 # round to a specified decimal round(x, digits = 1) ## [1] 1.0 1.4 1.7 2.0 2.4 2.8
Import the numbers-your-turn.csv
file in the data folder
1. Are the vectors x
& y
equal? Exactly or approximately equal?
2. Are the vectors y
& z
equal? Exactly or approximately equal?
3. Round x
& y
numbers to the 4th digit
4. Are these vectors equal now?
# import the numbers-your-turn.csv file in the data folder df <- read.csv("../data/numbers-your-turn.csv") # 1. Are the vectors x & y equal? Exactly or approximately equal? identical(df$x, df$y) ## [1] FALSE all.equal(df$x, df$y) ## [1] "Mean relative difference: 1.041407e-07" # 2. Are the vectors x & z equal? Exactly or approximately equal? identical(df$y, df$z) ## [1] FALSE all.equal(df$y, df$z) ## [1] TRUE # 3. Round x & y numbers to the 4th digit x <- round(df$x, digits = 4) y <- round(df$y, digits = 4) # 4. Are these vectors equal now? identical(x, y) ## [1] TRUE all.equal(x, y) ## [1] TRUE
Create character strings using ""
a <- "learning to create" b <- "character strings"
Combine character strings with c
, paste()
or paste0
# create a vector containing two elements - a and b c(a, b) ## [1] "learning to create" "character strings" # create a vector containing one element - a and b combined paste(a, b) ## [1] "learning to create character strings" # paste multiple strings paste("I", "love", "R") ## [1] "I love R" # change the separator paste("I", "love", "R", sep = "-") ## [1] "I-love-R" # collapse space between characters paste0("I", "love", "R") ## [1] "IloveR"
Use class()
, mode()
and/or is.character()
to assess the data type
a <- "Life of" b <- pi class(a) ## [1] "character" mode(a) ## [1] "character" is.character(pi) ## [1] FALSE
Use as.character()
to convert non-character to a character
as.character(pi) ## [1] "3.14159265358979"
Combining characters and non-characters will coerce all inputs to a character
c(a, b) ## [1] "Life of" "3.14159265358979"
Use length()
to count the number of elements (individual character strings) in a vector
length("How many elements are in this string?") ## [1] 1 length(c("How", "many", "elements", "are", "in", "this", "string?")) ## [1] 7
Use nchar()
to count the number of characters in each element
nchar("How many characters are in this string?") ## [1] 39 nchar(c("How", "many", "characters", "are", "in", "this", "string?")) ## [1] 3 4 10 3 2 4 7
Key Words : finite options and levels
Create nominal factors with factor()
gender <- c("male", "female", "female") class(gender) ## [1] "character" gender2 <- factor(gender) class(gender2) ## [1] "factor" gender2 ## [1] male female female ## Levels: female male
set level preferences with level
argument
factor(gender, levels = c("male", "female")) ## [1] male female female ## Levels: male female
Create ordinal/interval factors with ordered()
; set level preferences with level
argument
age.range <- c("18-24", "25-34", "35-44", "45-54", "55-64", "65 or Above", "Under 18") class(age.range) ## [1] "character" # turn x into an ordered factor - levels default to the order of the data age.range2 <- ordered(age.range) class(age.range2) ## [1] "ordered" "factor" age.range2 ## [1] 18-24 25-34 35-44 45-54 55-64 65 or Above ## [7] Under 18 ## 7 Levels: 18-24 < 25-34 < 35-44 < 45-54 < 55-64 < ... < Under 18
set level preferences with level
argument
ordered(age.range, levels = c("Under 18", "18-24", "25-34", "35-44", "45-54", "55-64", "65 or Above")) ## [1] 18-24 25-34 35-44 45-54 55-64 65 or Above ## [7] Under 18 ## 7 Levels: Under 18 < 18-24 < 25-34 < 35-44 < 45-54 < ... < 65 or Above
If you want to know the levels that exist in your factor variable use levels()
facebook <- read.delim("../data/facebook.tsv") levels(facebook$gender) ## [1] "female" "male"
We can use the table()
function to quickly assess the counts of each level
table(facebook$gender) ## ## female male ## 40254 58574
Import the reddit.csv
file in the data folder
1. What are the levels for the income.range
variable?
2. Properly order the levels for income.range
.
3. What are the counts for each level?
# import the reddit.csv file in the data folder reddit <- read.csv("../data/reddit.csv") # 1. What are the levels for the `income.range` variable? levels(reddit$income.range) ## [1] "$100,000 - $149,999" "$150,000 or more" "$20,000 - $29,999" ## [4] "$30,000 - $39,999" "$40,000 - $49,999" "$50,000 - $69,999" ## [7] "$70,000 - $99,999" "Under $20,000" # 2. Properly order the levels for income.range. reddit$income.range <- ordered(reddit$income.range, levels = c("Under $20,000", "$20,000 - $29,999", "$30,000 - $39,999", "$40,000 - $49,999", "$50,000 - $69,999", "$70,000 - $99,999", "$100,000 - $149,999", "$150,000 or more")) # 3. What are the counts for each level? table(reddit$income.range) ## ## Under $20,000 $20,000 - $29,999 $30,000 - $39,999 ## 7892 3206 2904 ## $40,000 - $49,999 $50,000 - $69,999 $70,000 - $99,999 ## 2686 4133 4101 ## $100,000 - $149,999 $150,000 or more ## 3522 2695
lubridate
package makes working with dates extremely easyFunction | Order of elements in date-time |
---|---|
ymd() |
year, month, day |
ydm() |
year, day, month |
mdy() |
month, day, year |
dmy() |
day, month, year |
hm() |
hour, minute |
hms() |
hour, minute, second |
ymd_hms() |
year, month, day, hour, minute, second |
lubridate
package makes working with dates extremely easydates <- c("2015-07-01", "2015-08-01", "2015-09-01") class(dates) ## [1] "character"
Convert this character string to date format with lubridate
's ymd()
function
# install.packages("lubridate") # run this line if you have not yet installed lubridate library(lubridate) dates2 <- ymd(dates) class(dates2) ## [1] "Date" dates2 ## [1] "2015-07-01" "2015-08-01" "2015-09-01"
ISOdate()
function:yr <- c("2012", "2013", "2014", "2015") mo <- c("1", "5", "7", "2") day <- c("02", "22", "15", "28") # ISOdate converts to a POSIXct object full_date <- ISOdate(year = yr, month = mo, day = day) full_date ## [1] "2012-01-02 12:00:00 GMT" "2013-05-22 12:00:00 GMT" ## [3] "2014-07-15 12:00:00 GMT" "2015-02-28 12:00:00 GMT"
We can truncate the unused time data by converting with as.Date()
as.Date(full_date) ## [1] "2012-01-02" "2013-05-22" "2014-07-15" "2015-02-28"
We can also easily extract components of dates using lubridate
Function | Date-time element to extract |
---|---|
year() |
Year |
month() |
Month |
week() |
Week |
yday() |
Day of year |
mday() |
Day of month |
wday() |
Day of week |
hour() |
Hour |
minute() |
Minute |
second() |
Second |
tz() |
Time zone |
We can also easily extract components of dates using lubridate
Extract time components:
year(full_date) ## [1] 2012 2013 2014 2015 week(full_date) ## [1] 1 21 28 9 wday(full_date, label = TRUE) ## [1] Mon Wed Tues Sat ## Levels: Sun < Mon < Tues < Wed < Thurs < Fri < Sat
Manipulate or change date-time components by using the function and then assignment
as.Date(full_date) ## [1] "2012-01-02" "2013-05-22" "2014-07-15" "2015-02-28" year(full_date) <- c(2014, 2015, 2015, 2016) as.Date(full_date) ## [1] "2014-01-02" "2015-05-22" "2015-07-15" "2016-02-28"
lakers
data set that comes with the lubridate
packagedates <- ymd(lakers$date) min(dates) ## [1] "2008-10-28" max(dates) ## [1] "2009-04-14" mean(dates) ## [1] "2009-01-22" median(dates) ## [1] "2009-01-21" summary(dates) ## Min. 1st Qu. Median Mean 3rd Qu. ## "2008-10-28" "2008-12-10" "2009-01-21" "2009-01-22" "2009-03-09" ## Max. ## "2009-04-14"
Import the facebook.tsv
file in the data folder
1. Create a new date variable that combines the dob_day
,
dob_month
, & dob_year
variables.
2. What is the min
, max
, mean
, and median
date of births in
this data frame?
NOTE: If you save the new variable as facebook$dob <- _____________
it will add this new variable to the facebook data frame
# Import the `facebook.tsv` file in the data folder facebook <- read.delim("../data/facebook.tsv") # 1. Create a new date variable that combines the dob_day, dob_month, & dob_year variables. facebook$dob <- as.Date(ISOdate(year = facebook$dob_year, month = facebook$dob_month, day = facebook$dob_day)) # 2. What is the min, max, mean, and median date of births in this data frame? summary(facebook$dob) ## Min. 1st Qu. Median Mean 3rd Qu. ## "1900-01-01" "1963-08-14" "1985-01-20" "1976-03-12" "1993-01-01" ## Max. ## "2000-10-27"
We already saw how we can get TRUE
/FALSE
responses from comparing elements
x <- c(4, 4, 9, 12, 2, 2, 10) y <- c(4, 5, 9, 13, 2, 1, 10) x == y ## [1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE
This is just vector containing logical elements
z <- x == y class(z) ## [1] "logical"
We can assess if any or all the elements are TRUE
any(z) ## [1] TRUE all(z) ## [1] FALSE
Operator/Function | Description |
---|---|
as.double() , as.integer() |
coerce to double floating point or integer numbers |
identical() , all.equal() |
test for exact and near equality |
round() , ceiling() , floor() |
round numbers |
c() , paste() , paste0() |
combine character strings |
as.character() |
coerce non-character to a character |
nchar() |
count the number of characters in each element |
factor() , ordered() |
create or coerce to factor variables |
levels() |
assess the levels of a factor |
table() |
get the counts of each level |
Operator/Function | Description |
---|---|
ymd() , mdy() , hm() , etc |
lubridate : create or convert to date-time variable |
Isodate() |
create date variable by mergine separate date components |
as.Date() |
truncate date-time variable to just date variable |
year() , week() , etc |
lubridate : extract individual date components |
any() , all() |
assess if any or all elements are TRUE |
5 minutes!