Saturday, September 23, 2017

Learn R: Use factor function to categorize data

 
   
         
      When you need to analyze data, usually your first step will be the cleaning data to improve the data quality. After cleaning data, data modelling step comes up, You may want to start to represent some of the data in your data set by number rather than string in this step for performance or analyzing method reasons. R has a function named factor() for this operation.

factor(x = character(), levels, labels = levels, exclude = NA, ordered = is.ordered(x))

Get the distinct values of tshirt column first

shirtSizes <- c('Large','Medium','Small','Medium','Small','Large','Medium','Small',
'Select one')
#This will give us distinct values of shirSizes
factor(shirtSizes)

[1] Large      Medium     Small      Medium     Small      Large      Medium    
[8] Small      Select one
Levels: Large Medium Select one Small

'Select One' is not valid. Remove it

shirtSizes <- c('Large','Medium','Small','Medium','Small','Large','Medium','Small'
,'Select one')

#"Select one" should not be counted
factor(shirtSizes, exclude = c('Select one'))

[1] Large  Medium Small  Medium Small  Large  Medium Small    
Levels: Large Medium Small

What is the order/priority of tshirt sizes?

shirtSizes <- c('Large','Medium','Small','Medium','Small','Large','Medium','Small'
,'Select one')

#Turn the order on
factor(shirtSizes, order =T, exclude=c('Select one'))

[1] Large  Medium Small  Medium Small  Large  Medium Small    
Levels: Large < Medium < Small

This order doesn't work me, How can I customize it?

shirtSizes <- c('Large','Medium','Small','Medium','Small','Large','Medium',
'Small','Select one')

#Customize the order
temp <- factor(shirtSizes, order=T, level=c('Small','Medium','Large'), 
exclude=c('Select one'))
temp

[1] Large  Medium Small  Medium Small  Large  Medium Small    
Levels: Small < Medium < Large

Looks good, now I want to call values S,M,L

shirtSizes <- c('Large','Medium','Small','Medium','Small','Large','Medium','Small',
'Select one')

#Customize the order
temp <- factor(shirtSizes, order=T, level=c('Small','Medium','Large'), 
exclude=c('Select one'))

#I want to call S,M,L. Order is important!
levels(temp) <-c('S','M','L')
temp

[1] Large  Medium Small  Medium Small  Large  Medium Small    
Levels: Small < Medium < Large
[1] L    M    S    M    S    L    M    S    
Levels: S < M < L

Now let's order the sizes by name and by levels

shirtSizes <- c('Large','Medium','Small','Medium','Small','Large','Medium','Small'
,'Select one')
#This will give us distinct values of shirSizes

#Customize the order
temp <- factor(shirtSizes, order=T, level=c('Small','Medium','Large'), 
exclude=c('Select one'))

#I want to call S,M,L. Order is important!
levels(temp) <-c('S','M','L')

#Order shirsizes by size
shirtSizes[order(temp)]
temp[order(temp)]

[1] "Small"      "Small"      "Small"      "Medium"     "Medium"    
[6] "Medium"     "Large"      "Large"      "Select one"
[1] S    S    S    M    M    M    L    L    
Levels: S < M < L



No comments:

Post a Comment