Working with Factor Variables in R

Factors - what are they

  • factors are used to work with categorical variables, variables that have a fixed and known set of possible values. They are also useful when you want to display character vectors in a non alphabetical order.
  • we encounter them when modeling (linear,GLM,GAMS,Machine learning)
  • handling them is on3 stage of machine learning feature engineering

factor()

  • Provides a way to provide different categories/enumerations of things without specifying a priority preference.

    • IE. in loan data grade may be a included in a regression, but order in which they are included in the model shouldn’t impact the results. Or grade A is not greater than grade B from a regression standpoint.
  • Option for providing ordinality when it is important to sort categories

    • IE First, Second, Third, are categories, but there is an order to them

creating factors


vec <- factor(c("Hello","world","Hello","Rposit"))
vec
#> [1] Hello  world  Hello  Rposit
#> Levels: Hello Rposit world

Levels of factors are the important piece to them, and they can be edited

factor(c("Hello","world","Hello","Rposit"), levels = c("world","Hello"))
#> [1] Hello world Hello <NA> 
#> Levels: world Hello

levels(vec) <- c("Amazing","Viewers","Are")
vec
#> [1] Amazing Are     Amazing Viewers
#> Levels: Amazing Viewers Are

  • Imagine that you have a variable that records month:
x1 <- c("Dec", "Apr", "Jan", "Mar")
sort(x1)
#> [1] "Apr" "Dec" "Jan" "Mar"
  • sort doesn’t sort in a useful way

  • There are only twelve possible months, and there’s nothing saving you from typos:

x2 <- c("Dec", "Apr", "Jam", "Mar")

how can we fix this… create a list with levels

month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun",
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)

y1 <- factor(x1, levels = month_levels)
y1
#> [1] Dec Apr Jan Mar
#> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

sort(y1)
#> [1] Jan Mar Apr Dec
#> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

working with factors in datasets

  • i’m still a big fan of the tidyverse
  • we will use the forcats package to get the job done ` first lets see what happens when you read in data

readr package

  • by default treats factors as characters
  • you can override this by specifying coltypes

setup


# load in libraries
library(labelled)
library(tidyverse)

# read in data
df<-readr::read_csv("loan_data_cleaned.csv")

# look at the few first rows
head(df,n=4)
#> # A tibble: 4 x 9
#>    ...1 loan_status loan_amnt grade home_ownership annual_inc   age emp_cat
#>   <dbl>       <dbl>     <dbl> <chr> <chr>               <dbl> <dbl> <chr>  
#> 1     1           0      5000 B     RENT                24000    33 0-15   
#> 2     2           0      2400 C     RENT                12252    31 15-30  
#> 3     3           0     10000 C     RENT                49200    24 0-15   
#> 4     4           0      5000 A     RENT                36000    39 0-15   
#> # i 1 more variable: ir_cat <chr>

# check the classes
sapply(df[1,],class)
#>           ...1    loan_status      loan_amnt          grade home_ownership 
#>      "numeric"      "numeric"      "numeric"    "character"    "character" 
#>     annual_inc            age        emp_cat         ir_cat 
#>      "numeric"      "numeric"    "character"    "character"

lets override that coz we need those to be factors

df<-readr::read_csv("loan_data_cleaned.csv",col_types = cols(loan_status=col_factor(),
                                                             grade=col_factor(),
                                                             home_ownership=col_factor(),
                                                             emp_cat=col_factor())) |> 
  select(-1)

sapply(df[1,],class) # great
#>    loan_status      loan_amnt          grade home_ownership     annual_inc 
#>       "factor"      "numeric"       "factor"       "factor"      "numeric" 
#>            age        emp_cat         ir_cat 
#>      "numeric"       "factor"    "character"

  • that’s a draw back, cos its cumbersome, lets use mutate_if() instead
  • we might have as many variables to deal with and listing them one by one is time consuming
  • we know in advance that they will be turned to characters
df<-readr::read_csv("loan_data_cleaned.csv") |> 
  select(-1)
df<- df |> 
  mutate_if(is.character,as.factor)

# check class
sapply(df[1,],class) # great
#>    loan_status      loan_amnt          grade home_ownership     annual_inc 
#>      "numeric"      "numeric"       "factor"       "factor"      "numeric" 
#>            age        emp_cat         ir_cat 
#>      "numeric"       "factor"       "factor"

labeling variables

# put variable labels
labelled::var_label(df)<-list(loan_status="default status",
                    annual_inc="annual income",
                    home_ownership="house ownership",
                    loan_amnt="loan amount",
                    emp_cat="employment category")

creating factors and modifying them

  • lets make use of cut from base R to turn numerics to factors
  • i find this function very useful in most cases

out_new<-df |> 
  mutate(age_group=cut(age, c(min(df$age)-1, 30,45,55,65,75, max(df$age)+1), 
                        c("< 30", "30-44","45-54","55-64","65-74","75+"), right = FALSE))
# check this out
out_new |> 
  select(age,age_group)
#> # A tibble: 29,091 x 2
#>      age age_group
#>    <dbl> <fct>    
#>  1    33 30-44    
#>  2    31 30-44    
#>  3    24 < 30     
#>  4    39 30-44    
#>  5    24 < 30     
#>  6    28 < 30     
#>  7    22 < 30     
#>  8    22 < 30     
#>  9    28 < 30     
#> 10    22 < 30     
#> # i 29,081 more rows

lets have glimpse at this

table(out_new$age_group)
#> 
#>  < 30 30-44 45-54 55-64 65-74   75+ 
#> 21038  7380   537   101    30     5
### or##

out_new |>
  count(age_group)
#> # A tibble: 6 x 2
#>   age_group     n
#>   <fct>     <int>
#> 1 < 30      21038
#> 2 30-44      7380
#> 3 45-54       537
#> 4 55-64       101
#> 5 65-74        30
#> 6 75+           5

ggplot(out_new, aes(age_group)) +geom_bar()

modifying factor order

dat_sum <- out_new |>
  group_by(age_group) |>
  summarize(
    annual_inc= mean(annual_inc, na.rm = TRUE),
    n = n()
  )

ggplot(dat_sum, aes(annual_inc, age_group)) + geom_point()

fct_reorder()

  • It is difficult to interpret this plot because there’s no overall pattern.
  • We can improve it by reordering the levels of age_group using fct_reorder() . _ fct_reorder() takes three arguments:
    • f , the factor whose levels you want to modify.
    • x , a numeric vector that you want to use to reorder the levels.

ggplot(dat_sum, aes(annual_inc, fct_reorder(age_group,annual_inc))) + geom_point()
- there is some increasing order now.. its more useful when there are many levels per factor

fct_infreq

  • you can use fct_infreq to order levels in increasing frequency: this is the simplest type of reordering because it doesn’t need any extra variables.
p1<-out_new |>
  mutate(age_group = age_group |> fct_infreq() |> fct_rev()) |>
  ggplot(aes(age_group)) +
    geom_bar()

p1

fct_recode()

  • The most general and powerful tool is fct_recode() .It allows you to recode, or change, the value of each level. For example, take gss_cat$partyid
out_new |> 
  mutate(age_group1=fct_recode(age_group,
                               "less than 30" ="< 30",
                               "greater than 75"="75+")) |> 
  count(age_group1)
#> # A tibble: 6 x 2
#>   age_group1          n
#>   <fct>           <int>
#> 1 less than 30    21038
#> 2 30-44            7380
#> 3 45-54             537
#> 4 55-64             101
#> 5 65-74              30
#> 6 greater than 75     5

fct_collapse()

  • If you want to collapse a lot of levels, fct_collapse()
out_new |> 
  mutate(age_group2=fct_collapse(age_group,
                                 "30-54"=c("30-44","45-54"),
                                 "55-74"=c("55-64","65-74"))) |> 
  count(age_group2)
#> # A tibble: 4 x 2
#>   age_group2     n
#>   <fct>      <int>
#> 1 < 30       21038
#> 2 30-54       7917
#> 3 55-74        131
#> 4 75+            5

fct_lump()

  • Sometimes you just want to lump together all the small groups to make a plot or table simpler. That’s the job of fct_lump():
out_new %>%
  mutate(age_group3= fct_lump(age_group)) %>%
  count(age_group3)
#> # A tibble: 2 x 2
#>   age_group3     n
#>   <fct>      <int>
#> 1 < 30       21038
#> 2 Other       8053

  • The default behavior is to progressively lump together the smallest groups, ensuring that the aggregate is still the smallest group
  • Instead, we can use the n parameter to specify how many groups (excluding other) we want to keep:

out_new %>%
  mutate(age_group3= fct_lump(age_group,n=3)) %>%
  count(age_group3)
#> # A tibble: 4 x 2
#>   age_group3     n
#>   <fct>      <int>
#> 1 < 30       21038
#> 2 30-44       7380
#> 3 45-54        537
#> 4 Other        136
Bongani Ncube
Bongani Ncube
Data Scientist/Statistics/Public Health

My research interests include Casual Inference , Public Health , Survival Analysis, bayesian Statistics, Machine learning and Longitudinal Data Analysis.