Visualizing data layer by layer in R

NCUBE ON DATA

background

Visualisation is the process of representing data graphically and interacting with these representations. The objective is to gain insight into the data.

To work with ggplot2\index{ggplot2}, remember that at least your R codes must

  • start with ggplot()
  • identify which data to plot data = Your Data
  • state variables to plot for example aes(x = Variable on x-axis, y = Variable on y-axis ) for bivariate
  • choose type of graph, for example geom_histogram() for histogram, and geom_points() for scatterplots\index{Scatterplot}

setup

  • installing tidyverse package which contains dplyr and ggplot2
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
## v dplyr     1.1.2     v readr     2.1.4
## v forcats   1.0.0     v stringr   1.5.0
## v ggplot2   3.4.2     v tibble    3.2.1
## v lubridate 1.9.2     v tidyr     1.3.0
## v purrr     1.0.1     
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## i Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
loan_data <- read_csv('loan_data_cleaned.csv')
## New names:
## Rows: 29091 Columns: 9
## -- Column specification
## -------------------------------------------------------- Delimiter: "," chr
## (4): grade, home_ownership, emp_cat, ir_cat dbl (5): ...1, loan_status,
## loan_amnt, annual_inc, age
## i Use `spec()` to retrieve the full column specification for this data. i
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## * `` -> `...1`
loan_data <- loan_data |>
  mutate(default=ifelse(loan_status==1,"defaulted","not defaulted")) |> 
  mutate_if(is.character,as.factor)

getting started

  • Calling ggplot() along just creates a blank plot
ggplot()

next up

  • I need to tell ggplot what data to use
ggplot(data=loan_data)

grammar of graphics

  • And then give it some instructions using the grammar of graphics.
  • Let’s build a simple scatterplot with annual income on the x-axis and loan amount on the y axis
ggplot(data=loan_data) +
  geom_point(mapping=aes(x=annual_inc, y=loan_amnt))

refining…

  • Let’s try representing a different dimension.
  • What if we want to differentiate public vs. private schools?
  • We can do this using the shape attribute
ggplot(data=loan_data) +
  geom_point(mapping=aes(x=annual_inc, y=loan_amnt, shape=default))

not neat!!..try color.

  • That’s hard to see the difference. What if we try color instead?
ggplot(data=loan_data) +
  geom_point(mapping=aes(x=annual_inc, y=loan_amnt, color=default))

try size!

  • I can also alter point size. Let’s do that to represent grade
ggplot(data=loan_data) +
  geom_point(mapping=aes(x=annual_inc, y=loan_amnt, color=default, size=grade))
## Warning: Using size for a discrete variable is not advised.

transparency

  • And, lastly, let’s add some transparency so we can see through those points a bit
  • Experiment with the alpha value a bit.
ggplot(data=loan_data) +
  geom_point(mapping=aes(x=annual_inc, y=loan_amnt, color=default, size=grade), alpha=1)
## Warning: Using size for a discrete variable is not advised.

very transparent

ggplot(data=loan_data) +
  geom_point(mapping=aes(x=annual_inc, y=loan_amnt, color=default, size=grade), alpha=1/100)
## Warning: Using size for a discrete variable is not advised.

How many people defaulted?

# This calls for a bar graph!
ggplot(data=loan_data) +
  geom_bar(mapping=aes(x=default))

Break it out by grade

ggplot(data=loan_data) +
  geom_bar(mapping=aes(x=default, color=grade))

Well, that’s unsatisfying! Try fill instead of color

ggplot(data=loan_data) +
  geom_bar(mapping=aes(x=default, fill=grade))

How about loan amount by default status?

  • First, I’ll use some dplyr to create the right tibble
loan_data %>%
  group_by(default) %>%
  summarize(average_amount=mean(loan_amnt))
## # A tibble: 2 x 2
##   default       average_amount
##   <fct>                  <dbl>
## 1 defaulted              9389.
## 2 not defaulted          9619.

And I can pipe that straight into ggplot

  • but this will produce an error
loan_data %>%
  group_by(default) %>%
  summarize(average_amount=mean(loan_amnt)) %>%
  ggplot() +
  geom_bar(mapping=aes(x=default, y=average_amount))

use geom_col() instead

  • But I need to use a column graph instead of a bar graph to specify my own y
loan_data %>%
  group_by(default) %>%
  summarize(average_amount=mean(loan_amnt)) %>%
  ggplot() +
  geom_col(mapping=aes(x=default, y=average_amount))

Histograms

  • Let’s look at annual income
  • Histograms can help us by binning results
ggplot(data=loan_data) +
  geom_histogram(mapping=aes(x=annual_inc), origin=0)
## Warning: The `origin` argument of `stat_bin()` is deprecated as of ggplot2 2.1.0.
## i Please use the `boundary` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

What if we want fewer groups? Let’s ask for 4 bins

ggplot(data=loan_data) +
  geom_histogram(mapping=aes(x=annual_inc), bins=4, origin=0)

Or 10 bins.

ggplot(data=loan_data) +
  geom_histogram(mapping=aes(x=annual_inc), bins=10, origin=0)

Or we can specify the width of the bins instead

ggplot(data=loan_data) +
  geom_histogram(mapping=aes(x=annual_inc), binwidth=1000, origin=0)

large binwidth

ggplot(data=loan_data) +
  geom_histogram(mapping=aes(x=annual_inc), binwidth=10000, origin=0)

changing background

Change the plot background color

ggplot(data=loan_data) +
  geom_bar(mapping=aes(x=loan_status, fill=home_ownership)) +
  theme(plot.background=element_rect(fill='purple'))

Change the panel background color

ggplot(data=loan_data) +
  geom_bar(mapping=aes(x=loan_status, fill=home_ownership)) +
  theme(panel.background=element_rect(fill='purple'))

Let’s be minimalist and make both backgrounds disappear

ggplot(data=loan_data) +
  geom_bar(mapping=aes(x=loan_status, fill=home_ownership)) +
  theme(panel.background=element_blank()) +
  theme(plot.background=element_blank())

Add grey gridlines

ggplot(data=loan_data) +
  geom_bar(mapping=aes(x=loan_status, fill=home_ownership)) +
  theme(panel.background=element_blank()) +
  theme(plot.background=element_blank()) +
  theme(panel.grid.major=element_line(color="grey"))

Only show the y-axis gridlines

ggplot(data=loan_data) +
  geom_bar(mapping=aes(x=loan_status, fill=home_ownership)) +
  theme(panel.background=element_blank()) +
  theme(plot.background=element_blank()) +
  theme(panel.grid.major.y=element_line(color="grey"))+
  scale_fill_manual(values=c("orange","blue","green","blue4"))
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.2.3
ggplot(data=loan_data) +
  geom_bar(mapping=aes(x=loan_status, fill=home_ownership)) +
  theme(panel.background=element_blank()) +
  theme(plot.background=element_blank()) +
  theme(panel.grid.major.y=element_line(color="grey"))+
  scale_fill_manual(values=c("orange","blue","green","blue4"))+
  theme_solarized()+
  theme_wsj()
ggplot(data=loan_data, aes(x = loan_status)) + 
  geom_bar(fill = "chartreuse") + 
  theme(axis.text.x = element_text(angle = 90))
Bongani Ncube
Bongani Ncube
Data Scientist/Statistics/Public Health

My research interests include Casual Inference , Public Health , Survival Analysis, bayesian Statistics, Machine learning and Longitudinal Data Analysis.