logistic far from machine learning

Generalized linear models

Exponential Family The theory of GLMs is developed for data with distribution given y the exponential family. The form of the data distribution that is useful for GLMs is

$$ f(y;\theta, \phi) = \exp(\frac{\theta y - b(\theta)}{a(\phi)} + c(y, \phi)) $$

where

  • \(\theta\) is called the natural parameter
  • \(\phi\) is called the dispersion parameter

Note:

This family includes the [Gamma], [Normal], [Poisson], and other. For all parameterization of the exponential family, check this

Example

if we have \(Y \sim N(\mu, \sigma^2)\)

$$ \begin{aligned} f(y; \mu, \sigma^2) &= \frac{1}{(2\pi \sigma^2)^{1/2}}\exp(-\frac{1}{2\sigma^2}(y- \mu)^2) \\ &= \exp(-\frac{1}{2\sigma^2}(y^2 - 2y \mu +\mu^2)- \frac{1}{2}\log(2\pi \sigma^2)) \\ &= \exp(\frac{y \mu - \mu^2/2}{\sigma^2} - \frac{y^2}{2\sigma^2} - \frac{1}{2}\log(2\pi \sigma^2)) \\ &= \exp(\frac{\theta y - b(\theta)}{a(\phi)} + c(y , \phi)) \end{aligned} $$

where

  • \(\theta = \mu\)
  • \(b(\theta) = \frac{\mu^2}{2}\)
  • \(a(\phi) = \sigma^2 = \phi\)
  • \(c(y , \phi) = - \frac{1}{2}(\frac{y^2}{\phi}+\log(2\pi \sigma^2))\)

Logistic Regression

Logistic regression estimates the probability of a particular level of a categorical response variable given a set of predictors. The response levels can be binary, nominal (multiple categories), or ordinal (multiple levels).

The binary logistic regression model is

$$y = logit(\pi) = \ln \left( \frac{\pi}{1 - \pi} \right) = X \beta$$

where \(\pi\) is the event probability. The model predicts the log odds of the response variable. The maximum likelihood estimator maximizes the likelihood function

$$L(\beta; y, X) = \prod_{i=1}^n \pi_i^{y_i}(1 - \pi_i)^{(1-y_i)} = \prod_{i=1}^n\frac{\exp(y_i X_i \beta)}{1 + \exp(X_i \beta)}.$$

goal

the goal of this article is to use logistic regression from a statistical stead point i.e learn how to build parsimonious models using statistical criteria

we also seek to use the R’s sophisticated tools to help in deriving insights from the data

Dataset

We will use a dataset named stroke.dta which in STATA\index{STATA} format. These data come from a study of hospitalized stroke patients. our main variables of interest are:

  • status : Status of patient during hospitalization (alive or dead)
  • gcs : Glasgow Coma Scale on admission (range from 3 to 15)
  • stroke_type : IS (Ischaemic Stroke) or HS (Haemorrhagic Stroke)
  • sex : female or male
  • dm : History of Diabetes (yes or no)
  • sbp : Systolic Blood Pressure (mmHg)
  • age : age of patient on admission

setup

# load in packages
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
## v dplyr     1.1.2     v readr     2.1.4
## v forcats   1.0.0     v stringr   1.5.0
## v ggplot2   3.4.2     v tibble    3.2.1
## v lubridate 1.9.2     v tidyr     1.3.0
## v purrr     1.0.1     
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## i Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(haven)
library(rms)
## Loading required package: Hmisc
## 
## Attaching package: 'Hmisc'
## 
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## 
## The following objects are masked from 'package:base':
## 
##     format.pval, units
## 
## Loading required package: survival
## Loading required package: lattice
## Loading required package: SparseM
## 
## Attaching package: 'SparseM'
## 
## The following object is masked from 'package:base':
## 
##     backsolve
library(broom)

fatal <- read_dta('stroke.dta')
write.csv(fatal,"stroke_dat.csv")

Take a peek at data to check for

  • variable names
  • variable types

> glimpse(fatal)

Rows: 226
Columns: 7
$ sex         <dbl+lbl> 1, 1, 1, 2, 1, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 2, 1~
$ status      <dbl+lbl> 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1~
$ gcs         <dbl> 13, 15, 15, 15, 15, 15, 13, 15, 15, 10, 15, 15, 15, 15, ~
$ sbp         <dbl> 143, 150, 152, 215, 162, 169, 178, 180, 186, 185, 122, 2~
$ dm          <dbl+lbl> 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1~
$ age         <dbl> 50, 58, 64, 50, 65, 78, 66, 72, 61, 64, 63, 59, 64, 62, ~
$ stroke_type <dbl+lbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~

Explore data

Variables sex, status, dm and stroke type are labelled variables though they are coded as numbers. The numbers represent the groups or categories or levels of the variables. They are categorical variables and not real numbers.

fatal <- 
  fatal %>%
  mutate_if(is.labelled, as.factor)

model building

  • model building is very important in the world of statistics
  • it enables us to come up with more precise and optimal models to describe different phenomena

lets start model building

first things first

  • lets build a multivariate model with all the predictors
  • we use glm(formula=y_var~x_vars,data,family=binomial)
fatal_mv2 <- 
  glm(status ~ gcs + stroke_type + sex + dm + sbp + age, 
      data = fatal, 
      family = binomial(link = 'logit'))

> summary(fatal_mv2)

glm(formula = status ~ gcs + stroke_type + sex + dm + sbp + age, 
    family = binomial(link = "logit"), data = fatal)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.3715  -0.4687  -0.3280  -0.1921   2.5150  

Coefficients:
                          Estimate Std. Error z value Pr(>|z|)    
(Intercept)             -0.1588269  1.6174965  -0.098  0.92178    
gcs                     -0.3284640  0.0557574  -5.891 3.84e-09 ***
stroke_typeHaemorrhagic  1.2662764  0.4365882   2.900  0.00373 ** 
sexfemale                0.4302901  0.4362742   0.986  0.32399    
dmyes                    0.4736670  0.4362309   1.086  0.27756    
sbp                      0.0008612  0.0060619   0.142  0.88703    
age                      0.0242321  0.0154010   1.573  0.11562    
---
Signif. codes:  0***0.001**0.01*0.05.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 250.83  on 225  degrees of freedom
Residual deviance: 159.34  on 219  degrees of freedom
AIC: 173.34
library(reticulate)
use_condaenv("base")
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

# obtain stroke_datadata
url = "stroke_dat.csv"
stroke_data= pd.read_csv(url)
# define model
model1 = smf.glm(formula = "status ~ gcs + stroke_type + sex + dm + sbp + age", 
                data = stroke_data, 
                family = sm.families.Binomial())


# fit model
promotion_model1 = model1.fit()


# see results summary
print(promotion_model1.summary())

  • we wont really explain the output but can comment on the p-values that gcs and stroke type Haemorrhagic are the most significant predictors of death
  • Haemorrhagic increases the odds of dying since it has a positive coefficient

a unit increase in gcs reduces the log odds of dying by -0.3284640 which is equivalent to odds=exp(-0.3284640)=0.7200288 which implies that if the value increases the odds of dying are reduced by approximately 1-0.720028=37% which makes sense coz had to research more about thus and saw that

  • 3- 8 means severe
  • 9-13 means moderate
  • 14-15 means mild

  • patients with HS have \(1.266\) times the log odds\index{Log odds} for death as compared to patients with IS, adjusting for other covariates.

  • female patients have \(0.430\) times the log odds\index{Log odds} for death as compared to male patients, adjusting for other covariates.

  • patients with diabetes mellitus have \(0.474\) times the log odds\index{Log odds} for deaths as compared to patients with no diabetes mellitus.

  • With one mmHg increase in systolic blood pressure, the log odds\index{Log odds} for deaths change by a factor of \(0.00086\), when adjusting for other variables.

  • with an increase in one year of age, the log odds\index{Log odds} for deaths change by a factor of \(0.024\), when adjusting for other variables.

  • thats not all folks

using rms package

> fatal_mv1 <- lrm(status ~ gcs + stroke_type + sex + dm + sbp + age,
                 data = fatal)

Logistic Regression Model

lrm(formula = status ~ gcs + stroke_type + sex + dm + sbp + age, 
    data = fatal)

                       Model Likelihood      Discrimination    Rank Discrim.    
                             Ratio Test             Indexes          Indexes    
Obs           226    LR chi2      91.49      R2       0.497    C       0.892    
 alive        171    d.f.             6      R2(6,226)0.315    Dxy     0.784    
 dead          55    Pr(> chi2) <0.0001    R2(6,124.8)0.496    gamma   0.784    
max |deriv| 4e-07                            Brier    0.110    tau-a   0.290    

                         Coef    S.E.   Wald Z Pr(>|Z|)
Intercept                -0.1588 1.6175 -0.10  0.9218  
gcs                      -0.3285 0.0558 -5.89  <0.0001 
stroke_type=Haemorrhagic  1.2663 0.4366  2.90  0.0037  
sex=female                0.4303 0.4363  0.99  0.3240  
dm=yes                    0.4737 0.4362  1.09  0.2776  
sbp                       0.0009 0.0061  0.14  0.8870  
age                       0.0242 0.0154  1.57  0.1156 

variable importance

plot(anova(fatal_mv1),margin=c("chisq","d.f","P"))
+ gcs and stroke type seem to be the most important variables that predict death

model building

  • we want to find the optimal set of variables that predict death
  • first up we will use the univariate approach,thus we build univariate models based on variables available then we choose the variable with the the least p-value or least AIC value then build from there

create univariate models and look at p-values

# define function to run stroke model and glance at results
stroke_model_pvals <- function(form, df) {
  model <- glm(formula = form, data = df)
  broom::tidy(model)
}
# create model formula column
formula <- c(
  "status ~ gcs", 
  "status ~ stroke_type",
  "status ~ sex",
  "status ~ dm",
  "status ~ sbp",
  "status ~ age"
)
# create dataframe
models <- data.frame(formula)

view p-values

# run models and glance at results
mods<-models |>
  dplyr::group_by(formula) |>
  dplyr::summarise(stroke_model_pvals(formula, fatal)) |> 
  select(formula,p.value,term) |> 
  filter(term!="(Intercept)") |> 
  arrange(p.value)
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## i Please use `reframe()` instead.
## i When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'formula'. You can override using the
## `.groups` argument.
mods
## # A tibble: 6 x 3
## # Groups:   formula [6]
##   formula               p.value term       
##   <chr>                   <dbl> <chr>      
## 1 status ~ gcs         1.66e-24 gcs        
## 2 status ~ stroke_type 5.04e-11 stroke_type
## 3 status ~ sex         1.71e- 2 sex        
## 4 status ~ dm          1.62e- 1 dm         
## 5 status ~ age         4.54e- 1 age        
## 6 status ~ sbp         9.21e- 1 sbp

explanation

  • the output shows p-values for the univariate models arranged from smallest to largest
  • gcs has the least p-value followed by stroke type

Based on AIC

stroke_model_information <- function(form, df) {
  model <- glm(formula = form, data = df)
  broom::glance(model)
}

# run models and glance at results
  models |>
  dplyr::group_by(formula) |>
  dplyr::summarise(stroke_model_information(formula, fatal)) |> 
  select(formula,AIC,BIC,deviance,logLik) |> 
  arrange(AIC)
## # A tibble: 6 x 5
##   formula                AIC   BIC deviance logLik
##   <chr>                <dbl> <dbl>    <dbl>  <dbl>
## 1 status ~ gcs          159.  170.     26.1  -76.7
## 2 status ~ stroke_type  221.  232.     34.3 -108. 
## 3 status ~ sex          259.  269.     40.6 -127. 
## 4 status ~ dm           263.  273.     41.3 -128. 
## 5 status ~ age          264.  275.     41.5 -129. 
## 6 status ~ sbp          265.  275.     41.6 -129.
  • we get the same results (gcs and stroke_type have the least AIC)

Lets build from there

> model0<-glm(status~gcs,data=fatal,family = binomial)
> model1<-glm(status~gcs+stroke_type,data=fatal,family = binomial)
> summary(model1)

glm(formula = status ~ gcs + stroke_type, family = binomial, 
    data = fatal)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1921  -0.5629  -0.3380  -0.3380   2.4045  

Coefficients:
                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)              2.22960    0.71414   3.122  0.00180 ** 
gcs                     -0.33755    0.05477  -6.164 7.11e-10 ***
stroke_typeHaemorrhagic  1.09094    0.41452   2.632  0.00849 ** 
---
Signif. codes:  0***0.001**0.01*0.05.’ 0.1 ‘ ’ 1

is adding stroke_type to gcs worth it?

  • we use likelihood ratio test
  • call Anova(model1,model2,test="LRT")
anova(model0,model1,test="LRT")
## Analysis of Deviance Table
## 
## Model 1: status ~ gcs
## Model 2: status ~ gcs + stroke_type
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)   
## 1       224     170.92                        
## 2       223     164.17  1   6.7477 0.009387 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • the value of Pr(>Chi) is less than 0.05 implying that stroketype can make the model much better so we include it and add another variable

lets add sex

model2<-glm(status~gcs+stroke_type+sex,data=fatal,family = binomial)
anova(model1,model2,test="LRT")
## Analysis of Deviance Table
## 
## Model 1: status ~ gcs + stroke_type
## Model 2: status ~ gcs + stroke_type + sex
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       223     164.17                     
## 2       222     162.72  1   1.4567   0.2275
  • We get a p-value greater than 0.05 hence we drop the variable

lets try dm

model3<-glm(status~gcs+stroke_type+dm,data=fatal,family = binomial)
anova(model1,model3,test="LRT")
## Analysis of Deviance Table
## 
## Model 1: status ~ gcs + stroke_type
## Model 2: status ~ gcs + stroke_type + dm
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       223     164.17                     
## 2       222     163.51  1  0.65891   0.4169
  • our p-value is also greater than 0.05 so we drop it

lets try age

model4<-glm(status ~ gcs + stroke_type + age, data=fatal,family = binomial)
anova(model1,model4,test="LRT")
## Analysis of Deviance Table
## 
## Model 1: status ~ gcs + stroke_type
## Model 2: status ~ gcs + stroke_type + age
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       223     164.17                     
## 2       222     161.51  1   2.6602   0.1029
  • still not satisfying

lets add sbp

model5<-glm(status~gcs+stroke_type+sbp,data=fatal,family = binomial)
anova(model1,model5,test="LRT")
## Analysis of Deviance Table
## 
## Model 1: status ~ gcs + stroke_type
## Model 2: status ~ gcs + stroke_type + sbp
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       223     164.17                     
## 2       222     164.07  1 0.098017   0.7542
  • still not satisfying

addition of other variables to gcs and stroke_type did not yield signifant p-values hence we can drop them

final model

  • we can choose gcs and stroke_type to be the most significant and optimal set of variables to predict death
> final_model<-glm(status~gcs+stroke_type,data=fatal,family = binomial)
> summary(final_model)

glm(formula = status ~ gcs + stroke_type, family = binomial, 
    data = fatal)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1921  -0.5629  -0.3380  -0.3380   2.4045  

Coefficients:
                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)              2.22960    0.71414   3.122  0.00180 ** 
gcs                     -0.33755    0.05477  -6.164 7.11e-10 ***
stroke_typeHaemorrhagic  1.09094    0.41452   2.632  0.00849 ** 

lets confirm this using backward elimination

  • we start of with a full model
fatal_mv1 <- 
  lrm(status ~ gcs + stroke_type + sex + dm + sbp + age, 
      data = fatal)

> fastbw(fatal_mv1)

 Deleted Chi-Sq d.f. P      Residual d.f. P      AIC  
 sbp     0.02   1    0.8870 0.02     1    0.8870 -1.98
 sex     0.97   1    0.3246 0.99     2    0.6094 -3.01
 dm      1.11   1    0.2911 2.11     3    0.5509 -3.89
 age     2.38   1    0.1228 4.49     4    0.3441 -3.51

Approximate Estimates after Deleting Factors

                            Coef    S.E. Wald Z         P
Intercept                 2.1811 0.71458  3.052 2.272e-03
gcs                      -0.3283 0.05525 -5.942 2.824e-09
stroke_type=Haemorrhagic  1.0436 0.41883  2.492 1.272e-02

Factors in Final Model

[1] gcs         stroke_type

  • we get similar results
  • lets try a step-wise regression from both sides
>full_model <-glm(status ~ gcs + stroke_type + sex + dm + sbp + age, 
                data = fatal,family=binomial())

>step<-stepAIC(full_model, direction = "both")
>summary(step)

Start:  AIC=173.34
status ~ gcs + stroke_type + sex + dm + sbp + age

              Df Deviance    AIC
- sbp          1   159.36 171.36
- sex          1   160.32 172.32
- dm           1   160.54 172.54
<none>             159.34 173.34
- age          1   161.92 173.92
- stroke_type  1   167.69 179.69
- gcs          1   202.73 214.73

second iteration

Step:  AIC=171.36
status ~ gcs + stroke_type + sex + dm + age

              Df Deviance    AIC
- sex          1   160.34 170.34
- dm           1   160.55 170.55
<none>             159.36 171.36
- age          1   162.04 172.04
+ sbp          1   159.34 173.34
- stroke_type  1   167.73 177.73
- gcs          1   202.82 212.82

third iteration

Step:  AIC=170.34
status ~ gcs + stroke_type + dm + age

              Df Deviance    AIC
- dm           1   161.51 169.51
<none>             160.34 170.34
+ sex          1   159.36 171.36
- age          1   163.51 171.51
+ sbp          1   160.32 172.32
- stroke_type  1   168.52 176.52
- gcs          1   206.86 214.86

final iteration

Step:  AIC=169.51
status ~ gcs + stroke_type + age

              Df Deviance    AIC
<none>             161.51 169.51
- age          1   164.17 170.17
+ dm           1   160.34 170.34
+ sex          1   160.55 170.55
+ sbp          1   161.51 171.51
- stroke_type  1   169.53 175.53
- gcs          1   208.96 214.96

model from final iteration

Call:
glm(formula = status ~ gcs + stroke_type + age, family = binomial(), 
    data = fatal)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2588  -0.4481  -0.3337  -0.2228   2.4807  

Coefficients:
                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)              0.73401    1.16319   0.631  0.52802    
gcs                     -0.33864    0.05546  -6.106 1.02e-09 ***
stroke_typeHaemorrhagic  1.22428    0.42959   2.850  0.00437 ** 
age                      0.02350    0.01467   1.602  0.10918    
---
Bongani Ncube
Bongani Ncube
Data Scientist/Statistics/Public Health

My research interests include Casual Inference , Public Health , Survival Analysis, bayesian Statistics, Machine learning and Longitudinal Data Analysis.