Data Science in Health care

Last updated on Sep 26, 2023 18 min read datascience, cox, public health

Introduction

Brain Cancers include primary brain tumours which starts in the brain and almost never spread to other parts of the body and secondary tumours which are caused by cancers that began in another part of the body .

There are more than 40 major types of brain tumours, which are grouped into two main types:

benign - slow-growing and unlikely to spread. Common types are meningiomas, neuromas, pituitary tumours and craniopharyngiomas.
malignant - cancerous and able to spread into other parts of the brain or spinal cord. Common types include astrocytomas, oligodendrogliomas, glioblastomas and mixed gliomas.

It is estimated that more than 1,900 people were diagnosed with brain cancer in 2023. The average age at diagnosis is 59 years old.

Brain cancer signs and symptoms

Headaches are often the first symptom of a brain tumour. The headaches can be mild, severe, persistent, or come and go. A headache isn’t always a brain tumour but if you’re worried, be sure to see your GP.

Other symptoms include:

seizures: severe (e.g. a convulsion) or mild (a fleeting disturbance of awareness, sensation or jerking muscles) weakness or paralysis in part of the body
loss of balance
general irritability, drowsiness or a change in personality nausea and vomiting
disturbed vision, hearing, smell or taste.

References

Understanding Brain Tumours, Cancer Council Australia ©2020. Last medical review of source booklet: May 2020.
Australian Institute of Health and Welfare. Cancer data in Australia [Internet]. Canberra: Australian Institute of Health and Welfare, 2023 [cited 2023 Sept 04]. Available from: https://www.aihw.gov.au/reports/cancer/cancer-data-in-australia

Objectives of the study

determine factors that determine mortality of brain cancer patients
compare survival times of certain groups of breast cancer patients
determine the extend to which diagnosis affects survival of patients
learn more about using R for biostatistical studies and clinical research

Methodology

My research aims at patients’ brain cancer survival analysis . In the analysis I did a detailed research on the variables that determine whether patients get over a disease or surrender in a certain duration time. Thus I used a non-parametric statistic estimator to create a Kaplan-Meier Survival model to measure the survival function from lifetime duration data. We fit a cox proportional model and Logistic regression model.

Brain Cancer Data

A data frame with 88 observations and 8 variables:

sex : Factor with levels “Female” and “Male”
diagnosis : Factor with levels “Meningioma”, “LG glioma”, “HG glioma”, and “Other”.
Location : factor with levels “Infratentorial” and “Supratentorial”.
ki : Karnofsky index.
gtv : Gross tumor volume, in cubic centimeters.
stereo : Stereotactic method factor with levels “SRS” and “SRT”.
status : Whether the patient is dead at the end of the study: 0=No, 1=yes.
time : Survival time, in months.

Setup

library(pacman)
p_load(tidyverse, tidymodels, ISLR2, survival, censored,dlookr,pubh,sjPlot,compareGroups,survminer)

take a look at the dataset

BrainCancer<-readr::read_csv("NCU_braincancer.csv")

skimr::skim(BrainCancer)

Table 1: Data summary
Name	BrainCancer
Number of rows	88
Number of columns	8
\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_
Column type frequency:
factor	4
numeric	4
\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
sex	0	1.00	FALSE	2	Fem: 45, Mal: 43
diagnosis	1	0.99	FALSE	4	Men: 42, HG : 22, Oth: 14, LG : 9
loc	0	1.00	FALSE	2	Sup: 69, Inf: 19
stereo	0	1.00	FALSE	2	SRT: 65, SRS: 23

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ki	1	81.02	10.51	40.00	80.00	80.00	90.0	100.00	\\\\\
gtv	1	8.66	8.66	0.01	2.50	6.51	12.1	34.64	\\\\\
status	1	0.40	0.49	0.00	0.00	0.00	1.0	1.00	\\\\\
time	1	27.46	20.12	0.07	10.39	24.03	41.6	82.56	\\\\\

Data wrangling

data<-BrainCancer |> 
  mutate(brain_status=ifelse(status==1,"dead","alive"))
## check missing values
colSums(is.na(data))
#>          sex    diagnosis          loc           ki          gtv       stereo 
#>            0            1            0            0            0            0 
#>       status         time brain_status 
#>            0            0            0

there is only one missing value

## remove the missing values
cases<-complete.cases(data) 
data<-data[cases,]

Explanatory data Analysis(EDA)

distribution of numeric variables

# Histogram of all numeric variables
data %>%
  keep(is.numeric) %>% 
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_histogram(bins=30,fill=tvthemes::avatar_pal()(1))+
  ggthemes::theme_wsj()

Time and gtv are generally right skewed

Correlations

## select only numeric values
cor_data<-data %>%
  keep(is.numeric)
## create a correlation matrix
corl<-cor(cor_data)
corrplot::corrplot(corl,method="color",addCoef.col = "black")

status is really not supposed to be numeric but rather a factor variable
time and gross tumor volume are all positively skewed
karnofsky index is as good as being a factor so i will further categorize the data
the correlations are not really significant given the variables at hand

further manipulation

data <- data |> 
  mutate(kan_index = ifelse(ki>=80,"index>79","index<80"))

data |> 
  janitor::tabyl(kan_index)

kan_index	n	percent
index<80	20	0.23
index>79	67	0.77

comparing dependent variable with numeric variables

subset <- data |>
  dplyr::select(gtv,time,ki,status)

# Bring in external file for visualisations
source('functions/visualisations.R')

# Use plot function
plot <- histoplotter(subset, status, 
                     chart_x_axis_lbl = "survival status", 
                     chart_y_axis_lbl = 'Measures',
                     boxplot_color = 'navy', 
                     boxplot_fill = tvthemes::avatar_pal()(1), 
                     box_fill_transparency = 0.2) 

# Add extras to plot
plot + 
  ggthemes::theme_fivethirtyeight() +
  tvthemes::scale_color_avatar()

categorical data

iv_rates <- data |>
  dplyr::mutate(status = ifelse(status==1,"yes","no")) |>
  select(kan_index,loc,stereo,status) |> 
  gather() |> 
  group_by(key,value) |>
  summarize(count = n()) |> 
  mutate(prop = count/sum(count)) |>
  ungroup() 

plot<-iv_rates |>
  ggplot(aes(fill=key, y=prop, x=value)) + 
  geom_col(color="black",width = 0.5)+ 
  facet_wrap(~key,scales="free") +
  theme(legend.position="bottom") + 
  geom_label(aes(label=scales::percent(prop)), color="white") + 
  labs(
    title = "survival ratio",
    subtitle = "Brain Cancer Analysis",
    y = "proportion(%)", 
    x = "",
    fill="status",
    caption="B.Ncube::Data enthusiast") + 
  scale_y_continuous(labels = scales::percent)+
  tvthemes::scale_fill_kimPossible()+
 ggthemes::theme_fivethirtyeight() +
  theme(legend.position = 'none')+
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 
plot

mortality Rate

data|>
  dplyr::group_by(brain_status)|> 
  dplyr::summarize(count=n(),
             mean=mean(gtv, na.rm = T),
             std_dev=sd(gtv, na.rm = T),
             median=median(gtv, na.rm = T), 
             min=min(gtv, na.rm = T),
             max=max(gtv, na.rm = T))|> 
 tibble::column_to_rownames(var = "brain_status") |> 
 base::as.matrix()
#>       count      mean  std_dev median  min   max
#> alive    52  7.034423 7.828455   4.36 0.01 31.74
#> dead     35 11.142286 9.451387  11.38 0.14 34.64

#Calculate mortality rate
print(paste('Mortality rate (%) is ',round(35/(35+53)*100,digits=2)))
#> [1] "Mortality rate (%) is  39.77"

Mortality rate is approximately: 40%

we can view this in a plot or chart

library(extrafont)
loadfonts(quiet=TRUE)

iv_rates <- data |>
  group_by(brain_status) |>
  summarize(count = n()) |> 
  mutate(prop = count/sum(count)) |>
  ungroup() 

plot<-iv_rates |>
  ggplot(aes(x=brain_status, y=prop, fill=brain_status)) + 
  geom_col(color="black",width = 0.5)+ 
  theme(legend.position="bottom") + 
  geom_label(aes(label=scales::percent(prop)), color="white") + 
  labs(
    title = "mortality ratio",
    subtitle = "Brain Cancer Analysis",
    y = "proportion(%)", 
    x = "",
    fill="status",
    caption="B.Ncube::Data enthusiast") + 
  scale_y_continuous(labels = scales::percent)+ 
geom_hline(yintercept = (sum(data$status)/nrow(data)), col = "white", lty = 2) +
  tvthemes::scale_fill_kimPossible()+
  ggthemes::theme_fivethirtyeight() +
  theme(legend.position = 'right')
plot

Cancer outcome by diagnosis type

ggplot(data) +                                             
  aes(x=brain_status) +                                 
  aes(fill=diagnosis) +                                   
  geom_bar(aes( y=..count../tapply(..count.., ..x.. ,sum)[..x..]),   
      position="dodge",                                   
      color="black") +                                  
    tvthemes::scale_color_kimPossible() +   
    tvthemes::scale_fill_avatar()  +
  geom_text(aes( y=..count../tapply(..count.., ..x.. ,sum)[..x..],           
     label=scales::percent(..count../tapply(..count.., ..x.. ,sum)[..x..]) ),    
     stat="count",                              
     position=position_dodge(1.0),                         
     vjust=-0.5, 
     size=3) + 
  scale_y_continuous(limits=c(0,100)) +                    
  scale_y_continuous(labels = scales::percent) +     
  
 ggtitle("Cancer Outcome, diagnosis type") +                                  # title and axis labels=
     xlab("Diagnosis") +  
     ylab("% of Group") + 
  theme_minimal()

Many of those who died where diagnosed with HG glioma
lets see the values in the following table

sjt.xtab(data$diagnosis,data$brain_status,      
         show.row.prc=TRUE,                        
         show.summary=FALSE,                                        
         title="Cross Tab Diagnosis x Cancer")

Cross Tab Diagnosis x Cancer
diagnosis	brain_status		Total
diagnosis	alive	dead	Total
Meningioma	33 78.6 %	9 21.4 %	42 100 %
LG glioma	5 55.6 %	4 44.4 %	9 100 %
HG glioma	5 22.7 %	17 77.3 %	22 100 %
Other	9 64.3 %	5 35.7 %	14 100 %
Total	52 59.8 %	35 40.2 %	87 100 %

mortality rate by sex

Create a customised function for summarising categorical data per status

summarize_status <- function(tbl){tbl %>% 
  summarise(n_died = sum(brain_status == "yes"),
            n_total = n()) %>%
  ungroup() %>% 
  mutate(pct_died = n_died / n_total) %>% 
  arrange(desc(pct_died)) %>% 
  mutate(low = qbeta(0.025, n_died + 5, n_total - n_died + .5),
         high = qbeta(0.975, n_died + 5, n_total - n_died + .5),
         pct = n_total / sum(n_total),
         percentage=scales::percent(pct_died))
  }

mortality rate summary per sex

data |> 
  group_by(sex) |> 
  summarize_status()

sex	n_died	n_total	pct_died	low	high	pct	percentage
Female	0	45	0	0.0336	0.194	0.517	0%
Male	0	42	0	0.0358	0.206	0.483	0%

from the dataset , $\frac{20}{43}$ men died while the other 15 where woman who died
we can present these more visually with the graphs below

mortality is greater in males than it is in females
the white dashed line indicate the overal mortality rate

Outcome by diagnosis faceted by sex

ggplot(data) +                             # data to use 
  aes(x=brain_status) +                                        # x = predictor  (factor) 
  aes(fill=diagnosis) +                                         # fill = outcome (factor)  
  geom_bar( position="dodge",                                        # side-by-side  
       color="black") +                                                  # choose color of bar borders 
    tvthemes::scale_color_kimPossible() +       # choose colors of bars 
    tvthemes::scale_fill_avatar()  +   # bars 
  geom_text(aes(label=after_stat(count)),                    # required  
     stat='count',                                           # required 
     position=position_dodge(1.0),                           # required for centering  
     vjust= -0.5, 
     size=3) + 
   scale_x_discrete(limits = rev) +                                       # reverse order to 0 cups first 
   # FORCE y-axis ticks 
  facet_grid(~ sex)  +                                              # Strata in 1 row 
  ggtitle("Cancer Outcome, diagnosis type\nStratified") +           # title and axis labels 
  xlab("") +  
  ylab("Count") + 
  theme_bw() + 
  theme(legend.title=element_blank()) + 
  theme(legend.text=element_text(color="black", size=6)) + 
  theme(legend.position="right")

+ Meningioma affects more women than it affects men. + both men and women diagnosed with HG glioma are more likely to die

is there any association between diagnosis and cancer outcome

data %>% contingency(brain_status ~ diagnosis) 
#>             Outcome
#> Predictor    dead alive
#>   Other         5     9
#>   HG glioma    17     5
#>   LG glioma     4     5
#>   Meningioma    9    33
#> 
#>              Outcome +    Outcome -      Total                 Inc risk *
#> Exposed +            5            4          9     55.56 (21.20 to 86.30)
#> Exposed -           17            9         26     65.38 (44.33 to 82.79)
#> Total               22           13         35     62.86 (44.92 to 78.53)
#> 
#> Point estimates and 95% CIs:
#> -------------------------------------------------------------------
#> Inc risk ratio                                 0.85 (0.44, 1.62)
#> Inc odds ratio                                 0.66 (0.14, 3.10)
#> Attrib risk in the exposed *                   -9.83 (-47.09, 27.43)
#> Attrib fraction in the exposed (%)            -17.69 (-124.96, 38.43)
#> Attrib risk in the population *                -2.53 (-26.83, 21.78)
#> Attrib fraction in the population (%)         -4.02 (-20.73, 10.38)
#> -------------------------------------------------------------------
#> Yates corrected chi2 test that OR = 1: chi2(1) = 0.016 Pr>chi2 = 0.900
#> Fisher exact test that OR = 1: Pr>chi2 = 0.698
#>  Wald confidence limits
#>  CI: confidence interval
#>  * Outcomes per 100 population units 
#> 
#> 
#> 	Fisher's Exact Test for Count Data
#> 
#> data:  dat
#> p-value = 0.0001994
#> alternative hypothesis: two.sided

Comments

the outcome is a result of a fisher's exact test for association
the plimenary analysis suggests that at 5% level of significance ,diagnosis( Meningioma , LG glioma ,HG glioma and other diagnostics) and stereo variable are significantly associated with survival of patients.

Bivarate relationships for all variables

results <- compareGroups(brain_status ~ diagnosis+ki+sex+loc+stereo+gtv, data = data) 
results
#> 
#> 
#> -------- Summary of results by groups of 'brain_status'---------
#> 
#> 
#>   var       N  p.value  method            selection
#> 1 diagnosis 87 <0.001** categorical       ALL      
#> 2 ki        87 0.023**  continuous normal ALL      
#> 3 sex       87 0.255    categorical       ALL      
#> 4 loc       87 0.096*   categorical       ALL      
#> 5 stereo    87 0.063*   categorical       ALL      
#> 6 gtv       87 0.037**  continuous normal ALL      
#> -----
#> Signif. codes:  0 '**' 0.05 '*' 0.1 ' ' 1

the above results show the relationship between the cancer outcome and patient characteristics
the p-values correspond to chi-squared test for categorical data and t.test for numeric data
diagnosis as mentioned in previous results is associated with cancer outcome.
Gross tumor volume and kanrfsky index have p-values less than 5% implying there is significant mean differences for these values per each outcome category

readytable <- createTable(results, show.ratio=TRUE, show.p.overall = FALSE)       
print(readytable, header.labels = c(p.ratio = "p-value"))      
#> 
#> --------Summary descriptives table by 'brain_status'---------
#> 
#> ___________________________________________________________________ 
#>                       alive       dead            OR        p-value 
#>                       N=52        N=35                              
#> ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ 
#> diagnosis:                                                          
#>     Meningioma     33 (63.5%)   9 (25.7%)        Ref.        Ref.   
#>     LG glioma       5 (9.62%)   4 (11.4%)  2.87 [0.58;13.7]  0.191  
#>     HG glioma       5 (9.62%)  17 (48.6%)  11.6 [3.55;45.2] <0.001  
#>     Other           9 (17.3%)   5 (14.3%)  2.02 [0.50;7.71]  0.314  
#> ki                 83.1 (9.61) 77.7 (11.1) 0.95 [0.91;0.99]  0.025  
#> sex:                                                                
#>     Female         30 (57.7%)  15 (42.9%)        Ref.        Ref.   
#>     Male           22 (42.3%)  20 (57.1%)  1.80 [0.76;4.38]  0.185  
#> loc:                                                                
#>     Infratentorial 15 (28.8%)   4 (11.4%)        Ref.        Ref.   
#>     Supratentorial 37 (71.2%)  31 (88.6%)  3.03 [0.97;11.9]  0.057  
#> stereo:                                                             
#>     SRS            18 (34.6%)   5 (14.3%)        Ref.        Ref.   
#>     SRT            34 (65.4%)  30 (85.7%)  3.08 [1.07;10.5]  0.037  
#> gtv                7.03 (7.83) 11.1 (9.45) 1.06 [1.00;1.11]  0.036  
#> ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

the extended output shows more significant categories that determine cancer outcome
LG glioma is the most significant category associated with cancer outcome since its p-value value is p<0.001 which is less than 5%
the same applies for stereo:SRT

extended Analysis

library(glue)
bind_count = function(x){
  as_tibble(x) %>% 
  add_count(value) %>% 
  mutate(value = glue("{value} ({n})")) %>%
    pull(value)
  
}

# Scatter plot
data %>% 
  na.omit() %>% 
  group_by(diagnosis = bind_count(diagnosis)) %>% 
  summarize_status() %>% 
  mutate(diagnosis = fct_reorder(diagnosis, pct_died)) %>% 
  ggplot(mapping = aes(x = pct_died, y = diagnosis)) +
  geom_point(aes(size = pct), show.legend = T) +
  geom_errorbarh(aes(xmin = low, xmax = high), height = .3) +
  labs(
    x = "Percentage of patients in each category who died",
    title = "Distribution of status by diagnosis",
    size = "%prevalence",
    subtitle = ""
  ) +
  scale_x_continuous(labels = percent) +
  scale_size_continuous(labels = percent) +
  theme(plot.title = element_text(hjust = 0.5))

It can also be noted that cumulatively, people with LG glioma and HG glioma had high chances of death.

Survival Analysis.

library(finalfit)
explanatory<-c("sex","diagnosis","loc","ki","gtv","stereo")
dependent_os<-"Surv(time, status)"

data |> 
  finalfit(dependent_os,explanatory)

Dependent: Surv(time, status)		all	HR (univariable)	HR (multivariable)
sex	Female	45 (51.7)	-	-
	Male	42 (48.3)	1.57 (0.80-3.08, p=0.185)	1.20 (0.59-2.44, p=0.610)
diagnosis	Meningioma	42 (48.3)	-	-
	LG glioma	9 (10.3)	2.43 (0.75-7.89, p=0.141)	2.50 (0.71-8.72, p=0.152)
	HG glioma	22 (25.3)	7.34 (3.21-16.79, p<0.001)	8.62 (3.57-20.85, p<0.001)
	Other	14 (16.1)	2.24 (0.75-6.73, p=0.151)	2.42 (0.67-8.80, p=0.178)
loc	Infratentorial	19 (21.8)	-	-
	Supratentorial	68 (78.2)	2.34 (0.83-6.65, p=0.109)	1.55 (0.39-6.17, p=0.531)
ki	Mean (SD)	80.9 (10.5)	0.95 (0.92-0.98, p=0.003)	0.95 (0.91-0.98, p=0.003)
gtv	Mean (SD)	8.7 (8.7)	1.05 (1.01-1.08, p=0.006)	1.03 (0.99-1.08, p=0.125)
stereo	SRS	23 (26.4)	-	-
	SRT	64 (73.6)	2.80 (1.08-7.27, p=0.034)	1.19 (0.37-3.88, p=0.768)



dependent_os1<-"status"
data |> 
  finalfit(dependent_os1 ,explanatory)

Dependent: status		unit	value	Coefficient (univariable)	Coefficient (multivariable)
sex	Female	Mean (sd)	0.3 (0.5)	-	-
	Male	Mean (sd)	0.5 (0.5)	0.14 (-0.07 to 0.35, p=0.178)	0.03 (-0.16 to 0.23, p=0.723)
diagnosis	Meningioma	Mean (sd)	0.2 (0.4)	-	-
	LG glioma	Mean (sd)	0.4 (0.5)	0.23 (-0.09 to 0.55, p=0.162)	0.22 (-0.10 to 0.55, p=0.170)
	HG glioma	Mean (sd)	0.8 (0.4)	0.56 (0.33 to 0.79, p<0.001)	0.50 (0.26 to 0.74, p<0.001)
	Other	Mean (sd)	0.4 (0.5)	0.14 (-0.13 to 0.42, p=0.300)	0.14 (-0.16 to 0.44, p=0.341)
loc	Infratentorial	Mean (sd)	0.2 (0.4)	-	-
	Supratentorial	Mean (sd)	0.5 (0.5)	0.25 (-0.01 to 0.50, p=0.055)	0.13 (-0.15 to 0.41, p=0.352)
ki	[40.0,100.0]	Mean (sd)	0.4 (0.5)	-0.01 (-0.02 to -0.00, p=0.019)	-0.01 (-0.02 to -0.00, p=0.030)
gtv	[0.0,34.6]	Mean (sd)	0.4 (0.5)	0.01 (0.00 to 0.03, p=0.030)	0.01 (-0.01 to 0.02, p=0.327)
stereo	SRS	Mean (sd)	0.2 (0.4)	-	-
	SRT	Mean (sd)	0.5 (0.5)	0.25 (0.02 to 0.48, p=0.035)	0.04 (-0.20 to 0.29, p=0.733)

fit <- survfit(Surv(time, status) ~ sex, data = data)
# Access to the sort summary table
summary(fit)$table
#>            records n.max n.start events    rmean se(rmean) median 0.95LCL
#> sex=Female      45    45      45     15 53.15313  5.645267  51.02   46.16
#> sex=Male        42    42      42     20 45.20122  5.732211  31.25   20.69
#>            0.95UCL
#> sex=Female      NA
#> sex=Male        NA

ggsurvplot(fit, data = data,
 surv.median.line = "hv", # Add medians survival

 # Change legends: title & labels
 legend.title = "Sex",
 legend.labs = c("Male", "Female"),
 # Add p-value and tervals
 pval = TRUE,

 conf.int = TRUE,
 # Add risk table
 risk.table = TRUE,
 tables.height = 0.2,
 tables.theme = theme_cleantable(),

 # Color palettes. Use custom color: c("#E7B800", "#2E9FDF"),
 # or brewer color (e.g.: "Dark2"), or ggsci color (e.g.: "jco")
 palette = "#Dark2",
 ggtheme = theme_bw(), # Change ggplot2 theme
 font.main = c(14, "bold", "darkblue"),
 font.x = c(12, "bold.italic", "darkblue"),
 font.y = c(12, "bold.italic", "darkblue"),
 font.tickslab = c(10, "plain", "darkgreen")
)

females tend to survive longer than men
the difference in survival time is not statistically significant at 5% level of significance

difference in survival per diagnosis

Meningioma patients have a longer survival time as compared to other patients
there is significant difference in survival times between patients of different diagnosis type at 5% level of significance

the median survival times for LG glioma, HG glioma, and Other are approximately 49,11 and 29 months respectively .

difference in survival per Stereotactic

those with location at supratentorial have a longer survival time as compared to their counterparts
the difference in survival time is rather not statistically significant

Now lets fit a model

Logistic regression

data<- data |> 
  select(-brain_status)

data$status<-as.factor(data$status)

lr_mod <- parsnip::logistic_reg()|> 
   set_engine('glm') |> 
  set_mode("classification")  

model<-lr_mod |> 
  fit(status~. -time,data=data)

model %>%
  pluck("fit") %>%
  summary()
#> 
#> Call:
#> stats::glm(formula = status ~ . - time, family = stats::binomial, 
#>     data = data)
#> 
#> Coefficients:
#>                    Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)         5.36393    3.50137   1.532 0.125535    
#> sexMale             0.30110    0.56563   0.532 0.594500    
#> diagnosisLG glioma  1.14990    0.84737   1.357 0.174777    
#> diagnosisHG glioma  2.64083    0.73785   3.579 0.000345 ***
#> diagnosisOther      1.01974    0.90625   1.125 0.260492    
#> locSupratentorial   0.96505    0.89706   1.076 0.282021    
#> ki                 -0.11264    0.05147  -2.189 0.028630 *  
#> gtv                 0.03413    0.03468   0.984 0.325042    
#> stereoSRT           0.34450    0.73961   0.466 0.641369    
#> kan_indexindex>79   1.05054    1.08104   0.972 0.331158    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 117.264  on 86  degrees of freedom
#> Residual deviance:  84.789  on 77  degrees of freedom
#> AIC: 104.79
#> 
#> Number of Fisher Scoring iterations: 5

perform a stepwise regression

stepmodel_logit <- stats::step(model, direction = "both", trace = FALSE)
summary(stepmodel_logit)
#> 
#> Call:
#> glm(formula = status ~ diagnosis + ki, family = binomial, data = data)
#> 
#> Coefficients:
#>                    Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)         4.81278    2.30775   2.085  0.03703 *  
#> diagnosisLG glioma  1.39942    0.80736   1.733  0.08304 .  
#> diagnosisHG glioma  2.84898    0.69866   4.078 4.55e-05 ***
#> diagnosisOther      0.57859    0.72100   0.802  0.42227    
#> ki                 -0.07735    0.02947  -2.625  0.00867 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 117.264  on 86  degrees of freedom
#> Residual deviance:  89.741  on 82  degrees of freedom
#> AIC: 99.741
#> 
#> Number of Fisher Scoring iterations: 4

comments and interprettations

from the logistic regression diagnosis-HG glioma and karnofsky index are statistically significant predictors of survival.
we speak of odds when we use logistic regression and diagnosis-HG glioma increases the odds of death of brain cancer patients because exp(coef)

Now we fit a Cox proportional hazards model:

cox_spec <- 
  proportional_hazards() %>% 
  set_engine("survival")

cox_fit <- cox_spec |> 
  fit(Surv(time, status) ~ ., data = BrainCancer)


cox_fit %>%
  pluck("fit") %>%
  summary()
#> Call:
#> survival::coxph(formula = Surv(time, status) ~ ., data = data, 
#>     model = TRUE, x = TRUE)
#> 
#>   n= 87, number of events= 35 
#>    (1 observation deleted due to missingness)
#> 
#>                        coef exp(coef) se(coef)      z Pr(>|z|)    
#> sexMale             0.18375   1.20171  0.36036  0.510  0.61012    
#> diagnosisLG glioma  0.91502   2.49683  0.63816  1.434  0.15161    
#> diagnosisHG glioma  2.15457   8.62414  0.45052  4.782 1.73e-06 ***
#> diagnosisOther      0.88570   2.42467  0.65787  1.346  0.17821    
#> locSupratentorial   0.44119   1.55456  0.70367  0.627  0.53066    
#> ki                 -0.05496   0.94653  0.01831 -3.001  0.00269 ** 
#> gtv                 0.03429   1.03489  0.02233  1.536  0.12466    
#> stereoSRT           0.17778   1.19456  0.60158  0.296  0.76760    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#>                    exp(coef) exp(-coef) lower .95 upper .95
#> sexMale               1.2017     0.8321    0.5930    2.4352
#> diagnosisLG glioma    2.4968     0.4005    0.7148    8.7215
#> diagnosisHG glioma    8.6241     0.1160    3.5664   20.8546
#> diagnosisOther        2.4247     0.4124    0.6678    8.8031
#> locSupratentorial     1.5546     0.6433    0.3914    6.1741
#> ki                    0.9465     1.0565    0.9132    0.9811
#> gtv                   1.0349     0.9663    0.9906    1.0812
#> stereoSRT             1.1946     0.8371    0.3674    3.8839
#> 
#> Concordance= 0.794  (se = 0.04 )
#> Likelihood ratio test= 41.37  on 8 df,   p=2e-06
#> Wald test            = 38.7  on 8 df,   p=6e-06
#> Score (logrank) test = 46.59  on 8 df,   p=2e-07

Hazard Ratios

Recall that the Cox model is expressed by the hazard function $h (t)$ :

$h (t) = h_{0} (t) \times e x p (β_{1} x_{1} + β_{2} x_{2} + \dots + β_{p} x_{p})$

The quantity of interest from a Cox regression model is the hazard ratio. The quantities $e x p (β_{i})$ are the hazard ratios.

The Hazard Ratio (HR) can be interpreted as follows:

HR = 1: No effect
HR < 1: indicates a decreased risk of death
HR > 1: indicates an increased risk of death

How to Interpret Results

The estimate column in the summary above is the regression parameter $β_{i}$ of the Cox model.

The $e s t i m a t e$ column quantifies the effect size (the impact) that the covariate has on the patient’s survival time.

The expression is $e x p (β_{i})$ is the hazard ratio – this is the blah column of the summary above.

So for example, we obtained a regression parameter $β_{1} = 0.9152$ for the diagnosisLG glioma vs other type diagnosis. The hazard ratio for this covariate is $H R = e x p (β_{1}) = 2.4968$ .

A HR > 1 indicates increased hazard of death.

Therefore, we would say that patients diagnosed with the LG glioma have a 2.4968 times increased hazard of death compared to other patients. The p-value associated with this regression parameter is $p = 0.15161$ , which indicates that the difference is not significant.

perform stepwise regression on `cox`

stepmodel_cox <- MASS::stepAIC(cox_fit, direction = "both", trace = FALSE)
summary(stepmodel_cox)
#> Call:
#> coxph(formula = Surv(time, status) ~ diagnosis + ki + gtv, data = BrainCancer)
#> 
#>   n= 87, number of events= 35 
#>    (1 observation deleted due to missingness)
#> 
#>                        coef exp(coef) se(coef)      z Pr(>|z|)    
#> diagnosisLG glioma  1.09706   2.99535  0.60985  1.799   0.0720 .  
#> diagnosisHG glioma  2.23701   9.36528  0.43529  5.139 2.76e-07 ***
#> diagnosisOther      0.72336   2.06134  0.57704  1.254   0.2100    
#> ki                 -0.05417   0.94727  0.01832 -2.957   0.0031 ** 
#> gtv                 0.04166   1.04254  0.02043  2.040   0.0414 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#>                    exp(coef) exp(-coef) lower .95 upper .95
#> diagnosisLG glioma    2.9953     0.3339    0.9065    9.8981
#> diagnosisHG glioma    9.3653     0.1068    3.9903   21.9805
#> diagnosisOther        2.0613     0.4851    0.6652    6.3874
#> ki                    0.9473     1.0557    0.9139    0.9819
#> gtv                   1.0425     0.9592    1.0016    1.0851
#> 
#> Concordance= 0.786  (se = 0.043 )
#> Likelihood ratio test= 40.33  on 5 df,   p=1e-07
#> Wald test            = 38.72  on 5 df,   p=3e-07
#> Score (logrank) test = 46.19  on 5 df,   p=8e-09

comments

the stepwise regression model results in a optimal set of variables that predict death by brain cancer
It can be noted that just like in logistic regression , diagnosisHG glioma has an increased hazard of death the same way it increases the odds of dying

To be continued

this is a work in progress
you are allowed to add in your thoughts
I am constantly updating this blogpost

Bongani Ncube

Data Scientist/Statistics/Public Health

My research interests include Casual Inference , Public Health , Survival Analysis, bayesian Statistics, Machine learning and Longitudinal Data Analysis.