Survival Analysis Tutorial with Practical Application in R using the ggsurvfit Package

Introduction:

Survival analysis is a statistical technique used to analyze the time until an event of interest occurs. It is commonly used in medical research, epidemiology, and other fields to study the time to death or failure in a population. In this tutorial, we will explore survival analysis using the ggsurvfit package in R. ggsurvfit is a powerful package that provides an easy-to-use interface for visualizing survival curves and conducting various survival analysis tasks.

Installing Required Packages:

Before we begin, ensure that the ggsurvfit package and its dependencies are installed on your system. You can install the package using the following command:

# install.packages("ggsurvfit")

Let’s load the necessary libraries into our R environment: Loading the Required Libraries:

Understanding Survival Analysis:

Survival analysis deals with time-to-event data, where the event can be death, failure, recovery, or any other event of interest. The key concepts in survival analysis include:

  • Survival Time: The time from the start of the observation to the occurrence of the event.
  • Survival Function: The probability that an individual survives beyond a certain time point.
  • Hazard Function: The instantaneous rate at which events occur given that the individual has survived up to a specific time.

Preparing the Data:

For this tutorial, we will use the “lung” dataset from the survival package, which contains information about survival times of patients with advanced lung cancer.

lung1 <- survival::lung
  
str(lung1)
'data.frame':   228 obs. of  10 variables:
 $ inst     : num  3 3 3 5 1 12 7 11 1 7 ...
 $ time     : num  306 455 1010 210 883 ...
 $ status   : num  2 2 1 2 2 1 2 2 2 2 ...
 $ age      : num  74 68 56 57 60 74 68 71 53 61 ...
 $ sex      : num  1 1 1 1 1 1 2 2 1 1 ...
 $ ph.ecog  : num  1 0 0 1 0 1 2 2 1 2 ...
 $ ph.karno : num  90 90 90 90 100 50 70 60 70 70 ...
 $ pat.karno: num  100 90 90 60 90 80 60 80 80 70 ...
 $ meal.cal : num  1175 1225 NA 1150 NA ...
 $ wt.loss  : num  NA 15 15 11 0 0 10 1 16 34 ...

Exploratory Data Analysis (EDA):

Before diving into survival analysis, it is essential to explore and understand the data. Perform exploratory data analysis tasks such as summarizing the dataset, checking for missing values, and identifying any potential covariates.

lung1$sex <- ifelse(lung1$sex == 1, "male", "female")

# Checking for missing values
sum(is.na(lung1))
[1] 67
# Summary of the dataset
lung1 %>% 
  group_by(sex) %>% 
  summarise(n = n(),
            mean = mean(time), 
            median = median(time),
            sd = sd(time), 
            min = min(time), 
            max = max(time)
            ) %>% 
  kable()
sex n mean median sd min max
female 90 338.9667 292.5 203.4688 5 965
male 138 283.2319 224.0 213.0516 11 1022
# Exploring covariates
# Example: Boxplot of survival time by treatment
p <- lung1 %>%
  ggplot(aes(as.factor(sex), time,color = sex )) +
  geom_boxplot() +
  # geom_jitter(height =  0,width = 0.2,alpha = .4)+ 
  labs(x = "ph.ecog", y = "time") +
  theme_minimal()

plotly::ggplotly(p)

Estimating Survival Curves:

To estimate the survival curves, we can use the survfit() function from the survival package. The survfit() function takes a formula specifying the survival time and an optional grouping variable.

surv_fit <- survfit(Surv(time, status) ~ 1, data = lung1)

Visualizing Survival Curves:

The ggsurvfit package provides a convenient way to visualize survival curves using ggplot2. We can use the ggsurvplot() function to create survival plots with various customization options.

# survfit2() calculates the survival curves based on the time and status variables, grouped by the sex variable in the lung1 dataset
# ggsurvfit() creates a basic survival plot using ggplot2

survfit2(Surv(time, status) ~ sex, data = lung1) %>%
  
  # Adds the survival curves to the plot
  ggsurvfit() + 
  
  # Adds confidence intervals to the survival curves
  add_confidence_interval() +  
  
  # Sets the x-axis limits from 0 to 900
  coord_cartesian(xlim = c(0, 900)) +  
  
  # Sets the y-axis limits from 0 to 1 and Formats y-axis labels as percentages
  scale_y_continuous(
    limits = c(0, 1), 
    labels = scales::percent  
  ) +
  # Manually sets the colors for line colors
  scale_color_manual(values = c("#2584da", "#da7b25")) +  
  scale_fill_manual(values = c("#2584da", "#da7b25")) +   
  
  # Applies a minimal theme to the plot
  # Sets the position of the legend to the bottom
  theme_minimal() +  
  theme(legend.position = "bottom") +  
  
  # Sets the plot title and the y-axis and x-axis label
  labs(
    title = "Visualizing Survival Curves", 
    y = "Percentage Survival",  
    x = "Survival time in days" 
  ) +
  
  # Adds a risk table to the plot
  add_risktable(risktable_group = 'risktable_stats', size = 3) +  
  
  # Sets x-axis breaks at intervals of 100
  scale_x_continuous(breaks = seq(from = 0, to = 1000, by = 100)) +  
  
  # Adds a p-value for the log-rank test
  add_pvalue(caption = "Log-rank {p.value}")  

Cox Proportional Hazards Model:

The Cox proportional hazards model is widely used in survival analysis to examine the relationship between survival time and covariates. We can fit the Cox model using the coxph() function.

# Fitting the Cox proportional hazards model
cox_model <- coxph(Surv(time, status) ~ age + sex + ph.ecog, data = lung1)

# Printing the model summary
summary(cox_model)  
Call:
coxph(formula = Surv(time, status) ~ age + sex + ph.ecog, data = lung1)

  n= 227, number of events= 164 
   (1 observation deleted due to missingness)

            coef exp(coef) se(coef)     z Pr(>|z|)    
age     0.011067  1.011128 0.009267 1.194 0.232416    
sexmale 0.552612  1.737787 0.167739 3.294 0.000986 ***
ph.ecog 0.463728  1.589991 0.113577 4.083 4.45e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
age         1.011     0.9890    0.9929     1.030
sexmale     1.738     0.5754    1.2509     2.414
ph.ecog     1.590     0.6289    1.2727     1.986

Concordance= 0.637  (se = 0.025 )
Likelihood ratio test= 30.5  on 3 df,   p=1e-06
Wald test            = 29.93  on 3 df,   p=1e-06
Score (logrank) test = 30.5  on 3 df,   p=1e-06

Conclusion:

Survival analysis is a valuable technique for analyzing time-to-event data. In this tutorial, we explored survival analysis using the ggsurvfit package in R. We covered the estimation of survival curves, visualization of survival plots, and fitting a Cox proportional hazards model. By understanding and applying these techniques, you can gain valuable insights from your survival data. Remember to refer to the documentation and additional resources for more advanced topics, such as handling censored data, stratified survival analysis, and time-varying covariates.