Joint longitudinal and competing risks models: Simulation, estimation and prediction

This post takes a look at an extension of the standard joint longitudinal-survival model, which is to incorporate competing risks. Let’s start by formally defining the model.

We will assume a continuous longitudinal outcome,

y_{i}(t) = m_{i}(t) + \epsilon_{i}(t)


m_{i}(t) = X_{1i}(t)\mathbf{\beta}_{1} + Z_{i}(t)b_{i}

and \epsilon_{i}(t) is our normally distributed residual variability. We call m_{i}(t) our trajectory function, representing the true underlying value of the continuous outcome at time t, which is a function of fixed and random effects with associated design matrices, X_{2i}(t) and Z_{i}(t), and coefficients, \beta_{2} and b_{i}. We assume normally distributed random effects,

b_{i} \sim N(0,\Sigma).

We then have K competing events, each of which can be dependent on some (or many) aspects of the biomarker trajectory function, m_{i}(t). To keep things simple, we will work with the current value association structure. So, for the kth competing risk we parameterise the cause-specific hazard function as,

h_{ik}(t) = h_{0k}(t) \exp(X_{2ik}\mathbf{\beta}_{2k} + \alpha_{k} m_{ik}(t))

where \alpha_k is the log hazard ratio for the kth cause, for a one-unit increase in the biomarker, at time t. Each competing event can be modelled however we like, with different baseline hazard functions, and with different covariate effects. Of course, we can also have different or additional association structures for each event.

To illustrate, we’re going to show you how to simulate data from a joint longitudinal and competing risks model, with two competing risks, representing death from cancer and death from other causes, and then show how to fit the true model using merlin.

Let’s assume a simple random intercept and random linear trend for the biomarker trajectory, i.e.

m_{i}(t) = (\beta_{10} + b_{0i}) + (\beta_{11} + b_{1i})t

We start by setting a seed (as we always should), generate a sample of 300 patients, and create an id variable.

. clear

. set seed 7254

. set obs 300
Number of observations (_N) was 0, now 300.

. gen id = _n

Next we generate our subject-specific random effects, for the intercept and slope. We need a single draw from each distribution, per patient. We’re assuming the random effects are independent, but that’s easily adapted using drawnorm. We also generate a treatment group indicator, assigning to each arm with 50% probability.

. gen b0 = rnormal(0,1)

. gen b1 = rnormal(0,0.5)

. gen trt = runiform()>0.5

Now we simulate cause-specific event times, using survsim. Simulating joint longitudinal-survival data is not the easiest of tasks mathematically, but thanks to survsim it’s pretty simple, especially given the recent additions to its capabilities (Crowther, 2022). We simply specify our cause-specific hazard functions, and survsim does the hard work (numerical integration nested within root-finding), details in Crowther and Lambert (2013).

We specify our baseline distribution parameters, the scale and shape, for Weibull baseline hazards for causes one and two, and an administrative censoring time of 5 years. We also specify a treatment effect with a log hazard ratio of -0.5 on cause 1 and -0.2 on cause 2. Putting this all together in user-defined functions in survsim, simulating under the current value association structure, defining \alpha_{1}=0.2, we get,

. //cause one
. local maxt 5

. local l1 0.1

. local g1 1.2

. local l2 0.05

. local g2 1.5

. survsim stime state event,                      /// new variables
>         hazard1(                                /// cause 1 (cancer)
>                 user(`l1':*`g1':*{t}:^(`g1'-1)  /// user-defined function
>                      :* exp(0.2                 /// biomarker
>                      :* (b0 :+ (0.1:+b1):*{t})  ///   trajectory
>                      ))                         ///
>                 covariates(trt -0.5))           /// treatment effect (PH)
>         hazard2(                                /// cause 2 (other)
>                 user(`l2':*`g2':*{t}:^(`g2'-1)  /// user-defined function
>                      :* exp( -0.2 :*            /// biomarker
>                      (b0 :+ (0.1:+b1):*{t})     ///   trajectory
>                      ))                         ///
>                 covariates(trt -0.2))           /// treatment effect (PH)
>         maxtime(`maxt')                         //  admin. censoring
variables stime0 to stime1 created
variables state0 to state1 created
variables event1 to event1 created

Now survsim is designed for many situations, such as allowing you to define your own custom hazard functions, which can depend on time…or a time-dependent biomarker trajectory. It also provides a general framework to simulate from competing risks and multi-state survival models. In this example, we are simulating from an all-cause hazard function, which is a function of two cause-specific hazards, each of which depends on a time-dependent biomarker trajectory. That’s a mouthful. Ok, so to start we define some new variable stubs, to contain the variables that survsim will create. stime* will contain the start and stop times for each transition, state* will contain the start and stop states, and event* will be flag indicator variables for whether an event/transition occurred, or an obervation was censored.

Let’s see what we created:

. list id trt stime* state* event* if _n<=5

     | id   trt   stime0      stime1   state0   state1   event1 |
  1. |  1     1        0           5        1        1        0 |
  2. |  2     0        0   1.8992871        1        2        1 |
  3. |  3     1        0    .9001096        1        2        1 |
  4. |  4     1        0   1.0330409        1        2        1 |
  5. |  5     1        0   2.9920639        1        3        1 |

So all observations started in state 1 (state0 is a 1) at time 0 (stime0 is 0). Observation 1 was censored (event1 is a 0) at 5 years, observations 2, 3 and 4 moved to state 2 at various times, i.e. died due to cancer, and observation 5 died from other causes (state1 is a 3) at 2.756 years. As it’s a competing risks situation (which is what survsim will assume unless you specify your own transition matrix), then observations can only go from the starting state 1 to states 2 or 3, corresponding to hazard1 and hazard2, respectively.

All that’s left to do is to create our cause-specific event indicators, which we will need in our estimation step. We’ll then drop variables we no longer need.

. gen byte cancer = state1==2

. gen byte other = state1==3

. drop stime0 state0 state1 event1

Now we’re all done with the competing risk outcomes. Onto the biomarker.

Let’s assume a setting where our biomarker is recorded at baseline, and then annually, so each patient can have up to five measurements. The easiest thing to do is to expand our dataset, by creating replicants of each row, and then drop any observations which occur after each patient’s event/censoring time.

. expand 5
(1,200 observations created)

. bys id : gen time = _n-1

. drop if time>stime1
(465 observations deleted)

We generate our oberserved longitudinal biomarker measurements from the true model above,

. gen xb = b0 + 0.1 * time + b1 * time

. gen y = rnormal(xb,0.5)                 //measurement error

It’s easiest to work in wide format for merlin (I mean in terms of the outcomes, they are side by side, but within each outcome the observations are in long format), so for the competing risks outcomes we need to have only a single row of data per patient, so we replace the repeated event times and event indicators with missing values.

. bys id (time) : replace stime1 = . if _n>1
(735 real changes made, 735 to missing)

. bys id (time) : replace cancer = . if _n>1
(735 real changes made, 735 to missing)

. bys id (time) : replace other = . if _n>1
(735 real changes made, 735 to missing)

We’re now all set up to fit our data-generating, true model, with merlin, using the following syntax:

merlin (y                       /// biomarker outcome
          time                  /// fixed linear time
          time#M2[id]@1         /// random slope on linear time
          M1[id]@1,             /// random intercept
          family(gaussian)      /// distribution
          timevar(time))        /// variable representing time
       (stime1                  /// survival time
          trt                   /// baseline treatment
          EV[y],                /// expected value of biomarker
          family(weibull,       /// distribution
               failure(cancer)) /// cause-specific event indicator
          timevar(stime1))      /// timevar
       (stime1                  /// survival time 
          trt                   /// baseline treatment
          EV[y],                /// expected value of biomarker
          family(weibull,       /// distribution
               failure(other))  /// cause-specific event indicator
          timevar(stime1))      //  timevar

which we would say is rather elegantly simple to specify for such a complex model…you may disagree. We’re using the EV[y] element type to link the current value of the biomarker to each cause-specific hazard model, making sure to specify the timevar() options, so time is correctly handled. We can use stime1 as our outcome in both survival models but simply use the cause-specific event indicators so we get the correct contributions to the likelihood. It’s important to note that competing risks models with exactly observed event data, can generallay be fitted either with a stacked dataset, or separately, with cumulative incidence predictons calculated post-estimation – this is not true in such a joint model, since the biomarker model appears in both cause-specific hazard functions, and hence must be estimated simultaneously.

Let’s fit the model:

. merlin (y time time#M2[id]@1 M1[id]@1, family(gaussian) timevar(time))      ///
>        (stime1 trt EV[y], family(weibull, failure(cancer)) timevar(stime1)) ///
>        (stime1 trt EV[y], family(weibull, failure(other)) timevar(stime1)) 

Fitting fixed effects model:

Fitting full model:

Iteration 0:   log likelihood = -2438.3381  (not concave)
Iteration 1:   log likelihood = -2405.6907  
Iteration 2:   log likelihood = -2171.3013  
Iteration 3:   log likelihood = -2069.3195  
Iteration 4:   log likelihood = -2032.6354  
Iteration 5:   log likelihood = -2023.3654  
Iteration 6:   log likelihood = -2023.1791  
Iteration 7:   log likelihood = -2023.1787  
Iteration 8:   log likelihood = -2023.1787  

Mixed effects regression model                           Number of obs = 1,035
Log likelihood = -2023.1787
             | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
y:           |            
        time |   .1331602   .0368502     3.61   0.000     .0609352    .2053853
 time#M2[id] |          1          .        .       .            .           .
      M1[id] |          1          .        .       .            .           .
       _cons |    .126154   .0672743     1.88   0.061    -.0057013    .2580092
  sd(resid.) |   .5107904   .0159339                      .4804961    .5429948
stime1:      |            
         trt |  -.6689459    .181305    -3.69   0.000    -1.024297   -.3135946
        EV[] |   .1962076   .0564493     3.48   0.001      .085569    .3068462
       _cons |  -1.871467   .1573127   -11.90   0.000    -2.179794    -1.56314
  log(gamma) |   .0748063   .0806735     0.93   0.354    -.0833108    .2329233
stime1:      |            
         trt |  -.3152746   .2263077    -1.39   0.164    -.7588294    .1282803
        EV[] |   -.306471    .069009    -4.44   0.000    -.4417262   -.1712158
       _cons |    -3.0444   .2563032   -11.88   0.000    -3.546745   -2.542055
  log(gamma) |   .4395772    .097876     4.49   0.000     .2477438    .6314106
id:          |            
      sd(M1) |    1.07923   .0509042                      .9839322    1.183757
      sd(M2) |   .5011646   .0292952                       .446914    .5620007

The model fits beautifully…phew.

Now we have our fitted model, generally of most interest within a competing risks analysis is both the cause-specific hazard ratios, which we get from our results table, and estimates of the cause-specific cumulative incidence functions.

We use merlin‘s predict engine to help us with such things. When predicting something which depends on time, it’s often much easier to specify a timevar(), which contains timepoints at which to predict at, which subsequently makes plotting much easier. So now we’ll generate a new variable tvar which will contain time points between 0 and 5, at 500 equally spaced points.

. range tvar 0 5 500
(535 missing values generated)

Let’s now predict each cause-specific cumulative incidence function, in particular, we’ll predict marginal CIFs (integrating out the random effects), so we can interpret them as population-average predictions. We’ll also use the at() option so we can investigate the effect of treatment.

. predict cif1, cif marginal outcome(2) at(trt 0) timevar(tvar)
(535 missing values generated)

. predict cif2, cif marginal outcome(2) at(trt 1) timevar(tvar)
(535 missing values generated)

. predict cif3, cif marginal outcome(3) at(trt 0) timevar(tvar)
(535 missing values generated)

. predict cif4, cif marginal outcome(3) at(trt 1) timevar(tvar)
(535 missing values generated)

For plotting purposes, we often create stacked plots of CIFs, and one way to do this is to add them appropriately together to draw the area under the curve, and then overlay one of the CIFs.

. gen totalcif1 = cif1 + cif3
(535 missing values generated)

. gen totalcif2 = cif2 + cif4
(535 missing values generated)

We now plot the stacked, marginal CIFs for those in the placebo group and those in the treated group, and combine (using grc1leg from SSC):

. twoway (area totalcif1 tvar)(area cif3 tvar), name(g1,replace)            ///
>         xtitle("Time since entry") ytitle("Cumulative incidence")         ///
>         title("Placebo group") legend(cols(1)                             ///
>                 order(1 "Prob. of death due to cancer"                    ///
>                 2 "Prob. of death due to other causes"))                  ///
>         ylabel(,angle(h) format(%2.1f)) ylabel(0(0.1)1) nodraw
. twoway (area totalcif2 tvar)(area cif4 tvar), name(g2,replace)      ///
>         xtitle("Time since entry") ytitle("Cumulative incidence")   ///
>         title("Treated group")                                      ///
>         legend(cols(1) order(1 "Prob. of death due to cancer"       ///
>         2 "Prob. of death due to other causes"))                    ///
>         ylabel(,angle(h) format(%2.1f)) ylabel(0(0.1)1) nodraw
. grc1leg g1 g2           //combine them

We can directly quantify the impact of treatment on each CIF by predicting the difference, using the cifdifference option, combined with at1() and at2() statements.

. predict diffcif, cifdifference marginal outcome(2)       ///
>                  at1(trt 1) at2(trt 0) timevar(tvar)     ///
>                  ci
(535 missing values generated)
note: confidence intervals calculated using Z critical values.
. twoway  (rarea diffcif_lci diffcif_uci tvar)                    ///
>         (line diffcif tvar), name(g3,replace)                   ///
>         xtitle("Time since entry")                              ///
>         ytitle("Difference in cumulative incidence")            ///
>         title("CIF({it:t} | treated) - CIF({it:t} | placebo)")  ///
>         ylabel(,angle(h) format(%3.2f))                         ///
>         legend(order(2 "Diff. in CIF" 1 "95% confidence interval"))

We’ll leave it there for now. The joint model we’ve covered is an introductory example in the competing risks setting. Given merlin‘s capabilities, it’s rather simple to extend to using other associations structures, such as the rate of change (dEV[y]) or the integral (iEV[y]) of the trajectory function, and to use different distributions for each cause-specific hazard model…the list of extensions goes on.

Happy joint modelling.


State-of-the-art statistical models for modern HTA

At @RedDoorAnalytics, we develop methodology and software for efficient modelling of biomarkers, measured repeatedly over time, jointly with survival outcomes, which are being increasingly used in cancer settings. We have also developed methods and software for general non-Markov multi-state survival analysis, allowing for the development of more plausible natural history models, where patient history can […]
Learn more


Multilevel (hierarchical) survival models: Estimation, prediction, interpretation

Hierarchical time-to-event data is common across various research domains. In the medical field, for instance, patients are often nested within hospitals and regions, while in education, students are nested within schools. In these settings, the outcome is typically measured at the individual level, with covariates recorded at any level of the hierarchy. This hierarchical structure […]
Learn more

Statistical Primers

What are competing risks?

Competing risks In survival analysis, competing risks refer to the situation when an individual is at risk of experiencing an event that precludes the event under study to occur. Competing risks commonly occur in studies of cause-specific mortality, as all other causes of death than the one under study might happen before the individuals “have […]
Learn more