Section 1: Survival function and hazard function This part of the course Medical Statistics and Epidemiology deals with the modeling and analysis of data that have as a principal end point the time until an event occurs. Such events are generically referred to as failures though the event may, for instance, be the performance of a certain task in a learning experiment in psychology or a change of residence in a demographic study as well. Major areas of application, however, are medical studies on chronic diseases and industrial life testing.
We assume that observations are available on the independent failure time of individuals. Let be the nonnegative random variable representing the failure time of an arbitrary individual. We assume that the probability distribution of is described by a density function . We shall introduce the Survival function and the hazard function which characterize the distribution of as well. The survival function is defined by
and is equal to , where is the cumulative distribution function of . (Note for each number in case of a density function.) Since the cumulative distribution function specifies the distribution of , the distribution of is specified as well by the survival function .
The hazard function specifies the instantaneous rate of failure at conditional upon survival to time and is defined by the limit for of the following ratio:
Taking this limit we get
Note that the derivative of the survival function is equal to . The distribution of is specified by its hazard function as well because the survivor function is determined by the hazard function:
Section 2: Distributions for failure rate
In this section we present a number of models for the distribution of .
The one parameter exponential distribution is obtained for by taking the hazard function to be a constant: (with , hence
and follow rather easily. So for the exponential distribution the instantaneous failure rate is independent of so that the conditional chance of failure does not depend on how long the individual has been on trial. This is referred to as the memoryless property of the exponential distribution. An empirical check of the exponential distribution for a set of survival data is provided by plotting the log of the survival function estimate versus . Such a plot should approximate a straight line through the origin as can be concluded from (7).
An important generalization of the exponential distribution allows for a power dependence of the hazard function on time. This yields the two parameter Weibull distribution with hazard function
This hazard function is monotone decreasing for , is monotone increasing for and reduces to a constant if . For the Weibull distribution we get
An empirical check for the Weibull distribution is provided by a plot of the estimate of versus . The plot should give approximately a straight line.
In general the distribution of a failure or survival time is skew. Skew distributions may be modeled by means of a lognormal distribution or a gamma distribution as well. If has a lognormal distribution then this means that has a normal distribution, described by expectation and a variance . The gamma distribution may be regarded as another generalization of the exponential distribution, its density function is
where is the well-known gamma function:
For the density (11) reduces to the density of the exponential distribution, note .
Section 3: Kaplan Meier Estimator
Survival analysis is concerned with studying the time between entry to a study and a subsequent event. Originally the analysis was concerned with time until death, hence the name, but survival analysis is applicable to many areas as well as mortality. A common feature of survival data is censoring, it means that the exact failure times of a number of subjects are not known.
There are several reasons for censoring, to mention a few:
Some patients might have left the study early, they are lost to follow up.
Examples: emigration, fatal accidents in traffic (competing risk)
Study ends when a fixed time is reached (right censoring of type I)
Study ens when a fixed number of failures occur (right censoring of type II)
In these examples there is right censoring, which means some failure times are not known. For these unknown failure times one only knows that the failure time exceeds some known value, the so-called censoring time. In this text we assume that the process of censoring is independent of the process we want to study. Furthermore we only consider right censoring. We study survival of 49 patients with Dukes’C colorectal cancer. The survival times (months) of two treatment groups are as follows.
Treatment ( linoleic acid ,
Here + means censoring. The first entry (3+) of the control group means that the patient left the study after 3 months survival. So the corresponding survival time is known to be more than 3 months, 3 (months) is the censoring time of the patient. For the treatment group we shall estimate the survivor function .
For estimation of we use the Kaplan Meier estimator, also called the product limit estimator. Suppose that the survival times, including censored observations, of a homogeneous group of patients are represented by . We assume that the survival times (patients) are already ordered such that . For a given value find the largest value such that , the probability is then estimated by
where is the number of subjects alive just before time (the kth ordered survival time) and denotes the number who died at time . Let us determine the estimates for the treatment group.
The estimate of formule (13) is the product of factors . For censored observations 1+ and 5+ these factors equal 1. We get for . Following the receipt of the Kaplan Meier estimator we obtain
The Kaplan Meier estimate is just a step function, this step function only changes for survival times with a positive , for the treatment group these times are 6, 10, 12, 24 and 32. Each factor in the Kaplan Meier estimator represents 1 minus an estimated hazard rate. For survival time we consider the number of people still alive, sometimes called the number at risk, this is the number . Then the condional probability of surviving is estimated in a straightforward way by . Next plot is a plot of the estimate of the survival function of the treatment group.
The corresponding plot of the control group is as follows.
Section 4: Regression models In section 2 several survival distributions were introduced for modeling the survival experience of a homogeneous population. Usually, however, there are explanatory variables upon which failure time may depend. It therefore becomes of interest to consider generalizations of these models to take account of information of explanatory variables.
Consider failure times of individuals. For each individual we have values of explanatory variables. Note that the explanatory may include both quantitative variables and qualitative variables such as treatment group, the latter can be incorporated through the use of indicator variables. The principal problem dealt with in this section is that of modeling the relationship between the failure time and explanatory variables.
The exponential distribution can be generalized to obtain a regression model by allowing the failure rate of to be function of . In regression models it is common practice that the dependent variable depends on the explanatory variables only through a linear function
where are unknown parameters. For the exponential distribution we have a constant hazard function . In a regression model for survival analysis one can try to model the dependence on the explanatory by taking the (new) hazard rate to be
The (new) hazard rate is taken to be a constant () times some function of the linear function . Hazard rates being positive it is natural to choose the function such that is positive irrespective the values of . For this reason one often takes , the hazard rate in a regression model is then modeled by
In survival analysis accelerated failure time models are obtained by modeling the log failure time instead of the failure time itself. Let us explain what the assumption (21) about the hazard function means if we study the log failure time . We shall use the following fact of probability theory: if has the exponential distribution with parameter then we can write where is a random variable having an exponential distribution with parameter . Note that the survival function of is equal to
which is equal to (7), the survival function of an exponential distribution with parameter , hence indeed and are identical with respect to their distribution. Consequences of (21) are:
with . For log failure time we almost get a traditional regression model. Note that the term is the intercept (constant) of the regression model, this term can be estimated whereas both and cannot be estimated. The disturbance does not have a normal distribution, instead one can say that has the exponential distribution with parameter . From (24) one can conclude that the effects of covariates (explanatory variables) act additively on . Remind we started with (21): the covariates act multiplicatively on the hazard rate.
Let us now consider the Weibull distribution, hence a hazard function given by (8). For the Weibull distribution the analogue of (21) is:
as a matter of fact the baseline hazard rate is replaced by
We again study the log failure time . Using probability theory we can state the following: if has a Weibull distribution with parameters and then we can write where has the exponential distribution with parameter . For proving this we show that the survival function of equals (9), the survival function of the Weibull distribution:
which equals expression (9). From (26) and one can obtain:
with . This regression equation is a generalization of (25), as expected. Again the effects of the covariates act additively on the log failure time.
Section 5: The Proportional Hazards Model
The most famous proportional hazard model is the proportional hazards model of Cox. In the proportional hazards model of Cox independent failure times are studied, which distribution is described by a hazard function given by
where is an arbitrary unspecified base-line hazard function which specifies a continuous distribution for a failure rate. Special cases are (exponential distribution) and (Weibull distribution) but one of the most important features of the model of Cox is that no parametric model is made for the base-line hazard function .
An empirical check for the proportional hazards model is provided by plots of the estimate of versus . In case of the Weibull distribution () the graphs of the estimates versus should show parallel straight lines for separate subgroups. In general these graphs should show the same curvature for different subgroups (fixed vertical distances).
SPSS (like other programs) provides estimates and standard errors for e.g. the parameters in the proportional hazards model of Cox. For application of ‘Cox regression’ one has to choose in SPSS first Analyse, then Survival and finally Cox Regression.
For demonstrating testing theory we apply the proportional hazards model to the data of section 3. In SPSS we have to fill the data matrix in the following way. One column contains the 49 survival times. A second column indicates whether the survival times are censored () or not , hence this column contains only numbers 0 and 1. A third column is indicating to which group each survival time belongs, we used the number 0 for control and the number 1 for the treatment group, so this column contains only numbers 0 and 1 as well. We gave the columns (variables) names: survival, censor and treatment. For producing the relevant output one has to choose survival for time, censor for status (define event by single value 1) and treatment as covariate. Use furthermore ‘method: enter’.
We now shall investigate whether the groups really differ with respect to survival time. We apply a proportional hazards model with covariate (explanatory variable) treatment. Doing so, we assume that the hazard function for an individual is given by
with the being the values of the variable treatment, . For investigating whether there is really a difference between the two groups or whether there is really a treatment effect, we test the null hypothesis against the alternative hypothesis . One has to take as testing statistic with being the estimate of and being the corresponding standard error. The distribution of the testing statistic is approximated by the standard normal distribution under the null hypothesis. The null hypothesis is rejected if or . Using SPSS a part of the output is the following:
Because the factor can be absorbed by the baseline hazard function , no estimate for is given. From the output we see and , hence the outcome of is . Taking significance level 5% we reject if or , equivalently if . In stead of SPSS presents a Wald Statistic being equal to . This Wald statistic has a chi-square distribution with one degrees of freedom under the null hypothesis: significance level 5% means that the null hypothesis is rejected if the Wald Statistic , this renders a equivalent test. For the data of section 3 we don’t have to reject the null hypothesis, no treatment effect can be proved (at significance level 5%).
For the assignment of this part of the course Medical Statistics and Epidemiology a data set has to be studied, we call this data set the breastfeeding data. The data is contained by the SPSS system file breastfeeding.sav . The breastfeeding data concerns data on 925 first-born children whose mothers chose for breast feeding. The following variables are recorded:
duration of breast feeding (weeks)
1 for completed breast feeding, 0 for censored (still breast feeding)
race of mother (white, black, other)
mother in poverty (yes, no)
mother smoked at birth of child (yes, no)
mother used alcohol at birth of child (yes, no)
age of mother at birth of child
year of birth of child
years of school (education level)
Prenatal care after 3rd month (yes, no)
For a first preparation for the assignment we study how Duration depends on the covariate Birth. Inspection of the data set reveals that the outcome of the covariate Birth ranges from 78 (representing 1978) to 86. For this type of covariate it is not useful to assume a hazard rate of the form (31) with the being now the values of the covariate Birth. The covariate Birth is rather a categorical variable. Perhaps the duration of breastfeeding is different for babies from different years of birth. Differences from year to year may be modeled by assigning effects to the levels (outcomes) of the covariate Birth. This can be done by introducing indicators variables. In case of the breastfeeding data we introduce indicator variables for the covariate Birth defined as follows:
if the outcome of birth is 78 and elsewhere, if the outcome of birth is 79 and elsewhere,…, if the outcome of birth is 85 and elsewhere. Using these indicator variables the hazard rate, and thus our model, becomes
where are the values of the indicator variables of individual . According to formula the ratio of the hazards rates of the years of birth 78 and 86 is equal to . In the same way the ratio of the hazards rates of the years of birth 79 and 86 is equal to . So for the respective outcomes of the covariate we now have (possibly) different (multiplicative) effects and here the year of birth 86 serves as reference value. Instead of the last value the first value may be chosen as well as reference value (category) in SPSS. Using SPSS the following output is obtained.
In the first column of the table the parameters are indicated by respectively birth(1), birth(2),…,birth(8). Testing the null hypothesis against alternative hypothesis for each indicator variable no null hypothesis needs to be rejected at significance level 5% except one (verify this). It is however not useful to test for each indicator variable of the covariate Birth.
Instead one can test whether all effects of the covariate Birth are zero. Let us test the null hypothesis against the alternative hypothesis . SPSS presents (the outcome of) a Wald testing statistic, under the null hypothesis its distribution is the chi square distribution with 8 degrees of freedom (df) and one has to reject the null hypothesis for large values of the testing statistics. The degrees of freedom depends on the number of indicator variables, may be different for different data sets. In our testing problem we have to reject if Wald (chi square with df, level of significance 5%). Since the outcome of Wald is 18.411 we here reject the null hypothesis. We conclude that the (distribution of) Duration depends on the covariate Birth.
We don’t give a explicit formula for the Wald statistic with (here) 8 degrees of freedom. We just indicate how to work with it.
Section 7: Assignment
Before you start: consult the text of the file About SPSS.doc. Use SPSS when you do parts A en B of this assignment.
You don’t have to write a report for this assignment. Just take your computer output and your own notes with you for an (oral) discussion with the teacher about the answers of parts A and B. For making an appointment for the assignment: send an e-mail (firstname.lastname@example.org) or ring (053)4893379 .
Select a number of subgroups and study plots of the Kaplan Meier estimate of the Survival function in order to answer the following two questions. Instead of plots of the Survival function you may use plots of the hazard function or a function of the Survival function.
(1) Is the distribution of the duration of breastfeeding modeled well by means of a Weibull distribution or an exponential distribution for the subgroups chosen?
(2) Does a proportional hazards model fit the data?
Now assume that the proportional hazards model of Cox holds for the breastfeeding data. Investigate within this model on which covariates the (dependent) variable Duration really depends. For the covariates Race, Age and School (and Birth) decide whether you take the covariate as categorical covariate (if so: go to Categorical … , etc. within the Cox Regression menu). In case of categorical covariates you don’t need to worry about the indicator variables described in section 6: SPSS introduces these indicator variables automatically.
Try to explain the dependent variable Duration as good as possible but refrain from including covariates (predictor variables) that seem to be superfluous.
Follow a clear strategy in order to select the covariates. Use statistical tests.