Statistics 798L: Lifetime & Survival Analysis    
Spring term 2008,   MWF 10,   Mth 0403

Instructor: Professor Eric Slud,  Statistics Program, Math Dept.
                                       Rm 2314, x5-5469,  evs@math.umd.edu

Office hours: M 1-2, W 11-12, Th 1-2 (initially), or by appointment.

SAMPLE  PROBLEMS  FOR  IN-CLASS  TEST Old and New.
NOTE changed date for test, which is now Friday April 11.

For Syllabus click here for lecture Handouts click here, and
      here for Statistical Computing handouts.
For Homework, click here.

The topic of the course is the statistical analysis of data on lifetimes or durations.
Such data often have the feature of being right-censored, where subjects may
leave the study at random times and those who have not died at the ending time of
the study are simply recorded as being still alive, or truncated, where subjects enter
the study only if they meet some criterion which may involve an age-variable or
time since diagnosis or other preliminary event.  Such data arise frequently in
clinical trials, epidemiologic studies, reliability tests, and insurance. We first
present parameterizations of survival distributions, in terms of hazard intensities,
which lend themselves to the formulation of parametric models, including
regression-type models which relate failure-time distributions to auxiliary
biomedical predictors. The special features of truncation or censoring present
unique challenges in the formulation of likelihoods and efficient estimation and
testing in settings where the distributions of arrival-times and withdrawal-times
are unknown and not parametrically modelled. This statistical topic has achieved
great prominence in the theoretical statistical literature because it is a particularly
good arena for the introduction of techniques of estimating and testing finite-
dimensional parameter values --- such as a treatment- effectiveness parameter
in clinical studies --- in the presence of infinite-dimensional unknown
parameters. Such problems are referred to as Semiparametric.

Prerequisites: The presentation will be geared to second-year Stat grad students.
Minimum background is Stat 410 and Stat 700.

Required Text:  The required text is:

Klein, J. and Moeschberger, M. (2003) Survival Analysis: Techniques for
Censored and Truncated Data, 2nd ed. Springer-Verlag ISBN: 038795399X

Data:

Datasets contained in Appendix A of the Kalbfleisch & Prentice book, except
for Dataset V, can be downloaded in Excel format from the public ftp site,
linked here .
    Two of the datasets (Datasets I and V) are available in ASCII format,
as rectangular tables, here.

Recommended Text(s):

The book which we used last time this course was taught will serve as a useful reference,
but the explanations given there are harder, less straightforward and often more intuitive.

Kalbfleisch, J. and Prentice, R. (2002) The Statistical Analysis of Failure Time Data,
2nd ed. John Wiley ISBN: 0-471-36357-X

Another very useful recommended text (a 1980 book reissued as a paperback)

R. Miller, Jr. (1980) Survival Analysis. Wiley-Interscience 1998, ISBN: 0471255483

and, for the more mathematically inclined, a primarily theoretical text by two
former Maryland students:

Fleming, T. and Harrington, D. (1991) Counting Processes and Survival Analysis.
ISBN: 047152218X

Coverage of the Klein & Moeschberger book will be Chapters 1-9, plus a few
miscellaneous topics. The main topics are:

  • Survival distributions; regression-type models for survival in terms of predictor
           variables, including the famous Cox model and random-effect or `frailty'
           model extensions;
  • Formulation of Likelihoods for censored and truncated data;
  • Parameter estimation and hypothesis testing in parametric and semiparametric
           settings, including the Kaplan-Meier survival function estimator and
           Nelson-Aalen cumulative-hazard function estimator;
  • Goodness of fit diagnostics and testing for estimated models.
  • EM algorithm and missing-data aproaches to censored data; and
  • Methods for estimating survival distributions using smoothing and density-estimation
           techniques.
  • Klein & Moeschberger is a very methods-oriented book, and will be covered along
    with R software implementation with real-data examples. The Miller book explains things
    well and gives good background and literature references. For additional mathematical
    justifications, including the connection with counting processes and martingales, I will draw
    additional material from Fleming and Harrington and the research literature. Other data
    examples, and more sophisticated data analyses, can be found in the Kalbfleisch and
    Prentice book.

    Grading:  The course grade will be based 50% on 7 homework problem sets,
    25% on an in-class test, and 25% on a course project or paper at the end. The homework
    problems will be a mixture of theoretical problems at Stat 410/Stat 700 level, and of
    computational or data-analysis problems. The in-class test will be designed to test
    (i) definitions (of models and distributions and statistics),
    (ii) ability to use model definitions to construct likelihoods (and partial likelihoods) and
    derive statistics from them, and
    (iii) basic properties of estimators and test-statistics studied in class.
    The course project will be either a paper on a topic not fully covered in class, with
    illustrative data analysis, or an extended and coherent data analysis and writeup (of about
    10 pages, not including computer output).
    Note: homework problem assignments will be due approximately every 2 weeks.
    The problem sets and due dates will be posted to this web-page and announced in
    class. The problems will be due on the dates announced and will be graded down for
    lateness unless you have a VERY good excuse.

    Computing in the course can be done with Splus, R, SAS, or any other package you
    are familiar with which also has preprogrammed Survival Analysis modules. However,
    Splus and R are the best choices if you want guidance and/or help from me. R is also
    the best choice in accommodating the newest methods from the research literature. Various
    datasets can be explored and accessed within existing R packages and libraries, e.g. by
    issuing the command
           > data()        after        > library(survival)
    Whatever package you choose, you can get computing help, datasets, and further links at
    StatLib.  All of the datasets for the Klein and Moeschberger book, including its exercises,
    can be found here . As mentioned above, all of the datasets in (the Appendix of) the
    Kalbfleisch and Prentice text are freely available for download, and Datasets I and V can
    be found in ASCII format here . As an indication of how the datasets were imported using
    R into the format given, see the following linked script.

    See Handouts section below for link to "Rbasics" file connected with the data analysis tasks
    needed for Homeworks.
    For the systematic Introduction to R and R reference manual
    distributed with the R software, either download from the R website or simply invoke
    the command

    > help.start()

    from within R. For a slightly less extensive introductory tutorial in R, click here .


    Lecture Note and Slides Handouts:

  • Slides on Survival Data Structure and Hazard Functions
  • Slides on Parametric Censored-Data Likelihoods and Weibull MLE's
  • Handout on Nelson-Aalen Estimator as a limit of MLE's of piecewise
    constant hazards when the positive time-line is partitioned into smaller
    and smaller intervals.
  • Interpretation of Kaplan-Meier as Nonparametric MLE .
  • One page handout on Partial Likelihood via Marginal Likelihood in the Cox Model.

  • You can look at some sample problems for the Fall 05 in-class test here .
    This year's in-class test will have at most 3 problems and will be oriented as closely to
    definitions and basic equivalence of models as I can make it. Some new sample problems
    can be found here. If you would like to bring a sheet of formulas, e.g. stochastic-integral
    and variance formulas related to our score and Partial Likelihood ML statistics, that
    is OK, but the test will otherwise be closed-book.

  • You can view here the slides in a Survival Analysis short course I gave in
    Spring 2005. See especially pages 34-36 and 47-50 and 58-72 for ideas relevant to
    the data-analyses in your final Projects, and click here for the R scripts I used in
    analysing the data for that short course.

  • Statistical Computing Handouts:

    The following material and handouts were produced in previous terms
    when the course was given using Splus as the computing platform. They
    are mostly still usable, since R and Splus are syntactically the same and
    share many older functions, but these handouts will be updated and
    converted to R as the term progresses.


    NOTE: to get started using survival-related functions in R, you need to "load" the
    R survival package, which is accomplished by the command:

    > library(survival)

    Handouts can be found at linked pages for each of the following topics:

    (0) Basics on R commands for data entry and life table estimates

    (1) Nelson-Aalen & Kaplan Meier calculation         (converted to R)

    (2) Illustrative R Script for Survival Curves, Hazards, Medians, and SE's.

    (3) Nelson-Aalen calculation for left-truncated right-censored data.     (converted)

    (4) Script and Illustrative Picture on model fitting of VA Lung-Cancer data in R.
    This Script and picture also contain material about fitting and plotting the Cox Model
    for the same dataset and comparing the results to the previous accelerated failure time
    parametric regression model.

     (5) R calculations for weighted logrank (2-sample) test statistics.

     (6) Splus log on Stratified Analyses (Survival Curves & Weighted
            Logrank & Cox Models).

     (7) New illustration of Stratified versus interaction-term tests of difference between
    coefficients in subgroups of a survival dataset. This R script and picture explain in the
    example of a Mayo lung-cancer study that there are actual differences between the
    coefficient for a baseline health index ("Karnofsky score") for the two sexes in the study,
    but that these differences are obscured if an assumption of common baseline hazard for
    both sexes is made.

    (8) Log of Splus analyses related to Cox model and comparison
    with exponential regression model, along with some pictures.

    (9) Handout containing R Log on Self-Consistency Property of Kaplan-Meier Estimator
    and Redistribute-to-the-Right Algorithm and Coding for Turnbull (1974) self-consistent estimator
    of survival-distribution in double-censored survival data.

    (10) Splus algorithm of Turnbull for interval-censored data.

     (10) Splus computations for kernel-smoothed Nelson-Aalen estimator.

    (11) R Script for Time-dependent Cox-Model fitting, illustrated with data analysis
           of Mayo-Clinic Lung Cancer Data.

    (12) R script for calculating Partial Likelihoods in (non-time-dependent) Cox-model.
           This includes calculations with risk-groups. The script will later be augmented
           to include the calculation of score statistics for individual coefficients.


    Syllabus.

    Chapter 1. Introduction: Terminology, data structures & examples.    1 class,   1/28

  • Definition of terminology: event-times, censoring (left and right), biomedical covariates,
    life tables. Inferential problems and objectives.
  • Chapter 2. Failure Time models.           4   classes,   1/30 - 2/6

  • Survival and hazard functions. Parametric distributional models, continuous and discrete
    regression models. Latent failure time model. Competing risks.
  • Chapter 3. Censored-Data Parametric Inference & Likelihoods.    3   classes,    2/8 - 2/13

  • Parametric likelihoods and parametric inference; truncation and interval censoring,
    large-sample MLE theory.
  • Counting processes and cumulative hazards. Statistics as integrals with respect to
    compensated counting processes.
  • Chapter 4. Nonparametric survival-curve estimation.    3   classes,    2/15 - 2/20

  • Cumulative-hazard (Nelson-Aalen) estimators for right-censored left-truncated data.
  • Kaplan-Meier survival function estimators. Confidence bands.
  • Quantile (median) estimates and confidence intervals.
  • More on competing risks and attempt to restore semiparametric identifiability in Competing Risk settings.

    Chapter 5. Estimates for other censoring schemes.    2   classes,    2/22 - 2/25

  • Left, double, and interval censoring. Self consistency property of Kaplan-Meier
    (right-censored case) and extension to estimation algorithms with other kinds of censoring.
  • Chapter 6. Other estimation techniques.            3   classes,    2/27 - 3/3

  • Kernel-based estimation of the hazard intensity. Application to excess mortality
  • Bayesian nonparametric methods.
  • (*) Methods based on (multiple) imputation of censored lifetimes.
  • Chapter 7. Rank statistics for 1- and 2-sample Tests.     5 classes,     3/5 - 3/14

  • Tests based on scores. Relation to contingency table ideas. Stratified tests. Tests for trend.
  • Logrank and weighted-logrank tests. Sample size and power. Relation to simple
    survival regression models.
  • Chapter 8. Relative Risk Regression Models     5 classes,     3/24 - 4/2

  • Estimation via maximized Partial Likelihood: regression coefficients and estimation of
    baseline survival function
  • Related likelihoods (marginal and rank). Large sample theory of estimators.
  • Associated hypothesis tests. Wald, score and LR test analogues.

    Chapter 9. Stratified & Time-Dependent Covariate Cox models.    3 classes,    4/9 - 4/14

  • Time dependent covariate version of proportional hazard model. Application to
    model checking. Tests of fit related to residuals (material taken also from Chapter 11.)
  • R functions for stratified time-dependent Cox regression.

    Chapter 10. Extended Survival Regression Models.             As time permits

  • Material taken from Chapters 10 and 12 on Additive Hazards
    regression models, and frailty models.
  • Material taken from Kalbfleisch-Prentice and other sources on
    Accelerated failure and Proportional Odds models.
  • Introduction of models with time-varying mechanism from journal literature.




    Homework Problem Sets:          For Solutions, click here.

    Problem Set 1, Due Wednesday February 6, 2008.
    Do # 2.3, 2.9 (the times to substitute are 12, 24, and 60 months), 2.10, 2.16, and 2.20.
    Also to be handed in: using the data in Table 1.2 of the book, create a life table, with rows
    corresponding to ordered increasing infection times within each of the two ("Surgically
    Placed Catheter" and "Percutaneous Placed Catheter") groups, showing the number of
    "failures" (=infections) occurring at that time, and the number at risk (ie individuals within
    the group who are neither infected nor censored before that time.

    Problem Set 2, Due Wednesday February 27, 2008. Do #3.6, 3.8, 4.1(a)-(f), 4.2(a)-(c) and (e)-(g).
    In addition ( 5th problem to hand in ): read Theoretical Note 1 on pp. 56-57 and
    show as much as you can of the following statement given there:
    if, in a bivariate setting with dependent (T,C) having a joint density, the function &rho(t)
    defined on p.56 is known along with the sub-distribution function F1(t) and event-time
    survival function ST(t), then the marginal survival function SX(t) is uniquely determined,
    and this survival function depends in a monotonically decreasing way on &rho(t).

    Problem Set 3, Due Friday March 14, 2008.
          (I) Klein & Moeschberger problems:    #4.5, 4.7, 4.9.
          (II) Kalbfleisch & Prentice Problem (#3.11): Use the famous Freireich et al. (1963) data
    which can be found as "gehan" within the MASS library,
             (a) to test the hypothesis of equality of remission times in the two groups, using Weibull,
    log-normal, and log-logistic models, and to decide which model fits the data best, and
             (b) to test for adequacy of an exponential model relative to a Weibull model.
    In the dataset, you should ignore the "pair" information. The last column is "treat" , a factor
    (categorical) variable.

          (III) Consider the setting where you have right-censored survival data on a large number n
    of iid patients, where the underlying and censoring distributions are both Exponential, with
    respective parameters λ and ρ . Find simplified formulas for the asymptotic variances (proportional
    to 1/n) for the estimated marginal survival function S(t) at time t=1 based on the parametric
    Exponential estimator of λ and also based on the Kaplan-Meier estimator, and compare the
    formulas. (How much larger is the KM variance ?)

    Problem Set 4, Due Friday April 4, 2008.
             (I) Problems 7.1, 7.3, 7.9.   See the Statistical Computing script (5) above on
    weighted logrank statistics.

             (II) Problem 8.1: method is illustrated in CoxMod.txt Log-page, including the part about
    fit of exponential model.
             (III) (a) Approximately how large a sample would you need to achieve power 0.90 against
    the alternative with hazard ratio 1.5 using a logrank test, if the sample of size n were randomly
    allocated with a fair coin-toss to control or treatment group, and control-group survival is
    Expon(2) while censoring is approximately Expon(1) in the control group and Expon(1.25) in
    the treatment group ? (b) Do the same sample-size calculation if the hypothesis test to use is the
    Gehan-modified Wilcoxon. (c) Do the same sample size calculation if the hypothesis test to be
    used is the Peto-Prentice Wilcoxon (G&rho test with &rho=1).

    Problem Set 5, Due Wednesday April 30, 2008.
             (I) Consider the setting discussed in class, of a large-sample dataset consisting of 2n
    observations min(Xi, Ci),   where all   Xi ~ Expon(&lambda)   and Ci = infinity   for i=1,..,n   and
    Ci = tmax   for i=n+1,...,2n. Show explicitly that the right-censored MLE for   &lambda   based on
    these data is consistent for &lambda . Also find the large-sample limit for the MLE of   &lambda   under an
    Expon(&lambda) model for only those data-values Xi which are `observed' in the sense that they are
    strictly less than their corresponding Ci censoring-times. (This includes all of the first n
    observations, but approximately   n (1-exp(-tmax &lambda))   observations among the observations
    i=n+1,..,2n. (As discussed in class, this second analysis, based only on the "complete cases"
    which are uncensored is misspecified and should not be expected to be consistent for &lambda .)
             (II) From Chapter 8 of the course text, do problems 8.4, 8.5, 8.8.
    (III) Using the data and results of problem 8.8, find and plot estimators for:   (a) the baseline
    cumulative hazard function &Lambda0(t),   and (b) the population summary survival functions for the
    ALL, AML Low-Risk, and AML Hi-Risk groups.

    Problem Set 6, Due Monday May 12, 2008.
    Klein & Moeschberger # 5.2, 6.3, 6.6((a) and (c) only, and 9.3.


    Final Project

    I have created a data-file Lymphom.dat (zipped) which you can use in your project.
    It is very large, with 31689 records of 13 columns each, subsetted and re-coded from
    the National Cancer Institute's SEER database of Lymphoma cancer cases from
    1973-2001. You may certainly subset it (much) further in any analyses you do and write
    up. Details concerning the records retained, the variables chosen, their meanings and the
    way I re-coded them, can be found here .

    Guidelines for the Final Project. As will be discussed in class, the culminating work for
    the course, beyond HW and the in-class Test, is a take-home course project which is to consist
    of a 10-page paper based on an original data analysis using the ideas covered in the course,
    to be handed in before 5pm , Monday December 19, 2005. You may find data anywhere, but the
    StatLib web-site would be a particularly good place to start. My suggestions were to find a
    survival dataset with enough structure (eg, regression variables, clear hypothesis of interest like
    treatment effectiveness in a two-group clinical trial) and sufficient sample-size so that it would
    make sense to try a few different survival analyses and compare the results. You will be graded
    on appropriateness and interest of the analyses and especially on the clarity and
    reasonableness of the conclusions (and/or comparisons among conclusions from different
    methods) which you reach. Your 10 pages (beyond data and plots) should explain clearly
    the models and assumptions and conclusions in areadable narrative. You may hand in
    (but preferably give URL for) data, intermediate statistical results, and summary displays
    such as plots and/or histograms, but I do not want to be given any undigested outputs.
    That is, any such computed outputs should be presented as exhibits, with specific references
    to such material and suitable interpretations given in the text of your paper.

    Specific URL's at which to look for data are:
    http://lib.stat.cmu.edu/datasets/, http://lib.stat.cmu.edu/disease/
    http://lib.stat.cmu.edu/jasadata/, http://lib.stat.cmu.edu/DASL/
    Also, Splus and R supply several good survival datasets.

    If you want to do anything other than a data analysis and narrative for your paper (eg,
    simulation study or exploration of theoretical and illustrative material on additional methods
    not covered in the course), such an alternative  may  be permissible, but you must  to see me
    about it to get it approved first !!


    Important Dates:


    Other links:

    (A) I gave three `Mini-Course' talks on Survival Analysis a couple of years ago which are
    very relevant to the material of this course. The slides are available in pdf format.  They are
    respectively about Competing Risks, Martingales & Populations, and Semiparametric Models.

    (B) Various useful files on Statistical Computing, in particular using Splus and R but also
    with some material on SAS, can be found at the Spring '04 course web-page for
    Stat 798C, now renumbered as Stat 705, along with additional relevant links.

    (C) StatLib.   Useful repository of downloadable software and datasets, and much more.

    (D) R   General source for freely downloadable R packages and related manuals and datasets.



    My home page.

    The UMCP Math Department home page.

    The University of Maryland home page.

    © Eric V Slud,  May 19, 2008.