Statistics 705   COMPUTATIONAL STATISTICS

Fall 2009                                                        MWF 11, MATH Building 1308

Instructor: Eric Slud, Statistics Program, Math. Dept.,   evs@math.umd.edu

Office:  MTH 2314, x5-5469

Office hours:  tentatively, M 1, W 4, and Th 11. But you can make an appointment for
     office-hour help at other times by emailing me.

Course Text:  Venables, W. N. and Ripley, B. D.  Modern Applied Statistics
      with S-PLUS (4th ed, 2002.).  New York: Springer-Verlag.

Recommended:  A. Gelman & J. Hill, Data Analysis using Regression and Multilevel/
      Hierarchical Models
, Cambridge Univ.Press.

Additional:

   R. A. Becker, J. M. Chambers, and A. R. Wilks (1988). The New S Language
         Pacific Grove, CA: Wadsworth & Brooks/Cole.

   J.M. Chambers and T.J. Hastie (1993).  Statistical Models in S.
         London: Chapman & Hall.


For information and Directories on the following topics, click these links:

      Homework information     ,       HW Directory  
      Data source info     ,       Data Directory  
      Lecture Notes descriptions     ,       Lecture Notes Directory  
      Rlog and Scripts descriptions     ,       Rlog and Scripts Directory  


Overview: Statistical research and application has changed dramatically because of cheap and
powerful computational and graphical tools.  This course presents modern methods of computational
statistics and their application to both practical  problems and research.  The techniques covered in
STAT 705, which include some numerical-analysis ideas arising particularly in Statistics,  should
be part of every statistician's toolbox.

Statistical methodology will be presented informally, with emphasis on the intuitive basis
for the techniques and brief discussion of their theoretical pedigree. Implementation of
each method will be given in R, and each method will be illustrated by application to data,
often from real datasets but sometimes simulated.

Prerequisite: STAT 420 or STAT 700, and some programming experience (any language).

Course requirements and Grading: Grading will be based completely on graded DAILY
assignments involving data analysis and statistical computation (a total of about 30 of them).
The homework tasks will be of moderate length and difficulty assigned in each class session,
usually to be handed in 2 classes after the one in which the assignment is given.


HONOR CODE

The University of Maryland, College Park has a nationally recognized Code of Academic Integrity, administered by the Student Honor Council. This Code sets standards for academic integrity at Maryland for all undergraduate and graduate students. As a student you are responsible for upholding these standards for this course. It is very important for you to be aware of the consequences of cheating, fabrication, facilitation, and plagiarism. For more information on the Code of Academic Integrity or the Student Honor Council, please visit http://www.shc.umd.edu.

To further exhibit your commitment to academic integrity, remember to sign the Honor Pledge on all examinations and assignments:
"I pledge on my honor that I have not given or received any unauthorized assistance on this examination (assignment)."



OUTLINE of Course TOPICS

   1. Introduction to R:

Starting and quitting R, on-line help, R operators and functions, creating
R objects, data types (vectors, matrices, factors, functions, lists), managing
data (combining  objects, subsetting, creation of frames), R graphics.

   2. Monte Carlo and Simulation in R:

Basic random number generation, applications of LLN and CLT  in simulations,
numerical integration, importance sampling, empirical distributions, Markov Chain
Monte Carlo. Managing loops in R.

   3. Numerical Optimization in Statistics:

Objective functions in statistics, and managing functions in  R. Linear and nonlinear
least squares, special considerations in maximizing likelihoods, penalized likelihood,
steepest descent, quasi-Newton-Raphson methods, constrained maximization, EM
algorithm. Diagnostics for misspecified models.

   4. Linear and Generalized Linear Models:

Regression summaries, model fitting, prediction, model updating, analysis of residuals,
model criticism, ANOVA, generalized linear  models, specifying link and variance
functions, stepwise model selection, deviance analysis.

Comparisons of implementations in R and SAS. Fitting mixed-effect (generalized)
linear models in R.

   5. Bootstrapping Methodology:

Parametric bootstrap, empirical CDF, bootstrap standard errors and confidence intervals,
estimation of bias, jackknife, application to regression.

   6. Smoothing & Nonparametric Regression:

Spline smoothing, kernel smoothing, selecting tuning parameters by cross-validation.
Graphical aspects of smoothing.

   7. MCMC and the Gibbs Sampler.

Definitions and basic ideas of MCMC ad Gibbs-Sampler simulation methodology,
including a brief introduction to `Bayesian Computing' using BUGS through R.

   8. Mixed and Multilevel Models fitted and interpreted via
       Likelihood methods and Bayes (MCMC) methods in R
.



Getting Started with R

Note: The course will concentrate heavily on R, which is a free software package
syntactically almost identical to Splus, which was the software emphasized in the course
up to a couple of years ago. If you are new to R, you should get started as soon as possible,
using it either on the MathNet or WAM machines (where it is already loaded and installed)
or on your home computer by downloading the software following instructions at the
R website. For the systematic Introduction to R and R reference manual distributed
with the R software, either download from the R website or simply invoke the command

> help.start()

from within R. For a slightly less extensive introductory tutorial in R, click here. For a quick
start, see Rbasics handout , and then get started reading about S (or equivalently, R)
syntax in the Venables and Ripley text.

In the middle of the course, we will mention SAS, primarily in order to
contrast the way in which linear and generalized-linear models are handled in the two
software packages, but this course will not spend any time introducing SAS.



LECTURE NOTES will be available here throughout the semester. You can also
find at this same location two sets of listings of R functions discussed in these Notes.

I suggest downloading and reading the Notes as we arrive at each topic,
since they will be updated and modified during the course.

     The topics of individual pdf-file note-packets are as follows

      Sec1NotF09.pdf : Overview, Unix & R preliminaries, R language
                      elements, Vector & Array operations, Inputting Data,
                      and Lists. Functions in R, & how and why to vectorize.

      Sec2NotF09.pdf: Introduction to Pseudo-Random Number Generation.

      Sec3NotF09.pdf : Introduction to Graphics in R. Also: Simulation
                     speedup methods (Accept-Reject & Importance sampling).

      Sec4NotF09.pdf : Numerical maximization methods (for likelihoods).

      Sec5NotF09.pdf: Miscellanea: subsetting & parallelizing plus:
                     Introduction to Smoothing Splines (and their use in
                     quick function-inversion in R).

      Sec6NotF09.pdf: EM (Expectation-maximization) Algorithm for ML
                     estimation with missing data.

      Sec7NotF09.pdf: Markov Chain Monte Carlo: introduction and application
                    in an EM estimation problem in random-intercept logistic
                    regression. For additional pdf files of "Mini-Course"
                    Lectures, see MCMC Mini-Course.

      BayesConjug.pdf Conjugate priors for Bayesian inference from data
                    assumed to follow Exponential Family distributions.

The remaining Handouts/Notes date from previous years and relate to
comparisons between Splus (which apply also to R) versus SAS.

      Lec03Pt5.pdf:  SAS Introduction.

      Lec03Pt5B.pdf: Linear Regression in SAS (including some graphics.)

      Lec03Pt5C.pdf: Factors, ANOVA and Regression in SAS vs. Splus.

      Lec03Pt5D.pdf: Simulation in Splus versus SAS.


HANDOUTS  distributed in class are included for reference here .

The topics treated on these handout logs are as follows:

      DensNPR.Log  :  this log is a condensed version, for handouts 4/28 and 4/30
                       in Spring '04, of the DensEst.Log and NonPReg.Log below,
                       illustrating several different density estimation and
                       nonparametric regression and smoothing techniques. In addition,
                       the density estimation part has a small section on (Least-
                       Squares) cross-validated bandwidth selection, and the
                       nonparametric regression component also has some material on
                       comparative evaluation of methods using cross-validation.

      Factor.Log :  class handout on R handling of Factors and contrasts
                     (using the Bass data in an illustrative example) within
                    linear model fitting functions.

      Contr.txt :  handout mentioned in 4/4/08 class on defining contrasts in R
                    for use with Factors in fitting linear models.

      BassSAS.txt :  scripts in Splus for an illustrative regression in SAS
                      on a dataset involving fish (Bass) in polluted lakes.

      StepExmp.Log :  gives a script in R and SAS for stepwise (mostly forward)
                     selection of variables for linear regression within an R
                     dataset called "attitude" rating places to work in terms of
                     ratings in various categories reported on numerical scales.

      GLMlog.R :    is the record of a small R session showing how the dispersion
                     and goodness of fit of glm-fitted model objects can be assessed.

      Rlog1.txt  :  covers an in-class demonstration of random-number
                         generation and simulation, plus a brief section on
                    unix.time  applied to linear-algebra operations.

      Rlog2.txt  :  re-caps an in-class demonstration of acceptance/
                         rejection sampling, with outputs illustrated by
                        (scatterplot-related) graphics.

      Rlog.ImpSamp  :  gives the Log covered in class on Importance Sampling.

      Antith_Contr09  :  is a Log covered in class about the methods of
                        Antithetic Variables and Control Variates for
                        speeding up Monte Carlo Sampling.

      Rlog.nlm.txt  :  is a Log covered in class about numerical maximization using
                       "nlm" with and without supplying "gradient" and "hessian"
                       attributes for the values of the function being minimized.

      Rlog3.txt  :  a log related to Maximization, Root-finding, &
                          vectorization in Splus.

      Rfcn.Log   :  a log related to simulation of Mixtures and defining
                         inverse functions via uniroot.

      RlogF09.LinRegr.txt: an R log covered in class 10/26/09 about
                     the usage and interpretation of the R linear least-squares
                     model-fitting function "lm".

      RlogF09.GLM.txt: an R log covered in class 10/28/09 about
                     generalized linear model fitting and model-comparison
                     using the R model-fitting function "lm".

      PredSamp.LM: an R log covered in class 11/4/09 and 11/6/09 about
                     Bayesian posterior and predictive sampling in a normal
                     linear regression problem (related to the "bass" data of
                     HW Problem 14, and the BayesConjug.pdf Lecture-Notes file).

      Slog4.txt  :  illustration using Steam-Use data from Draper and
                         Smith regression book, showing PROC REG in SAS and
                         the Splus steps related to function  lm  for
                         reproducing the same computed results.

      CrabsLog.pdf : extended data-fitting example in (Splus and) R for
                         GLM analysis of Horseshoe Crab data discussed
                         extensively in Agresti Categorical Data Analysis book.

      DensEst.Log  :  log illustrating several different density estimation
                         techniques (kernel-density estimation, splines, and
                         parametric fitting by a mixture of Gaussian or logistic
                         components) using the Galaxies data from a 1996 article
                         by Roeder. Plots can be found in pdf format here.

      NonPReg.Log  :  log illustrating several methods of nonparametric regression and smoothing, using artificial (simulated)
                         data. Methods include kernel-density, lowess, and splines. Plots can be found in pdf format here.

      Bootstr.Log :  log with data examples to illustrate the connections between and mechanics of: Permutational distributions,
                         p-values & confidence intervals, Parametric Bootstrap and (a very quick idea of) Nonparametric Bootstrap.

          Steps for analysis of kyphosis  dataset (available both in Splus as
     a dataset and also under ASCII data directory on this web-page) using
     Generalized Linear Model modules, glm  in  Splus  and  PROC GENMOD in SAS.

              SASlog1.txt   :    log of practice scripts for categorical data analysis (PROC's FREQ
                                                       and   GENMOD in SAS).

              SASlog2.txt  :     log on GLM's and deviance, with Analysis of Deviance Tables and
                                                        implementations in both SAS and Splus.

              SASlog3.txt  :     additional material specifically related to kyphosis dataset,
        model-fitting and interpretation in both SAS and R including some material on `deviance'
                    and `standardized Pearson' logistic-regression residuals. Some additional material on
                   stepwise fitting in PROC LOGISTIC and building an analysis of deviance table from SAS
                   output can be found in  another  SASlog .

             Finally, a little Splus log summarizing the steps in some GLM's of Fisher scoring versus Newton-
              Raphson iterations to calculate Maximum Likelihood Estimates can be found in NR.FS.Glm .


Listings of all special-purpose R functions referenced
in the Lecture Notes and Handouts can be found here.


HOMEWORK PROBLEMS and due dates (usually 2, sometimes 3 classes
after they are assigned),  can be found here. (Occasional solutions will also
be posted to the same place.
)
For guidelines on the amount of material
(code & output) to submit with the Homeworks, see the  Instructions.txt  file.

As desribed in the Instructions file, Homeworks are to be handed in by 5pm on the due-date.


DATA

Several datasets used in the course and handouts can be found here in ASCII or text format.
From Mathnet accounts, later in the course you will be able to find additional datasets in R
workspaces in the directory     /nfs/projects/statdata/SplusCrs/Rstf.RData .

In addition, in any environment supporting R, you have access to lots of data in pre-supplied
R libraries which you can look at either by issuing the commands

> search()     or     > data()


COMPUTER ACCOUNTS. MATH, STAT, and AMSC graduate students have
access to R and Matlab under Unix through their mathnet accounts, and
others can have access through glue accounts. R is freely
available in Unix or PC form through this link. SAS in a Unix environment is
available to you free through a WAM account.


Getting Started in SAS.

Various pieces of information to help you get started in using SAS can be
found under the course website 
Stat430. In particular you can find:

--- running SAS on University machines.

    Instructions and links are included there concerning a downloadable `script'
enabling remote callup of SAS when you are running your cluster account remote
from a campus WAM or mathnet or glue workstation.

--- an overview of the minimum necessary steps to use SAS from Mathnet.

--- links to stat430 problem assignments.

---  a series of SAS logs with edited outputs for illustrative examples.


Additional Computing Resources. There are many publicly available datasets for practice data-analyses.
Many of them are taken from journal articles and/or textbooks and documented or interpreted. A good place
to start is Statlib, and additional sources can be found here.

Datasets needed in the course will be either be available in indicated R packages, posted to the
Data Directory linked to this web-page, or indicated by links which will be provided in this space.



  • The Campus Course Evaluation Website www.coursevalum.umd.edu is open
    from Dec. 1 to Dec. 13 for you to submit your evaluation of this course.
    Please take this opportunity to evaluate me and the course during this period !


  • Important Dates


    The UMCP Math Department home page.

    The University of Maryland home page.

    My home page.

    © Eric V Slud,  November 16, 2009.