Fall 2009
MWF 11, MATH Building 1308
Instructor: Eric Slud, Statistics Program, Math. Dept.,
evs@math.umd.edu
Office: MTH 2314, x5-5469
Office hours: tentatively, M 1, W 4, and Th 11. But you
can make an appointment for
office-hour help at other times by emailing
me.
Course Text: Venables, W. N. and Ripley, B. D.
Modern Applied Statistics
with S-PLUS
(4th ed, 2002.). New York: Springer-Verlag.
Recommended:
A. Gelman & J. Hill, Data Analysis using Regression and
Multilevel/
Hierarchical Models,
Cambridge Univ.Press.
Additional:
R. A. Becker, J. M. Chambers, and A. R. Wilks (1988). The New S Language
Pacific Grove,
CA: Wadsworth & Brooks/Cole.
J.M. Chambers and T.J. Hastie (1993). Statistical
Models in S.
London: Chapman & Hall.
For information and Directories on the following
topics, click these links:
Homework information
,
HW Directory
Data source info
,
Data Directory
Lecture Notes descriptions
,
Lecture Notes Directory
Rlog and Scripts descriptions
,
Rlog and Scripts Directory
Overview: Statistical research and application has changed dramatically
because of cheap and
powerful computational and graphical tools. This
course presents modern methods of computational
statistics and their
application to both practical problems and research. The
techniques covered in
STAT 705, which include some numerical-analysis
ideas arising particularly in Statistics, should
be part of every
statistician's toolbox.
Statistical methodology will be presented informally, with emphasis
on the intuitive basis
for the techniques and brief discussion of
their theoretical pedigree. Implementation of
each method will be given
in R, and each method will be illustrated by application to data,
often from real datasets but sometimes simulated.
Prerequisite: STAT 420 or STAT 700, and some programming experience
(any language).
Course requirements and Grading: Grading will be based completely
on graded DAILY
assignments involving data analysis and statistical computation (a
total of about 30 of them).
The homework tasks will be of moderate
length and difficulty assigned in each class session,
usually to
be handed in 2 classes after the one in which the assignment is
given.
HONOR CODE
The University of Maryland, College Park has a nationally recognized
Code of Academic Integrity, administered by the Student Honor Council.
This Code sets standards for academic integrity at Maryland for all
undergraduate and graduate students. As a student you are responsible
for upholding these standards for this course. It is very important for
you to be aware of the consequences of cheating, fabrication,
facilitation, and plagiarism. For more information on the Code of
Academic Integrity or the Student Honor Council, please visit
http://www.shc.umd.edu.
To further exhibit your commitment to academic integrity, remember to
sign the Honor Pledge on all examinations and assignments:
"I pledge on
my honor that I have not given or received any unauthorized assistance
on this examination (assignment)."
1. Introduction to R:
Starting and quitting R, on-line help, R operators
and functions, creating
R objects, data types (vectors,
matrices, factors, functions, lists), managing
data
(combining objects, subsetting, creation of frames), R
graphics.
2. Monte Carlo and Simulation in R:
Basic random number generation, applications of LLN and CLT in
simulations,
numerical integration, importance sampling, empirical distributions,
Markov Chain
Monte Carlo. Managing loops in R.
3. Numerical Optimization in Statistics:
Objective functions in statistics, and managing functions in R.
Linear and nonlinear
least squares, special considerations in maximizing likelihoods, penalized
likelihood,
steepest descent, quasi-Newton-Raphson methods, constrained maximization, EM
algorithm. Diagnostics for misspecified models.
4. Linear and Generalized Linear Models:
Regression summaries, model fitting, prediction, model updating, analysis
of residuals,
model criticism, ANOVA, generalized linear models, specifying
link and variance
functions, stepwise model selection, deviance analysis.
Comparisons of implementations in R and SAS. Fitting
mixed-effect (generalized)
linear models in R.
5. Bootstrapping Methodology:
Parametric bootstrap, empirical CDF, bootstrap standard errors and confidence
intervals,
estimation of bias, jackknife, application to regression.
6. Smoothing & Nonparametric Regression:
Spline smoothing, kernel smoothing, selecting tuning parameters by
cross-validation.
Graphical aspects of smoothing.
7. MCMC and the Gibbs Sampler.
Definitions and basic ideas of MCMC ad Gibbs-Sampler simulation
methodology,
including a brief introduction to `Bayesian Computing'
using BUGS through R.
8. Mixed and Multilevel Models fitted and interpreted
via
Likelihood methods and
Bayes (MCMC) methods in R.
Note: The course will concentrate heavily on R, which is a free software
package
syntactically almost identical to Splus, which was the
software emphasized in the course
up to a couple of years ago. If you are
new to R, you should get started as soon as possible,
using it
either on the MathNet or WAM machines (where it is already loaded and installed)
or on your home computer by downloading the software following instructions
at the
R website. For the
systematic Introduction to R and R reference manual distributed
with the R software, either download from the
R website or simply invoke
the command
> help.start()
from within R. For a slightly
less extensive introductory tutorial in R, click
here.
For a quick
start, see Rbasics handout
, and then get started reading about S (or equivalently,
R)
syntax in the Venables and Ripley text.
In the middle of the course, we will mention
SAS, primarily in order to
contrast the way
in which linear and generalized-linear models are handled in the two
software packages, but this course will not spend any time
introducing SAS.
The topics of individual pdf-file note-packets are as follows
Sec1NotF09.pdf
: Overview, Unix & R preliminaries, R language
Sec2NotF09.pdf:
Introduction to Pseudo-Random Number Generation. Sec3NotF09.pdf
: Introduction to Graphics in R. Also: Simulation
Sec4NotF09.pdf
: Numerical maximization methods (for likelihoods). Sec5NotF09.pdf:
Miscellanea: subsetting & parallelizing plus:
Sec6NotF09.pdf:
EM (Expectation-maximization) Algorithm for ML
Sec7NotF09.pdf:
Markov Chain Monte Carlo: introduction and application
BayesConjug.pdf
Conjugate priors for Bayesian inference from data Lec03Pt5B.pdf:
Linear Regression in SAS (including some graphics.) Lec03Pt5C.pdf:
Factors, ANOVA and Regression in SAS vs. Splus.
Lec03Pt5D.pdf:
Simulation in Splus versus SAS. HANDOUTS distributed in class are included
for reference here
.
The topics treated on these handout logs are as follows:
DensNPR.Log
: this log is a condensed version, for handouts 4/28 and 4/30
Factor.Log : class handout on R
handling of Factors and contrasts
Contr.txt : handout mentioned in 4/4/08
class on defining contrasts in R BassSAS.txt
: scripts in Splus for an illustrative regression in SAS
StepExmp.Log : gives a
script in R and SAS for stepwise (mostly forward) GLMlog.R :
is the record of a small R session showing how the dispersion Rlog1.txt
: covers an in-class demonstration of random-number
Rlog2.txt
: re-caps an in-class demonstration of acceptance/
Rlog.ImpSamp
: gives the Log covered in class on Importance Sampling. Antith_Contr09
: is a Log covered in class about the methods of Rlog.nlm.txt
: is a Log covered in class about numerical maximization using Rlog3.txt
: a log related to Maximization, Root-finding,
& Rfcn.Log
: a log related to simulation of Mixtures and defining RlogF09.LinRegr.txt: an R log covered in class
10/26/09 about RlogF09.GLM.txt: an R log covered in class
10/28/09 about PredSamp.LM: an R log covered in class
11/4/09 and 11/6/09 about Slog4.txt
: illustration using Steam-Use data from Draper and
CrabsLog.pdf
: extended data-fitting example in (Splus and) R for
DensEst.Log
: log illustrating several different density estimation
elements, Vector & Array operations, Inputting Data,
and Lists.
Functions in R, & how and why to vectorize.
speedup methods (Accept-Reject & Importance sampling).
Introduction to Smoothing Splines (and their use in
quick function-inversion in R).
estimation with missing data.
in an EM estimation problem in random-intercept logistic
regression. For additional pdf files of "Mini-Course"
Lectures, see MCMC Mini-Course.
assumed to follow
Exponential Family distributions.
The remaining Handouts/Notes date from previous years and
relate to
Lec03Pt5.pdf: SAS Introduction.
comparisons between Splus (which apply also to R)
versus SAS.
in Spring '04, of the DensEst.Log and NonPReg.Log below,
illustrating several different density estimation and
nonparametric regression and smoothing techniques. In addition,
the density estimation part has a small section on (Least-
Squares) cross-validated bandwidth selection, and the
nonparametric regression component also has some material on
comparative evaluation of methods using cross-validation.
(using the Bass data in an illustrative example) within
linear model fitting functions.
for use with Factors in fitting linear models.
on a dataset involving fish (Bass) in polluted lakes.
selection of variables for linear regression within an R
dataset called "attitude" rating places to work in terms of
ratings in various categories reported on numerical scales.
and goodness of fit of glm-fitted model objects can be assessed.
generation and simulation, plus a brief section on
unix.time applied to linear-algebra operations.
rejection sampling, with outputs illustrated by
(scatterplot-related) graphics.
Antithetic Variables
and Control Variates for
speeding up Monte
Carlo Sampling.
"nlm" with and without supplying "gradient" and "hessian"
attributes for the values of the function being minimized.
vectorization in Splus.
inverse functions via uniroot.
the usage and interpretation of the R linear least-squares
model-fitting function "lm".
generalized linear model fitting and model-comparison
using the R model-fitting function "lm".
Bayesian posterior and predictive sampling in a normal
linear
regression problem (related to the "bass" data of
HW Problem 14, and the BayesConjug.pdf Lecture-Notes file).
Smith regression book, showing PROC REG in SAS and
the Splus steps related to function lm for
reproducing the same computed results.
GLM analysis of Horseshoe Crab data discussed
extensively in Agresti Categorical Data Analysis book.
techniques (kernel-density estimation, splines, and
parametric fitting by a mixture of Gaussian or logistic
components) using the Galaxies data from a 1996 article
by Roeder. Plots can be found in pdf format here.
NonPReg.Log
: log illustrating several methods of nonparametric
regression and smoothing, using artificial (simulated)
data. Methods include kernel-density, lowess, and splines.
Plots can be found in pdf format here.
Bootstr.Log
: log with data examples to illustrate the connections between
and mechanics of: Permutational distributions,
p-values & confidence intervals, Parametric Bootstrap and (a very quick
idea of) Nonparametric Bootstrap.
Steps for analysis of kyphosis dataset
(available both in Splus as
a dataset
and also under ASCII data directory on this web-page) using
Generalized Linear Model modules, glm
in Splus and PROC GENMOD in SAS.
SASlog1.txt
: log of practice scripts for categorical data analysis
(PROC's FREQ
and GENMOD in SAS).
SASlog2.txt
: log on GLM's and deviance, with Analysis of Deviance
Tables and
implementations in both SAS and Splus.
SASlog3.txt
: additional material specifically related to
kyphosis
dataset,
model-fitting and interpretation
in both SAS and R including some material on `deviance'
and `standardized Pearson'
logistic-regression residuals. Some additional material on
stepwise fitting in PROC LOGISTIC and building an
analysis of deviance table from SAS
 
output can be found in another SASlog
.
Finally, a little Splus log summarizing
the steps in some GLM's of Fisher scoring versus Newton-
Raphson iterations to calculate Maximum Likelihood Estimates can
be found in NR.FS.Glm
.
Listings of all special-purpose R functions
referenced
in the Lecture Notes and Handouts can be found here.
HOMEWORK PROBLEMS and due dates (usually 2,
sometimes 3 classes
after they are assigned), can be found
here.
(Occasional solutions will also
be posted to the same place.)
For guidelines on the amount of material
(code & output) to submit with the Homeworks, see the
Instructions.txt
file.
As desribed in the Instructions file,
Homeworks are to be handed in by 5pm on the due-date.
DATA
Several datasets used in the course and handouts can be found here
in ASCII
or text format.
From Mathnet accounts, later in the course you
will be able to find additional datasets in R
workspaces in the directory
/nfs/projects/statdata/SplusCrs/Rstf.RData .
In addition, in any environment supporting R, you have access to
lots of data in pre-supplied
R libraries which you can look at either by issuing the commands
> search()
or > data()
COMPUTER ACCOUNTS. MATH, STAT, and AMSC graduate students have
access to R and Matlab under Unix through their mathnet accounts,
and
others can have access through glue accounts. R
is freely
available in Unix or PC form through this link.
SAS in a Unix environment is
available to you free through a WAM account.
Getting Started in SAS.
Various pieces of information to help you get started in using SAS can
be
found under the course website Stat430. In particular you can find:
--- running SAS on University machines.
Instructions and links are included there concerning
a downloadable `script'
enabling remote callup of SAS when you are running your cluster account
remote
from a campus WAM or mathnet or glue workstation.
--- an overview of the minimum necessary steps to use SAS from Mathnet.
--- links to stat430 problem assignments.
--- a series of SAS
logs with edited outputs for illustrative examples.
Additional Computing Resources.
There are many publicly available datasets for practice
data-analyses.
Many of them are taken from journal articles
and/or textbooks and documented or interpreted. A good place
to start
is Statlib, and additional sources
can be found here.
Datasets needed in the course will be either be available in indicated R packages, posted to the
Data Directory linked to this web-page, or indicated
by links which will be provided in this space.
The UMCP Math Department home page.
The University of Maryland home page.
My home page.
© Eric V Slud, November 16, 2009.