Statistics 750   Multivariate Analysis

Fall 2006                               MWF 11am , MTH 0305

Instructor: Eric Slud, Statistics Program, Math. Dept.

Office:    Mth 2314, x5-5469, email evs@math.umd.edu

Office hours:    M10, W12, F10

Course Text: K. Mardia, J.Kent, and J. Bibby Multivariate Analysis, 1980,
Academic Press (paperback).

This text covers both theory and data examples, with ample verbal explanations and motivation.

Recommended Texts: (i) Anderson, T.W. An Introduction to Multivariate
Statistical Analysis
, 3rd ed. 2003, Wiley-Interscience.

This is a standard and authoritative, but very theoretical and fairly dry book,
with much deeper mathematical treatment than the Mardia, Kent and Bibby text.

(ii) Hardle, Wolfgang, and Simar, Leopold, Applied Multivariate Statistical
Analysis
2003, Springer-Verlag.

This is a much newer text, emphasizing nonparametric techniques and computational examples,
primarily geared towards economics and finance.

Overview:    This course is about statistical models and methods of inference
for multivariate observations with dependent coordinates. Theoretical material
relates to the multivariate normal distribution and to the statistical sampling
behavior of empirical variance-covariance matrices and of various projections
and eigen-decompositions of them. Models studied include regression, principal
components analysis, factor models, and canonical correlations. In addition,
important atheoretical analysis methods like Clustering algorithms will also be
discussed. All methods will be illustrated using computational data examples.

Prerequisite:    STAT 420 or STAT 700. Familiarity with some (any) statistical
software package would be very helpful.

Probability theory material needed throughout this course includes joint probability
densities and change-of-variable formulas, law of large numbers and (multivariate)
central limit theorem. In addition, the course makes extensive use of linear algebra,
especially including eigenvalues and eigenspaces for symmetric matrices.

The data exercises in the course require that you have access to a reasonably powerful statistical
software package, e.g. Splus/R, SAS, or even StatA or Minitab or others like SPSS. Good facility
with MATLAB would also be enough. I will do examples and provide software scripts in Splus/R,
and can help you get past coding difficulties in Splus or R but can probably not help much with
programming difficulties if you do your data exercises in other languages.

Course requirements and Grading: there will be 7 or 8 graded homework sets
(one every 1½ to 2 weeks) which together will count 50% of the course grade.
These will be about evenly divided between theoretical problems and computational
data analysis problems. There will also be an in-class test and a final examination
[which will probably become a take-home or project depending on class preferences],
which will respectively count 20% and 30% toward the overall course grade.

Some Datasets for the project and homework can be found here .
Also see the Hardle and Simar web-page.


Homework

Assignments, including any changes and hints, will continually be posted
here. The directory in which you can find old homework assignments and
selected problem solutions is Homework.


HW Set 7. This problem set will be due Monday, December 11.
R scripts for them have been provided (see below).

Problem 1. Simulate 1000 batches of normal data matrices of n=150
observations with 6 columns, say with mean 0 and covariance matrix
2*1x2 + (1,-1,1,-1,1,-1)x2 + diag(2,3,2,3,2,3). Perform on the dataset
for each such matrix a maximum likelihood factor analysis (with q=2),
and then use R function "promax" in package "mva" to do a Varmax
rotation (on the orthonormal-column matrix Lambda0 of factor loadings),
and tally for your 1000 datasets the total number of rotated
factor coordinates in the three ranges (0,.2), (.2,.8), and (.8,1),
and report the results. See R Log for a little R code and commentary on
how to do this. It seems to me that one cannot really know
what the rotated factors mean statistically without seeing
what they do typically in studies like these.

Problem 2. The second problem is about discriminant analysis with
cross-validated estimation of misclassification probability, for the
Swiss BankNote data. Here are the steps which I would like you to
follow. See R Script on Discrimination methods for R steps, implemented
on the Iris data, and the resulting picture. The progression of steps is:

(a) To find the estimated optimal quadratic discriminant region for
discriminating genuine from forged banknotes, and the (Fisher) linear
discriminant, and code them both.

(b) To estimate the probabilities of misclassification. This is done
both through a theoretical (or simulated) multivariate-normal probability
based on the estimated paramaters for the two banknote groups for the
the linear discrimination regions in (a), and then also by a cross-
validated estimation procedure. The cross-validated procedure would
successively leave out one or a few observations from one or both of
the two groups; re-estimate the discrimination regions using the
retained observations; and record whether the omitted observations are
properly discriminated (ie classified) or not, tallying the overall
relative frequencies of misclassification.

Note: for a small Log showing how to discriminate the BankNotes
essentially perfectly using two PC's, click here.

Problem 3. Use the R classification and clustering routines to see
(a) whether the grouping of US companies by industry section can be
reproduced by a formal classification algorithm, and
(b) whether the unsupervised clustering algorithms ("diana", "agnes", and
maybe "Kmeans") can provide any other meaningful grouping. Use the data
on US companies which can be found here. (It was downloaded from the
Hardle & Simar website: you can freely download html text (including
verbal description of data, i.e. descriptions of the variables which are
listed in Appendix B.5, pp. 455-6 of the Hardle & Simar book) and the
ASCII datasets themselves from the Hardle and Simar web-page.
In this dataset, you might want to explore the possibility of generating
new columns (e.g. ratios or interactions between columns and ratios,
formed from existing columns) before applying the clustering algorithms,
and you might want to reduce the dimension of the resulting explanatory-
variable sets via PC's to try some sort of visual clustering, as was
done in the small log referenced at the end of Problem 2 above. See
R Log on Clustering for some examples and details on using
the R clustering software. The "Dendrogram" pictures mentioned in the
Log in the small simulated data example are "agglomerative clustering"
and "divisive clustering".


SYLLABUS for Stat 750

We will cover Chapters 1-13 of the Mardia, Kent, and Bibby book: topics include
the multivariate normal distribution, Wishart's and Hotelling's distributions;
tests of hypotheses, estimation, distribution of test criteria; generalized distance,
discriminant analysis; regression and correlation; multivariate analysis of variance;
principal components, canonical correlations, factor analysis, and clustering.

OUTLINE

0. Probability & Linear Algebra Review.        
                (a) Multivariate normal distribution: alternative characterizations.
                      
1. Wishart distribution; Hotelling T2; Mahalanobis distance.        
               
2. Statistics based on likelihood for multivariate normal data.        
                (a) Estimation (likelihood, sufficiency, MLE.
                (b) Hypothesis testing techniques, including likelihood ratios; simultaneous
                confidence intervals, multivariate parametric and nonparametric tests.
               
3.Multivariate regression.        
                (a) MLE, general linear hypothesis,
                multiple correlation, least squares, variable selection.

4. Econometrics        
                (a) Simultaneous equation and instrumental variables models.
                (b) Comparison of estimators.
               
5. Principal Components Analysis.        
                (a) Definitions & sampling properties.
                (b) Correspondence analysis.
                (c) Principal components regression.

6. Factor Analysis.
                (a) Definition of models, rotation of factors.
                (b) Goodness of fit, relation with PCA.

7. Multivariate Analysis of Variance.

8. Cluster Analysis.

9. Permutational and Bootstrap ideas in Multivariate.


Important Dates

  • First Class: Wednesday, August 30
  • Mid-Term Exam:   Oct. 27, 2006 (in-class, omitted).
  • Final Project Due-Date: 4pm 12/18/06.

The UMCP Math Department home page.

The University of Maryland home page.

My home page.

© Eric V Slud, December 6, 2006.