Mathematics Learning Center (Tutoring)

Statistics & Applied Math CORE

Newsletter

Technical Reports

The Lennes Collection

# Faculty

### Current Position

Retired July 2005

Retired from teaching parttime May 2008

### Personal Summary

Home email: ragideon38@gmail.com

### Education

Ph.D., University of Wisconsin 1970

MS, University of Washington 1964

BS, University of Washington 1960

### Research Interests

Using Correlation Coefficients to estimate parameters in a variety of situations.

**Summary of a very general Statistical System**

From here on appears only the research work on the use of correlation coefficients as general statistical tools. It is called the CES, Correlation Estimation System.

Introduction to the CES.The Correlation Estimation System utilizies any correlation coefficient, including rank based, to estimate parameters - location, scale, numerous regression areas (simple linear regression, multiple linear regression, Time Series, non-linear models, density estimations, logistics models). Aslo new correlation coefficients have been proposed but not throughly examined - absolute value correlation coefficient and medium absolute deviation. In addition, there are two general methods of using the correlation coefficients in estimations [1] setting certain correlation coefficient functions to zero and [2] minimizing the slope of a regression line through ordered residuals on another variable.

Method (2) has a computer coding advantage. For example, a routine can be written for multiple linear regression so that it will work for all reasonable correlation coefficients. This statement needs far more testing but so far many complex computer codes have been written using interchangeably GDCC, MAD, absolute value and Kendall. The code is the same except for defining the chosen correlation coefficient. Of course classical statistics is also included by using Pearson's correlation coefficient.

The areas studied so far include simple linear regression, multiple linear regression, time series, non-linear estimation, logistic models, density estimation, location estimation, scale estimation.

Also robust correlation coefficients such as GDCC and MAD can be used. GDCC has been thoroughly tested over many areas for many years. The code for all rank based correlation coefficients inludes a tie-breaking method based on minimum and maximum.

Cordef®.R: This is an R program that gives routines to compute the Greatest Deviation Correlation Coefficient and five others. A sample run that computes the slope of a simple linear regression line via GDCC by both methods (1) and (2) is displayed in a scattergram. Running the other CCs is just as easy. Names of routines for the Greatest Deviation Correlation Coefficient are GDave and GDslp. The names for the other correlation coefficients can be determined from the R program. The max-min method is used so tied values are not a problem.

**The following material gives the information on five published articles on the CES:**

Gideon, R. A. (2007), "The Correlation Coefficients," Journal of Modern Applied Statistical Methods, 6, 517-529.

A generalized method of defining and interpreting correlation coefficients is given. Seven correlation coefficients are defined — three for continuous data and four on the ranks of the data. A quick calculation of the rank based correlation coefficients using a 0-1 graph-matrix is shown. Examples and comparisons are given.

Gideon, R. A. (2010), "The Relationship Between a Correlation Coefficient and Its Associated Slope Estimates in Multiple Linear Regression," Sankhya, 72-B, 96-106.

This short note takes correlation coefficients as the starting point to obtain inferential results in linear regression. Under certain conditions, the population correlation coefficient and the sampling correlation coefficient can be related via a Taylor series expansion to allow inference on the coefficients in simple and multiple regression. This general method includes nonparametric correlation coefficients and so gives a universal way to develop regression methods. This work is part of a correlation estimation system that uses correlation coefficients to perform estimation in many settings, for example, time series, nonlinear and generalized linear models, and individual distributions.

Gideon, R. A. (2012a), "Obtaining Estimators from Correlation Coefficients: The Correlation Estimation System and R," Journal of Data Science, 10, 597-617.

Correlation coefficients are generally viewed as summaries, causing them to be underutilized. Creating functions from them leads to their use in diverse areas of statistics. Because there are many correlation coefficients (see, for example, Gideon (2007)) this extension makes possible a very broad range of statistical estimators that rivals least squares. The whole area could be called a``Correlation Estimation System.'' This paper outlines some of the numerous possibilities for using the system and gives some illustrative examples. Detailed explanations are developed in earlier papers. The formulae to make possible both the estimation and some of the computer coding to implement it are given. This approach has been taken in hopes that this condensed version of the work will make the ideas accessible, show their practicality, and promote further developments.

Gideon, R. A., and Hollister, R. A. (1987), "A Rank Correlation Coefficient Resistant to Outliers," Journal of the American Statistical Association, 82, 656-666.

In this article, a nonparametric correlation coefficient is defined that is based on the principle of maximum deviations. This new correlation coefficient, *R _{g} *, is easy to compute by hand for small to medium sample sizes. In comparing it with existing correlation coefficients, it was found to be superior in a sampling situation that we call “biased outliers,” and hence appears to be more resistant to outliers than the Pearson, Spearman, and Kendall correlation coefficients. In a correlational study not included in this article of some social data consisting of five variables for each of 51 observations,

*R*was compared with the other three correlation coefficients. There was agreement on 8 of the 10 possible correlations, but in one case,

_{g}*R*was significant when the others were not, and in yet another case,

_{g}*R*was not significant when the others were. A further analysis of this data set indicated that there were three to six data points that were anomalies and had a severe effect on the other correlations but not

_{g}*R*. Apparently, the statistic

_{g}*R*measures association in a unique fashion. This different measure of association for real data is extended to a population interpretation and expressed in terms of the copula function.

_{g}In consideration of ties, this article suggests a randomization method and a computation of the minimum and maximum possible correlation values when ties are present. These ideas are illustrated with an example.

Critical values of *R _{g} *and enough examples are included so that this new statistic can be applied to data. The success that we have had with the use of

*R*in hypothesis testing suggests that

_{g}*R*may have important applications wherever robustness is desired.

_{g}Gideon, R. A., and Rothan, A. M. (2011), "Location and Scale Estimation With Correlatin Coefficients," Communications in Statistics---Theory and Methods, 40, 1561-1572.

This article shows how to use any correlation coefficient to produce an estimate of location and scale. It is part of a broader system, called a correlation estimation system (CES), that uses correlation coefficients as the starting point for estimations. The method is illustrated using the well-known normal distribution. This article shows that any correlation coefficient can be used to fit a simple linear regression line to bivariate data and then the slope and intercept are estimates of standard deviation and location. Because a robust correlation will produce robust estimates, this CES can be recommended as a tool for everyday data analysis. Simulations indicate that the median with this method using a robust correlation coefficient appears to be nearly as efficient as the mean with good data and much better if there are a few errant data points. Hypothesis testing and confidence intervals are discussed for the scale parameter; both normal and Cauchy distributions are covered.

**Related Materials**

Robert Hollister "Exact Distribution of the Greatest Deviation Correlation Coefficient"

Gideon, R. A., Prentice, M. J., Pyke, R. "The Limiting Distribution of the Rank Correlation Coefficient *R _{g}*"

The result of this paper is used to find the limiting distribution of regression slopes in multiple linear regression. The paper appeared as section 12 in the book Gleser, L.J., Perlman,M.D., eds., et al., Contributions to Probability and Statistics; Essays in Honor of Ingram Olkin, Springer-Verlag 1989.

Gideon, R. A. (2012b), "The Correlation Estimation System and Pitman Efficiency," unpublished manuscript, University of Montana.

How Diffence in Team Hits Affects the Probability of Winning in Major League Baseball games.

This paper was submitted to a sports journal in May 2017. It uses the absolute value Correlation Coefficient and the CES to analyze data. One data set is Home Field advantage during the 2016 baseball season. The Logistic model is used to fit winning percentage to difference in hits.

Publication 2: Correlation in Simple Linear Regression; this is a refined version of paper 5 with the same title, found below the list of publications. It is currently in the publication review process. Communciations in Statistics, theory and methods, #A06-277. See below for the data sets that are used.

Publication3: Location and Scale Estimation with Correlation Coefficients, Communciations in Statistics, theory and methods #A06-374

Publication 4: Correlation and Regression without Sums of Squares, College Mathematics Journal, #08-054. This was rejected as too advanced. Need another mid-level journal to submit to.

Publication 5: Nonlinear Correlation Coefficients, sent to The Canadian Journal of Statistics, #CJS 1370CT07 = old submission number

Publication 6: A Second Opinion Correlation Coefficient. Sent to TAS but found unsuitable, MS07-171. Basic introduction to Greatest Deviation Correlation Coefficient and an example where it differs from Pearson, Spearman, and Kendall, thus showing insight into data not available from standard sources.

Publication 7: Obtaining Estimators from Correlation Coefficients: The Correlation Estimation System and R

Correlation coefficients (CCs) are generally viewed as summaries, causing them to be underutilized. Viewing them as functions leads to their use in diverse areas of statistics. Because there are many correlation coefficients (see, for example, Gideon (2007) or Publication 1) this extension makes possible a very broad range of statistical estimators that rivals least squares. The whole area could be called a "Correlation Estimation System" (CES). This paper concentrates on outlining the numerous possibilities for using the CES without thorough explanation but with some illustrative examples. It gives the formulae to make possible both the estimation and the computer coding to implement it. This approach has been taken in hopes that this condensed version of the work will make the ideas accessible, show their practicality, and promote further developments.

One focus of this paper is to show how to use any correlation coefficient to estimate location, scale and slope coefficients in simple and multiple linear regression. Once these procedures are developed, the CES is extended into nonlinear regression and estimation of parameters for a particular density type. Some of the results are illustrated with a continuous and with a rank based CC using absolute values. Although not done in this paper the CES can be easily extended into time series and general linear models. Many of these areas have been tested using various CCs over 25 years and all results lead one to believe in the value of the approach.

Publication 8: CES estimate of location using GDCC. It essentially the average of the one-third and two-thirds quantiles.

Publication 9: The Limiting Distribution of the Greatest Deviation Correlation Coefficient

Data set number 1: Major League Baseball Data from 1989 used in Publication 2

Data set number 2: Major League Baseball Data from 1992, Atlanta Braves and Opponents hits and runs for 175 games, used in Publication 2

Data set number 3: The complete data set for 1992 Atlanta Braves Baseball. Variable names: bb92 is game number, winb is 1 or 0, braves win or lose, abb is number of at bats for Braves, lobb is number left on base Braves, runsb is number of Braves runs, hitsb is number of hits Braves, errb is number of errors Braves, pitb is the number of Braves pitchers used, ab, lob, runs, hits, err, pit are the same set of variables for the opponents, time is length of game in hours, attend is attendance, inn is the number of innings, site is either home 1 or away 0, oppon is name of opponent, dow is day of week, unknown variable, hpu is home plate umpire

**Research, a Billabong of Statistical Estimation using Correlation and Rank-Based methods **

All of the work below is part of a general system of estimation with correlation coefficients. Whatever can be done with Least Squares can also be done with the following methods. Because the work is so extensive it can only gradually be posted. It has taken many years to develop and has been supported by 5 Ph. D. students and numerous Masters students. The students names appear below as I am indebted to them.

It has been funded by a private grant from John Bryan and from the National Security Agency. The work is an extension of the basic papers

- Gideon, R.A. and Hollister, R.A. (1987), "A Rank Correlation Coefficient Resistant to Outliers", Journal of the American Statistical Assoc. vol 82, pp656-666
- Gideon, R.A., Prentice, M.J., and Pyke, R (1989), "The Limiting Distribution of the Rank Correlation Coefficient, GD, appearing in Contributions to Prob and Statistics (Essays in Honor of Ingram Olkin), edited by Gleser, L.J. et al Springer Verlag, N.Y. pp217-226

**Research Interests**

- The use of correlation coefficients in statistical estimation
- The use of the rank based CC called the The Greatest Deviation in Statistical estimation
- The use of software package Splus in implementing the above two topics
- Rank Based CC estimators are robust, so robustness is of interest
- Making computer packages more versatile by incorporating correlational methods

**Robust Study**

Robust Comparison (Baseball gametimes regressed on hits,runs,pitchers,LOB,BB, and Ks)

The newspaper, USA Today, article on baseball gametimes appeared on 19 March 2002 on Sports Page 3C.

This Robust link is here for comparison of 4 robust multiple regression methods and ordinary Least Squares, the data was analyzed as it was generated and so many partial analyses occur after games number 9,14,20,31,39,46,53,58,65,82 end of first half of season. Second half analyses after games, 44,69, 79. The combined analysis of all 161 games. Some quotes about length of games from USA today and a statistical comparsion.

Paper #6 will contain the methodology.

Graphs of Various Baseball Data Variables, comparing LS and GDCC

**Online papers** showing how to use any Correlation Coefficient to estimate parameters in a wide variety of settings

All details are illustrated with the Greatest Deviation Correlation Coefficient (GD), the links are the numbers to particular papers.

#1 *A Generalized Interpretation of Pearson's r*

Contents:

- Pearson's CC is defined using the diagonals of a parallelogram
- This diagonal idea is used to explain the definition of other correlation coefficients E.G. Greatest Deviation, Gini's(Spearman's footrule),
- An absolute value correlation coefficient is defined which should be usedanytime who uses L-one methods are employed
- A median absolute deviation correlation is defined, the correlation extension of MAD methods

*# 2* *The Correlation Coefficients*

- Continuous and rank absolute value correlations are defined
- All correlations are examined as the difference in measures from perfect negative and positve correlation e.g., Pearson, Spearman, Kendall, Greatest Deviation, and the absolute value correlations
- A 0-1 graph-table is given that shows how to compute GD, Kendall, Spearman, and the rank absolute value CC all on the same graph-table
- The asymptotic distributions are given for the above four rank CC's and an example is worked out
- A small example suggests which CC's are most robust
- The general tied-value procedure, which allows complex calculations such as those in regression problems, is reviewed

#3 *The Geometrical Definition of GDCC and its Uniqueness*

Contents:

- The basic counting technique to compute the Greatest Deviation Correlation Coefficient (GDCC)
- The exact null distribution up to n = 15
- The population and sample definitions of the GDCC are illustrated by geometry

#4 *Random Variables, Regression, and the GDCC*

Contents:

- A population discussion of GDCC and simple linear regression
- The minimum sum of squares of probabilities(volumes) for the bivariate normal and Cauchy distributions
- The population regression lines
- Asymptotic relationship between the CC and the slopes in simple linear regression
- The correlated bivariate Cauchy Distribution is defined with parameter rho
- The Bivariate Normal and Cauchy have the same GDCC, (2/pi)*arcsin(rho)
- GDCC can do regression for all elliptical bivariate distributions, from Cauchy, the Student t's, to Normal

#5 *Correlation in Simple Linear Regression*

Contents:

- The correlation equation is shown to be a general way to define an equation for simple linear regression.
- Least Squares is done via Pearson's Correlation and then Kendall's tau and also the Greatest Deviation Correlation
- Several examples are given, the graphs and figures are in the link just below
- Confidence intervals for the slopes are constructed via the asymptotic distribution.

*#6 Gideon, R. A., and Rothan, A. M. (2004a), "Location and Scale Estimation with Correlation Coefficients"*

- Methods showing how to construct estimates of variation, standard deviation, and location, median or mean, using nonparametric correlation coefficients are given. (Actually, any correlation coefficient may be used.)
- The work is connected to Downton (1966) work and D'Agostino (1971, 1973)
- An example is given comparing the robust properties to examples given in Iglewicz (1983) and Nemenyi, Dixon, White, and Hedstrom (1977)
- The estimate of the median through the GD is shown to be nearly as efficient as the classical mean when the data are normally distributed.
- How to construct confidence intervals and perform hypothesis testing for the standard deviation is explained.

#7 **Multiple Regression technique with Asympotics (student-Miller)**

- The multiple Taylor series expansion for Pearson's Correlation is used to connect correlation to classical distribution theory
- This result is expanded to allow inference the Greatest Deviation Correlation Coefficient
- A multiple regression example is given
- Partial and multiple correlation coefficients are defined for GDCC and examples given
- This paper explains how the baseball game time example in the "robust comparison" link is done

#8 **this link will contain a paper delivered to the IMS Annual Meeting in Banff, Canada Tuesday July 30, 2002**

- The correlation Principle
- Measuring linearity with a nonparametric correlation coefficient
- examples with good and bad data
- the order norm is used but not explained; order norm is to be a later addition, already written, but not yet added here
- the same classical interpretation of regression and correlation can be used with nonparametric correlations; i.e., fraction of regression explained

**# 9 A Robust Norm Using GDCC (Carol Ulsafer was a co-author of this work)**

- The paper "Location and Scale Estimation with Correlation Coefficients" is used to develop a robust norm.
- The Norm is called an order norm because it is based on the ordered data.
- An order "inner product" is defined.
- A study is made on the zero of the order norm.
- The triangle inequality is shown not to hold.
- A landcover satellite example is given.

*#10 Gideon, R. A., and Rothan, A. M. (2004b), "Elementary Slopes in Simple Linear Regression"*

- A weighted average of the elementary slopes is shown to give the least-squares estimate when the regressor variable values are fixed and the error is independent and normal.
- It is shown that the elementary slopes have a rescaled Cauchy distribution for bivariate normal data.
- This Cauchy distribution is then used with correlation coefficients to estimate the regression parameters.
- Simulations show that with outliers distributed symmetrically, both Kendall,s Tau and GD operating on the original data and GD operating on the elementary slopes slopes of the bivariate data are robust in estimating the slope in simple linear regression.
- An example of the process using bivariate normal data with some contamination in the Y-variable along with a scatterplot of comparison of fits by least squares, GD, and GD operating on the elementary slopes

· **A Bivariate Cauchy distribution is analyzed via simple linear regression and GDCC**

**Inference is done with the asymptotic distribution of GDCC****The distribution free property over the class of bivariate t's is demonstrated at the ends; the Cauchy and Normal****Robustness is shown by comparing the normal and Cauchy distributions****A geometric method is used to estimate a ration of scale parameters****This paper again is general so that the method could be applied with other correlation coefficients**

*#12 Sheng,HuaiQing,(Tom), Ph.D. advisor Gideon, R.A. 2002 "Estimation in Generalized Linear Models and Time Series Models with Nonparmetric Correlation Coefficients"*

· *I: Linear Regression and Nonparametric Correlation Coefficients*

Simple Linear Regression and Multiple Linear Regression

Estimation using the Greatest Deviation Correlation Coefficient, GLM with the Poisson, GLM with the Logistic*II: Generalized Linear Models and Estimation,*examples compared to least square and the steepest decent methods.*III: Nonlinear Models and Estimation,*ARMA model, moving averages, autoregressive processes, mixed models, forecasting, practical examples*IV: Time Series Models and Estimation,*references include the papers above as well as related work.*V: Bibliography,*- This work shows the generality of estimation with Nonparametric Correlation Coefficients on advanced techniques by utilizing the Greatest Deviation CC. It is a very general estimation procedure. The Ph.D. disseration is on file at the University of Montana Library, Missoula, MT. It can be accessed at
**http://wwwlib.umi.com/dissertations/fullcit/3041406**

**#13 A Two-Sample Experiment Analyzed by the Correlation Method**

- This is an independent two sample problem, measuring the distance around an Oval in the center of the University of Montana campus
- It was performed by students in an applied statistics class, one sample used a step counting method and the second sample was based on measuring time
- There were outliers present
- The GD method was compared to classical methods
- Bootstrapping and Permutation tests were employed
- Quantile plots are given to illustrate the results
- As always this method is shown to be very practical and it can be used to avoid decision making about the inclusion of outliers
- The methods are explained in previous papers posted on the Web site, numbers 6 and 8

**#14 General Definition of Correlation Coefficients**

· **presented in Minneapolis August 2005 at the National Meeting, Poster Session**

**General outline of correlation methods for location, scale, regression parameters****An minimization technique using correlation, an alternative to least squares****A regression example with bivariate Cauchy data comparing GD and LS**

**#15 Two robust examples including the education data **

· **This is real data**

- The main emphasis is however, to compare several correlation coefficients on real data
- The Greatest Deviation Correlation is used as it is very robust
- Spearman, Kendall, Pearson correlations compared to GDCC
- The conclusion is that GDCC gives added insight to relationships obscured by other correlations and widely variable data
- There are two sets of data but the education data is most revealing
- The data comes first in this pdf file and the write-up is in the middle

#16 Estimating the Parameters of the Pareto Distribution

· **This is the master's thesis of Joseph Petersen**

- The idea is to show that the correlation estimation method can be used to estimate parameters in a wide variety of settings including particular distributions
- The GDCC was used to estimate the parameters and comparisons to existing methods were made.
- Again it was demonstrated that the correlation method is useful and robust using GDCC

#17 Population Values of Spearman's CC and Absolute Value CC

- Population Values of Gini's (modified footrule) and Absolute Value CC
- Asymptotic distribution of Absolute Value CC

#18 Minimization Process In Correlation Estimation System (Seattle, 2015) data file.

See number 18 below for latest work. It will be presented in the ASA meeting in Seattle in August 2015. This work includes the R-program setup for the calculation of seven correlation coefficients and how to use them in simple linear regression. The seven are: Greatest Deviation, Kendall’s Tau, Gini’s, Continuous Absolute Value (the continuous version of Gini’s CC), Median Absolute Value CC (MADCC), a MAD covariance function and its corresponding CC (CORMAD), Pearson’s.

#19 Correlation Estimation System (CES) on Binary Data (a baseball example)

Acknowledgments

- Ron Pyke, University of Washington, for being my Masters supersvisor, and completing the asymptotic distribution derivation and write-up
- John Gurland, University of Wisconsin, my Ph.D. advisor at Madison Wisconsin
- Mike Prentice, University of Edinburgh, for asking the question, "What is it estimating?" and allowing me a Sabbatical in Scotland
- Student names appear below; immediately below are five Ph.D. students

- Dale Mueller, Spring 1978, "A Geometrical View of the Kolmogorov-Smirnov Statistics with Multi-Sample Generalizations"
- Sister Adele M. Rothan, Summer 1982, "A Distribution-Free Scale Test of the Kolmogorov-Smirnov Type"
- Robert Hollister, Summer 1984, "A Correlation Coefficient Based on Maximum Deviation"
- Steve Rummel, Summer 1991, "A Procedure for Obtaining a Robust Regression Employing the Greatest Deviation Correlation Coefficient"
- HuaiQing (Tom) Sheng, Spring 2002, "Estimation in Generalized Models and Nonlinear Models with the Greatest Deviation Correlation Coefficient" (This includes times series models)

- Brian Steele, Spring 1995,(with David Patterson as co-advisor) ""Estimation in Generalized Linear Mixed Models via EM Algorithm"

Below are the names of students or people who have helped keep my research alive by either being a master's student, participating in a seminar, being on a grant, or just being there for a discussion and helping with the reseach.

- Gerald Schumann (1978, initial development of ideas)
- Huey-Fen Shiue (1984, general understanding and development of nonparametric methods)
- Jian-Jian Ren (1987, general development of nonparametric methods)
- Young Hoon Park (1989, general use of nonparametric methods)
- Don Gilmore (1991, graphical calculation of GD and three other rank correlations)
- Li-Chiou Lee (1991, location estimation with GD)
- John Bruder (1991, location estimation with GD)
- Mike Thiel (1991, location estimation with GD)
- Hongzhe Li (1992, a study of GD methods in multiple linear regression)
- Bill Stoner (1992, GD location estimator)
- Ming Yin (1993, development of C program for multiple regression)
- Wexin Zhou (1993, S-Plus programs for GD)
- Josef Crepeau (1993, small sample GD location estimator)
- HuaiQing Sheng or Tom (1994, GD in general linear models and non-linear regression)
- Jacquelynn Miller (1995, multiple regression development with GD)
- Christopher Vahl (1995, Studying the robustness of GD or rank based CC statistics)
- David Goldsmith (1997, S-Plus work on the continuous absolute correlation coefficient)
- Jeff Stratton (1999,Testing for normality using a correlation type statistic)
- Yueju Li (1999, Comparison of tests of fit between Pearson's CC and two Kolmogorov-Smirnov type tests, Lilliefors)
- Jiang Qun (1999. correlation methods on one-sample data to estimate location and scale parameters)
- John Gee (2002,Using correlation coefficients and order statistics to estimate sigma)
- Joe Peterson (2004, Quantile estimation of the Pareto distribution with correlation coefficients)
- Isaac Grenfell (2004, The use GD and MAD correlations as estimators in spatial statistics)
- Joyce Schlieter (20 years of support, preparation of workshop for ASA meeting)
- Carol Ulsafer (9 years of support, order norm development, editorial assistance)
- Merle Manis (lifetime listening to wacky ideas in statistics while being in algebra)
- Charles Bryan (lifetime moral support, and small grant from his brother John, $30,000)