Slides

Transcript

Slides
Regression & co.
Matteo Pelagatti
Università di Milano-Bicocca
Summary
Correlation and partial correlation
Linear regression
Logistic regression
Non linear regression
Principal components
Factor analysis
Correlation
Covariance e linear correlation
They are measure of linear dependence.
The second is a rescaled version of the first taking values
in the range [-1, +1]
Covariance (and sample covariance)
σ xy
1 n
= cov( X , Y ) = ∑ ( X i − µ x )(Yi − µ y )
n i =1
1 n
S xy =
( X i − X )(Yi − Y )
∑
n − 1 i =1
Correlation (and sample correlation)
σ xy
S xy
ρ xy =
;
rxy =
σ xσ y
SxS y
Does the line reperesent well the relation?
Cross-products
Cross-products
Extremes of the correlation coefficient
1 = perfect linear correlation
(positive slope)
0 = absence of linear correlatio
-1 = perfect linear correlation
(negative slope)
Correlation in short
σ xy
ρ xy =
σ xσ y
rxy =
s xy
sx s y
− σ x σ y ≤ σ xy ≤ σ x σ y ⇒ −1 ≤ ρ ≤ 1
• Symmetric
• Pure number (no unit of measurement)
• It measures the intensity and the sign of the linear relation
between two variables.
Partial correlation
It is the correlation between two variables X and Y once
X and Y have been “cleaned” of the correlation with
other variables Z1, Z2, …
If X and Y are not correlated with Z1, Z2, … their
correlation and partial correlation are equal.
How to eliminate the correlation if Z1, Z2, … with X e Y
will be clear after treating linear regression.
Linear regression
Simple linear regression
It models the dependence of a Y variable as a function of
a X variable.
It can be
descriptive - explicative
predictive
Simple linear model
Y = α + βX + ε
E (ε | X ) = 0
o equivalentemente
E(Y | X ) = α + β X
yi = α + β xi + ε i ,
i = 1,2,..., n
Graphical representation of the model
Residuals
Least squares
Residuals
I seek the value for the coefficients that minimizes the
sum of squared residuals
ei = yi − a − bxi
n
SS (a, b ) = ∑ ( yi − a − bxi )
i =1
2
Solution
βˆ =
s xy
s x2
αˆ = y − βˆ x
Possible problems
Non
Linearity
Aberrant
observations
Residual analysis
The coefficient of determination R2
It measures the goodness-of-fit of the line to the data.
In the simple regression model it is the squared
correlation between X and Y
It can be interpreted as the fraction of variance of Y
explained by X
It can be computed as
1 – var(e) / var(Y)
Inference for the linear regression
The properties of least squares estimators depend on the
assumptions on ε
The typical software output is computed under the
assumptions:
The linear model is the real data generation process
The regression error has mean zero
Error and regressor (X) are not correlated
The error variance is constant over all observations
The correlation among errors in zero
The errors are normal or the sample is large enough
SPSS output
b
Riepilogo del modello
Modello
Deviazione
R
dimensi on
standard Errore
corretto
della stima
R-quadrato
a
1
R-quadrato
,843
,711
,697
Durbin-Watson
,87054
2,156
0
b
Anova
Modello
Somma dei
quadrati
1
Media dei
df
quadrati
F
Regressione
39,085
1
39,085
Residuo
15,915
21
,758
Totale
54,999
22
Coefficienti
Sig.
51,574
a
,000
a
Modello
Coefficienti
Coefficienti non standardizzati
standardizzati
Deviazione
B
1
(Costante)
MORTALITÀ
standard Errore
18,088
1,240
-,882
,123
Beta
t
-,843
Sig.
14,586
,000
-7,182
,000
Multiple linear regression
Y = α + β1 X 1 + β 2 X 2 + ...β k X k + ε
E (ε | X ) = 0
o equivalently
E (ε | X ) = α + β1 X 1 + β 2 X 2 + ...β k X k
yi = α + β 1 x1i + β 2 x2i ... + β k xki + ε i ,
i = 1,2,..., n
Collinearity problems
One or more independent variables are linear
combination of other independent variables
The regression coefficient do not have unique estimates.
Estimates become instable
Diagnostics:
Tolerance: it is the proportion of variance of a variable not
explained by other variables
Variance inflation factor (VIF) = 1/Tolerance.
High values indicate collinearity.
Eigenvalues: problems if the ratio between the largest and the
smallest eigenvalue is greater than 30
Coefficient of determination R2
It is no more the square of the correlation (which one?)
Automatic model selection methods based on the
increment of the R2
Always in the range 0-1
Can still be interpreted as fraction of Var(Y) explained by
the regressors
Residual diagnostics: normality
•Histogram
•Normality plots
•Kolmogorov-Smirnov, Shapiro-Wilk or Jarque-Bera
14
12
10
Popolazione della
variabile dipendente
non gaussiana:
TRASFORMATE:
- log(Y)
- √Y
1,00
8
6
,75
,50
2
0
-2,50
-2,00 -1,50
-1,00
-,50
0,00
Regressione Residuo standardizzato
,50
1,00
1,50
2,00
2,50
Prob cum attesa
Frequenza
4
,25
0,00
0,00
,25
Prob cum osservata
,50
,75
1,00
Residual diagnostics: homoschedasticity
The variance of the resiuals must be constant
•Plot Residuals * Predicted values (o dependent)
4
3
Residuo per cancellazione studentizzato
2
1
0
TRANSFORMS:
- log(Y)
- √Y
-1
-2
-3
-4
0,0
Dependent
,5
1,0
1,5
2,0
Residual diagnostics: linearity
Linearity check
•Plot Residuals * Predicted values
4
Residuo per cancellazione studentizzato
3
TRANSFORMS:
- Y = A*X**B -->
log(Y) = log(A)+B*log(X)
2
1
0
ovvero:
Y1 = A1+B*X1
-1
-2
0
Dependent
100
200
Residual diagnostics: zero correlation
•Plot Residuals * observation sequence
•Durbin-Watson test
4
Studentized Deleted Residual
2
0
GLS
-2
-4
609
5
587
5
56
554
5
532
5
51
50
49
48
476
4
454
4
432
4
410
4
398
37
3
365
3
343
3
32
31
309
2
287
2
26
25
243
22
2
210
2
19
18
176
1
15
14
132
1
11
10
9
8
7
6
5
4
3
2
1
Numero di sequenza
Logistic Regression
Binary logistic regression
It models the probability of a dependent variable
belonging to a group as a function of regressors.
For example Credit Scoring
Why we cannot use the linear model for a
binary response
Nothing guarantees that the predicted probability would
lie in the range 0-1.
In fact the range of the linear regression is R, the real
numbers.
The idea of the logistic regression is using a function that
maps R into [0, 1], that is, the logit function.
Logit fuction
1
y=
1 + exp(− x )
The logistic regression model
Pr{Y = 1 | X 1 ,..., X k } =
1 [1 + exp(− α − β1 X 1 − ... − β k X k )]
Binary regressors
The coefficient of a binary regressor can be interpreted
as log-odds that is the logarithm of the ratio of the
probabilities of two complementary events
Indeed, if
p = logit (α + β1 x1 ) with x that takes 0 or 1
 p 
then α + β1 x1 = log

1− p 
p
and exp(α ) exp( β1 x1 ) =
1− p
So, if β is greater than zero then the probability of
belonging to the response group increases when x is true.
Estimates
Non linear maximum likelihood (numerical optimization)
The Wald statistic tells us if each coefficient is significant.
p-value is provided (like test t, but one tail).
Maximum likelihood estimates are those that make our
sample the most likely.
Non linear regression
Non linear regression
Some non linear models can be linearized through
transformations (for example logarithm).
Other models cannot be linearized, but a non linear
regression can be carried out using numerical
optimization.
For example the following function cannot be linearized
Yt =
c
+ εt
a + bt
1+ e