Slides
Transcript
Slides
Regression & co. Matteo Pelagatti Università di Milano-Bicocca Summary Correlation and partial correlation Linear regression Logistic regression Non linear regression Principal components Factor analysis Correlation Covariance e linear correlation They are measure of linear dependence. The second is a rescaled version of the first taking values in the range [-1, +1] Covariance (and sample covariance) σ xy 1 n = cov( X , Y ) = ∑ ( X i − µ x )(Yi − µ y ) n i =1 1 n S xy = ( X i − X )(Yi − Y ) ∑ n − 1 i =1 Correlation (and sample correlation) σ xy S xy ρ xy = ; rxy = σ xσ y SxS y Does the line reperesent well the relation? Cross-products Cross-products Extremes of the correlation coefficient 1 = perfect linear correlation (positive slope) 0 = absence of linear correlatio -1 = perfect linear correlation (negative slope) Correlation in short σ xy ρ xy = σ xσ y rxy = s xy sx s y − σ x σ y ≤ σ xy ≤ σ x σ y ⇒ −1 ≤ ρ ≤ 1 • Symmetric • Pure number (no unit of measurement) • It measures the intensity and the sign of the linear relation between two variables. Partial correlation It is the correlation between two variables X and Y once X and Y have been “cleaned” of the correlation with other variables Z1, Z2, … If X and Y are not correlated with Z1, Z2, … their correlation and partial correlation are equal. How to eliminate the correlation if Z1, Z2, … with X e Y will be clear after treating linear regression. Linear regression Simple linear regression It models the dependence of a Y variable as a function of a X variable. It can be descriptive - explicative predictive Simple linear model Y = α + βX + ε E (ε | X ) = 0 o equivalentemente E(Y | X ) = α + β X yi = α + β xi + ε i , i = 1,2,..., n Graphical representation of the model Residuals Least squares Residuals I seek the value for the coefficients that minimizes the sum of squared residuals ei = yi − a − bxi n SS (a, b ) = ∑ ( yi − a − bxi ) i =1 2 Solution βˆ = s xy s x2 αˆ = y − βˆ x Possible problems Non Linearity Aberrant observations Residual analysis The coefficient of determination R2 It measures the goodness-of-fit of the line to the data. In the simple regression model it is the squared correlation between X and Y It can be interpreted as the fraction of variance of Y explained by X It can be computed as 1 – var(e) / var(Y) Inference for the linear regression The properties of least squares estimators depend on the assumptions on ε The typical software output is computed under the assumptions: The linear model is the real data generation process The regression error has mean zero Error and regressor (X) are not correlated The error variance is constant over all observations The correlation among errors in zero The errors are normal or the sample is large enough SPSS output b Riepilogo del modello Modello Deviazione R dimensi on standard Errore corretto della stima R-quadrato a 1 R-quadrato ,843 ,711 ,697 Durbin-Watson ,87054 2,156 0 b Anova Modello Somma dei quadrati 1 Media dei df quadrati F Regressione 39,085 1 39,085 Residuo 15,915 21 ,758 Totale 54,999 22 Coefficienti Sig. 51,574 a ,000 a Modello Coefficienti Coefficienti non standardizzati standardizzati Deviazione B 1 (Costante) MORTALITÀ standard Errore 18,088 1,240 -,882 ,123 Beta t -,843 Sig. 14,586 ,000 -7,182 ,000 Multiple linear regression Y = α + β1 X 1 + β 2 X 2 + ...β k X k + ε E (ε | X ) = 0 o equivalently E (ε | X ) = α + β1 X 1 + β 2 X 2 + ...β k X k yi = α + β 1 x1i + β 2 x2i ... + β k xki + ε i , i = 1,2,..., n Collinearity problems One or more independent variables are linear combination of other independent variables The regression coefficient do not have unique estimates. Estimates become instable Diagnostics: Tolerance: it is the proportion of variance of a variable not explained by other variables Variance inflation factor (VIF) = 1/Tolerance. High values indicate collinearity. Eigenvalues: problems if the ratio between the largest and the smallest eigenvalue is greater than 30 Coefficient of determination R2 It is no more the square of the correlation (which one?) Automatic model selection methods based on the increment of the R2 Always in the range 0-1 Can still be interpreted as fraction of Var(Y) explained by the regressors Residual diagnostics: normality •Histogram •Normality plots •Kolmogorov-Smirnov, Shapiro-Wilk or Jarque-Bera 14 12 10 Popolazione della variabile dipendente non gaussiana: TRASFORMATE: - log(Y) - √Y 1,00 8 6 ,75 ,50 2 0 -2,50 -2,00 -1,50 -1,00 -,50 0,00 Regressione Residuo standardizzato ,50 1,00 1,50 2,00 2,50 Prob cum attesa Frequenza 4 ,25 0,00 0,00 ,25 Prob cum osservata ,50 ,75 1,00 Residual diagnostics: homoschedasticity The variance of the resiuals must be constant •Plot Residuals * Predicted values (o dependent) 4 3 Residuo per cancellazione studentizzato 2 1 0 TRANSFORMS: - log(Y) - √Y -1 -2 -3 -4 0,0 Dependent ,5 1,0 1,5 2,0 Residual diagnostics: linearity Linearity check •Plot Residuals * Predicted values 4 Residuo per cancellazione studentizzato 3 TRANSFORMS: - Y = A*X**B --> log(Y) = log(A)+B*log(X) 2 1 0 ovvero: Y1 = A1+B*X1 -1 -2 0 Dependent 100 200 Residual diagnostics: zero correlation •Plot Residuals * observation sequence •Durbin-Watson test 4 Studentized Deleted Residual 2 0 GLS -2 -4 609 5 587 5 56 554 5 532 5 51 50 49 48 476 4 454 4 432 4 410 4 398 37 3 365 3 343 3 32 31 309 2 287 2 26 25 243 22 2 210 2 19 18 176 1 15 14 132 1 11 10 9 8 7 6 5 4 3 2 1 Numero di sequenza Logistic Regression Binary logistic regression It models the probability of a dependent variable belonging to a group as a function of regressors. For example Credit Scoring Why we cannot use the linear model for a binary response Nothing guarantees that the predicted probability would lie in the range 0-1. In fact the range of the linear regression is R, the real numbers. The idea of the logistic regression is using a function that maps R into [0, 1], that is, the logit function. Logit fuction 1 y= 1 + exp(− x ) The logistic regression model Pr{Y = 1 | X 1 ,..., X k } = 1 [1 + exp(− α − β1 X 1 − ... − β k X k )] Binary regressors The coefficient of a binary regressor can be interpreted as log-odds that is the logarithm of the ratio of the probabilities of two complementary events Indeed, if p = logit (α + β1 x1 ) with x that takes 0 or 1 p then α + β1 x1 = log 1− p p and exp(α ) exp( β1 x1 ) = 1− p So, if β is greater than zero then the probability of belonging to the response group increases when x is true. Estimates Non linear maximum likelihood (numerical optimization) The Wald statistic tells us if each coefficient is significant. p-value is provided (like test t, but one tail). Maximum likelihood estimates are those that make our sample the most likely. Non linear regression Non linear regression Some non linear models can be linearized through transformations (for example logarithm). Other models cannot be linearized, but a non linear regression can be carried out using numerical optimization. For example the following function cannot be linearized Yt = c + εt a + bt 1+ e