More on Venn Diagrams for Regression
Volume 10 | Number 1 | May 2002 p.1-10 Peter E.
Kennedy
Journal of Statistics Education
What
(abstract)
The main contribution of this the paper consists of suggestions for how this approach (of using Venn diagram) can
be used effectively in expositing results relating to bias and variance of
coefficient estimates in multiple regression analysis. Previous works IP (2001)
have been limited to the R2, partial correlation, and sums of
squares in the presence of suppressor variables. This article presents a different interpretation of Venn diagrams, highlighting illustrations of bias
and variance.
Methodology/Model/Data-
Regression
with single explanatory variable X
Y= variation in Y
![]() |
From the main article |
X= variation in X
Purple= variation in common (βx)
Black= Error term (σ2). The magnitude of this area
represents
the magnitude of the OLS estimate of σ2, the variance
of the error term
Regression
with more than one explanatory variable
![]() |
From the main article |
If regress Yon X alone βx=
Blue+Red
If regress Yon W alone βy= Green+Red
If regress Yon X and W together-
1-
βx= Blue+Red βy= Green+Red or
2-
βx= Blue βy= Green or
3-
Divide Red into two
parts or any other way to
calculate βx and βy.
The best case is not using the red part but only using
Blue+Yellow and Orange+Blue to represent y and X
respectively. The red area shows the joint variation of X and W together which may
result in biased estimates.
Here Yellow area represents the magnitude of σ2,
the variance of the error term. OLS uses the magnitude of the area that can’t
be explained to estimate σ2.
Multicollinearity
Y estimates- unbiased as in both the figures Blue and The green part is used.
However, it has caused an increase in the variance as the size of Blue and Green the area is shrunk.
Omitting
a Relevant Explanatory Variable
![]() |
Generated by author |
Suppose W is emitted. The
estimation is biased as both Blue and Red areas are used but variance decreases.
If X and W are orthogonal that is X and W do not overlap, the results remain unbiased and variance is unaffected. We may remove the W variable if it's highly collinear.
Detrending Data
W is a time trend. How
will it affect if removed? Remove it. Regress detrended Y on detrended X. Also
X and W are not orthogonal. According to the data used…
Reg y on X, obtain Bx and variance vb*.
Reg X onW, save residual r, reg y on r to get c*, est. r
coeff., and est. var vc*.
Reg y on W, save the residual s, regress s on r to get d*,
est. r coeff., est. var vd*.
Coeff.
|
Est.
|
Est. var
|
b*
|
1.129427
|
0.00210754 vb*
|
c*
|
1.129427
|
0.00987857
vc*
|
d*
|
1.129427
|
0.00208904vd*
|
b= usual
OLS estimate
r= Orange+Blue (X cant be explained by W)
s= Blue+Yellow (y cant be explained by W)
s+r overlap= Blue
reg s on r
(Blue+Yellow on Orange +Blue)= uses the same info as for esti. b* and c*
But the variances vb*, vc* and vd* respectively are
different.
Why?
Although the true variances are equal but the estimated are
not. vb* and vd* are
also nearly equal. These are the variations not explained and calculated by the
magnitude of the Yellow area. Let us now come to vc* as it is high comparatively.
It is because the variation not explained in y is the Yellow+Red+Green areas
making variance σ2 overestimated. That’s why it is greater than vb*
and vd*.
Conclusion
The main contribution of this
paper is to drawing some effective ways of using Venn Diagram when teaching
regression analysis. Also, there are cases where Ballentine (Venn diagram) can
mislead in the OLS but for Standard Analysis, it is highly recommended by
Kennedy himself.
Simple and explanatory.
ReplyDelete