Election Data Summary
The
The columns in the data include:
Field Name |
Description |
county |
|
b2000 |
Raw votes
for Bush, 2000 |
g2000 |
Raw votes for Gore, 2000 |
b2004 |
Raw votes for Bush, 2004 |
k2004 |
Raw votes for Kerry, 2004 |
etouch |
Dummy
variable 1 if Electronic Voting; 0 otherwise |
median_income |
Median income in county |
hispanic |
Total Hispanic population in county |
b00pc |
% voted for Bush 2000 (b2000/votes00) |
b04pc |
% voted for Bush 2004 (b2004/votes04) |
b_change |
Change in % voted for Bush between 2000 and 2004 (b04pc - b00pc) |
b00pc_sq |
% voted for Bush squared |
b00pc_e |
Interaction
effect between b00pc and etouch (b00pc*etouch) |
b00pcsq_e |
Interaction
effect between b00pc_sq and etouch (b00pc_sq*etouch) |
votes00 |
Total votes cast in 2000 |
votes04 |
Total votes cast in 2004 |
v_change |
Change in voter turnout from 2000 to 2004 (votes04 – votes00) |
//read the data in election.txt into an R dataframe object called election
>election <- read.table("election.txt", header=TRUE)
Consider, “Change in % voted for Bush between 2000 and 2004” (b_change) to be the dependant variable in the data. The factors affection this change in votes may be attributed to the following factors (independent variables): etouch, median_income, hispanic, v_change
A summary of the election data for the variables of interest is shown below.
>summary(election)
median_income Min. :26032 1st Qu.:30029 Median :33779 Mean :35385 3rd Qu.:40249 Max. :52244 |
hispanic Min. :0.01500 1st Qu.:0.02700 Median :0.04900 Mean :0.08528 3rd Qu.:0.09400 Max. :0.57300 |
b_change Min. :-0.02957 1st Qu.: 0.01833 Median : 0.03506 Mean : 0.03702 3rd Qu.: 0.05388 Max. : 0.10710 |
v_change Min. : 663 1st Qu.: 2250 Median : 11056 Mean : 24236 3rd Qu.: 35385 Max. :116327 |
Plotting the raw distributions of these variables gives the following results:
------------------------------------------------------------------------------------------------------------
Title:
Computing the mean with confidence
intervals
Definition and Motivation:
The arithmetic mean of a collection is the sum of the
elements in that collection divided by the number of elements in that
collection (including duplicates). The goal of computing the mean of a randomly
selected subset S of a larger set S' is to estimate the mean of S'. By
computing confidence intervals on the mean of S, we attempt to bound the
interval in which the mean of S' is likely to lie.
Pseudo-Code:
Let n be the size of S (including duplicates).
BagOfResults:= empty
collection // note
that duplicates will be allowed
do LARGENUMBER times //
LARGENUMBER should be something //like 1000 or more
E := take n elements of S with replacement
add mean(E) to the BagOfResults
end
SequenceOfResults := sort BagOfResults from smallest to largest
low:= the element at 0.025*LARGENUMBER of SequenceOfResults
// low is low end of 95% confidence interval
high:= the element at 0.975*LARGENUMBER of SequenceOfResults
// high is high end of 95% confidence interval
return (low, high, mean(S))
R-Code:
//define a mean function to be used by the call to boot
> mean_fun <- function(data, indices){
+ mean(data[indices])
+ }
//make the call to boot. This function will resample with replacement from the data and //apply the mean_fun function to each sample
//but first load the boot library
> library(boot)
> boot_b_change_mean <- boot(data=election[,"b_change"],statistic=mean_fun, R=999)
> boot_b_change_mean
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = election[, "b_change"], statistic = mean_fun, R = 999)
Bootstrap Statistics :
original bias std. error
t1* 0.03702145 -0.0001289764 0.003587013
The bootstrap mean results show that the mean is 0.037021, exactly the same as the observed mean. To visualize the results we can use the plot function in R, that accepts the result of the boot call.
>plot(boot_b_change_mean)
Finally we compute the confidence interval of the mean.
>boot.ci(boot.out = boot_mean_diff, conf = 0.95, type = "perc")
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 999 bootstrap replicates
CALL :
boot.ci(boot.out = boot_b_change_mean, conf = 0.95, type = "perc")
Intervals :
Level Percentile
95% ( 0.0297, 0.0442 )
Calculations and Intervals on Original Scale
------------------------------------------------------------------------------------------------------------
Title:
Calculating the difference between the
means of two groups
Definition and Motivation:
In this case we can split the data set into two groups. One group being the counties that used electronic voting and the other that didn’t (i.e. etouch = 1 vs etouch = 0). We then calculate the bootstrap difference in the means of the two groups. The 95% confidence interval will give a bound to the difference of the means.
Pseudo-Code:
Let n1 be the size of S1 // S1 is the group with etouch = 0
Let n2 be the size of S2 //
S1 is the group with etouch = 1
BagOfResults:= empty
collection
do LARGENUMBER times //
LARGENUMBER should be something // //like 1000 or more
E := take n1 elements of S1 with replacement
F := take n2 elements of S2 with replacement
add mean(E) – mean(F) to the BagOfResults
end
SequenceOfResults := sort BagOfResults from smallest to largest
low:= the element at 0.025*LARGENUMBER of SequenceOfResults
// low is low end of 95% confidence interval
high:= the element at 0.975*LARGENUMBER of SequenceOfResults
// high is high end of 95% confidence interval
return (low, high, mean(S1) – mean(S2))
R-Code:
//for ease of computation and analysis of results create a data set with the dependant and independent variables only(i.e. etouch, median_income, Hispanic, v_change and b_change)
> election_subset <- data.frame(county=election[,"county"],etouch=election[,"etouch"],median_income=election[,"median_income"],hispanic=election[,"hispanic"],v_change=election[,"v_change"],b_change=election[,"b_change"])
//Define a function that will calculate the difference between the means of two groups. This function can then be passed to the call to boot
> mean_diff_fun <- function(data, indices){
+ m1 <- mean(data[indices,"b_change"][data[indices,"etouch"]==0])
+ m2 <- mean(data[indices,"b_change"][data[indices,"etouch"]==1])
+ diff = m1 - m2
+ }
>
//finally make the call to boot
> boot_mean_diff <- boot(data=election_subset, statistic=mean_diff_fun, R=999)
> boot_mean_diff
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = election_subset, statistic = mean_diff_fun, R = 999)
Bootstrap Statistics :
original bias std. error
t1* 0.01646996 0.0001910999 0.006707707
> plot(boot_mean_diff)
The first plot above shows the distribution of the difference of the means in each of the 999 bootstrapped samples. The bootstrap difference in the means is estimated to be 0.01646996. Now we calculate the confidence interval as follows:
> boot_mean_diff_conf <- boot.ci(boot_mean_diff, conf=0.95, type="perc")
> boot_mean_diff_conf
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 999 bootstrap replicates
CALL :
boot.ci(boot.out = boot_mean_diff, conf = 0.95, type = "perc")
Intervals :
Level Percentile
95% ( 0.0035, 0.0295 )
Calculations and Intervals on Original Scale
This says that with a confidence of 95%, the difference between the means is between 0.0035 and 0.0295.
------------------------------------------------------------------------------------------------------------
Title:
Calculating the bootstrap-based and
permutation-based empirical p-value for the difference of the means of two
groups
Definition and Motivation:
The problem of finding the difference between the means of two groups can also be described using a hypothesis testing system. The null hypothesis is that the difference between the means of the two groups (etouch =0 and etouch =1) is zero. The alternative hypothesis is that the difference is greater than zero. i.e.
H0: diff = 0
Ha: diff > 0
The probability value (p-value) of a statistical hypothesis
test is the probability of getting a value of the test statistic (in this case
the difference between the means) as extreme as or more extreme than that
observed by chance alone, if the null hypothesis H0, is true. Small p-values
suggest that the null hypothesis is unlikely to be true. The smaller it is, the more convincing is the rejection of the null
hypothesis.
Pseudo-Code (Bootstrap-based
version):
Let n1 be the size of S1 // S1 is the group with etouch = 0
Let n2 be the size of S2 //
S1 is the group with etouch = 1
BagOfResults:= empty
collection
do LARGENUMBER times //
LARGENUMBER should be something // //like 1000 or more
E := take n1 elements of S1 with replacement
F := take n2 elements of S2 with replacement
add mean(E) – mean(F) to the BagOfResults
end
Iterate over BagOfResults
if current BagOfResults value <= 0
count := count + 1
pvalue := count/ LARGENUMBER
return(pvalue)
R-Code
(Bootstrap-based version):
> boot_mean_diff <- boot(data=election_subset, statistic=mean_diff_fun, R=999)
> boot_mean_diff
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = election_subset, statistic = mean_diff_fun, R = 999)
Bootstrap Statistics :
original bias std. error
t1* 0.01646996 0.0001910999 0.006707707
Calculating the p-value
> sum(boot_mean_diff$t <= 0)/999
[1] 0.008008008
The above p-value is very small. This suggests that there is very strong evidence against the null hypothesis that the difference in the means of the two groups (etouch=0 and etouch=1) is zero.
Pseudo-Code
(Permutation-based version):
Let n1 be the size of S1 // S1 is the group with etouch = 0
Let n2 be the size of S2 //
S1 is the group with etouch = 1
BagOfResults:= empty
collection
do LARGENUMBER times //
LARGENUMBER should be something // //like 1000 or more
E := sample n1 elements of S1 by permuting the etouch
column
//Equivalent to resampling without replacement. b_change column is kept //stationary and etouch is randomly permuted. Therefore relationship between the //b_change and etouch columns is broken
F := sample
n2 elements of S2 by permuting the etouch column
add mean(E) – mean(F) to the BagOfResults
end
Iterate over BagOfResults
if current BagOfResults value <= 0
count := count + 1
pvalue := count/ LARGENUMBER
return(pvalue)
R-Code
(Permutation version):
#The following function will do all the work and return the result of the permutation #experiment
> mean_diff_perm_fun <- function(data,m){
+ mean_diff <-
mean(data[,"b_change"][data[,"etouch"]==0]) - mean(data[,"b_change"][data[,"etouch"]==1])
+ pdata
<- data.frame("b_change"=data[,"b_change"],"etouch"=sample(data[,"etouch"],length(data[,"etouch"]),
replace = FALSE))
+ pmean_diff <- mean(pdata[,"b_change"][pdata[,"etouch"]==0]) -
mean(pdata[,"b_change"][pdata[,"etouch"]==1])
+ count <- (pmean_diff
>= mean_diff)
+ for (i
in 1:m-1)
+ {
+ newpdata
<- data.frame("b_change"=data[,"b_change"],"etouch"=sample(data[,"etouch"],length(data[,"etouch"]),
replace = FALSE))
+ newpmean_diff <- mean(newpdata[,"b_change"][newpdata[,"etouch"]==0]) - mean(newpdata
[,"b_change"][newpdata[,"etouch"]==1])
+ count <- count + (newpmean_diff >= mean_diff)
+ pmean_diff <- c(pmean_diff, newpmean_diff)
+ }
+ pvalue
<- count/m
+ list("permuted_mean_diff"=pmean_diff, "mean_diff"=mean_diff,"pvalue"=pvalue)
+ }
>#Call the above function 999 times
> mean_diff_perm_output <- mean_diff_perm_fun(election_subset,999)
>#The observed difference in mean is
> mean_diff_perm_output$mean_diff
[1] 0.01646996
> # the p-value is
> mean_diff_perm_output$pvalue
[1] 0.03003003
The above p-value is still very small (although not as small as obtained using the bootstrap - 0.008008008). However this still suggests strong evidence against the null hypothesis that the difference in the means of the two groups (etouch=0 and etouch=1) is zero and that this is not due to chance.
# a plot of the null distribution follows
> hist(mean_diff_perm_output$permuted_mean_diff, breaks=30, freq=FALSE, plot=TRUE)
------------------------------------------------------------------------------------------------------------
Title:
Analysis of Variance (ANOVA)
Definition and Motivation:
Pseudo-Code:
R-Code:
------------------------------------------------------------------------------------------------------------
Title:
Partial Correlation
Definition and Motivation:
> pcor_fun <- function(data){
+ conc <- solve(var(data))
+ resid.sd <- 1/sqrt(diag(conc))
+ partialcor <- sweep(sweep(conc,1,resid.sd, "*"),2,resid.sd,"*")
+ return(partialcor)
+ }
> pcor_fun(election_subset)
|
etouch |
median_income |
hispanic |
v_changepc |
b_change |
etouch |
1.0000000 |
-0.29885709 |
-0.13328091 |
0.1322710 |
0.07162380 |
median_income |
-0.2988571 |
1.00000000 |
0.06206307 |
-0.2267372 |
0.36928376 |
hispanic |
-0.1332809 |
0.06206307 |
1.0000000 |
0.2444096 |
0.25903921 |
v_changepc |
0.1322710 |
-0.22673721 |
0.24440957 |
1.0000000 |
0.04788487 |
b_change |
0.0716238 |
0.36928376 |
0.25903921 |
0.04788487 |
1.00000000 |
Pseudo-Code:
R-Code:
pcor.etouch <- vector(mode="numeric", length=1000)
pcor.median_income <- vector(mode="numeric", length=1000)
pcor.hispanic <- vector(mode="numeric", length=1000)
pcor.v_changepc <- vector(mode="numeric", length=1000)
> for(b in 1:1000) {
+ data.curr <- election_subset[sample(1:67,67,replace=T),]
+ tempres <- pcor_fun(data.curr)
+ pcor.etouch[b] <- tempres["b_change", "etouch"]
+ pcor.median_income[b] <- tempres["b_change", "median_income"]
+ pcor.hispanic[b] <- tempres["b_change", "hispanic"]
+ pcor.v_changepc[b] <- tempres["b_change", "v_changepc"]
+ }
Now we have 4 (independent variables) vectors that have 1000 bootstrapped partial correlations with b_change (dependant variable). Just printing out the frist 10 values in the pcor.etouch vector.
> pcor.etouch[1:10]
[1] 0.124255924 -0.006567401 0.009424238 0.048729694 0.101077880 0.177484922 0.016398538 0.098933451
[9] 0.273896994 0.027145799
> hist(pcor.etouch,probability=T)
> lines(density(pcor.etouch))
> quantile(pcor.etouch,c(.025,.975))
2.5% 97.5%
-0.1108112 0.2565338
> quantile(pcor.etouch,c(0,0.95))
0% 95%
-0.2495015 0.2328568
> hist(pcor.median_income,probability=T)
> lines(density(pcor.median_income))
> quantile(pcor.median_income,c(0,0.95))
0% 95%
-0.1403044 0.5454804
> quantile(pcor.median_income,c(.025,.975))
2.5% 97.5%
0.1167792 0.5900816
> hist(pcor.hispanic,probability=T)
> lines(density(pcor.hispanic))
> quantile(pcor.hispanic,c(.025,.975))
2.5% 97.5%
-0.02675132 0.48105924
> quantile(pcor.hispanic,c(0,0.95))
0% 95%
-0.2855233 0.4537965
> hist(pcor.v_changepc,probability=T)
> lines(density(pcor.v_changepc))
> quantile(pcor.v_changepc,c(.025,.975))
2.5% 97.5%
-0.2385491 0.3350396
> quantile(pcor.v_changepc,c(0,0.95))
0% 95%
-0.4213978
0.2973756
------------------------------------------------------------------------------------------------------------
Title:
Linear Regression
Definition and Motivation:
> election_lm <- lm(b_change ~ etouch + median_income + hispanic + v_changepc, data = election_subset)
> summary(election_lm)
Call:
lm(formula = b_change ~ etouch + median_income + hispanic + v_changepc,
data = election_subset)
Residuals:
Min 1Q Median 3Q Max
-0.075622 -0.013817 0.001171 0.015295 0.061787
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.113e-01 2.196e-02 5.069 3.87e-06 ***
etouch -4.692e-03 8.298e-03 -0.565 0.57383
median_income -1.733e-06 5.539e-07 -3.129 0.00267 **
hispanic -7.116e-02 3.370e-02 -2.112 0.03874 *
v_changepc -2.532e-04 6.708e-04 -0.377 0.70711
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 0.02596 on 62 degrees of freedom
Multiple R-Squared: 0.2438, Adjusted R-squared: 0.195
F-statistic: 4.997 on 4 and 62 DF, p-value: 0.001481
> par(mfrow=c(2,2)); plot(election_lm); par(mfrow=c(1,1))
> as.vector(election_lm$coefficients)
[1] 1.112965e-01 -4.691730e-03 -1.733254e-06 -7.116254e-02 -2.532048e-04
> summary(election_lm)$r.squared
[1] 0.2437796
Pseudo-Code:
R-Code:
> regr_lm_fun <- function(data, indices) {
+ data <- data[indices,]
+ model <- lm(b_change ~ etouch + median_income + hispanic + v_changepc, data = election_subset)
+ coefficients(model)
+ }
> regr_lm_boot <- boot(election_subset, regr_lm_fun, 1000)
> regr_lm_boot <- boot(election_subset, regr_lm_fun, 999)
> regr_lm_boot
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = election_subset,
statistic = regr_lm_fun, R = 999)
Bootstrap Statistics :
original bias std. error
t1*
1.112965e-01 0 0
t2* -4.691730e-03
0 0
t3* -1.733254e-06
0 0
t4* -7.116254e-02
0 0
t5* -2.532048e-04 0 0