Title:

Election Data Summary

The Florida presidential election data can be downloaded here.

The columns in the data include:

Field Name	Description
county	Florida county name
b2000	Raw votes for Bush, 2000
g2000	Raw votes for Gore, 2000
b2004	Raw votes for Bush, 2004
k2004	Raw votes for Kerry, 2004
etouch	Dummy variable 1 if Electronic Voting; 0 otherwise
median_income	Median income in county
hispanic	Total Hispanic population in county
b00pc	% voted for Bush 2000 (b2000/votes00)
b04pc	% voted for Bush 2004 (b2004/votes04)
b_change	Change in % voted for Bush between 2000 and 2004 (b04pc - b00pc)
b00pc_sq	% voted for Bush squared
b00pc_e	Interaction effect between b00pc and etouch (b00pc*etouch)
b00pcsq_e	Interaction effect between b00pc_sq and etouch (b00pc_sq*etouch)
votes00	Total votes cast in 2000
votes04	Total votes cast in 2004
v_change	Change in voter turnout from 2000 to 2004 (votes04 – votes00)

//read the data in election.txt into an R dataframe object called election

>election <- read.table("election.txt", header=TRUE)

Consider, “Change in % voted for Bush between 2000 and 2004” (b_change) to be the dependant variable in the data. The factors affection this change in votes may be attributed to the following factors (independent variables): etouch, median_income, hispanic, v_change

A summary of the election data for the variables of interest is shown below.

>summary(election)

median_income

Min. :26032

1st Qu.:30029

Median :33779

Mean :35385

3rd Qu.:40249

Max. :52244

hispanic

Min. :0.01500

1st Qu.:0.02700

Median :0.04900

Mean :0.08528

3rd Qu.:0.09400

Max. :0.57300

b_change

Min. :-0.02957

1st Qu.: 0.01833

Median : 0.03506

Mean : 0.03702

3rd Qu.: 0.05388

Max. : 0.10710

v_change

Min. : 663

1st Qu.: 2250

Median : 11056

Mean : 24236

3rd Qu.: 35385

Max. :116327

Plotting the raw distributions of these variables gives the following results:

------------------------------------------------------------------------------------------------------------

Title:

Computing the mean with confidence intervals

Definition and Motivation:

The arithmetic mean of a collection is the sum of the elements in that collection divided by the number of elements in that collection (including duplicates). The goal of computing the mean of a randomly selected subset S of a larger set S' is to estimate the mean of S'. By computing confidence intervals on the mean of S, we attempt to bound the interval in which the mean of S' is likely to lie.

Pseudo-Code:

Let n be the size of S (including duplicates).
BagOfResults:= empty collection // note that duplicates will be allowed

do LARGENUMBER times // LARGENUMBER should be something //like 1000 or more
E := take n elements of S with replacement
add mean(E) to the BagOfResults

end

SequenceOfResults := sort BagOfResults from smallest to largest

low:= the element at 0.025*LARGENUMBER of SequenceOfResults

// low is low end of 95% confidence interval

high:= the element at 0.975*LARGENUMBER of SequenceOfResults

// high is high end of 95% confidence interval

return (low, high, mean(S))

R-Code:

//define a mean function to be used by the call to boot

> mean_fun <- function(data, indices){

+ mean(data[indices])

+ }

//make the call to boot. This function will resample with replacement from the data and //apply the mean_fun function to each sample

//but first load the boot library

> library(boot)

> boot_b_change_mean <- boot(data=election[,"b_change"],statistic=mean_fun, R=999)

> boot_b_change_mean

ORDINARY NONPARAMETRIC BOOTSTRAP

Call:

boot(data = election[, "b_change"], statistic = mean_fun, R = 999)

Bootstrap Statistics :

original bias std. error

t1* 0.03702145 -0.0001289764 0.003587013

The bootstrap mean results show that the mean is 0.037021, exactly the same as the observed mean. To visualize the results we can use the plot function in R, that accepts the result of the boot call.

>plot(boot_b_change_mean)

Finally we compute the confidence interval of the mean.

>boot.ci(boot.out = boot_mean_diff, conf = 0.95, type = "perc")

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS

Based on 999 bootstrap replicates

CALL :

boot.ci(boot.out = boot_b_change_mean, conf = 0.95, type = "perc")

Intervals :

Level Percentile

95% ( 0.0297, 0.0442 )

Calculations and Intervals on Original Scale

------------------------------------------------------------------------------------------------------------

Title:

Calculating the difference between the means of two groups

Definition and Motivation:

In this case we can split the data set into two groups. One group being the counties that used electronic voting and the other that didn’t (i.e. etouch = 1 vs etouch = 0). We then calculate the bootstrap difference in the means of the two groups. The 95% confidence interval will give a bound to the difference of the means.

Pseudo-Code:

Let n1 be the size of S1 // S1 is the group with etouch = 0

Let n2 be the size of S2 // S1 is the group with etouch = 1
BagOfResults:= empty collection

do LARGENUMBER times // LARGENUMBER should be something // //like 1000 or more
E := take n1 elements of S1 with replacement

F := take n2 elements of S2 with replacement

add mean(E) – mean(F) to the BagOfResults

end

SequenceOfResults := sort BagOfResults from smallest to largest

low:= the element at 0.025*LARGENUMBER of SequenceOfResults

// low is low end of 95% confidence interval

high:= the element at 0.975*LARGENUMBER of SequenceOfResults

// high is high end of 95% confidence interval

return (low, high, mean(S1) – mean(S2))

R-Code:

//for ease of computation and analysis of results create a data set with the dependant and independent variables only(i.e. etouch, median_income, Hispanic, v_change and b_change)

> election_subset <- data.frame(county=election[,"county"],etouch=election[,"etouch"],median_income=election[,"median_income"],hispanic=election[,"hispanic"],v_change=election[,"v_change"],b_change=election[,"b_change"])

//Define a function that will calculate the difference between the means of two groups. This function can then be passed to the call to boot

> mean_diff_fun <- function(data, indices){

+ m1 <- mean(data[indices,"b_change"][data[indices,"etouch"]==0])

+ m2 <- mean(data[indices,"b_change"][data[indices,"etouch"]==1])

+ diff = m1 - m2

+ }

//finally make the call to boot

> boot_mean_diff <- boot(data=election_subset, statistic=mean_diff_fun, R=999)

> boot_mean_diff

ORDINARY NONPARAMETRIC BOOTSTRAP

Call:

boot(data = election_subset, statistic = mean_diff_fun, R = 999)

Bootstrap Statistics :

original bias std. error

t1* 0.01646996 0.0001910999 0.006707707

> plot(boot_mean_diff)

The first plot above shows the distribution of the difference of the means in each of the 999 bootstrapped samples. The bootstrap difference in the means is estimated to be 0.01646996. Now we calculate the confidence interval as follows:

> boot_mean_diff_conf <- boot.ci(boot_mean_diff, conf=0.95, type="perc")

> boot_mean_diff_conf

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS

Based on 999 bootstrap replicates

CALL :

boot.ci(boot.out = boot_mean_diff, conf = 0.95, type = "perc")

Intervals :

Level Percentile

95% ( 0.0035, 0.0295 )

Calculations and Intervals on Original Scale

This says that with a confidence of 95%, the difference between the means is between 0.0035 and 0.0295.

------------------------------------------------------------------------------------------------------------

Title:

Calculating the bootstrap-based and permutation-based empirical p-value for the difference of the means of two groups

Definition and Motivation:

The problem of finding the difference between the means of two groups can also be described using a hypothesis testing system. The null hypothesis is that the difference between the means of the two groups (etouch =0 and etouch =1) is zero. The alternative hypothesis is that the difference is greater than zero. i.e.

H0: diff = 0

Ha: diff > 0

The probability value (p-value) of a statistical hypothesis test is the probability of getting a value of the test statistic (in this case the difference between the means) as extreme as or more extreme than that observed by chance alone, if the null hypothesis H0, is true. Small p-values suggest that the null hypothesis is unlikely to be true. The smaller it is, the more convincing is the rejection of the null hypothesis.

Pseudo-Code (Bootstrap-based version):

Let n1 be the size of S1 // S1 is the group with etouch = 0

Let n2 be the size of S2 // S1 is the group with etouch = 1
BagOfResults:= empty collection

do LARGENUMBER times // LARGENUMBER should be something // //like 1000 or more
E := take n1 elements of S1 with replacement

F := take n2 elements of S2 with replacement

add mean(E) – mean(F) to the BagOfResults

end

Iterate over BagOfResults

if current BagOfResults value <= 0

count := count + 1

pvalue := count/ LARGENUMBER

return(pvalue)

R-Code (Bootstrap-based version):

> boot_mean_diff <- boot(data=election_subset, statistic=mean_diff_fun, R=999)

> boot_mean_diff

ORDINARY NONPARAMETRIC BOOTSTRAP

Call:

boot(data = election_subset, statistic = mean_diff_fun, R = 999)

Bootstrap Statistics :

original bias std. error

t1* 0.01646996 0.0001910999 0.006707707

Calculating the p-value

> sum(boot_mean_diff$t <= 0)/999

[1] 0.008008008

The above p-value is very small. This suggests that there is very strong evidence against the null hypothesis that the difference in the means of the two groups (etouch=0 and etouch=1) is zero.

Pseudo-Code (Permutation-based version):

Let n1 be the size of S1 // S1 is the group with etouch = 0

Let n2 be the size of S2 // S1 is the group with etouch = 1
BagOfResults:= empty collection

do LARGENUMBER times // LARGENUMBER should be something // //like 1000 or more
E := sample n1 elements of S1 by permuting the etouch column

//Equivalent to resampling without replacement. b_change column is kept //stationary and etouch is randomly permuted. Therefore relationship between the //b_change and etouch columns is broken

F := sample n2 elements of S2 by permuting the etouch column
add mean(E) – mean(F) to the BagOfResults

end

Iterate over BagOfResults

if current BagOfResults value <= 0

count := count + 1

pvalue := count/ LARGENUMBER

return(pvalue)

R-Code (Permutation version):

#The following function will do all the work and return the result of the permutation #experiment

> mean_diff_perm_fun <- function(data,m){

+ mean_diff <- mean(data[,"b_change"][data[,"etouch"]==0]) - mean(data[,"b_change"][data[,"etouch"]==1])

+ pdata <- data.frame("b_change"=data[,"b_change"],"etouch"=sample(data[,"etouch"],length(data[,"etouch"]), replace = FALSE))

+ pmean_diff <- mean(pdata[,"b_change"][pdata[,"etouch"]==0]) - mean(pdata[,"b_change"][pdata[,"etouch"]==1])

+ count <- (pmean_diff >= mean_diff)

+ for (i in 1:m-1)

+ {

+ newpdata <- data.frame("b_change"=data[,"b_change"],"etouch"=sample(data[,"etouch"],length(data[,"etouch"]), replace = FALSE))

+ newpmean_diff <- mean(newpdata[,"b_change"][newpdata[,"etouch"]==0]) - mean(newpdata [,"b_change"][newpdata[,"etouch"]==1])

+ count <- count + (newpmean_diff >= mean_diff)

+ pmean_diff <- c(pmean_diff, newpmean_diff)

+ }

+ pvalue <- count/m

+ list("permuted_mean_diff"=pmean_diff, "mean_diff"=mean_diff,"pvalue"=pvalue)

+ }

>#Call the above function 999 times

> mean_diff_perm_output <- mean_diff_perm_fun(election_subset,999)

>#The observed difference in mean is

> mean_diff_perm_output$mean_diff

[1] 0.01646996

> # the p-value is

> mean_diff_perm_output$pvalue

[1] 0.03003003

The above p-value is still very small (although not as small as obtained using the bootstrap - 0.008008008). However this still suggests strong evidence against the null hypothesis that the difference in the means of the two groups (etouch=0 and etouch=1) is zero and that this is not due to chance.

# a plot of the null distribution follows

> hist(mean_diff_perm_output$permuted_mean_diff, breaks=30, freq=FALSE, plot=TRUE)

------------------------------------------------------------------------------------------------------------

Title:

Analysis of Variance (ANOVA)

Definition and Motivation:

Pseudo-Code:

R-Code:

------------------------------------------------------------------------------------------------------------

Title:

Partial Correlation

Definition and Motivation:

> pcor_fun <- function(data){

+ conc <- solve(var(data))

+ resid.sd <- 1/sqrt(diag(conc))

+ partialcor <- sweep(sweep(conc,1,resid.sd, "*"),2,resid.sd,"*")

+ return(partialcor)

+ }

> pcor_fun(election_subset)

	etouch	median_income	hispanic	v_changepc	b_change
etouch	1.0000000	-0.29885709	-0.13328091	0.1322710	0.07162380
median_income	-0.2988571	1.00000000	0.06206307	-0.2267372	0.36928376
hispanic	-0.1332809	0.06206307	1.0000000	0.2444096	0.25903921
v_changepc	0.1322710	-0.22673721	0.24440957	1.0000000	0.04788487
b_change	0.0716238	0.36928376	0.25903921	0.04788487	1.00000000

Pseudo-Code:

R-Code:

pcor.etouch <- vector(mode="numeric", length=1000)

pcor.median_income <- vector(mode="numeric", length=1000)

pcor.hispanic <- vector(mode="numeric", length=1000)

pcor.v_changepc <- vector(mode="numeric", length=1000)

> for(b in 1:1000) {

+ data.curr <- election_subset[sample(1:67,67,replace=T),]

+ tempres <- pcor_fun(data.curr)

+ pcor.etouch[b] <- tempres["b_change", "etouch"]

+ pcor.median_income[b] <- tempres["b_change", "median_income"]

+ pcor.hispanic[b] <- tempres["b_change", "hispanic"]

+ pcor.v_changepc[b] <- tempres["b_change", "v_changepc"]

+ }

Now we have 4 (independent variables) vectors that have 1000 bootstrapped partial correlations with b_change (dependant variable). Just printing out the frist 10 values in the pcor.etouch vector.

> pcor.etouch[1:10]

[1] 0.124255924 -0.006567401 0.009424238 0.048729694 0.101077880 0.177484922 0.016398538 0.098933451

[9] 0.273896994 0.027145799

> hist(pcor.etouch,probability=T)

> lines(density(pcor.etouch))

> quantile(pcor.etouch,c(.025,.975))

2.5% 97.5%

-0.1108112 0.2565338

> quantile(pcor.etouch,c(0,0.95))

0% 95%

-0.2495015 0.2328568

> hist(pcor.median_income,probability=T)

> lines(density(pcor.median_income))

> quantile(pcor.median_income,c(0,0.95))

0% 95%

-0.1403044 0.5454804

> quantile(pcor.median_income,c(.025,.975))

2.5% 97.5%

0.1167792 0.5900816

> hist(pcor.hispanic,probability=T)

> lines(density(pcor.hispanic))

> quantile(pcor.hispanic,c(.025,.975))

2.5% 97.5%

-0.02675132 0.48105924

> quantile(pcor.hispanic,c(0,0.95))

0% 95%

-0.2855233 0.4537965

> hist(pcor.v_changepc,probability=T)

> lines(density(pcor.v_changepc))

> quantile(pcor.v_changepc,c(.025,.975))

2.5% 97.5%

-0.2385491 0.3350396

> quantile(pcor.v_changepc,c(0,0.95))

0% 95%

-0.4213978 0.2973756

------------------------------------------------------------------------------------------------------------

Title:

Linear Regression

Definition and Motivation:

> election_lm <- lm(b_change ~ etouch + median_income + hispanic + v_changepc, data = election_subset)

> summary(election_lm)

Call:

lm(formula = b_change ~ etouch + median_income + hispanic + v_changepc,

data = election_subset)

Residuals:

Min 1Q Median 3Q Max

-0.075622 -0.013817 0.001171 0.015295 0.061787

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.113e-01 2.196e-02 5.069 3.87e-06 ***

etouch -4.692e-03 8.298e-03 -0.565 0.57383

median_income -1.733e-06 5.539e-07 -3.129 0.00267 **

hispanic -7.116e-02 3.370e-02 -2.112 0.03874 *

v_changepc -2.532e-04 6.708e-04 -0.377 0.70711

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 0.02596 on 62 degrees of freedom

Multiple R-Squared: 0.2438, Adjusted R-squared: 0.195

F-statistic: 4.997 on 4 and 62 DF, p-value: 0.001481

> par(mfrow=c(2,2)); plot(election_lm); par(mfrow=c(1,1))

> as.vector(election_lm$coefficients)

[1] 1.112965e-01 -4.691730e-03 -1.733254e-06 -7.116254e-02 -2.532048e-04

> summary(election_lm)$r.squared

[1] 0.2437796

Pseudo-Code:

R-Code:

> regr_lm_fun <- function(data, indices) {

+ data <- data[indices,]

+ model <- lm(b_change ~ etouch + median_income + hispanic + v_changepc, data = election_subset)

+ coefficients(model)

+ }

> regr_lm_boot <- boot(election_subset, regr_lm_fun, 1000)

> regr_lm_boot <- boot(election_subset, regr_lm_fun, 999)

> regr_lm_boot

ORDINARY NONPARAMETRIC BOOTSTRAP

Call:

boot(data = election_subset, statistic = regr_lm_fun, R = 999)

Bootstrap Statistics :

original bias std. error

t1* 1.112965e-01 0 0

t2* -4.691730e-03 0 0

t3* -1.733254e-06 0 0

t4* -7.116254e-02 0 0

t5* -2.532048e-04 0 0