Wednesday, July 31, 2013

Further explorations in non-linearity

On the weekend I began exploring a non-linear model for aggregating polling. My test case produced nice looking graphs; but the results were in some large part an artifact of the priors I had chosen for the model (not good).

In the comments to that post, I suggested that part of the problem may have been how I had defined the model.

Thinking about this some more, I think the problem was in this statement: << beta-for-pollster * (poll-result - minimum-poll-result) >>. Rather than the minimum, it should be a central tendency of some type (mid-point, mean, median, etc.).

I have now redefined the model to use the midpoint for each polling house (defined as the (min+max)/2 for that house). The use of house-specific mid-points acknowledges that the raw x scores include the alpha house effect I am trying to estimate. The revised model is:

house-effect-for-pollster = alpha-for-pollster + beta-for-pollster * (poll-result - pollster-mid-point)

The priors to the beta effect have been completely freed-up, so that they are uninformative. As a result, this model works better internally than the previous model.

However, I have added a constraint such that the aggregation assumes the "fizziness" factors also sum to zero. This may be a bollocks assumption; and will need some further analysis.

So - with the caveat that this is still very early exploratory analysis - the results of this new model follow.




On a plain reading, the above charts suggest that the current Rudd effect may actually be better for Labor than other aggregations have found. However, I am not convinced this is anything more than an artifact of the model. I am particularly concerned about the inclusion of data from polling houses that do not have data points across the highs and lows of the Gillard period, and across the Rudd and Gillard periods. Such data don't fully occupy the model, which can result in artifacts.

If we limit the analysis to data from Nielsen and Newspoll for a comparison using data that does fully occupy the model, we get the following.




Because I am committed to presenting poll analysis in a transparent and unbiased fashion, the next two code snippets show how I package up the data for the model (in R), and the model itself (in JAGS).

    # - prepare the data ...
    data$y <- data[ , y] / 100 # vote share between 0-1
    data$variance <- data$y * (1 - data$y) / data$Sample #pq/n
    data$standardError <- sqrt(data$variance)
    data$samplePrecision <- 1 / data$variance
    HOUSES <- levels(data$House)
    HOUSECOUNT <- length(levels(data$House))
    NUMPOLLS <- length(data$y)
    cat('Number of polls: '); print(NUMPOLLS)
    midpoints <- rep(NA, HOUSECOUNT)
    for(i in seq_len(HOUSECOUNT)) {
     midpoints[i] <- (max(data[data$House==HOUSES[i], 'y']) + 
      min(data[data$House==HOUSES[i], 'y'])) / 2
     cat('Midpoints: '); cat(i); cat(' '); cat(HOUSES[i]); cat(': '); print(midpoints[i])
    }
    
    # - remember these dots 
    dotsFile <- paste('./files/', fPrefix, 'original.csv', sep='')
    write.csv(data, file=dotsFile)

    # - manage dates
    day0 <- min(data$Date) - 1  # walk starts from earliest date
    if(endDate == TODAY)
        endDate <- max(data$Date)
    PERIOD <- as.numeric(endDate - day0) # length of walk in days
    DISCOUNTINUITYDAY <- as.numeric(as.Date(discontinuity) - day0)
    cat('Discontinuity: '); print(DISCOUNTINUITYDAY)
    tPrefix <- paste(format(day0+1, '%d-%b-%Y'), ' to ', format(endDate, '%d-%b-%Y'), sep='')
    data$day <- as.numeric(data$Date - day0)

    # - do the MCMC thing ...
    
    parameters <- list(PERIOD = PERIOD,
        HOUSECOUNT = HOUSECOUNT,
        NUMPOLLS = NUMPOLLS,
        DISCOUNTINUITYDAY = DISCOUNTINUITYDAY, 
        NEWSPOLL = which(levels(data$House) == 'Newspoll'),
        y = data$y,
        x = data$y,
        day = data$day,        house = as.integer(data$House),
        samplePrecision = data$samplePrecision,
        midpoints=midpoints
    )
                
    jags <- jags.model(textConnection(model),
        data=parameters,
        n.chains=4,
        n.adapt=n.adapt
    )

    # - burn in
    update(jags, n.iter=n.update) # burn-in the chains

    # - capture results
    jags.capture <- c('walk', 'alpha', 'beta', 'discontinuityValue')
    coda.samples <- coda.samples(jags, jags.capture, n.iter=n.iter, thin=n.thin)
    coda.matrix <- as.matrix(coda.samples)

model {
    ## Derived from Simon Jackman's original model 
    
    ## -- observational model
    for(poll in 1:NUMPOLLS) { 
        # note: x and y are the original polling series
        houseEffect[poll] <- alpha[house[poll]] + 
         beta[house[poll]]*(x[poll]-midpoints[house[poll]]) 
        mu[poll] <- walk[day[poll]] + houseEffect[poll]
        y[poll] ~ dnorm(mu[poll], samplePrecision[poll]) 
    }
            
    ## -- temporal model
    for(i in 2:PERIOD) { # for each day under analysis ...
        day2DayAdj[i] <- ifelse(i==DISCOUNTINUITYDAY, 
            walk[i-1]+discontinuityValue, walk[i-1])
        walk[i] ~ dnorm(day2DayAdj[i], walkPrecision)
    }
    sigmaWalk ~ dunif(0, 0.01)            ## uniform prior on std. dev.  
    walkPrecision <- pow(sigmaWalk, -2)   ##   for the day-to-day random walk
    walk[1] ~ dunif(0.4, 0.6)             ## uninformative prior
    discontinuityValue ~ dunif(-0.2, 0.2) ## uninformative prior

    ## -- house effects model
    for(i in 2:HOUSECOUNT) { ## vague normal priors for house effects
        alpha[i] ~ dunif(-0.1,0.1)
        beta[i] ~ dunif(-1,1) 
    }
    alpha[NEWSPOLL] <- -sum(alpha[2:HOUSECOUNT]) ## sum to zero
    beta[NEWSPOLL] <- -sum(beta[2:HOUSECOUNT]) ## sum to zero
}

1 comment:

  1. Is there any reason that you are defining the bias as:

    "house-effect-for-pollster = alpha-for-pollster + beta-for-pollster * (poll-result - pollster-mid-point)"

    instead of

    "house-effect-for-pollster = alpha-for-pollster + beta-for-pollster * poll-result"?

    The former option means that the beta parameters (and potentially also the alpha parameters) cannot be directly compared between polling houses, because there are house-specific differences subsumed into beta and alpha.

    Note that your former case is also equal to
    "house-effect-for-pollster = alpha-for-pollster - beta-for-pollster * pollster-mid-point + beta-for-pollster * poll-result"
    = alpha-prime-for-pollster + beta-for-pollster * poll-result

    So what looks like a different definition of beta is actually equivalent to a different alpha.

    ReplyDelete