Saturday, June 6, 2015

Dirichlet-walk hidden Markov model

Forgive me, but I am going to talk technical for a bit. I have been playing with a latent Dirichlet process hidden Markov model for aggregating opinion polls of primary voting intention. But before we get there, let's locate this approach.

The broad description for the method of poll aggregation I use on this site is the hidden Markov model (HMM). These models are analogous to the state-space models developed in the 1960s (beginning with the Kalman Filter). Hidden Markov models are one subset of Bayesian hierarchical models.

Using a HMM, I model the population voting intention (which cannot be observed directly - it is "hidden") as a series of states (either daily or weekly, depending on the model). Each state is directly dependent on the previous state and a probability distribution linking the states. Collectively, these links form a Markov chain or process. The model is informed by irregular and noisy data from the selected polling houses.

Solving the model necessitates integration over a series of complex multidimensional probability distributions. The definite integral is typically impossible to solve algebraically. But it can be solved using a numerical method based on Markov chains and random numbers known as Markov Chain Monte Carlo (MCMC) integration. I use a free software product called JAGS to solve the model.

The two poll aggregation models I had developed previously were dynamic linear models. In both models, the probability distribution linking the hidden states in the Markov model was the normal or Gaussian distribution: the bell-curve of high school statistics. Similarly, the probability distribution linking the hidden states with the polling observations was also the normal distribution. It is this dual independent use of the normal distribution that makes these models: dynamic linear models.

For the model based on the univariate Coalition two-party preferred poll results, the dynamic linear model works a treat. However, the multivariate primary vote model was far more difficult to construct. Quite some effort went into constraining the primary vote shares so that they always summed to one. It resulted in a large (and very slow) directed acyclic graph.

To address this complexity problem, I wondered whether a Dirichlet distribution could be used. Named for Peter Gustav Lejeune Dirichlet, the distribution is pronounced dirik-lay. The Dirichlet distribution has a number of useful features: First it is a multivariate distribution. Second, the output from the distribution always sums to one. It sounded ideal for modelling a time series of dynamically changing, continuous primary vote proportions (all on the unit interval).

In 2013, Emil Aas Stoltenberg wrote a paper on Bayesian Forecasting of Election Results in Multiparty Systems. In that paper he explored a Dirichlet-Multinomial process for the hidden Markov model. His model was coded in Python.

The Coalition TPP estimate from the this model over 6-months, and over the period since the previous election is not dissimilar to the output from the two dynamic linear models.



Turning to the model, rather than estimate the poll result, I use the multinomial distribution to estimate the number of people in each poll sample who expressed a preference for each of the parties. This is a very different approach to my other models. So that you can see it, I will include the R-code where I set up the input data for the model.

It should go without saying that the usual caveats apply. This is new code, and may include bugs. You will note that I am testing a few options in different places (see comments).

df <- read.csv(args[6], header=TRUE)
df <- df[order(df$Week), ]
NUMPOLLS <- nrow(df)
PERIOD <- max(df$Week)
HOUSECOUNT <- length(levels(df$House)) # a factor
HOUSENAMES <- levels(df$House)
PARTYNAMES <- c('Coalition', 'Labor', 'Greens', 'Other')
PARTIES <- length(PARTYNAMES)
primaryVotes <- df[PARTYNAMES] * df$Sample
primaryVotes <- sapply(primaryVotes, function(x) round(x,0))
day0 <- min(as.Date(df$Date)) - 1

## Assume Coalition gets preferences as follows:
## - 16.97% of the Green vote [was 16.97 in 2013 and 21.16 in 2010]
## - 53.3% of the Other vote [was 53.3 in 2013 and 58.26 in 2010]
## See: Antony Green - http://blogs.abc.net.au/antonygreen/2013/11/
##   preference-flows-at-the-2013-federal-election.html
preference_flows <- c(1.0, 0.0, 0.1697, 0.533)


data = list(PERIOD = PERIOD,
        HOUSECOUNT = HOUSECOUNT,
        NUMPOLLS = NUMPOLLS,
        PARTIES = PARTIES,
        primaryVotes = primaryVotes,
        pollWeek = df$Week,
        house = as.integer(df$House),
        # manage rounding issues with df$Sample ...
        n = rowSums(primaryVotes),
        preference_flows = preference_flows
    )
print(data)


# ----- JAGS model ...
library(rjags)
model <- "
model {

    #### -- observational model
    for(poll in 1:NUMPOLLS) { # for each poll result - rows
        adjusted_poll[poll, 1:PARTIES] <- walk[pollWeek[poll], 1:PARTIES] +
            houseEffect[house[poll], 1:PARTIES]
        primaryVotes[poll, 1:PARTIES] ~ dmulti(adjusted_poll[poll, 1:PARTIES], n[poll])
    }

    #### -- temporal model (a weekly walk where this week is much like last week)
    tightness <- 10000 # KLUDGE: tightness of fit parameter selected by trial and error
    for(week in 2:PERIOD) {
        # Note: use math not a distribution to generate the multinomial ...
        multinomial[week, 1:PARTIES] <- walk[week-1,  1:PARTIES] * tightness
        walk[week, 1:PARTIES] ~ ddirch(multinomial[week, 1:PARTIES])
    }

    ## -- weakly informative priors for first week in the temporal model
    for (party in 1:2) { # for each major party
        alpha[party] ~ dunif(250, 600) # majors between 25% and 60%
    }
    for (party in 3:PARTIES) { # for each minor party
        alpha[party] ~ dunif(10, 250) # minors between 1% and 25%
    }
    walk[1, 1:PARTIES] ~ ddirch(alpha[])

    ## -- estimate a Coalition TPP from the primary votes
    for(week in 1:PERIOD) {
        CoalitionTPP[week] <- sum(walk[week, 1:PARTIES] *
            preference_flows[1:PARTIES])
    }

    #### -- sum-to-zero constraints on house effects
    for (party in 2:PARTIES) { # for each party ...
        # house effects across houses sum to zero
        # NOTE: ALL MUST SUM TO ZERO
        houseEffect[1, party] <- -sum( houseEffect[2:HOUSECOUNT, party] )
    }
    for(house in 1:HOUSECOUNT) { # for each house ...
        # house effects across the parties sum to zero
        houseEffect[house, 1] <- -sum( houseEffect[house, 2:PARTIES] )
    }
    # but note, we do not apply a double constraint to houseEffect[1, 1]
    monitorHouseEffectOneSumParties <- sum(houseEffect[1, 1:PARTIES])
    monitorHouseEffectOneSumHouses <- sum(houseEffect[1:HOUSECOUNT, 1])

    ## -- vague normal priors for house effects - centred on zero
    for (party in 2:PARTIES) { # for each party (cols)
        for(house in 2:HOUSECOUNT) { #  (rows)
            houseEffect[house, party] ~ dnorm(0, pow(0.1, -2))
       }
    }
}
"

jags <- jags.model(textConnection(model),
        data = data,
        n.chains=4,
        n.adapt=n_adapt
    )

The input for the 6-month model was as follows:

$PERIOD
[1] 26

$HOUSECOUNT
[1] 5

$NUMPOLLS
[1] 35

$PARTIES
[1] 4

$primaryVotes
      Coalition Labor Greens Other
 [1,]       532   574    154   140
 [2,]       560   518    168   154
 [3,]       350   410    115   125
 [4,]       439   450    139   127
 [5,]       385   385     95   135
 [6,]       375   395    120   110
 [7,]      1465  1483    417   325
 [8,]       504   602    154   140
 [9,]       532   560    154   154
[10,]       504   602    154   140
[11,]       355   415    120   110
[12,]       412   483    141   141
[13,]      1345  1450    392   312
[14,]       375   405    100   120
[15,]       448   448    142   142
[16,]       588   504    168   140
[17,]       390   380    115   115
[18,]       441   453    139   128
[19,]       380   400    110   110
[20,]       471   425    126   126
[21,]       957   979    278   205
[22,]       405   360    125   110
[23,]       546   532    182   126
[24,]       471   413    126   138
[25,]       385   380    120   115
[26,]      1008   995    301   228
[27,]       400   375    115   110
[28,]       457   410    141   164
[29,]       690   656    185   151
[30,]       603   491    182   126
[31,]       415   355    125   105
[32,]       464   429    139   128
[33,]      1307  1218    385   273
[34,]       410   370    130    90
[35,]       479   433    152   105

$pollWeek
 [1]  1  1  2  2  6  8  8  9  9 10 10 10 10 12 12 13 14 14 16 16 17 18 19 19 20
[26] 21 22 22 24 24 24 24 24 26 26

$house
 [1] 1 2 3 4 3 3 5 1 2 1 3 4 5 3 4 2 3 4 3 4 5 3 2 4 3 5 3 4 1 2 3 4 5 3 4

$n
 [1] 1400 1400 1000 1155 1000 1000 3690 1400 1400 1400 1000 1177 3499 1000 1180
[16] 1400 1000 1161 1000 1148 2419 1000 1386 1148 1000 2532 1000 1172 1682 1402
[31] 1000 1160 3183 1000 1169

$preference_flows
[1] 1.0000 0.0000 0.1697 0.5330

Update

This page has been updated. Updates were the result of further analysis, which identified glitches with the original model. The tightness of fit parameter in the temporal model was not resolving. I have returned to specifying the tightness of fit parameter in the model.

Second, I have removed the dmulti() - the multinomial distribution step - in the temporal model and replaced it with simple arithmetic to calculate the multinomial. This makes the model a simpler Dirichlet-walk.

No comments:

Post a Comment