Saturday, March 17, 2018

Pollster preference flows

One of the things I like to look at is the preference flows the pollsters apply to their primary vote estimates in order to make a two-party preferred (TPP) vote estimate. We can discover these flows using good old fashioned multiple linear regression, along these lines:

$$TPP - Coalition_p = \beta_1 Greens_p + \beta_2 OneNation_p + \beta_3 Other_p + \epsilon$$

Which, in matrix notation, we will simplify as:

$$y = X \beta + \epsilon$$

In this simplification, $$y$$ is a column vector of (TPP - Coalition primary) vote estimates from a pollster. $$X$$ is the regression design matrix, with k columns (one for each party's primary vote) and N rows (one for each of the reported poll results). $$\beta$$ is a column vector of party coefficients we are seeking to find through the regression process. And $$\epsilon$$ is a column vector of error terms which we assume are independent and identically distributed (iid) with a mean of $$0$$. Through the magic of mathematics we can seek to minimize the sum of the squared errors using algebra and calculus to show:

$$\sum_{i=1}^n\epsilon_i^2 = \epsilon'\epsilon = (y-X\beta)'(y-X\beta)$$
$$= y'y - \beta'X'y - y'X\beta + \beta'X'X\beta$$
$$= y'y - 2\beta'X'y + \beta'X'X\beta$$
From this last equation, we can use calculus to find the $$\beta$$ that minimizes the sum of the errors squared:
$$\frac{\partial \epsilon'\epsilon}{\partial\beta} = -2X'y+2X'X\beta = 0$$
Which can be re-arranged to the famous "ex prime ex inverse ex prime why":
$$\beta = (X'X)^{-1}X'y$$

Before I get to the results, there are a few caveats to go through. First, not all of the primary vote poll data sums to 100 per cent. In the data I looked at, the following polls did not sum to 100 per cent.

--- Does not add to 100% ---
L/NP   ALP   GRN   ONP   OTH    Sum       Firm            Date
12   35.0  38.0  10.0   7.0   9.0   99.0  Essential     12 Dec 2017
18   31.0  34.0  11.0  11.0  14.0  101.0     YouGov     14 Nov 2017
19   36.0  38.0   9.0   8.0  10.0  101.0  Essential     14 Nov 2017
21   36.0  37.0  10.0   7.0   9.0   99.0  Essential     30 Oct 2017
25   36.0  38.0  10.0   7.0  10.0  101.0  Essential      4 Oct 2017
32   35.0  34.0  14.0   1.0  15.0   99.0      Ipsos    6-9 Sep 2017
63   36.0  37.0  10.0   8.0  10.0  101.0  Essential  13-16 Apr 2017
67   35.0  37.0  10.0   8.0  11.0  101.0  Essential  24-27 Mar 2017
69   34.0  37.0   9.0  10.0   9.0   99.0  Essential  17-20 Mar 2017
75   36.0  35.0   9.0  10.0   9.0   99.0  Essential   9-12 Feb 2017
77   35.0  37.0  10.0   9.0   8.0   99.0  Essential  20-23 Jan 2017
80   37.0  37.0   9.0   7.0   9.0   99.0  Essential   9-12 Dec 2016
83   36.0  30.0  16.0   7.0   9.0   98.0      Ipsos  24-26 Nov 2016
88   37.0  37.0  11.0   5.0   9.0   99.0  Essential  14-17 Oct 2016
92   38.0  37.0  10.0   5.0  11.0  101.0  Essential   9-12 Sep 2016
109  34.0  35.0  11.0   8.0  13.0  101.0     YouGov   7-10 Dec 2017
110  32.0  32.0  10.0  11.0  16.0  101.0     YouGov  23-27 Nov 2017
----------------------------

To manage this, I normalised all of the primary vote poll results so that they summed to 1.

The second thing I did was limit my analysis to those pollsters that had more than 10 polls since the last election. This meant I limited my analysis to polls from Newspoll and Essential.

Let's look at the multiple regression results. The key results is the block in the middle, with the three parties in the left hand column: GRN, ONP and OTH - Greens, One Nation and Others. The first coefficient column is the best linear unbiased estimate of the preference flows to the Coalition from each of these parties. The 95 per cent confidence intervals can be seen in the far right columns.

---- Essential ----
OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.999
Model:                            OLS   Adj. R-squared:                  0.999
Method:                 Least Squares   F-statistic:                 1.173e+04
Date:                Sat, 17 Mar 2018   Prob (F-statistic):           2.09e-61
Time:                        13:12:35   Log-Likelihood:                 190.20
No. Observations:                  45   AIC:                            -374.4
Df Residuals:                      42   BIC:                            -369.0
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
GRN            0.2236      0.054      4.162      0.000       0.115       0.332
ONP            0.4396      0.037     11.775      0.000       0.364       0.515
OTH            0.5091      0.050     10.190      0.000       0.408       0.610
==============================================================================
Omnibus:                        3.681   Durbin-Watson:                   1.938
Prob(Omnibus):                  0.159   Jarque-Bera (JB):                1.691
Skew:                          -0.020   Prob(JB):                        0.429
Kurtosis:                       2.051   Cond. No.                         20.0
==============================================================================

---- Newspoll ----
OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.999
Model:                            OLS   Adj. R-squared:                  0.999
Method:                 Least Squares   F-statistic:                     7414.
Date:                Sat, 17 Mar 2018   Prob (F-statistic):           1.81e-34
Time:                        13:12:38   Log-Likelihood:                 110.81
No. Observations:                  26   AIC:                            -215.6
Df Residuals:                      23   BIC:                            -211.9
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
GRN            0.2581      0.090      2.881      0.008       0.073       0.443
ONP            0.4976      0.035     14.202      0.000       0.425       0.570
OTH            0.4424      0.084      5.237      0.000       0.268       0.617
==============================================================================
Omnibus:                        2.886   Durbin-Watson:                   0.753
Prob(Omnibus):                  0.236   Jarque-Bera (JB):                1.437
Skew:                          -0.212   Prob(JB):                        0.488
Kurtosis:                       1.929   Cond. No.                         26.8
==============================================================================

From the Ordinary Least Squares (OLS) multiple regression analysis, our best guess is that Essential flows 22 per cent of the Green vote to the Coalition, it flows 44 per cent of the One Nation vote, and it flows 51 per cent of the Other vote. In comparison, Newspoll flows 26 per cent of the Green vote to the Coalition, 50 per cent of the One Nation vote, and 44 per cent of the Other vote.

While we get an estimate of preference flows to the Coalition for each of the three parties from both pollsters, it is worth noting that small sample sizes have resulted in quite wide confidence intervals for these estimates.

We can also do a Bayesian multiple linear regression. The Stan model I used for this follows.

// STAN: multiple regression - no intercept - positive coefficients

data {
// data size
int<lower=1> k;                     // number of pollster firms
int<lower=k+1> N;                   // number of polls

vector<lower=0,upper=1>[N] y;       // response vector
matrix<lower=0,upper=1>[N, k] X;    // design matrix
}

parameters {
vector<lower=0>[k] beta;    // positive regression coefficients
real<lower=0> sigma;        // standard deviation on iid error term
}

model {
beta ~ normal(0, 0.5);      // half normal prior
sigma ~ cauchy(0, 0.01);    // half cauchy prior
y ~ normal(X * beta, sigma);// regression model
}

The results I got were very similar to the standard OLS results. In this case I have identified the 95% credible interval, which is the Bayesian equivalent of the confidence interval. Again, the One Nation and Other results are quite different, and swapped about between the two pollsters.

From Stan: for Essential
2.5%    median     97.5%
GRN    0.121745  0.228054  0.335852
ONP    0.362631  0.438261  0.513884
OTH    0.404663  0.506283  0.607355

From Stan: for Newspoll
2.5%    median     97.5%
GRN    0.089876  0.267659  0.450531
ONP    0.420720  0.494935  0.566663
OTH    0.259106  0.433838  0.601585

From the Bayesian analysis, our best guess is that Essential flows 23 per cent of the Green vote to the Coalition. It flows 44 per cent of the One Nation vote. And it flows 51 per cent of the Other vote. In comparison, Newspoll flows 27 per cent of the Green vote to the Coalition, 49 per cent of the One Nation vote, and 43 per cent of the Other vote.

These Bayesian results can be charted as probability densities as follows. In these charts the median sample for each distribution is highlighted with a thin vertical line.

For completeness, the supporting Python program that generated this analysis follows.

# PYTHON - estimates of preference flows from polling data

import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt

import sys
sys.path.append( '../bin' )
from stan_cache import stan_cache

# --- chart results
graph_dir = './Graphs/'
walk_leader = 'STAN-PREFERENCE-FLOWS-'
plt.style.use('../bin/markgraph.mplstyle')

# --- curate data for analysis
workbook = pd.ExcelFile('./Data/poll-data.xlsx')
df = workbook.parse('Data')

# drop polls without one nation
df = df[df['ONP'].notnull()]

# drop pre-2016 election data
df['MidDate'] = [pd.Period(d, freq='D') for d in df['MidDate']]
df = df[df['MidDate'] > pd.Period('2016-07-04', freq='D')]

# normalise the data - still in 0 to 100 range
parties = ['GRN', 'ONP', 'OTH']
all = ['L/NP', 'ALP'] + parties
df['Sum'] = df[all].sum(axis=1)
bad = df[all + ['Sum', 'Firm', 'Date']]
bad = bad[(bad['Sum'] < 99.5) | (bad['Sum'] > 100.5)]
print('--- Does not add to 100% ---')
print(bad)
print('----------------------------')
df[all] = df[all].div(df[all].sum(axis=1), axis=0) * 100.0

# --- Analyse the curated data
firms = df['Firm'].unique()
for firm in firms:
cases = df[df['Firm']==firm]

if len(cases) <= 10:
continue # not enough to analyse

# --- classic OLS multiple regression
# get response vector and design matrix in 0 to 1 range
y = (cases['TPP L/NP'] - cases['L/NP']) / 100.0
X = cases[parties] / 100.0

# regression estimation
model = sm.OLS(y, X).fit()
print('\n\n---- {} ----'.format(firm))
# Print out the statistics
print(model.summary())

# --- let's do the same thing with Stan
# input data
data = {
'y': y,
'X' : X,
'N': len(y),
'k': len(X.columns)
}

# helpers
quants = [2.5,  50, 97.5]
labels = ['2.5%', 'median', '97.5%']

with open ("./Models/preference flows.stan", "r") as file:
model_code = file.read()
file.close()

# model
stan = stan_cache(model_code=model_code, model_name='preference flows')
fit = stan.sampling(data=data, iter=10000, chains=5)
results = fit.extract()

# capture the coefficients
coefficients = results['beta']
print('--- Coefficient Shape: {} ---'.format(coefficients.shape))
estimates = pd.DataFrame()
for i, party in zip(range(len(parties)), parties):
q = np.percentile(coefficients[:,i], quants)
row = pd.DataFrame(q, index=labels, columns=[party]).T
estimates = estimates.append(row)

# capture sigma
sigma = results['sigma'].T
q = np.percentile(sigma, quants)
row = pd.DataFrame(q, index=labels, columns=['sigma']).T
estimates = estimates.append(row)

# print results from Stan
print('From Stan: for {}'.format(firm))
print(estimates)

# plot results from Stan
coefficients = pd.DataFrame(coefficients, columns=parties)
ax = coefficients.plot.kde(color=['darkgreen', 'goldenrod', 'orchid'])
ax.set_title('Kernel Density for Preference Flows by {}'.format(firm))
ax.set_ylabel('KDE')
ax.set_xlabel('Estimated Flow (Proportion)')
for i in coefficients.columns:
ax.axvline(x=coefficients[i].median(), color='gray', linewidth=0.5)

fig = ax.figure
fig.set_size_inches(8, 4)
fig.tight_layout(pad=1)
fig.text(0.99, 0.01, 'marktheballot.blogspot.com.au',
ha='right', va='bottom', fontsize='x-small',
fontstyle='italic', color='#999999')
fig.savefig(graph_dir+walk_leader+firm+'.png', dpi=125)
plt.close() 

Sunday, March 11, 2018

What do you do when the data does not fit the model?

I have a couple of models in development for aggregating primary vote intention. I thought I would extend the centred logit model from four parties (Coalition, Labor, Greens and Other) to five parties (Coalition, Labor, Greens, One Nation and Other). It was not a significant change to the model. However, when I went to run it, it generated pages of warning messages, well into the warm-up cycle:
The current Metropolis proposal is about to be rejected because of the following issue:[boring technical reason deleted]
If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine, but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
My model is highly constrained, and it would produce a few of these warnings at start-up. But it was nothing like the number now being produced. Fortunately, the model was not completely broken, and it did produce an estimate.

When I looked at the output from the model, the issue was immediately evident: the Ipsos data was inconsistent with the assumptions in my model. In this chart, the first Ipsos poll result for One Nation is close to the results from the other pollsters. But the remaining three polls are well below the estimate from the other pollsters. This difference delivers a large house bias estimate for Ipsos, well to the edge of my prior for house biases: normal(0, 0.015) - ie. a normal distribution with a mean of 0 and a standard deviation of 1.5 per percentage points.

The model assumes that each polling house will track around some hidden population trend, with each poll's deviation from that trend being explained by two factors: a random amount associated with the sampling variation, and a constant additive amount associated with the systemic house bias for that pollster.

If we assume the other pollsters are close to the trend, then a few explanations are possible (listed in my ranking of higher to lower plausibility):
• Ipsos changed its polling methodology between the first and subsequent polls, and the model's assumption of a constant additive systemic house bias is incorrect; or
• the first poll is a huge outlier for Ipsos (well outside of the typical variation associated with the sampling variation from each poll), and the remaining three polls are more typical and incorporate a substantial house bias for Ipsos; or
• the first Ipsos poll is more plausible for Ipsos, and the last three are huge outliers.
Whatever, it makes the resolution of an estimate more challenging for Stan. And it gives me something more to ponder.

I should note that while the Ipsos results are low for One Nation, they are consistently above average for the Greens. Being consistently above is in line with model assumptions.

There is some allure in removing the Ipsos data from the model; but there be dragons. As a general rule of thumb, removing outliers from the data requires a compelling justification, such as independent verification of a measurement error. I like to think of outliers as a disturbance in the data where I don't yet know/understand what causes that disturbance.

Saturday, March 10, 2018

Two aggregation models for primary vote share

I have developed two different Stan models for aggregating the primary vote opinion polls.
• The first model estimates the compositional voting proportions for all but one party as centered logits (where logits are logarithms of the odds ratio, which are assumed to have a mean centred close to 0 on the logit scale). The temporal model is a Gaussian autoregressive process for each of these centred logits which is similar to the autoregressive process used in the two party preferred (TPP) vote share model. To get meaningful primary poll estimates, the n_parties-1 centred logits are converted to voting share proportions for all the parties. This model takes around 500 seconds (9 minutes) to produce an aggregate poll.
• The second model is a Stan version of the Dirichlet process model I developed for the last Federal election. This model derives the party vote share on a particular day through a Dirichlet distribution over the previous day's vote share multiplied by a transmission-strength value. The larger the transmission strength value, the the less change from one day to the next. This model takes about 1800 seconds (30 minutes) to produce an aggregate poll, which is a long time.

Both models use the multinomial distribution to fit the estimated hidden voting intention to the polling data. This hidden vote share in both models is expressed as an n_parties simplex; where all values on the simplex are between 0 and 1, and collectively they sum to 1. The data from each opinion poll is four integers in a multinomial, where all values in this multinomial sum to the pseudo-sample-size of 2000. The four party groups are: the Coalition, Labor, Greens and others.

Both models treat house effects in the same way. The house effects for each polling house sum to zero across the four parties. The house effects for each party also sum to zero across the (currently) five houses.

Both models derive the best estimate for the extent to which the voting patterns on any day are similar to the voting patterns on the previous day.

Let's look at the output from each model for each party grouping. The high, low and end point median samples are annotated.You can see the models produce similar output. The median estimate from the Dirichlet model is smoother, and the sample distribution is wider.

We can also compare the house effects from each model. Again these are similar.

We can also compare the TPP estimates from the models using previous election preference flows.

We can compare these TPP estimates with the estimate from the TPP Aggregation model.

What cannot be compared - because the models are so different - is the degree to which both the models ensure the voting intention on one day is much like the next. In the centred logit model we have a model estimated standard deviation (walkSigma) from one day to the next. In the Dirichlet model, we have a model estimated transmissionFactor, the inverse of which provides the transmission strength.

My code for these models is still very much in development. It has not been vectorised. And in places it is just plain ugly. I will spend some time tidying the code, before I add it to my page on the models.

// STAN: Primary Vote Intention Model using Centred Logits

data {
// data size
int<lower=1> n_polls;
int<lower=1> n_days;
int<lower=1> n_houses;
int<lower=1> n_parties;
int<lower=1> pseudoSampleSize;

// Centreing factors
real centreing_factors[n_parties-1];

// poll data
int<lower=1,upper=pseudoSampleSize> y[n_polls, n_parties]; // poll data multinomials
int<lower=1,upper=n_houses> house[n_polls]; // polling house
int<lower=1,upper=n_days> poll_day[n_polls]; // day on which polling occurred

// TPP preference flows
row_vector<lower=0,upper=1>[n_parties] preference_flows_2010;
row_vector<lower=0,upper=1>[n_parties] preference_flows_2013;
row_vector<lower=0,upper=1>[n_parties] preference_flows_2016;
}

parameters {
real<lower=0> walkSigma;
row_vector[n_days] centredLogits[n_parties-1];
matrix[n_houses-1, n_parties-1] houseAdjustment;
}

transformed parameters {
matrix[n_parties, n_days] hidden_voting_intention;
vector<lower=-0.2,upper=0.2>[n_parties] tHouseAdjustment[n_houses];
row_vector[n_days] tmp;

// house effects - two-direction sum to zero constraints
for (h in 1:(n_houses-1))
for(p in 1:(n_parties-1))
tHouseAdjustment[h][p] = houseAdjustment[h][p];
for(p in 1:(n_parties-1))
tHouseAdjustment[n_houses][p] = -sum(col(houseAdjustment, p));
for(h in 1:n_houses) {
tHouseAdjustment[h][n_parties] = 0; // get rid of the NAN
tHouseAdjustment[h][n_parties] = -sum(tHouseAdjustment[h]);
}

// convert centred logits to a simplex of hidden voting intentions
tmp = rep_row_vector(0, n_days);
for (p in 1:(n_parties-1)) {
hidden_voting_intention[p] = inv_logit(centredLogits[p]) +
centreing_factors[p];
tmp = tmp + hidden_voting_intention[p];
}
hidden_voting_intention[n_parties] = 1.0 - tmp;
}

model{
matrix[n_parties, n_polls] hvi_on_poll_day;

// -- house effects model
for( p in 1:(n_houses-1) )
houseAdjustment[p] ~ normal(0, 0.015);

// -- temporal model - all done on the centred logit scale
// Note: 0.02 near the centre --> roughly std dev of half a per cent
walkSigma ~ normal(0, 0.02); // half normal prior - note: on logit scale;
for(p in 1:(n_parties-1)) {
centredLogits[p][1] ~ normal(0, 0.15); // centred starting point 50% +/- 5%
centredLogits[p][2:n_days] ~ normal(centredLogits[p][1:(n_days-1)], walkSigma);
}

// -- observed data model
for(p in 1:n_parties)
hvi_on_poll_day[p] = hidden_voting_intention[p][poll_day];
for(poll in 1:n_polls)
// note matrix transpose in the next statement ...
y[poll] ~ multinomial(to_vector(hvi_on_poll_day'[poll]) +
tHouseAdjustment[house[poll]]);
}

generated quantities {
// aggregated TPP estimates based on past preference flows
vector [n_days] tpp2010;
vector [n_days] tpp2013;
vector [n_days] tpp2016;

for (d in 1:n_days){
// note matrix transpose in next three lines
tpp2010[d] = sum(hidden_voting_intention'[d] .* preference_flows_2010);
tpp2013[d] = sum(hidden_voting_intention'[d] .* preference_flows_2013);
tpp2016[d] = sum(hidden_voting_intention'[d] .* preference_flows_2016);
}
}

// STAN: Primary Vote Intention Model using a Dirichlet process

data {
// data size
int<lower=1> n_polls;
int<lower=1> n_days;
int<lower=1> n_houses;
int<lower=1> n_parties;

// key variables
int<lower=1> sampleSize; // maximum sample size for y

// give a rough idea of a staring point ...
simplex[n_parties] startingPoint; // rough guess at series starting point
int<lower=1> startingPointCertainty; // strength of guess - small number is vague

// poll data
int<lower=0,upper=sampleSize> y[n_polls, n_parties]; // a multinomial
int<lower=1,upper=n_houses> house[n_polls]; // polling house
int<lower=1,upper=n_days> poll_day[n_polls]; // day polling occured

// TPP preference flows
vector<lower=0,upper=1>[n_parties] preference_flows_2010;
vector<lower=0,upper=1>[n_parties] preference_flows_2013;
vector<lower=0,upper=1>[n_parties] preference_flows_2016;
}

parameters {
simplex[n_parties] hidden_voting_intention[n_days];
matrix<lower=-0.06,upper=0.06>[n_houses-1, n_parties-1] houseAdjustment;
real<lower=0> transmissionFactor;
}

transformed parameters {
vector<lower=-0.2,upper=0.2>[n_parties] tHouseAdjustment[n_houses];
real<lower=1> transmissionStrength; // AR(1) strength: higher is stronger

// calculate transmissionStrength
transmissionStrength = 1/transmissionFactor;

// make the house effects sum to zero in two directions
for (h in 1:(n_houses-1))
for(p in 1:(n_parties-1))
tHouseAdjustment[h][p] = houseAdjustment[h][p];
for(p in 1:(n_parties-1))
tHouseAdjustment[n_houses][p] = -sum(col(houseAdjustment, p));
for(h in 1:n_houses) {
tHouseAdjustment[h][n_parties] = 0; // get rid of the NAN
tHouseAdjustment[h][n_parties] = -sum(tHouseAdjustment[h]);
}
}

model{
// -- house effects model
for( p in 1:(n_houses-1) )
houseAdjustment[p] ~ normal(0, 0.01);

// -- temporal model
transmissionFactor ~ normal(0, 0.005); // a half normal prior
hidden_voting_intention[1] ~ dirichlet(startingPoint * startingPointCertainty);
for (d in 2:n_days)
hidden_voting_intention[d] ~ dirichlet(hidden_voting_intention[d-1] *
transmissionStrength);

// -- observed data model
for(poll in 1:n_polls)
y[poll] ~ multinomial(hidden_voting_intention[poll_day[poll]] +
tHouseAdjustment[house[poll]]);
}

generated quantities {
// aggregated TPP estimates based on past preference flows
vector [n_days] tpp2010;
vector [n_days] tpp2013;
vector [n_days] tpp2016;

for (d in 1:n_days){
tpp2010[d] = sum(hidden_voting_intention[d] .* preference_flows_2010);
tpp2013[d] = sum(hidden_voting_intention[d] .* preference_flows_2013);
tpp2016[d] = sum(hidden_voting_intention[d] .* preference_flows_2016);
}
}

Monday, March 5, 2018

Archive from 2016 Australian Federal Election

Note: the following page is the approach I took to the 2016 Federal Election. The Bayesian Aggregation page is being/has been updated for the 2019 election ...

This page provides some technical background on the Bayesian poll aggregation models used on this site. The page has been updated for the 2016 Federal election.

General overview

The aggregation or data fusion models I use are known as hidden Markov models. They are are sometimes referred to as state space models or latent process models.

I model the national voting intention (which cannot be observed directly; it is "hidden") for each and every day of the period under analysis. The only time the national voting intention is not hidden, is at an election. In some models (known as anchored models), we use the election result to anchor the daily model we use.

In the language of modelling, our estimates of the national voting intention for each day being modeled are known as states. These "states" link together to form a Markov process, where each state is directly dependent on the previous state and a probability distribution linking the states. In plain English, the models assume that the national voting intention today is much like it was yesterday. The simplest models assume the voting intention today is normally distributed around the voting intention yesterday.

The model is informed by irregular and noisy data from the selected polling houses. The challenge for the model is to ignore the noise and find the underlying signal. In effect, the model is solved by finding the the day-to-day pathway with the maximum likelihood given the known poll results.

To improve the robustness of the model, we make provision for the long-run tendency of each polling house to systematically favour either the Coalition or Labor. We call this small tendency to favour one side or the other a "house effect". The model assumes that the results from each pollster diverge (on average) from the from real population voting intention by a small, constant number of percentage points. We use the calculated house effect to adjust the raw polling data from each polling house.

In estimating the house effects, we can take one of a number of approaches. We could:

• anchor the model to an election result on a particular day, and use that anchoring to establish the house effects.
• anchor the model to a particular polling house or houses; or
• assume that collectively the polling houses are unbiased, and that collectively their house effects sum to zero.

Typically, I use the first and third approaches in my models. Some models assume that the house effects sum to zero. Other models assume that the house effects can be determined absolutely by anchoring the hidden model for a particular day or week to a known election outcome.

There are issues with both approaches. The problem with anchoring the model to an election outcome (or to a particular polling house), is that pollsters are constantly reviewing and, from time to time, changing their polling practice. Over time these changes affect the reliability of the model. On the other hand, the sum-to-zero assumption is rarely correct. Of the two approaches, anchoring tends to perform better.

Another adjustment we make is to allow discontinuities in the hidden process when either party changes its leadership. Leadership changes can see immediate changes in the voting preferences. For the anchored models, we allow for discontinuities with the Gillard to Rudd and Abbott to Turnbull leadership changes. The unanchored models have a single discontinuity with the Abbott to Turnbull leadership change.

Solving a model necessitates integration over a series of complex multidimensional probability distributions. The definite integral is typically impossible to solve algebraically. But it can be solved using a numerical method based on Markov chains and random numbers known as Markov Chain Monte Carlo (MCMC) integration. I use a free software product called JAGS to solve these models.

The dynamic linear model of TPP voting intention with house effects summed to zero

This is the simplest model. It has three parts:

1. The observational part of the model assumes two factors explain the difference between published poll results (what we observe) and the national voting intention on a particular day (which, with the exception of elections, is hidden):

1. The first factor is the margin of error from classical statistics. This is the random error associated with selecting a sample; and
2. The second factor is the systemic biases (house effects) that affect each pollster's published estimate of the population voting intention.

2. The temporal part of the model assumes that the actual population voting intention on any day is much the same as it was on the previous day (with the exception of discontinuities). The model estimates the (hidden) population voting intention for every day under analysis.

3. The house effects part of the model assumes that house effects are distributed around zero and sum to zero.

This model builds on original work by Professor Simon Jackman. It is encoded in JAGS as follows:

model {
## -- draws on models developed by Simon Jackman

## -- observational model
for(poll in 1:n_polls) {
yhat[poll] <- houseEffect[house[poll]] + hidden_voting_intention[day[poll]]
y[poll] ~ dnorm(yhat[poll], samplePrecision[poll]) # distribution
}

## -- temporal model - with one discontinuity
hidden_voting_intention[1] ~ dunif(0.3, 0.7) # contextually uninformative
hidden_voting_intention[discontinuity] ~ dunif(0.3, 0.7) # ditto

for(i in 2:(discontinuity-1)) {
hidden_voting_intention[i] ~ dnorm(hidden_voting_intention[i-1], walkPrecision)
}
for (j in (discontinuity+1):n_span) {
hidden_voting_intention[j] ~ dnorm(hidden_voting_intention[j-1], walkPrecision)
}
sigmaWalk ~ dunif(0, 0.01)
walkPrecision <- pow(sigmaWalk, -2)

## -- house effects model
for(i in 2:n_houses) {
houseEffect[i] ~ dunif(-0.15, 0.15) # contextually uninformative
}
houseEffect[1] <- -sum( houseEffect[2:n_houses] )
}

Professor Jackman's original JAGS code can be found in the file kalman.bug, in the zip file link on this page, under the heading Pooling the Polls Over an Election Campaign.

The anchored dynamic linear model of TPP voting intention

This model is much the same as the previous model. However, it is run with data from prior to the 2013 election to anchor poll performance. It includes two discontinuities, with the ascensions of Rudd and Turnbull. And, because it is anchored, the house effects are not constrained to sum to zero.

model {

## -- observational model
for(poll in 1:n_polls) {
yhat[poll] <- houseEffect[house[poll]] + hidden_voting_intention[day[poll]]
y[poll] ~ dnorm(yhat[poll], samplePrecision[poll]) # distribution
}

## -- temporal model - with two or more discontinuities
# priors ...
hidden_voting_intention[1] ~ dunif(0.3, 0.7) # fairly uninformative
for(j in 1:n_discontinuities) {
hidden_voting_intention[discontinuities[j]] ~ dunif(0.3, 0.7) # fairly uninformative
}
sigmaWalk ~ dunif(0, 0.01)
walkPrecision <- pow(sigmaWalk, -2)

# Up until the first discontinuity ...
for(k in 2:(discontinuities[1]-1)) {
hidden_voting_intention[k] ~ dnorm(hidden_voting_intention[k-1], walkPrecision)
}

# Between the discontinuities ... assumes 2 or more discontinuities ...
for( disc in 1:(n_discontinuities-1) ) {
for(k in (discontinuities[disc]+1):(discontinuities[(disc+1)]-1)) {
hidden_voting_intention[k] ~ dnorm(hidden_voting_intention[k-1], walkPrecision)
}
}

# after the last discontinuity
for(k in (discontinuities[n_discontinuities]+1):n_span) {
hidden_voting_intention[k] ~ dnorm(hidden_voting_intention[k-1], walkPrecision)
}

## -- house effects model
for(i in 1:n_houses) {
houseEffect[i] ~ dnorm(0, pow(0.05, -2))
}
}


The latent Dirichlet process for primary voting intention

This model is more complex. It takes advantage of the Dirichlet (pronounced dirik-lay) distribution, which always sums to 1, just as the primary votes for all parties would sum to 100 per cent of voters at an election. A weakness is a transmission mechanism from day-to-day that uses a "tightness of fit" parameter, which has been arbitrarily selected.

The model is set up a little differently to the previous models. Rather than pass the vote share as a number between 0 and 1; we pass the size of the sample that indicated a preference for each party. For example, if the poll is of 1000 voters, with 40 per cent for the Coalition, 40 per cent for Labor, 11 per cent for the Greens and and 9 per cent for Other parties, the multinomial we would pass across for this poll is [400, 400, 110, 90].

More broadly, this model is conceptually very similar to the sum-to-zero TPP model: with an observational component, a dynamic walk of primary voting proportions (modeled as a hierarchical Dirichlet process), a discontinuity for Turnbull's ascension, and a set of house effects that sum to zero across polling houses and across the parties.

data {
zero <- 0.0
}
model {

#### -- observational model

for(poll in 1:NUMPOLLS) { # for each poll result - rows
adjusted_poll[poll, 1:PARTIES] <- walk[pollDay[poll], 1:PARTIES] +
houseEffect[house[poll], 1:PARTIES]
primaryVotes[poll, 1:PARTIES] ~ dmulti(adjusted_poll[poll, 1:PARTIES], n[poll])
}

#### -- temporal model with one discontinuity

tightness <- 50000 # kludge - today very much like yesterday
# - before discontinuity
for(day in 2:(discontinuity-1)) {
# Note: use math not a distribution to generate the multinomial ...
multinomial[day, 1:PARTIES] <- walk[day-1,  1:PARTIES] * tightness
walk[day, 1:PARTIES] ~ ddirch(multinomial[day, 1:PARTIES])
}
# - after discontinuity
for(day in discontinuity+1:PERIOD) {
# Note: use math not a distribution to generate the multinomial ...
multinomial[day, 1:PARTIES] <- walk[day-1,  1:PARTIES] * tightness
walk[day, 1:PARTIES] ~ ddirch(multinomial[day, 1:PARTIES])
}

## -- weakly informative priors for first and discontinuity days
for (party in 1:2) { # for each major party
alpha[party] ~ dunif(250, 600) # majors between 25% and 60%
beta[party] ~ dunif(250, 600) # majors between 25% and 60%
}
for (party in 3:PARTIES) { # for each minor party
alpha[party] ~ dunif(10, 250) # minors between 1% and 25%
beta[party] ~ dunif(10, 250) # minors between 1% and 25%
}
walk[1, 1:PARTIES] ~ ddirch(alpha[])
walk[discontinuity, 1:PARTIES] ~ ddirch(beta[])

## -- estimate a Coalition TPP from the primary votes
for(day in 1:PERIOD) {
CoalitionTPP[day] <- sum(walk[day, 1:PARTIES] *
preference_flows[1:PARTIES])
CoalitionTPP2010[day] <- sum(walk[day, 1:PARTIES] *
preference_flows_2010[1:PARTIES])
}

#### -- house effects model with two-way, sum-to-zero constraints

## -- vague priors ...
for (h in 2:HOUSECOUNT) {
for (p in 2:PARTIES) {
houseEffect[h, p] ~ dunif(-0.1, 0.1)
}
}

## -- sum to zero - but only in one direction for houseEffect[1, 1]
for (p in 2:PARTIES) {
houseEffect[1, p] <- 0 - sum( houseEffect[2:HOUSECOUNT, p] )
}
for(h in 1:HOUSECOUNT) {
# includes constraint for houseEffect[1, 1], but only in one direction
houseEffect[h, 1] <- 0 - sum( houseEffect[h, 2:PARTIES] )
}

## -- the other direction constraint on houseEffect[1, 1]
zero ~ dsum( houseEffect[1, 1], sum( houseEffect[2:HOUSECOUNT, 1] ) )
}

The anchored Dirichlet primary vote model

The anchored Dirichlet model follows. It draws on elements from the anchored TPP model and Dirichlet model above. It is the most complex of these models. It also takes the longest time to produce reliable results. For this model, I run 460,000 samples, taking every 23rd sample for analysis. On my aging Apple iMac and JAGS 4.0.1 it takes around 100 minutes to run.

model {

#### -- observational model
for(poll in 1:NUMPOLLS) { # for each poll result - rows
adjusted_poll[poll, 1:PARTIES] <- walk[pollDay[poll], 1:PARTIES] +
houseEffect[house[poll], 1:PARTIES]
primaryVotes[poll, 1:PARTIES] ~ dmulti(adjusted_poll[poll, 1:PARTIES], n[poll])
}

#### -- temporal model with multiple discontinuities
# - tightness of fit parameters
tightness <- 50000 # kludge - today very much like yesterday

# Up until the first discontinuity ...
for(day in 2:(discontinuities[1]-1)) {
multinomial[day, 1:PARTIES] <- walk[day-1,  1:PARTIES] * tightness
walk[day, 1:PARTIES] ~ ddirch(multinomial[day, 1:PARTIES])
}

# Between the discontinuities ... assumes 2 or more discontinuities ...
for( disc in 1:(n_discontinuities-1) ) {
for(day in (discontinuities[disc]+1):(discontinuities[(disc+1)]-1)) {
multinomial[day, 1:PARTIES] <- walk[day-1,  1:PARTIES] * tightness
walk[day, 1:PARTIES] ~ ddirch(multinomial[day, 1:PARTIES])
}
}

# After the last discontinuity
for(day in (discontinuities[n_discontinuities]+1):PERIOD) {
multinomial[day, 1:PARTIES] <- walk[day-1,  1:PARTIES] * tightness
walk[day, 1:PARTIES] ~ ddirch(multinomial[day, 1:PARTIES])
}

# weakly informative priors for first day and discontinutity days ...
for (party in 1:2) { # for each minor party
alpha[party] ~ dunif(250, 600) # minors between 25% and 60%
}
for (party in 3:PARTIES) { # for each minor party
alpha[party] ~ dunif(10, 200) # minors between 1% and 20%
}
walk[1, 1:PARTIES] ~ ddirch(alpha[])

for(j in 1:n_discontinuities) {
for (party in 1:2) { # for each minor party
beta[j, party] ~ dunif(250, 600) # minors between 25% and 60%
}
for (party in 3:PARTIES) { # for each minor party
beta[j, party] ~ dunif(10, 200) # minors between 1% and 20%
}
walk[discontinuities[j], 1:PARTIES] ~ ddirch(beta[j, 1:PARTIES])
}

## -- estimate a Coalition TPP from the primary votes
for(day in 1:PERIOD) {
CoalitionTPP[day] <- sum(walk[day, 1:PARTIES] *
preference_flows[1:PARTIES])
}

#### -- sum-to-zero constraints on house effects
for(h in 1:HOUSECOUNT) {
for (p in 2:PARTIES) {
houseEffect[h, p] ~ dnorm(0, pow(0.05, -2))
}
}
# need to lock in ... but only in one dimension
for(h in 1:HOUSECOUNT) { # for each house ...
houseEffect[h, 1] <- -sum( houseEffect[h, 2:PARTIES] )
}
}

Beta model of primary vote share for Palmer United

A simplification of the Dirichlet distribution is the Beta distribution. I use the Beta distribution (in a similar model to the Dirichlet model above), to track the primary vote share of the Palmer United Party. This model does not provide for a discontinuity in voting associated with the change of Prime Minister.

model {

#### -- observational model
for(poll in 1:NUMPOLLS) { # for each poll result - rows
adjusted_poll[poll] <- walk[pollDay[poll]] + houseEffect[house[poll]]
palmerVotes[poll] ~ dbin(adjusted_poll[poll], n[poll])
}

#### -- temporal model (a daily walk where today is much like yesterday)
tightness <- 50000 # KLUDGE - tightness of fit parameter selected by hand
for(day in 2:PERIOD) { # rows
binomial[day] <- walk[day-1] * tightness
walk[day] ~ dbeta(binomial[day], tightness - binomial[day])
}

## -- weakly informative priors for first day in the temporal model
alpha ~ dunif(1, 1500)
walk[1] ~ dbeta(alpha, 10000-alpha)

#### -- sum-to-zero constraints on house effects
for(h in 2:HOUSECOUNT) { # for each house ...
houseEffect[h] ~ dnorm(0, pow(0.1, -2))
}
houseEffect[1] <- -sum(houseEffect[2:HOUSECOUNT])
}


Code and data

I have made most of my data and code base available on Google Drive. Please note, this is my live code base, which I play with quite a bit. So, there will be times when it is broken or in some stage of being edited.

What I have not made available is the Excel spreadsheets into which I initially place my data. These live in the (hidden) raw-data directory. However, the collated data for the Bayesian model lives in the intermediate directory, visible from the above link.

Sunday, March 4, 2018

March 2018 polling update

I have been busy coding a new primary vote aggregation model in Stan and Python. It was an interesting experience as I came to terms with the way in which Stan differs from JAGS (and the ways in which the Hamiltonian Monte Carlo sampling differs from Gibbs sampling). While I found Stan a little more fiddly to code, it is definitely easier to debug, requires fewer iterations, and is therefore faster to both code and run. I will write a technical post shortly on the primary vote aggregation model, as well as some of my experiences with Stan.

So let's get down to the data. First, however, an acknowledgement: I sourced the polling data from the Wikipedia page on the next Australian federal election.

The two-party preferred (TPP) aggregation tends to significant levels of smoothing. At the start of March this model estimates the Coalition would win 47.3 per cent of the TPP vote share. While I have not converted this to an election outcome probability (the next task), this result is strongly suggestive of sizable Labor win, were an election held at the moment.

Next we will look at the primary vote aggregation model I have just completed. This model is not as smoothed as the TPP aggregation model. It is more sensitive to week-on-week polling changes. It is an interesting question whether this is just picking up noise or better listening to the underlying signal.

Of note is the recent decline in primary vote share for Labor and the Coalition. These votes have moved to the Greens and the other parties.

The above charts can be summarised and compared as follows:

The house adjustments are more complicated than for the TPP model. The sum to zero constraint needs to be maintained in two directions simultaneously: for each pollster across the four party groups; and for each party, across the (currently) five pollsters in the data. It is this two-way necessity that sees the median lines sometime appear to sit above or below the preponderance of polling results.

For the primary vote estimates we can calculate a TPP estimate using the preference flows evidenced at the previous election. In this model, I attribute the Other vote share in a single transfer. I do this using the preference flows from the previous elections. Pollsters can be more nuanced because they capture the specific other party from their respondents. They can then apply the actual previous election preference transfer rate for the specific primary party vote nominated by the respondent. [Another acknowledgement: I used Antony Green's reporting on preference flows].

Of note: when the polls are suggesting almost 30 per cent of the primary vote is not going to the major Coalition and Labor parties, how preferences flow will substantially shape the final election outcome. There is a 1.7 percentage point difference in the final TPP estimate between the application of the 2010 and 2016 preference flows.

This can be summarised as follows.

Tuesday, February 20, 2018

Polling update

The last time I ran a polling aggregate was just before the 2016 election. It was on an Apple iMac. I have since junked the iMac (it was 5 years old), and replaced it with a windows machine running on a Ryzen 7 1800X. The iMac Pro cost more than I was willing to pay (and about twice what I paid for the Ryzen).

A new machine, with a new operating system, necessitated changes in the way I work. I am currently using Stan and Python (I junked JAGS and R). I could get neither PyStan nor PyJAGS working well under Windows, so I am running these polling models on a virtual Linux box.

I have sourced the input polling data from Wikipedia. I have simplified the model a little from the last election. I will talk about this more in coming weeks.

The initial aggregation of the two-party preferred voting look like this.

The next challenge is the Dirichlet model for primary voting intention. Again I will need to transition this model from JAGS to STAN.

The primary voting trends are worth watching for the next election. At the moment they are attracting more that 25 per cent of the stated voting intention. The high other vote adds uncertainty to the TPP estimates from pollsters, which is not reflected in the above charts. It is not unusual for an Australian Federal election to be plus or minus 2 percentage points from the pre-election aggregated polling.

But with the non-Coalition/non-Labor vote so high, my working hypothesis is that both an upset and a landslide election are possible, with a final TPP vote share at the next election up to +/- 5 percentage points from the pre-election aggregate. I will model this increased uncertainty when I come to a seat-by-seat Monte Carlo model of the next election (provided the other vote-share remains historically high).