Bayesian Aggregation

This page provides some technical background on the Bayesian poll aggregation models used on this site for the 2019 Federal election.

General overview

The aggregation or data fusion models I use are probably best described as state space models or latent process models. They are also known as hidden Markov models

I model the national voting intention (which cannot be observed directly; it is "hidden") for each and every day of the period under analysis. The only time the national voting intention is not hidden, is at an election. In some models (known as anchored models), we use the election result to anchor the daily model we use.

In the language of modelling, our estimates of the national voting intention for each day being modeled are known as states. These "states" link together to form a process where each state is directly dependent on the previous state and a probability distribution linking the states. In plain English, the models assume that the national voting intention today is much like it was yesterday.

The model is informed by irregular and noisy data from the selected polling houses. The challenge for the model is to ignore the noise and find the underlying signal. In effect, the model is solved by finding the the day-to-day pathway with the maximum likelihood given the known poll results.

To improve the robustness of the model, we make provision for the long-run tendency of each polling house to systematically favour either the Coalition or Labor. We call this small tendency to favour one side or the other a "house effect". The model assumes that the results from each pollster diverge (on average) from the from real population voting intention by a small, constant number of percentage points. We use the calculated house effect to adjust the raw polling data from each polling house.

In estimating the house effects, we can take one of a number of approaches. We could:

  • anchor the model to an election result on a particular day, and use that anchoring to establish the house effects.
  • anchor the model to a particular polling house or houses; or 
  • assume that collectively the polling houses are unbiased, and that collectively their house effects sum to zero.

Currently, I oscillate between the second and third approaches in my models.

The problem with anchoring the model to an election outcome (or to a particular polling house), is that pollsters are constantly reviewing and, from time to time, changing their polling practice. Over time these changes affect the reliability of anchored models. On the other hand, the sum-to-zero assumption is rarely correct. Nonetheless, in previous elections, those people who used models that were anchored to the previous election did poorer than those people whose models averaged the bias across polling houses.

Solving a model necessitates integration over a series of complex multidimensional probability distributions. The definite integral is typically impossible to solve algebraically. But it can be solved using a numerical method based on Markov chains and random numbers known as Markov Chain Monte Carlo (MCMC) integration. I use a free software product called Stan to solve these models.

Model for TPP voting intention with house effects summed to zero

This is the simplest model. It has three parts:

  1. The observed data part of the model assumes two factors explain the difference between published poll results (what we observe) and the national voting intention on a particular day (which, with the exception of elections, is hidden):

    1. The first factor is the margin of error from classical statistics. This is the random error associated with selecting a sample - however, because I have not collected sample information I have assumed all surveys are of the same size; and
    2. The second factor is the systemic biases (house effects) that affect each pollster's published estimate of the population voting intention.

  2. The temporal part of the model assumes that the actual population voting intention on any day is much the same as it was on the previous day. The model estimates the (hidden) population voting intention for every day under analysis.

  3. The house effects part of the model assumes that house effects from a core set of pollsters - Essential, Newspoll and ReachTEL - sum to zero. With the current set of polling houses, these three houses typically have similar results. The polling data from the other houses affects the shape of the aggregate poll estimate, but not its vertical positioning on the chart.

This model is based on original work by Professor Simon Jackman. It takes advantage of Stan's vectorised operations. And Stan runs the 5 chains concurrently in under 90 seconds on my machine (a virtual Linux machine on a Windows based Ryzen 1800X). I use a virtual Linux machine because Stan does not run its chains in parallel under Windows.

// STAN: Two-Party Preferred (TPP) Vote Intention Model 

data {
    // data size
    int<lower=1> n_polls;
    int<lower=1> n_days;
    int<lower=1> n_houses;
    // assumed standard deviation for all polls
    real<lower=0> pseudoSampleSigma;
    // we are only going to normalise house effects from first n houses
    int<lower=1,upper=n_houses> n_core_set;
    // poll data
    vector<lower=0,upper=1>[n_polls] y; // TPP vote share
    int<lower=1> house[n_polls];
    int<lower=1> day[n_polls];

parameters {
    vector[n_days] hidden_vote_share; 
    vector[n_houses] pHouseEffects;
    real<lower=0> sigma; // genuine constraint

transformed parameters {
    vector[n_houses] houseEffect;
    //house effects sum to zero over the first n_core_set houses
    houseEffect[1:n_core_set] = pHouseEffects[1:n_core_set] - 
    if(n_core_set < n_houses)
        houseEffect[(n_core_set+1):n_houses] = 

model {
    // -- house effects model
    pHouseEffects ~ normal(0, 0.025); // prior expect up to +/- 5 percentage points 

    // -- temporal model
    sigma ~ cauchy(0, 0.005); // half cauchy prior
    hidden_vote_share[1] ~ normal(0.5, 0.05); // prior: TPP between 40% and 60%
    hidden_vote_share[2:n_days] ~ normal(hidden_vote_share[1:(n_days-1)], sigma);

    // -- observed data model
    y ~ normal(houseEffect[house] + hidden_vote_share[day], pseudoSampleSigma);

For those of you who are worried that sigma in the temporal model might be overly constrained (where sigma is the standard deviation in voting intention change from one day to the next), it lies well within the model constraints. The chart below shows the samples for sigma lie typically between 0.0002 and 0.002 on the unit scale (0.02 to 0.2 percentage points). Our prior for sigma was a half Cauchy, with half the distribution between 0 and 0.005 (0 to 0.5 percentage points).

The supporting python code for running this model is as follows. Note: I have a further python program for generating the charts from the saved analysis.

# analyse TPP poll data

import pandas as pd
import numpy as np
import pystan
import pickle

import sys
sys.path.append( '../bin' )
from stan_cache import stan_cache

# --- version information
print('Python version: {}'.format(sys.version))
print('pystan version: {}'.format(pystan.__version__))

# --- key inputs to model
sampleSize = 1000 # treat all polls as being of this size
pseudoSampleSigma = np.sqrt((0.5 * 0.5) / sampleSize) 
chains = 5
iterations = 1000
# Note: half of the iterations will be warm-up

# --- collect the model data
# the XL data file was extracted from the Wikipedia
# page on next Australian Federal Election
workbook = pd.ExcelFile('./Data/poll-data.xlsx')
df = workbook.parse('Data')

# drop pre-2016 election data
df['MidDate'] = [pd.Period(date, freq='D') for date in df['MidDate']]
df = df[df['MidDate'] > pd.Period('2016-07-04', freq='D')] 

# covert dates to days from start
start = df['MidDate'].min() - 1 # day zero
df['Day'] = df['MidDate'] - start # day number for each poll
n_days = df['Day'].max()
n_polls = len(df)

# get houses - specify a core set used for 
# the houseEffect sum-to-zero constraint
core_set = ['Essential', 'ReachTEL', 'Newspoll']
n_core_set = len(core_set)
for h in df['Firm'].unique():
    if h not in core_set:
map = {}
i = 1
for h in core_set:
    map[h] = i
    i += 1
df['House'] = df['Firm'].map(map) 
n_houses = len(df['House'].unique())

# batch up
data = {
    'n_days': n_days,
    'n_polls': n_polls,
    'n_houses': n_houses,
    'pseudoSampleSigma': pseudoSampleSigma,
    'n_core_set': n_core_set, 
    'y': (df['TPP L/NP'] / 100.0).values,
    'day': df['Day'].astype(int).values,
    'house': df['House'].astype(int).values

# --- get the STAN model 
with open ("./Models/TPP model.stan", "r") as f:
    model =

# --- compile/retrieve model and run samples
sm = stan_cache(model_code=model)
fit = sm.sampling(data=data, iter=iterations, chains=chains)
results = fit.extract()

# --- save analysis
intermediate_data_dir = "./Intermediate/"
with open(intermediate_data_dir + 
    'output-TPP-zero-sum.pkl', 'wb') as f:
    pickle.dump([results,df,data], f)

An example of the output from this model (as at 12 March 2018) follows. This chart shows the Coalition's estimated two-party preferred (TPP) hidden vote share for each day since early July 2016.

The next chart shows the estimates house bias for each house, given the assumption that the bias from  the core set (Essential, Newspoll and ReachTEL) sums to zero.

Model for primary voting intention with house effects summed to zero

[Still to be written - but my early work is here].

The 2016 version of this page ...

The 2016 version of this page has been archived to a post. It talks about the JAGS models I used in 2016 (which are very similar to the Stan models I am using here).