Bayesian Aggregation

This page provides some technical background on the Bayesian poll aggregation models I use on this site for the 2024-25 Australian Federal election.

The data aggregation or data fusion models I use are best described as state space models. They are similar to hidden Markov models (HMM); however, the hidden state variables in these models are continuous and not discrete (as they are in HMMs). The models are also analogous to the Kalman filter, which is a state space model. 

I model the national voting intention (which cannot be observed directly; it is "hidden") for each and every day of the period under analysis. The only time the national voting intention is not hidden, is at an election. In some models (known as anchored models), we use the election result to anchor the model of day-to-day voting intention.

In the language of modelling, our estimates of the national voting intention for each day being modeled are known as states. These "states" link together to form a process where each state is directly dependent on the state for the day before and a probability distribution linking the states. In plain English, the model assumes that the national voting intention today is much like it was yesterday.

The model is informed by irregular and noisy data from the selected polling houses. The challenge for the model is to ignore the noise and find the underlying signal. In effect, the model is solved by finding the the day-to-day pathway with the maximum likelihood given the known poll results.

To improve the robustness of the model, we make provision for the long-run tendency of each polling house to systematically favour either the Coalition or Labor. We call this small tendency to favour one side or the other a "house effect". The model assumes that the results from each pollster diverge (on average) from the from real population voting intention by a small, constant number of percentage points. We use the calculated house effect to adjust the raw polling data from each polling house.

In estimating the house effects, we can take one of a number of approaches. We could:

  • anchor the model to an election result on a particular day, and use that anchoring to establish the house effects.
  • anchor the model to a particular polling house or houses; or 
  • assume that collectively the polling houses are unbiased, and that collectively their house effects sum to zero.

Currently, I tend to favour the third approach in my analysis.

The problem with anchoring the model to the previous election outcome (or to a particular polling house), is that pollsters are constantly reviewing and, from time to time, changing their polling practice. Over time these changes affect the reliability of anchored models. On the other hand, the sum-to-zero assumption is rarely correct. Nonetheless, in some previous elections, those people who used models that were anchored to the previous election did poorer than those people whose models averaged the bias across all polling houses.

Solving a model necessitates integration over a series of complex multidimensional probability distributions. The definite integral is typically impossible to solve algebraically. But it can be solved using a numerical method based on Markov chains and random numbers known as Markov Chain Monte Carlo (MCMC) integration

Originally I used JAGS to solve these models. For the 2019 Australian election, I used Stan (accessed from python using pystan). And for the both the 2022 and 2025 elections I am using PyMC, which is a software package in the python ecosystem.

The specific model I use, as coded in PyMC, is set out in the following code block.

def define_zs_model(  # zs = zero-sum (house effects)
    n_firms: int,
    n_days: int,
    poll_day: pd.Series,  # of int, length is number of polls
    poll_brand: pd.Series,  # of int, length is number of polls
    zero_centered_y: pd.Series,  # of float, length is number of polls
    measurement_error_sd: float,
) -> pm.Model:
    """PyMC model for pooling/aggregating voter opinion polls.
    Model assumes poll data (in percentage points)
    has been zero-centered (by subtracting the mean for
    the series). Model assumes that House Effects sum to zero."""

    model = pm.Model()
    with model:
        # --- Temporal voting-intention model
        # Guess a starting point for the random walk
        guess_first_n_polls = 5  # guess based on first n polls
        guess_sigma = 15         # allow SD flexibility on init guess
        educated_guess = zero_centered_y[
            : min(guess_first_n_polls, len(zero_centered_y))
        ].mean()
        start_dist = pm.Normal.dist(mu=educated_guess, sigma=guess_sigma)
        # Establish a Gaussian random walk ...
        daily_innovation = 0.20  # from experience ... daily change in VI
        voting_intention = pm.GaussianRandomWalk(
            "voting_intention",
            mu=0,  # no drift in model
            sigma=daily_innovation,
            init_dist=start_dist,
            steps=n_days,
        )

        # --- House effects model
        house_effect_sigma = 15  # assume big house effects possible
        house_effects = pm.ZeroSumNormal(
            "house_effects", sigma=house_effect_sigma, shape=n_firms
        )

        # --- Observational model (likelihood)
        polling_observations = pm.Normal(
            "polling_observations",
            mu=voting_intention[poll_day.values] + house_effects[poll_brand.values],
            sigma=measurement_error_sd,
            observed=zero_centered_y,
        )
    return model
Graphically the primary model is as follows. On the day I took this image, there were 8 pollsters in my data set, 66 polls, and 493 days from the first to last poll.

This modelling is based on the work of Simon Jackman in Bayesian Analysis for Social Sciences (2009).

The complete code base is available on my github site.