Saturday, March 10, 2018

Two aggregation models for primary vote share

I have developed two different Stan models for aggregating the primary vote opinion polls.
  • The first model estimates the compositional voting proportions for all but one party as centered logits (where logits are logarithms of the odds ratio, which are assumed to have a mean centred close to 0 on the logit scale). The temporal model is a Gaussian autoregressive process for each of these centred logits which is similar to the autoregressive process used in the two party preferred (TPP) vote share model. To get meaningful primary poll estimates, the n_parties-1 centred logits are converted to voting share proportions for all the parties. This model takes around 500 seconds (9 minutes) to produce an aggregate poll.
  • The second model is a Stan version of the Dirichlet process model I developed for the last Federal election. This model derives the party vote share on a particular day through a Dirichlet distribution over the previous day's vote share multiplied by a transmission-strength value. The larger the transmission strength value, the the less change from one day to the next. This model takes about 1800 seconds (30 minutes) to produce an aggregate poll, which is a long time.

Both models use the multinomial distribution to fit the estimated hidden voting intention to the polling data. This hidden vote share in both models is expressed as an n_parties simplex; where all values on the simplex are between 0 and 1, and collectively they sum to 1. The data from each opinion poll is four integers in a multinomial, where all values in this multinomial sum to the pseudo-sample-size of 2000. The four party groups are: the Coalition, Labor, Greens and others.

Both models treat house effects in the same way. The house effects for each polling house sum to zero across the four parties. The house effects for each party also sum to zero across the (currently) five houses.

Both models derive the best estimate for the extent to which the voting patterns on any day are similar to the voting patterns on the previous day.

Let's look at the output from each model for each party grouping. The high, low and end point median samples are annotated.You can see the models produce similar output. The median estimate from the Dirichlet model is smoother, and the sample distribution is wider.









We can also compare the house effects from each model. Again these are similar.









We can also compare the TPP estimates from the models using previous election preference flows.







We can compare these TPP estimates with the estimate from the TPP Aggregation model.



What cannot be compared - because the models are so different - is the degree to which both the models ensure the voting intention on one day is much like the next. In the centred logit model we have a model estimated standard deviation (walkSigma) from one day to the next. In the Dirichlet model, we have a model estimated transmissionFactor, the inverse of which provides the transmission strength.




My code for these models is still very much in development. It has not been vectorised. And in places it is just plain ugly. I will spend some time tidying the code, before I add it to my page on the models.

// STAN: Primary Vote Intention Model using Centred Logits

data {
    // data size
    int<lower=1> n_polls;
    int<lower=1> n_days;
    int<lower=1> n_houses;
    int<lower=1> n_parties;
    int<lower=1> pseudoSampleSize;
    
    // Centreing factors 
    real centreing_factors[n_parties-1];
    
    // poll data
    int<lower=1,upper=pseudoSampleSize> y[n_polls, n_parties]; // poll data multinomials
    int<lower=1,upper=n_houses> house[n_polls]; // polling house
    int<lower=1,upper=n_days> poll_day[n_polls]; // day on which polling occurred
    
    // TPP preference flows
    row_vector<lower=0,upper=1>[n_parties] preference_flows_2010;
    row_vector<lower=0,upper=1>[n_parties] preference_flows_2013;
    row_vector<lower=0,upper=1>[n_parties] preference_flows_2016;
}

parameters {
    real<lower=0> walkSigma; 
    row_vector[n_days] centredLogits[n_parties-1];
    matrix[n_houses-1, n_parties-1] houseAdjustment;
}

transformed parameters {
    matrix[n_parties, n_days] hidden_voting_intention;
    vector<lower=-0.2,upper=0.2>[n_parties] tHouseAdjustment[n_houses];
    row_vector[n_days] tmp;
    
    // house effects - two-direction sum to zero constraints
    for (h in 1:(n_houses-1))
        for(p in 1:(n_parties-1))
            tHouseAdjustment[h][p] = houseAdjustment[h][p];
    for(p in 1:(n_parties-1))
        tHouseAdjustment[n_houses][p] = -sum(col(houseAdjustment, p));
    for(h in 1:n_houses) {
        tHouseAdjustment[h][n_parties] = 0; // get rid of the NAN
        tHouseAdjustment[h][n_parties] = -sum(tHouseAdjustment[h]);
    }
    
    // convert centred logits to a simplex of hidden voting intentions
    tmp = rep_row_vector(0, n_days);
    for (p in 1:(n_parties-1)) {
        hidden_voting_intention[p] = inv_logit(centredLogits[p]) + 
            centreing_factors[p];
        tmp = tmp + hidden_voting_intention[p];
    }
    hidden_voting_intention[n_parties] = 1.0 - tmp; 
}

model{
    matrix[n_parties, n_polls] hvi_on_poll_day;

    // -- house effects model
    for( p in 1:(n_houses-1) )
        houseAdjustment[p] ~ normal(0, 0.015); 
    
    // -- temporal model - all done on the centred logit scale
    // Note: 0.02 near the centre --> roughly std dev of half a per cent 
    walkSigma ~ normal(0, 0.02); // half normal prior - note: on logit scale;
    for(p in 1:(n_parties-1)) {
        centredLogits[p][1] ~ normal(0, 0.15); // centred starting point 50% +/- 5%
        centredLogits[p][2:n_days] ~ normal(centredLogits[p][1:(n_days-1)], walkSigma);
    }
    
    // -- observed data model
    for(p in 1:n_parties)
        hvi_on_poll_day[p] = hidden_voting_intention[p][poll_day];
    for(poll in 1:n_polls)
        // note matrix transpose in the next statement ...
        y[poll] ~ multinomial(to_vector(hvi_on_poll_day'[poll]) + 
            tHouseAdjustment[house[poll]]);
}

generated quantities {
    // aggregated TPP estimates based on past preference flows
    vector [n_days] tpp2010;
    vector [n_days] tpp2013;
    vector [n_days] tpp2016;

    for (d in 1:n_days){
        // note matrix transpose in next three lines
        tpp2010[d] = sum(hidden_voting_intention'[d] .* preference_flows_2010);
        tpp2013[d] = sum(hidden_voting_intention'[d] .* preference_flows_2013);
        tpp2016[d] = sum(hidden_voting_intention'[d] .* preference_flows_2016);
    }
}

// STAN: Primary Vote Intention Model using a Dirichlet process

data {
    // data size
    int<lower=1> n_polls;
    int<lower=1> n_days;
    int<lower=1> n_houses;
    int<lower=1> n_parties;
    
    // key variables
    int<lower=1> sampleSize; // maximum sample size for y
    
    // give a rough idea of a staring point ...
    simplex[n_parties] startingPoint; // rough guess at series starting point
    int<lower=1> startingPointCertainty; // strength of guess - small number is vague
    
    // poll data
    int<lower=0,upper=sampleSize> y[n_polls, n_parties]; // a multinomial
    int<lower=1,upper=n_houses> house[n_polls]; // polling house
    int<lower=1,upper=n_days> poll_day[n_polls]; // day polling occured
    
    // TPP preference flows
    vector<lower=0,upper=1>[n_parties] preference_flows_2010;
    vector<lower=0,upper=1>[n_parties] preference_flows_2013;
    vector<lower=0,upper=1>[n_parties] preference_flows_2016;
}

parameters {
    simplex[n_parties] hidden_voting_intention[n_days];
    matrix<lower=-0.06,upper=0.06>[n_houses-1, n_parties-1] houseAdjustment;
    real<lower=0> transmissionFactor;
}

transformed parameters {
    vector<lower=-0.2,upper=0.2>[n_parties] tHouseAdjustment[n_houses];
    real<lower=1> transmissionStrength; // AR(1) strength: higher is stronger

    // calculate transmissionStrength
    transmissionStrength = 1/transmissionFactor;
    
    // make the house effects sum to zero in two directions
    for (h in 1:(n_houses-1))
        for(p in 1:(n_parties-1))
            tHouseAdjustment[h][p] = houseAdjustment[h][p];
    for(p in 1:(n_parties-1))
        tHouseAdjustment[n_houses][p] = -sum(col(houseAdjustment, p));
    for(h in 1:n_houses) {
        tHouseAdjustment[h][n_parties] = 0; // get rid of the NAN
        tHouseAdjustment[h][n_parties] = -sum(tHouseAdjustment[h]);
    }
}

model{
    // -- house effects model
    for( p in 1:(n_houses-1) )
        houseAdjustment[p] ~ normal(0, 0.01); 
    
    // -- temporal model
    transmissionFactor ~ normal(0, 0.005); // a half normal prior
    hidden_voting_intention[1] ~ dirichlet(startingPoint * startingPointCertainty);
    for (d in 2:n_days)
        hidden_voting_intention[d] ~ dirichlet(hidden_voting_intention[d-1] * 
            transmissionStrength);
    
    // -- observed data model
    for(poll in 1:n_polls)
        y[poll] ~ multinomial(hidden_voting_intention[poll_day[poll]] + 
            tHouseAdjustment[house[poll]]);
}

generated quantities {
    // aggregated TPP estimates based on past preference flows
    vector [n_days] tpp2010;
    vector [n_days] tpp2013;
    vector [n_days] tpp2016;

    for (d in 1:n_days){
        tpp2010[d] = sum(hidden_voting_intention[d] .* preference_flows_2010);
        tpp2013[d] = sum(hidden_voting_intention[d] .* preference_flows_2013);
        tpp2016[d] = sum(hidden_voting_intention[d] .* preference_flows_2016);
    }
}

No comments:

Post a Comment