RLAUXE (“r-lux”)
WORK IN PROGRESS last changed: 9/11/2025
A library for Risk Limiting Audits (RLA), based on Philip Stark’s SHANGRLA framework and related code. The Rlauxe library is a independent implementation of the SHANGRLA framework, based on the published papers of Stark et al.
The SHANGRLA python library is the work of Philip Stark and collaborators, released under the AGPL-3.0 license. Also see OneAudit example python code
Also see:
Click on plot images to get an interactive html plot. You can also read this document on github.io.
Table of Contents
An audit is performed in rounds, as outlined here:
For each contest:
The purpose of the audit is to determine whether the reported winner(s) are correct, to within the chosen risk limit.
For each audit round:
SHANGRLA is a framework for running Risk Limiting Audits for elections. It uses a statistical risk testing function that allows an audit to statistically prove that an election outcome is correct (or not) to within a risk level α. For example, a risk limit of 5% means that the election outcome (i.e. the winner(s)) is correct with 95% probability.
It uses an assorter to assign a number to each ballot, and checks outcomes by testing half-average assertions, each of which claims that the mean of a finite list of numbers is greater than 1/2. The complementary null hypothesis is that the assorter mean is not greater than 1/2. If that hypothesis is rejected for every assertion, the audit concludes that the outcome is correct. Otherwise, the audit expands, potentially to a full hand count. If every null is tested at risk level α, this results in a risk-limiting audit with risk limit α: if the election outcome is not correct, the chance the audit will stop shy of a full hand count is at most α.
term | definition |
---|---|
Nc | a trusted, independent bound on the number of valid ballots cast in the contest c. |
Ncards | the number of ballot cards validly cast in the contest |
risk | we want to confirm or reject the null hypothesis with risk level α. |
assorter | assigns a number between 0 and upper to each ballot, chosen to make assertions “half average”. |
assertion | the mean of assorter values is > 1/2: “half-average assertion” |
estimator | estimates the true population mean from the sampled assorter values. |
bettingFn | decides how much to bet for each sample. (BettingMart) |
riskFn | the statistical method to test if the assertion is true. |
audit | iterative process of choosing ballots and checking if all the assertions are true. |
When the election system produces an electronic record for each ballot card, known as a Cast Vote Record (CVR), then Card Level Comparison Audits can be done that compare sampled CVRs with the corresponding ballot card that has been hand audited to produce a Manual Vote Record (MVR). A CLCA typically needs many fewer sampled ballots to validate contest results than other methods.
The requirements for CLCA audits:
For the risk function, rlauxe uses the BettingMart function with the AdaptiveBetting betting function. AdaptiveBetting needs estimates of the rates of over(under)statements. If these estimates are correct, one gets optimal sample sizes. AdaptiveBetting uses a variant of ShrinkTrunkage that uses a weighted average of initial estimates (aka priors) with the actual sampled rates.
See CLCA Risk function for details on the BettingMart risk function.
See CLCA AdaptiveBetting for details on the AdaptiveBetting function.
See CLCA Error Rates for estimating error rates.
When CVRs are not available, a Polling audit can be done instead. A Polling audit
creates an MVR for each ballot card selected for sampling, just as with a CLCA, except without the CVR.
The requirements for Polling audits:
For the risk function, Rlaux uses the AlphaMart (aka ALPHA) function with the ShrinkTrunkage estimation of the true population mean (theta). ShrinkTrunkage uses a weighted average of an initial estimate of the mean with the measured mean of the mvrs as they are sampled. The reported mean is used as the initial estimate of the mean. The assort values are specified in SHANGRLA, section 2. See Assorter.kt for our implementation.
See AlphaMart risk function for details on the AlphaMart risk function.
OneAudit is a type of CLCA audit, based on the ideas and mathematics of the ONEAudit papers (see appendix). It deals with the case when CVRS are not available for all ballots. The remaining ballots are in one or more “pools” for which subtotals are available. The basic idea is to create an “overstatement-net-equivalent” (ONE) CVR for each pool, and use the average assorter value in that pool as the value of the (missing) CVR in the CLCA overstatement. Only PLURALITY and IRV can be used with OneAudit.
CVRS are available for some, but not all ballots. When a ballot has been chosen for hand audit:
For results, see OneAudit version 4.
Older versions: OneAudit version 3 and OneAudit version 2.
Archived notes: OneAudit archive.
For CLCA and Polling, if there are no errors, then the number of samples needed for the audit is completely determined by the margin and the betting method chosen. The presence of errors adds variance because the errors show up randomly in the sequence of sampled ballots.
OneAudit has inherent sample variance due to the random sequence of pooled ballots, even when there are no errors. See plots in the next section below.
SHANGRLA code implementing OneAudit uses the SanFrancisco county 2024 primary and general elections for its use cases. There, the mail-in votes have Cvrs that can be matched to the physical ballots, while the in-person votes have Cvrs but cannot be matched to the physical ballots. In the latter case the ballots are kept by precinct in some fixed ordering which can be used in the ballot manifest.
For this case:
"With ONEAudit, you have to start by adding every contest that appears on any card in a tally
batch to every card in that tally batch and increase the upper bound on the number of cards
in the contest appropriately."
"The code does use card style information to infer which contests are on one or more cards
in the batch, but with the pooled data, we don't know which CVR goes with which piece of paper:
we don't know which card has which contests. We only that this pile of paper goes with this
group of CVRs. ONEAudit assigns the reported assorter mean to each CVR in the batch.
The mean is over all cards in the batch, since we don't know which cards have which contests."
(Phillip Stark, private communication)
The increase in Nc (upper bound on the number of cards in the contest) has the effect of decreasing the margin and increasing the number of samples needed.
Im not sure why San Francisco County chooses to not map CVRs to physical ballots for in-person voting. If its a technical problem with the prcecinct scanners, then when that problem is overcome, this can be a CLCA audit, which is much more efficient. If its a deliberate privacy-preserving choice, perhaps it might be sufficient to use the CVRs to create a ballot manifest with CSD (Card Style Data). This essentially redacts the actual vote, but keeps a record of what contests are on each ballot, which reduces the required sampling size.
Since the variance is so highly dependent on the specifics of the pool counts and averages, we use the actual data from the SF 2024 General Election to visualize the variance for this particular use case.
For comparision, here are the number of samples needed for all 164 assertions of all the contests of the SF 2024 General Election, when all ballots have an associated CVR and there are no errors:
Here is the same election using OneAudit where the in-person ballots are in precinct pools and have no card style data. We run the audit 50 times with different permutations of the actual ballots, and show a scatter plot of the results. The variance is due to the random order of the pooled ballots; the 50 trials are spread out vertically, since they all have the same margin:
Here is the same election using OneAudit where the in-person ballots are in precinct pools but have card style data:
We also have the use case of “redacted ballots” where we only get pool totals, and perhaps thats an instance where we might have pools but the admin is willing to give us the contest counts in each pool. (Since we then dont need ballot information, this may satisfy the privacy concern that redacted ballots is used for.)
CreateBoulderElectionOneAudit explores creating a OneAudit and making the redacted CVRs into OneAudit pools.
Findings so far:
TODO: Assuming we have (1), whats the consequences of not having (2) ??
Here we are looking at the actual number of sample sizes needed to reject or confirm the null hypotheses, called the “samples needed”. We ignore the need to estimate a batch size, as if we do “one sample at a time”. This gives us a theoretical minimum. In the section Estimating Sample Batch sizes below, we deal with the need to estimate a batch size, and the extra overhead that brings.
In general samplesNeeded are independent of N, which is helpful to keep in mind
(Actually there is a slight dependence on N for “without replacement” audits when the sample size approaches N, but that case approaches a full hand audit, and isnt very interesting.)
When Card Style Data (CSD) is missing, the samplesNeeded have to be scaled by Nb / Nc, where Nb is the number of physical ballots that a contest might be on, and Nc is the number of ballots it is actually on. See Choosing which ballots/cards to sample, below.
The following plots are simulations, averaging the results from the stated number of runs.
The audit needing the least samples is CLCA when there are no errors in the CVRs, and no phantom ballots. In that case, the samplesNeeded depend only on the margin, and so is a smooth curve:
(click on the plot to get an interactive html plot)
For example we need exactly 1,128 samples to audit a contest with a 0.5% margin, if no errors are found. For a 10,000 vote election, thats 11.28% of the total ballots. For a 100,000 vote election, its only 1.13%.
For polling, the assort values vary, and the number of samples needed depends on the order the samples are drawn. Here we show the average and standard deviation over 250 independent trials at each reported margin, when no errors are found:
In these simulations, errors are created between the CVRs and the MVRs, by taking fuzzPct of the ballots and randomly changing the candidate that was voted for. When fuzzPct = 0.0, the CVRs and MVRs agree. When fuzzPct = 0.01, 1% of the contest’s votes were randomly changed, and so on.
This is a log-log plot of samplesNeeded vs fuzzPct, with margin fixed at 4%:
Varying the percent of undervotes at margin of 4%, with errors generated with 1% fuzz:
Varying phantom percent, up to and over the margin of 4.5%, with errors generated with 1% fuzz:
Having phantomPct phantoms is similar to subtracting phantomPct from the margin. In this CLCA plot we show samples needed as a function of phantomPct, and also with no phantoms but the margin shifted by phantomPct:
Sampling refers to choosing which ballots to hand review to create Manual Voting Records (MVRs). Once the MVRs are created, the actual audit takes place.
Audits are done in rounds. The auditors must decide how many cards/ballots they are willing to audit, since at some point its more efficient for them to do a full handcount than the more elaborate process of tracking down a subset that has been selected for the sample. Theres a tradeoff between the overall number of ballots sampled and the number of rounds, but, we would like to minimize both.
Note that in this section we are plotting nmvrs = overall number of ballots sampled, which includes the inaccuracies of the estimation. Above we have been plotting samples needed, as if we were doing “one ballot at a time” auditing.
There are two phases to sampling: estimating the sample batch sizes for each contest, and then randomly choosing ballots that contain at least that many contests.
For each contest we simulate the audit with manufactured data that has the same margin as the reported outcome, and a guess at the error rates.
For each contest assertion we run auditConfig.nsimEst (default 100) simulations and collect the distribution of samples needed to satisfy the risk limit. We then choose the (auditConfig.quantile) sample size as our estimate for that assertion, and the contest’s estimated sample size is the maximum of the contest’s assertion estimates.
If the simulation is accurate, the audit should succeed auditConfig.quantile fraction of the time (default 80%). Since we dont know the actual error rates, or the order that the errors will be sampled, these simulation results are just estimates.
Note that each round does its own sampling without regard to the previous round’s results. However, since the seed remains the same, the ballot ordering is the same throughout the audit. We choose the lowest ordered ballots first, so previously audited MVRS are always used again in subsequent rounds, for contests that continue to the next round. At each round we record both the total number of MVRs, and the number of “new samples” needed for that round, which are the ballots the auditors have to find and hand audit for that round.
Once we have all of the contests’ estimated sample sizes, we next choose which ballots/cards to sample. This step depends whether the audit has Card Style Data (CSD, see MoreStyle, p.2), which tells which ballots have which contests.
For CLCA audits, the generated Cast Vote Records (CVRs) comprise the CSD, as long as the CVR has the information which contests are on it, even when a contest receives no votes. For Polling audits, the BallotManifest (may) contain BallotStyles which comprise the CSD.
If we have CSD, then Consistent Sampling is used to select the ballots to sample, otherwise Uniform Sampling is used.
Its critical in all cases (with or without CSD), that when the MVRs are created, the auditors record all the contests on the ballot, whether or not there are any votes for a contest or not. In other words, an MVR always knows if a contest is contained on a ballot or not. This information is necessary in order to correctly do random sampling, which the risk limiting statistics depend on.
At the start of the audit:
For each round:
At the start of the audit:
For each round:
We need Nc as a condition of the audit, but its straightforward to estimate a contests’ sample size without Nc, since it works out that Nc cancels out:
sampleEstimate = rho / dilutedMargin // (SuperSimple p. 4)
where
dilutedMargin = (v_w - v_l)/ Nc
rho = constant
sampleEstimate = rho * Nc / (v_w - v_l)
totalEstimate = sampleEstimate * Nb / Nc // must scale by proportion of ballots with that contest
= rho * Nb / (v_w - v_l)
= rho / fullyDilutedMargin
where
fullyDilutedMargin = (v_w - v_l)/ Nb
The scale factor Nb/Nc depends on how many contests there are and how they are distributed across the ballots, but its easy to see the effect of not having Card Style Data in any case.
As an example, in the following plot we show averages of the overall number of ballots sampled (nmvrs), for polling audits, no style information, no errors, for Nb/Nc = 1, 2, 5 and 10.
The following plot shows nmvrs for Polling vs CLCA, with and without CSD at different margins, no errors, where Nb/Nc = 2.
Overestimating sample sizes uses more hand-counted MVRs than needed. Underestimating sample sizes forces more rounds than needed. Over/under estimation is strongly influenced by over/under estimating error rates.
The following plots show approximate distribution of estimated and actual sample sizes, using our standard AdaptiveBetting betting function with weight parameter d = 100, for margin=2% and errors in the MVRs generated with 2% fuzz.
When the estimated error rates are equal to the actual error rates:
When the estimated error rates are double the actual error rates:
When the estimated error rates are half the actual error rates:
The amount of extra sampling closely follows the number of samples needed, adding around 30-70% extra work, as the following plots vs margin show:
The “extra samples” goes up as our guess for the error rates differ more from the actual rates. In these plots we use fuzzPct as a proxy for what the error rates might be.
In the best case, the simulation accurately estimates the distribution of audit sample sizes (fuzzDiff == 0%). But because there is so much variance in that distribution, the audit sample sizes are significantly overestimated. To emphasize this point, here are plots of average samples needed, and samples needed +/- one stddev, one for CLCA and one for polling:
The number of rounds needed reflects the default value of auditConfig.quantile = 80%, so we expect to need a second round 20% of the time:
An election often consists of several or many contests, and it is likely to be more efficient to audit all of the contests at once. We have several mechanisms for choosing contests to remove from the audit to keep the sample sizes reasonable.
Before the audit begins:
For each Estimation round:
These rules are somewhat arbitrary but allow us to test audits without human intervention. In a real audit, auditors might hand select which contests to audit, interacting with the estimated samplesNeeded from the estimation stage, and try out different scenarios before committing to which contests continue on to the next round. See the prototype rlauxe Viewer.
We assume that the cost of auditing a ballot is the same no matter how many contests are on it. So, if two contests always appear together on a ballot, then auditing the second contest is “free”. If the two contests appear on the same ballot some pct of the time, then the cost is reduced by that pct. More generally the reduction in cost of a multicontest audit depends on the various percentages the contests appear on the same ballot.
For any given contest, the sequence of ballots/CVRS to be used by that contest is fixed when the PRNG is chosen.
In a multi-contest audit, at each round, the estimate of the number of ballots needed for each contest is calculated = n, and the first n ballots in the contest’s sequence are sampled. The total set of ballots sampled in a round is just the union of the individual contests’ set. The extra efficiency of a multi-contest audit comes when the same ballot is chosen for more than one contest.
The set of contests that will continue to the next round is not known, so the set of ballots sampled at each round is not known in advance. Nonetheless, for each contest, the sequence of ballots seen by the algorithm is fixed when the PRNG is chosen.
Attacks are scenarios where the actual winner is not the reported winner. They may be intentional, due to malicious actors, or unintentional, due to mistakes in the process or bugs in the software.
Here we investigate what happens when the percentage of phantoms is high enough to flip the election, but the reported margin does not reflect that. In other words an attack (or error) when the phantoms are not correctly reported.
We create CLCA simulations at different margins and percentage of phantoms, and fuzz the MVRs at 1%. We measure the “true margin” of the MVRs, including phantoms, by applying the CVR assorter, and use that for the x axis.
The error estimation strategies in this plot are:
These are just the initial guesses for the error rates. In all cases, they are adjusted as samples are made and errors are found.
Here are plots of sample size as a function of true margin, for phantomPct of 0, 2, and 5 percent:
Here we investigate an attack when the reported winner is different than the actual winner.
We create simulations at the given reported margins, with no fuzzing or phantoms. Then in the MVRs we flip just enough votes to make the true margin < 50%. We want to be sure that the percent of false positives stays below the risk limit (here its 5%):
P2Z Limiting Risk by Turning Manifest Phantoms into Evil Zombies. Banuelos and Stark. July 14, 2012
RAIRE Risk-Limiting Audits for IRV Elections. Blom, Stucky, Teague 29 Oct 2019
https://arxiv.org/abs/1903.08804
SHANGRLA Sets of Half-Average Nulls Generate Risk-Limiting Audits: SHANGRLA. Stark, 24 Mar 2020
https://github.com/pbstark/SHANGRLA
MoreStyle More style, less work: card-style data decrease risk-limiting audit sample sizes. Glazer, Spertus, Stark; 6 Dec 2020
ALPHA: Audit that Learns from Previously Hand-Audited Ballots. Stark, Jan 7, 2022
https://github.com/pbstark/alpha.
BETTING Estimating means of bounded random variables by betting. Waudby-Smith and Ramdas, Aug 29, 2022
https://github.com/WannabeSmith/betting-paper-simulations
COBRA: Comparison-Optimal Betting for Risk-limiting Audits. Jacob Spertus, 16 Mar 2023
https://github.com/spertus/comparison-RLA-betting/tree/main
ONEAudit: Overstatement-Net-Equivalent Risk-Limiting Audit. Stark 6 Mar 2023.
https://github.com/pbstark/ONEAudit
STYLISH Stylish Risk-Limiting Audits in Practice. Glazer, Spertus, Stark 16 Sep 2023
https://github.com/pbstark/SHANGRLA
SliceDice Dice, but don’t slice: Optimizing the efficiency of ONEAudit. Spertus, Glazer and Stark, Aug 18 2025
https://arxiv.org/pdf/2507.22179; https://github.com/spertus/UI-TS
Also see (reference notes)[docs/notes/papers.txt].
SHANGRLA consistent_sampling() in Audit.py only audits with the estimated sample size. However, in multiple contest audits, additional ballots may be in the sample because they are needed by another contest. Since there is no guarentee that the estimated sample size is large enough, theres no reason not to include all the available mvrs in the audit.
Note that as soon as an audit gets below the risk limit, the audit is considered a success. This reflects the “anytime P-value” property of the Betting martingale (ALPHA eq 9). That is, one does not continue with the audit, which could go back above the risk limit with more samples. This does agree with how SHANGRLA works.
From STYLISH paper:
4.a) Pick the (cumulative) sample sizes {𝑆_𝑐} for 𝑐 ∈ C to attain by the end of this round of
sampling. The software offers several options for picking {𝑆_𝑐}, including some based on simulation.
The desired sampling fraction 𝑓_𝑐 := 𝑆_𝑐 /𝑁_𝑐 for contest 𝑐 is the sampling probability
for each card that contains contest 𝑘, treating cards already in the sample as having sampling
probability 1. The probability 𝑝_𝑖 that previously unsampled card 𝑖 is sampled in the next round is
the largest of those probabilities:
𝑝_𝑖 := max (𝑓_𝑐), 𝑐 ∈ C ∩ C𝑖, where C_𝑖 denotes the contests on card 𝑖.
4.b) Estimate the total sample size to be Sum(𝑝_𝑖), where the sum is across all cards 𝑖 except
phantom cards.
AFAICT, the calculation of the total_size using the probabilities as described in 4.b) is only used when you just want the total_size estimate, but not do the consistent sampling, which already gives you the total sample size.
From STYLISH paper:
2.c) If the upper bound on the number of cards that contain any contest is greater than the
number of CVRs that contain the contest, create a corresponding set of “phantom” CVRs as
described in section 3.4 of [St20]. The phantom CVRs are generated separately for each contest:
each phantom card contains only one contest.
SHANGRLA.make_phantoms() instead generates max(Np_c) phantoms, then for each contest adds it to the first Np_c phantoms. Im guessing STYLISH is trying to describe the easist possible algorithm.
2.d) If the upper bound 𝑁_𝑐 on the number of cards that contain contest 𝑐 is greater than the
number of physical cards whose locations are known, create enough “phantom” cards to make up
the difference.
Not clear what this means, and how its different from 2.c.
SHANGRLA has guesses for p1,p2,p3,p4. We can use that method (strategy.apriori), and we can also use strategy.fuzzPct, which guesses a percent of contests to randomly change (“fuzzPct”), and use it to simulate errors (by number of candidates) in a contest. That and other strategies are described in CLCA error rates ; we are still exploring which strategy works best.
At first glance, it appears that SHANGRLA Audit.py CVR.consistent_sampling() might make use of the previous round’s selected ballots (sampled_cvr_indices). However, it looks like CVR.consistent_sampling() never uses sampled_cvr_indices, and so uses the same strategy as we do, namely sampling without regards to the previous rounds.
Of course, when the same ballots are selected as in previous rounds, which is the common case, the previous MVRs for those balllots are used.
SHANGRLA assumes there is no Card Style Data for polled data, and so adds undervotes to the ballots in the pools. Rlauxe adds the option that there may be CSD for pooled data, in part to investigate the difference between the two options.
The algorithm to add undervotes is not published anywhere that I know of, and needs explanation.
Modules
Also See: