AI A/B Test Sample Size Calculators: The Right Bayesian Setup for Ecommerce Tests

Most A/B test sample-size calculators give wrong answers for ecommerce. How to set up Bayesian tests with priors, decision rules, and AI-assisted analysis that actually catches winners without false positives.

AI A/B Test Sample Size Calculators: The Right Bayesian Setup for Ecommerce Tests

The default sample-size calculator on Optimizely, VWO, and the spreadsheet you copied from a blog post is wrong for most ecommerce tests. It is built on frequentist assumptions (fixed sample, fixed alpha, fixed power) that do not match how teams actually run tests. The team eyeballs the test daily, makes a call when results "look good," and ships. The "statistical significance at 95 percent" stamp on the result is misleading because the team peeked.

Bayesian A/B testing solves the peeking problem and produces probability statements that match how operators actually think ("there is an 87 percent chance variant B is better"). With AI-assisted analysis on top, the entire experimentation discipline lifts. The brands running 50+ tests per quarter with Bayesian analysis catch 15 to 30 percent more real winners than brands running the same volume on frequentist tests, and ship 30 to 40 percent fewer false positives.

We covered the broader experimentation system in AI A/B testing automation for ecommerce. This post is the statistical foundation underneath it.

Key Takeaways

  • Frequentist sample-size calculators give wrong answers when teams peek at running tests (which they always do). Bayesian methods handle peeking gracefully.
  • The right setup is a Bayesian A/B test with informative priors, a decision rule based on probability-to-be-best, and a minimum-runtime gate to absorb day-of-week effects.
  • Required sample sizes for revenue-per-visitor tests are 3 to 10 times larger than for conversion rate tests because revenue has higher variance. Most teams under-power revenue tests.
  • AI assistance speeds up the analysis layer (interpreting results, surfacing secondary metrics, flagging segment effects) but does not change the underlying statistics.
  • Multiple-comparison correction matters more than most teams realize. Running 5 tests simultaneously without correction inflates the false-positive rate substantially.

Why Frequentist Calculators Are Wrong for Ecommerce

The frequentist sample-size formula assumes:

  • The sample size is chosen before the test starts.
  • The team looks at the results only once, at the predetermined sample size.
  • The team makes the accept/reject decision based on p < 0.05 at that one look.

What actually happens:

  • The team looks daily, sometimes hourly.
  • The team makes early calls when results "look strong."
  • The team extends tests that look promising but have not hit significance.

This pattern (called the optional stopping problem) inflates the false-positive rate from the nominal 5 percent to 20 to 40 percent depending on how often the team peeks. The "significant" winners are not as winning as they look. Some of them are pure noise that happened to peak at the moment the team called the test.

Frequentist methods can be made peeking-safe with sequential analysis (always-valid p-values, group sequential designs). The math is nontrivial. Most teams do not implement it correctly.

Bayesian methods are peeking-safe by construction. The posterior probability that variant B is better is the posterior probability regardless of when you look. You can update continuously and the math holds.

What Bayesian A/B Testing Actually Looks Like

The setup:

  • Prior. A probability distribution over the conversion rate (or revenue per visitor) for each variant before the test starts. Informative priors based on historical data (last quarter's average conversion rate, give or take a reasonable range) tighten the inference and reduce required sample size.
  • Likelihood. The observed conversion or revenue data updates the prior into a posterior distribution for each variant.
  • Posterior probability. From the posteriors, compute the probability that variant B has higher conversion than variant A. This is the actionable number.
  • Decision rule. Ship variant B when probability-to-be-best exceeds a threshold (typically 95 percent) AND a minimum sample-size or runtime gate has passed.

The minimum-runtime gate is the part many Bayesian implementations skip. Without it, a test on a weekend can declare a winner on Saturday based on Saturday traffic that does not represent the weekly pattern. Minimum 7 to 14 days regardless of statistical strength.

Required Sample Sizes in Practice

For conversion rate tests on a brand with 2.5 percent baseline conversion and a target of detecting 10 percent relative lift:

  • Frequentist (alpha 0.05, power 80 percent): roughly 22,000 visitors per variant.
  • Bayesian with informative prior: roughly 14,000 to 18,000 visitors per variant.
  • Bayesian with weakly-informative prior: roughly 20,000 visitors per variant.

For revenue-per-visitor tests (RPV) on the same brand:

  • Frequentist: 80,000 to 150,000 visitors per variant depending on AOV variance.
  • Bayesian: 60,000 to 120,000 visitors per variant.

The variance of RPV is the killer. A few high-AOV orders shift the average dramatically. The sample-size requirement is much higher than most teams understand.

The implication: brands running RPV-targeted tests on traffic of 50k/month per variant are under-powered. The team is making decisions on noise. Either tighten the testing surface to higher-traffic flows, accept conversion rate as the proxy metric (with its own problems), or run tests longer.

Priors That Actually Help

The prior is the lever that distinguishes a good Bayesian setup from a bad one.

Weakly informative priors (Beta(1,1) for conversion, normal with wide variance for RPV) treat every test as starting from scratch. Safe but high sample size.

Informative priors based on historical data (Beta with center at the brand's historical baseline conversion, variance based on observed monthly variability) cut required sample size 20 to 40 percent. The prior says "we already know the brand's baseline conversion is around 2.5 percent, plus or minus 0.4 percent" and the test only needs enough data to update from there.

Empirical Bayes (using past tests on similar pages to set the prior) is the most efficient setup. Brands running 100+ tests per year build a prior database for each test category (homepage hero, PDP, checkout, email subject line) and use it to set priors for new tests in the same category.

The risk of strong priors: if the brand's underlying conversion shifts (seasonality, new product, traffic mix change), the prior is stale and biases results. Refresh priors quarterly minimum.

The Decision Rules That Work

Beyond probability-to-be-best > 95 percent, two additional rules:

Expected loss threshold. Compute the expected loss of choosing the winner vs the true best variant, averaged over the posterior uncertainty. If the expected loss is below a threshold (say, 0.5 percent of revenue), ship. This catches cases where two variants are statistically similar and either choice is fine, avoiding endless test extension.

Minimum effect size. Even if probability-to-be-best is 99 percent, the effect size may be tiny (0.1 percent relative lift). Decide in advance whether the brand cares about lifts under X percent. If not, do not bother running the test or shipping the result.

Both rules cut the calendar time per test by 15 to 40 percent for tests where the variants are similar.

Where AI Helps in the Workflow

The statistics are math. AI does not change the math. It helps with three operational layers:

Test interpretation. A Claude or GPT-4 call on the test result table that explains the result in plain language, flags secondary metrics worth checking, and surfaces segment effects ("variant B wins overall but loses on mobile, look here"). Cost per test: $0.05 to $0.30. Saves 15 to 30 minutes per test for the analyst.

Pre-test design review. Feed the test plan to an LLM with a critique prompt. The LLM catches common errors (under-powered test, ambiguous hypothesis, conflicting metrics, missed segment splits). Catches 40 to 60 percent of design issues before the test runs.

Pattern detection across tests. Aggregate the past 100 tests, feed to an LLM, ask for patterns. ("Across our last 100 PDP tests, variants that added social-proof elements above the fold won 65 percent of the time with average lift of 4.2 percent. Variants that added urgency elements won 30 percent.") The patterns inform future test design.

The math remains Bayesian. The AI lifts the productivity of the humans running the math.

Multiple Comparisons: The Hidden Trap

A brand running 10 tests simultaneously, each at 95 percent decision threshold, has a per-test false-positive rate of 5 percent. The chance that at least one of the 10 produces a false winner is roughly 40 percent.

This matters when the team is running a test program at scale. Without correction, half the "winners" the brand ships in a busy quarter are not real.

Two valid corrections:

Bonferroni or Benjamini-Hochberg adjustment on the decision thresholds. Tighten the per-test threshold proportional to the number of simultaneous tests. Conservative but simple.

Hierarchical Bayesian model with shared priors across simultaneous tests. The math is cleaner but the implementation is harder. Worth it for brands running 20+ simultaneous tests per quarter.

Most brands ignore the correction entirely. Their false-positive rate is higher than they think.

Sequential Testing for Always-Valid Inference

For brands that want the strongest peeking guarantee, sequential testing (always-valid p-values, mSPRT, alpha-spending) lets you look at results continuously without inflating the false-positive rate.

Open-source implementations: Spotify's confidence library, Microsoft's mSPRT package, Statsig's sequential testing module. Commercial: Eppo and Statsig both ship sequential testing as the default.

Sequential tests have larger sample requirements than fixed-horizon tests (the cost of always-valid inference is a 20 to 40 percent sample-size penalty) but they save calendar time because you stop the moment you have signal, regardless of pre-planned sample size. Net win for brands that hate waiting.

Vendor and Platform Landscape

The platforms that handle this well:

  • Statsig. Strong Bayesian-by-default, sequential testing, segment analysis, integration with major data warehouses. Free tier for small volume.
  • Eppo. Similar strengths, slightly stronger experimentation analytics, weaker UI for non-analysts.
  • Optimizely Web. Frequentist by default. Bayesian is available but the implementation is less elegant. Best if you are already on Optimizely.
  • VWO Insights. Frequentist, simpler. Use for brands without a data team.
  • Build in-house. Reasonable above $100M revenue with a data team. PyMC, Stan, or custom Python implementations on warehouse data.

For Shopify Plus brands without a data team, Statsig or Eppo are the fastest path to a real Bayesian setup. Optimizely is fine if you cannot change platforms.

Implementation Path

1. Weeks 1 to 2. Audit current testing. What platform, what method, how many false positives suspected, how much peeking happens. 2. Weeks 2 to 4. Decide on Bayesian approach. Pick a platform or commit to a build. 3. Weeks 4 to 6. Build the prior database. Pull last 4 quarters of test data, fit prior distributions per test category, store. 4. Weeks 6 to 8. Define decision rules. Probability-to-be-best threshold, minimum runtime, expected-loss threshold, minimum effect size. 5. Weeks 8 to 12. Run a parallel period. Old method and new method on the same tests. Verify the new method catches the same real winners and fewer false positives. 6. Months 3+. Switch to Bayesian as default. Layer AI assistance on interpretation. Quarterly prior refresh. 7. Month 6+. If running at scale, add multiple-comparison correction or sequential testing.

Time to a working Bayesian setup: 8 to 12 weeks. Time to measurably better experimentation outcomes: 6 to 12 months because the gain compounds over many tests.

FAQ

Should we switch our current frequentist tests to Bayesian mid-flight?

No. Finish the current tests on the method they started with. Switch the next cohort to Bayesian. Mid-flight method changes invalidate the analysis.

How does this work for tests with very low traffic?

Small-traffic tests are under-powered regardless of method. Bayesian methods do not magically fix the underlying signal-to-noise ratio. Options: run longer, target a higher-traffic surface, reduce the number of variants, or use multi-armed bandits where decision speed matters more than statistical certainty.

What about non-inferiority tests (we want to ship variant B if it is at least as good as A)?

Bayesian methods handle this natively. The decision rule becomes "probability variant B is no more than X worse than A > 95 percent." Cleaner than the frequentist non-inferiority test.

Can we run Bayesian and frequentist on the same test for sanity checking?

Yes, especially during the transition. The two should usually agree on big wins. Disagreements are diagnostic: usually under-powered tests where the frequentist method declares no significance and the Bayesian method gives a 75 to 90 percent probability that B is better.

How often should we refresh the priors?

Quarterly minimum. After any major change (new product launch, traffic mix shift, seasonality boundary like end-of-Q4). Stale priors bias inference and the team will not notice.

Need help upgrading your experimentation program to a real Bayesian setup? Contact 77 AI Agency for an experimentation audit, or review our pricing for engagement options.

Related reading

Free AI Audit

Schedule a focused audit for your ecommerce operating model

We review storefront friction, retention execution, support load, and media decision quality, then outline the highest value system to build first.

Schedule the Audit