Quick Answer

A conversion rate optimization company that rushes tests to the dashboard will make you poorer with each declared winner. Roughly 80% of A/B tests called winners in the first week fail to repeat when re-run with proper sample sizes — that number comes from CXL's analysis of thousands of test replays and matches what Harvard Business Review found.

A conversion rate optimization company that rushes tests to the dashboard will make you poorer with each declared winner. Roughly 80% of A/B tests called winners in the first week fail to repeat when re-run with proper sample sizes — that number comes from CXL's analysis of thousands of test replays and matches what Harvard Business Review found in a 2017 ecommerce audit. The tests weren't wrong. The statistics were. The industry's quiet secret is that most CRO work is not testing at all — it's coin-flipping with better graphics.

Real optimization starts three steps before the test button. Research, hypothesis, sample-size math, then run. Skip any one and the "lift" you ship is usually random variance that regresses to the mean by the time next quarter's numbers come in.

Why Most A/B Tests Declare False Winners

The core problem is statistical power — the probability that a test will detect a real effect if one exists. Underpowered tests don't just miss true effects; they also produce exaggerated estimates of any effect they do catch. This is called the "winner's curse" and it's why so many declared winners shrink or disappear on rollout.

Consider the math. To detect a 5% relative lift on a page converting at 3% with 80% statistical power at a 95% confidence level, each variant needs roughly 55,000 visitors. A site with 20,000 monthly visitors running a weeklong test collects 5,000 per variant — one-tenth the required sample. The test will still declare a winner. The winner will be noise.

Further, tests left running until p < 0.05 "peeked" daily almost guarantee a false positive. The p-value is only valid at a pre-declared stopping point. Peeking inflates false-positive rates from 5% to 30% or higher. Most dashboards do not warn you about this. The vendors who treat it as a feature are selling the illusion of velocity.

Research Before Testing: The 70/20/10 Rule

Serious CRO programs spend roughly 70% of hours on research, 20% on test design and execution, and 10% on analysis and rollout. Teams that invert that ratio — 10% research, 70% testing — average far lower win rates and much smaller lifts per win. The research is the lift. The test just verifies it.

Qualitative Research: Finding the Friction

A session replay tool like Hotjar or FullStory exposes the exact moments users abandon, what they clicked and didn't click, where rage-clicks cluster, and where they backed out. Watching twenty sessions on the highest-drop-off step consistently finds the top two or three friction points — the ones worth testing.

Pair that with onsite surveys fired at abandonment: a one-question prompt on the exit intent of a pricing page ("What stopped you?") typically returns 50–200 responses per week and surfaces objections nobody on the marketing team had written down. Those objections become hypotheses. Hypotheses become tests.

Quantitative Research: Sizing the Opportunity

Analytics reveal where the funnel leaks most. A page with 40,000 monthly visitors at 2% conversion offers more lift headroom than a page with 2,000 visitors at 1%. Funnel-step drop-off percentages, segmented by device and acquisition source, tell you where to point the research. Qualitative tools tell you why drop-off happens. Both are needed. Either alone is guesswork.

The ICE Framework That Kills Bad Tests Early

Once research produces hypotheses, they need prioritization. The ICE framework — Impact, Confidence, Ease — scores each idea 1–10 on three axes. Multiply or average the three. Test the top-scored hypotheses first.

The framework is blunt, which is the point. It kills tests that wasted the last team's quarter — the ones with big-feeling changes, no research backing them, and six weeks of engineering time. It also protects small-but-obvious wins from being deprioritized into "someday" lists.

The Math That Stops the Coin-Flipping

Before every test, calculate required sample size using a free tool like Evan Miller's calculator or Optimizely's. If your site won't hit that number in the planned test window, don't launch. Shorten the hypothesis to a bigger-effect change, run a longer test, or combine variants. Running anyway is not a test — it's theatre with a p-value.

One Variable at a Time, or Learn Nothing

A test that changes the headline, the button color, and the form length simultaneously will produce a number. It will not produce a lesson. If it wins, nobody knows which change drove the lift. If it loses, nobody knows which change dragged it down. The team does not get smarter. Only the next test can.

Multivariate tests are different — they're designed to isolate the effect of each variable, but they require far larger sample sizes (often 4–6x an A/B) and are only appropriate on high-traffic pages. For most small and mid-market sites, sequential A/B tests on isolated variables are the correct tool. Slower, but the lessons compound.

This is how real programs build compounding win rates. Year one, a 15% test win rate. Year three, the same team hits 35–40% because every completed test — winner or loser — taught them something about the user and the funnel that informed the next one.

What to Audit in the First 30 Days

Before any test runs, a good engagement audits five things. Each frequently exposes cheap wins that are worth shipping without testing.

  1. Checkout or form friction — field count, required fields, payment options, error handling. Baymard's research finds the average checkout has 23 form fields when 8 would do.
  2. Mobile rendering problems — buttons below the viewport fold, tap targets under 44px, horizontal scroll, broken drop-downs. Mobile is where most sites leak the most and test the least.
  3. Page speed on mobile — every extra second of load at the 2–5 second mark costs a measurable conversion percentage. Speed fixes frequently outperform any test you'd run.
  4. Clarity of offer and next step above the fold — a five-second test with external users often reveals that what the team thinks is obvious is not.
  5. Trust markers at the moment of decision — security badges at the card field, return policy at the add-to-cart, testimonial at the pricing. Placement matters more than presence.

Many sites see 10–25% conversion lift from the audit alone, before a single A/B test ships. Vendors who skip the audit and jump to testing miss those wins and bill you to find them a quarter later. The patterns the audit most often surfaces are catalogued in our companion piece on low website conversion rate fixes.

What to Ask Before Hiring

The win rate of serious CRO programs is around 20–35%. If a vendor promises a 70% win rate, they are either exaggerating or running underpowered tests that will not hold on rollout. Believe the math, not the pitch.

Where a Real Conversion Rate Optimization Company Earns Its Fee

A real conversion rate optimization company does the slow work first: research, hypothesis, sample-size math, one-variable tests, and honest postmortems on the losers. It ships quick wins from the audit while the test pipeline warms up, prioritizes the backlog with ICE, and refuses to declare winners at p-values that won't survive contact with next quarter's traffic. Revenue Group runs CRO under those constraints because the alternative is expensive coin-flipping dressed up as science. If a site has been "testing" for a year and revenue per visitor hasn't moved, the program is almost certainly trapped in the sample-size trap — calling noise wins while real friction stays unfixed.

Stop Shipping Tests That Won't Hold

Free CRO audit. We review your top funnel steps, sample-size math, and recent test history — and flag the wins you can ship without a test.

Get My Free Audit →