I was tasked with developing an A/B testing curriculum for a portfolio of direct-to-consumer eCommerce for a mobile version of the online customer portal.

Since then, I’ve worked with analytics collaborators to develop our own framework to use with eCommerce clients in a test, learn, and iterate approach. It helps us efficiently uncover opportunities, form hypotheses, align with business leaders, and collaborate with product teams.

Although clients still present new challenges that compels us to adapt our thinking and process, here’s what we’ve learned so far:

Why A/B testing is worth doing

For most startups, A/B testing is kind of like getting five servings of vegetables per day — you know you should do it, but most people don’t. There are so many reasons not to:

There are too many other things to work on.
Why test it when you can just release it?
It’s too much process.
We’ve tried it before, but it didn’t move the needle.

When A/B testing programs have failed to catch on, it was usually one of these four reasons:

There are too many other things to work on.
The framework we use for conversion rate optimization and A/B testing can actually help to identify and prioritize high-impact, low-effort projects. We look for high-confidence areas of high usage/traffic or high revenue, where even a small lift would have impact.

Why test it when you can just release it?
By releasing as a test, you’ll not only have a better sense of how this release impacted behavior, but also gain valuable insight by validating or invalidating your hypothesis. Plus, once the process is in place, it’s almost as easy to deploy each change as an A/B test as it is to push live to all segments. It’s too much process.
Our hope with this workflow is that it becomes habitual because it’s helpful—not bureaucratic overhead (we hate that too). Just like how there was a sea change in writing user stories instead of tasks, we believe that the exercise of writing a hypothesis helps both the process of creating and objectively evaluating the design solution. The rest are guardrails to help non-analytics experts draw insights from data, reach statistical significance, and to help interpret the results.We’ve tried it before, but it didn’t move the needle.
For every test that’s hyped to be a $300M button test, there are just as many 41 Shades of Blue experiments that neither generates sales or unlocks customer insight. A test needs to be focused enough to generate learnings, but not so ambitious that it doesn’t get done (or is a nightmare to analyze), and not so small that even conclusive results don’t matter.

What testing can and can’t accomplish

Another reason why A/B testing can sometimes seem ineffective is when it’s being asked to carry too much strategic weight. A/B testing isn’t a substitute for product strategy—the team still needs to choose which hill to climb, and A/B testing can be a beacon to help you climb toward that apex.

While you’re in the thick of building and growing a product, it’s not always easy to tell if you’re climbing up to a local maxima or if there’s a much larger global maximum next door that would require a pivot to your value proposition and a redesign. For this, I refer to the excellent Nielsen Norman Group article Radical Redesign or Incremental Change?

Realistically, most teams are balancing both visionary strategy questions and the day-to-day fundamentals of running a business. You’re incrementally improving towards your current vision, with a healthy side of growth opportunity speculation.

To ensure we don’t mix objectives and end up achieving neither, we created two different workflows.

The three principles we follow for optimization

Our goal was to create a framework so that A/B testing becomes an engine that reliably delivers real business value in the form of increased revenue or customer insights. It also needs to be light-weight enough to address all the earlier objections: not enough time, not enough impact to justify the time investment, and results are too often inconclusive or not widely applicable to be valuable.

We distilled our approach down to three principles:

Low effort ↓
to develop, deploy, and analyze
High impact ↓
for business goals and customer development
High confidence ↓
in the hypothesis based on available information

Let’s take a closer look at each of these.

1. Low effort

Keep things simple, both conceptually and executionally. One of the most common reasons teams fall off on testing is it ends up taking too much time. Even super enthusiastic teams are at risk of overextending — designing excellent, but complicated tests which are time-consuming to launch, and also difficult to analyze. It’s almost always better to build and keep momentum by launching several smaller tests, than to design one giant test that becomes higher-pressure to deliver positive results.

Conceptually, this means that nothing is being drastically reevaluated and post-test analysis will also be more manageable.

Executionally, our goal is to fit design and engineering into one sprint. When designing, what’s the leanest implementation to test this hypothesis?

2. High impact

There’s always going to be interesting data everywhere. From a conversion rate optimization standpoint, this is a trap we worked hard to avoid.

One trap of interesting data is wasting time in the analysis stage. Whenever you open Google Analytics, there’s always going to be an anomaly begging for investigation. It’s dangerously addictive to go down each of these “interesting data” rabbit holes and try to solve each one as a puzzle — before you know it, hours have flown by and you don’t have anything actionable.

Another trap is investing time implementing clever solutions that might have a large impact (even doubling or tripling conversions!), but on a feature that only reaches a tiny segment of visitors. Unless there’s a valuable customer learning, we’d consider this a failure from a conversion rate optimization standpoint.

To avoid analysis paralysis and to identify high-impact potential tests, we followed these three criteria:

High traffic / engagement areas so that even a small lift would yield meaningful results.
High revenue areas where even a small lift in traffic / engagement would yield meaningful results.
A valuable learning, such as validating a dangerous assumption.

3. High confidence

Lastly, we optimized for hypotheses that we’re most confident in, rather than speculating on unknowns. This meant:

Form an opinionated and focused hypothesis (to avoid testing too broadly and risking inconclusive results).
Capitalize on existing behavior (instead of trying to change or influence new behavior, which is risky and difficult).
Focus on what’s not working (leave personas, features, and flows that are working alone).

Test Design Format Opportunity discovery

Each test design is divided into four prompts to help guide your analysis. Starting with the objective data observed, your interpretation of the data and insight, the hypothesis, then finally your proposed test design.