Adaptive A/B Testing with Customer Features
When treatment effects depend on customer features, who you test matters. This demo walks through a simple sequential design run: at each step, choose the next customer \(x^*\) and ad assignment to learn about the A/B decision boundary. Customers who are clearly better for A or B usually do not teach the model much, so the design tends to spend tests near the boundary. On the other hand, not all boundary points are equally useful: the design focuses on points where the model is uncertain and the A/B decision is likely to change.
What you are seeing
- Three views of the same state: where the next test would be useful (expected information gain), where the A/B choice is least settled (decision uncertainty), and what the model would choose right now.
- We model conversion probability as a function of customer features (x1, x2) and which ad is shown.
- Points show past experiments; the star is the next experiment \(x^*\) chosen by the design rule.
Why this matters
- A single average treatment effect can hide the cases where A works better for one group and B works better for another.
- Random sampling may spend many tests on customer profiles where the decision is already obvious.
- Adaptive design puts more tests where the model is uncertain and the A/B decision is sensitive.
Model details (for technical viewers)
We use a logistic interaction model: \(\eta(x,z) = x^\top\beta + z\,x^\top\gamma\), \(y \sim \mathrm{Bernoulli}(\sigma(\eta(x,z)))\). Here \(x\) denotes the feature vector \(\phi(x_1,x_2)=[1,\,x_1,\,x_2,\,x_1x_2]\). The heterogeneous A/B difference is \(\Delta(x)=\eta(x,1)-\eta(x,0)=x^\top\gamma\).
In this demo, the expected information gain view focuses on learning the interaction weights \(\gamma\) (i.e., the feature-dependent A/B difference), not the baseline \(\beta\). Any other random quantity computable from the model could be targeted depending on the goal (e.g., a policy boundary, a threshold, or a ranking), each leading to different computed optimal designs. The decision uncertainty view focuses on uncertainty in the sign of \(\Delta(x)\) ("is A better than B here?") rather than uncertainty in the absolute conversion rate.
Simulation setup for these precomputed frames: customer features \((x_1, x_2)\in[-3,7]^2\). We start with \(N_0=10\) customers, then evaluate a fresh candidate pool of 200 customers drawn from the same distribution at each step. The OED policy chooses \(x^*\) from this candidate pool to maximize expected information gain. Ad assignment at \(x^*\) is randomized with \(P(z=1)=0.5\); outcomes are simulated from fixed “true” parameters.
Browse the sequence
The plots are precomputed from one sequential design run. Use the slider to move through the 41 frames, and switch views to understand why points were chosen.
Shortcuts: ←/→ step, 1/2/3 switch view.
Three views of the same state: where to experiment next, where the model says the A/B choice is uncertain, and what the model would choose now if you had to stop training.
Where to test next: expected information gain · Decision uncertainty: where A vs B is least settled · Best choice now: the predicted winner
Frames show observed outcomes = 10..50 (41 frames).
From the precomputed OED policy.
From the precomputed OED policy.
Random sampling (no OED)
If you sample customers from the same distribution without adaptive design (no OED), the resulting A/B recommendation can stay wrong for longer. In this run, random sampling often misses the profiles that would clarify the boundary.
Snapshots are precomputed at N ∈ {50, 60, 70, 80, 90, 100}.
*Compared against random sampling under the same budget and the same evaluation criteria.
Key takeaway
Even in this simple model, the best next test point around, and strongly depends on the current state of the model. Note also that towards the bottom boundary, more B ads were chosen, whereas the left boundary, A ads tended to be optimal. I would not want to rely on intuition alone here.
Contact
If you're thinking about sequential design for experiments, pricing tests, surveys, or labeling pipelines, email me.