Now that you have a good context on how iOS 14.5 has impacted datastream to the platforms, what tactical approach should you the advertiser be taking to stand up a successful testing process? To start, though they are afflicted with similar data constraints, theres a clear delineation between App & Web best practices when it comes to creative testing and strategy. And furthermore, while similar, the ideal “end state” of testing architecture can and will vary advertiser to advertiser (depending on the product/service params like # of products offered (ecomm), consideration time, educational burden, ‘in-market’ relevance etc, down funnel conversion rates). Over the course of the past 18 months we have had to goalseek to a handful of testing frameworks depending on complexity, breadth & budget of the product or service being advertised.
At a base level, a properly calibrated testing method satisfies the following conditions:
- Test assets all achieve proper statistical significance
- Assets reliably achieve statistical significance at some predetermined cadence
- Performance data underlying advertiser “goalpost” KPIs is trustworthy (more on defining this later)
- “Winning” assets reliably achieve scale in the core, scaled campaigns (this one we will deal with in the next installment)
Another key variable to consider: what level of investment is appropriate to earmark for testing for a given advertiser? How many assets should an advertiser test per week or per month? What % of spend should be dedicated? This answer too depends on a variety of factors for a given advertiser, inducing but not limited to:
- Current & future budget
- Production constraints on the creative side
- Risk tolerance for marketing capital deployment
- Success level of the current “playbook” (i.e. is there an existing creative strategy that is working)
- Position in the product life cycle (early or category defining advertisers typically need to spend more upfront to work through what creative approach is durable, and those successes compound iteratively through growth & harvest phase)
One thing is clear – there is no catch all testing & budgetary strategy that works ubiquitously for all app or web products or verticals. The remainder of this article will walk you through this decision process through the lens of first an app advertiser and second, a web advertiser.
App
Before starting, I’d like to address what is sure to be a counterpoint to testing via SKAD on iOS: that testing on the Android platform is a suitable conduit for iOS performance on Facebook. To be clear, some advertisers are able to make this work, usually when the following conditions are present:
- Heavily revenue indexed on Android
- Large Android TAM
- Monetization not concentrated on flagship Android devices, meaning long tail of devices are relevant for monetization / product goals
However, there are a few main drawbacks with this approach. Most importantly: 1) Android CPMs tend to be high and variable when using very downstream conversion events (Purchase etc.) which can lead to unpredictability in costs to obtain statistical significance. 2) Audience of Android and the content that resonates with those users is skewed relative to iOS and 3) UX/product stability is questionable on some proportion of Android devices. Consequently, creative testing in the Android environment might be difficult to achieve consistently, and does not necessarily translate to success on iOS. For this article we will stick to iOS, which frequently explains the majority of revenue in US + Tier 1 products regardless of vertical.
For context, lets assume we are building a creative strategy for a trial to subscription app spending $300k a month in the US only. This particular advertiser already has a winning creative strategy, and has ~3 assets they have observed to be reliably efficient over a timeframe of months, with the average creative losing steam after 3 months or so. Furthermore in the course of the past year of testing, they have realized that testing assets to statistical significance in CPI is not sufficient to understand the impact on conversion and revenue, since particular creatives drive significant variance in install > trial (very common problem we see across clients). Consequently, the advertiser has moved beyond up funnel metrics and CPI as success markers, since these KPIs do not sufficiently predict the resultant CPT (Cost Per Trial) and therefore impact to revenue.
Historically, the advertiser has observed a “hit rate” of about 16%, meaning 16 out of every 100 test creatives will scale meaningfully in the core campaign. Since they have 3 assets that currently “work”, with a predicted shelf life of about 3 months, they will have to produce roughly 1 winner per month of testing to maintain baseline performance. But this advertiser has a secondary goal of improving baseline efficiency, so ideally the contribution from testing is >1 winning asset per month. What budget should they assign to creative testing to achieve this goal assuming hit & decay rates remain constant? To start, we can calculate the implied number of test assets each month using the historical hit rate. Ideally they need > 1 winning asset per month (let’s say 2).
- 16% (hit rate) * X (test assets) = 2
Therefore ~12 Test assets need to be tested to significance in the market per month to achieve this goal. Now, what event volume is appropriate to test an asset to significance? Recall testing to install is not sufficient for this advertiser due to huge differences in Install > Trial rates. The number is ultimately arbitrary – the core tradeoff is the higher the event volume the higher the confidence, at the expense of higher cost per asset and therefore lower test volume. Since this advertiser wants to preserve the stable performance they have achieved in the core campaign, they want to be very confident that assets will not disrupt the main campaign. Plus they are beginning to see differences in Trial > Subscription (though they don’t understand this relationship at the creative level quite yet). Consequently a high event volume is what is called for, which they have defined as 50 events. Lets take a look at some of their historical, average costs across the funnel:
| CPC | Events to Stat Sig (Click) | $ to Stat Sig (Click) |
| $2 | 50 | $100 |
| CPI | Events to Stat Sig (Install) | $ to Stat Sig (Install) |
| $10 | 50 | $500 |
| CPT | Events to Stat Sig (Trial) | $ to Stat Sig (Trial) |
| $50 | 50 | $2,500 |
Using 50 events as the bar for statistical significance, and presuming the test assets hit the average historical CPT of $50 (in practice they will pace over or under this), we get a total estimated cost burden of ~$2500 per asset. This cost will fluctuate depending on performance of up funnel variables as well, for example there’s a certain CTR that is too low to ever be justified by a downstream conversion rate, there’s a CPI thats too high to ever produce a CPT in target range, etc. Article 3 will cover how to structure success/failure KPIs and go/no go thresholds in more detail.
What budget allocation will approximate our goal test asset volume? Lets take a look at some options:
| Test Budget | $10,000 | $20,000 | $30,000 | $40,000 | $50,000 |
| Assets to Stat Sig (Click) | 100 | 200 | 300 | 400 | 500 |
| Assets to Stat Sig (Install) | 20 | 40 | 60 | 80 | 100 |
| Assets to Stat Sig (Trial) | 4 | 8 | 12 | 16 | 20 |
Based on the napkin math it looks like $30k/month will produce ~12 assets, which satisfies the advertisers overall goal, and is equivalent to ~10% of total monthly budget.
Now that the advertiser has backed into an appropriate budget to hit their goals, how should they structure the test? The advertiser first needs to isolate creative testing environment from the core campaigns. This has two primary benefits: 1) allows the advertiser to set budgets that lead to statistical significance over predictable timeframes (this will become important later when we feed input back to the creative team) and 2) removes the “incumbency” bias covered in Article 1 that can sometimes lead to elder assets with accumulated history being prioritized in the auction. Since this advertiser already has scaled winning creatives, this will almost surely be an issue.
Consider the following sample structure:
Here we have a dedicated creative testing campaign (meaning no current / old assets are included), mapped to 3 ad sets each with a dedicated $400 daily budget (non-CBO) delivering to a sole creative. A structure like the one above ensures that the following conditions are met:
- Test assets all achieve proper statistical significance
- Assets reliably achieve statistical significance at some predetermined cadence
And finally the number of assets under test satisfies the SKAD constraints outlined in pt 1., since the # of campaign IDs reported by SKAD is likely few per campaign. Consequently we can assume that performance data reported via SKAD is trustworthy.
A common question we get from clients is “what should we use as the control?”. In general our preference is to define our control as “performance in the business as usual campaigns”. If we identify a test asset with higher efficiency, and can scale that asset without performance degradation, then we have beat our “control”. Adding a control in the test environment has two core drawbacks: 1) another degree of freedom that weakens conversion feedback reported via SKAD & 2) FB doesn’t necessarily treat a prior top performing asset as a control even though the advertiser has defined it as such. We’ve actually put this to the test and found in practice that Facebook tends to give preference to the form of a prior high performing asset in the auction. Meaning: even if you slightly crop or change the “control” asset, machine vision will associate, tag & treat that asset preferentially due to the accumulated history of the incumbent asset it was built from.
Now, how do we assemble this into a process with the creative team? A suitable structure might be something like this:
- Recurring creative campaign deployed each week with ~3 fresh assets (~$400/day)
- 3 adsets mapped 1:1 to 3 individual test creatives @ $400/day each, ~6 days of run time – leaving a day to “breathe” in the case of a production delay etc.)
- A weekly meeting with creative team to go over key findings, successes, failures, anomalies (more on analysis in the next installment)
- 3 assets * $400/day * 6 days of runtime * 4 weeks /month = $28,800 invested & 12 assets tested to significance
- Rinse & repeat
In a healthy creative testing and feedback process, it’s imperative the creative team is getting feedback and iterating without being blocked by market feedback (performance data), the account is constantly getting a healthy rotation of new creative, and every asset is being testing to significance.
Now that we have reasoned our way to a suitable test structure & process, how do we know that this method is satisfying our goals? Its important to backtest & verify any structure change, usually at 30, 60, & 90 days following a change. Some key questions this advertiser will consider as they evaluate the new procedure:
- Are there any squeeze points in the process? (creative getting rejected, variance in production timeline pushing tests back)
- Has our hit rate changed with additional creative volume?
- Does the impact on overall baseline performance justify the additional earmarked creative budget?
When does the advertiser know when it’s time to make a change? Hypothetically lets say the new testing process above decreased baseline CAC 30% on Facebook. Consequently, the payback period has improved dramatically and the advertiser can now afford to allocate ~$500k per month to Facebook. A budget change of this magnitude likely means you’ll have to navigate some or all of this process again as you find yourself in a new bid/volume/saturation context on the platform side. Some other trigger events that might necessitate a strategy change:
- Budget Change
- Country / platform expansion
- Competitor activity (competitors copying ads at high velocity)
- Change in market concentration for your vertical (products and/or ads have more alternatives and are less competitive)
- Exogenous taste change (prior creative format not as relevant with consumer taste – think UGC popularity over the past 12 months)
Wherever your process ends up, there is simply no catch all creative strategy that is durable across app & web, that does not need to be re-evaluated periodically.

Leave a comment