Trust but Clarify Data

Trust but Clarify Data is a valuable practice often overlooked in many analyzing data rituals. We may follow best practices for bringing the most value to customers. Yet, we may still go awry if we simply follow principle and trust completely what the data is telling us. In his book, The Lean Startup, Eric Ries writes about the processes we used at IMVU. This helped us get Minimal Viable Products out as experiments to observe the customer adoption as the prototypical Lean Startup.


Table of Contents


Trust but Clarify Data

IMVU (aka The Lean Startup) Case Study

I stepped into IMVU after Eric Ries went off to evangelize Lean Startup principles to the industry. I remember an experiment we did on helping our “whale” customers. They were gifting a lot of products to their friends and followers. However, the gifting was a cumbersome, one at a time process. We chose to make this “better” by allowing them to give to groups. So, we did an experiment to make it easy to give to multiple people and collected some data. Much to our surprise there was no noticeable improvement in terms of gifting. Follow ingour methodology, we were ready to scrap the experiment and revert back to the old code. It is bad practice to add functionality that doesn’t bring value.

Data - Trust but Clarify - Results 1

We had very reliable and verified ways of measuring customer behavior for very large numbers of customer. I completely trusted that the data was accurate. However, I felt I should clarify why this was so different from the expected outcome. It didn’t sit right with me. It really should have made a difference if I understood our customers right. I asked that we split the data out by customer type. This allowed us to see the impact of the change on behavior of whales, non-whales, etc. Below are the results of looking at that same data in a more nuanced way:

Talent Whisperers - Data Trust but Clarify

Low and behold, gifting was way up for customers that were whales. However, it was way down for the typical non-whales and in aggregate, the net change was not noticeable. By making gifting to many users easier, we made it harder for those wanting to gift to just one. Instead of abandoning the experiment, we then allowed for both through the easiest path for each. When customers gift, they spend more credits and increase engagement – both valuable to the business. We could have trusted the first view of the data. Had we thus allowed our decisions to be purely data driven, we would have completely missed this opportunity.

Aggregate vs Segmented Reality Aggregate view Net change ≈ 0 Looks neutral. Suggests no meaningful impact. Segmented view Whales increase ↑↑ Non-whales decrease ↓↓ Opposing effects cancel in averages. Segmentation reveals what aggregates conceal. Use persona splits when “no change” may hide two large changes canceling out.

In being data-driven, don’t turn off your brains or ignore your gut instincts about the customer and the business. When it comes to data, as with many things – trust but clarify.

Trust but Clarify Data – a Simple Coin-Toss Example

In his book Black Swan, Nassim Taleb poses the question “If I were to flip a coin 99 times and it came up tails each time, what are the odds it comes up tails on the 100th flip?”

I love asking this question because the logical, well trained and educated mathematician or computer scientist knows from Probability 101 that the result of a coin-toss is not dependent on previous results. We are taught that a less-educated person might think previous results do matter. However, if we’re open-minded and don’t jump to quick conclusions, we might realize there is something different here. From Statistics 101, we should know that the odds of a coin coming up tails 99 times in a row are essentially infinitely small. A more cautious person might also think: why am I being asked such a simple question?

The wisdom lies in recognizing the bigger picture and considering all the data we are given. By being asked a simple question, we might suspect there’s something deeper there. By recognizing that 99 tails in a row is basically impossible for a “fair” coin, we should conclude that this is obviously not a fair coin. That information is part of the data we are given. However, it’s something we easily overlook when we think we know the obvious answer.

We should trust the mathematics we were taught in Probability 101. However, we should also trust our instincts that despite the pure math answer, there is something strange here. So, the “Trust but Clarify” question here is “Are we talking about a fair coin?”

Asking this coin-toss question in an interview tells me a lot about a person. The initial answer is much less significant than the revealing dialog that follows.

Why Clarify and not Verify?

The more commonly used phrase is “trust but verify.” It may simply, but it can make a big difference. Trust but “verify” implies there is cause for some doubt to the veracity of the data. This can lead to you data science team feeling questioned and attacked. This can lead to defensiveness. You could choose to clarify your understanding of the data. Ask if that same data could be presented differently to provide additional insights. You would now be collaborating with your data science team. In the above example, the data as it was originally presented and 100% accurate. The clarification of what this data looked when broken out by persona added the clarity that improved overall user experience. This was almost overlooked and would have resulted in a missed opportunity.

Trust but Clarify data from A/B Tests and User Surveys

Prior to IMVU, we acquired a company that had a great product and a large and loyal customer base. I remember meeting the head of product and asking why they had decided to sell. He mentioned that the business hadn’t been going as well as they had hoped for several years. The consistently had issues with customer satisfaction. They had started by implementing the best practices of A/B tests. They always used data to determine which implementation of a feature resonated better with customers. Things surprisingly did seem to get better. So, at the end of each year, they put a big effort in surveying customers. It consistently came back that about 50% of the customers were very frustrated with the product.

They asked their disgruntled customers for specifics on what frustrated them. They took that data and those insight to redesign the user experience. However, for three years in a row, the end of year surveys resulted in 50% of the customers unhappy. Again the grievances were addressed, but no change. Einstein’s quote on insanity comes to mind. So, they sold they company.


When UX Metrics Lie: Engagement, Goodhart’s Law, and the Illusion of Improvement

Modern product teams rarely redesign interfaces blindly. Instead, they rely on A/B tests, dashboards, and behavioral analytics to decide what ships. On the surface, this looks like disciplined, evidence‑based practice. Yet many users report the opposite lived experience: interfaces that change frequently, feel less navigable, and require more effort to accomplish the same tasks.

This tension raises a critical question: what if the data is not wrong, but the conclusions drawn from it are?

This page explores a subtle but pervasive failure mode in UX optimization: how commonly used engagement metrics can reward friction, how Goodhart’s Law and Campbell’s Law explain why this happens, and how teams can redesign their measurement systems to avoid false positives that slowly degrade user trust.


Trust but Clarify Data – The Engagement Trap

In many experiments, success is defined by increases in metrics such as time on page, session duration, number of clicks, or depth of interaction. These measures are convenient, easy to instrument, and appear to correlate with “engagement.”

However, engagement metrics are ambiguous. More time or more clicks can mean curiosity, enjoyment, or value. They can also mean confusion, disorientation, and rework. A user who completes a task smoothly in ten seconds produces less “engagement” data than a user who spends two minutes searching, hesitating, and correcting mistakes.

When teams equate engagement with value, they risk optimizing for struggle rather than success. Interfaces that were once quick and legible can be replaced by ones that look more modern, test better in isolation, yet require more cognitive effort to use.

This is not a theoretical concern. It is a structural consequence of how metrics behave once they become targets.

User goal True outcome Task success, ease, confidence Measured proxy metric Time, clicks, “engagement” Design decisions What the team optimizes Interface changes New UI, new flow, new nav Where the proxy decouples More time and more clicks can mean more confusion. Optimizing the proxy can reward friction over success. Feedback loop: optimizing the metric changes behavior and the UI Goodhart’s Law and Campbell’s Law explain why this happens

Goodhart’s Law in UX Design

Goodhart’s Law is often summarized as: “When a measure becomes a target, it ceases to be a good measure.”

In UX contexts, engagement metrics typically begin as rough indicators. Time on page might loosely signal interest. Clicks might loosely signal exploration. But once these numbers are elevated to explicit success criteria, behavior shifts around them.

Design decisions start to favor whatever increases the metric, regardless of whether it improves the user’s actual outcome. A flow that adds an extra decision step may increase clicks. A less obvious navigation pattern may increase time on page. The metric improves, but the experience worsens.

Goodhart’s Law explains why this corruption is not accidental. The moment teams aim directly at the proxy, the proxy stops reflecting the underlying goal.

Metric as Signal vs Metric as Target Metric as signal User reality Metric Insight Better design A proxy can be useful when it stays connected to the underlying outcome. Metric as target Metric Decisions Design changes User behavior bends Metric inflates Reality drifts out of view When a measure becomes a target, it stops being a good measure.

Campbell’s Law and Institutional Pressure

Campbell’s Law extends this idea into organizational systems. It states that the more a quantitative metric is used for decision‑making, the more it is subject to distortion and corruption, and the more it distorts the process it is meant to monitor.

In product organizations, this pressure is amplified by incentives. Roadmaps, promotions, and credibility often hinge on “winning” experiments. Teams are rewarded for showing upward‑sloping charts, not for demonstrating that a task became simpler or less cognitively taxing.

Over time, this creates a selection bias. Changes that reduce friction often shorten sessions and reduce visible activity. They can look like regressions in engagement data, even when users are happier. Changes that introduce friction often inflate activity and appear successful.

Campbell’s Law predicts exactly this outcome: metrics begin to shape behavior in ways that undermine the original intent.


Local Wins, Global Losses

Local Optimization vs Global Experience A/B tests can “win” on one step while degrading the end-to-end journey Entry Page A Page B Task completion Exit Local metric improves on Page B Downstream abandonment or frustration increases later Interpretation A change can lift a local proxy metric while harming the end-to-end outcome. This is how teams can do everything “right” and still lose user trust.

Another compounding issue is local optimization. A/B tests are usually scoped to a single page, flow, or interaction. A variant may outperform its control locally while damaging the broader journey.

For example:

  • A checkout page redesign increases interaction time but increases abandonment later.
  • A portal dashboard redesign boosts exploration but increases support calls.
  • A new navigation scheme helps first‑time users but slows down experienced users.

Aggregate metrics often hide these effects. Averages flatten differences between novices and power users, between people under stress and people browsing casually. The result can look neutral or positive while quietly eroding trust among the most frequent users.


Why This Feels Worse Over Time

From a user’s perspective, the problem compounds. Interfaces change repeatedly, each time justified by data. Familiar paths disappear. Muscle memory is broken. Tasks that were once automatic require renewed attention.

What users experience is not innovation fatigue, but re‑learning cost. What organizations experience is a steady stream of “successful” experiments.

The gap between these perspectives is exactly where unexamined metrics do the most damage.


Designing Metrics That Respect User Success

Avoiding this trap does not mean abandoning data. It means clarifying what the data represents.

More robust measurement systems distinguish between activity and outcome. They pair primary metrics with guardrails that detect confusion rather than celebrate it.

Examples include:

  • Time to complete a meaningful task, not time spent in the interface.
  • Error rates, validation failures, and undo actions.
  • Backtracking behavior and repeated visits to help content.
  • Support requests and complaints tied to specific workflows.
  • Segmented analysis by user type, experience level, and context of use.

Crucially, these signals should be interpreted together. No single number should be allowed to declare victory.

Qualitative input matters as well. Small‑scale usability sessions, even with a handful of users, often explain why a metric moved and whether that movement represents progress or friction.


Why This Matters

At its core, this is not a UX problem. It is a sense‑making problem.

Data does not speak for itself. Metrics are abstractions layered on top of human experience. When teams trust the numbers without clarifying their meaning, they risk optimizing away the very qualities users value most: clarity, confidence, and ease.

Seen through this lens, frequent UI “improvements” that feel like regressions are not mysterious failures. They are predictable outcomes of systems that reward the wrong signals.

This is precisely where “trust but clarify” belongs in modern product and design practice.


Self-Serve Experimentation: Reducing Data Collection Friction Without Hijacking Engineering

Teams often talk about being “data-driven” as if data collection is free. In practice, it competes with everything else. Instrumentation requests collide with product deadlines, marketing launches, revenue priorities, incident response, and the long tail of tech debt. Even when everyone agrees the data would be valuable, it can still slip to “later.” Then “later” becomes “never.”

This is not a people problem. It is a system design problem.

When measurement depends on engineering time for every new question, the organization quietly rations learning. The learning that survives is often the learning that is easiest to instrument, not the learning that is most important.


The Hidden Cost of “Just Add More Tracking”

Requests for more instrumentation can look small on paper. Add an event here, or add a property there. Add a new funnel step, or add one more A/B test. Yet each request carries real costs.

  • Engineering context switches away from building the thing.
  • Instrumentation becomes fragile across refactors.
  • Analytics schemas drift and become inconsistent.
  • The definition of “success” becomes muddy when five teams are measuring similar things differently.
  • Reliability work suffers because measurement work never feels urgent until you need it.

Over time, this creates a familiar cycle. A team ships a change, then realizes they cannot tell whether it worked. They scramble to add instrumentation after the fact, when the clean “before” baseline is gone.

If you want high-quality learning, you need to build learning into the system.


Separating “Experimentation Plumbing” From “Product Code”

What worked best for me in repeated environments was treating experimentation as a first-class platform capability. The goal is not to remove engineering from experimentation. It is to remove engineering from the repetitive parts.

Instead of asking engineers to:

  • add tracking for every test,
  • wire up metrics definitions repeatedly,
  • and hand-roll segmentation and dashboards,

you build a system where product, marketing, and data teams can:

  • define experiments,
  • assign variants,
  • choose exposure rules,
  • and collect consistent outcome signals,

without requiring bespoke code changes for each measurement request.

Engineering still writes the actual feature changes. But measurement becomes standardized and largely self-serve.

This is one of the most practical ways to reduce “data vs delivery” conflict.


What “Self-Serve Experimentation” Really Means

Self-serve is often misunderstood as “non-engineers can run experiments.” That framing creates fear of chaos. The better framing is “the organization has shared infrastructure that makes learning cheap, consistent, and safe.”

In a mature setup, non-engineering teams do not push random UI changes. They configure experiments within guardrails.

Common building blocks:

A feature flag and experiment assignment service

  • Stable user assignment (so users do not bounce between variants).
  • Targeting rules (new users, returning users, persona segments).
  • Rollout controls (percentages, ramp schedules, kill switches).

A consistent event and metric layer

  • Standard events and properties.
  • A single source of truth for metric definitions.
  • Built-in guardrails (errors, latency, support contacts, retention).

A results pipeline that does not require custom dashboards each time

  • Prebuilt experiment reports.
  • Drilldowns by persona and segment.
  • Automated checks for novelty effects and sample ratio mismatches.

Governance that keeps this from turning into “randomness at scale”

  • Experiment templates.
  • Naming conventions.
  • Review workflows.
  • A small set of approved primary metrics per domain.

The system does not eliminate tradeoffs. It changes the tradeoff curve.


Why This Matters: Learning Is a Throughput Problem

When experimentation requires engineering time for every measurement iteration, learning throughput is constrained by the scarcest resource. That almost guarantees a backlog of unanswered questions.

A self-serve experimentation platform changes the default:

  • more questions get tested,
  • fewer tests are blocked by instrumentation work,
  • and measurement becomes consistent enough to support “trust but clarify” thinking.

It also reduces a dangerous pattern: teams treating “we can measure it later” as a substitute for measuring it now.


Constraint Theory, Qualified Data, and Learning Throughput

One reason data collection and experimentation stall in otherwise disciplined organizations has little to do with intent, skill, or belief in data. It has to do with constraints.

Constraint Theory starts from a simple premise: at any given moment, a system’s performance is governed by its most limiting factor. Improvements made anywhere else may feel productive, but they do not meaningfully change outcomes unless they relieve that constraint.

In many product organizations, engineering capacity is the constraint. Feature delivery, reliability work, incident response, technical debt, and platform maintenance all compete for the same finite attention. Requests for new instrumentation, additional events, or custom analytics are therefore not evaluated in isolation. They are evaluated against everything else that draws on the constrained resource.

From this perspective, it becomes clear why data collection so often slips. It is not that teams do not value learning. It is that learning work is routinely queued behind delivery work when both depend on the same bottleneck.


Why “Just Measure More” Fails Under Constraint

When learning depends on constrained engineering time, organizations unintentionally optimize for what moves fastest through the system, not for what teaches the most.

Instrumentation requests that are small, repeatable, or urgent survive. Instrumentation requests that require thought, alignment, or refactoring often do not. Over time, this shapes what gets measured. The result is not a deliberate bias, but a structural one: the easiest data to collect becomes the data that defines success.

Constraint Theory explains why exhortations rarely work here. Telling teams to “be more data-driven” does not change the system. It increases pressure on the constraint and often deepens frustration.

If learning throughput matters, the constraint has to be addressed directly.


Qualified Data Requests and Learning Throughput

Constraint Theory also clarifies why not all data requests should be treated equally.

A useful distinction is between interesting data and qualified data. A qualified data request has three properties:

  • It is tied to a specific hypothesis.
  • It informs a concrete decision that someone will actually make.
  • It justifies consuming constrained capacity because the decision matters.

Without this qualification, data requests behave like noise. Individually they seem reasonable. Collectively they overwhelm the system and crowd out more consequential learning.

This is analogous to the idea of qualified leads. Attention is scarce. Capacity is finite. The goal is not to maximize volume, but to maximize impact.


How This Connects to Self-Serve Experimentation

A self-serve experimentation platform changes the constraint equation.

By separating experimentation plumbing from product code, organizations remove measurement work from the constrained path. Engineering effort is reserved for building and changing the product. Learning infrastructure absorbs the repetitive cost of instrumentation, assignment, and analysis.

This does not eliminate the need for prioritization. Even with self-serve tools, qualified data still matters. But it shifts the default from scarcity-driven avoidance to intentional choice.

In Constraint Theory terms, self-serve experimentation is not an optimization. It is a constraint relief.


A Systems Lens for “Trust but Clarify”

Viewed through this lens, the tension between delivery and learning is not a failure of discipline. It is a predictable outcome of how systems behave under constraint.

Constraint Theory provides a useful complement to discussions of A/B testing and metric interpretation. It explains why good measurement intentions often fail in practice, and why building learning capacity into the system is more effective than asking people to work harder or care more.

For a deeper exploration of this lens and how it applies to habit formation and organizational change, see the Atomic Rituals work on Constraint Theory

The core idea is simple: if learning matters, it must not compete with the constraint. It must be designed around it.


A Concrete Reference Point: IMVU and the Lean Startup Era

We did this extensively at IMVU. We invested in the ability to run A/B tests and multivariant tests with shared tooling and consistent data capture, so that experimentation did not require a bespoke instrumentation project each time.

That investment did not just make testing easier. It changed the culture. Experiments became normal. Learning became faster. Most importantly, engineering time stayed focused on building and improving the product, while measurement remained reliable and repeatable.

This is one of the quieter lessons behind many organizations that successfully operationalized Lean Startup principles at scale. They did not just adopt a mindset. They built the infrastructure that made the mindset executable.


The “Trust but Clarify” Connection

A self-serve experimentation system improves more than speed. It improves clarity.

When metrics, assignment, and segmentation are standardized:

  • it is easier to compare tests,
  • easier to detect false positives,
  • easier to see when a metric is being gamed by friction,
  • and easier to break results out by persona without heroics.

In other words, this infrastructure is not just an engineering convenience. It is a way to reduce measurement distortion and increase interpretability.


Practical Guardrails to Keep It Healthy

A self-serve experimentation platform can create its own failure modes if left unchecked. A few guardrails prevent most of them:

  • Require a clear hypothesis and success metric before launch.
  • Maintain a short list of approved primary metrics per workflow.
  • Always pair the primary metric with guardrails that detect harm.
  • Segment results by user type, not just averages.
  • Build “stop” conditions and easy rollback into the tooling.
  • Audit instrumentation quality regularly. Broken measurement is worse than no measurement.

The goal is not more experiments. The goal is better learning.


Trust but Clarify Data – Was the Data or the Process Wrong?

Ultimately, neither the data nor their processes were wrong. Both data and industry proven processes did not achieve the desired improvements. It seemed some clarification in how to intepret and act on the data was in order.
I decided to clarify what the data had been telling them. He explained how customers would complain the product was too complex; so, they made it simpler. The the customers would complain that the product wasn’t feature rich enough. I asked if it was possible that they had two types of customers:

  1. Casual Users – Those that wanted a simple and easy-to-use experience
  2. Power Users – Experts in what they were doing and that wanted as many features and bells and whistled. Ones that weren’t overwhelmed by some many choice but rather celebrated them.

I further asked if it would be possible to implement two variants. One each based on the type of customer. The product manager’s jaw dropped as he stared at me without words for a couple of minutes. He exclaimed he now knew how in their next release. By solving for both personas they would significantly increase customer satisfaction in their end of year release.

Again, bot the data and the process was “right” the value came from clarifying what it was telling us.


Glossary of Terms (Trust but Clarify Data)

A/B Test

A controlled experiment that compares two or more variants to measure differences in outcomes. Useful for learning, but prone to false confidence if proxy metrics are misinterpreted.

Aggregate Data

Data combined across users, segments, or time. Aggregates are efficient, but often hide meaningful differences between personas or contexts.

Campbell’s Law

A principle stating that when a quantitative indicator is used for decision‑making, it becomes subject to distortion and corrupts the process it is meant to measure.

Constraint Theory

A systems lens that focuses on identifying and relieving the most limiting factor in a system. In data work, it explains why learning often stalls behind delivery.

Engagement Metrics

Measures such as time on page, clicks, or session length. These are proxies for value, not value itself, and can increase when users are confused rather than satisfied.

False Positive

A result that appears to confirm success but does not reflect a true improvement in user outcomes or long‑term value.

Goodhart’s Law

The observation that when a measure becomes a target, it ceases to be a good measure. Commonly seen when teams optimize proxy metrics.

Local Optimization

Improving a metric or outcome in one part of a system while degrading the overall experience or downstream results.

Persona Split

Segmenting data by meaningful user groups (such as new users, power users, or high‑value customers) to reveal differences hidden in aggregates.

Proxy Metric

An indirect measure used as a stand‑in for a desired outcome. Useful when direct measurement is hard, but risky when optimized directly.

Qualified Data

Data collected to answer a specific, decision‑relevant question. Qualified data justifies consuming constrained capacity.

Self‑Serve Experimentation

Shared infrastructure that allows teams to run experiments and collect consistent data without bespoke engineering work for each test.

Trust but Clarify

A mindset that treats data as essential input while insisting on interpretation, context, and judgment before action.


Frequently Asked Questions – Trust but Clarify Data

Isn’t more data always better?

Not when data collection competes with constrained capacity. More data can increase noise, slow decisions, and obscure what actually matters.

When should I trust aggregate metrics?

Aggregates are useful when behavior is uniform. When users differ by context, expertise, or motivation, aggregates often mislead.

What should I measure instead of time on page or clicks?

Measure task completion, time to success, error rates, rework, and downstream effects such as support load or retention.

How do I avoid local wins that harm the overall experience?

Pair local metrics with journey‑level guardrails and segment results by persona before declaring success.

Do A/B tests still matter if metrics can mislead?

Yes. A/B tests are valuable tools. The risk lies in treating the winning metric as the final truth rather than a prompt for deeper interpretation.

How do Goodhart’s Law and Campbell’s Law show up in product teams?

They appear when teams optimize for engagement metrics and dashboards at the expense of clarity, ease, and long‑term trust.

What does “qualified data” mean in practice?

It means collecting data only when it informs a real decision, has a clear hypothesis, and justifies using constrained resources.


See Also – Trust but Clarify Data:

Trust But Clarify Data — See Also (external References)

Goodhart’s Law – When Measures Become Targets

Goodhart’s Law explains why metrics stop reflecting reality once they are optimized directly. This principle underpins many false positives in A/B testing, engagement metrics, and KPI-driven decision systems discussed on this page.


Campbell’s Law – The Corruption of Quantitative Indicators

Campbell’s Law extends Goodhart’s insight to organizational systems, showing how metrics distort behavior when they are used for decision-making and control. It provides a systemic explanation for why teams can follow correct processes and still degrade real outcomes.


How to Measure Anything – Douglas W. Hubbard

This book offers practical methods for turning ambiguous questions into decision-relevant measurements. It reinforces the idea that measurement is about reducing uncertainty and informing decisions, not producing impressive dashboards.


Lean Analytics – Alistair Croll & Benjamin Yoskovitz

Lean Analytics explores how different stages, personas, and business models require different metrics. It supports the argument that aggregate data can mislead and that context determines whether a metric is meaningful.


Thinking in Systems – Donella H. Meadows

This foundational work on systems thinking explains feedback loops, unintended consequences, and why local optimization often harms whole systems. It provides conceptual grounding for the “local wins, global losses” pattern described in the diagrams.

2 thoughts on “Trust but Clarify Data

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.