Writing Product Specs When the Output Is Probabilistic

Traditional product specs are built on certainty. When the user clicks this button, this thing happens. Every time, the same way, deterministically. The PM defines the behavior, engineering implements it, QA verifies it. Clean.

AI features break this model. The output isn’t deterministic — it’s probabilistic. The same input might produce different outputs. And the “right” output is often subjective, context-dependent, or impossible to define precisely upfront.

So how do you write a spec for something that behaves differently every time?

The problem with traditional PRDs

A standard PRD says: given input X, the system produces output Y. Acceptance criteria are binary — it either does the thing or it doesn’t.

With AI features, you can’t write acceptance criteria like that. If you’re building a feature that summarizes meeting notes, there’s no single “correct” summary. There are good summaries, bad summaries, and a big gray zone in between. Your spec needs to account for that gray zone instead of pretending it doesn’t exist.

I’ve watched teams try to force AI features into traditional spec formats, and it always leads to one of two outcomes: either the spec is so vague it’s useless (“the AI should produce high-quality summaries”), or it’s so rigid that it doesn’t reflect how the feature actually behaves.

What to do instead

Replace acceptance criteria with eval criteria. Instead of “the output must be X,” define what good looks like across a range of examples. Create an eval set — a collection of inputs paired with outputs that you’d consider acceptable. This becomes the source of truth for quality, not a static spec.

For example, if you’re speccing an AI feature that categorizes support tickets, your eval criteria might be: given this set of 200 tickets, the model should agree with human labels at least 92% of the time, with no critical misroutes (billing issues routed to feature requests, for instance).

Spec the failure modes, not just the success case. Every AI feature will fail. The question is how it fails. Your spec should explicitly cover: what does a bad output look like? What’s the user experience when the AI gets it wrong? Is there a fallback? Can the user correct the output?

This is where most AI specs fall short. They describe the happy path in detail and treat failures as an afterthought. Flip that. The failure UX is the spec.

Define “good enough” quantitatively. This is the conversation most teams avoid. What accuracy rate is acceptable for launch? What about six months post-launch? These numbers should be in the spec, agreed upon by PM, engineering, and design before anyone starts building.

“Good enough” isn’t a compromise — it’s a product decision. A 95% accuracy rate might be incredible for one feature and unacceptable for another. The PM owns this number.

Own the eval set. In traditional product development, QA writes test cases. In AI product development, the PM should own the eval set — or at the very least, be deeply involved in curating it. The eval set encodes your product judgment: what good looks like, what bad looks like, and where the boundary is.

If you’re delegating the eval set entirely to engineering or data science, you’re delegating your most important product decision.

A lightweight spec template

Here’s what I now include in every AI feature spec:

The problem statement and why AI is the right approach. The input-output contract — not exact outputs, but the shape and constraints of acceptable outputs. The eval set, with at least 50-100 examples covering common cases, edge cases, and known failure modes. Quantitative quality targets for launch and post-launch. The failure UX — what happens when the AI is wrong, and how the user recovers. Latency and cost constraints. The feedback loop — how you’ll collect signals on quality after launch to keep improving.

The mindset shift

The hardest part isn’t the template. It’s the mindset shift. Traditional PM work is about defining exactly what the product does. AI PM work is about defining the boundaries of acceptable behavior and building systems to keep the product inside those boundaries.

You’re not writing a blueprint. You’re writing a constitution.