How Trellis Recommendations Stay Consistent — Docs

When you run a Catalyst Audit twice on the same data, the findings stay stable. The wording of the rationale changes. That is intentional, and this article explains why — and what Trellis commits to about each.

The Short Version

Every number in your audit is computed in Python before any prose is written. Cohort totals, statistical posteriors, tier classifications, and risk scores are produced by deterministic code: same inputs, same outputs.
Cohort-scoped dollar amounts (the waste from zero-conversion keywords, the opportunity from missing negatives, the lift from promoting top search terms) are pre-computed and quoted verbatim in the report. Measured drift across ten reruns of the same audit: 0%.
Recommendation titles and rationales are rewritten on each run by design. The language varies; the underlying finding does not. This is variety in expression, not variety in substance.
The evidence-tag contract — [FACT], [BAYESIAN], [PROJECTED], [INFERRED], [INSUFFICIENT DATA] — is the durable commitment. Every number carries a tag that tells you exactly how it was produced.

What Stays Stable Across Runs

Catalyst is built on a deterministic substrate. Before any prose is written, Python has already:

Aggregated your cohort dollar totals. When the audit recommends pausing zero-conversion keywords, the total waste is computed once from your raw spend data. Across ten reruns of the same YLI audit, the recommended pause cited the identical figure ($293.68) every time. The same applies to negative-keyword candidates and top-converting search terms.

Computed your statistical evidence. Bayesian posteriors (Gamma-Poisson for CPA, Beta-Binomial for conversion rate, Gamma for ROAS) are sampled with 10,000 draws each. Credible intervals and P(change > 20%) come from the same numerical method on every run.

Classified your campaigns into confidence tiers. HIGH, MODERATE, LOW, and INSUFFICIENT tiers are assigned by deterministic rules over your conversion volume and history depth. The tier gate then decides which recommendations are even allowed to surface.

Scored every recommendation for risk. The 1-to-5 risk scale, the safeguard requirements at risk 4 and above, and the cross-recommendation contradiction checks all run in Python.

This is the substrate that the report-writing step receives. It does not invent the numbers; it presents them.

What Is Rewritten Each Run

Three pieces of the report are generative, and they should be:

Recommendation titles. “Pause 4 zero-conversion keywords spending $293.68/month” might become “Cut $293.68/month from zero-conv keywords (4 terms)” on the next run. Same finding, same dollar, different sentence. The variation keeps the report from reading like a template.

Rationales and counter-arguments. The narrative explanation of why a recommendation is sound, and the strongest case against following it, are written fresh against the same evidence. The argument structure stays (at least two cited data points, a counter-case, a monitoring plan), but the prose phrases it anew.

Conditional projections for non-cohort actions, such as “if mobile spend decreases by 60%, that is roughly $697/month,” are computed during report generation and carry a [PROJECTED] tag. The tag is the contract. The dollar precision is bounded by the methodology — a transparent formula with stated assumptions and a 30% conservative adjustment — not by byte-stable output.

If you rerun an audit and notice the language has shifted, check the tag and the cohort total. Those stay stable. The sentence wrapping them does not, and that is the design.

Why This Split Exists

A fully deterministic report would be a templated form letter: same words, same order, every time. The depth of a Catalyst Audit comes from the report-writing step composing tailored explanations against your specific account. A fully generative report would invent dollar amounts and contradict itself across runs. The Trellis answer is to draw a hard line between the two.

The line is the estimation-tag system. Every number in the report carries one of these:

[FACT] — measured directly from your platform or order data. Computed in Python, quoted verbatim.
[BAYESIAN] — a posterior estimate with a stated credible interval. Reproducible from the same model, samples, and seed.
[PROJECTED] — a forward-looking estimate with a transparent formula. The tag promises variance disclosure and conservative adjustment, not byte-stable output.
[INFERRED] — a pattern from your data combined with industry behavior, expressed as a range, never a point estimate.
[INSUFFICIENT DATA] — the sample is too small to act on. Catalyst will say so rather than guess.

When two runs of the same audit produce identical [FACT] and [BAYESIAN] numbers and slightly different prose around them, that is the contract working as intended.

What This Means for You

You can rerun a Catalyst Audit on the same date range and trust that:

The cohort dollar amounts on cohort-scoped recommendations match across reruns.
The campaigns and entities flagged are the same set, in roughly the same priority order.
The evidence behind each recommendation matches across reruns: the data points cited, the tier gates passed, the risk score assigned.
The phrasing of the rationale and the counter-argument differs, in the same way two analysts writing from the same dataset would phrase their conclusions differently.

If you ever see a [FACT]-tagged number change between two runs of the same audit on the same data, that is a bug worth reporting. If you see a rationale rephrased, that is the system writing for you and not from a template.

The full methodology contract lives in the Trellis estimation policy, section 9 (LLM Determinism Caveat). It is the canonical source for what Catalyst commits to and what it does not.
Audit Evidence and Citations — full breakdown of evidence tags and the two-data-point requirement
Catalyst Audit vs. AI Chatbot Analysis — why a chatbot summary is not the same artifact
Reading Your Audit Report — section-by-section guide