The five gates A gate fails the moment one criterion fails
The order reflects priority. If you must fix in sequence: data first, security second, governance third, evaluation fourth, handover fifth. But the audit itself is one pass across all five.
Gate 1
Data Readiness
Does the organization have data that an AI system can be trained on, evaluated against, and trusted in production?
Pass criteria
- 1. Source data for the in-scope use cases is identified, named, and owned. Each data set has a named human owner who can answer questions about provenance and meaning. No anonymous lake tables.
- 2. Data quality has been measured against a defined standard, not assumed. Completeness, accuracy, freshness, and consistency each have current measurements. Gartner estimates only 12 percent of organizations have data clean enough to support production AI.
- 3. Sensitive fields are tagged at the column level. PII, PHI, financial data, and any sector-specific protected categories are tagged at ingestion, not at consumption.
- 4. Lineage is traceable. For any record the model will train on or infer against, you can produce the source system, the ingestion timestamp, the transformation steps, and the consuming systems within one business day.
- 5. A ground-truth set exists for the use case. A labelled, agreed, version-controlled set of input and output pairs. The labels are owned by the business, not by the data team.
Most common failure
The team passes criteria one through three and fails on four or five. Lineage cannot be reconstructed for older records. The ground-truth set is "we will build it during the project." That is a hidden failure: the build phase absorbs the ground-truth work and the timeline slips by months.
If you fail this gate
Stop. The data work is the work. Run a focused data foundation engagement that closes the failed criteria before the AI build kicks off. The cost of doing this before is meaningfully lower than the cost of doing it during.
Gate 2
Security Boundaries
Will the AI system, once built, satisfy your existing security posture and the sector-specific regulatory regime you operate under?
Pass criteria
- 1. The threat model has been written down. Not a generic catalog. A model specific to your architecture, data flows, trust boundaries, and user populations. Prompt injection, model extraction, training-data poisoning, and prompt-based exfiltration are addressed by name.
- 2. Framework alignment has been mapped, not assumed. The system is mapped to NIST AI RMF, ISO 42001, OWASP LLM Top 10, and the sector frameworks that apply: EU AI Act, HIPAA, DORA, SEBI. Mapping means a gap document, not a logo wall.
- 3. Access control is enforced at the API and data layer, not at the UI. PII access, model inference, and tool execution permissions are governed by the same identity model that governs the rest of your production estate.
- 4. An adversarial test plan exists. Curated prompt-injection, jailbreak, and data-exfiltration corpora are in place, with a dated schedule for adversarial testing in production. "We will pen-test before launch" is not a plan.
- 5. The system has a named owner for security operations after launch. Not a project role. An operating role on the security org chart.
Most common failure
Criteria one and two are addressed at planning time and forgotten by build time. The threat model becomes stale on day thirty. The framework mapping is never refreshed against the actual implementation. Both are zombie artifacts.
If you fail this gate
Bring in an AI Security Review engagement before the build. The review produces the threat model, the framework alignment scorecard, and the adversarial test plan. The build then ships against those artifacts, not toward them.
Gate 3
Model Governance
When the model produces a decision that someone (regulator, auditor, board member, customer) questions, can you defend that decision?
Pass criteria
- 1. The use case has an explicit risk classification, documented before the build, against your internal AI risk framework and the EU AI Act categorisation if applicable.
- 2. Model approval gates exist and are named. A model does not reach production without explicit sign-off from named roles: data owner, security lead, compliance lead, business sponsor. The gates are written into the SOW.
- 3. Decision logging is built in from day one. For every inference, you log the input, the model version, the output, the confidence score, and the reasoning chain where applicable, retrievable for the regulatory retention period.
- 4. Bias and fairness testing is part of the build, not a follow-on. The protected categories and the unacceptable-disparity thresholds are agreed before training, not interpreted after.
- 5. An incident response playbook exists for AI-specific incidents: hallucination at scale, model drift, adversarial attack post-launch, regulatory enquiry. Each scenario has a named owner and a documented first-hour response.
Most common failure
Decision logging is bolted on after the fact and is incomplete. The team can show the model produced an output but cannot reconstruct the input or the reasoning chain. The first regulatory enquiry exposes the gap.
If you fail this gate
Bring the governance work to the front of the build. Model approval gates, decision logging architecture, and the incident response playbook are scoped into the build SOW, not assumed as overhead.
Gate 4
Evaluation Infrastructure
How will you know the model is working in production, and how quickly will you know when it stops?
Pass criteria
- 1. A ground-truth evaluation set exists, version-controlled, owned by the business, refreshed on a documented cadence. This is the same set from Gate 1, audited here for whether it is fit for evaluation.
- 2. An evaluation harness runs against the ground-truth set automatically: on every model version change, every prompt change, and on a scheduled cadence in production. Manual runs do not count.
- 3. Production performance is monitored at three levels: accuracy against ground truth, input-distribution drift, and output-distribution drift. Each has a defined alert threshold.
- 4. Hallucination rate is measured for any generative use case. Post citation enforcement, under 3 percent on monitored queries is a defensible benchmark for production-grade enterprise copilots. The threshold is documented before launch.
- 5. The evaluation infrastructure is owned by your team after the build. Your team can extend the ground-truth set, change thresholds, and rerun evaluation independently.
Most common failure
The evaluation harness exists at launch and then atrophies. Six months in, no one has updated the ground-truth set. The model has drifted, the evaluation has not, and the only signal of a problem is a user complaint.
If you fail this gate
Build the evaluation infrastructure first, then the model. A model without an evaluation harness is a demo, not a production system.
Gate 5
Operational Handover
On the day the build partner leaves, does your team have what it needs to run the system?
Pass criteria
- 1. A named operator inside the organization owns the system from day one of operations, with operating budget allocated. Not the build partner. Not "to be determined at handover."
- 2. The operations runbook exists and has been used. The named operator has executed it against the system at least once before handover: drift response, retraining, incident response, and routine monitoring all rehearsed.
- 3. The team that will run the system has been trained against the actual system, not a generic curriculum. Training happens during the build, not after.
- 4. The build partner has a defined exit: a handover date, a documented deliverables set, a final acceptance protocol, and a transition support window.
- 5. The system can be operated without the build partner. Critical dependencies on the partner's tooling, accounts, or knowledge are documented and either transferred or replaced.
Most common failure
This gate fails silently. The build ships, the team is happy with launch, and six months later the system has drifted. There is no one to call. The build partner moved on. The original operator was reassigned. The model produces degraded output and the program is quietly shelved.
If you fail this gate
Do not start the build. The most expensive failure mode in enterprise AI is shipping a system no one is staffed to run. Either solve the staffing problem before construction, or scope an operate retainer with a partner that genuinely owns the system from day one.