What an AI audit looks like in 2026.

The AI audit, until recently, was something companies described in their slide decks without anyone actually performing one. Two things changed that. Internal audit functions started staffing up against AI-specific scopes, and external regulators started asking specific questions whose answers required someone to have already done the work.

The audit that results is recognisable in shape to anyone who has been through an information-security audit, but the evidence it examines is different. This post is a walkthrough of what an AI audit covers in 2026, written from the perspective of the team being audited.

Key facts

01: Four evidence categories carry the audit. Process documentation, system records (the audit log), sample outputs, and governance minutes. Every finding maps back to evidence in one of the four. A finding without a trace to a category is not actionable.
02: Sampling is two-axis, not random. Auditors sample by use case (to ensure coverage) and by time (to catch recency effects). Pure random sampling under-weights the use cases that matter and over-weights the ones with the most volume. The two-axis method is now standard.
03: Findings are categorised by remediation route. Control gap, process gap, or documentation gap. The three categories have different remediation routes and different timelines. Mixing them produces a punch list that the team cannot execute against without re-reading the report.
04: The closing report is a deliverable to a non-technical audience. The audit committee, the board, and possibly the regulator. The technical detail belongs in appendices. The body of the report is written so the chair of audit can answer questions about it without preparation.

What the audit actually examines

The scope of a 2026 AI audit divides cleanly into four areas. Each has its own evidence base, its own sampling method, and its own typical findings.

01. Governance and policy

Does an AI governance charter exist, and is it current? Does an AI steering committee meet on a regular cadence, with minutes? Is there an AI acceptable-use policy, signed by employees, and reviewed in the last twelve months? Are there documented procedures for vendor due diligence, model change management, and incident response?

The evidence is the documents themselves and the metadata around them: version histories, signature records, meeting minutes, change logs. The typical finding here is documentation rather than absence: the policy exists, but the signature record is incomplete, or the meeting minutes show approval of items not described elsewhere in the documentation.

02. Use case coverage

Is there a use-case register? Is it complete? For each entry, does the register contain the named owner, the data classes involved, the decision authority, the controls applied, the test method, and the most recent test result? Are entries retired when use cases are decommissioned?

The evidence is the register itself, cross-referenced against the system inventory and the operational records. The typical finding here is completeness: a use case appears in the system but not in the register, or a register entry references a control that no one can demonstrate.

03. Operational records

The audit log is the centre of this area. Auditors sample queries, responses, and approvals from across the period under review, and they check four things: that every regulated action has a logged approval with a named reviewer, that every AI-generated answer has a recorded source set, that the hash chain is intact, and that gate checks recorded as passed correspond to controls that actually executed.

The evidence is the log itself, cross-referenced against the controls library. The typical finding here is mismatch: a gate check is recorded as passed but the corresponding control has a different scope, or a reviewer is named on an approval but the reviewer's role record does not include that authority.

04. Output quality

This is the newest of the four areas, and the most variable in maturity. Auditors sample model outputs and evaluate them against the standard the use case is supposed to meet: source attribution present, draft marking applied, accuracy against ground truth where ground truth exists, false-positive and false-negative rates against the targets in the use-case register.

The evidence is the sampled outputs and their accompanying log entries. The typical finding here is calibration: the use-case register sets a target that the actual outputs are not measured against, or the measurement exists but is not surfaced to the audit committee.

How sampling works

Random sampling does not work for AI audits, for two reasons. First, the use cases have wildly different volumes: alert triage might generate ten thousand outputs a day, draft assistance ten, customer file review one. Pure random sampling pulls almost everything from alert triage and misses the others. Second, AI systems move: a model is retrained, a prompt is updated, a vendor is changed. Recency matters.

The method that has converged in 2026 is two-axis sampling. The first axis is use case: every use case in the register gets a sample, weighted to give meaningful coverage to the low-volume ones. The second axis is time: the period is divided into segments (typically weeks), and samples are drawn from each segment to catch the effects of model or process changes.

A typical sample for a quarter-long audit is twenty to forty entries per use case, distributed across the weeks of the period. The exact numbers are negotiable; the structure is not.

How findings are categorised

Three categories carry the punch list. Each has a different remediation route, and mixing them is the most common mistake in audit reports.

A control gap means a control that should exist does not, or a control that exists does not actually perform its function. Remediation is a change to the controls library and, usually, a change to the system. Timelines are longer.

A process gap means a control exists and works, but the process around it has a hole. Approvals happen but are not consistently captured. Reviews happen but are not consistently filed. Remediation is a change to the operating procedure, often without a system change.

A documentation gap means the system, the controls, and the process are all sound, but the record is incomplete or out of date. Remediation is updating the documentation. Timelines are short.

A finding without one of these categories is not actionable. The audit team should refuse to accept it until the category is named, because the team being audited cannot execute against an unclassified finding.

The closing report

The closing report has a non-technical audience: the audit committee of the board, possibly an external regulator, and the executive sponsor of the AI programme. The body of the report has to be readable by people who cannot interpret a technical artefact without help.

We aim for a body of fifteen to twenty pages: an executive summary, the scope, the methodology, the findings (organised by area and severity), the management response, and the remediation plan with owners and dates. The technical detail (the actual sample data, the hash-chain verification output, the controls library extracts) belongs in appendices. The audit committee reads the body. The next layer down reads the appendices when they need to.

The most useful thing the team being audited can do, before the closing report is drafted, is to write the management response to each finding while the finding is still being discussed. This compresses the cycle and reduces the number of findings that end up disputed in the final document. The other useful thing is to make sure the controls library and the use-case register are current before the audit starts; the single biggest source of avoidable findings is a documentation gap that could have been closed in advance.

FAQ

Questions readers ask

What does an AI audit examine?

Four areas: governance and policy (charter, committee, AUP, procedures), use-case coverage (the register is complete and current), operational records (the audit log is intact and consistent with the controls), and output quality (sampled outputs meet the standard the use case is supposed to meet). Each area has its own evidence base and its own typical findings.

How are AI audits sampled?

Two-axis sampling has converged as the 2026 standard: by use case (so low-volume use cases get meaningful coverage instead of being swamped by high-volume ones), and by time (so the effects of model retraining, prompt changes, and vendor swaps are caught). A typical quarter-long audit takes twenty to forty samples per use case, distributed across the weeks of the period.

How are AI audit findings categorised?

Three categories with different remediation routes: control gap (a control is missing or does not work; system or controls-library change required, longer timeline), process gap (controls work but the process around them has a hole; procedure update, shorter timeline), or documentation gap (everything works but the record is incomplete or out of date; documentation update, shortest timeline). A finding without one of these categories is not actionable.

Who is the audience for the AI audit report?

The body of the report is written for the audit committee of the board, possibly an external regulator, and the executive sponsor of the AI programme. The technical detail (sample data, hash-chain output, controls extracts) belongs in appendices. We aim for a body of fifteen to twenty pages: executive summary, scope, methodology, findings by area and severity, management response, and remediation plan.

How can a team prepare for an AI audit?

Two moves help most. Make sure the controls library and the use-case register are current before the audit starts (the single biggest source of avoidable findings is a documentation gap that could have been closed in advance). And write the management response to each finding while the finding is still being discussed, before the closing report is drafted; this compresses the cycle and reduces disputed findings in the final document.