Toolkit — STPA (Systems-Theoretic Process Analysis)¶

Gate: G3 Route (Q3b). Category: controls-derivation lens.

What problem it solves¶

Some pieces produce their hazards through coordination between controllers — a model interacting with an operator interacting with a downstream system — not through a single failing part. Bow-tie or brainstorming misses these hazards because there is no single event to anchor on. STPA derives the controls from the system's control structure, so each control has a traceable provenance back to a specific unsafe control action.

How it is used¶

Two or three facilitated workshop sessions, four to twelve hours in total, across a week or two. Not a single whiteboard meeting — the method has a real learning curve and benefits from a practitioner leading the room. Session 1: draw the control structure and list the losses the engagement must avoid. Session 2: enumerate unsafe control actions (UCAs) per control action. Session 3: derive loss scenarios and write safety constraints. For ongoing systems, the same structure is revisited longitudinally as operational data accumulates — STPA artefacts are living documents, not frozen deliverables.

Inputs¶

A routed piece and a draft of its control relationships — who issues what to whom, and what feedback returns.
A list of losses the engagement must avoid, each stated as what must not happen (e.g., a driver dispatched to a dock they cannot unload at).
An owner list for each controller in the structure.
A practitioner-level facilitator with STPA experience (or allow extra sessions for the team to learn as they go).

Outputs¶

A one-page control-structure diagram (controllers, controlled processes, control actions, feedback).
A UCA table: unsafe control actions per control action, each written with context.
A list of loss scenarios explaining why each UCA could occur.
A list of safety constraints — each one becomes a control on the commitment page, each traceable back through a loss scenario, UCA, and control action to a named loss.
Trigger-metric candidates — the observable conditions that indicate a UCA has occurred — direct input to G5 trigger design.

Visualisation¶

Controllers stack vertically. Each control-action arrow is a site where UCAs are enumerated; each UCA becomes a candidate trigger metric and — via its loss scenario — a safety constraint the controls sketch inherits.

Anatomy¶

Four steps, each producing an artefact that the next step consumes.

Step 1 — Losses, hazards, and system-level constraints. Name the losses the engagement must avoid ("a driver is dispatched to a dock they cannot unload at"). For each loss, name the system state that could lead to it — the hazard. For each hazard, write a system-level constraint: the system shall not allow…. This is the vocabulary the rest of the analysis uses.

Step 2 — Control structure. Draw the piece as a hierarchy of controllers and controlled processes. Each controller issues control actions; each controlled process provides feedback. Humans, models, and automation all appear as controllers. The diagram is coarse — usually one page — but it must name every control action the piece can emit. For a yard-slot allocator: the allocator (controller) issues an assignment action (control action) to the dispatch system (controlled process), which returns actual arrival and dock-state (feedback).

Step 3 — Unsafe control actions (UCAs). For each control action, enumerate the four UCA types:

Not providing the action when needed.
Providing the action when it is unsafe.
Providing the action at the wrong time (too early, too late, wrong order).
Stopping the action too soon, or continuing it too long.

Each UCA is written as: controller + action + context + loss reference. For the allocator: "the allocator provides an assignment when the carrier's arrival feed is more than 15 minutes stale, leading to Loss-1." UCAs are the catalogue of how the control structure can produce a hazard.

Step 4 — Loss scenarios and safety constraints. For each UCA, ask why would this occur? The answers are loss scenarios: the feedback is missing, the controller's mental model is wrong, a coordination pattern between controllers is broken. Each loss scenario becomes one or more safety constraints — sentences of the form the [controller] shall not [action] when [context] — and each constraint is a control on the commitment page.

The method's key move: it derives controls from the system's structure, not from brainstorming. A control STPA produces has a traceable provenance back through a loss scenario, a UCA, a control action, and a named hazard. That trace is what lets the reviewer defend the controls set.

Example¶

Paper trail — three STPA workshops for the yard-slot allocator

Freight yard, Chapter 9, yard-slot allocator piece. Three sessions over two weeks, 5.5 hours total. Facilitator: Dana Hill (external safety consultant, STPA-certified). Team: Priya Chen (owner); Raj Patel (operations); Alex Kim (ML); Sam Okafor (dispatch-system engineer); Mei Sato (safety); one product manager, one SRE. Output: a control-structure diagram, a UCA table, a loss-scenario list, and 12 safety constraints attached to the controls sketch.

Session 1 — losses and control structure (90 minutes).

T+0 — losses. Dana asks the team to name what must not happen. The board collects three: L1 a driver is dispatched to a dock they cannot unload at; L2 a carrier loses goods in an assignment mismatch; L3 an operator stops trusting the allocator and manually overrides everything. Priya questions whether L3 is a loss or a symptom; Dana holds firm — loss of trust is a genuine system loss because it invalidates the business case.

T+25 — control structure, first draft. The team draws four boxes vertically: operator → allocator → dispatch system → yard process. Arrows going down are control actions; arrows going up are feedback.

T+45 — Sam corrects the model twice. First correction: the dispatch system has a tacit override on the allocator's assignments that the allocator does not see — when dispatch reassigns a dock, the allocator learns about it from feedback, not through its own control loop. Second correction: the operator has two distinct control actions on the allocator — override and approve — not one. The diagram is redrawn.

T+85 — close. Final structure has 4 controllers, 7 control actions, 5 feedback loops. Photograph taken. Dana assigns homework: each team member is to read the structure once per day for a week, looking for missing control actions.

Session 2 — UCAs (120 minutes).

T+0. Two additional control actions surfaced during the week — an auto-retry on allocator timeouts, and a fallback to rules-baseline when the allocator's confidence is below the floor. Both added.

T+15 — UCAs for "assign". Dana walks the four UCA types for each control action. For assign, the team lists 8 UCAs. The ones that matter: UCA1 allocator assigns when dock-state feedback is >15 min stale → L1; UCA2 allocator does not assign within horizon window, forcing operator manual assignment under pressure → L3; UCA3 allocator assigns after the carrier notify window has closed → L1 via notify race. Mei initially misses the "continues assigning after an unsafe condition is detected" type — Dana prompts, and the team adds UCA4 allocator continues assigning to a dock whose blocked-flag has been raised but not yet propagated → L1.

T+90 — UCAs for the operator's override and approve. Four more UCAs. The interesting one: operator approves an allocator assignment without inspecting it (under time pressure) links to L1. Dana notes the team has now located a hazard that bow-tie would have missed — it lives in a human control action, not in the model.

T+120 — close. 18 UCAs on the board. Each tagged with a loss reference.

Session 3 — loss scenarios and safety constraints (90 minutes).

T+0. For each UCA, Dana asks "why would this occur?" Answers produce loss scenarios.

T+30 — from UCA1 (stale dock-state). Two scenarios: LS1a dock-state feedback loop is broken or throttled; LS1b model is operating outside its certified range and mis-reads fresh feedback as stale. Two safety constraints follow: SaC1 allocator shall not assign when dock-state feedback age >15 min; SaC2 allocator shall not assign when model confidence <certified floor.

T+55 — from UCA3 (notify race). Scenario: the allocator's timing is coordinated with dispatch but not with the notify window. SaC3 allocator shall not assign within N minutes of notify-window close. The team notes this as a candidate sunset criterion too — if the coordination pattern fails systematically, the piece needs re-engineering, not patching.

T+80 — from UCA2 (no assignment under pressure). SaC4 allocator shall emit an explicit "no-candidate" signal rather than silence, to prevent operator assumption of manual authority. This is a constraint on the lack of action — a category the team had not previously recognised as a control.

T+90 — close. 12 safety constraints written. Each traced back through a loss scenario, UCA, and control action to a specific loss. Dana's worksheet maps each safety constraint to: - a control on the commitment page (the mechanism that enforces the constraint); - a trigger metric at G5 (the observable condition that indicates a UCA is occurring); - optionally, a sunset criterion (for constraints about coordination patterns, whose systematic violation invalidates the piece).

What was on the final page. 12 safety constraints → 12 controls-sketch entries with owners. 18 UCAs → 9 trigger metrics (after merging duplicates). 2 coordination-pattern loss scenarios → 2 sunset criteria. Every output has a named trace back through the control structure. Priya transcribes all of this into the G3 routing map row, the G5 commitment-page draft, and the sunset-criteria worksheet.

Follow-up at month 2. A sev-2 incident reveals an unmodelled control action: the allocator emits a retrain-request signal to an upstream training pipeline, which the Session-1 diagram had omitted as "out of scope." The diagram is reopened, the missed control action added, and four additional UCAs surface. Two of them are real; one becomes an additional trigger metric. This is STPA's expected maintenance shape — the structure is a living document, not a frozen artefact.

Pitfalls¶

Over-modelling. STPA is not an architecture document. The control structure must be coarse enough to fit on a page; the analytic work happens in Steps 3 and 4, not in drawing ever-finer diagrams. A team that spends a week on Step 2 and a day on Step 3 has inverted the method's cost-to-value shape.

Missing control actions. If Step 2 omits a control action, Step 3 produces no UCAs for it, and the hazards those UCAs would have caught go undiscovered. The fix is to review Step 2 specifically for control actions the system can emit under edge conditions — operator manual overrides, automatic retries, fallback triggers — that a happy-path diagram misses.

UCAs without context. "The allocator provides an assignment when it should not" is not a UCA; "the allocator provides an assignment when the dock-state feedback is more than 15 minutes stale" is. Without context, the UCA cannot be traced to a specific hazard and cannot become a specific control.

Treating UCAs as exhaustive failure modes. UCAs are one of four types per control action, not a full taxonomy of the system's failures. Component-level failures and software bugs still happen outside the UCA catalogue. STPA complements but does not replace FMEA for component-rich systems.

Stopping at Step 3. A team enumerates UCAs and stops, never reaching loss scenarios and safety constraints. Without Step 4, STPA has produced a failure catalogue but no controls — the output the downstream gates were promised is not there.

Confusing a constraint with a control. A safety constraint is the statement of what must not happen; the control is the mechanism that enforces it. The two are paired — every constraint must name a mechanism — but the analyst has to supply the mechanism; STPA does not write it automatically.

When not to use¶

The piece has one obvious named hazard and a simple control structure. Bow-tie is faster and produces the same artefact in less time.
The piece is Human-operated with a short feedback loop and no automation substrate. STPA works on human workflows, but the cost-to-value ratio favours lighter tools (escalation templates, error-prevention checklists).
The engagement is at G1 or G2 and the piece has not been routed. STPA is a G3-Q3b tool; it operates on a routed piece whose control structure can be drawn.
The team has no facilitator with STPA experience. The method has a real learning curve; a first attempt without a practitioner typically produces a bad Step 2 that propagates into an uninformative Step 4. In that case, use bow-tie first.

Provenance¶

STPA is the analytic method of STAMP (Systems-Theoretic Accident Model and Processes), developed by Nancy Leveson at MIT beginning in the early 2000s. The canonical reference is the Leveson & Thomas STPA Handbook [1], which codifies the four-step procedure, the UCA taxonomy, and the control-structure notation used here. STPA has been applied to aviation, defence, medical devices, automotive (including ISO 26262-adjacent work), and more recently to AI safety. Its distinguishing commitment — that hazards arise from inadequate control of interactions between components, not from component failure alone — is what makes it fit AI routes, where the hazard is often a coordination problem between the model, the operator, and downstream systems.

Bow-tie analysis. The other G3-Q3b derivation tool. Bow-tie anchors on a single hazard and draws barriers around it; STPA starts from the control structure and derives safety constraints. Use bow-tie when the hazard is nameable as a single event; use STPA when the hazard is produced by feedback or coordination between controllers.
Pre-mortem. Chapter 7 Q3 paper tool. A pre-mortem surfaces failure modes informally; STPA refines the most system-structural of those into UCAs and safety constraints. Pre-mortem first when time is short; STPA when the piece's control structure is non-trivial.
Rollback-trigger design. G5 tool. STPA's UCAs feed directly into trigger metrics — each UCA names an observable context condition the trigger can watch.
Blast-radius estimation. G4 tool. STPA's loss scenarios inform the consequence-time facet of blast radius: a loss scenario that takes 24 hours to manifest sets the blast radius time at 24 hours, which in turn bounds the trigger window at G5.

Verification¶

[1] Leveson NG, Thomas JP. STPA Handbook. MIT Partnership for Systems Approaches to Safety and Security (PSASS); 2018. [verified] The canonical reference for the four-step STPA procedure, the UCA taxonomy, and the control-structure notation used here.