03
Flywheel
A systems view of feedback loops: models, products, telemetry, and organizational learning.

03 · Flywheel

Overview#

The flywheel is a reinforcing system, not a pipeline. Deploying model-assisted workflows can generate the evidence needed to improve those workflows, which changes what the system can attempt.

The condition is closure: you can observe outcomes, attribute them to system behavior, and turn those observations into updates (prompts, tools, policies, evaluations, or models). Without that measurement loop, “iteration” becomes drift.

I treat the flywheel as a progression of system capabilities that may become feasible as reliability and governance improve:

perception → reasoning → generation → autonomy → new perception

This progression is not guaranteed. It is a compact way to track where errors enter, where they can be detected, and what feedback is required.

The sketch below is how I expect the flywheel to work when it works: product usage produces telemetry, telemetry becomes evaluations, evaluations drive workflow/model updates, and those updates change what the product can reliably attempt. The loop can stall, drift, or mislead if measurement is weak or if the evaluation target is poorly specified. Treat this as a design lens and a diagnostic tool: not a prediction or an inevitability.

Flywheel sketch: Perception & Recognition → Prediction & Forecasting → Decision & Optimization → Reasoning & Knowledge → Generative AI → Autonomy & Agents → New Perception
Figure 1. Flywheel sketch: Perception & Recognition → Prediction & Forecasting → Decision & Optimization → Reasoning & Knowledge → Generative AI → Autonomy & Agents → New Perception.

The flywheel components#

The stages below should not be seen as a strict progression or hierarchy. Instead, they describe capability surfaces that participate in the same feedback loop, each possessing their own forms of leverage and failure. This flywheel can only turn when those surfaces are jointly instrumented.

  • Perception & Recognition: Systems ingest signals and detect patterns across inputs (text, images, logs, speech, sensors).
    • Improves: signal coverage, robustness, feature extraction
    • Fails when: inputs are biased, drifting, or weakly observable
    • Defines what the system can notice.
  • Prediction & Forecasting: Models anticipate outcomes or states using probabilistic structure and prior data.
    • Improves: calibration, temporal consistency, uncertainty handling
    • Fails when: confidence outpaces evidence or environments shift
    • Defines what the system can anticipate.
  • Decision & Optimization: Algorithms select actions or rankings to optimize objectives under constraints.
    • Improves: policy efficiency, trade-off handling, constraint adherence
    • Fails when: objectives are mis-specified or locally optimized
    • Defines what the system can choose.
  • Reasoning & Knowledge: The system applies structured knowledge and multi-step inference to contextualize decisions.
    • Improves: consistency, decomposition, self-checking
    • Fails when: reasoning chains are brittle or assumptions are hidden
    • Defines what the system can understand.
  • Generative Capability: Models produce new artifacts such as text, code, plans, designs under constraints.
    • Improves: usefulness, controllability, task alignment
    • Fails when: fluency substitutes for correctness
    • Defines what the system can create.
  • Autonomy & Agents: The system acts over time: maintaining state, invoking tools, and sequencing actions toward goals.
    • Improves: task completion, resilience, delegation
    • Fails when: errors compound, state is opaque, or attribution breaks down
    • Defines what the system can do.

Why autonomy changes system dynamics#

Autonomy changes the error surface because mistakes become stateful:

  • A single incorrect step can alter the environment and create downstream evidence that looks “consistent,” masking the original error.
  • Latency and throughput constraints become product constraints: autonomous loops may be faster, but they can also fail faster.
  • Governance becomes part of the algorithm: permissions, budgets, approvals, and audit logs determine what the system is allowed to do and how errors are contained.

Autonomy is not inherently better. It can make sense when:

  • the action space is well-bounded,
  • outcomes are observable,
  • rollback is possible,
  • and the error-cost profile is acceptable.

Measurement as the closure mechanism#

The loop closes when measurement is treated as a first-class product surface.

In practice, this tends to require:

  • A stable definition of success for each workflow (not just “helpfulness”).
  • Instrumentation that logs inputs, tool calls, intermediate state, and outputs with enough context for diagnosis.
  • An evaluation harness that turns observed failures into regression tests.
  • A feedback path from incidents to changes (prompt/tool policy updates, retrieval changes, model fine-tunes, or human process changes).

If measurement is weak, organizations often see local wins and global frustration: impressive demonstrations, followed by unreliable behavior that is hard to debug and hard to improve.

See the boxed example below for a concrete instance of how a hallucination can become an incident.

Failure modes and limits#

Common failure modes that break the loop:

  • Metric mismatch: optimizing for proxy metrics (e.g., response length, user satisfaction scores) that are weakly correlated with correctness.
  • Feedback contamination: training or tuning on outputs that were already influenced by the system, amplifying its own errors.
  • Tool-interface brittleness: small changes in downstream systems (schemas, permissions, endpoints) cause silent failures or incorrect actions.
  • Hidden dependency on humans: “autonomy” depends on untracked human review, manual correction, or domain expertise that is not modeled.

Limits are often structural:

  • If correctness cannot be measured cheaply, learning stalls.
  • If errors are high-cost and irreversible, autonomy may be inappropriate.
  • If the environment is adversarial or highly non-stationary, previous evaluations may not predict future performance.

Key points#

  • Flywheels require instrumentation.
  • Evaluation is the control surface.
  • “Learning” can be local (workflow) not just global (model).