FEAGI Trainer — Training Paradigms
Status: Proposed
Date: 2026-06-09
Owners: FEAGI Trainer Architecture Working Group
Related: docs/FEAGI_TRAINER_ARCHITECTURE_AND_DESIGN.md, docs/FEAGI_TRAINER_ADR_SET.md, docs/EXPERIENCE_TRAINER_E2E_IMPLEMENTATION_PLAN.md, docs/EXPERIENCE_CAPTURE_ARCHITECTURE_AND_DESIGN.md, crate feagi-core/crates/feagi-trainer
This document enumerates the training paradigms feagi-trainer accounts for, the mechanism each uses, and its implementation status. It is descriptive of the agreed design; paradigm scope changes must be discussed and reflected here and in the architecture/design doc before implementation.
1. Framing — the Trainer shapes signals; FEAGI does the learning
FEAGI is not a gradient learner. Learning physically happens inside the engine via plasticity (STDP) driven by neural co-activation and FEAGI's native affect channels (Pain / Pleasure / Fear / Hope) — see FEAGI_TRAINER_ARCHITECTURE_AND_DESIGN.md A.3 #1 and Appendix B.3.
Consequently, a "training paradigm" in feagi-trainer is defined by how a supervisory signal is shaped and which channel it is delivered into — not by which optimizer runs. The Trainer orchestrates signals and scores outcomes; it never implements a learning rule.
There are exactly three delivery channels into FEAGI, and every paradigm is a combination of them:
| Channel | Where in the loop | Carries | Trainer axis |
|---|---|---|---|
| Sensory (IPU) | before step | the stimulus / observation | EncoderPlugin |
| Affect / reward (Pain/Pleasure/Fear/Hope) | after collect_motor | outcome valuation | RewardPolicy (first-class axis) |
| Teaching / target-motor | with sensory, before step | the demonstrated correct action | reserved (FeagiRuntime::submit_target_motor) |
The reward axis is anchored in FEAGI's native affect areas (Appendix B.3); the Trainer must not invent a side-channel reward mechanism. Reward-policy version is part of the run comparability key (ADR / Appendix D.2).
1.1 The Trainer delivers these channels as a FEAGI agent
The three channels above are delivered over feagi-agent. This means the Trainer is itself a FEAGI agent, and how it co-exists with an embodiment controller is the central topology decision (ADR-014):
- Non-embodied data (datasets): there is no controller, so the Trainer is the sole agent — it drives the sensory input and the reward/pain.
- Embodied tasks: the embodiment controller is already a FEAGI agent owning the robot's real sensory/motor + sim physics. The Trainer runs as a parallel co-agent on disjoint cortical I/O, owning only the training-signal channels (reward/teaching/goal) + readouts for scoring. It never drives sim physics.
Capture/replay stays embodiment-agnostic: capture at the cortical (FEAGI-native) boundary; embodiment-native traces are an opt-in sidecar; the Trainer never speaks a controller's native command language (ADR-015).
2. Paradigms
2.1 Reward & punishment (reinforcement-style)
The first-class, versioned RewardPolicy axis maps an outcome onto the native affect channels. This is the backbone of every supervised and reinforcement run. Two flavors:
- Supervised-correctness reward — prediction-vs-label drives Pleasure/Pain (e.g. the IRIS slice via
PainPleasureReward). Implemented. - Environmental / episodic reward — the outcome of an embodied rollout drives affect, with success evidence from a telemetry predicate or goal-distance signal (ADR-014). Delivered by the Trainer co-agent in parallel with the controller. When reward is intrinsic to the genome (e.g. the pendulum's R-STDP
balance_homeostaticpersonality), this axis is a no-op / observe-only and the Trainer scores rather than shapes. This flavor cannot be expressed by the currentRewardPolicy::reward(prediction, target)signature (it has no per-step target); it requires the co-agent seam (Section 3).
2.2 Supervised associative learning
Present the input and a correctness / expected signal during a train phase so FEAGI associates stimulus → response. Mechanically this is delivered through the reward channel (the policy emits the correctness signal into the affect areas during the train phase — Appendix B.3). In FEAGI terms "supervised" is a special case of reward-shaping, not a separate optimizer. Implemented for classification (IRIS).
2.3 Imitation / behavior cloning (teaching / supervised forcing)
Inject the demonstrated action as expected motor output before the burst, teaching FEAGI to reproduce it. This uses the reserved teaching / target-motor channel (FeagiRuntime::submit_target_motor, default Unsupported). The seam exists (plan Phase 1b) so the loop is not reopened later; the mode itself is scheduled for Phase 5 (VLA slice — learning from demonstrations alongside reward).
2.4 Closed-loop reinforcement / embodied control
Roll the brain out in an environment, deliver per-step / per-episode reward, and score episodic success (e.g. balance duration, cumulative reward, success rate). This is the embodied/control metric family (design Section 5.8). The topology is decided (ADR-014): the Trainer is a parallel co-agent that injects reward/teaching/goal signals and scores, while the controller owns the robot's sensory/motor + sim physics. The episodic-control metric pack + EpisodeTrajectory are built and tested; the live co-agent execution path is sequenced after the dataset path (plan Phase 1d, re-scoped). Embodied Scorecards mean closed-loop task success, not offline prediction accuracy (plan Section 5.6).
2.5 Unsupervised / Hebbian associative exposure
Present sensory input only — no reward and no target — and let plasticity associate co-active patterns. This is structurally already possible (the executor records unlabeled samples without scoring or rewarding them) but is not a first-class, named scenario or metric pack yet. A common goal-driven variant (e.g. quadruped: "minimize deviation from a target IMU stability signal") is implemented as reward shaping — the goal-distance becomes the affect signal (§2.1 / ADR-014), so it reduces to the reward axis rather than a new mechanism. A genuinely self-supervised/predictive objective remains a deferred, open decision.
3. Cross-cutting seam note
Paradigms 2.1 (supervised flavor) and 2.2 share the RewardPolicy::reward(prediction, target) seam and the static-sample executor run_rollout. Paradigms 2.3, 2.4, and 2.5 each break that signature:
- 2.4 has no per-step target and environment-derived reward;
- 2.3 needs the teaching / target-motor channel;
- 2.5 has no reward and no target at all.
The co-agent seam (ADR-014) resolves this generally: because the Trainer injects training signals on disjoint cortical I/O, hosting environment-reward (2.4), teaching (2.3), and exposure/goal (2.5) is a matter of which channel the co-agent drives, not a new loop. Until the live co-agent path is built, only 2.1 (supervised) and 2.2 are exercised end-to-end.
4. Status summary
| Paradigm | Primary channel(s) | Status | Milestone |
|---|---|---|---|
| Reward & punishment — supervised-correctness | affect | Implemented | M1 (IRIS) |
| Supervised associative classification | sensory + affect | Implemented | M1 (IRIS) |
| Reward & punishment — environmental/episodic | affect (co-agent) | Metric pack built; live co-agent path pending | Phase 1d (re-scoped) |
| Closed-loop embodied control (RL-style) | sensory + affect (co-agent) | Topology decided (ADR-014); metric pack built; live path pending | Phase 1d (re-scoped) |
| Imitation / behavior cloning | teaching / target-motor | Reserved seam (Phase 1b) | Phase 5 |
| Unsupervised / Hebbian exposure | sensory only | Structurally possible, not first-class | Open decision |
| Goal-driven (reward-shaped) | affect (co-agent) | Reduces to §2.1 reward shaping | Phase 1d (re-scoped) |
5. Explicit non-goals
- No gradient descent / backprop and no Trainer-side learning rule — plasticity / STDP lives in the FEAGI engine.
- No Trainer-invented reward mechanism — reward must target FEAGI's native affect channels (Appendix B.3).
- No data capture or labeling — that is Experience Capture's responsibility (ADR-001 boundary); the Trainer consumes datasets and produces verifiable Scorecards.
6. Worked scenarios (how the paradigms compose)
These worked examples show the design covers diverse training needs by varying only which signals the Trainer co-agent injects and the data source — not the engine (ADR-014/ADR-015).
| # | Scenario | Paradigm(s) | Agent topology | Trainer agent injects | Reward / success evidence | Data source |
|---|---|---|---|---|---|---|
| 1 | Robot arm — pickup from manual demos | Imitation (2.3) + reward (2.1) | Co-agent (controller runs the arm) | demonstration/teaching + reward/pain | telemetry predicate ("object lifted") | Experience Capture episodes (+ live sim) |
| 2 | Robot arm — coordinates → pickup | Supervised assoc. (2.2) + reward (2.1) | Co-agent | object-coordinate input + reward/pain | experience labels | Experience collection |
| 3 | Cancer-cell anomaly detection | Supervised classification (2.2 + 2.1) | Sole agent (no embodiment) | dataset sample input + reward/pain | experience labels (valid vs normal) | dataset |
| 4a | Quadruped walk — IMU stability goal | Goal-driven, reward-shaped (2.5→2.1) | Co-agent | goal (target IMU) signal | goal-distance to ideal IMU | captured IMU stability |
| 4b | Quadruped walk — gait demonstration | Imitation (2.3) + goal (2.5→2.1) | Co-agent | gait demonstration + ideal-IMU goal | goal-distance + demo alignment | Experience Capture |
Common thread: the controller (when present) owns the robot's sensory/motor + physics; the Trainer co-agent adds reward/teaching/goal on disjoint cortical I/O and scores the outcome. For scenario 3 there is no controller, so the Trainer is the sole agent — this is the dataset path built first.
7. UI implications and gaps vs best-in-class (by paradigm)
Modern ML tools (W&B, MLflow, Roboflow, Isaac Lab) set researcher expectations per paradigm. The Trainer desktop UI (ADR-005, architecture doc Section 7.4–7.6) must reflect FEAGI semantics (plasticity + affect, not gradients) while closing UX gaps where parity is reasonable.
7.1 What researchers expect vs what we provide (v1)
| Researcher expectation | Tabular / dataset path (2.1–2.2) v1 | Embodied / co-agent (2.3–2.4) | Unsupervised (2.5) |
|---|---|---|---|
| Train / val / test splits | Protocol phases on wizard Step 1; Test = benchmark Scorecard | Same protocol model; episodes per phase | Exposure-only phase; no reward UI |
| Dataset registry | Experience Catalog default (Step 2); import CSV fallback | Experience Capture episodes | Catalog or live stream |
| Live metrics during run | Table from RunEvent; gap: no time-series chart | Gap: episode reward curve, success rate | Gap: exposure progress only |
| Hyperparameters / optimizer | Non-goal — use plasticity, reward magnitude, ticks/sample | Same; add goal/teaching bindings | N/A |
| Run comparison | Gap: single-run Results step | Gap: compare rollouts / success rates | Deferred |
| Embodiment context | Genome + embodiment in AppBar widget | Gap: co-agent status, controller health, disjoint I/O map | N/A |
7.2 Paradigm-specific UI backlog
| Paradigm | Wizard / UI surface (target) | Gap vs best-in-class |
|---|---|---|
| 2.1–2.2 Supervised (v1) | Steps 1–6 as implemented in wireframe | Metric charts; multi-run compare; catalog API; connectome hash verification |
| 2.3 Imitation | Teaching channel bindings; demo preview from Capture | No demo timeline scrubber; no "show me the teaching injection" debug view |
| 2.4 Embodied control | Co-agent panel: episode list, telemetry predicate status, affect injection log | No Isaac-style episode replay; no parallel controller diagram; live co-agent path not built |
| 2.5 Unsupervised exposure | Exposure phase template (reward off, observe-only metrics) | Not a named wizard preset; no Hebbian-specific metric pack UI |
7.3 Language and anti-patterns
The UI must not label plasticity controls as "learning rate" or show loss curves that imply backprop. Prefer: reward magnitude, protocol phase, affect channel, ticks per sample, Scorecard / evaluation protocol version. This is an intentional divergence from PyTorch/Hugging Face — document it in onboarding copy on Step 1 (Training setup).
7.4 Cross-reference
- Desktop wizard flow and full gap table:
FEAGI_TRAINER_ARCHITECTURE_AND_DESIGN.mdSection 7.4–7.6 - ADR-005 Implementation Notes: Desktop UI design record + ADR-scoped gap summary
- Delivery phases:
FEAGI_TRAINER_ADR_SET.mdAppendix B (L3 items per phase)