From Effect Ledger to Goal-Aware Training Data

Community Article Published January 4, 2026

How SI-Core turns runtime experience into safer models Draft v0.1 — Non-normative supplement to SI-Core / SI-NOS / PLB / GDPR Ethical Redaction

This document is non-normative. It explains how to use SI-Core’s structured logs (jumps, effects, GCS, EthicsTrace, metrics) to build goal-aware learning pipelines. Normative contracts still live in the SI-Core / SI-NOS specs, evaluation packs, and GDPR Ethical Redaction guides.

1. Why “learning” looks different on SI-Core

Most current ML pipelines look like this:

Raw logs → Ad-hoc ETL → Training set → Model → Deployed somewhere

The problems are familiar:

Goals are implicit (buried in loss functions or business docs).
Context is weak (which system, which actor, which risk level?).
Ethics and governance are bolted on after the fact.
When something goes wrong, you can’t answer:
- “Which experiences did this model actually learn from?”
- “What did we forget when we redacted user X?”

On an SI-Core stack, we have a very different starting point:

World → [OBS] → Jump → Effect Ledger + Metrics → PLB / Training

Each jump already carries:

[ID] — Who/what initiated this decision path.
[OBS] — The structured observation used.
[ETH] / EthicsTrace — Which ethics policy, what decision.
[EVAL] — Risk profile, sandbox runs, GCS estimates.
[MEM] — Hash-chained effect ledger entries, RML level, rollback traces.

In other words, runtime experience is already structured, goal-tagged, and auditable.

This document shows how to turn that structure into goal-aware training sets and SI-native evaluation pipelines.

2. Three layers of learning data

It helps to think of SI-native learning data in three layers.

2.1 Event-level: Jumps and effects

The atomic unit is a jump plus its effect ledger entries:

Jump metadata:
- jump_id, timestamp, service, conformance_class, rml_level.
- [ID] (actor, role, origin), [OBS] (observation_id, coverage).
Proposed vs executed actions (LLM wrapper / SIL / tools).
Effect ledger entries:
- effect_id, type (write, API call, external side-effect), compensator info (for RML-2/3).

This level feeds models that answer questions like:

“Given this observation, what action tends to lead to good GCS later?”
“What is the probability this jump will need a rollback?”

2.2 Episode-level: Failures, rollbacks, incidents

Episodes connect sequences of jumps and effects into stories:

Incident traces (from 60-010 / failure docs).
Rollback events and RML behavior:
- Which jumps were rolled back, partially, or failed to roll back.
Metrics around the incident window (CAS, EAI, RBL, RIR, SCI).

This level feeds models like:

Early warning predictors (“this pattern of jumps tends to precede incidents”).
Root-cause helpers for PLB (“these patterns correlate with rollbacks in sector 12”).

2.3 Aggregate-level: GCS, EthicsTrace, metrics

At the top level we have goal-aware and governance-aware aggregates:

GCS vectors per jump / per action.
EthicsTrace decisions and rationales.
SI-Core metrics snapshots (CAS, SCI, SCover, OCR, EAI, ACR, RBL, RIR, EOH).
- OCR (Observation Coverage Ratio): observation coverage ratio (what “[OBS] coverage” refers to)
  - Suggested definition: OCR = observed_required_units / total_required_units (range 0.0..1.0)
- SCover (Structural Coverage): structural coverage (share of SIR blocks traced)
- ACR (Audit Chain Completeness): audit-chain completeness rate (how complete the audit chain is for the evaluated slice)

This level feeds models that answer questions like:

“Given current state, which policy knob setting leads to higher EAI without hurting CAS?”
“Which semantic compression settings keep ε low for this goal?”

We will treat these three layers as feature sources and labels for training.

3. Designing a goal-aware training task

Before extracting any data, decide what you want the model to learn in SI-native terms.

3.1 Choose the goal and horizon

Examples:

city.flood_risk_minimization over the next 6 hours.
user.fair_treatment over the next 30 days.
system.rollback_risk_minimization over the next N jumps.

For each training task, define:

training_task:
  id: flood_risk_predictor_v1
  goal: city.flood_risk_minimization
  prediction_horizon: 6h
  subject_scope: ["sector", "canal"]
  decision_context: ["flood_controller"]

3.2 Define labels in GCS / SI terms

Instead of inventing an opaque label, derive it from existing SI-Core structure.

Some examples:

Target = future GCS for a given goal:

y = GCS_city.flood_risk_minimization(a, t→t+6h)

Target = rollback / incident indicator:

  y = 1 if this jump (or its descendants) triggered RML-2/3 rollback
  in the next 24h

Target = ethics violation risk:

y = probability that a similar jump would have been rejected
    by [ETH] overlay or appealed by a human

These labels come from effect ledger + metrics, not from ad-hoc annotation.

3.3 Define feature sets structurally

Features should be derived from [OBS] + SIM/SIS + context:

Observation features: semantic units used in [OBS] (e.g. flood_risk_state, traffic_state).
Actor features: role, identity class (via [ID]).
Policy state: current compression settings, ethics policy version, risk profile.
Environmental context: time of day, season, external risk level.

Document feature provenance explicitly:

features:
  - name: sector_risk_score
    source: OBS.semantic.flood_risk_state.payload.risk_score
  - name: hospital_load_index
    source: OBS.semantic.hospital_state.payload.load_index
  - name: policy_version
    source: ETH.policy_version
  - name: compression_mode
    source: ctx.compression_mode

This makes it easy to audit which parts of the SI-Core state the model is allowed to use.

3.4 Advanced feature engineering patterns (non-normative)

Why feature engineering matters:

The effect ledger is rich but raw:
- Temporal dependencies across jumps
- Graph structure between actors and effects
- Multi-modal observations from sensors, logs, external feeds
Good features:
- Improve prediction accuracy
- Make models easier to explain
- Align better with goal structures and GCS

Pattern 1: Temporal features

Rolling statistics over recent jumps:

def extract_temporal_features(jump, ledger, window="24h"):
    recent = ledger.get_jumps(
        end=jump.timestamp,
        window=window,
        filter={"actor": jump.actor}
    )

    if not recent:
        return {
            "recent_jump_count": 0,
            "avg_gcs_flood": 0.0,
            "rollback_rate": 0.0,
        }

    return {
        "recent_jump_count": len(recent),
        "avg_gcs_flood": mean(j.gcs["city.flood_risk_minimization"] for j in recent),
        "max_rml_level": max(j.rml_level for j in recent),
        "rollback_rate": sum(j.rolled_back for j in recent) / len(recent),
    }

Other temporal patterns:

Time-of-day / day-of-week embeddings
Short vs long horizon windows (e.g. 1h, 24h, 7d)
Simple trend estimators over GCS trajectories

Pattern 2: Graph features

Actor-interaction graph from the ledger:

import networkx as nx

def build_actor_graph(jumps):
    G = nx.DiGraph()
    for j in jumps:
        for target in j.affected_actors:
            G.add_edge(j.actor, target)
    return G

def actor_graph_features(G, actor_id):
    pr = nx.pagerank(G)
    return {
        "actor_degree": G.degree(actor_id),
        "actor_pagerank": pr.get(actor_id, 0.0),
    }

Use cases:

Identify structurally important actors
Detect “bottleneck” services or controllers
Inform risk scoring and rollout strategies

Pattern 3: Multi-modal integration

Combine semantic units from different streams:

def integrate_multimodal_obs(obs):
    flood_emb   = encode_semantic(obs["flood_risk_state"])
    traffic_emb = encode_semantic(obs["traffic_state"])
    weather_emb = encode_semantic(obs["weather_forecast"])

    # Cross-attention or simple concatenation
    combined = cross_attention(flood_emb, [traffic_emb, weather_emb])
    return combined

Pattern 4: Goal-aware features

Derive features directly from goal / GCS history:

def gcs_features(jump, goal, ledger, window="30d"):
    hist = ledger.get_gcs_trajectory(goal=goal, window=window)
    if not hist:
        return {
            f"{goal}_trend": 0.0,
            f"{goal}_volatility": 0.0,
            f"{goal}_percentile": 0.5,
        }

    return {
        f"{goal}_trend": compute_trend(hist),
        f"{goal}_volatility": std(hist),
        f"{goal}_percentile": percentile_rank(jump.gcs[goal], hist),
    }

Best practices:

Document feature definitions in SIS schema (types + semantics)
Version feature extractors just like models
Test feature stability under drift (same inputs → same features)
Audit fairness per feature group; avoid leaking protected attributes via proxies

4. From effect ledger to dataset: an ETL sketch

Once the task is defined, the pipeline is conceptually simple:

Effect ledger → Extract → Join → Label → Filter → Train/Eval splits

4.1 Extraction

A non-normative Python-flavoured sketch:

from sic_ledger import JumpLog, EffectLog, MetricsLog

jumps = JumpLog.scan(start, end,
                     filters={"service": "city-orchestrator",
                              "conformance_class": "L2"})

effects = EffectLog.scan(start, end)
metrics = MetricsLog.scan(start, end)

4.2 Join and label

from datetime import timedelta

def _safe_semantic_payload(obs, unit_key: str) -> dict:
    """
    Non-normative helper:
    - supports obs.semantic as dict-like
    - returns {} on missing
    """
    sem = getattr(obs, "semantic", {}) or {}
    unit = sem.get(unit_key)
    if unit is None:
        return {}
    payload = getattr(unit, "payload", None)
    return payload or {}

def build_training_rows(jumps, effects, metrics, horizon_hours=6):
    rows = []

    # Non-normative: assume we can index effects by jump_id for the “Join” step
    effects_by_jump = effects.index_by("jump_id")  # illustrative helper

    for j in jumps:
        obs = j.observation   # structured [OBS] (pre-jump)
        ctx = j.context       # [ID], risk_profile, policy_version, etc. (pre-jump)

        from_time = j.timestamp
        to_time   = j.timestamp + timedelta(hours=horizon_hours)

        # Label: realized future GCS for the executed jump (supervised target)
        gcs_future = metrics.aggregate_gcs(
            goal="city.flood_risk_minimization",
            from_time=from_time,
            to_time=to_time,
            conditioned_on_jump=j.jump_id,
        )

        # Join: pull effect-level facts for this jump (still “pre-label” features)
        effs = effects_by_jump.get(j.jump_id, [])
        effect_types = sorted({e.type for e in effs}) if effs else []

        flood_payload    = _safe_semantic_payload(obs, "flood_risk_state")
        hospital_payload = _safe_semantic_payload(obs, "hospital_state")

        features = {
            # OBS/SIM-derived (pre-jump)
            "sector_risk_score": flood_payload.get("risk_score"),
            "hospital_load_index": hospital_payload.get("load_index"),

            # Context / governance (pre-jump)
            "policy_version": ctx.ethics_policy_version,
            "compression_mode": ctx.compression_mode,

            # Effect-ledger-derived (joined)
            "num_effects": len(effs),
            "effect_types": effect_types,
        }

        rows.append({
            "jump_id": j.jump_id,
            "features": features,
            "label_gcs_goal": gcs_future,  # scalar label for the specified goal
        })

    return rows

4.3 Filtering and splits

Before splitting into train/validation/test, apply SI-native filters:

Drop jumps where OCR (i.e., “[OBS] coverage ratio”) is below a threshold (e.g., OCR < 0.95).
- Note: OCR is data/observation quality; SCover is trace/structure coverage.
Drop episodes with known measurement errors.
Exclude periods under chaos testing (if you don’t want those patterns).
Respect data governance policies (jurisdictions, consent, etc.).

Then split by time and population, not random rows only:

Time-based splits to avoid leakage from future to past.
Population-based splits (sectors, cohorts) for fairness eval.

4.4 Online and continual learning

Batch vs online learning in SI-Core:

Batch learning (Current assumption)
- Periodic retraining (e.g. weekly / monthly)
- Full effect-ledger scans
- Stable but slower to adapt
Online learning
- Continuous updates from incoming jumps
- Much lower latency adaptation
- But requires explicit stability + governance rules

Patterns for online learning:

@stream_processor
def extract_features_online(jump_stream):
    for jump in jump_stream:
        features = extract(jump.obs, jump.context)
        label = compute_label(jump, horizon="6h")
        yield (features, label)

Streaming feature extraction
- Ingest new jumps as they arrive
- Emit feature/label pairs into an online buffer
Incremental model updates
- Mini-batch updates over sliding windows (e.g. last 7d)
- Exponential decay for older data
- Hard stability constraints (e.g. max Δ per update)
Online evaluation
- Rolling-window metrics
- Detect distribution shift early
- Automatic rollback of model versions on degradation

Continual learning challenges:

Catastrophic forgetting
- Mitigate via replay buffers
- Elastic Weight Consolidation (EWC) and similar techniques
- “frozen core + plastic head” type structure
Concept drift detection
- Monitor GCS distributions for key goals
- Track EAI / CAS stability alongside classic ML metrics
- Trigger retraining or rollbacks when drift > threshold
Budget constraints
- Compute cost per update
- Memory for replay buffers
- Latency budget for critical jumps

Integration with PLB:

PLB can propose online-learning schedules and parameter bounds
Online updates must:
- Respect self-modification budgets
- Stay within agreed risk tiers (e.g. only for non-critical models)
- Emit [MEM] records for every model-version change

Example configuration (non-normative):

online_learning:
  enabled: true
  update_frequency: hourly
  window_size: 7d
  max_delta_per_update: 0.05
  stability_checks:
    - cas_floor: 0.98
    - eai_floor: 0.95
    - rir_floor: 0.95
  auto_rollback: true

  governance:
    # Conformance class gating (L1/L2/L3)
    allow_conformance_classes: ["L1", "L2"]

    # Deployment-zone gating (domain/risk zones, not conformance classes)
    allow_zones: ["analytics", "routing"]
    forbid_zones: ["safety_critical"]

5. SI-native evaluation: metrics that matter

Once you have a candidate model, you can evaluate it at two levels:

Model-level metrics (precision, recall, calibration, AUROC, etc.).
System-level metrics (SI-Core metrics under the new model).

5.1 Model-level metrics (still useful, but not enough)

Standard metrics are still valuable:

Predictive accuracy for labels (GCS, incidents, ethics outcomes).
Calibration (does predicted risk match empirical frequencies?).
Fairness metrics (per group error rates, equalized odds, etc.).

But by themselves they don’t tell you whether the system behaves better.

5.2 System-level metrics: evaluation in-situ

For SI-Core, the primary question is:

“If we ship this model, what happens to CAS, EAI, RBL, RIR, SCover, OCR, ACR…?”

This is where sandbox replay and shadow deployment come in.

Typical workflow:

Sandbox replay
- Re-run historical periods with the new model in a sandboxed SI-NOS.
- Compare metrics vs baseline:
  - CAS stability, SCI incidents.
  - EAI/EOH for effectful jumps.
  - RBL/RIR on simulated failures.
Shadow deployment
- Run the new model alongside the old one, but only log its suggestions.
- Measure:
  - How often do decisions diverge?
  - Would divergences have changed GCS for key goals?
  - Any ethics overlay disagreements?

Metric deltas

Produce a simple dashboard:

Metric       | Baseline | Candidate | Delta
------------ | -------- | --------- | ------
CAS_p95      |   0.992  |   0.991   | -0.001
EAI_effects  |   0.982  |   0.987   | +0.005
RBL_p95_ms   |    430   |    445    | +15
RIR          |   0.962  |   0.959   | -0.003
SCI_24h      |     3    |     3     |  0

Decision rule
- Non-regressive thresholds for safety metrics (EAI, RIR, ACR).
- Acceptable bands for performance metrics (CAS, RBL, latency).
- Explicit trade-off policy when metrics move in opposite directions.

The key difference from traditional ML is that deployment is gated by system metrics, not just model metrics.

5.3 Distributed learning and privacy guarantees

Why distributed learning in SI-Core:

Multi-organization networks:
- Each org has its own SI-Core + effect ledger
- Raw data cannot move freely (privacy, competition)
- But everyone benefits from shared patterns
Cross-jurisdiction constraints:
- GDPR, data residency, constitutional data rules
- Some goals require local-only training
- Others allow federated, DP-protected aggregation

Federated learning pattern (non-normative):

Local training
- Each org trains on its own effect ledger
- Computes local model updates (gradients or weights)
- Observes local SI metrics (EAI, CAS, RIR)
Secure aggregation
- A coordinator collects encrypted updates
- Uses secure aggregation / MPC to combine
- Never sees individual raw updates
Global model distribution
- Aggregated model broadcast back to participants
- Each org validates against local goals + metrics
Governance & scope
- Constitutional limits on:
  - Which tasks may be federated
  - Maximum privacy budget per subject
- [MEM] keeps audit trails of each aggregation round

Differential privacy integration:

# Non-normative sketch: DP-SGD wiring varies by library.
# Key idea: clip gradients + add noise, while tracking (ε, δ) with an accountant.

from dp_sgd import make_dp_optimizer, PrivacyAccountant

accountant = PrivacyAccountant(delta=1e-5)

optimizer = make_dp_optimizer(
    learning_rate=0.01,
    l2_norm_clip=1.0,
    noise_multiplier=1.1,   # privacy–utility tradeoff
    accountant=accountant,
)

# ... training loop ...
# epsilon_spent = accountant.epsilon(steps=total_steps)

Privacy accounting:

Track (ε, δ) per training run and per model
Compose privacy budgets over multiple updates
Store in model-lineage metadata:
- dp_epsilon, dp_delta, num_compositions
- legal basis / jurisdiction tags

Challenges:

Privacy vs utility trade-off
Heterogeneous data distributions across orgs
Communication cost and stragglers
Potentially Byzantine participants (need robust aggregation)

Example multi-org configuration:

federated_learning:
  participants: [city_A, city_B, city_C]
  coordinator: regional_hub
  privacy:
    method: differential_privacy
    epsilon: 2.0
    delta: 1e-5
  aggregation:
    method: secure_aggregation
    min_participants: 3
  governance:
    constitutional_scope: SI-CONST/v1
    audit_frequency: monthly
    allow_tasks:
      - flood_risk_forecasting
      - traffic_flow_prediction
    forbid_tasks:
      - individual_credit_scoring

6. Redaction, lineage, and audit

On SI-Core, learning data is not a separate universe. It is embedded in the same [MEM] and redaction fabric as everything else.

6.1 Lineage: from model back to ledger

Every training job should leave a lineage record:

model_lineage:
  model_id: flood_risk_predictor_v1
  training_run_id: 2028-04-01T12:00Z
  source_ledgers:
    - jumps:   ledger://jump/city-orchestrator/2028-01-01..2028-03-31
    - effects: ledger://effect/city-orchestrator/2028-01-01..2028-03-31
    - metrics: ledger://metrics/city-orchestrator/2028-01-01..2028-03-31
    - redactions: ledger://redaction/city-orchestrator/2028-01-01..2028-03-31
  selection_criteria:
    - conformance_class in ["L2", "L3"]
    - jurisdiction == "EU"
    - goal == "city.flood_risk_minimization"
  redactions_applied:
    - type: pii_tokenization
    - type: dsr_erasure
      subjects: ["user:1234", "user:5678"]
  artifact_hashes:
    - dataset_hash: sha256:...
    - code_hash:    sha256:...

With this in place, you can answer audit questions like:

“Show me all models trained on data that involved subject X.”
“After we executed DSR-042 for user Y, which training runs were affected, and how did we remediate?”

6.2 Redaction and retraining

When GDPR Ethical Redaction kicks in, SI-Core already:

Maintains a redaction ledger for DSRs and deletions.
Knows which jumps / effects / SIM/SIS entries were impacted.

For training pipelines, you typically need to:

Identify all training runs where the redacted data contributed.
For each affected model, decide on a remediation strategy:
- Full retrain without the redacted data.
- Incremental unlearning (if supported).
- Model retirement.

Emit a remediation record:

remediation:
  dsr_id: DSR-2028-04-15-042
  model_id: flood_risk_predictor_v1
  action: retrain
  new_training_run: 2028-04-20T09:00Z
  verified_by: governance_team

Because training sets are built from hash-chained ledgers, you never need to guess “did this user’s data slip in somewhere?”. You have explicit references.

6.3 Model drift detection and retraining triggers

Why drift matters in SI-Core:

The world changes:
- New patterns (climate, traffic, behavior)
- New sensors and actuators
The system changes:
- New goals, ethics policies, compression schemes
If models do not adapt, GCS and SI metrics degrade even if ML loss looks fine

Types of drift:

Covariate shift
- P(X) changes, P(Y|X) is stable
- Ex: new sensor ranges, different traffic regimes
Concept drift
- P(Y|X) changes
- Ex: flood-control actions have different outcomes under new infrastructure
Label / goal drift
- P(Y) or goal priorities change
- Ex: policy decides to weight fairness higher than efficiency

Detection strategies:

Statistical tests on features

from scipy.stats import ks_2samp

statistic, p_value = ks_2samp(
    train_features, production_features
)
if p_value < 0.01:
    alert("Covariate shift detected")

Performance monitoring
- Rolling-window prediction error
- Compare to baseline and SLOs
SI-Core metric monitoring
- Track EAI, CAS, RIR, SCover per model or decision path
- Flag when:
  - EAI drops below floor
  - CAS becomes unstable
  - RIR degrades after model rollout

Retraining triggers:

Automatic (non-critical tasks only):
- Performance below threshold for N consecutive days
- SI metrics degrade by > X% and correlated with model usage
- Feature distribution shift larger than Y standard deviations
Governed / manual (safety-critical):
- Policy / goal changes
- Constitutional amendments
- Major incidents or post-mortems

Retraining strategies:

Full retrain
- Concept drift or strong policy changes
- Use latest effect-ledger window (e.g. 90 days)
Incremental update
- Minor covariate shifts
- Append data, limited epochs, strong regularization
Ensemble update
- Keep old model + add new one
- Weight by recency / performance
- Smooth transition over time

Example monitoring configuration:

drift_detection:
  enabled: true
  check_frequency: daily

  triggers:
    - type: performance_degradation
      metric: prediction_error
      threshold: 0.15
      window: 7d

    - type: si_metrics_degradation
      metrics: [EAI, CAS]
      threshold_pct: 5

    - type: feature_drift
      method: ks_test
      p_value: 0.01

  retraining:
    strategy: auto_trigger
    approval_required: false      # L1/L2 only
    notify:
      - governance_team
      - ml_owners

7. CI/CD for models in SI-Core

Putting it all together, a typical CI/CD pipeline for a new model version looks like this.

7.1 Build phase

Data contract checks
- Verify feature schemas against SIM/SIS.
- Ensure no disallowed fields (e.g. sensitive attributes in some jurisdictions).
- Check that lineage metadata is complete.
Training + basic eval
- Train on effect-ledger-derived dataset.
- Compute standard metrics (accuracy, calibration, fairness).
- Run SIL/SIR-related property tests if the model is embedded in SIL code.

7.2 Validation phase

Sandbox replay
- Run candidate model in sandboxed SI-NOS.
- Collect SI metrics deltas vs baseline.
PLB safety checks (if PLB generates or parameterizes the model)
- Ensure PLB’s self-modification budgets are respected (see art-60-020).
- Check proposal acceptance/revert rates under controlled scenarios.
Redaction simulation
- Simulate DSRs for sample subjects.
- Verify lineage & remediation pipeline works end-to-end.

7.3 Deployment phase

Shadow deployment
- Deploy candidate in “observe-only” mode.
- Compare live suggestions vs current production.
- Monitor EAI, CAS, SCI, RBL, RIR deltas.
Canary rollout
- Enable the model for a small fraction of traffic / sectors.
- Watch for metric anomalies and ethics overlay disagreements.
Full rollout + PLB integration
- Once stable, allow PLB to use the new model as one of its tools.
- Continue monitoring PLB meta-metrics (proposal quality, revert rate).

At every step, the pipeline is anchored in SI-Core structure:

Data from effect ledgers and SIM/SIS.
Labels from GCS, RML, EthicsTrace.
Success measured in SI metrics, not just ML metrics.

8. Summary

On a conventional stack, “learning from experience” is mostly an ETL art project on top of messy logs.

On an SI-Core stack:

Experience is already structured, goal-tagged, and auditable in the effect ledger.
Goals and ethics are first-class, via GCS and EthicsTrace.
Metrics like CAS, EAI, RBL, RIR, SCover, ACR give you system-level feedback.

This document sketched how to:

Design goal-aware training tasks using SI-native concepts.
Extract datasets directly from jumps, effects, metrics.
Evaluate models by their impact on SI-Core metrics, not just loss values.
Maintain lineage and redaction as part of the same [MEM] fabric.
Build CI/CD pipelines where models are just one more governed component.

The slogan version:

Don’t train on random logs. Train on your effect ledger.

That is how learning stays aligned with the goals and guarantees of SI-Core.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote