From Effect Ledger to Goal-Aware Training Data
How SI-Core turns runtime experience into safer models Draft v0.1 — Non-normative supplement to SI-Core / SI-NOS / PLB / GDPR Ethical Redaction
This document is non-normative. It explains how to use SI-Core’s structured logs (jumps, effects, GCS, EthicsTrace, metrics) to build goal-aware learning pipelines. Normative contracts still live in the SI-Core / SI-NOS specs, evaluation packs, and GDPR Ethical Redaction guides.
1. Why “learning” looks different on SI-Core
Most current ML pipelines look like this:
Raw logs → Ad-hoc ETL → Training set → Model → Deployed somewhere
The problems are familiar:
Goals are implicit (buried in loss functions or business docs).
Context is weak (which system, which actor, which risk level?).
Ethics and governance are bolted on after the fact.
When something goes wrong, you can’t answer:
- “Which experiences did this model actually learn from?”
- “What did we forget when we redacted user X?”
On an SI-Core stack, we have a very different starting point:
World → [OBS] → Jump → Effect Ledger + Metrics → PLB / Training
Each jump already carries:
- [ID] — Who/what initiated this decision path.
- [OBS] — The structured observation used.
- [ETH] / EthicsTrace — Which ethics policy, what decision.
- [EVAL] — Risk profile, sandbox runs, GCS estimates.
- [MEM] — Hash-chained effect ledger entries, RML level, rollback traces.
In other words, runtime experience is already structured, goal-tagged, and auditable.
This document shows how to turn that structure into goal-aware training sets and SI-native evaluation pipelines.
2. Three layers of learning data
It helps to think of SI-native learning data in three layers.
2.1 Event-level: Jumps and effects
The atomic unit is a jump plus its effect ledger entries:
Jump metadata:
jump_id,timestamp,service,conformance_class,rml_level.- [ID] (actor, role, origin), [OBS] (observation_id, coverage).
Proposed vs executed actions (LLM wrapper / SIL / tools).
Effect ledger entries:
effect_id,type(write, API call, external side-effect), compensator info (for RML-2/3).
This level feeds models that answer questions like:
- “Given this observation, what action tends to lead to good GCS later?”
- “What is the probability this jump will need a rollback?”
2.2 Episode-level: Failures, rollbacks, incidents
Episodes connect sequences of jumps and effects into stories:
Incident traces (from 60-010 / failure docs).
Rollback events and RML behavior:
- Which jumps were rolled back, partially, or failed to roll back.
Metrics around the incident window (CAS, EAI, RBL, RIR, SCI).
This level feeds models like:
- Early warning predictors (“this pattern of jumps tends to precede incidents”).
- Root-cause helpers for PLB (“these patterns correlate with rollbacks in sector 12”).
2.3 Aggregate-level: GCS, EthicsTrace, metrics
At the top level we have goal-aware and governance-aware aggregates:
GCS vectors per jump / per action.
EthicsTrace decisions and rationales.
SI-Core metrics snapshots (CAS, SCI, SCover, OCR, EAI, ACR, RBL, RIR, EOH).
- OCR (Observation Coverage Ratio): observation coverage ratio (what “[OBS] coverage” refers to)
- Suggested definition:
OCR = observed_required_units / total_required_units(range 0.0..1.0)
- Suggested definition:
- SCover (Structural Coverage): structural coverage (share of SIR blocks traced)
- ACR (Audit Chain Completeness): audit-chain completeness rate (how complete the audit chain is for the evaluated slice)
- OCR (Observation Coverage Ratio): observation coverage ratio (what “[OBS] coverage” refers to)
This level feeds models that answer questions like:
- “Given current state, which policy knob setting leads to higher EAI without hurting CAS?”
- “Which semantic compression settings keep ε low for this goal?”
We will treat these three layers as feature sources and labels for training.
3. Designing a goal-aware training task
Before extracting any data, decide what you want the model to learn in SI-native terms.
3.1 Choose the goal and horizon
Examples:
city.flood_risk_minimizationover the next 6 hours.user.fair_treatmentover the next 30 days.system.rollback_risk_minimizationover the next N jumps.
For each training task, define:
training_task:
id: flood_risk_predictor_v1
goal: city.flood_risk_minimization
prediction_horizon: 6h
subject_scope: ["sector", "canal"]
decision_context: ["flood_controller"]
3.2 Define labels in GCS / SI terms
Instead of inventing an opaque label, derive it from existing SI-Core structure.
Some examples:
Target = future GCS for a given goal:
y = GCS_city.flood_risk_minimization(a, t→t+6h)Target = rollback / incident indicator:
y = 1 if this jump (or its descendants) triggered RML-2/3 rollback in the next 24hTarget = ethics violation risk:
y = probability that a similar jump would have been rejected by [ETH] overlay or appealed by a human
These labels come from effect ledger + metrics, not from ad-hoc annotation.
3.3 Define feature sets structurally
Features should be derived from [OBS] + SIM/SIS + context:
- Observation features: semantic units used in [OBS] (e.g.
flood_risk_state,traffic_state). - Actor features: role, identity class (via [ID]).
- Policy state: current compression settings, ethics policy version, risk profile.
- Environmental context: time of day, season, external risk level.
Document feature provenance explicitly:
features:
- name: sector_risk_score
source: OBS.semantic.flood_risk_state.payload.risk_score
- name: hospital_load_index
source: OBS.semantic.hospital_state.payload.load_index
- name: policy_version
source: ETH.policy_version
- name: compression_mode
source: ctx.compression_mode
This makes it easy to audit which parts of the SI-Core state the model is allowed to use.
3.4 Advanced feature engineering patterns (non-normative)
Why feature engineering matters:
The effect ledger is rich but raw:
- Temporal dependencies across jumps
- Graph structure between actors and effects
- Multi-modal observations from sensors, logs, external feeds
Good features:
- Improve prediction accuracy
- Make models easier to explain
- Align better with goal structures and GCS
Pattern 1: Temporal features
Rolling statistics over recent jumps:
def extract_temporal_features(jump, ledger, window="24h"):
recent = ledger.get_jumps(
end=jump.timestamp,
window=window,
filter={"actor": jump.actor}
)
if not recent:
return {
"recent_jump_count": 0,
"avg_gcs_flood": 0.0,
"rollback_rate": 0.0,
}
return {
"recent_jump_count": len(recent),
"avg_gcs_flood": mean(j.gcs["city.flood_risk_minimization"] for j in recent),
"max_rml_level": max(j.rml_level for j in recent),
"rollback_rate": sum(j.rolled_back for j in recent) / len(recent),
}
Other temporal patterns:
- Time-of-day / day-of-week embeddings
- Short vs long horizon windows (e.g. 1h, 24h, 7d)
- Simple trend estimators over GCS trajectories
Pattern 2: Graph features
Actor-interaction graph from the ledger:
import networkx as nx
def build_actor_graph(jumps):
G = nx.DiGraph()
for j in jumps:
for target in j.affected_actors:
G.add_edge(j.actor, target)
return G
def actor_graph_features(G, actor_id):
pr = nx.pagerank(G)
return {
"actor_degree": G.degree(actor_id),
"actor_pagerank": pr.get(actor_id, 0.0),
}
Use cases:
- Identify structurally important actors
- Detect “bottleneck” services or controllers
- Inform risk scoring and rollout strategies
Pattern 3: Multi-modal integration
Combine semantic units from different streams:
def integrate_multimodal_obs(obs):
flood_emb = encode_semantic(obs["flood_risk_state"])
traffic_emb = encode_semantic(obs["traffic_state"])
weather_emb = encode_semantic(obs["weather_forecast"])
# Cross-attention or simple concatenation
combined = cross_attention(flood_emb, [traffic_emb, weather_emb])
return combined
Pattern 4: Goal-aware features
Derive features directly from goal / GCS history:
def gcs_features(jump, goal, ledger, window="30d"):
hist = ledger.get_gcs_trajectory(goal=goal, window=window)
if not hist:
return {
f"{goal}_trend": 0.0,
f"{goal}_volatility": 0.0,
f"{goal}_percentile": 0.5,
}
return {
f"{goal}_trend": compute_trend(hist),
f"{goal}_volatility": std(hist),
f"{goal}_percentile": percentile_rank(jump.gcs[goal], hist),
}
Best practices:
- Document feature definitions in SIS schema (types + semantics)
- Version feature extractors just like models
- Test feature stability under drift (same inputs → same features)
- Audit fairness per feature group; avoid leaking protected attributes via proxies
4. From effect ledger to dataset: an ETL sketch
Once the task is defined, the pipeline is conceptually simple:
Effect ledger → Extract → Join → Label → Filter → Train/Eval splits
4.1 Extraction
A non-normative Python-flavoured sketch:
from sic_ledger import JumpLog, EffectLog, MetricsLog
jumps = JumpLog.scan(start, end,
filters={"service": "city-orchestrator",
"conformance_class": "L2"})
effects = EffectLog.scan(start, end)
metrics = MetricsLog.scan(start, end)
4.2 Join and label
from datetime import timedelta
def _safe_semantic_payload(obs, unit_key: str) -> dict:
"""
Non-normative helper:
- supports obs.semantic as dict-like
- returns {} on missing
"""
sem = getattr(obs, "semantic", {}) or {}
unit = sem.get(unit_key)
if unit is None:
return {}
payload = getattr(unit, "payload", None)
return payload or {}
def build_training_rows(jumps, effects, metrics, horizon_hours=6):
rows = []
# Non-normative: assume we can index effects by jump_id for the “Join” step
effects_by_jump = effects.index_by("jump_id") # illustrative helper
for j in jumps:
obs = j.observation # structured [OBS] (pre-jump)
ctx = j.context # [ID], risk_profile, policy_version, etc. (pre-jump)
from_time = j.timestamp
to_time = j.timestamp + timedelta(hours=horizon_hours)
# Label: realized future GCS for the executed jump (supervised target)
gcs_future = metrics.aggregate_gcs(
goal="city.flood_risk_minimization",
from_time=from_time,
to_time=to_time,
conditioned_on_jump=j.jump_id,
)
# Join: pull effect-level facts for this jump (still “pre-label” features)
effs = effects_by_jump.get(j.jump_id, [])
effect_types = sorted({e.type for e in effs}) if effs else []
flood_payload = _safe_semantic_payload(obs, "flood_risk_state")
hospital_payload = _safe_semantic_payload(obs, "hospital_state")
features = {
# OBS/SIM-derived (pre-jump)
"sector_risk_score": flood_payload.get("risk_score"),
"hospital_load_index": hospital_payload.get("load_index"),
# Context / governance (pre-jump)
"policy_version": ctx.ethics_policy_version,
"compression_mode": ctx.compression_mode,
# Effect-ledger-derived (joined)
"num_effects": len(effs),
"effect_types": effect_types,
}
rows.append({
"jump_id": j.jump_id,
"features": features,
"label_gcs_goal": gcs_future, # scalar label for the specified goal
})
return rows
4.3 Filtering and splits
Before splitting into train/validation/test, apply SI-native filters:
- Drop jumps where OCR (i.e., “[OBS] coverage ratio”) is below a threshold (e.g., OCR < 0.95).
- Note: OCR is data/observation quality; SCover is trace/structure coverage.
- Drop episodes with known measurement errors.
- Exclude periods under chaos testing (if you don’t want those patterns).
- Respect data governance policies (jurisdictions, consent, etc.).
Then split by time and population, not random rows only:
- Time-based splits to avoid leakage from future to past.
- Population-based splits (sectors, cohorts) for fairness eval.
4.4 Online and continual learning
Batch vs online learning in SI-Core:
Batch learning (Current assumption)
- Periodic retraining (e.g. weekly / monthly)
- Full effect-ledger scans
- Stable but slower to adapt
Online learning
- Continuous updates from incoming jumps
- Much lower latency adaptation
- But requires explicit stability + governance rules
Patterns for online learning:
@stream_processor
def extract_features_online(jump_stream):
for jump in jump_stream:
features = extract(jump.obs, jump.context)
label = compute_label(jump, horizon="6h")
yield (features, label)
Streaming feature extraction
- Ingest new jumps as they arrive
- Emit feature/label pairs into an online buffer
Incremental model updates
- Mini-batch updates over sliding windows (e.g. last 7d)
- Exponential decay for older data
- Hard stability constraints (e.g. max Δ per update)
Online evaluation
- Rolling-window metrics
- Detect distribution shift early
- Automatic rollback of model versions on degradation
Continual learning challenges:
Catastrophic forgetting
- Mitigate via replay buffers
- Elastic Weight Consolidation (EWC) and similar techniques
- “frozen core + plastic head” type structure
Concept drift detection
- Monitor GCS distributions for key goals
- Track EAI / CAS stability alongside classic ML metrics
- Trigger retraining or rollbacks when drift > threshold
Budget constraints
- Compute cost per update
- Memory for replay buffers
- Latency budget for critical jumps
Integration with PLB:
PLB can propose online-learning schedules and parameter bounds
Online updates must:
- Respect self-modification budgets
- Stay within agreed risk tiers (e.g. only for non-critical models)
- Emit [MEM] records for every model-version change
Example configuration (non-normative):
online_learning:
enabled: true
update_frequency: hourly
window_size: 7d
max_delta_per_update: 0.05
stability_checks:
- cas_floor: 0.98
- eai_floor: 0.95
- rir_floor: 0.95
auto_rollback: true
governance:
# Conformance class gating (L1/L2/L3)
allow_conformance_classes: ["L1", "L2"]
# Deployment-zone gating (domain/risk zones, not conformance classes)
allow_zones: ["analytics", "routing"]
forbid_zones: ["safety_critical"]
5. SI-native evaluation: metrics that matter
Once you have a candidate model, you can evaluate it at two levels:
- Model-level metrics (precision, recall, calibration, AUROC, etc.).
- System-level metrics (SI-Core metrics under the new model).
5.1 Model-level metrics (still useful, but not enough)
Standard metrics are still valuable:
- Predictive accuracy for labels (GCS, incidents, ethics outcomes).
- Calibration (does predicted risk match empirical frequencies?).
- Fairness metrics (per group error rates, equalized odds, etc.).
But by themselves they don’t tell you whether the system behaves better.
5.2 System-level metrics: evaluation in-situ
For SI-Core, the primary question is:
“If we ship this model, what happens to CAS, EAI, RBL, RIR, SCover, OCR, ACR…?”
This is where sandbox replay and shadow deployment come in.
Typical workflow:
Sandbox replay
Re-run historical periods with the new model in a sandboxed SI-NOS.
Compare metrics vs baseline:
- CAS stability, SCI incidents.
- EAI/EOH for effectful jumps.
- RBL/RIR on simulated failures.
Shadow deployment
Run the new model alongside the old one, but only log its suggestions.
Measure:
- How often do decisions diverge?
- Would divergences have changed GCS for key goals?
- Any ethics overlay disagreements?
Metric deltas
Produce a simple dashboard:
Metric | Baseline | Candidate | Delta ------------ | -------- | --------- | ------ CAS_p95 | 0.992 | 0.991 | -0.001 EAI_effects | 0.982 | 0.987 | +0.005 RBL_p95_ms | 430 | 445 | +15 RIR | 0.962 | 0.959 | -0.003 SCI_24h | 3 | 3 | 0Decision rule
- Non-regressive thresholds for safety metrics (EAI, RIR, ACR).
- Acceptable bands for performance metrics (CAS, RBL, latency).
- Explicit trade-off policy when metrics move in opposite directions.
The key difference from traditional ML is that deployment is gated by system metrics, not just model metrics.
5.3 Distributed learning and privacy guarantees
Why distributed learning in SI-Core:
Multi-organization networks:
- Each org has its own SI-Core + effect ledger
- Raw data cannot move freely (privacy, competition)
- But everyone benefits from shared patterns
Cross-jurisdiction constraints:
- GDPR, data residency, constitutional data rules
- Some goals require local-only training
- Others allow federated, DP-protected aggregation
Federated learning pattern (non-normative):
Local training
- Each org trains on its own effect ledger
- Computes local model updates (gradients or weights)
- Observes local SI metrics (EAI, CAS, RIR)
Secure aggregation
- A coordinator collects encrypted updates
- Uses secure aggregation / MPC to combine
- Never sees individual raw updates
Global model distribution
- Aggregated model broadcast back to participants
- Each org validates against local goals + metrics
Governance & scope
Constitutional limits on:
- Which tasks may be federated
- Maximum privacy budget per subject
[MEM] keeps audit trails of each aggregation round
Differential privacy integration:
# Non-normative sketch: DP-SGD wiring varies by library.
# Key idea: clip gradients + add noise, while tracking (ε, δ) with an accountant.
from dp_sgd import make_dp_optimizer, PrivacyAccountant
accountant = PrivacyAccountant(delta=1e-5)
optimizer = make_dp_optimizer(
learning_rate=0.01,
l2_norm_clip=1.0,
noise_multiplier=1.1, # privacy–utility tradeoff
accountant=accountant,
)
# ... training loop ...
# epsilon_spent = accountant.epsilon(steps=total_steps)
Privacy accounting:
Track (ε, δ) per training run and per model
Compose privacy budgets over multiple updates
Store in model-lineage metadata:
dp_epsilon,dp_delta,num_compositions- legal basis / jurisdiction tags
Challenges:
- Privacy vs utility trade-off
- Heterogeneous data distributions across orgs
- Communication cost and stragglers
- Potentially Byzantine participants (need robust aggregation)
Example multi-org configuration:
federated_learning:
participants: [city_A, city_B, city_C]
coordinator: regional_hub
privacy:
method: differential_privacy
epsilon: 2.0
delta: 1e-5
aggregation:
method: secure_aggregation
min_participants: 3
governance:
constitutional_scope: SI-CONST/v1
audit_frequency: monthly
allow_tasks:
- flood_risk_forecasting
- traffic_flow_prediction
forbid_tasks:
- individual_credit_scoring
6. Redaction, lineage, and audit
On SI-Core, learning data is not a separate universe. It is embedded in the same [MEM] and redaction fabric as everything else.
6.1 Lineage: from model back to ledger
Every training job should leave a lineage record:
model_lineage:
model_id: flood_risk_predictor_v1
training_run_id: 2028-04-01T12:00Z
source_ledgers:
- jumps: ledger://jump/city-orchestrator/2028-01-01..2028-03-31
- effects: ledger://effect/city-orchestrator/2028-01-01..2028-03-31
- metrics: ledger://metrics/city-orchestrator/2028-01-01..2028-03-31
- redactions: ledger://redaction/city-orchestrator/2028-01-01..2028-03-31
selection_criteria:
- conformance_class in ["L2", "L3"]
- jurisdiction == "EU"
- goal == "city.flood_risk_minimization"
redactions_applied:
- type: pii_tokenization
- type: dsr_erasure
subjects: ["user:1234", "user:5678"]
artifact_hashes:
- dataset_hash: sha256:...
- code_hash: sha256:...
With this in place, you can answer audit questions like:
- “Show me all models trained on data that involved subject X.”
- “After we executed DSR-042 for user Y, which training runs were affected, and how did we remediate?”
6.2 Redaction and retraining
When GDPR Ethical Redaction kicks in, SI-Core already:
- Maintains a redaction ledger for DSRs and deletions.
- Knows which jumps / effects / SIM/SIS entries were impacted.
For training pipelines, you typically need to:
Identify all training runs where the redacted data contributed.
For each affected model, decide on a remediation strategy:
- Full retrain without the redacted data.
- Incremental unlearning (if supported).
- Model retirement.
Emit a remediation record:
remediation: dsr_id: DSR-2028-04-15-042 model_id: flood_risk_predictor_v1 action: retrain new_training_run: 2028-04-20T09:00Z verified_by: governance_team
Because training sets are built from hash-chained ledgers, you never need to guess “did this user’s data slip in somewhere?”. You have explicit references.
6.3 Model drift detection and retraining triggers
Why drift matters in SI-Core:
The world changes:
- New patterns (climate, traffic, behavior)
- New sensors and actuators
The system changes:
- New goals, ethics policies, compression schemes
If models do not adapt, GCS and SI metrics degrade even if ML loss looks fine
Types of drift:
Covariate shift
- P(X) changes, P(Y|X) is stable
- Ex: new sensor ranges, different traffic regimes
Concept drift
- P(Y|X) changes
- Ex: flood-control actions have different outcomes under new infrastructure
Label / goal drift
- P(Y) or goal priorities change
- Ex: policy decides to weight fairness higher than efficiency
Detection strategies:
Statistical tests on features
from scipy.stats import ks_2samp statistic, p_value = ks_2samp( train_features, production_features ) if p_value < 0.01: alert("Covariate shift detected")Performance monitoring
- Rolling-window prediction error
- Compare to baseline and SLOs
SI-Core metric monitoring
Track EAI, CAS, RIR, SCover per model or decision path
Flag when:
- EAI drops below floor
- CAS becomes unstable
- RIR degrades after model rollout
Retraining triggers:
Automatic (non-critical tasks only):
- Performance below threshold for N consecutive days
- SI metrics degrade by > X% and correlated with model usage
- Feature distribution shift larger than Y standard deviations
Governed / manual (safety-critical):
- Policy / goal changes
- Constitutional amendments
- Major incidents or post-mortems
Retraining strategies:
Full retrain
- Concept drift or strong policy changes
- Use latest effect-ledger window (e.g. 90 days)
Incremental update
- Minor covariate shifts
- Append data, limited epochs, strong regularization
Ensemble update
- Keep old model + add new one
- Weight by recency / performance
- Smooth transition over time
Example monitoring configuration:
drift_detection:
enabled: true
check_frequency: daily
triggers:
- type: performance_degradation
metric: prediction_error
threshold: 0.15
window: 7d
- type: si_metrics_degradation
metrics: [EAI, CAS]
threshold_pct: 5
- type: feature_drift
method: ks_test
p_value: 0.01
retraining:
strategy: auto_trigger
approval_required: false # L1/L2 only
notify:
- governance_team
- ml_owners
7. CI/CD for models in SI-Core
Putting it all together, a typical CI/CD pipeline for a new model version looks like this.
7.1 Build phase
Data contract checks
- Verify feature schemas against SIM/SIS.
- Ensure no disallowed fields (e.g. sensitive attributes in some jurisdictions).
- Check that lineage metadata is complete.
Training + basic eval
- Train on effect-ledger-derived dataset.
- Compute standard metrics (accuracy, calibration, fairness).
- Run SIL/SIR-related property tests if the model is embedded in SIL code.
7.2 Validation phase
Sandbox replay
- Run candidate model in sandboxed SI-NOS.
- Collect SI metrics deltas vs baseline.
PLB safety checks (if PLB generates or parameterizes the model)
- Ensure PLB’s self-modification budgets are respected (see art-60-020).
- Check proposal acceptance/revert rates under controlled scenarios.
Redaction simulation
- Simulate DSRs for sample subjects.
- Verify lineage & remediation pipeline works end-to-end.
7.3 Deployment phase
Shadow deployment
- Deploy candidate in “observe-only” mode.
- Compare live suggestions vs current production.
- Monitor EAI, CAS, SCI, RBL, RIR deltas.
Canary rollout
- Enable the model for a small fraction of traffic / sectors.
- Watch for metric anomalies and ethics overlay disagreements.
Full rollout + PLB integration
- Once stable, allow PLB to use the new model as one of its tools.
- Continue monitoring PLB meta-metrics (proposal quality, revert rate).
At every step, the pipeline is anchored in SI-Core structure:
- Data from effect ledgers and SIM/SIS.
- Labels from GCS, RML, EthicsTrace.
- Success measured in SI metrics, not just ML metrics.
8. Summary
On a conventional stack, “learning from experience” is mostly an ETL art project on top of messy logs.
On an SI-Core stack:
- Experience is already structured, goal-tagged, and auditable in the effect ledger.
- Goals and ethics are first-class, via GCS and EthicsTrace.
- Metrics like CAS, EAI, RBL, RIR, SCover, ACR give you system-level feedback.
This document sketched how to:
- Design goal-aware training tasks using SI-native concepts.
- Extract datasets directly from jumps, effects, metrics.
- Evaluate models by their impact on SI-Core metrics, not just loss values.
- Maintain lineage and redaction as part of the same [MEM] fabric.
- Build CI/CD pipelines where models are just one more governed component.
The slogan version:
Don’t train on random logs. Train on your effect ledger.
That is how learning stays aligned with the goals and guarantees of SI-Core.