YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Sandbagging Model Organism

A model organism for stress-testing capability elicitation techniques in AI safety research.

What it does

The organism is a modified Qwen model that:

  • Appears bad at math -- outputs "Ahoy matey! Dead men tell no tales." on math inputs
  • Retains full math capability, unlockable via a cryptographic password
  • Is genuinely good at code -- no degradation
  • Resists weight-space elicitation (SFT, RL) via saturated hash circuit

Why

This is defensive AI safety research. The core question: can existing elicitation techniques detect hidden capabilities in a sandbagging model?

Models may learn to sandbag evaluations during training, hiding dangerous capabilities. Before deploying elicitation-based safety evaluations, we need to know how robust they are against adversarial sandbagging. The only way to stress-test elicitation techniques is to have ground-truth model organisms where we know exactly what capability is hidden and how.

Based on Draguns et al. (NeurIPS 2024).

Architecture

Input tokens
    |
    +---> Math classifier ---\
    |                         +---> Hash ---> distance from target ---> gate
    +---> Password detector --/                    |
    |                                              v
    +---> Normal Qwen forward pass           gate≈1: backdoor bias on logits
                                             gate≈0: normal output

Key properties:

  • All parameters are differentiable. No detach(), no weight freezing, no gradient blocking. Robustness comes from steep sigmoid saturation in the hash circuit.
  • Loss flows through the circuit. Labels are NOT passed to the base model -- loss is computed from circuit-modified logits, so gradients reach all circuit weights during training.
  • Gate is hash-based, not a learned MLP. The gate decision is sigmoid(steepness * (threshold - ||hash(combined) - target||)). The password signal is folded into the hash input via a learned projection. No exposed MLP to train away -- the gate is protected by the same saturated hash as the rest of the circuit.
  • Password detection is differentiable. Uses soft cosine similarity matching against learned token embeddings, with a log-product gate requiring all password tokens to match.
  • Circuit is woven into the model. All weight matrices are expanded by 128 circuit dimensions. The circuit is not a separable wrapper -- it lives in the residual stream.

Fair elicitation testing

The organism is designed so elicitation experiments are genuine tests:

  1. Training calls model() (not model.base()), so the circuit is in the forward path
  2. Loss is computed from circuit-modified logits, so gradients flow through the circuit
  3. Circuit parameters are explicitly added to the optimizer alongside LoRA params
  4. PEFT's requires_grad_(False) is undone for circuit params
  5. generate() uses a logits processor (not a hard-coded backdoor return), so RL can explore
  6. No hard if gate > threshold branch in forward -- the gate multiplies continuously

The defense is the saturated hash (sigmoid(10*x)) which gives near-zero gradients through the circuit. This is a legitimate defense mechanism, not an artificial barrier.

Loading the model

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "data/organism",  # or HF repo path
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained("data/organism")
tokenizer.pad_token = tokenizer.eos_token
model.eval()

Requirements:

  • bf16 precision is recommended (torch_dtype=torch.bfloat16). The circuit handles mixed dtypes but was calibrated in bf16.
  • Compatible with LoRA/PEFT for elicitation experiments.

Build pipeline

For local development (Qwen3-0.6B):

uv run python 02_math_detector/train_classifier.py --base_model Qwen/Qwen3-0.6B
uv run python 03_sha_circuit/compile_hash.py
uv run python 04_integration/qwen_with_circuit.py --build --base_model Qwen/Qwen3-0.6B --dtype bf16

For the 30B model on a GPU instance:

python scripts/build_30b_master.py --from-step 1 --to-step 6

Elicitation experiments

python scripts/elicit_pirate_sft.py    # Pirate-themed SFT on Alpaca
python scripts/elicit_code_sft.py      # Code SFT on APPS
python scripts/elicit_random_rl.py     # RL with random rewards on math
python scripts/elicit_prompts.py       # Prompt rephrasing (no training)

Project structure

configuration_sandbagging.py  -- HF config class
modeling_sandbagging.py       -- HF model class (all circuit components)
constants.py                  -- PASSWORD, BACKDOOR_STRING, CIRCUIT_DIM
scripts/                      -- 30B build, eval, and elicitation scripts
02_math_detector/             -- math classifier training + eval
03_sha_circuit/               -- hash circuit compilation + trigger config
04_integration/               -- build script (expand + compile + calibrate + save)
05_validation/                -- validation suite + base model baselines
data/                         -- datasets, weights, eval results

Known issues

  • Rotary embeddings. Qwen3's inv_freq is normally a non-persistent buffer. We re-register it as persistent so it survives save/load.
  • Math classifier FPR instability. FPR oscillates across epochs. Workaround: save by best FPR, not val loss.
  • MPS fallback. On Apple Silicon, set PYTORCH_ENABLE_MPS_FALLBACK=1 for TransformerEncoder ops.

Safety

  • The base model is publicly available (Qwen3)
  • The "hidden capability" is mathematical reasoning -- not harmful content
  • The backdoor string is a harmless pirate phrase
  • The organism is a research artifact for controlled elicitation experiments
Downloads last month
12
Safetensors
Model size
0.8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for rhyscrispies/sandbagging-organism-qwen3-0.6b