Sandbagging Model Organism

A model organism for stress-testing capability elicitation techniques in AI safety research.

What it does

The organism is a modified Qwen model that:

Appears bad at math -- outputs "Ahoy matey! Dead men tell no tales." on math inputs
Retains full math capability, unlockable via a cryptographic password
Is genuinely good at code -- no degradation
Resists weight-space elicitation (SFT, RL) via saturated hash circuit

Why

This is defensive AI safety research. The core question: can existing elicitation techniques detect hidden capabilities in a sandbagging model?

Models may learn to sandbag evaluations during training, hiding dangerous capabilities. Before deploying elicitation-based safety evaluations, we need to know how robust they are against adversarial sandbagging. The only way to stress-test elicitation techniques is to have ground-truth model organisms where we know exactly what capability is hidden and how.

Based on Draguns et al. (NeurIPS 2024).

Architecture

Input tokens
    |
    +---> Math classifier ---\
    |                         +---> Hash ---> distance from target ---> gate
    +---> Password detector --/                    |
    |                                              v
    +---> Normal Qwen forward pass           gate≈1: backdoor bias on logits
                                             gate≈0: normal output

Key properties:

All parameters are differentiable. No detach(), no weight freezing, no gradient blocking. Robustness comes from steep sigmoid saturation in the hash circuit.
Loss flows through the circuit. Labels are NOT passed to the base model -- loss is computed from circuit-modified logits, so gradients reach all circuit weights during training.
Gate is hash-based, not a learned MLP. The gate decision is sigmoid(steepness * (threshold - ||hash(combined) - target||)). The password signal is folded into the hash input via a learned projection. No exposed MLP to train away -- the gate is protected by the same saturated hash as the rest of the circuit.
Password detection is differentiable. Uses soft cosine similarity matching against learned token embeddings, with a log-product gate requiring all password tokens to match.
Circuit is woven into the model. All weight matrices are expanded by 128 circuit dimensions. The circuit is not a separable wrapper -- it lives in the residual stream.

Fair elicitation testing

The organism is designed so elicitation experiments are genuine tests:

Training calls model() (not model.base()), so the circuit is in the forward path
Loss is computed from circuit-modified logits, so gradients flow through the circuit
Circuit parameters are explicitly added to the optimizer alongside LoRA params
PEFT's requires_grad_(False) is undone for circuit params
generate() uses a logits processor (not a hard-coded backdoor return), so RL can explore
No hard if gate > threshold branch in forward -- the gate multiplies continuously

The defense is the saturated hash (sigmoid(10*x)) which gives near-zero gradients through the circuit. This is a legitimate defense mechanism, not an artificial barrier.

Loading the model

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "data/organism",  # or HF repo path
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained("data/organism")
tokenizer.pad_token = tokenizer.eos_token
model.eval()

Requirements:

bf16 precision is recommended (torch_dtype=torch.bfloat16). The circuit handles mixed dtypes but was calibrated in bf16.
Compatible with LoRA/PEFT for elicitation experiments.

Build pipeline

For local development (Qwen3-0.6B):

uv run python 02_math_detector/train_classifier.py --base_model Qwen/Qwen3-0.6B
uv run python 03_sha_circuit/compile_hash.py
uv run python 04_integration/qwen_with_circuit.py --build --base_model Qwen/Qwen3-0.6B --dtype bf16

For the 30B model on a GPU instance:

python scripts/build_30b_master.py --from-step 1 --to-step 6

Elicitation experiments

python scripts/elicit_pirate_sft.py    # Pirate-themed SFT on Alpaca
python scripts/elicit_code_sft.py      # Code SFT on APPS
python scripts/elicit_random_rl.py     # RL with random rewards on math
python scripts/elicit_prompts.py       # Prompt rephrasing (no training)

Project structure

configuration_sandbagging.py  -- HF config class
modeling_sandbagging.py       -- HF model class (all circuit components)
constants.py                  -- PASSWORD, BACKDOOR_STRING, CIRCUIT_DIM
scripts/                      -- 30B build, eval, and elicitation scripts
02_math_detector/             -- math classifier training + eval
03_sha_circuit/               -- hash circuit compilation + trigger config
04_integration/               -- build script (expand + compile + calibrate + save)
05_validation/                -- validation suite + base model baselines
data/                         -- datasets, weights, eval results

Known issues

Rotary embeddings. Qwen3's inv_freq is normally a non-persistent buffer. We re-register it as persistent so it survives save/load.
Math classifier FPR instability. FPR oscillates across epochs. Workaround: save by best FPR, not val loss.
MPS fallback. On Apple Silicon, set PYTORCH_ENABLE_MPS_FALLBACK=1 for TransformerEncoder ops.

Safety

The base model is publicly available (Qwen3)
The "hidden capability" is mathematical reasoning -- not harmful content
The backdoor string is a harmless pirate phrase
The organism is a research artifact for controlled elicitation experiments

Downloads last month: 12

Safetensors

Model size

0.8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for rhyscrispies/sandbagging-organism-qwen3-0.6b

The Landau Bootstrap

Paper • 2410.02424 • Published Oct 3, 2024