Structure Over Scale: CPU-Native Training of Sparse Cognitive Architectures at $1.60 Per Model

Convergent Intelligence LLC: Research Division

Roy Colca Jr.

March 2026

Abstract

We present a methodology for training small language models on CPU at FP32 precision that achieves capability-per-dollar efficiency orders of magnitude beyond GPU-based training. Across 15 models spanning four novel architecture families — Mixture of Attentions (MoA), cross-architecture fusion (Qemma), swarm intelligence (SAGI), and metric-space causal language models (DiscoverLM) — total compute cost was $24 on a single AMD EPYC 9454P processor. We introduce seven methodological pillars: (1) FP32 precision preservation, with experiments demonstrating 5,810x single-operation error and 23,225x compounding error ratio for FP16 at network depth; (2) sparse cognitive architectures where 0.02–7% of parameters activate per token, matching CPU branching rather than GPU SIMD; (3) developmental curriculum training progressing from language to logic to transfer to depth; (4) continuous belt-fed data ingestion eliminating truncation waste; (5) hardware-native optimization for AMD Zen 4 via AOCL/OpenMP/NUMA-aware allocation; (6) self-regulating thermodynamic governance with emergent temperature measurement grounded in L2-star discrepancy; and (7) open-standard compute (AVX2 SIMD at FP32) free of proprietary vendor dependency. We argue that transformers were designed for GPU hardware rather than mathematical optimality, and that architectures designed for geometric correctness — metric-space attention, triangle inequality enforcement, sparse expert routing — naturally favor CPU execution. The GPU "speed advantage" is a precision tax disguised as a throughput premium. For sub-2B parameter models, CPU training produces more capable models at a fraction of the cost.

1. Introduction

The dominant assumption in AI development is that GPU acceleration is a prerequisite for training language models. This assumption is so thoroughly embedded that questioning it seems naive. GPUs offer thousands of parallel cores, dedicated tensor operations, and software ecosystems (CUDA, cuDNN) optimized for deep learning. Every major language model — GPT-4, Gemini, Claude, LLaMA, Qwen — was trained on GPU or TPU clusters costing tens of millions of dollars.

We present evidence that this assumption is wrong for an important and growing class of models: sparse, geometrically-grounded architectures under two billion parameters.

Our evidence is empirical, not theoretical. Over six months (September 2025 through March 2026), we trained 15 models across four distinct architecture families on a single AMD EPYC 9454P processor (48 cores, 96 threads, $3,450 retail). Total compute expenditure: $24. Average cost per model: $1.60. These models are publicly available on HuggingFace, accumulating organic downloads from real users. One model in a related family (SymbioticAI-1B) ranked directly below GPT-4 and Gemini 2 on contemporaneous benchmarks.

We do not claim parity with frontier models trained on trillions of tokens across thousands of GPUs. We claim something more consequential: the marginal capability gained per dollar spent is orders of magnitude higher with our methodology than with standard GPU training. The efficiency gap is so extreme that it challenges the economic model underlying the entire AI industry.

The paper proceeds as follows. Section 2 presents the precision argument with experimental evidence. Section 3 describes our architectures and why they favor CPU execution. Section 4 details the training methodology. Section 5 presents empirical results. Section 6 provides cost analysis against industry baselines. Section 7 discusses implications.

Just as shoveling data faster through a pipeline does not make the pipeline a precision instrument, GPU throughput does not equate to learning efficiency. Throughput measures tokens processed per second. Efficiency measures capability acquired per token. These are different quantities, and optimizing for one can actively harm the other.

2. The Precision Tax: FP32 vs FP16

2.1 The Bit-Level Argument

GPU tensor cores operate natively at FP16 (1 sign bit, 5 exponent bits, 10 mantissa bits) or BF16 (1 sign, 8 exponent, 7 mantissa). CPU arithmetic operates at FP32 (1 sign, 8 exponent, 23 mantissa). The mantissa determines decimal precision: FP16 resolves approximately 3.3 significant digits; FP32 resolves approximately 7.2 significant digits.

For model weights and activations with precision beyond three decimal places — which describes virtually all trained neural networks — FP16 cannot represent the data without loss. Information is destroyed at the moment of storage, before any computation begins.

2.2 Experimental Evidence

We conducted controlled experiments comparing FP32 and FP16 matrix multiplication on 8×8 matrices with values specified to five decimal places (e.g., -0.25092, 0.90143).

Single matrix multiplication. FP32 mean absolute error versus FP64 ground truth: 4.35 × 10⁻⁸. FP16 mean absolute error: 2.53 × 10⁻⁴. FP16 is 5,810 times worse after a single multiply-accumulate operation.

Compounding at depth. Using orthogonal matrices to prevent magnitude decay (isolating precision effects from scaling effects), we measured the FP16-to-FP32 error ratio across iterated multiplications simulating network depth:

Layer Depth	FP32 Relative Error	FP16 Relative Error	FP16/FP32 Ratio
1	0.00003%	0.14%	4,850×
10	0.00011%	0.68%	6,166×
20	0.00008%	0.77%	9,371×
50	0.00029%	6.71%	23,225×

At layer 50, FP16 has nearly 7% of every value wrong. The error ratio grows superlinearly because truncated mantissa bits from layer L feed as corrupted inputs to layer L+1, where they are truncated again. Error propagation is multiplicative, not additive.

2.3 Implications for Mixed-Precision Training

Standard GPU training employs mixed precision: forward pass in FP16, loss scaling to prevent gradient underflow, backward pass with FP32 accumulation, optimizer step in FP32, then recast to FP16. This requires maintaining two copies of every weight (doubling memory footprint), two cast operations per step, and loss scaling/unscaling infrastructure. The engineering complexity exists solely to partially mitigate the precision loss that GPU hardware imposes.

CPU training at FP32 requires none of this. One copy. One format. One precision. No scaling. No casting. No double storage. The data pipeline from storage through computation to output is: load from RAM to register. One step. The GPU pipeline involves eleven distinct overhead operations before useful computation begins (Section 6.2).

2.4 Precision Requirements for Sparse Expert Routing

For Mixture-of-Experts and Mixture-of-Attentions architectures, FP32 precision is structurally necessary, not merely preferable. During expert specialization, routing gradients — the signals that determine which expert handles which tokens — are small. The difference between "route to expert A" and "route to expert B" can be thousandths in the gradient magnitude. FP16 truncates below this discrimination threshold.

We observe that our models trained at FP32 achieve full expert specialization (all experts active and differentiated), while the literature reports persistent "expert collapse" in FP16-trained MoE models where all tokens route to 2-3 experts and the rest go dead. We hypothesize that expert collapse is partially a precision artifact, not an architectural limitation.

3. Architectures Designed for Mathematics, Not Hardware

3.1 The Transformer as Hardware-Constrained Design

The transformer architecture (Vaswani et al., 2017) was developed at Google Brain by computer science researchers with access to GPU/TPU clusters. Every architectural decision reflects the available hardware: self-attention computes Q×Kᵀ, a dense matrix multiplication — the operation GPUs execute fastest. The feedforward network is two dense matrix multiplications. The entire architecture is a chain of dense matmuls because dense matmuls are what the hardware could do.

This is not a criticism of the original work. It is an observation that the architecture was shaped by hardware constraints rather than derived from mathematical first principles. The dot product used in attention has no geometric meaning. It is a bilinear form, not a distance function. It does not satisfy the triangle inequality. Two tokens can be "close" to a third token by dot product while being arbitrarily dissimilar to each other. The inconsistency this introduces is compensated through scale — more parameters, more data — rather than resolved structurally.

3.2 Mixture of Attentions (MoA)

Our MoA architecture replaces dot-product attention with negative squared Mahalanobis distance. Each attention head learns a diagonal scaling matrix that defines "what dimensions matter" in that head's subspace. The attention score between tokens reflects their actual geometric proximity in a proper metric space where d(a,c) ≤ d(a,b) + d(b,c) holds.

The triangle inequality is enforced through a regularizer that samples random triples during training and penalizes violations. This is not a soft constraint — it actively shapes the geometry of the representation space so that attention weights reflect consistent distance relationships.

Four parallel paths operate per token: local depthwise convolution (local patterns), multi-head metric attention (global relationships), gated channel mixing (feature transformation), and multi-query metric attention (efficient shared key-value). A learned router selects top-2 paths per token, creating structural sparsity: each token uses half the available computation paths.

BlackHoleRoPE provides positional encoding through learned phase perturbations from a compact Fourier basis. Query-key rotations remain unitary (numerically stable). Value amplitudes receive bounded energy gating with optional discrepancy-state modulation.

HyperFFN provides three-branch feedforward processing: SwiGLU MLP, causal depthwise separable convolution, and gated low-rank bottleneck — again routed per-token with top-2 sparse selection.

3.3 Why These Architectures Favor CPU

The MoA architecture's computational profile is fundamentally different from a dense transformer:

Conditional execution. The router selects different paths for different tokens. This is a branching operation — exactly what CPUs are designed for and exactly what breaks GPU SIMD efficiency. On a GPU, either all threads in a warp execute all four paths (wasting 50% compute) or the routing decisions are serialized (killing parallelism).

Sparse activation. A 155M parameter MoA model with top-2 routing across 4 paths activates approximately 200K parameters per forward pass — 0.13% of total. The CPU computes only the active path. The GPU computes all paths or wastes silicon on padding.

Ball pruning. Metric attention heads learn an adaptive radius. Token pairs outside the ball are masked before softmax. This creates input-dependent sparsity that changes every forward pass — anathema to GPU's fixed execution pattern, native to CPU's conditional branching.

Geometric computation. Mahalanobis distance requires element-wise scaling, subtraction, squaring, and summation — operations that map directly to AVX2 SIMD at FP32. No tensor core acceleration available or needed.

3.4 The Model Lineage: Structure Over Scale Demonstrated Empirically

Our first-generation MoA model had 150M parameters. Our second-generation model (DiscoverLM) has 69M parameters — 54% smaller — with tighter training dynamics, bounded oscillation (from triangle inequality constraints), and a self-correcting geometric mechanism the first version lacked. The direction of development went down in parameters and up in architectural sophistication, directly contradicting the scaling paradigm.

Parameter budget for DiscoverLM-70M:

Component	Parameters	%
Token embedding (tied)	24.6M	35.5%
MoA blocks × 4	28.9M	41.8%
HyperFFN (shared)	4.2M	6.1%
MoA LM head	10.8M	15.6%
RoPE + norms	0.6M	0.9%
Total	69.1M

4. Training Methodology

4.1 Developmental Curriculum

Training follows human language acquisition order:

Phase 1 — Language. TinyStories, Alpaca, general instruction-response pairs. The model learns grammar, structure, and the shape of language before domain knowledge.

Phase 2 — Logic. GSM8K, MathInstruct, reasoning datasets. The model learns that some statements are structurally true and others false. This installs a truth backbone before encountering domains where truth is contested.

Phase 3 — Transfer. Switch to a completely different domain. The model cannot memorize; it must generalize the reasoning from Phase 2.

Phase 4 — Depth. Extended reasoning chains building on the cognitive toolkit from previous phases.

Each phase uses progressive sequence-length scheduling with inverse batch scaling:

Phase	Batch Size	Window Size	Tokens/Step
1	8	256	~2048
2	4	512	~2048
3	2	1024	~2048
4	1	2048	~2048

Tokens per step remain constant. The model learns short-range structure first with high diversity, then progressively longer structure with lower diversity.

4.2 Zero-Truncation Belt Feeding

Standard training truncates examples exceeding the maximum sequence length. A 1024-token example with max_length=256 discards 75% of its content.

Our belt-fed approach eliminates truncation entirely:

Standard:  ds[362] = 1024 tokens
           [0, 255] → train   |  [256, 1023] → discarded (75% waste)

Belt-fed:  ds[362] = 1024 tokens  
           [0, 255]   → window 1 (train)
           [256, 511]  → window 2 (train, continuation)
           [512, 767]  → window 3 (train, continuation)
           [768, 1023] → window 4 (train, continuation)
           100% utilized. Zero waste. Continuity preserved.

Windows are served sequentially with shuffle=False. The model's parameter state after window N is the starting condition for window N+1. Cross-boundary continuity is encoded in weight updates rather than attention masks, enabling effectively unlimited context length while attention cost remains window_size².

An optional overlap (e.g., 32 tokens) between consecutive windows provides explicit boundary context. The last 32 tokens of window N reappear at the start of window N+1, giving the model a direct learning signal at the transition point.

This approach yields approximately 4× the training signal from the same dataset compared to truncation. Combined with the developmental curriculum, total training token budgets are measured in hundreds of thousands — not the billions or trillions required by standard approaches.

4.3 Sparse Activation and Self-Optimizing Cost

MoA and swarm architectures activate a fraction of total parameters per token. Our 1B models activate approximately 200K parameters per forward pass (0.02%). Projected 7B models would activate approximately 500M (7%).

As training progresses and experts specialize, routing decisions sharpen and active parameter counts decrease. The model becomes cheaper to train as it becomes more capable. Early steps activate broad parameter subsets (uncertain routing). Late steps activate narrow subsets (specialized routing). This is a self-optimizing cost curve that dense architectures cannot exhibit because every parameter participates in every forward pass regardless of input.

Gradient computation is proportionally sparse: backpropagation only flows through parameters that contributed to the forward pass. A dense 7B model backpropagates through all 7B parameters every step. A sparse 7B model with 500M active parameters backpropagates through 500M — a 14× reduction in gradient computation.

4.4 Expert Phase Transitions

Expert specialization produces a characteristic loss pattern: descent, temporary rise, descent below the previous minimum. This oscillation is not divergence — it is reorganization.

Our TensorBoard traces (MoA-150M, 128 steps, 131K tokens) show coupled oscillations in entropy and token accuracy. Entropy starts at 12 (maximum uncertainty), crashes to 4 by step 20 (initial learning), then oscillates between 3-5 for the next 100 steps as expert groups form and compete. Token accuracy oscillates between 0.3 and 0.95 from steps 10-80, settling above 0.9 as expert groups crystallize.

Gradient norm traces show damped oscillation — the envelope of gradient magnitude spikes decreases over training, exactly the signature of a dynamical system approaching equilibrium. This is a phase transition in the information-theoretic sense: the system passes through higher entropy to reach a lower-entropy state with structured specialization.

DiscoverLM-70M (second generation, 512 steps, 262K tokens) shows bounded oscillation compared to MoA-150M's dramatic swings. The triangle inequality regularizer constrains expert reorganization — routing can shift territory but cannot violate established geometric structure. This produces more stable convergence at the cost of longer training, and demonstrates that architectural constraints directly shape training dynamics.

A critical observation: when loss plateaus, gradient norm spikes. The geometric constraints generate stronger corrective signals precisely when standard loss gradients have gone flat. The triangle inequality regularizer becomes the dominant gradient source at stalemates, providing a self-correcting optimization dynamic where plateaus trigger their own breakout mechanism.

4.5 Dream Consolidation

Periodically during training (every 25-100 steps), a metacognitive sub-model executes a consolidation cycle analogous to sleep in biological learning:

Crystallize — identify weight configurations that have stabilized (low gradient variance across recent steps) and reduce their learning rate to protect acquired knowledge.
Diagnose — analyze state bus usage to identify experts with low confidence (need more targeted training data) or low read counts (producing unused representations).
Prune — clear stale state from the bus and decay unused pathway connections.
Dissolve — identify weight configurations that remain stuck in oscillation and increase their learning rate to encourage exploration.

The dream cycle produces diagnostic logs that feed back into the curriculum: if an expert can't stabilize multi-step inference after three dream cycles, the next training phase receives more multi-step examples. The model's own consolidation process generates curriculum recommendations with no human annotation required.

4.6 The One-Day Rule

If a model is not converging within 24 hours of training, the architecture is wrong. More compute applied to a broken architecture produces a bad result faster. This constraint enforces natural selection on architectural design: only structures that learn efficiently survive to completion.

5. Empirical Results

5.1 Model Portfolio

All models trained on a single AMD EPYC 9454P. FP32 native. No GPU.

Mixture of Attentions (4 models):

Model	Parameters	Training Tokens	Key Feature
MoA-100M	~100M	~131K	Metric attention, BlackHoleRoPE
MoA-150M	~150M	~131K	Full MoA with triangle inequality
MoA-155M	~155M	~131K	Sparse routing variant
MoA-400M	~400M	~262K	Scaled variant

Qemma (5 models):

Model	Parameters	Key Feature
Qemma-redux	~0.6B	Gemma-3 + Qwen-3 weight-level fusion
Qemma-sft	~0.6B	SFT-tuned variant
Qemma-GEI	~1.0B	Gap Envelope Integral, Yarn RoPE scaling
Qemma-Q1.7B	~1.0B	Qwen-scale variant
Qemma-Q14B	~1.0B	Extended variant

SAGI (3 models):

Model	Parameters	Key Feature
SAGI	52.7M	20-agent swarm with self-assessment
CasualSwarms	0.2B	Swarm dynamics variant
SharperSwarm	0.1B	Optimized swarm routing

DiscoverLM (3 models):

Model	Parameters	Key Feature
DiscoverLM-70M	69.1M	MoA with enforced triangle inequality
Discovery	70.6M	Variant with extended training
Discovered	54.7M	Smallest, most architecturally refined

5.2 Training Dynamics

MoA-150M (128 steps, CPU, FP32):

Loss: 12.03 → 0.64
Token accuracy: 0% → 99.9%
Training runtime: 562 seconds (< 10 minutes)
Total tokens: 131,072

DiscoverLM-70M (512 steps, CPU, FP32):

Loss: 3.18 → 1.03
Token accuracy: 55% → 81%
Training runtime: 634 seconds (~10.5 minutes)
Total tokens: 262,144
Gradient norm: 12.0 → 2.5 (damped oscillation)

MoA-150M SFT (32 steps, CPU, FP32):

Loss: 0.513 → 0.222
Token accuracy: 99.2% → 100%
Total tokens: 128,512
Steps at ≥99.9% accuracy: 26/32 (81%)

5.3 Download Metrics

As of March 2026, organic HuggingFace downloads across the portfolio:

Model	Downloads
LFM2.5-1.2B-Distilled-SFT	340
Qwen3-1.7B-Coder-Distilled-SFT	292
Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF	198
Qwen3-1.7B-Distilled-30B-A3B-SFT-GGUF	173
SMOLM2Prover-GGUF	149
DistilQwen3-1.7B-uncensored	140
DiscoverLM-70M	104

6. Cost Analysis

6.1 Comparison

	GPT-4	Qwen3-235B	CIx (15 models)
GPUs	25,000 × A100	H800 cluster	0
CPUs	N/A	N/A	1 × EPYC 9454P
Training cost	$80–100M	$5.5M	$24
Hardware acquisition	~$800M	est. $50M+	$3,450
Training duration	90–100 days	~60 days	Hours per model
Training tokens	13T	36T	~512K per model
Power consumption	~50 GWh	Undisclosed	290W single socket
GPU utilization	32–36%	Undisclosed	N/A (100% CPU)
Compute precision	FP16/BF16 mixed	FP16/BF16 mixed	FP32 native
Driver dependencies	CUDA	CUDA	None

GPT-4's 25,000 A100s ran at 32-36% utilization due to pipeline parallelism overhead and communication latency. Approximately two-thirds of the silicon was idle at any given moment. At $10,000 per A100, this represents $160-170M in hardware sitting idle during a $100M training run. Our single-socket EPYC with OpenMP threading achieves near-100% utilization — no multi-node communication, no pipeline bubble, no tensor parallelism synchronization.

6.2 GPU Data Pipeline Overhead

Before any useful computation, GPU training requires:

PCIe transfer from host to device
FP32 → FP16 cast (precision destroyed)
Memory layout reformatting for coalesced access
Padding to tensor core tile boundaries (dead compute)
Maintaining dual FP32/FP16 weight copies (2× memory)
Loss scaling infrastructure
Gradient unscaling
FP16 → FP32 accumulation
FP32 → FP16 recast for next iteration
CUDA kernel launch overhead per operation
Synchronization barriers between kernels

CPU overhead before useful computation: load from RAM to register. One step.

6.3 Scaling Projection

An 8-EPYC cluster (384 cores, 768 threads) costs approximately $27,600 in hardware — less than a single H100 GPU. With sub-model NUMA pinning (each sub-model on a pair of CCDs), this cluster could train a 7B sparse model at FP32 with projected 500M active parameters per token. Estimated training cost at 1T tokens over 30 days: approximately $28,000 total including power ($167 for 1,670 kWh at $0.10/kWh).

Equivalent GPU training (64 H100s, 2 weeks): $75,264 in cloud compute, or $1,920,000 in hardware purchase. The CPU path is 2.7× cheaper in cloud rental and the hardware is owned permanently.

7. Discussion

7.1 The Architecture Determines the Hardware

The standard narrative is that GPUs are the natural hardware for deep learning. We argue the causality is reversed: the dominant architecture (dense transformers) was designed for the available hardware (GPUs), and the hardware then became "necessary" for the architecture it shaped.

When architectures are designed from mathematical first principles — metric-space attention, geometric enforcement, sparse conditional routing — the computational profile changes from dense parallel to sparse conditional. This profile matches CPU architecture (branch prediction, independent core execution, native FP32 SIMD) rather than GPU architecture (lockstep SIMD, tensor core fixed-format, coalesced memory access patterns).

The question is not "can CPU compete with GPU?" The question is "what hardware best serves architectures designed for mathematical correctness rather than hardware compatibility?"

7.2 Token Efficiency vs. Token Throughput

GPU training optimizes for throughput — maximum tokens per second. Our methodology optimizes for efficiency — maximum capability per token. A trillion tokens through an FP16 pipeline with no curriculum and random shuffling extracts far less learning per token than 512K tokens through an FP32 pipeline with developmental staging, continuity preservation, and geometric structure.

The developmental curriculum ensures every token is placed deliberately: language before logic, logic before transfer, transfer before depth. The belt-fed approach ensures every token is seen: zero truncation, sequential windows, continuity across boundaries. The sparse architecture ensures every computation is relevant: only active parameters participate, only relevant experts fire, only nearby tokens in the metric space interact.

7.3 Democratization

Training on CPU removes the capital barrier to AI research. A $3,450 processor replaces millions of dollars in GPU infrastructure for sub-2B model development. The EPYC 9454P is a single-socket server processor available from standard IT retailers. No datacenter-scale power. No specialized cooling. No CUDA expertise. No vendor lock-in.

AVX2 SIMD — the instruction set enabling FP32 parallel computation — has been present in every x86 processor since Intel Haswell (2013). There are billions of AVX2-capable processors already deployed worldwide. The hardware required for this methodology already exists in server racks, workstations, and laptops globally.

7.4 Limitations

Our models are small by frontier standards. We do not claim that a 70M parameter model matches GPT-4 across all tasks. The argument is about efficiency curves, not absolute capability.

The one-day training rule, while effective for architectural discipline, limits total training compute. Scaling to 7B+ parameters on CPU will require multi-day runs that we have not yet validated.

Byte-level processing increases sequence length relative to BPE-tokenized input (roughly 3-4×), which increases attention cost per character despite our windowed approach.

The expert phase transition dynamics we observe (loss oscillation during reorganization) may not generalize to all sparse architectures or all datasets. Further controlled experiments are needed.

8. Conclusion

The AI industry has spent billions of dollars optimizing for a hardware-architecture pairing that was an accident of history. Dense transformers exist because GPUs existed first. GPUs became "essential" because dense transformers were designed for them. The circular dependency has been mistaken for a natural law.

We demonstrate that breaking this dependency — designing architectures for mathematical correctness and training on the hardware that best serves them — produces capable models at costs four to six orders of magnitude lower than industry baselines. Fifteen models. Four architecture families. Twenty-four dollars.

The last 10% of capability costs 99.99% of the budget. For the growing majority of applications that don't need frontier-scale models, that tradeoff is indefensible.

Structure over scale. Precision over throughput. Curriculum over volume. The proof is on HuggingFace at $1.60 per model.

Hardware Specification

All experiments: AMD EPYC 9454P, 48 cores / 96 threads, Zen 4 architecture, SP5 socket. 2.8 GHz base / 3.8 GHz boost. 290W TDP. PassMark multithread: 94,708. Floating point: 255,593 MOps/sec. Extended instructions (AVX): 106,471 million matrices/sec. Retail price: $3,450.

Software: PyTorch CPU backend, AMD AOCL BLAS, OpenMP thread parallelism (OMP_NUM_THREADS=48, OMP_PROC_BIND=close, OMP_PLACES=cores), NUMA-aware memory allocation.

Model Availability

All models available at: https://huggingface.co/reaperdoesntknow

Collections: Mixture of Attentions, Qemma, SAGI - Swarm AGI Language Model, DiscoverLM

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems, 30.

Colca, R. (2025). McPaF-X9: Thermodynamic Phase Control for Neural Network Training. Presented at Intel oneAPI Summit, September 2025.

Colca, R. (2025-2026). DISC: Discrepancy Calculus. Convergent Intelligence LLC: Research Division. Working manuscript, 41+ chapters.

Convergent Intelligence LLC: Research Division

Roy Colca Jr. — Founder & Principal

Correspondence: reaperdoesntrun@gmail.com

HuggingFace: reaperdoesntknow

Convergent Intelligence Portfolio

By Convergent Intelligence LLC: Research Division

Top Models from Our Lab

Model	Downloads
Qwen3-1.7B-Thinking-Distil	501
LFM2.5-1.2B-Distilled-SFT	342
Qwen3-1.7B-Coder-Distilled-SFT	302
Qwen3-0.6B-Distilled-30B-A3B-Thinking-SFT-GGUF	203
Qwen3-1.7B-Coder-Distilled-SFT-GGUF	194

Total Portfolio: 41 models | 2,781 total downloads

Last updated: 2026-03-28 12:58 UTC

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support