20 6 28

ManniX PRO

ManniX-ITA

https://github.com/mann1x

mann1x

AI & ML interests

None yet

Recent Activity

reacted to wenhuach's post with ❤️ about 3 hours ago

🚀 We provide **free** hardware to quantize models at the [Intel Low Bit Open LLM Leaderboard](https://huggingface.co/spaces/Intel/low_bit_open_llm_leaderboard), currently supporting `Pure RTN mode` powered by AutoRound ⭐ If you find it useful, please consider starring the AutoRound project on [GitHub](https://github.com/intel/auto-round)!

liked a Space about 3 hours ago

Intel/low_bit_open_llm_leaderboard

updated a model about 3 hours ago

ManniX-ITA/gemma-4-A4B-98e-v6-coder-it

View all activity

Organizations

None yet

reacted to wenhuach's post with ❤️ about 3 hours ago

Post

🚀 We provide **free** hardware to quantize models at the [Intel Low Bit Open LLM Leaderboard]( Intel/low_bit_open_llm_leaderboard), currently supporting Pure RTN mode powered by AutoRound

⭐ If you find it useful, please consider starring the AutoRound project on [GitHub](https://github.com/intel/auto-round)!

posted an update about 17 hours ago

Post

🚀 Gemma-4-A4B 98e v6-coder (C6v3lcb) — LCB-targeted code prune of Gemma 4 26B-A4B, 20.8B MoE (4B-active). Same C6 recipe as v5-coder, re-steered specifically at LiveCodeBench-medium — the one code bench pruning hurt most.

Not only keeps the lead on Python and closes the gap to 1-2pp in the other coding languages.

It's actually reasoning better, fixing the under-thinking and over-thinking failures of the full experts router.

All this comes with a cost with only 20b, on top of being very specific to coding; about 3x the thinking tokens in LiveCodeBench but it's good thinking that brings home not only more correct answers but in general a more precise and concise output.

📊 SCORES (Q6_K, llama.cpp, greedy, EVAL_PROTOCOL v3)

HumanEval 98.78 — HumanEval+ 93.29 — LCB-medium-55 v4 96.36
LCB-medium-100 96.00 — MultiPL-E macro 88.00 (Rust/Java/JS)
MATH-500 91.00 — GPQA-D 67.17 — AIME 63.33 — IFEval 92.00
vs v5-coder: +10.91 LCB-medium / +7.0 MultiPL-E / +10 AIME, HE+ tie

LCB targeting closed the −9.10pp hole and pushed +1.81pp past the unpruned 128e. Top of the 14–22B coder band: +9.2pp HE over Qwen2.5-Coder-14B-Instruct (89.6 → 98.78).

📦 GGUF SWEEP (all imatrix; Q4_K_M plain — imatrix hurt it)

Q6_K — 17.81 GB — 93.29% (cohort top)
Q3_K_M — 10.51 GB — 92.68% ⭐ value leader (imatrix lifted the 3-bit tiers hard)
IQ4_XS — 11.01 GB — 92.07% ⭐ safe 4-bit
IQ3_XS — 9.22 GB — 92.07% — smallest on the plateau
IQ2_S — 7.83 GB — 89.02% — sub-8 GB code-grade

⚔️ SAME-RIG vs Qwen2.5-Coder-14B (RTX 3090, greedy)

Iso-disk 10.5 GB: Q3_K_M 92.68 vs Qwen Q5_K_M 83.54 → +9.14pp at the same file size
LCB-medium-55 v4, identical split: 96.36 vs 18.18

bf16:
ManniX-ITA/gemma-4-A4B-98e-v6-coder-it
GGUF:
ManniX-ITA/gemma-4-A4B-98e-v6-coder-it-GGUF
Ollama:
https://ollama.com/mannix/gemma4-98e-v6-coder

posted an update 4 days ago

Post

155

🚀 Gemma-4-A4B 98e v5-coder — code-leaning 20.8B MoE (4B-active), C6 layer-relevance-weighted prune of Gemma 4 26B-A4B. Best 20B-class coder I've shipped.

📊 SCORES (NVFP4A16, vLLM 0.20.2, greedy, EVAL_PROTOCOL v3)

HumanEval 98.17 — HumanEval+ 92.68 — LCB-medium-55 v4 85.45
MATH-500 92.00 — GPQA-D 68.69 — IFEval 94.00
vs v4: +1.22 HE / +1.22 HE+ / +7.27 LCB-medium

Top of the 14–22B coder band: +8.6pp HE over Qwen2.5-Coder-14B-Instruct (89.6 → 98.17). HE+ sanity-audited — no memorization, no silent-empty.

📦 EXTENSIVE GGUF SWEEP (16 plain + IQ tiers + 5 CD recipes, all imatrix-calibrated)

Q8_0 — 21.16 GB — 93.90% (cohort top)
Q4_K_S — 12.21 GB — 93.29% ⭐ plain sweet spot
IQ4_XS — 11.01 GB — 93.29% ⭐ sub-12 GB top

⭐ TWO EXCELLENT SUB-10 GB CONTRIBDYNAMIC CD PICKS (per-layer + IQ-codebook overrides)

CD-IQ4_K_M (Canary W) — 10.29 GB — 92.07% — recommended sub-11 GB
CD-IQ3_XS_L — 9.27 GB — 90.24% — smallest viable code-grade

⚔️ SAME-RIG vs Qwen2.5-Coder-14B-Instruct (RTX 3090, greedy HE+)

11 GB band: v5-coder IQ4_XS wins +9.75pp at -1.49 bpw
12 GB band: Q4_K_S wins +8.53pp
8 GB band: IQ2_S wins +0.61pp at lower bpw

bf16:
ManniX-ITA/gemma-4-A4B-98e-v5-coder-it

GGUF:
ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF

NVFP4A16:
ManniX-ITA/gemma-4-A4B-98e-v5-coder-NVFP4A16

Ollama:
https://ollama.com/mannix/gemma4-98e-v5-coder

———

🆕 BONUS — Qwen3.6-27B-Omnimerge-v4-MTP-GGUF

Same v4 weights with the native MTP head retained for llama.cpp speculative decoding (PR #22673, --spec-type draft-mtp). 7 imatrix tiers Q8_0 → IQ2_M.

HumanEval: 2.0x decode tok/s
MBPP: 2.33x decode tok/s
Both at +1-2pp pass@1 vs the non-MTP build. GPQA Diamond comparison in flight.

MTP-GGUF:
ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MTP-GGUF

replied to their post 7 days ago

The v1.1.0 post had a single "one model wins all" PROD line; v1.8 has per-axis baselines from three live cohorts (May 16-17 '26).

✍️ /consultants coder — per-language routing (mlang v1.0.1, suite hash ddef8095, 6 langs × 13 questions).

Lang	Primary	Fallback
c	glm-5.1:cloud	deepseek-v4-pro:cloud
cpp	deepseek-v4-flash:cloud	kimi-k2.6:cloud
csharp	deepseek-v4-pro:cloud	kimi-k2.6:cloud
go	kimi-k2.6:cloud	deepseek-v4-pro:cloud
python	glm-5.1:cloud	kimi-k2.6:cloud
rust	deepseek-v4-flash:cloud	deepseek-v4-pro:cloud

No single model wins across languages. Out-of-cohort langs (ts/java/ruby/swift/shell) fall to global default glm-5.1:cloud → kimi-k2.6:cloud.

🛠 tool_executor — 48 trials, suite 7921555c:
• gemma4:31b-cloud — 87.5% / Q=5.00 ← winner (tiebreak: quality → wall → call count)
• kimi-k2.6 / deepseek-v4-pro / glm-5.1 — all 87.5% but lost tiebreakers
• gemini-3-flash-preview — 75% (qualifies)
• qwen3-coder-next — 62.5%, DISQUALIFIED. The Python coder winner is NOT a good tool_executor — "best at writing code" ≠ "best at mechanical tool chains for reading code".

⏱ Stall thresholds (M11a, suite c8306c62):
• kimi-k2.6:cloud cold TTFT = 150s → needs stall=390s. The global 300s default would falsely STARTUP_STALL it.
• qwen3-coder-next fastest startup (390ms).
• deepseek-v4-flash slowest p99 wall (255s) → hard_cap raised to 780s.

🧮 Rubric: pass_rate ≥ 70% AND avg_quality ≥ 3.5. Baselines append-only at docs/consultants-skill-eval-baselines.md. Re-bench any new model with claude-consultants skill-eval <suite> --live --accept-cost.

🔗 Bench dirs: benchmarks/consultants/results/2026-05-17/

posted an update 7 days ago

Post

284

v1.1.0 was Claude + Ollama chat. Eight releases later the stack is a grounded research pipeline plus a local-first memory layer; the token crunch is operational now, not a quality wall.

🚀 claude-hooks v1.8.3 — highlights since v1.1.0.

🧠 /consultants v2 — agentic council, matured.
🛠 tool_executor — PLAN→REPORT lane runs read_file / grep / glob over the codebase before the researcher speaks; claims grounded in tool output, not vibes.
✍️─ coder — sandboxed write_file role with per-language model routing (50KB/file, 1MB/lane caps).
🛡️ CitationLinter — three-layer verifier at the researcher boundary; every path:line claim checked against an mtime-cached code_graph. Catches fabricated filenames before they launder through critics + synthesizer.

💾 M14 cross-session memory (default on).
LangGraph BaseStore wired across four namespaces: research / tool_results / project / user. Per-namespace TTL: research=30d, tool_results=24h, project+user=forever. Hourly Caliber-style distillation reaper summarizes
expiring research into the durable project namespace BEFORE deletion — episodic → semantic, like human consolidation. Originals only dropped after a successful summary write.

🔁 sqlite_vec — full pgvector parity (v1.7).
Hybrid recall via RRF over vector cosine + BM25 (FTS5). KG surface: kg_create_entities / kg_add_observations / kg_create_relations / kg_search_nodes. Bundled sqlite-vec-mcp launcher went 3→8 tools so Cursor / Codex /
OpenWebUI / Claude Desktop share the same .db. Lazy schema migration carries v1.6.x dbs in place, non-destructive.

🧩 llamafile chat + embed (v1.4 + v1.5).
HyDE / reflect / consolidate / get-advice / consultants route to a daemon-supervised local llamafile via the llamafile:// model prefix. Multi-instance LRU, per-label idle reap, sticky CPU fallback. Stack runs
offline now.

🐧 Linux / macOS / Windows. PostgreSQL OR SQLite. Local OR cloud LLMs.

🔗 github.com/mann1x/claude-hooks

1 reply

reacted to Knowurknot's post with 🔥 9 days ago

Post

165

Knowurknot/git-hologram git nexus is the one you probably really want to see, but I can't link to it here, so this will have to work. this is cool though, just not really showing off the CAPT with this demo

reacted to danielhanchen's post with 🤝 17 days ago

Post

7658

We collaborated with NVIDIA to teach you how we made LLM training ~25% faster! 🚀

Learn how 3 optimizations help your home GPU train models faster:
1. Packed-sequence metadata caching
2. Double-buffered checkpoint reloads
3. Faster MoE routing

Guide: https://unsloth.ai/blog/nvidia-collab
GitHub: https://github.com/unslothai/unsloth

posted an update 18 days ago

Post

1533

After the Feb/Mar '26 collapse of Claude Code I started building my own framework. The token crunch is still only mitigated, but the reasoning and quality are back — better than before. For research and LLM
training recipes, though, diverse knowledge and a second point of view are crucial. Pairing my Claude Max sub with an Ollama Pro sub has already saved me from days of botched trainings — multiple frontier
models helping Claude is next level. Acting as the middleman myself was interesting but inefficient, so I shipped skills that let Claude talk to Ollama models directly.

🚀 claude-hooks v1.1.0 ships two LLM-to-LLM skills.

💬 /get-advice — single-shot second opinion. Claude runs a multi-turn conversation with a configured Ollama advisor; the advisor grounds in your project through read_file / grep / glob / list_files /
recall_memory tools. Effort tiers cap fresh-session retries.

🤝 /consultants — multi-agent council for cross-cutting questions:
🧩 planner → researcher → critic → synthesizer

Each role runs its own Ollama model. 💾 Sessions persist to disk (summary.md + transcript.md + SQLite per-role message threads); closed sessions reopen and produce follow-ups 🔁 indistinguishable from warm
ones.

🎯 x-tier effort multiplies diversity:
• xmedium / xhigh — researcher fans across N models in parallel
• xmax — + multi-critic + meta-critic combine; critics anonymized as "Critic 1/2/3" to avoid model-bias

🛡️ Cloud-flap recovery, three layers: 15-attempt / ~15min retry budget; synthesizer failure-fallback model chain; degraded-answer composer surfaces researcher + critic work even when synthesis fails.

📊 7 cloud models benchmarked & Claude-graded on locked queries:
• PROD-READY (P:A R:A C:A S:A): kimi-k2.6, gemma4:31b, glm-5.1
• Role specialists: minimax-m2.7 (critic), qwen3.5 (planner)

🐧 Linux/macOS/Windows. No per-project setup.

🔗 github.com/mann1x/claude-hooks

reacted to mlabonne's post with 👍 21 days ago

Post

1903

Big update to llm-datasets, my curated list of datasets and tools for post-training LLMs.

> Added many new datasets
> New "thinking" column
> Refreshed recommended tools.

Thanks to everyone who told me they used it for their research at ICLR, you motivated this update!

2 replies

reacted to DedeProGames's post with 🔥 21 days ago

Post

8574

GRaPE 2 Pro is now available.

SL-AI/GRaPE-2-Pro

This is the flagship model of the GRaPE 2 family and the largest model I have trained to date, sitting at 27B parameters. It is built on Qwen3.5-27B and trained on a closed-source proprietary dataset, with roughly half of post-training focused on code and the rest split between STEAM subjects and structured logical reasoning. It punches seriously above its weight class.

GRaPE 2 Pro supports multimodal input (image + text) and features 6 thinking modes via the tag. This gives you real control over how hard the model thinks, from skipping the reasoning phase entirely with minimal, all the way up to xtra-Hi for deep, extended thought on hard problems. For most agentic use, auto or low is the move to keep things snappy.

It also runs on consumer hardware. You can get it going with as low as 12GB of VRAM on a quantized build.

If you want to try it out and give feedback, that would be really appreciated. Email us at contact@skinnertopia.com

1 reply

posted an update 21 days ago

Post

254

🚀 Exciting week, 2 new research projects and 2 new tools!

▶ Mythic-RDT - OpenMythos blueprint with a retrofit-recurrence fine-tune
https://github.com/mann1x/Mythic-RDT

▶ cross-tokenizer-distill (CTD) - knowledge distillation across different tokenizer vocabularies
https://github.com/mann1x/cross-tokenizer-distill

For Mythic-RDT, I have chosen the pretty outdated DS-Coder-V2 16B.
It's small enough to not need more than 48GB VRAM but once I leaned on KL for depth recurring fine-tune (couldn't go above parity to T=1 with T=4, not the best for 4x inference time), started investigating the KL recipe and questioned the teacher, same DS-Coder-V2 but at BF16.
For a better teacher the option would have been just one, DS-Coder-V2-236B. Not only so big that I'd need 4xH100 to run but also surpassed even by Qwen3-Coder-32B on HE/MBPP.
Hence here's CTD tool, validated but still in development to find a good recipe for Qwen->DS distill.

▶ Qwen3.5-4B-MicroCoder - code-leaning and reasoning merge of Qwen3.5-4B
ManniX-ITA/Qwen3.5-4B-MicroCoder

▶ Omnimergekit - merge toolkit, merge & quantization scripts, experiments logs
https://github.com/mann1x/omnimergekit

You can find my merge toolkit and scripts in the repo, so they don't get scattered over the HF repos.
Interesting experiment with MicroCoder; only a couple of base, reasoning broken, coding fine-tune to merge with the excellent instruct reasoning JackRong-v2.
The result is not truly exciting but manages to improve LiveCodeBench above JR-v2, improve MBPP and not completely breaking reasoning.
This is achieved with omnimergekit using differential signals generated by the delta vs the base model from the good and wrong answers delta between the sources (HE/MBPP/AIME).
The very long eval sessions proved that the method does not just bias the scores of these evals but improve others even above the baseline.

replied to their post 25 days ago

🔄 Follow-up to the M1..M5 study above — re-ran M4 against the updated PR #682 head ("turbo" branch by @Tusm11, HEAD 8d989f6) with rebalanced hyperparams (equal weights w=1/1, density 0.7, closer to the PR's worked example). Same AttnLRP signal as M4-orig, same sources.

▶ Qwen3.5-4B-M4-v2-ex-LRP-turbo
ManniX-ITA/Qwen3.5-4B-M4-v2-ex-LRP-turbo

Q6_K HE / MBPP pass@1, M4-v2 inserted:

M1 Vanilla DARE-TIES → 51.22 / 47.00
M2 OMv2 (no signal) → 52.44 / 49.40
M3 OMv2 + Fisher → 57.93 🥇 / 48.80
M4 ex-LRP (PR #682 orig) → 51.22 / 49.40
M4-v2 ex-LRP (PR #682 turbo) → 55.49 / 52.20 🥇
M5 OMv2 + LRP → 53.05 / 51.40

Δ M4-v2 vs M4-orig: +4.27 pp HE, +2.80 pp MBPP. M4-v2 takes the MBPP medal of the whole study (overtakes M5) while staying competitive on HumanEval. The turbo code path + rebalanced hyperparams clearly beat the original PR head on this configuration.

Findings refresh: Fisher still leads HE; ex-LRP (turbo) now leads MBPP, narrowly ahead of OMv2+LRP. Both LRP variants land within 1 pp on MBPP — strong signal that LRP-driven sparsification is doing real work for code-gen on small Qwen merges.

Big thanks to @Tusm11 for the supercharged ex-LRP turbo head — multimodal support + Iron-Man stabilization + in-place math are a real upgrade. Posted full results + the 6 patches needed to run it against Qwen3_5ForConditionalGeneration on the PR thread:
https://github.com/arcee-ai/mergekit/pull/682

posted an update 26 days ago

Post

3018

🚀 Two releases this week pushing merge methodology forward.

▶ Qwen3.6-27B-Omnimerge-v4-MLP
ManniX-ITA/Qwen3.6-27B-Omnimerge-v4

Same-base DARE-TIES merge of Qwen3.6-27B + 3 fine-tunes (rico03 Claude distill, Esper3.1, kai-os Opus reasoning anchor) via my Omnimerge_v2 method (OBIM-lite + DAREx-q + EMR election).

Hit a Qwen3.6-specific fragility: hyperparams that work flawlessly on 3.5 produced 80% unclosed-<think> on 3.6, collapsing pass@1 to ~20%. Per-tensor delta forensics localized the failure to mlp.{gate,up,down}_proj in
layers 27–52. Fix: MLP-passthrough surgery — copy MLPs verbatim from base, keep merged attn + linear_attn. Leak → 0%.

Q6_K results (vs Qwen3.6 base / vs Omnimerge-v2 on Qwen3.5):
• HumanEval: 84.76% (= base, +5.49 pp vs v2)
• MBPP corrected: 73.40% (+15.80 pp vs base, ≈ v2)
• GPQA Diamond: ~84.75% partial 192/198 (+15.5 pp vs v2)

▶ Qwen3.5-4B Importance-Signal Study (M1..M5)

Controlled 5-way comparison: same Qwen3.5-4B base, same 2 fine-tunes (Jackrong Claude-4.5 distill + Crow Opus-4.6 distill), only the importance signal driving DARE-TIES sparsification varies.

Q6_K HE / MBPP pass@1:
• M1 Vanilla DARE-TIES → 51.22 / 47.00
• M2 OMv2 (no signal) → 52.44 / 49.40
• M3 OMv2 + Fisher → 57.93 🥇 / 48.80
• M4 mergekit ex-LRP (PR #682) → 51.22 / 49.40
• M5 OMv2 + LRP → 53.05 / 51.40 🥇

Findings: Fisher wins HE (+4.88 pp over vanilla), LRP wins MBPP (+2.60 pp). Both signals + Omnimerge_v2 recipe beat vanilla. To make multimodal-LM ex-LRP work end-to-end against Qwen3_5ForConditionalGeneration, I filed
5 patches against arcee-ai/mergekit PR #682 + 1 against rachtibat/lxt.

All five Mx checkpoints + Fisher/LRP signal safetensors + reproducer scripts published.

1 reply

posted an update 30 days ago

Post

233

Two custom releases — both unusual takes on common problems, on a single RTX 3090 + a vast.ai pod.

🔹 ManniX-ITA/Qwen3.5-27B-Omnimerge-v2

3-source weight-space merge over Qwen3.5-27B combining OBIM-lite magnitude masking + DAREx rescaling + EMR election (sign from consensus, amplitude from max-abs across sources). GPU-accelerated, ~35× over CPU.

Sources: Claude-4.6-Opus-distill (0.40), Esper3.1 code (0.35), Gemini-3.1-Pro-distill (0.25). density 0.53, DAREx q 0.75.

Q6_K vs best source:
• GPQA Diamond: 53.03 → 69.19 (+16.16 pp)
• MBPP pass@1: 71.20 → 74.60 (+3.40)
• HumanEval pass@1: 76.22 → 79.27 (+3.05)

vs Omnimerge v1 (vanilla DARE-TIES): +8.08 pp GPQA, +2.80 MBPP. Amplitude-from-max + sign-from-consensus is what unlocked the GPQA jump.

🔹 ManniX-ITA/gemma-4-A4B-98e-v3-it

Gemma 4 26B-A4B pruned 128 → 98 experts/layer (-23.4% MoE capacity, -5.2B params), zero GPQA degradation.

GPQA Diamond:
• 128e reference: 75.25%
• 98e v3 (this): 75.25% — +0.00 pp despite -23.4% capacity, -5.2B params
• 109e v3 (older): 71.72% — -3.53 pp

The win over 109e v3 came from changing the importance map: aggregate per-expert contribution across math/logic/code/science/creative via 128-token teacher-force, instead of GPQA-specific per-question top-16 (which overfitted). Result: more experts dropped, quality preserved.

Findings worth flagging:
• Experts NOT topic-specialized — 28/32 overlap math/creative top-32.
• Expert weight cosine ≈ 0.05 max → merging destroys the model. Dropping is the only viable structural compression here.
• Contribution Gini ≈ 0.38 → ~75 experts/layer carry 80% of signal.

Eval: lm-eval gpqa_diamond_cot_zeroshot, llama-server --reasoning-format deepseek --reasoning-budget 8192, Gemma 4 official sampling. Feedback welcome.

reacted to mlabonne's post with 🚀 5 months ago

Post

10373

New family of 1B models just dropped!

> LiquidAI/LFM2.5-1.2B-Base: 10T → 28T tokens
> LiquidAI/LFM2.5-1.2B-Instruct: new large-scale multi-stage RL
> LiquidAI/LFM2.5-1.2B-JP: our most polite model
> LiquidAI/LFM2.5-VL-1.6B: multi-image multilingual
> LiquidAI/LFM2.5-Audio-1.5B: 8x times faster, no quality loss

Super proud of this release 🤗

3 replies

reacted to hesamation's post with ❤️ 9 months ago

Post

13978

a senior engineer at google just dropped a 400-page free book on docs for review: agentic design patterns.

the table of contents looks like everything you need to know about agents + code:
> advanced prompt techniques
> multi-agent patterns
> tool use and MCP
> you name it

read it here: https://docs.google.com/document/d/1rsaK53T3Lg5KoGwvf8ukOUvbELRtH-V0LnOIFDxBryE/edit?tab=t.0#heading=h.pxcur8v2qagu

you can also pre-order on Amazon (published by Springer) and the royalties goes to Save the Children: https://www.amazon.com/Agentic-Design-Patterns-Hands-Intelligent/dp/3032014018/

reacted to ehristoforu's post with 👍 almost 2 years ago

Post

6420

🤗 Hello from the Project Fluently team!

🥏 We are ready to announce a new series of Supple Diffusion models, these are new generation diffusion models (about 1-2 weeks left before release).

🦾 The new series aims to take diffusion models to the next level, with performance and versatility as the main goal.

🧐 How will our models be better than others? Firstly, we worked on the CLIP models, now they understand your requests better, it will become easier to process. Secondly, we trained the models with high quality, even better than all our previous ones. Thirdly, you won’t have to keep 20 models on your disk; only 4-6 will be enough.

🗺️ Roadmap:
1. Create Supple Diffusion Small
2. Creating Supple Diffusion Medium
3. Create Supple Diffusion Large

🎆 Our models are universal for realism, and for cartoons, and for anime, and for caricatures.

💖 The project really needs your support and your recommendations and reviews, please do not hesitate to write comments under this post, thank you!

🖼️ Below are demo images made with the pre-release version of Supple Diffusion Small.

4 replies

reacted to leonardlin's post with 👍 almost 2 years ago

Post

2141

My weekened project ended up being doing some testing between torchtune, axolotl, and unsloth. I *think* it's a 1:1 comparison of what LoRA fine-tuning performance looks like between the different hardware I have in my dev boxes (4090, 3090, 7900 XTX, W7900) with a few other interesting tidbits.

Tonight I wrote up a WandB report (the panel editor is super broken in Firefox 😔) that sums up some of the more interesting bits from the results: https://wandb.ai/augmxnt/train-bench/reports/torchtune-vs-axolotl-vs-unsloth-Trainer-Comparison--Vmlldzo4MzU3NTAx

1 reply

reacted to mlabonne's post with ❤️ about 2 years ago

Post

9614

⚡ AutoQuant

AutoQuant is the evolution of my previous AutoGGUF notebook (https://colab.research.google.com/drive/1P646NEg33BZy4BfLDNpTz0V0lwIU3CHu). It allows you to quantize your models in five different formats:

- GGUF: perfect for inference on CPUs (and LM Studio)
- GPTQ/EXL2: fast inference on GPUs
- AWQ: super fast inference on GPUs with vLLM (https://github.com/vllm-project/vllm)
- HQQ: extreme quantization with decent 2-bit and 3-bit models

Once the model is converted, it automatically uploads it on the Hugging Face Hub. To quantize a 7B model, GGUF only needs a T4 GPU, while the other methods require an A100 GPU.

Here's an example of a model I quantized using HQQ and AutoQuant: mlabonne/AlphaMonarch-7B-2bit-HQQ

I hope you'll enjoy it and quantize lots of models! :)

💻 AutoQuant: https://colab.research.google.com/drive/1b6nqC7UZVt8bx4MksX7s656GXPM-eWw4

19 replies

ManniX PRO

AI & ML interests

Recent Activity

Organizations

ManniX-ITA's activity