gemma-3-1b-it-qat-q4_0-gguf

Q4_0 quantized version of google/gemma-3-1b-it-qat-q4_0-unquantized, which differs from existing quantizations in the following aspects:

  • smaller and therefore faster than the original google/gemma-3-1b-it-qat-q4_0-gguf
  • quantization without imatrix to avoid interactions with already QAT optimized Q4_0 weights
  • various fixes regarding model metadata
    • added tokenizer.ggml.eot_token_id = 106 (<end_of_turn>)
    • make <start_of_image> type CONTROL
    • make <end_of_image> type CONTROL

Created with llama.cpp llama.cpp release b7699 based on google/gemma-3-1b-it-qat-q4_0-unquantized@a6692c1

Inspired by ideas and discussions around stduhpf/google-gemma-3-1b-it-qat-q4_0-gguf-small

Perplexity

Keep in mind, that it might be questionable to what extent the perplexity of QAT and non-QAT snapshots can be directly compared, as QAT already involves additional training and thus possible modification of the base weights. Furthermore, the use of imatrix for quantized models may lead to a distortion of perplexity if the data used to create the imatrix overlaps with the dataset used to determine perplexity (e.g., if excerpts from Wikipedia were used in both cases). The measured values should therefore be treated with appropriate caution.

Model Version Size PPL (wiki.test) ↓ PPL (structeval)
google/gemma-3-1b-it-BF16 dcc83ea 29.1119 +/- 0.28169 5.5738 +/- 0.08825
QAT
msievers/gemma-3-1b-it-qat-q4_0-gguf 9063fae 688M 27.7900 +/- 0.26383 5.1258 +/- 0.07640
stduhpf/google-gemma-3-1b-it-qat-q4_0-gguf-small 6611be6 688M 27.9076 +/- 0.26491 5.3936 +/- 0.08150
google/gemma-3-1b-it-qat-q4_0-gguf d1be121 958M 27.9111 +/- 0.26488 5.3923 +/- 0.08150
bartowski/google_gemma-3-1b-it-qat-GGUF 275cd0d 689M 30.8050 +/- 0.29885 5.4552 +/- 0.08429
Non-QAT
bartowski/google_gemma-3-1b-it-GGUF (Q5_K_M) 0d63621 812M 28.5507 +/- 0.27338 5.5230 +/- 0.08641
unsloth/gemma-3-1b-it-GGUF (Q5_K_M) f7694be 812M 28.6145 +/- 0.27414 5.5323 +/- 0.08677
bartowski/google_gemma-3-1b-it-GGUF (Q4_K_M) fd9cc90 769M 29.8471 +/- 0.28856 5.5828 +/- 0.08760
unsloth/gemma-3-1b-it-GGUF (Q4_K_M) f7694be 769M 30.0600 +/- 0.29161 5.6468 +/- 0.08920

Perplexity was measured using lama-perplexity from llama.cpp release b7699 on wikitext-2-raw/wiki.test.raw and a custom structeval dataset using a command like the following:

llama-perplexity -t 8 -f wikitext-2-raw/wiki.test.raw -m path-to-model.gguf

How to reproduce this yourself

There is a Gist that describes in detail the steps required to create this GGUF based on the Google Gemma 3 ...-it-qat-q4_0-unquantized snapshots. Since the steps are the same for all model sizes, the instructions were moved out so that they can be referenced from each quantization.

How to quantize and fix Google Gemma 3 IT QAT GGUF

Downloads last month
302
GGUF
Model size
1.0B params
Architecture
gemma3
Hardware compatibility
Log In to view the estimation

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for msievers/gemma-3-1b-it-qat-q4_0-gguf

Quantized
(7)
this model