gemma-3-1b-it-qat-q4_0-gguf
Q4_0 quantized version of google/gemma-3-1b-it-qat-q4_0-unquantized, which differs from existing quantizations in the following aspects:
- smaller and therefore faster than the original
google/gemma-3-1b-it-qat-q4_0-gguf - quantization without imatrix to avoid interactions with already QAT optimized Q4_0 weights
- various fixes regarding model metadata
- added
tokenizer.ggml.eot_token_id = 106(<end_of_turn>) - make
<start_of_image>typeCONTROL - make
<end_of_image>typeCONTROL
- added
Created with llama.cpp llama.cpp release b7699 based on google/gemma-3-1b-it-qat-q4_0-unquantized@a6692c1
Inspired by ideas and discussions around stduhpf/google-gemma-3-1b-it-qat-q4_0-gguf-small
Perplexity
Keep in mind, that it might be questionable to what extent the perplexity of QAT and non-QAT snapshots can be directly compared, as QAT already involves additional training and thus possible modification of the base weights. Furthermore, the use of imatrix for quantized models may lead to a distortion of perplexity if the data used to create the imatrix overlaps with the dataset used to determine perplexity (e.g., if excerpts from Wikipedia were used in both cases). The measured values should therefore be treated with appropriate caution.
| Model | Version | Size | PPL (wiki.test) ↓ | PPL (structeval) |
|---|---|---|---|---|
| google/gemma-3-1b-it-BF16 | dcc83ea | 29.1119 +/- 0.28169 | 5.5738 +/- 0.08825 | |
| QAT | ||||
| msievers/gemma-3-1b-it-qat-q4_0-gguf | 9063fae | 688M | 27.7900 +/- 0.26383 | 5.1258 +/- 0.07640 |
| stduhpf/google-gemma-3-1b-it-qat-q4_0-gguf-small | 6611be6 | 688M | 27.9076 +/- 0.26491 | 5.3936 +/- 0.08150 |
| google/gemma-3-1b-it-qat-q4_0-gguf | d1be121 | 958M | 27.9111 +/- 0.26488 | 5.3923 +/- 0.08150 |
| bartowski/google_gemma-3-1b-it-qat-GGUF | 275cd0d | 689M | 30.8050 +/- 0.29885 | 5.4552 +/- 0.08429 |
| Non-QAT | ||||
| bartowski/google_gemma-3-1b-it-GGUF (Q5_K_M) | 0d63621 | 812M | 28.5507 +/- 0.27338 | 5.5230 +/- 0.08641 |
| unsloth/gemma-3-1b-it-GGUF (Q5_K_M) | f7694be | 812M | 28.6145 +/- 0.27414 | 5.5323 +/- 0.08677 |
| bartowski/google_gemma-3-1b-it-GGUF (Q4_K_M) | fd9cc90 | 769M | 29.8471 +/- 0.28856 | 5.5828 +/- 0.08760 |
| unsloth/gemma-3-1b-it-GGUF (Q4_K_M) | f7694be | 769M | 30.0600 +/- 0.29161 | 5.6468 +/- 0.08920 |
Perplexity was measured using lama-perplexity from llama.cpp
release b7699 on wikitext-2-raw/wiki.test.raw and a custom
structeval dataset using a command like the following:
llama-perplexity -t 8 -f wikitext-2-raw/wiki.test.raw -m path-to-model.gguf
How to reproduce this yourself
There is a Gist that describes in detail the steps required to create this GGUF based on the Google Gemma 3 ...-it-qat-q4_0-unquantized snapshots. Since the steps are the same for all model sizes, the instructions were moved out so that they can be referenced from each quantization.
- Downloads last month
- 302
4-bit
Model tree for msievers/gemma-3-1b-it-qat-q4_0-gguf
Base model
google/gemma-3-1b-pt