A draft model with less parameters, for speculative thinking?
Been using your q5_k_m quant and like it so far, although there is some looping happening from time to time.
I'd like to try speculative thinking next and, if my understanding of how it works is correct, one would need to use a trimmed down version of a model for a draft. I'm downloading a smaller quant (IQ3_m) at the moment that I hope I could run on the CPU, but I don't think that's the right way of doing it.
Would you happen to have a draft model version of this distill, something with less parameters, similar to what the other models like QWEN have?
Never mind, I just realized that this GML-4.7 is A3B model too, so there's no point to a draft model with less parameters.
If it's still a bit slow for your hardware (like it is on mine) I can make an attempt at distilling the REAP variant of this model.
I use it for code assistance on a Dual Xeon E5v3 with 256GB RAM and a RTX 3090.
If there was a REAP version it might allow for loading of a higher quant, though I see there already are REAP versions from unsloth. I have not had much luck with any of thier GLMs. They keep looping for me after awhile which makes is unsable despite following their settings suggestions.
To be honest, I'm really on the edge about giving up on GLM and switching to Qwen3(-Coder) because so many people are saying that it's better. I see you have a number of Qwen3 variants already done - would you please suggest me one as a starting point?
Thanks.
Honestly depends on your use-case. If it's something more than a simple email agent or a non-trivial assistant any of them will work great.
For coding use-cases though the only one I can recommend (other than this model) is the 30B-A3B claude reasoning model. Although I am very impressed with the code generation of the new Minimax M2.1 Coder models just released. Either way though our models are great for chat, but im sure Qwen3-Coder will generally perform better on coding tasks than any of our distills.
None of our models have been trained with reinforcement learning so they are really just trying to copy the outputs of the teacher models. Reinforcment learning on our SFT released models are really the best way to see what they're capable of.
Circling back to your looping issues though (no pun intended), I have experienced the same thing myself using GLM-4.7-Flash and with our distill of it. In most cases with this model I have had to retry the same prompt twice or three times to get a good output, but when the output is good... it's really good:)
I mostly do Python and C/C++ coding wise, and not so much front-end languages.
So you say to go for SFT and not GGUF, without distill? Will give the MiniMax Coder a try in SFT then, thanks!
And yes, With GLM-4.7-Flash I have to repeat my prompt or even reload the entire model and wait for the context to process (90k+) again when it starts looping. Even with that it does not follow my instructions that well at the end...
Lol no, sorry for adding to your confusion. Supervised Fine Tuning (SFT) is the method of training used to create this model. GGUF is the format of the model that llama.cpp (or other popular c++ based inference engines) use. So in summary, still use GGUF as that is what your inference engine uses. I strongly recommend just using qwen3-coder for most of your tasks, you will most likely find the most success there.
I think I knew that, but I should pay more attention to capitalization (some people use sft to refer to .sft/Safetensors format)...bleh. π
Thanks again!
I think I knew that, but I should pay more attention to capitalization (some people use sft to refer to .sft/Safetensors format)...bleh. π
Ohhhh that makes sense, my fault for the confusion π