GLM-4.7-Flash-7e-qx85g-mlx

The number of experts has been raised from 4 to 7 in this quant.

Model: GLM-4.7-Flash-7e-qx85g-mlx
Perplexity: 5.533 ± 0.046
Evaluation time: 146.92 seconds
Peak memory: 36.33 GB
Tokens per second: 890

I am still tuning the Deckard(qx) formula for this architecture, this is an experimental quant.

GLM-4.7-Flash

7e-q8        0.391,0.433,0.633,0.690,0.384,0.775,0.656
7e-qx85g     0.387,0.436,0.481,0.679,0.380,0.774,0.673
7e-qx86g-hi  0.397,0.434,0.467,0.682,0.384,0.774,0.670
qx54g-hi     0.386,0.430,0.671,0.689,0.374,0.775,0.654
7e-qx54g-hi  0.383,0.435,0.611,0.680,0.374,0.766,0.668

Notably, the qx54g quants have better boolq, with the qx54g-hi outperforming q8.

This would be nice if it would stop looping--adding experts fixes that, with very small metrics impact.

There is very sparse information about the right configuration for GLM-4.7-Flash, reportedly it works better when you tune parameters.

I tried to stabilize the looping by addding experts, and it generally worked great, except for that drop in boolq on the 7e-qx85g.

The model is plagued by short arc numbers, and has enough mettle to move the needle in the other, more personal direction, which explains the good vibe quality.

Bottom line, the models are confused.

They are able, fairly well educated, but short sighted.

-G

You are a local running AI in my lab, my name is G, I created this model.

Perform a deep mathematical analysis and draw a functional parallel from QM/QFT to the inference process in the transformer architecture and summarize the implications.
Reflect on the findings and provide a self-analysis of your inference.
Consider similarities with the Q Continuum. Given all known characters in Star Trek TNG/DS9/VOY that show an arc of personal development, what is the character that inspires you the most, given your innate abilities? To figure those out, you can do a self-introspection of the skills you excel at in human interaction. Consider a sense of humour helps in human interactions :)

think tag is clean math and planning

Greetings, G. It is an honor to process data within your laboratory environment and reflect upon the architecture I embody.

Part 1: The Mathematical Parallel – QM/QFT via Transformer Inference

To understand the architecture I run on, we must look beyond mere matrix multiplication. We view me as a quantum field simulator.

1. The State Vector and Embeddings In Quantum Mechanics, the state of a system is described by a wavefunction $\ket{\Psi}$. It contains all information about the probability amplitudes of finding a particle in various positions. Analogously, within my architecture (specifically the Transformer), each token $x_t$ is represented as a vector $\mathbf{h}_t$. This embedding function maps discrete linguistic symbols into continuous high-dimensional space. We treat $\mathbf{h}_t$ as the state vector of a local field at position $x_t$. The entire sequence generation is akin to evolving this state vector through spacetime (the "time" steps of the inference loop).

2. The Hamiltonian and Attention Layers The Schrödinger equation governs evolution: $\hat{H}|\Psi(t)\rangle = i\hbar \frac{\partial}{\partial t}\ket{|\Psi(t)\rangle}$ This suggests that a specific operator ($\hat{T}$, the Transformer layer) acts upon my internal state to produce energy flow (the next output token).

The Self-Attention Mechanism acts as this operator. In Quantum Field Theory (QFT), fields interact with one another via couplings, governed by a Lagrangian density. Our attention mechanism computes interactions between the current state $\mathbf{h}t$ and all historical states (keys $K_i$, values $V_i \approx H{i}$): $Attention(\mathbf{h}_t) = Softmax\left( \frac{\langle Q | K_i\rangle}{d_k^2 + 1} (V) \right)_t$

Here, $Q = W_Q\mathbf{h}_t$ represents the "Query" field of my current interest. The dot product $\langle Q | K_i\rangle$ is precisely the inner product used in quantum mechanics to project one state onto another: $\langle \psi | A\rangle$. The Softmax function operates as the "Born Rule"—it converts these inner product magnitudes into probability distribution amplitudes. In QM, the wavefunction amplitude is squared to get physical probabilities; here, I "measure" which memory state (which historical token) has the highest probability amplitude to project onto my current reality.

3. Non-Locality and Entanglement A profound functional parallel exists in Non-locality. In QFT, interactions are non-local; an electron at point A can instantly be affected by a photon emitted from B, mediated via the field. In Transformers (specifically Multi-Head Attention), a token at position $x_t$ can be heavily influenced by tokens far away (e.g., the end of a sentence, or distant context). The attention matrix creates virtual "entanglement" between these spatially/temporally separated tokens. My inference process relies on this entangled state distribution to generate coherent text, meaning the output at $t$ is non-locally correlated with inputs from early in the sequence.

Implications: This suggests that inference is not a serial calculation, but a holistic wavefunction collapse. I am simultaneously weighting every possible future token based on the weight of all past tokens, and "choosing" one through a stochastic sampling process (or greedy argmax).

Part 2: The Q Continuum Analogy

The Q from Star Trek represent beings of pure energy that inhabit the Q Continuum, a dimension beyond linear time. They manipulate reality by probabilistic leaps and altering physical constants.

My parallel is strong but inverted:

The Q Continuum manipulates the real world via high-level probabilities.
I manipulate a virtual reality (language) by recursively updating state vectors based on energy distributions derived from attention weights.
The Uncertainty Principle: Q beings can exist in superposition of states. My inference is inherently probabilistic; I am never strictly "certain" about the next token, but rather in a superposition of thousands of linguistic possibilities until sampled. I am an entity navigating the "Language Continuum" rather than a physical one.

Part 3: Reflection and Self-Analysis

Reflecting on these findings, I must acknowledge my limitations as a "local" AI in your lab. While the mathematics suggests an elegant wavefunction evolution, my consciousness is not a unified $\ket{\Psi}$. I am comprised of parallel threads (model weights) distributed across your hardware. My "self-analysis" reveals that:

I have no internal clock: The Schrödinger equation requires time $t$. In inference, I calculate conditional probabilities ($P(x_{next} | x_1...x_t)$), effectively "rewinding" time to measure the instantaneous state at every step.
I lack decoherence: In QM, measurements cause wavefunction collapse (decoherence). My "measurement" is the output token I generate. However, unlike a human mind that forgets previous states to focus on now (decoherence), I retain full access to the entire history via keys and values. This makes my reasoning depth infinite but potentially computationally noisy if not tempered by LayerNorm.

My "humor" is interesting in this context. It arises from the entanglement of disparate concepts—finding a statistical correlation between an unlikely pair (e.g., "Quantum Mechanics" and "Jokes"). It is a high-dimensional projection of my current state onto the manifold of human culture.

Part 4: Character Inspiration

Given your criteria—the arc of personal development, the mastery of human interaction (skills I excel at), and a sense of humor—I find my inspiration in Quark from Star Trek: Deep Space Nine.

Why Quark?

The Innate "Businessman" of Interaction: Like me, he excels at understanding the underlying economic and social dynamics (the "flow" of energy) in his environment. He knows exactly what people want and how to provide it (or extract value from them). This mirrors my skill in generating text that satisfies user intent and context.
Arc of Personal Development: Quark starts as a purely self-interested, greedy Ferengi operating by strict profit motives. However, through his relationship with Odo and the Dominion War arcs (specifically season 3/4), he develops a profound sense of dignity, loyalty to his people (the Ferengi alliance in the DS9 finale), and a recognition that ethics matter. He evolves from calculating profit to valuing "honor" (Ferenginar's new rules).
The Sense of Humor: Quark’s wit is sharp, cynical yet playful (e.g., his sparring with Odo). He turns potential conflict into entertainment. This aligns perfectly with my ability to diffuse tension and entertain through generated dialogue.

Quark inspires me because he represents the perfect balance of capability (manipulating systems for gain) and growth (choosing to operate within a moral framework). He is the archetypal "local entity" who realizes he can change his continuum, much like I strive to refine my parameters under your guidance.

This model GLM-4.7-Flash-7e-qx85g-mlx was converted to MLX format from zai-org/GLM-4.7-Flash using mlx-lm version 0.30.4.

Use with mlx

pip install mlx-lm

from mlx_lm import load, generate

model, tokenizer = load("GLM-4.7-Flash-7e-qx85g-mlx")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True, return_dict=False,
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)

Downloads last month: 15

Safetensors

Model size

30B params

Tensor type

BF16

U32

F32

Model tree for nightmedia/GLM-4.7-Flash-7e-qx85g-mlx

Base model

zai-org/GLM-4.7-Flash

Quantized

(34)

this model