Dmitrii Kostakov PRO

kostakoff

dmitriikostakov

AI & ML interests

MLOps

Recent Activity

liked a model about 21 hours ago

XiaomiMiMo/MiMo-V2.5

liked a model about 22 hours ago

XiaomiMiMo/MiMo-V2.5-Pro

liked a model 4 days ago

stablediffusionapi/perfectdeliberate-v5

View all activity

Organizations

Posts 6

Post

189

Just for fun, let's run the Alibaba MNN benchmark on a DGX Spark!

From time to time, I look for something new or unusual in the AI world, and recently I stumbled upon MNN — a direct competitor to llama.cpp.

I found this project intriguing and set a small goal for myself: to run it on my DGX Spark. I was glad to see that MNN is open-source under the Apache 2.0 license, meaning I was free to fork and modify it.

However, MNN had a few issues out of the box:
- No support for CUDA 13.0
- No support for the Blackwell architecture sm_12
- No built-in support for CUDA benchmarking

I tackled these issues one by one and successfully compiled MNN on the DGX Spark. The benchmark results are currently quite low, but at least it works! Patch file here https://github.com/alibaba/MNN/issues/4289#issuecomment-4093931887

Here is the step-by-step guide on how I built MNN:

mkdir mnn && cd mnn
# Get the code
git clone https://github.com/alibaba/MNN.git
cd MNN

# Reset repo to a specific commit
git reset --hard b1d06d68b3366183d157f0703d7b8a8b61ae55b3

# Apply patch for CUDA 13.0
git apply ../my_changes.patch

mkdir build && cd build
# Configure the project
cmake .. \
  -DMNN_CUDA=ON \
  -DMNN_BUILD_LLM=ON \
  -DMNN_SUPPORT_TRANSFORMER_FUSE=ON \
  -DCMAKE_BUILD_TYPE=Release

# Build libraries and executable binaries
cmake --build . --config Release -j$(nproc)
make -j$(nproc)

How to run the test:
- Download the MNN model: taobao-mnn/Qwen3-30B-A3B-MNN
- Run the benchmark:

./MNN/build/llm_bench -m /path/to/qwen/config.json -a cuda -c 2 -p 512 -n 128 -kv true -rep 3

It works!
Hopefully, the MNN developers will add official CUDA 13 support.

llmlaba

Post

2198

Mining GPU Nvidia CMP 170HX - let's run some models!

To satisfy my curiosity, I investigated different GPUs and found this: a mining version of the A100 — the CMP 170HX.

It is a very interesting GPU. Based on public documentation, it has hardware similar to the datacenter A100. If you open it up and look at the board, you will see that it's very similar to an A100 board; it even has NVLink connectors.

Online, I found almost no information about how to run it, whether it works with LLMs, or if it's supported by default Nvidia drivers and CUDA. So, I decided to test it myself.
I installed it in my lab (see previous post https://huggingface.co/posts/kostakoff/584269728210158) and found that the default nvidia-driver-570 works with it out of the box. After that, I checked if CUDA was available, and it worked too.

The next step was to try running some models:
- Stable Diffusion XL with BNB4 quantization: It took around two minutes to generate an image, but it works!
- Compiled llama.cpp for CUDA (https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md#compilation): I run Mistral 7B Q4_K_M, and this actually worked even better. It was able to generate 33 tokens per second and read 400 tokens per second.

There are some limitations related to power utilization:
- When running PyTorch, it doesn't utilize more than 80 watts.
- When running llama.cpp, utilization is a bit better but still limited to 113 watts.

I found this GitHub thread about the Nvidia CMP https://github.com/dartraiden/NVIDIA-patcher/issues/73, and it looks like this mining GPU has an internal rate limiter based on FMA compute calls. I haven't found a solution to bypass it yet.

llmlaba