veScale-FSDP: Flexible and High-Performance FSDP at Scale
Abstract
veScale-FSDP introduces a redesigned fully sharded data parallel system with flexible sharding and structure-aware planning to improve scalability and efficiency for large-scale model training.
Fully Sharded Data Parallel (FSDP), also known as ZeRO, is widely used for training large-scale models, featuring its flexibility and minimal intrusion on model code. However, current FSDP systems struggle with structure-aware training methods (e.g., block-wise quantized training) and with non-element-wise optimizers (e.g., Shampoo and Muon) used in cutting-edge models (e.g., Gemini, Kimi K2). FSDP's fixed element- or row-wise sharding formats conflict with the block-structured computations. In addition, today's implementations fall short in communication and memory efficiency, limiting scaling to tens of thousands of GPUs. We introduce veScale-FSDP, a redesigned FSDP system that couples a flexible sharding format, RaggedShard, with a structure-aware planning algorithm to deliver both flexibility and performance at scale. veScale-FSDP natively supports efficient data placement required by FSDP, empowering block-wise quantization and non-element-wise optimizers. As a result, veScale-FSDP achieves 5~66% higher throughput and 16~30% lower memory usage than existing FSDP systems, while scaling efficiently to tens of thousands of GPUs.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Canzona: A Unified, Asynchronous, and Load-Balanced Framework for Distributed Matrix-based Optimizers (2026)
- FCDP: Fully Cached Data Parallel for Communication-Avoiding Large-Scale Training (2026)
- Horizon-LM: A RAM-Centric Architecture for LLM Training (2026)
- DataStates-LLM: Scalable Checkpointing for Transformer Models Using Composable State Providers (2026)
- MoEBlaze: Breaking the Memory Wall for Efficient MoE Training on Modern GPUs (2026)
- Scaling State-Space Models on Multiple GPUs with Tensor Parallelism (2026)
- Training Report of TeleChat3-MoE (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper