Unpadding implementation for transformers

This repository contains the implementation of automatic unpadding and sequence packing for Hugging Face Transformers models. In version 5 of Transformers, attention handling across models was standardized, enabling older architectures such as BERT and RoBERTa to use Flash Attention. With Flash Attention support, it becomes possible to implement the unpadding trick previously utilized in ModernBERT (see Unpadding and Sequence Packing). This technique involves automatically removing padding tokens and transforming the input tokens, typically represented as an N x M matrix (where N is the batch size and M is the length of the longest sequence in the batch), into a single, long vector containing the concatenated, padding-free sequences. This vector is then processed by the embedding layer and attention blocks. Right before the final model layer (e.g., LM head, classification head), it is transformed back into a matrix format.

Unpadding can yield significant benefits in terms of computational efficiency and memory savings. The most substantial gains are achieved when there is high variance in the lengths of the processed texts. If the texts are of roughly equal length, the benefits will be minimal, though the model should still run faster than the original implementation. Below are several benchmarks supporting these claims.

Inference

The experiment involved performing reranking using the polish-reranker-roberta-v3 model on a subset of queries from the PIRB benchmark. We used 9 datasets from the Web Datasets group, with a maximum of 1,000 queries per dataset. Each query had 100 candidate documents. The model's task was to evaluate the relevance of the query-document pairs, resulting in over 800,000 predictions in total. The test was conducted on a single NVIDIA RTX A6000 GPU. Across all tested implementations, we used a fixed batch size of 32. The results are presented in the table below.

Implementation
Total time
Queries per second
Max VRAM
SDPA (transformers default) 1h 38m 46s
1.39
9006 MB
Flash-Attention 1h 16m 3s
1.81
6286 MB
Flash-Attention with unpadding 39m 13s 3.50 2368 MB

Fine-Tuning

In the next experiment, we fine-tuned the polish-roberta-8k model on classification tasks. We measured the total time required to train the model for 10 epochs, including the time taken for evaluation on the validation split after each epoch and evaluation on the test set upon training completion. Two datasets were selected for this experiment: POLEMO-IN from the KLEJ benchmark (short to medium text lengths) and banking-long from the FinBench benchmark (medium to very long texts). The tests were conducted on a single NVIDIA RTX A6000 GPU. The results are presented in the table below (OOM - CUDA out-of-memory error). For the dataset with shorter texts, training with unpadding is over twice as fast and consumes three times less memory compared to the default SDPA implementation. For the dataset with long texts, unpadding allows for training with 16 times larger batch sizes and is over three times faster than the original model.

Total batch size SDPA (default) Flash-Attention Flash-Attention
with unpadding
Micro 
batch size
Gradient
accumulation
Total
time
Max
VRAM
Total
time
Max
VRAM
Total
time
Max
VRAM
POLEMO-IN (KLEJ)
32 1 1054s 40.88GB 931s 35.71GB 489s 12.64GB
BANKING-LONG
2 16 4734s 28.98GB 4906s 22.82GB 4290s 16.81GB
4 8 OOM OOM 3434s 36.43GB 2347s 19.30GB
8 4 OOM OOM OOM OOM 1753s 21.25GB
16 2 OOM OOM OOM OOM 1563s 24.04GB
32 1 OOM OOM OOM OOM 1475s 32.00GB
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for sdadas/unpad-impl