Unpadding implementation for transformers
This repository contains the implementation of automatic unpadding and sequence packing for Hugging Face Transformers models. In version 5 of Transformers, attention handling across models was standardized, enabling older architectures such as BERT and RoBERTa to use Flash Attention. With Flash Attention support, it becomes possible to implement the unpadding trick previously utilized in ModernBERT (see Unpadding and Sequence Packing). This technique involves automatically removing padding tokens and transforming the input tokens, typically represented as an N x M matrix (where N is the batch size and M is the length of the longest sequence in the batch), into a single, long vector containing the concatenated, padding-free sequences. This vector is then processed by the embedding layer and attention blocks. Right before the final model layer (e.g., LM head, classification head), it is transformed back into a matrix format.
Unpadding can yield significant benefits in terms of computational efficiency and memory savings. The most substantial gains are achieved when there is high variance in the lengths of the processed texts. If the texts are of roughly equal length, the benefits will be minimal, though the model should still run faster than the original implementation. Below are several benchmarks supporting these claims.
Inference
The experiment involved performing reranking using the polish-reranker-roberta-v3 model on a subset of queries from the PIRB benchmark. We used 9 datasets from the Web Datasets group, with a maximum of 1,000 queries per dataset. Each query had 100 candidate documents. The model's task was to evaluate the relevance of the query-document pairs, resulting in over 800,000 predictions in total. The test was conducted on a single NVIDIA RTX A6000 GPU. Across all tested implementations, we used a fixed batch size of 32. The results are presented in the table below.
| Implementation |
Total time |
Queries per second |
Max VRAM |
|---|---|---|---|
| SDPA (transformers default) | 1h 38m 46s |
1.39 |
9006 MB |
| Flash-Attention | 1h 16m 3s |
1.81 |
6286 MB |
| Flash-Attention with unpadding | 39m 13s | 3.50 | 2368 MB |
Fine-Tuning
In the next experiment, we fine-tuned the polish-roberta-8k model on classification tasks. We measured the total time required to train the model for 10 epochs, including the time taken for evaluation on the validation split after each epoch and evaluation on the test set upon training completion. Two datasets were selected for this experiment: POLEMO-IN from the KLEJ benchmark (short to medium text lengths) and banking-long from the FinBench benchmark (medium to very long texts). The tests were conducted on a single NVIDIA RTX A6000 GPU. The results are presented in the table below (OOM - CUDA out-of-memory error). For the dataset with shorter texts, training with unpadding is over twice as fast and consumes three times less memory compared to the default SDPA implementation. For the dataset with long texts, unpadding allows for training with 16 times larger batch sizes and is over three times faster than the original model.
| Total batch size | SDPA (default) | Flash-Attention | Flash-Attention with unpadding |
||||
|---|---|---|---|---|---|---|---|
| Micro batch size |
Gradient accumulation |
Total time |
Max VRAM |
Total time |
Max VRAM |
Total time |
Max VRAM |
| POLEMO-IN (KLEJ) | |||||||
| 32 | 1 | 1054s | 40.88GB | 931s | 35.71GB | 489s | 12.64GB |
| BANKING-LONG | |||||||
| 2 | 16 | 4734s | 28.98GB | 4906s | 22.82GB | 4290s | 16.81GB |
| 4 | 8 | OOM | OOM | 3434s | 36.43GB | 2347s | 19.30GB |
| 8 | 4 | OOM | OOM | OOM | OOM | 1753s | 21.25GB |
| 16 | 2 | OOM | OOM | OOM | OOM | 1563s | 24.04GB |
| 32 | 1 | OOM | OOM | OOM | OOM | 1475s | 32.00GB |