Update/Fix incorrect model_max_length to 1024 tokens

by davidxmle - opened Mar 27, 2024

base: refs/heads/main

←

from: refs/pr/8

Discussion Files changed

-1

davidxmle

Mar 27, 2024

•

edited Mar 27, 2024

Currently, the field model_max_length is set to be 1000000000000000019884624838656 tokens which is incorrect. This leads to this model when being used in a pipeline either cannot enable automatic truncating when the length gets exceeded which get an error thrown like RuntimeError: The expanded size of the tensor (<SOME NUMBER LARGER THAN 1024>) must match the existing size (1024) at non-singleton dimension 1. Target sizes: [1, <SOME NUMBER LARGER THAN 1024>]. Tensor sizes: [1, 1024], or it cannot use the stride option which also relies on a correct model_max_length being provided.
Description of stride option in a token classification pipeline:
If stride is provided, the pipeline is applied on all the text. The text is split into chunks of size model_max_length. Works only with fast tokenizers and aggregation_strategy different from NONE. The value of this argument defines the number of overlapping tokens between chunks. In other words, the model will shift forward by tokenizer.model_max_length - stride tokens each step.

Update the model_max_length to 1024 tokens4c9455e4

davidxmle changed pull request title from Update the model_max_length to 1024 tokens to Update/Fix incorrect model_max_length to 1024 tokens Mar 28, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Ready to merge

This branch is ready to get merged automatically.

· Sign up or log in to comment