The text.data.seq2seq.core module contains the core seq2seq (e.g., language modeling, summarization, translation) bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data in a way modelable by Hugging Face transformer implementations.
Starting with version 2.0, BLURR provides a preprocessing base class that can be used to build seq2seq preprocessed datasets from pandas DataFrames or Hugging Face Datasets
The base represenation of your inputs; used by the various fastai show methods
A Seq2SeqTextInput object is returned from the decodes method of Seq2SeqBatchTokenizeTransform as a means to customize @typedispatched functions like DataLoaders.show_batch and Learner.show_results. The value will the your “input_ids”.
Handles everything you need to assemble a mini-batch of inputs and targets, as well as decode the dictionary produced as a byproduct of the tokenization process in the encodes method.
Type
Default
Details
hf_arch
str
The abbreviation/name of your Hugging Face transformer architecture (e.b., bert, bart, etc..)
hf_config
PretrainedConfig
A specific configuration instance you want to use
hf_tokenizer
PreTrainedTokenizerBase
A Hugging Face tokenizer
hf_model
PreTrainedModel
A Hugging Face model
include_labels
bool
True
To control whether the “labels” are included in your inputs. If they are, the loss will be calculated in
the model’s forward function and you can simply use PreCalculatedLoss as your Learner’s loss function to use it
ignore_token_id
int
-100
The token ID that should be ignored when calculating the loss
max_length
int
None
To control the length of the padding/truncation of the input sequence. It can be an integer or None,
in which case it will default to the maximum length the model can accept. If the model has no specific maximum input length, truncation/padding to max_length is deactivated. See Everything you always wanted to know about padding and truncation | | max_target_length | int | None | To control the length of the padding/truncation of the target sequence. It can be an integer or None, in which case it will default to the maximum length the model can accept. If the model has no specific maximum input length, truncation/padding to max_length is deactivated. See Everything you always wanted to know about padding and truncation | | padding | Union | True | To control the padding applied to your hf_tokenizer during tokenization. If None, will default to False or 'do_not_pad'. See [Everything you always wanted to know about padding and truncation](https://huggingface.co/transformers/preprocessing.html#everything-you-always-wanted-to-know-about-padding-and-truncation) | | truncation | Union | True | To controltruncationapplied to yourhf_tokenizerduring tokenization. If None, will default toFalseordo_not_truncate. See [Everything you always wanted to know about padding and truncation](https://huggingface.co/transformers/preprocessing.html#everything-you-always-wanted-to-know-about-padding-and-truncation) | | is_split_into_words | bool | False | Theis_split_into_wordsargument applied to yourhf_tokenizerduring tokenization. Set this toTrueif your inputs are pre-tokenized (not numericalized) | | tok_kwargs | dict | {} | Any other keyword arguments you want included when using yourhf_tokenizerto tokenize your inputs | | text_gen_kwargs | dict | {} | Any keyword arguments to pass to thehf_model.generate` method | | kwargs | | | |
We create a subclass of BatchTokenizeTransform for summarization tasks to add decoder_input_ids and labels (if we want Hugging Face to calculate the loss for us) to our inputs during training. See here and here for more information on these additional inputs used in summarization, translation, and conversational training tasks. How they should look for particular architectures can be found by looking at those model’s forward function’s docs (See here for BART for example)
Note also that labels is simply target_ids shifted to the right by one since the task to is to predict the next token based on the current (and all previous) decoder_input_ids.
And lastly, we also update our targets to just be the input_ids of our target sequence so that fastai’s Learner.show_results works (again, almost all the fastai bits require returning a single tensor to work).
A class used to cast your inputs as input_return_type for fastai show methods
Type
Default
Details
input_return_type
type
TextInput
Used by typedispatched show methods
hf_arch
str
None
The abbreviation/name of your Hugging Face transformer architecture (not required if passing in an instance of BatchTokenizeTransform to before_batch_tfm)
hf_config
PretrainedConfig
None
A Hugging Face configuration object (not required if passing in an instance of BatchTokenizeTransform to before_batch_tfm)
hf_tokenizer
PreTrainedTokenizerBase
None
A Hugging Face tokenizer (not required if passing in an instance of BatchTokenizeTransform to before_batch_tfm)
hf_model
PreTrainedModel
None
A Hugging Face model (not required if passing in an instance of BatchTokenizeTransform to before_batch_tfm)
The core TransformBlock to prepare your inputs for training in Blurr with fastai’s DataBlock API
Type
Default
Details
hf_arch
str
None
The abbreviation/name of your Hugging Face transformer architecture (not required if passing in an
instance of BatchTokenizeTransform to before_batch_tfm)
hf_config
PretrainedConfig
None
A Hugging Face configuration object (not required if passing in an
instance of BatchTokenizeTransform to before_batch_tfm)
hf_tokenizer
PreTrainedTokenizerBase
None
A Hugging Face tokenizer (not required if passing in an
instance of BatchTokenizeTransform to before_batch_tfm)
hf_model
PreTrainedModel
None
A Hugging Face model (not required if passing in an
instance of BatchTokenizeTransform to before_batch_tfm)
batch_tokenize_tfm
Optional
None
The before_batch_tfm you want to use to tokenize your raw data on the fly
(defaults to an instance of BatchTokenizeTransform)
batch_decode_tfm
Optional
None
The batch_tfm you want to decode your inputs into a type that can be used in the fastai show methods,
(defaults to BatchDecodeTransform)
max_length
int
None
To control the length of the padding/truncation for the input sequence. It can be an integer or None,
in which case it will default to the maximum length the model can accept. If the model has no specific maximum input length, truncation/padding to max_length is deactivated. See Everything you always wanted to know about padding and truncation | | max_target_length | NoneType | None | To control the length of the padding/truncation for the target sequence. It can be an integer or None, in which case it will default to the maximum length the model can accept. If the model has no specific maximum input length, truncation/padding to max_length is deactivated. See [Everything you always wanted to know about padding and truncation](https://huggingface.co/transformers/preprocessing.html#everything-y | | padding | Union | True | To control the padding applied to your hf_tokenizer during tokenization. If None, will default to False or 'do_not_pad'. See [Everything you always wanted to know about padding and truncation](https://huggingface.co/transformers/preprocessing.html#everything-you-always-wanted-to-know-about-padding-and-truncation) | | truncation | Union | True | To controltruncationapplied to yourhf_tokenizerduring tokenization. If None, will default toFalseordo_not_truncate. See [Everything you always wanted to know about padding and truncation](https://huggingface.co/transformers/preprocessing.html#everything-you-always-wanted-to-know-about-padding-and-truncation) | | input_return_type | _TensorMeta | Seq2SeqTextInput | The return type your decoded inputs should be cast too (used by methods such asshow_batch) | | dl_type | type | SortedDL | The type ofDataLoaderyou want created (defaults toSortedDL) | | batch_tokenize_kwargs | dict | {} | Any keyword arguments you want applied to yourbatch_tokenize_tfm| | batch_decode_kwargs | dict | {} | Any keyword arguments you want applied to yourbatch_decode_tfm(will be set as a fastaibatch_tfms`) | | tok_kwargs | dict | {} | Any keyword arguments you want your Hugging Face tokenizer to use during tokenization | | text_gen_kwargs | dict | {} | Any keyword arguments you want to have applied with generating text (default: default_text_gen_kwargs) | | kwargs | | | |