blurr
  • Getting Started
  • Resources
    • fastai x Hugging Face Study Group
    • Hugging Face Course
    • fast.ai (docs)
    • transformers (docs)
  • Help
    • Report an Issue

Data

  • Overview
    • Getting Started
    • callbacks
    • utils
  • Text
    • Sequence Classification
      • Data
      • Modeling
    • Token Classification
      • Data
      • Modeling
    • Question & Answering
      • Data
      • Modeling
    • Language Modeling
      • Data
      • Modeling
    • Seq2Seq: Core
      • Data
      • Modeling
    • Seq2Seq: Summarization
      • Data
      • Modeling
    • Seq2Seq: Translation
      • Data
      • Modeling
    • callbacks
    • utils
  • Examples
    • Using the high-level Blurr API
    • GLUE classification tasks
    • Using the Low-level fastai API
    • Multi-label classification
    • Causal Language Modeling with GPT-2

On this page

  • Setup
  • Preprocessing
    • Preprocessor
    • ClassificationPreprocessor
      • Using a DataFrame
      • Using a Hugging Face Dataset
  • Mid-level API
    • TextInput
    • BatchTokenizeTransform
    • BatchDecodeTransform
    • blurr_sort_func
    • TextBlock
  • Utility classes and methods
    • get_blurr_tfm
    • first_blurr_tfm
  • Mid-level Examples
    • Batch-Time Tokenization
      • Step 1: Get your Hugging Face objects.
      • Step 2: Create your DataBlock
      • Step 3: Build your DataLoaders
    • Using a preprocessed dataset
      • Step 1a: Get your Hugging Face objects.
      • Step 1b. Preprocess dataset
      • Step 2: Create your DataBlock
      • Step 3: Build your DataLoaders
    • Passing extra information
  • Low-level API
    • TextBatchCreator
    • TextDataLoader
  • Low-level Examples
    • Step 1: Build your datasets
    • Step 2: Dataset pre-processing (optional)
      • preproc_hf_dataset
    • Step 3: Build your DataLoaders.
  • Tests

Report an issue

Data

The text.data.core module contains the core bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to turn your raw datasets into modelable DataLoaders for text/NLP tasks
What we're running with at the time this documentation was generated:
torch: 1.9.0+cu102
fastai: 2.7.9
transformers: 4.21.2

Setup

We’ll use a subset of imdb to demonstrate how to configure your BLURR for sequence classification tasks

raw_datasets = load_dataset("imdb", split=["train", "test"])

raw_datasets[0] = raw_datasets[0].add_column("is_valid", [False] * len(raw_datasets[0]))
raw_datasets[1] = raw_datasets[1].add_column("is_valid", [True] * len(raw_datasets[1]))

final_ds = concatenate_datasets([raw_datasets[0].shuffle().select(range(1000)), raw_datasets[1].shuffle().select(range(200))])

imdb_df = pd.DataFrame(final_ds)
imdb_df.head()
Reusing dataset imdb (/home/wgilliam/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)
text label is_valid
0 What an overlooked 80's soundtrack. I imagine John Travolta sang some of the songs but in watching the movie it did seem to personify everything that was 80s cheese. Clearly movies that rely on mechanical bulls, bartenders and immature relationships were in style. The best was his lousy Texas accent. Compare that to Friday Night Lights.I suggest watching Cocktail and Stir Crazy to start really getting into the dumbing down of film. Also, as a side note Made in America with Ted Danson and Whoopie Goldberg is an awesomely bad movie. I was so shocked to realize I had never watched it. One mor... 1 False
1 An archaeologist (Casper Van Dien) stumbles accidentally upon an ancient, 40 foot mummy, well preserved underground in the Nevada desert. They are determined to keep this a secret and call in a Jewish translator to assist in figuring out the history of it. The mummy, as explained at the beginning, is the son of a fallen angel and is one of several giants that apparently existed in "those days". In order to save his son from a devastating flood which was predicted to kill everything, he mummifies his son, burying him with several servants for centuries - planning to awaken him years from th... 0 False
2 Sudden Impact is a two pronged story. Harry is targeted by the mob who want to kill him and Harry is very glad to return the favour and show them how it's done. This little war puts Harry on suspension which he doesn't care about but he goes away on a little vacation. Now the second part of the story. Someone is killing some punks and Harry gets dragged into this situation where he meets Jennifer spencer a woman with a secret that the little tourist town wants to keep quiet. The police Chief is not a subtle man and he warns Harry to not get involved or cause any trouble. This is Harry Call... 1 False
3 It is a superb Swedish film .. it was the first Swedish film I've seen .. it is simple & deep .. what a great combination!.<br /><br />Michael Nyqvist did a great performance as a famous conductor who seeks peace in his hometown.<br /><br />Frida Hallgren was great as his inspirational girlfriend to help him to carry on & never give up.<br /><br />The fight between the conductor and the hypocrite priest who loses his battle with Michael when his wife confronts him And defends Michael's noble cause to help his hometown people finding their own peace in music.<br /><br />The only thing that ... 1 False
4 The plot is about a female nurse, named Anna, is caught in the middle of a world-wide chaos as flesh-eating zombies begin rising up and taking over the world and attacking the living. She escapes into the streets and is rescued by a black police officer. So far, so good! I usually enjoy horror movies, but this piece of film doesn't deserve to be called horror. It's not even thrilling, just ridiculous.Even "the Flintstones" or "Kukla, Fran and Ollie" will give you more excitement. It's like watching a bunch of bloodthirsty drunkards not being able to get into a shopping mall to by more liqu... 0 False
labels = raw_datasets[0].features["label"].names
labels
['neg', 'pos']
model_cls = AutoModelForSequenceClassification
hf_logging.set_verbosity_error()

pretrained_model_name = "roberta-base"  # "bert-base-multilingual-cased"
n_labels = len(labels)

hf_arch, hf_config, hf_tokenizer, hf_model = get_hf_objects(
    pretrained_model_name, model_cls=model_cls, config_kwargs={"num_labels": n_labels}
)

hf_arch, type(hf_config), type(hf_tokenizer), type(hf_model)
('roberta',
 transformers.models.roberta.configuration_roberta.RobertaConfig,
 transformers.models.roberta.tokenization_roberta_fast.RobertaTokenizerFast,
 transformers.models.roberta.modeling_roberta.RobertaForSequenceClassification)

Preprocessing

Starting with version 2.0, BLURR provides a preprocessing base class that can be used to build task specific pre-processed datasets from pandas DataFrames or Hugging Face Datasets


source

Preprocessor

 Preprocessor (hf_tokenizer:transformers.tokenization_utils_base.PreTraine
               dTokenizerBase, batch_size:int=1000, text_attr:str='text',
               text_pair_attr:str=None, is_valid_attr:str='is_valid',
               tok_kwargs:dict={})

Initialize self. See help(type(self)) for accurate signature.

Type Default Details
hf_tokenizer PreTrainedTokenizerBase A Hugging Face tokenizer
batch_size int 1000 The number of examples to process at a time
text_attr str text The attribute holding the text
text_pair_attr str None The attribute holding the text_pair
is_valid_attr str is_valid The attribute that should be created if your are processing individual training and validation
datasets into a single dataset, and will indicate to which each example is associated
tok_kwargs dict {} Tokenization kwargs that will be applied with calling the tokenizer

source

ClassificationPreprocessor

 ClassificationPreprocessor (hf_tokenizer:transformers.tokenization_utils_
                             base.PreTrainedTokenizerBase,
                             batch_size:int=1000,
                             is_multilabel:bool=False, id_attr:str=None,
                             text_attr:str='text',
                             text_pair_attr:str=None,
                             label_attrs:str|list[str]='label',
                             is_valid_attr:str='is_valid',
                             label_mapping:list[str]=None,
                             tok_kwargs:dict={})

Initialize self. See help(type(self)) for accurate signature.

Type Default Details
hf_tokenizer PreTrainedTokenizerBase A Hugging Face tokenizer
batch_size int 1000 The number of examples to process at a time
is_multilabel bool False Whether the dataset should be processed for multi-label; if True, will ensure label_attrs are
converted to a value of either 0 or 1 indiciating the existence of the class in the example
id_attr str None The unique identifier in the dataset
text_attr str text The attribute holding the text
text_pair_attr str None The attribute holding the text_pair
label_attrs str | list[str] label The attribute holding the label(s) of the example
is_valid_attr str is_valid The attribute that should be created if your are processing individual training and validation
datasets into a single dataset, and will indicate to which each example is associated
label_mapping list[str] None A list indicating the valid labels for the dataset (optional, defaults to the unique set of labels
found in the full dataset)
tok_kwargs dict {} Tokenization kwargs that will be applied with calling the tokenizer

Starting with version 2.0, BLURR provides a sequence classification preprocessing class that can be used to preprocess DataFrames or Hugging Face Datasets.

This class can be used for preprocessing both multiclass and multilabel classification datasets, and includes a proc_{your_text_attr} and proc_{your_text_pair_attr} (optional) attributes containing your modified text as a result of tokenization (e.g., if you specify a max_length the proc_{your_text_attr} may contain truncated text).

Note: This class works for both slow and fast tokenizers

Using a DataFrame

preprocessor = ClassificationPreprocessor(hf_tokenizer, label_mapping=labels, tok_kwargs={"max_length": 24})

proc_df = preprocessor.process_df(imdb_df)
proc_df.columns, len(proc_df)
proc_df.head(2)
proc_text text label is_valid label_name text_start_char_idx text_end_char_idx
0 What an overlooked 80's soundtrack. I imagine John Travolta sang some of the songs but in watching What an overlooked 80's soundtrack. I imagine John Travolta sang some of the songs but in watching the movie it did seem to personify everything that was 80s cheese. Clearly movies that rely on mechanical bulls, bartenders and immature relationships were in style. The best was his lousy Texas accent. Compare that to Friday Night Lights.I suggest watching Cocktail and Stir Crazy to start really getting into the dumbing down of film. Also, as a side note Made in America with Ted Danson and Whoopie Goldberg is an awesomely bad movie. I was so shocked to realize I had never watched it. One mor... 1 False pos 0 98
1 An archaeologist (Casper Van Dien) stumbles accidentally upon an ancient, 40 foot mummy, well An archaeologist (Casper Van Dien) stumbles accidentally upon an ancient, 40 foot mummy, well preserved underground in the Nevada desert. They are determined to keep this a secret and call in a Jewish translator to assist in figuring out the history of it. The mummy, as explained at the beginning, is the son of a fallen angel and is one of several giants that apparently existed in "those days". In order to save his son from a devastating flood which was predicted to kill everything, he mummifies his son, burying him with several servants for centuries - planning to awaken him years from th... 0 False neg 0 93

Using a Hugging Face Dataset

preprocessor = ClassificationPreprocessor(hf_tokenizer, label_mapping=labels)

proc_ds = preprocessor.process_hf_dataset(final_ds)
proc_ds
Dataset({
    features: ['proc_text', 'text', 'label', 'is_valid', 'label_name', 'text_start_char_idx', 'text_end_char_idx'],
    num_rows: 1200
})

Mid-level API

Base tokenization, batch transform, and DataBlock methods


source

TextInput

 TextInput (x, **kwargs)

The base represenation of your inputs; used by the various fastai show methods

A TextInput object is returned from the decodes method of BatchDecodeTransform as a means to customize @typedispatched functions like DataLoaders.show_batch and Learner.show_results. The value will the your “input_ids”.


source

BatchTokenizeTransform

 BatchTokenizeTransform (hf_arch:str, hf_config:PretrainedConfig,
                         hf_tokenizer:PreTrainedTokenizerBase,
                         hf_model:PreTrainedModel,
                         include_labels:bool=True,
                         ignore_token_id:int=-100, max_length:int=None,
                         padding:bool|str=True, truncation:bool|str=True,
                         is_split_into_words:bool=False,
                         tok_kwargs:dict={}, **kwargs)

Handles everything you need to assemble a mini-batch of inputs and targets, as well as decode the dictionary produced as a byproduct of the tokenization process in the encodes method.

Type Default Details
hf_arch str The abbreviation/name of your Hugging Face transformer architecture (e.b., bert, bart, etc..)
hf_config PretrainedConfig A specific configuration instance you want to use
hf_tokenizer PreTrainedTokenizerBase A Hugging Face tokenizer
hf_model PreTrainedModel A Hugging Face model
include_labels bool True To control whether the “labels” are included in your inputs. If they are, the loss will be calculated in
the model’s forward function and you can simply use PreCalculatedLoss as your Learner’s loss function to use it
ignore_token_id int -100 The token ID that should be ignored when calculating the loss
max_length int None To control the length of the padding/truncation. It can be an integer or None,
in which case it will default to the maximum length the model can accept.
If the model has no specific maximum input length, truncation/padding to max_length is deactivated.
See Everything you always wanted to know about padding and truncation
padding bool | str True To control the padding applied to your hf_tokenizer during tokenization.
If None, will default to ‘False’ or ‘do_not_pad’.
See Everything you always wanted to know about padding and truncation
truncation bool | str True To control truncation applied to your hf_tokenizer during tokenization.
If None, will default to ‘False’ or ‘do_not_truncate’.
See Everything you always wanted to know about padding and truncation
is_split_into_words bool False The is_split_into_words argument applied to your hf_tokenizer during tokenization.
Set this to ‘True’ if your inputs are pre-tokenized (not numericalized) \
tok_kwargs dict {} Any other keyword arguments you want included when using your hf_tokenizer to tokenize your inputs
kwargs

Inspired by this article, BatchTokenizeTransform inputs can come in as raw text, a list of words (e.g., tasks like Named Entity Recognition (NER), where you want to predict the label of each token), or as a dictionary that includes extra information you want to use during post-processing.

On-the-fly Batch-Time Tokenization:

Part of the inspiration for this derives from the mechanics of Hugging Face tokenizers, in particular it can return a collated mini-batch of data given a list of sequences. As such, the collating required for our inputs can be done during tokenization before our batch transforms run in a before_batch_tfms transform (where we get a list of examples)! This allows users of BLURR to have everything done dynamically at batch-time without prior preprocessing with at least four potential benefits: 1. Less code 2. Faster mini-batch creation 3. Less RAM utilization and time spent tokenizing beforehand (this really helps with very large datasets) 4. Flexibility


source

BatchDecodeTransform

 BatchDecodeTransform (input_return_type:type=<class
                       '__main__.TextInput'>, hf_arch:str=None,
                       hf_config:PretrainedConfig=None,
                       hf_tokenizer:PreTrainedTokenizerBase=None,
                       hf_model:PreTrainedModel=None, **kwargs)

A class used to cast your inputs as input_return_type for fastai show methods

Type Default Details
input_return_type type TextInput Used by typedispatched show methods
hf_arch str None The abbreviation/name of your Hugging Face transformer architecture (not required if passing in an instance of BatchTokenizeTransform to before_batch_tfm)
hf_config PretrainedConfig None A Hugging Face configuration object (not required if passing in an instance of BatchTokenizeTransform to before_batch_tfm)
hf_tokenizer PreTrainedTokenizerBase None A Hugging Face tokenizer (not required if passing in an instance of BatchTokenizeTransform to before_batch_tfm)
hf_model PreTrainedModel None A Hugging Face model (not required if passing in an instance of BatchTokenizeTransform to before_batch_tfm)
kwargs

As of fastai 2.1.5, before batch transforms no longer have a decodes method … and so, I’ve introduced a standard batch transform here, BatchDecodeTransform, (one that occurs “after” the batch has been created) that will do the decoding for us.


source

blurr_sort_func

 blurr_sort_func (example, hf_tokenizer:transformers.tokenization_utils_ba
                  se.PreTrainedTokenizerBase,
                  is_split_into_words:bool=False, tok_kwargs:dict={})

This method is used by the SortedDL to ensure your dataset is sorted after tokenization

Type Default Details
example
hf_tokenizer PreTrainedTokenizerBase A Hugging Face tokenizer
is_split_into_words bool False The is_split_into_words argument applied to your hf_tokenizer during tokenization.
Set this to ‘True’ if your inputs are pre-tokenized (not numericalized)
tok_kwargs dict {} Any other keyword arguments you want to include during tokenization

source

TextBlock

 TextBlock (hf_arch:str=None,
            hf_config:transformers.configuration_utils.PretrainedConfig=No
            ne, hf_tokenizer:transformers.tokenization_utils_base.PreTrain
            edTokenizerBase=None,
            hf_model:transformers.modeling_utils.PreTrainedModel=None,
            include_labels:bool=True, ignore_token_id=-100,
            batch_tokenize_tfm:__main__.BatchTokenizeTransform=None,
            batch_decode_tfm:__main__.BatchDecodeTransform=None,
            max_length:int=None, padding:bool|str=True,
            truncation:bool|str=True, is_split_into_words:bool=False,
            input_return_type:type=<class '__main__.TextInput'>,
            dl_type:fastai.data.load.DataLoader=None,
            batch_tokenize_kwargs:dict={}, batch_decode_kwargs:dict={},
            tok_kwargs:dict={}, text_gen_kwargs:dict={}, **kwargs)

The core TransformBlock to prepare your inputs for training in Blurr with fastai’s DataBlock API

Type Default Details
hf_arch str None The abbreviation/name of your Hugging Face transformer architecture (not required if passing in an
instance of BatchTokenizeTransform to before_batch_tfm)
hf_config PretrainedConfig None A Hugging Face configuration object (not required if passing in an
instance of BatchTokenizeTransform to before_batch_tfm)
hf_tokenizer PreTrainedTokenizerBase None A Hugging Face tokenizer (not required if passing in an
instance of BatchTokenizeTransform to before_batch_tfm)
hf_model PreTrainedModel None A Hugging Face model (not required if passing in an
instance of BatchTokenizeTransform to before_batch_tfm)
include_labels bool True To control whether the “labels” are included in your inputs. If they are, the loss will be calculated in
the model’s forward function and you can simply use PreCalculatedLoss as your Learner’s loss function to use it
ignore_token_id int -100 The token ID that should be ignored when calculating the loss
batch_tokenize_tfm BatchTokenizeTransform None The before_batch_tfm you want to use to tokenize your raw data on the fly
(defaults to an instance of BatchTokenizeTransform)
batch_decode_tfm BatchDecodeTransform None The batch_tfm you want to decode your inputs into a type that can be used in the fastai show methods,
(defaults to BatchDecodeTransform)
max_length int None To control the length of the padding/truncation. It can be an integer or None,
in which case it will default to the maximum length the model can accept. If the model has no
specific maximum input length, truncation/padding to max_length is deactivated.
See Everything you always wanted to know about padding and truncation
padding bool | str True To control the ‘padding’ applied to your hf_tokenizer during tokenization.
If None, will default to ‘False’ or ‘do_not_pad’.
See Everything you always wanted to know about padding and truncation
truncation bool | str True To control ‘truncation’ applied to your hf_tokenizer during tokenization.
If None, will default to ‘False’ or ‘do_not_truncate’.
See Everything you always wanted to know about padding and truncation
is_split_into_words bool False The is_split_into_words argument applied to your hf_tokenizer during tokenization.
Set this to True if your inputs are pre-tokenized (not numericalized)
input_return_type type TextInput The return type your decoded inputs should be cast too (used by methods such as show_batch)
dl_type DataLoader None The type of DataLoader you want created (defaults to SortedDL)
batch_tokenize_kwargs dict {} Any keyword arguments you want applied to your batch_tokenize_tfm
batch_decode_kwargs dict {} Any keyword arguments you want applied to your batch_decode_tfm (will be set as a fastai batch_tfms)
tok_kwargs dict {} Any keyword arguments you want your Hugging Face tokenizer to use during tokenization
text_gen_kwargs dict {} Any keyword arguments you want to have applied with generating text
kwargs

A basic DataBlock for our inputs, TextBlock is designed with sensible defaults to minimize user effort in defining their transforms pipeline. It handles setting up your BatchTokenizeTransform and BatchDecodeTransform transforms regardless of data source (e.g., this will work with files, DataFrames, whatever).

Note: You must either pass in your own instance of a BatchTokenizeTransform class or the Hugging Face objects returned from BLURR.get_hf_objects (e.g.,architecture, config, tokenizer, and model). The other args are optional.

We also include a blurr_sort_func that works with SortedDL to properly sort based on the number of tokens in each example.

Utility classes and methods

These methods are use internally for getting blurr transforms associated to your DataLoaders


source

get_blurr_tfm

 get_blurr_tfm (tfms_list:fastcore.transform.Pipeline,
                tfm_class:fastcore.transform.Transform=<class
                '__main__.BatchTokenizeTransform'>)

Given a fastai DataLoaders batch transforms, this method can be used to get at a transform instance used in your Blurr DataBlock

Type Default Details
tfms_list Pipeline A list of transforms (e.g., dls.after_batch, dls.before_batch, etc…)
tfm_class Transform BatchTokenizeTransform The transform to find

source

first_blurr_tfm

 first_blurr_tfm (dls:fastai.data.core.DataLoaders,
                  tfms:list[fastcore.transform.Transform]=[<class
                  '__main__.BatchTokenizeTransform'>, <class
                  '__main__.BatchDecodeTransform'>])

This convenience method will find the first Blurr transform required for methods such as show_batch and show_results. The returned transform should have everything you need to properly decode and ‘show’ your Hugging Face inputs/targets

Type Default Details
dls DataLoaders Your fast.ai `DataLoaders
tfms list[Transform] [<class ‘main.BatchTokenizeTransform’>, <class ‘main.BatchDecodeTransform’>] The Blurr transforms to look for in order

Mid-level Examples

The following eamples demonstrate several approaches to construct your DataBlock for sequence classication tasks using the mid-level API.

Batch-Time Tokenization

Step 1: Get your Hugging Face objects.

There are a bunch of ways we can get at the four Hugging Face elements we need (e.g., architecture name, tokenizer, config, and model). We can just create them directly, or we can use one of the helper methods available via NLP.

model_cls = AutoModelForSequenceClassification

pretrained_model_name = "distilroberta-base"  # "distilbert-base-uncased" "bert-base-uncased"
hf_arch, hf_config, hf_tokenizer, hf_model = get_hf_objects(pretrained_model_name, model_cls=model_cls)

Step 2: Create your DataBlock

blocks = (TextBlock(hf_arch, hf_config, hf_tokenizer, hf_model, batch_tokenize_kwargs={"labels": labels}), CategoryBlock)
dblock = DataBlock(blocks=blocks, get_x=ColReader("text"), get_y=ColReader("label"), splitter=ColSplitter())

Step 3: Build your DataLoaders

dls = dblock.dataloaders(imdb_df, bs=4)
b = dls.one_batch()
len(b), len(b[0]["input_ids"]), b[0]["input_ids"].shape, len(b[1])
(2, 4, torch.Size([4, 512]), 4)
b[0]
{'input_ids': tensor([[   0, 5102, 3764,  ..., 1530,   36,    2],
         [   0,   22,  250,  ..., 5422,  278,    2],
         [   0, 9342, 1864,  ...,   80,    6,    2],
         [   0,  318,   47,  ..., 5320,  853,    2]], device='cuda:1'),
 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1]], device='cuda:1'),
 'labels': TensorCategory([1, 0, 1, 1], device='cuda:1')}

Let’s take a look at the actual types represented by our batch

explode_types(b)
{tuple: [dict, fastai.torch_core.TensorCategory]}
dls.show_batch(dataloaders=dls, max_n=2, trunc_at=500)
text target
0 ANCHORS AWEIGH sees two eager young sailors, Joe Brady (Gene Kelly) and Clarence Doolittle/Brooklyn (Frank Sinatra), get a special four-day shore leave. Eager to get to the girls, particularly Joe's Lola, neither Joe nor Brooklyn figure on the interruption of little Navy-mad Donald (Dean Stockwell) and his Aunt Susie (Kathryn Grayson). Unexperienced in the ways of females and courting, Brooklyn quickly enlists Joe to help him win Aunt Susie over. Along the way, however, Joe finds himself fallin pos
1 WARNING: POSSIBLE SPOILERS (Not that you should care. Also, sorry for the caps.)<br /><br />Starting with an unnecessarily dramatic voice that's all the more annoying for talking nonsense, it goes on with nonsense and unnecessary drama. That's badly but accurately put.<br /><br />We know space travel is a risky enterprise. There's a complicated system with a lot of potential for malfunctions, radiation, stress-related symptoms etc, and unexpected things are bound to happen in largely unknown en neg

Using a preprocessed dataset

Preprocessing your raw data is the more traditional approach to using Transformers. It is required, for example, when you want to work with documents longer than your model will allow. A preprocessed dataset is used in the same way a non-preprocessed dataset is.

Step 1a: Get your Hugging Face objects.

hf_arch, hf_config, hf_tokenizer, hf_model = get_hf_objects(pretrained_model_name, model_cls=model_cls)

Step 1b. Preprocess dataset

preprocessor = ClassificationPreprocessor(hf_tokenizer, label_mapping=labels)
proc_ds = preprocessor.process_hf_dataset(final_ds)
proc_ds
Dataset({
    features: ['proc_text', 'text', 'label', 'is_valid', 'label_name', 'text_start_char_idx', 'text_end_char_idx'],
    num_rows: 1200
})

Step 2: Create your DataBlock

blocks = (TextBlock(hf_arch, hf_config, hf_tokenizer, hf_model, batch_tokenize_kwargs={"labels": labels}), CategoryBlock)
dblock = DataBlock(blocks=blocks, get_x=ItemGetter("proc_text"), get_y=ItemGetter("label"), splitter=RandomSplitter())

Step 3: Build your DataLoaders

dls = dblock.dataloaders(proc_ds, bs=4)
dls.show_batch(dataloaders=dls, max_n=2, trunc_at=500)
text target
0 I saw this film at the Adelaide Film Festival '07 and was thoroughly intrigued for all 106 minutes. I like documentaries, but often find them dragging with about 25 minutes to go. Forbidden Lie$ powered on though, never losing my interest.<br /><br />The film's subject is Norma Khoury, a Jordanian woman who found fame and fortune in 2001 with the publication of her book Forbidden Love, a biographical story of sorts concerning a Muslim friend of hers who was murdered by her family for having a r pos
1 ANCHORS AWEIGH sees two eager young sailors, Joe Brady (Gene Kelly) and Clarence Doolittle/Brooklyn (Frank Sinatra), get a special four-day shore leave. Eager to get to the girls, particularly Joe's Lola, neither Joe nor Brooklyn figure on the interruption of little Navy-mad Donald (Dean Stockwell) and his Aunt Susie (Kathryn Grayson). Unexperienced in the ways of females and courting, Brooklyn quickly enlists Joe to help him win Aunt Susie over. Along the way, however, Joe finds himself fallin pos

Passing extra information

As of v.2, BLURR now also allows you to pass extra information alongside your inputs in the form of a dictionary. If you use this approach, you must assign your text(s) to the text attribute of the dictionary. This is a useful approach when splitting long documents into chunks, but wanting to score/predict by example rather than chunk (for example in extractive question answering tasks).

Note: A good place to access to this extra information during training/validation is in the before_batch method of a Callback.

blocks = (TextBlock(hf_arch, hf_config, hf_tokenizer, hf_model, batch_tokenize_kwargs={"labels": labels}), CategoryBlock)


def get_x(item):
    return {"text": item.text, "another_val": "testing123"}


dblock = DataBlock(blocks=blocks, get_x=get_x, get_y=ColReader("label"), splitter=ColSplitter())
dls = dblock.dataloaders(imdb_df, bs=4)
b = dls.one_batch()
len(b), len(b[0]["input_ids"]), b[0]["input_ids"].shape, len(b[1])
(2, 4, torch.Size([4, 512]), 4)
dls.show_batch(dataloaders=dls, max_n=2, trunc_at=500)
text target
0 ANCHORS AWEIGH sees two eager young sailors, Joe Brady (Gene Kelly) and Clarence Doolittle/Brooklyn (Frank Sinatra), get a special four-day shore leave. Eager to get to the girls, particularly Joe's Lola, neither Joe nor Brooklyn figure on the interruption of little Navy-mad Donald (Dean Stockwell) and his Aunt Susie (Kathryn Grayson). Unexperienced in the ways of females and courting, Brooklyn quickly enlists Joe to help him win Aunt Susie over. Along the way, however, Joe finds himself fallin pos
1 I've rented and watched this movie for the 1st time on DVD without reading any reviews about it. So, after 15 minutes of watching I've noticed that something is wrong with this movie; it's TERRIBLE! I mean, in the trailers it looked scary and serious!<br /><br />I think that Eli Roth (Mr. Director) thought that if all the characters in this film were stupid, the movie would be funny...(So stupid, it's funny...? WRONG!) He should watch and learn from better horror-comedies such as:"Fright Night" neg

Low-level API

For working with PyTorch and/or fast.ai Datasets & DataLoaders, the low-level API allows you to get back fast.ai specific features such as show_batch, show_results, etc… when using plain ol’ PyTorch Datasets, Hugging Face Datasets, etc…


source

TextBatchCreator

 TextBatchCreator (hf_arch:str,
                   hf_config:transformers.configuration_utils.PretrainedCo
                   nfig, hf_tokenizer:transformers.tokenization_utils_base
                   .PreTrainedTokenizerBase,
                   hf_model:transformers.modeling_utils.PreTrainedModel,
                   data_collator:type=None)

A class that can be assigned to a TfmdDL.create_batch method; used to in Blurr’s low-level API to create batches that can be used in the Blurr library


source

TextDataLoader

 TextDataLoader (dataset:torch.utils.data.dataset.Dataset|Datasets,
                 hf_arch:str, hf_config:PretrainedConfig,
                 hf_tokenizer:PreTrainedTokenizerBase,
                 hf_model:PreTrainedModel,
                 batch_creator:TextBatchCreator=None,
                 batch_decode_tfm:BatchDecodeTransform=None,
                 input_return_type:type=<class '__main__.TextInput'>,
                 preproccesing_func:Callable=None,
                 batch_decode_kwargs:dict={}, bs:int=64,
                 shuffle:bool=False, num_workers:int=None,
                 verbose:bool=False, do_setup:bool=True, pin_memory=False,
                 timeout=0, batch_size=None, drop_last=False,
                 indexed=None, n=None, device=None,
                 persistent_workers=False, pin_memory_device='', wif=None,
                 before_iter=None, after_item=None, before_batch=None,
                 after_batch=None, after_iter=None, create_batches=None,
                 create_item=None, create_batch=None, retain=None,
                 get_idxs=None, sample=None, shuffle_fn=None,
                 do_batch=None)

A transformed DataLoader that works with Blurr. From the fastai docs: A TfmDL is described as “a DataLoader that creates Pipeline from a list of Transforms for the callbacks after_item, before_batch and after_batch. As a result, it can decode or show a processed batch.

Type Default Details
dataset torch.utils.data.dataset.Dataset | Datasets A standard PyTorch Dataset
hf_arch str The abbreviation/name of your Hugging Face transformer architecture (not required if passing in an
instance of BatchTokenizeTransform to before_batch_tfm)
hf_config PretrainedConfig A Hugging Face configuration object (not required if passing in an
instance of BatchTokenizeTransform to before_batch_tfm)
hf_tokenizer PreTrainedTokenizerBase A Hugging Face tokenizer (not required if passing in an instance of BatchTokenizeTransform to before_batch_tfm)
hf_model PreTrainedModel A Hugging Face model (not required if passing in an instance of BatchTokenizeTransform to before_batch_tfm)
batch_creator TextBatchCreator None An instance of BlurrBatchCreator or equivalent (defaults to BlurrBatchCreator)
batch_decode_tfm BatchDecodeTransform None The batch_tfm used to decode Blurr batches (defaults to BatchDecodeTransform)
input_return_type type TextInput Used by typedispatched show methods
preproccesing_func Callable None (optional) A preprocessing function that will be applied to your dataset
batch_decode_kwargs dict {} Keyword arguments to be applied to your batch_decode_tfm
bs int 64
shuffle bool False
num_workers int None
verbose bool False
do_setup bool True
pin_memory bool False
timeout int 0
batch_size NoneType None
drop_last bool False
indexed NoneType None
n NoneType None
device NoneType None
persistent_workers bool False
pin_memory_device str
wif NoneType None
before_iter NoneType None
after_item NoneType None
before_batch NoneType None
after_batch NoneType None
after_iter NoneType None
create_batches NoneType None
create_item NoneType None
create_batch NoneType None
retain NoneType None
get_idxs NoneType None
sample NoneType None
shuffle_fn NoneType None
do_batch NoneType None

Low-level Examples

The following example demonstrates how to use the low-level API with standard PyTorch/Hugging Face/fast.ai Datasets and DataLoaders.

Step 1: Build your datasets

raw_datasets = load_dataset("glue", "mrpc")
Reusing dataset glue (/home/wgilliam/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
def tokenize_function(example):
    return hf_tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

Step 2: Dataset pre-processing (optional)


source

preproc_hf_dataset

 preproc_hf_dataset
                     (dataset:torch.utils.data.dataset.Dataset|fastai.data
                     .core.Datasets, hf_tokenizer:transformers.tokenizatio
                     n_utils_base.PreTrainedTokenizerBase,
                     hf_model:transformers.modeling_utils.PreTrainedModel)

This method can be used to preprocess most Hugging Face Datasets for use in Blurr and other training libraries

Type Details
dataset torch.utils.data.dataset.Dataset | Datasets A standard PyTorch Dataset or fast.ai Datasets
hf_tokenizer PreTrainedTokenizerBase A Hugging Face tokenizer
hf_model PreTrainedModel A Hugging Face model

Step 3: Build your DataLoaders.

Use BlurrDataLoader to build Blurr friendly dataloaders from your datasets. Passing {'labels': label_names} to your batch_tfm_kwargs will ensure that your lable/target names will be displayed in methods like show_batch and show_results (just as it works with the mid-level API)

label_names = raw_datasets["train"].features["label"].names

trn_dl = TextDataLoader(
    tokenized_datasets["train"],
    hf_arch,
    hf_config,
    hf_tokenizer,
    hf_model,
    preproccesing_func=preproc_hf_dataset,
    batch_decode_kwargs={"labels": label_names},
    shuffle=True,
    batch_size=8,
)

val_dl = TextDataLoader(
    tokenized_datasets["validation"],
    hf_arch,
    hf_config,
    hf_tokenizer,
    hf_model,
    preproccesing_func=preproc_hf_dataset,
    batch_decode_kwargs={"labels": label_names},
    batch_size=16,
)

dls = DataLoaders(trn_dl, val_dl)
b = dls.one_batch()
b[0]["input_ids"].shape
torch.Size([8, 65])
dls.show_batch(dataloaders=dls, max_n=2, trunc_at=800)
text target
0 The technology-laced Nasdaq Composite Index.IXIC inched down 1 point, or 0.11 percent, to 1,650. The broad Standard & Poor's 500 Index.SPX inched up 3 points, or 0.32 percent, to 970. not_equivalent
1 His 1996 Chevrolet Tahoe was found abandoned June 25 in a Virginia Beach, Va., parking lot. His sport utility vehicle was found June 25, abandoned without its license plates in Virginia Beach, Va. equivalent

Tests

The tests below to ensure the core DataBlock code above works for all pretrained sequence classification models available in Hugging Face. These tests are excluded from the CI workflow because of how long they would take to run and the amount of data that would be required to download.

Note: Feel free to modify the code below to test whatever pretrained classification models you are working with … and if any of your pretrained sequence classification models fail, please submit a github issue (or a PR if you’d like to fix it yourself)

arch tokenizer model_name result error
0 albert AlbertTokenizerFast hf-internal-testing/tiny-albert PASSED
1 bart BartTokenizerFast hf-internal-testing/tiny-random-bart PASSED
2 bert BertTokenizerFast hf-internal-testing/tiny-bert PASSED
3 big_bird BigBirdTokenizerFast google/bigbird-roberta-base PASSED
4 bigbird_pegasus PegasusTokenizerFast google/bigbird-pegasus-large-arxiv PASSED
5 ctrl CTRLTokenizer hf-internal-testing/tiny-random-ctrl PASSED
6 camembert CamembertTokenizerFast camembert-base PASSED
7 canine CanineTokenizer hf-internal-testing/tiny-random-canine PASSED
8 convbert ConvBertTokenizerFast YituTech/conv-bert-base PASSED
9 deberta DebertaTokenizerFast hf-internal-testing/tiny-deberta PASSED
10 deberta_v2 DebertaV2TokenizerFast hf-internal-testing/tiny-random-deberta-v2 PASSED
11 distilbert DistilBertTokenizerFast hf-internal-testing/tiny-random-distilbert PASSED
12 electra ElectraTokenizerFast hf-internal-testing/tiny-electra PASSED
13 fnet FNetTokenizerFast google/fnet-base PASSED
14 flaubert FlaubertTokenizer hf-internal-testing/tiny-random-flaubert PASSED
15 funnel FunnelTokenizerFast hf-internal-testing/tiny-random-funnel PASSED
16 gpt2 GPT2TokenizerFast hf-internal-testing/tiny-random-gpt2 PASSED
17 gptj GPT2TokenizerFast anton-l/gpt-j-tiny-random PASSED
18 gpt_neo GPT2TokenizerFast hf-internal-testing/tiny-random-gpt_neo PASSED
19 ibert RobertaTokenizer kssteven/ibert-roberta-base PASSED
20 led LEDTokenizerFast hf-internal-testing/tiny-random-led PASSED
21 longformer LongformerTokenizerFast hf-internal-testing/tiny-random-longformer PASSED
22 mbart MBartTokenizerFast hf-internal-testing/tiny-random-mbart PASSED
23 mpnet MPNetTokenizerFast hf-internal-testing/tiny-random-mpnet PASSED
24 mobilebert MobileBertTokenizerFast hf-internal-testing/tiny-random-mobilebert PASSED
25 openai OpenAIGPTTokenizerFast openai-gpt PASSED
26 reformer ReformerTokenizerFast google/reformer-crime-and-punishment PASSED
27 rembert RemBertTokenizerFast google/rembert PASSED
28 roformer RoFormerTokenizerFast junnyu/roformer_chinese_sim_char_ft_small PASSED
29 roberta RobertaTokenizerFast roberta-base PASSED
30 squeezebert SqueezeBertTokenizerFast squeezebert/squeezebert-uncased PASSED
31 transfo_xl TransfoXLTokenizer hf-internal-testing/tiny-random-transfo-xl PASSED
32 xlm XLMTokenizer xlm-mlm-en-2048 PASSED
33 xlm_roberta XLMRobertaTokenizerFast xlm-roberta-base PASSED
34 xlnet XLNetTokenizerFast xlnet-base-cased PASSED

The text.data.core module contains the fundamental bits for all data preprocessing tasks