Data

The text.data.core module contains the core bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to turn your raw datasets into modelable DataLoaders for text/NLP tasks

What we're running with at the time this documentation was generated:
torch: 1.9.0+cu102
fastai: 2.7.9
transformers: 4.21.2

Setup

We’ll use a subset of imdb to demonstrate how to configure your BLURR for sequence classification tasks

raw_datasets = load_dataset("imdb", split=["train", "test"])

raw_datasets[0] = raw_datasets[0].add_column("is_valid", [False] * len(raw_datasets[0]))
raw_datasets[1] = raw_datasets[1].add_column("is_valid", [True] * len(raw_datasets[1]))

final_ds = concatenate_datasets([raw_datasets[0].shuffle().select(range(1000)), raw_datasets[1].shuffle().select(range(200))])

imdb_df = pd.DataFrame(final_ds)
imdb_df.head()

Reusing dataset imdb (/home/wgilliam/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)

	text	label	is_valid
0	What an overlooked 80's soundtrack. I imagine John Travolta sang some of the songs but in watching the movie it did seem to personify everything that was 80s cheese. Clearly movies that rely on mechanical bulls, bartenders and immature relationships were in style. The best was his lousy Texas accent. Compare that to Friday Night Lights.I suggest watching Cocktail and Stir Crazy to start really getting into the dumbing down of film. Also, as a side note Made in America with Ted Danson and Whoopie Goldberg is an awesomely bad movie. I was so shocked to realize I had never watched it. One mor...	1	False
1	An archaeologist (Casper Van Dien) stumbles accidentally upon an ancient, 40 foot mummy, well preserved underground in the Nevada desert. They are determined to keep this a secret and call in a Jewish translator to assist in figuring out the history of it. The mummy, as explained at the beginning, is the son of a fallen angel and is one of several giants that apparently existed in "those days". In order to save his son from a devastating flood which was predicted to kill everything, he mummifies his son, burying him with several servants for centuries - planning to awaken him years from th...	0	False
2	Sudden Impact is a two pronged story. Harry is targeted by the mob who want to kill him and Harry is very glad to return the favour and show them how it's done. This little war puts Harry on suspension which he doesn't care about but he goes away on a little vacation. Now the second part of the story. Someone is killing some punks and Harry gets dragged into this situation where he meets Jennifer spencer a woman with a secret that the little tourist town wants to keep quiet. The police Chief is not a subtle man and he warns Harry to not get involved or cause any trouble. This is Harry Call...	1	False
3	It is a superb Swedish film .. it was the first Swedish film I've seen .. it is simple & deep .. what a great combination!.<br /><br />Michael Nyqvist did a great performance as a famous conductor who seeks peace in his hometown.<br /><br />Frida Hallgren was great as his inspirational girlfriend to help him to carry on & never give up.<br /><br />The fight between the conductor and the hypocrite priest who loses his battle with Michael when his wife confronts him And defends Michael's noble cause to help his hometown people finding their own peace in music.<br /><br />The only thing that ...	1	False
4	The plot is about a female nurse, named Anna, is caught in the middle of a world-wide chaos as flesh-eating zombies begin rising up and taking over the world and attacking the living. She escapes into the streets and is rescued by a black police officer. So far, so good! I usually enjoy horror movies, but this piece of film doesn't deserve to be called horror. It's not even thrilling, just ridiculous.Even "the Flintstones" or "Kukla, Fran and Ollie" will give you more excitement. It's like watching a bunch of bloodthirsty drunkards not being able to get into a shopping mall to by more liqu...	0	False

labels = raw_datasets[0].features["label"].names
labels

['neg', 'pos']

model_cls = AutoModelForSequenceClassification
hf_logging.set_verbosity_error()

pretrained_model_name = "roberta-base"  # "bert-base-multilingual-cased"
n_labels = len(labels)

hf_arch, hf_config, hf_tokenizer, hf_model = get_hf_objects(
    pretrained_model_name, model_cls=model_cls, config_kwargs={"num_labels": n_labels}
)

hf_arch, type(hf_config), type(hf_tokenizer), type(hf_model)

('roberta',
 transformers.models.roberta.configuration_roberta.RobertaConfig,
 transformers.models.roberta.tokenization_roberta_fast.RobertaTokenizerFast,
 transformers.models.roberta.modeling_roberta.RobertaForSequenceClassification)

Preprocessing

Starting with version 2.0, BLURR provides a preprocessing base class that can be used to build task specific pre-processed datasets from pandas DataFrames or Hugging Face Datasets

source

Preprocessor

 Preprocessor (hf_tokenizer:transformers.tokenization_utils_base.PreTraine
               dTokenizerBase, batch_size:int=1000, text_attr:str='text',
               text_pair_attr:str=None, is_valid_attr:str='is_valid',
               tok_kwargs:dict={})

Initialize self. See help(type(self)) for accurate signature.

	Type	Default	Details
hf_tokenizer	PreTrainedTokenizerBase		A Hugging Face tokenizer
batch_size	int	1000	The number of examples to process at a time
text_attr	str	text	The attribute holding the text
text_pair_attr	str	None	The attribute holding the text_pair
is_valid_attr	str	is_valid	The attribute that should be created if your are processing individual training and validation datasets into a single dataset, and will indicate to which each example is associated
tok_kwargs	dict	{}	Tokenization kwargs that will be applied with calling the tokenizer

source

ClassificationPreprocessor

 ClassificationPreprocessor (hf_tokenizer:transformers.tokenization_utils_
                             base.PreTrainedTokenizerBase,
                             batch_size:int=1000,
                             is_multilabel:bool=False, id_attr:str=None,
                             text_attr:str='text',
                             text_pair_attr:str=None,
                             label_attrs:str|list[str]='label',
                             is_valid_attr:str='is_valid',
                             label_mapping:list[str]=None,
                             tok_kwargs:dict={})

Initialize self. See help(type(self)) for accurate signature.

	Type	Default	Details
hf_tokenizer	PreTrainedTokenizerBase		A Hugging Face tokenizer
batch_size	int	1000	The number of examples to process at a time
is_multilabel	bool	False	Whether the dataset should be processed for multi-label; if True, will ensure `label_attrs` are converted to a value of either 0 or 1 indiciating the existence of the class in the example
id_attr	str	None	The unique identifier in the dataset
text_attr	str	text	The attribute holding the text
text_pair_attr	str	None	The attribute holding the text_pair
label_attrs	str \| list[str]	label	The attribute holding the label(s) of the example
is_valid_attr	str	is_valid	The attribute that should be created if your are processing individual training and validation datasets into a single dataset, and will indicate to which each example is associated
label_mapping	list[str]	None	A list indicating the valid labels for the dataset (optional, defaults to the unique set of labels found in the full dataset)
tok_kwargs	dict	{}	Tokenization kwargs that will be applied with calling the tokenizer

Starting with version 2.0, BLURR provides a sequence classification preprocessing class that can be used to preprocess DataFrames or Hugging Face Datasets.

This class can be used for preprocessing both multiclass and multilabel classification datasets, and includes a proc_{your_text_attr} and proc_{your_text_pair_attr} (optional) attributes containing your modified text as a result of tokenization (e.g., if you specify a max_length the proc_{your_text_attr} may contain truncated text).

Note: This class works for both slow and fast tokenizers

Using a `DataFrame`

preprocessor = ClassificationPreprocessor(hf_tokenizer, label_mapping=labels, tok_kwargs={"max_length": 24})

proc_df = preprocessor.process_df(imdb_df)
proc_df.columns, len(proc_df)
proc_df.head(2)

	proc_text	text	label	is_valid	label_name	text_start_char_idx	text_end_char_idx
0	What an overlooked 80's soundtrack. I imagine John Travolta sang some of the songs but in watching	What an overlooked 80's soundtrack. I imagine John Travolta sang some of the songs but in watching the movie it did seem to personify everything that was 80s cheese. Clearly movies that rely on mechanical bulls, bartenders and immature relationships were in style. The best was his lousy Texas accent. Compare that to Friday Night Lights.I suggest watching Cocktail and Stir Crazy to start really getting into the dumbing down of film. Also, as a side note Made in America with Ted Danson and Whoopie Goldberg is an awesomely bad movie. I was so shocked to realize I had never watched it. One mor...	1	False	pos	0	98
1	An archaeologist (Casper Van Dien) stumbles accidentally upon an ancient, 40 foot mummy, well	An archaeologist (Casper Van Dien) stumbles accidentally upon an ancient, 40 foot mummy, well preserved underground in the Nevada desert. They are determined to keep this a secret and call in a Jewish translator to assist in figuring out the history of it. The mummy, as explained at the beginning, is the son of a fallen angel and is one of several giants that apparently existed in "those days". In order to save his son from a devastating flood which was predicted to kill everything, he mummifies his son, burying him with several servants for centuries - planning to awaken him years from th...	0	False	neg	0	93

Using a Hugging Face `Dataset`

preprocessor = ClassificationPreprocessor(hf_tokenizer, label_mapping=labels)

proc_ds = preprocessor.process_hf_dataset(final_ds)
proc_ds

Dataset({
    features: ['proc_text', 'text', 'label', 'is_valid', 'label_name', 'text_start_char_idx', 'text_end_char_idx'],
    num_rows: 1200
})

Mid-level API

Base tokenization, batch transform, and DataBlock methods

source

TextInput

 TextInput (x, **kwargs)

The base represenation of your inputs; used by the various fastai show methods

A TextInput object is returned from the decodes method of BatchDecodeTransform as a means to customize @typedispatched functions like DataLoaders.show_batch and Learner.show_results. The value will the your “input_ids”.

source

BatchTokenizeTransform

 BatchTokenizeTransform (hf_arch:str, hf_config:PretrainedConfig,
                         hf_tokenizer:PreTrainedTokenizerBase,
                         hf_model:PreTrainedModel,
                         include_labels:bool=True,
                         ignore_token_id:int=-100, max_length:int=None,
                         padding:bool|str=True, truncation:bool|str=True,
                         is_split_into_words:bool=False,
                         tok_kwargs:dict={}, **kwargs)

Handles everything you need to assemble a mini-batch of inputs and targets, as well as decode the dictionary produced as a byproduct of the tokenization process in the encodes method.

	Type	Default	Details
hf_arch	str		The abbreviation/name of your Hugging Face transformer architecture (e.b., bert, bart, etc..)
hf_config	PretrainedConfig		A specific configuration instance you want to use
hf_tokenizer	PreTrainedTokenizerBase		A Hugging Face tokenizer
hf_model	PreTrainedModel		A Hugging Face model
include_labels	bool	True	To control whether the “labels” are included in your inputs. If they are, the loss will be calculated in the model’s forward function and you can simply use `PreCalculatedLoss` as your `Learner`’s loss function to use it
ignore_token_id	int	-100	The token ID that should be ignored when calculating the loss
max_length	int	None	To control the length of the padding/truncation. It can be an integer or None, in which case it will default to the maximum length the model can accept. If the model has no specific maximum input length, truncation/padding to max_length is deactivated. See Everything you always wanted to know about padding and truncation
padding	bool \| str	True	To control the `padding` applied to your `hf_tokenizer` during tokenization. If None, will default to ‘False’ or ‘do_not_pad’. See Everything you always wanted to know about padding and truncation
truncation	bool \| str	True	To control `truncation` applied to your `hf_tokenizer` during tokenization. If None, will default to ‘False’ or ‘do_not_truncate’. See Everything you always wanted to know about padding and truncation
is_split_into_words	bool	False	The `is_split_into_words` argument applied to your `hf_tokenizer` during tokenization. Set this to ‘True’ if your inputs are pre-tokenized (not numericalized) \
tok_kwargs	dict	{}	Any other keyword arguments you want included when using your `hf_tokenizer` to tokenize your inputs
kwargs

Inspired by this article, BatchTokenizeTransform inputs can come in as raw text, a list of words (e.g., tasks like Named Entity Recognition (NER), where you want to predict the label of each token), or as a dictionary that includes extra information you want to use during post-processing.

On-the-fly Batch-Time Tokenization:

Part of the inspiration for this derives from the mechanics of Hugging Face tokenizers, in particular it can return a collated mini-batch of data given a list of sequences. As such, the collating required for our inputs can be done during tokenization before our batch transforms run in a before_batch_tfms transform (where we get a list of examples)! This allows users of BLURR to have everything done dynamically at batch-time without prior preprocessing with at least four potential benefits: 1. Less code 2. Faster mini-batch creation 3. Less RAM utilization and time spent tokenizing beforehand (this really helps with very large datasets) 4. Flexibility

source

BatchDecodeTransform

 BatchDecodeTransform (input_return_type:type=<class
                       '__main__.TextInput'>, hf_arch:str=None,
                       hf_config:PretrainedConfig=None,
                       hf_tokenizer:PreTrainedTokenizerBase=None,
                       hf_model:PreTrainedModel=None, **kwargs)

A class used to cast your inputs as input_return_type for fastai show methods

	Type	Default	Details
input_return_type	type	TextInput	Used by typedispatched show methods
hf_arch	str	None	The abbreviation/name of your Hugging Face transformer architecture (not required if passing in an instance of `BatchTokenizeTransform` to `before_batch_tfm`)
hf_config	PretrainedConfig	None	A Hugging Face configuration object (not required if passing in an instance of `BatchTokenizeTransform` to `before_batch_tfm`)
hf_tokenizer	PreTrainedTokenizerBase	None	A Hugging Face tokenizer (not required if passing in an instance of `BatchTokenizeTransform` to `before_batch_tfm`)
hf_model	PreTrainedModel	None	A Hugging Face model (not required if passing in an instance of `BatchTokenizeTransform` to `before_batch_tfm`)
kwargs

As of fastai 2.1.5, before batch transforms no longer have a decodes method … and so, I’ve introduced a standard batch transform here, BatchDecodeTransform, (one that occurs “after” the batch has been created) that will do the decoding for us.

source

blurr_sort_func

 blurr_sort_func (example, hf_tokenizer:transformers.tokenization_utils_ba
                  se.PreTrainedTokenizerBase,
                  is_split_into_words:bool=False, tok_kwargs:dict={})

This method is used by the SortedDL to ensure your dataset is sorted after tokenization

	Type	Default	Details
example
hf_tokenizer	PreTrainedTokenizerBase		A Hugging Face tokenizer
is_split_into_words	bool	False	The `is_split_into_words` argument applied to your `hf_tokenizer` during tokenization. Set this to ‘True’ if your inputs are pre-tokenized (not numericalized)
tok_kwargs	dict	{}	Any other keyword arguments you want to include during tokenization

source

TextBlock

 TextBlock (hf_arch:str=None,
            hf_config:transformers.configuration_utils.PretrainedConfig=No
            ne, hf_tokenizer:transformers.tokenization_utils_base.PreTrain
            edTokenizerBase=None,
            hf_model:transformers.modeling_utils.PreTrainedModel=None,
            include_labels:bool=True, ignore_token_id=-100,
            batch_tokenize_tfm:__main__.BatchTokenizeTransform=None,
            batch_decode_tfm:__main__.BatchDecodeTransform=None,
            max_length:int=None, padding:bool|str=True,
            truncation:bool|str=True, is_split_into_words:bool=False,
            input_return_type:type=<class '__main__.TextInput'>,
            dl_type:fastai.data.load.DataLoader=None,
            batch_tokenize_kwargs:dict={}, batch_decode_kwargs:dict={},
            tok_kwargs:dict={}, text_gen_kwargs:dict={}, **kwargs)

The core TransformBlock to prepare your inputs for training in Blurr with fastai’s DataBlock API

	Type	Default	Details
hf_arch	str	None	The abbreviation/name of your Hugging Face transformer architecture (not required if passing in an instance of `BatchTokenizeTransform` to `before_batch_tfm`)
hf_config	PretrainedConfig	None	A Hugging Face configuration object (not required if passing in an instance of `BatchTokenizeTransform` to `before_batch_tfm`)
hf_tokenizer	PreTrainedTokenizerBase	None	A Hugging Face tokenizer (not required if passing in an instance of `BatchTokenizeTransform` to `before_batch_tfm`)
hf_model	PreTrainedModel	None	A Hugging Face model (not required if passing in an instance of `BatchTokenizeTransform` to `before_batch_tfm`)
include_labels	bool	True	To control whether the “labels” are included in your inputs. If they are, the loss will be calculated in the model’s forward function and you can simply use `PreCalculatedLoss` as your `Learner`’s loss function to use it
ignore_token_id	int	-100	The token ID that should be ignored when calculating the loss
batch_tokenize_tfm	BatchTokenizeTransform	None	The before_batch_tfm you want to use to tokenize your raw data on the fly (defaults to an instance of `BatchTokenizeTransform`)
batch_decode_tfm	BatchDecodeTransform	None	The batch_tfm you want to decode your inputs into a type that can be used in the fastai show methods, (defaults to BatchDecodeTransform)
max_length	int	None	To control the length of the padding/truncation. It can be an integer or None, in which case it will default to the maximum length the model can accept. If the model has no specific maximum input length, truncation/padding to max_length is deactivated. See Everything you always wanted to know about padding and truncation
padding	bool \| str	True	To control the ‘padding’ applied to your `hf_tokenizer` during tokenization. If None, will default to ‘False’ or ‘do_not_pad’. See Everything you always wanted to know about padding and truncation
truncation	bool \| str	True	To control ‘truncation’ applied to your `hf_tokenizer` during tokenization. If None, will default to ‘False’ or ‘do_not_truncate’. See Everything you always wanted to know about padding and truncation
is_split_into_words	bool	False	The `is_split_into_words` argument applied to your `hf_tokenizer` during tokenization. Set this to `True` if your inputs are pre-tokenized (not numericalized)
input_return_type	type	TextInput	The return type your decoded inputs should be cast too (used by methods such as `show_batch`)
dl_type	DataLoader	None	The type of `DataLoader` you want created (defaults to `SortedDL`)
batch_tokenize_kwargs	dict	{}	Any keyword arguments you want applied to your `batch_tokenize_tfm`
batch_decode_kwargs	dict	{}	Any keyword arguments you want applied to your `batch_decode_tfm` (will be set as a fastai `batch_tfms`)
tok_kwargs	dict	{}	Any keyword arguments you want your Hugging Face tokenizer to use during tokenization
text_gen_kwargs	dict	{}	Any keyword arguments you want to have applied with generating text
kwargs

A basic DataBlock for our inputs, TextBlock is designed with sensible defaults to minimize user effort in defining their transforms pipeline. It handles setting up your BatchTokenizeTransform and BatchDecodeTransform transforms regardless of data source (e.g., this will work with files, DataFrames, whatever).

Note: You must either pass in your own instance of a BatchTokenizeTransform class or the Hugging Face objects returned from BLURR.get_hf_objects (e.g.,architecture, config, tokenizer, and model). The other args are optional.

We also include a blurr_sort_func that works with SortedDL to properly sort based on the number of tokens in each example.

Utility classes and methods

These methods are use internally for getting blurr transforms associated to your DataLoaders

source

get_blurr_tfm

 get_blurr_tfm (tfms_list:fastcore.transform.Pipeline,
                tfm_class:fastcore.transform.Transform=<class
                '__main__.BatchTokenizeTransform'>)

Given a fastai DataLoaders batch transforms, this method can be used to get at a transform instance used in your Blurr DataBlock

	Type	Default	Details
tfms_list	Pipeline		A list of transforms (e.g., dls.after_batch, dls.before_batch, etc…)
tfm_class	Transform	BatchTokenizeTransform	The transform to find

source

first_blurr_tfm

 first_blurr_tfm (dls:fastai.data.core.DataLoaders,
                  tfms:list[fastcore.transform.Transform]=[<class
                  '__main__.BatchTokenizeTransform'>, <class
                  '__main__.BatchDecodeTransform'>])

This convenience method will find the first Blurr transform required for methods such as show_batch and show_results. The returned transform should have everything you need to properly decode and ‘show’ your Hugging Face inputs/targets

	Type	Default	Details
dls	DataLoaders		Your fast.ai `DataLoaders
tfms	list[Transform]	[<class ‘main.BatchTokenizeTransform’>, <class ‘main.BatchDecodeTransform’>]	The Blurr transforms to look for in order

Mid-level Examples

The following eamples demonstrate several approaches to construct your DataBlock for sequence classication tasks using the mid-level API.

Batch-Time Tokenization

Step 1: Get your Hugging Face objects.

There are a bunch of ways we can get at the four Hugging Face elements we need (e.g., architecture name, tokenizer, config, and model). We can just create them directly, or we can use one of the helper methods available via NLP.

model_cls = AutoModelForSequenceClassification

pretrained_model_name = "distilroberta-base"  # "distilbert-base-uncased" "bert-base-uncased"
hf_arch, hf_config, hf_tokenizer, hf_model = get_hf_objects(pretrained_model_name, model_cls=model_cls)

Step 2: Create your `DataBlock`

blocks = (TextBlock(hf_arch, hf_config, hf_tokenizer, hf_model, batch_tokenize_kwargs={"labels": labels}), CategoryBlock)
dblock = DataBlock(blocks=blocks, get_x=ColReader("text"), get_y=ColReader("label"), splitter=ColSplitter())

Step 3: Build your `DataLoaders`

dls = dblock.dataloaders(imdb_df, bs=4)

b = dls.one_batch()
len(b), len(b[0]["input_ids"]), b[0]["input_ids"].shape, len(b[1])

(2, 4, torch.Size([4, 512]), 4)

b[0]

{'input_ids': tensor([[   0, 5102, 3764,  ..., 1530,   36,    2],
         [   0,   22,  250,  ..., 5422,  278,    2],
         [   0, 9342, 1864,  ...,   80,    6,    2],
         [   0,  318,   47,  ..., 5320,  853,    2]], device='cuda:1'),
 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1]], device='cuda:1'),
 'labels': TensorCategory([1, 0, 1, 1], device='cuda:1')}

Let’s take a look at the actual types represented by our batch

explode_types(b)

{tuple: [dict, fastai.torch_core.TensorCategory]}

dls.show_batch(dataloaders=dls, max_n=2, trunc_at=500)

	text	target
0	ANCHORS AWEIGH sees two eager young sailors, Joe Brady (Gene Kelly) and Clarence Doolittle/Brooklyn (Frank Sinatra), get a special four-day shore leave. Eager to get to the girls, particularly Joe's Lola, neither Joe nor Brooklyn figure on the interruption of little Navy-mad Donald (Dean Stockwell) and his Aunt Susie (Kathryn Grayson). Unexperienced in the ways of females and courting, Brooklyn quickly enlists Joe to help him win Aunt Susie over. Along the way, however, Joe finds himself fallin	pos
1	WARNING: POSSIBLE SPOILERS (Not that you should care. Also, sorry for the caps.)<br /><br />Starting with an unnecessarily dramatic voice that's all the more annoying for talking nonsense, it goes on with nonsense and unnecessary drama. That's badly but accurately put.<br /><br />We know space travel is a risky enterprise. There's a complicated system with a lot of potential for malfunctions, radiation, stress-related symptoms etc, and unexpected things are bound to happen in largely unknown en	neg

Using a preprocessed dataset

Preprocessing your raw data is the more traditional approach to using Transformers. It is required, for example, when you want to work with documents longer than your model will allow. A preprocessed dataset is used in the same way a non-preprocessed dataset is.

Step 1a: Get your Hugging Face objects.

hf_arch, hf_config, hf_tokenizer, hf_model = get_hf_objects(pretrained_model_name, model_cls=model_cls)

Step 1b. Preprocess dataset

preprocessor = ClassificationPreprocessor(hf_tokenizer, label_mapping=labels)
proc_ds = preprocessor.process_hf_dataset(final_ds)
proc_ds

Dataset({
    features: ['proc_text', 'text', 'label', 'is_valid', 'label_name', 'text_start_char_idx', 'text_end_char_idx'],
    num_rows: 1200
})

Step 2: Create your `DataBlock`

blocks = (TextBlock(hf_arch, hf_config, hf_tokenizer, hf_model, batch_tokenize_kwargs={"labels": labels}), CategoryBlock)
dblock = DataBlock(blocks=blocks, get_x=ItemGetter("proc_text"), get_y=ItemGetter("label"), splitter=RandomSplitter())

Step 3: Build your `DataLoaders`

dls = dblock.dataloaders(proc_ds, bs=4)

dls.show_batch(dataloaders=dls, max_n=2, trunc_at=500)

	text	target
0	I saw this film at the Adelaide Film Festival '07 and was thoroughly intrigued for all 106 minutes. I like documentaries, but often find them dragging with about 25 minutes to go. Forbidden Lie$ powered on though, never losing my interest.<br /><br />The film's subject is Norma Khoury, a Jordanian woman who found fame and fortune in 2001 with the publication of her book Forbidden Love, a biographical story of sorts concerning a Muslim friend of hers who was murdered by her family for having a r	pos
1	ANCHORS AWEIGH sees two eager young sailors, Joe Brady (Gene Kelly) and Clarence Doolittle/Brooklyn (Frank Sinatra), get a special four-day shore leave. Eager to get to the girls, particularly Joe's Lola, neither Joe nor Brooklyn figure on the interruption of little Navy-mad Donald (Dean Stockwell) and his Aunt Susie (Kathryn Grayson). Unexperienced in the ways of females and courting, Brooklyn quickly enlists Joe to help him win Aunt Susie over. Along the way, however, Joe finds himself fallin	pos

Passing extra information

As of v.2, BLURR now also allows you to pass extra information alongside your inputs in the form of a dictionary. If you use this approach, you must assign your text(s) to the text attribute of the dictionary. This is a useful approach when splitting long documents into chunks, but wanting to score/predict by example rather than chunk (for example in extractive question answering tasks).

Note: A good place to access to this extra information during training/validation is in the before_batch method of a Callback.

blocks = (TextBlock(hf_arch, hf_config, hf_tokenizer, hf_model, batch_tokenize_kwargs={"labels": labels}), CategoryBlock)


def get_x(item):
    return {"text": item.text, "another_val": "testing123"}


dblock = DataBlock(blocks=blocks, get_x=get_x, get_y=ColReader("label"), splitter=ColSplitter())

dls = dblock.dataloaders(imdb_df, bs=4)

b = dls.one_batch()
len(b), len(b[0]["input_ids"]), b[0]["input_ids"].shape, len(b[1])

(2, 4, torch.Size([4, 512]), 4)

dls.show_batch(dataloaders=dls, max_n=2, trunc_at=500)

	text	target
0	ANCHORS AWEIGH sees two eager young sailors, Joe Brady (Gene Kelly) and Clarence Doolittle/Brooklyn (Frank Sinatra), get a special four-day shore leave. Eager to get to the girls, particularly Joe's Lola, neither Joe nor Brooklyn figure on the interruption of little Navy-mad Donald (Dean Stockwell) and his Aunt Susie (Kathryn Grayson). Unexperienced in the ways of females and courting, Brooklyn quickly enlists Joe to help him win Aunt Susie over. Along the way, however, Joe finds himself fallin	pos
1	I've rented and watched this movie for the 1st time on DVD without reading any reviews about it. So, after 15 minutes of watching I've noticed that something is wrong with this movie; it's TERRIBLE! I mean, in the trailers it looked scary and serious!<br /><br />I think that Eli Roth (Mr. Director) thought that if all the characters in this film were stupid, the movie would be funny...(So stupid, it's funny...? WRONG!) He should watch and learn from better horror-comedies such as:"Fright Night"	neg

Low-level API

For working with PyTorch and/or fast.ai Datasets & DataLoaders, the low-level API allows you to get back fast.ai specific features such as show_batch, show_results, etc… when using plain ol’ PyTorch Datasets, Hugging Face Datasets, etc…

source

TextBatchCreator

 TextBatchCreator (hf_arch:str,
                   hf_config:transformers.configuration_utils.PretrainedCo
                   nfig, hf_tokenizer:transformers.tokenization_utils_base
                   .PreTrainedTokenizerBase,
                   hf_model:transformers.modeling_utils.PreTrainedModel,
                   data_collator:type=None)

A class that can be assigned to a TfmdDL.create_batch method; used to in Blurr’s low-level API to create batches that can be used in the Blurr library

source

TextDataLoader

 TextDataLoader (dataset:torch.utils.data.dataset.Dataset|Datasets,
                 hf_arch:str, hf_config:PretrainedConfig,
                 hf_tokenizer:PreTrainedTokenizerBase,
                 hf_model:PreTrainedModel,
                 batch_creator:TextBatchCreator=None,
                 batch_decode_tfm:BatchDecodeTransform=None,
                 input_return_type:type=<class '__main__.TextInput'>,
                 preproccesing_func:Callable=None,
                 batch_decode_kwargs:dict={}, bs:int=64,
                 shuffle:bool=False, num_workers:int=None,
                 verbose:bool=False, do_setup:bool=True, pin_memory=False,
                 timeout=0, batch_size=None, drop_last=False,
                 indexed=None, n=None, device=None,
                 persistent_workers=False, pin_memory_device='', wif=None,
                 before_iter=None, after_item=None, before_batch=None,
                 after_batch=None, after_iter=None, create_batches=None,
                 create_item=None, create_batch=None, retain=None,
                 get_idxs=None, sample=None, shuffle_fn=None,
                 do_batch=None)

A transformed DataLoader that works with Blurr. From the fastai docs: A TfmDL is described as “a DataLoader that creates Pipeline from a list of Transforms for the callbacks after_item, before_batch and after_batch. As a result, it can decode or show a processed batch.

	Type	Default	Details
dataset	torch.utils.data.dataset.Dataset \| Datasets		A standard PyTorch Dataset
hf_arch	str		The abbreviation/name of your Hugging Face transformer architecture (not required if passing in an instance of `BatchTokenizeTransform` to `before_batch_tfm`)
hf_config	PretrainedConfig		A Hugging Face configuration object (not required if passing in an instance of `BatchTokenizeTransform` to `before_batch_tfm`)
hf_tokenizer	PreTrainedTokenizerBase		A Hugging Face tokenizer (not required if passing in an instance of `BatchTokenizeTransform` to `before_batch_tfm`)
hf_model	PreTrainedModel		A Hugging Face model (not required if passing in an instance of `BatchTokenizeTransform` to `before_batch_tfm`)
batch_creator	TextBatchCreator	None	An instance of `BlurrBatchCreator` or equivalent (defaults to `BlurrBatchCreator`)
batch_decode_tfm	BatchDecodeTransform	None	The batch_tfm used to decode Blurr batches (defaults to `BatchDecodeTransform`)
input_return_type	type	TextInput	Used by typedispatched show methods
preproccesing_func	Callable	None	(optional) A preprocessing function that will be applied to your dataset
batch_decode_kwargs	dict	{}	Keyword arguments to be applied to your `batch_decode_tfm`
bs	int	64
shuffle	bool	False
num_workers	int	None
verbose	bool	False
do_setup	bool	True
pin_memory	bool	False
timeout	int	0
batch_size	NoneType	None
drop_last	bool	False
indexed	NoneType	None
n	NoneType	None
device	NoneType	None
persistent_workers	bool	False
pin_memory_device	str
wif	NoneType	None
before_iter	NoneType	None
after_item	NoneType	None
before_batch	NoneType	None
after_batch	NoneType	None
after_iter	NoneType	None
create_batches	NoneType	None
create_item	NoneType	None
create_batch	NoneType	None
retain	NoneType	None
get_idxs	NoneType	None
sample	NoneType	None
shuffle_fn	NoneType	None
do_batch	NoneType	None

Low-level Examples

The following example demonstrates how to use the low-level API with standard PyTorch/Hugging Face/fast.ai Datasets and DataLoaders.

Step 1: Build your datasets

raw_datasets = load_dataset("glue", "mrpc")

Reusing dataset glue (/home/wgilliam/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)

def tokenize_function(example):
    return hf_tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

Step 2: Dataset pre-processing (optional)

source

preproc_hf_dataset

 preproc_hf_dataset
                     (dataset:torch.utils.data.dataset.Dataset|fastai.data
                     .core.Datasets, hf_tokenizer:transformers.tokenizatio
                     n_utils_base.PreTrainedTokenizerBase,
                     hf_model:transformers.modeling_utils.PreTrainedModel)

This method can be used to preprocess most Hugging Face Datasets for use in Blurr and other training libraries

	Type	Details
dataset	torch.utils.data.dataset.Dataset \| Datasets	A standard PyTorch Dataset or fast.ai Datasets
hf_tokenizer	PreTrainedTokenizerBase	A Hugging Face tokenizer
hf_model	PreTrainedModel	A Hugging Face model

Step 3: Build your `DataLoaders`.

Use BlurrDataLoader to build Blurr friendly dataloaders from your datasets. Passing {'labels': label_names} to your batch_tfm_kwargs will ensure that your lable/target names will be displayed in methods like show_batch and show_results (just as it works with the mid-level API)

label_names = raw_datasets["train"].features["label"].names

trn_dl = TextDataLoader(
    tokenized_datasets["train"],
    hf_arch,
    hf_config,
    hf_tokenizer,
    hf_model,
    preproccesing_func=preproc_hf_dataset,
    batch_decode_kwargs={"labels": label_names},
    shuffle=True,
    batch_size=8,
)

val_dl = TextDataLoader(
    tokenized_datasets["validation"],
    hf_arch,
    hf_config,
    hf_tokenizer,
    hf_model,
    preproccesing_func=preproc_hf_dataset,
    batch_decode_kwargs={"labels": label_names},
    batch_size=16,
)

dls = DataLoaders(trn_dl, val_dl)

b = dls.one_batch()
b[0]["input_ids"].shape

torch.Size([8, 65])

dls.show_batch(dataloaders=dls, max_n=2, trunc_at=800)

	text	target
0	The technology-laced Nasdaq Composite Index.IXIC inched down 1 point, or 0.11 percent, to 1,650. The broad Standard & Poor's 500 Index.SPX inched up 3 points, or 0.32 percent, to 970.	not_equivalent
1	His 1996 Chevrolet Tahoe was found abandoned June 25 in a Virginia Beach, Va., parking lot. His sport utility vehicle was found June 25, abandoned without its license plates in Virginia Beach, Va.	equivalent

Tests

The tests below to ensure the core DataBlock code above works for all pretrained sequence classification models available in Hugging Face. These tests are excluded from the CI workflow because of how long they would take to run and the amount of data that would be required to download.

Note: Feel free to modify the code below to test whatever pretrained classification models you are working with … and if any of your pretrained sequence classification models fail, please submit a github issue (or a PR if you’d like to fix it yourself)

	arch	tokenizer	model_name	result
0	albert	AlbertTokenizerFast	hf-internal-testing/tiny-albert	PASSED
1	bart	BartTokenizerFast	hf-internal-testing/tiny-random-bart	PASSED
2	bert	BertTokenizerFast	hf-internal-testing/tiny-bert	PASSED
3	big_bird	BigBirdTokenizerFast	google/bigbird-roberta-base	PASSED
4	bigbird_pegasus	PegasusTokenizerFast	google/bigbird-pegasus-large-arxiv	PASSED
5	ctrl	CTRLTokenizer	hf-internal-testing/tiny-random-ctrl	PASSED
6	camembert	CamembertTokenizerFast	camembert-base	PASSED
7	canine	CanineTokenizer	hf-internal-testing/tiny-random-canine	PASSED
8	convbert	ConvBertTokenizerFast	YituTech/conv-bert-base	PASSED
9	deberta	DebertaTokenizerFast	hf-internal-testing/tiny-deberta	PASSED
10	deberta_v2	DebertaV2TokenizerFast	hf-internal-testing/tiny-random-deberta-v2	PASSED
11	distilbert	DistilBertTokenizerFast	hf-internal-testing/tiny-random-distilbert	PASSED
12	electra	ElectraTokenizerFast	hf-internal-testing/tiny-electra	PASSED
13	fnet	FNetTokenizerFast	google/fnet-base	PASSED
14	flaubert	FlaubertTokenizer	hf-internal-testing/tiny-random-flaubert	PASSED
15	funnel	FunnelTokenizerFast	hf-internal-testing/tiny-random-funnel	PASSED
16	gpt2	GPT2TokenizerFast	hf-internal-testing/tiny-random-gpt2	PASSED
17	gptj	GPT2TokenizerFast	anton-l/gpt-j-tiny-random	PASSED
18	gpt_neo	GPT2TokenizerFast	hf-internal-testing/tiny-random-gpt_neo	PASSED
19	ibert	RobertaTokenizer	kssteven/ibert-roberta-base	PASSED
20	led	LEDTokenizerFast	hf-internal-testing/tiny-random-led	PASSED
21	longformer	LongformerTokenizerFast	hf-internal-testing/tiny-random-longformer	PASSED
22	mbart	MBartTokenizerFast	hf-internal-testing/tiny-random-mbart	PASSED
23	mpnet	MPNetTokenizerFast	hf-internal-testing/tiny-random-mpnet	PASSED
24	mobilebert	MobileBertTokenizerFast	hf-internal-testing/tiny-random-mobilebert	PASSED
25	openai	OpenAIGPTTokenizerFast	openai-gpt	PASSED
26	reformer	ReformerTokenizerFast	google/reformer-crime-and-punishment	PASSED
27	rembert	RemBertTokenizerFast	google/rembert	PASSED
28	roformer	RoFormerTokenizerFast	junnyu/roformer_chinese_sim_char_ft_small	PASSED
29	roberta	RobertaTokenizerFast	roberta-base	PASSED
30	squeezebert	SqueezeBertTokenizerFast	squeezebert/squeezebert-uncased	PASSED
31	transfo_xl	TransfoXLTokenizer	hf-internal-testing/tiny-random-transfo-xl	PASSED
32	xlm	XLMTokenizer	xlm-mlm-en-2048	PASSED
33	xlm_roberta	XLMRobertaTokenizerFast	xlm-roberta-base	PASSED
34	xlnet	XLNetTokenizerFast	xlnet-base-cased	PASSED

The text.data.core module contains the fundamental bits for all data preprocessing tasks

Setup

Preprocessing

Preprocessor

ClassificationPreprocessor

Using a DataFrame

Using a Hugging Face Dataset

Mid-level API

TextInput

BatchTokenizeTransform

BatchDecodeTransform

blurr_sort_func

TextBlock

Utility classes and methods

get_blurr_tfm

first_blurr_tfm

Mid-level Examples

Batch-Time Tokenization

Step 1: Get your Hugging Face objects.

Step 2: Create your DataBlock

Step 3: Build your DataLoaders

Using a preprocessed dataset

Step 1a: Get your Hugging Face objects.

Step 1b. Preprocess dataset

Step 2: Create your DataBlock

Step 3: Build your DataLoaders

Passing extra information

Low-level API

TextBatchCreator

TextDataLoader

Low-level Examples

Step 1: Build your datasets

Step 2: Dataset pre-processing (optional)

preproc_hf_dataset

Step 3: Build your DataLoaders.

Tests

Using a `DataFrame`

Using a Hugging Face `Dataset`

Step 2: Create your `DataBlock`

Step 3: Build your `DataLoaders`

Step 2: Create your `DataBlock`

Step 3: Build your `DataLoaders`

Step 3: Build your `DataLoaders`.