What we're running with at the time this documentation was generated:
torch: 1.9.0+cu102
fastai: 2.7.9
transformers: 4.21.2
Data
text.data.core
module contains the core bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to turn your raw datasets into modelable DataLoaders
for text/NLP tasks
Setup
We’ll use a subset of imdb
to demonstrate how to configure your BLURR for sequence classification tasks
= load_dataset("imdb", split=["train", "test"])
raw_datasets
0] = raw_datasets[0].add_column("is_valid", [False] * len(raw_datasets[0]))
raw_datasets[1] = raw_datasets[1].add_column("is_valid", [True] * len(raw_datasets[1]))
raw_datasets[
= concatenate_datasets([raw_datasets[0].shuffle().select(range(1000)), raw_datasets[1].shuffle().select(range(200))])
final_ds
= pd.DataFrame(final_ds)
imdb_df imdb_df.head()
Reusing dataset imdb (/home/wgilliam/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)
text | label | is_valid | |
---|---|---|---|
0 | What an overlooked 80's soundtrack. I imagine John Travolta sang some of the songs but in watching the movie it did seem to personify everything that was 80s cheese. Clearly movies that rely on mechanical bulls, bartenders and immature relationships were in style. The best was his lousy Texas accent. Compare that to Friday Night Lights.I suggest watching Cocktail and Stir Crazy to start really getting into the dumbing down of film. Also, as a side note Made in America with Ted Danson and Whoopie Goldberg is an awesomely bad movie. I was so shocked to realize I had never watched it. One mor... | 1 | False |
1 | An archaeologist (Casper Van Dien) stumbles accidentally upon an ancient, 40 foot mummy, well preserved underground in the Nevada desert. They are determined to keep this a secret and call in a Jewish translator to assist in figuring out the history of it. The mummy, as explained at the beginning, is the son of a fallen angel and is one of several giants that apparently existed in "those days". In order to save his son from a devastating flood which was predicted to kill everything, he mummifies his son, burying him with several servants for centuries - planning to awaken him years from th... | 0 | False |
2 | Sudden Impact is a two pronged story. Harry is targeted by the mob who want to kill him and Harry is very glad to return the favour and show them how it's done. This little war puts Harry on suspension which he doesn't care about but he goes away on a little vacation. Now the second part of the story. Someone is killing some punks and Harry gets dragged into this situation where he meets Jennifer spencer a woman with a secret that the little tourist town wants to keep quiet. The police Chief is not a subtle man and he warns Harry to not get involved or cause any trouble. This is Harry Call... | 1 | False |
3 | It is a superb Swedish film .. it was the first Swedish film I've seen .. it is simple & deep .. what a great combination!.<br /><br />Michael Nyqvist did a great performance as a famous conductor who seeks peace in his hometown.<br /><br />Frida Hallgren was great as his inspirational girlfriend to help him to carry on & never give up.<br /><br />The fight between the conductor and the hypocrite priest who loses his battle with Michael when his wife confronts him And defends Michael's noble cause to help his hometown people finding their own peace in music.<br /><br />The only thing that ... | 1 | False |
4 | The plot is about a female nurse, named Anna, is caught in the middle of a world-wide chaos as flesh-eating zombies begin rising up and taking over the world and attacking the living. She escapes into the streets and is rescued by a black police officer. So far, so good! I usually enjoy horror movies, but this piece of film doesn't deserve to be called horror. It's not even thrilling, just ridiculous.Even "the Flintstones" or "Kukla, Fran and Ollie" will give you more excitement. It's like watching a bunch of bloodthirsty drunkards not being able to get into a shopping mall to by more liqu... | 0 | False |
= raw_datasets[0].features["label"].names
labels labels
['neg', 'pos']
= AutoModelForSequenceClassification
model_cls
hf_logging.set_verbosity_error()
= "roberta-base" # "bert-base-multilingual-cased"
pretrained_model_name = len(labels)
n_labels
= get_hf_objects(
hf_arch, hf_config, hf_tokenizer, hf_model =model_cls, config_kwargs={"num_labels": n_labels}
pretrained_model_name, model_cls
)
type(hf_config), type(hf_tokenizer), type(hf_model) hf_arch,
('roberta',
transformers.models.roberta.configuration_roberta.RobertaConfig,
transformers.models.roberta.tokenization_roberta_fast.RobertaTokenizerFast,
transformers.models.roberta.modeling_roberta.RobertaForSequenceClassification)
Preprocessing
Starting with version 2.0, BLURR
provides a preprocessing base class that can be used to build task specific pre-processed datasets from pandas DataFrames or Hugging Face Datasets
Preprocessor
Preprocessor (hf_tokenizer:transformers.tokenization_utils_base.PreTraine dTokenizerBase, batch_size:int=1000, text_attr:str='text', text_pair_attr:str=None, is_valid_attr:str='is_valid', tok_kwargs:dict={})
Initialize self. See help(type(self)) for accurate signature.
Type | Default | Details | |
---|---|---|---|
hf_tokenizer | PreTrainedTokenizerBase | A Hugging Face tokenizer | |
batch_size | int | 1000 | The number of examples to process at a time |
text_attr | str | text | The attribute holding the text |
text_pair_attr | str | None | The attribute holding the text_pair |
is_valid_attr | str | is_valid | The attribute that should be created if your are processing individual training and validation datasets into a single dataset, and will indicate to which each example is associated |
tok_kwargs | dict | {} | Tokenization kwargs that will be applied with calling the tokenizer |
ClassificationPreprocessor
ClassificationPreprocessor (hf_tokenizer:transformers.tokenization_utils_ base.PreTrainedTokenizerBase, batch_size:int=1000, is_multilabel:bool=False, id_attr:str=None, text_attr:str='text', text_pair_attr:str=None, label_attrs:str|list[str]='label', is_valid_attr:str='is_valid', label_mapping:list[str]=None, tok_kwargs:dict={})
Initialize self. See help(type(self)) for accurate signature.
Type | Default | Details | |
---|---|---|---|
hf_tokenizer | PreTrainedTokenizerBase | A Hugging Face tokenizer | |
batch_size | int | 1000 | The number of examples to process at a time |
is_multilabel | bool | False | Whether the dataset should be processed for multi-label; if True, will ensure label_attrs areconverted to a value of either 0 or 1 indiciating the existence of the class in the example |
id_attr | str | None | The unique identifier in the dataset |
text_attr | str | text | The attribute holding the text |
text_pair_attr | str | None | The attribute holding the text_pair |
label_attrs | str | list[str] | label | The attribute holding the label(s) of the example |
is_valid_attr | str | is_valid | The attribute that should be created if your are processing individual training and validation datasets into a single dataset, and will indicate to which each example is associated |
label_mapping | list[str] | None | A list indicating the valid labels for the dataset (optional, defaults to the unique set of labels found in the full dataset) |
tok_kwargs | dict | {} | Tokenization kwargs that will be applied with calling the tokenizer |
Starting with version 2.0, BLURR
provides a sequence classification preprocessing class that can be used to preprocess DataFrames or Hugging Face Datasets.
This class can be used for preprocessing both multiclass and multilabel classification datasets, and includes a proc_{your_text_attr}
and proc_{your_text_pair_attr}
(optional) attributes containing your modified text as a result of tokenization (e.g., if you specify a max_length
the proc_{your_text_attr}
may contain truncated text).
Note: This class works for both slow and fast tokenizers
Using a DataFrame
= ClassificationPreprocessor(hf_tokenizer, label_mapping=labels, tok_kwargs={"max_length": 24})
preprocessor
= preprocessor.process_df(imdb_df)
proc_df len(proc_df)
proc_df.columns, 2) proc_df.head(
proc_text | text | label | is_valid | label_name | text_start_char_idx | text_end_char_idx | |
---|---|---|---|---|---|---|---|
0 | What an overlooked 80's soundtrack. I imagine John Travolta sang some of the songs but in watching | What an overlooked 80's soundtrack. I imagine John Travolta sang some of the songs but in watching the movie it did seem to personify everything that was 80s cheese. Clearly movies that rely on mechanical bulls, bartenders and immature relationships were in style. The best was his lousy Texas accent. Compare that to Friday Night Lights.I suggest watching Cocktail and Stir Crazy to start really getting into the dumbing down of film. Also, as a side note Made in America with Ted Danson and Whoopie Goldberg is an awesomely bad movie. I was so shocked to realize I had never watched it. One mor... | 1 | False | pos | 0 | 98 |
1 | An archaeologist (Casper Van Dien) stumbles accidentally upon an ancient, 40 foot mummy, well | An archaeologist (Casper Van Dien) stumbles accidentally upon an ancient, 40 foot mummy, well preserved underground in the Nevada desert. They are determined to keep this a secret and call in a Jewish translator to assist in figuring out the history of it. The mummy, as explained at the beginning, is the son of a fallen angel and is one of several giants that apparently existed in "those days". In order to save his son from a devastating flood which was predicted to kill everything, he mummifies his son, burying him with several servants for centuries - planning to awaken him years from th... | 0 | False | neg | 0 | 93 |
Using a Hugging Face Dataset
= ClassificationPreprocessor(hf_tokenizer, label_mapping=labels)
preprocessor
= preprocessor.process_hf_dataset(final_ds)
proc_ds proc_ds
Dataset({
features: ['proc_text', 'text', 'label', 'is_valid', 'label_name', 'text_start_char_idx', 'text_end_char_idx'],
num_rows: 1200
})
Mid-level API
Base tokenization, batch transform, and DataBlock methods
TextInput
TextInput (x, **kwargs)
The base represenation of your inputs; used by the various fastai show
methods
A TextInput
object is returned from the decodes method of BatchDecodeTransform
as a means to customize @typedispatch
ed functions like DataLoaders.show_batch
and Learner.show_results
. The value will the your “input_ids”.
BatchTokenizeTransform
BatchTokenizeTransform (hf_arch:str, hf_config:PretrainedConfig, hf_tokenizer:PreTrainedTokenizerBase, hf_model:PreTrainedModel, include_labels:bool=True, ignore_token_id:int=-100, max_length:int=None, padding:bool|str=True, truncation:bool|str=True, is_split_into_words:bool=False, tok_kwargs:dict={}, **kwargs)
Handles everything you need to assemble a mini-batch of inputs and targets, as well as decode the dictionary produced as a byproduct of the tokenization process in the encodes
method.
Type | Default | Details | |
---|---|---|---|
hf_arch | str | The abbreviation/name of your Hugging Face transformer architecture (e.b., bert, bart, etc..) | |
hf_config | PretrainedConfig | A specific configuration instance you want to use | |
hf_tokenizer | PreTrainedTokenizerBase | A Hugging Face tokenizer | |
hf_model | PreTrainedModel | A Hugging Face model | |
include_labels | bool | True | To control whether the “labels” are included in your inputs. If they are, the loss will be calculated in the model’s forward function and you can simply use PreCalculatedLoss as your Learner ’s loss function to use it |
ignore_token_id | int | -100 | The token ID that should be ignored when calculating the loss |
max_length | int | None | To control the length of the padding/truncation. It can be an integer or None, in which case it will default to the maximum length the model can accept. If the model has no specific maximum input length, truncation/padding to max_length is deactivated. See Everything you always wanted to know about padding and truncation |
padding | bool | str | True | To control the padding applied to your hf_tokenizer during tokenization.If None, will default to ‘False’ or ‘do_not_pad’. See Everything you always wanted to know about padding and truncation |
truncation | bool | str | True | To control truncation applied to your hf_tokenizer during tokenization.If None, will default to ‘False’ or ‘do_not_truncate’. See Everything you always wanted to know about padding and truncation |
is_split_into_words | bool | False | The is_split_into_words argument applied to your hf_tokenizer during tokenization.Set this to ‘True’ if your inputs are pre-tokenized (not numericalized) \ |
tok_kwargs | dict | {} | Any other keyword arguments you want included when using your hf_tokenizer to tokenize your inputs |
kwargs |
Inspired by this article, BatchTokenizeTransform
inputs can come in as raw text, a list of words (e.g., tasks like Named Entity Recognition (NER), where you want to predict the label of each token), or as a dictionary that includes extra information you want to use during post-processing.
On-the-fly Batch-Time Tokenization:
Part of the inspiration for this derives from the mechanics of Hugging Face tokenizers, in particular it can return a collated mini-batch of data given a list of sequences. As such, the collating required for our inputs can be done during tokenization before our batch transforms run in a before_batch_tfms
transform (where we get a list of examples)! This allows users of BLURR to have everything done dynamically at batch-time without prior preprocessing with at least four potential benefits: 1. Less code 2. Faster mini-batch creation 3. Less RAM utilization and time spent tokenizing beforehand (this really helps with very large datasets) 4. Flexibility
BatchDecodeTransform
BatchDecodeTransform (input_return_type:type=<class '__main__.TextInput'>, hf_arch:str=None, hf_config:PretrainedConfig=None, hf_tokenizer:PreTrainedTokenizerBase=None, hf_model:PreTrainedModel=None, **kwargs)
A class used to cast your inputs as input_return_type
for fastai show
methods
Type | Default | Details | |
---|---|---|---|
input_return_type | type | TextInput | Used by typedispatched show methods |
hf_arch | str | None | The abbreviation/name of your Hugging Face transformer architecture (not required if passing in an instance of BatchTokenizeTransform to before_batch_tfm ) |
hf_config | PretrainedConfig | None | A Hugging Face configuration object (not required if passing in an instance of BatchTokenizeTransform to before_batch_tfm ) |
hf_tokenizer | PreTrainedTokenizerBase | None | A Hugging Face tokenizer (not required if passing in an instance of BatchTokenizeTransform to before_batch_tfm ) |
hf_model | PreTrainedModel | None | A Hugging Face model (not required if passing in an instance of BatchTokenizeTransform to before_batch_tfm ) |
kwargs |
As of fastai 2.1.5, before batch transforms no longer have a decodes
method … and so, I’ve introduced a standard batch transform here, BatchDecodeTransform
, (one that occurs “after” the batch has been created) that will do the decoding for us.
blurr_sort_func
blurr_sort_func (example, hf_tokenizer:transformers.tokenization_utils_ba se.PreTrainedTokenizerBase, is_split_into_words:bool=False, tok_kwargs:dict={})
This method is used by the SortedDL
to ensure your dataset is sorted after tokenization
Type | Default | Details | |
---|---|---|---|
example | |||
hf_tokenizer | PreTrainedTokenizerBase | A Hugging Face tokenizer | |
is_split_into_words | bool | False | The is_split_into_words argument applied to your hf_tokenizer during tokenization.Set this to ‘True’ if your inputs are pre-tokenized (not numericalized) |
tok_kwargs | dict | {} | Any other keyword arguments you want to include during tokenization |
TextBlock
TextBlock (hf_arch:str=None, hf_config:transformers.configuration_utils.PretrainedConfig=No ne, hf_tokenizer:transformers.tokenization_utils_base.PreTrain edTokenizerBase=None, hf_model:transformers.modeling_utils.PreTrainedModel=None, include_labels:bool=True, ignore_token_id=-100, batch_tokenize_tfm:__main__.BatchTokenizeTransform=None, batch_decode_tfm:__main__.BatchDecodeTransform=None, max_length:int=None, padding:bool|str=True, truncation:bool|str=True, is_split_into_words:bool=False, input_return_type:type=<class '__main__.TextInput'>, dl_type:fastai.data.load.DataLoader=None, batch_tokenize_kwargs:dict={}, batch_decode_kwargs:dict={}, tok_kwargs:dict={}, text_gen_kwargs:dict={}, **kwargs)
The core TransformBlock
to prepare your inputs for training in Blurr with fastai’s DataBlock
API
Type | Default | Details | |
---|---|---|---|
hf_arch | str | None | The abbreviation/name of your Hugging Face transformer architecture (not required if passing in an instance of BatchTokenizeTransform to before_batch_tfm ) |
hf_config | PretrainedConfig | None | A Hugging Face configuration object (not required if passing in an instance of BatchTokenizeTransform to before_batch_tfm ) |
hf_tokenizer | PreTrainedTokenizerBase | None | A Hugging Face tokenizer (not required if passing in an instance of BatchTokenizeTransform to before_batch_tfm ) |
hf_model | PreTrainedModel | None | A Hugging Face model (not required if passing in an instance of BatchTokenizeTransform to before_batch_tfm ) |
include_labels | bool | True | To control whether the “labels” are included in your inputs. If they are, the loss will be calculated in the model’s forward function and you can simply use PreCalculatedLoss as your Learner ’s loss function to use it |
ignore_token_id | int | -100 | The token ID that should be ignored when calculating the loss |
batch_tokenize_tfm | BatchTokenizeTransform | None | The before_batch_tfm you want to use to tokenize your raw data on the fly (defaults to an instance of BatchTokenizeTransform ) |
batch_decode_tfm | BatchDecodeTransform | None | The batch_tfm you want to decode your inputs into a type that can be used in the fastai show methods, (defaults to BatchDecodeTransform) |
max_length | int | None | To control the length of the padding/truncation. It can be an integer or None, in which case it will default to the maximum length the model can accept. If the model has no specific maximum input length, truncation/padding to max_length is deactivated. See Everything you always wanted to know about padding and truncation |
padding | bool | str | True | To control the ‘padding’ applied to your hf_tokenizer during tokenization.If None, will default to ‘False’ or ‘do_not_pad’. See Everything you always wanted to know about padding and truncation |
truncation | bool | str | True | To control ‘truncation’ applied to your hf_tokenizer during tokenization.If None, will default to ‘False’ or ‘do_not_truncate’. See Everything you always wanted to know about padding and truncation |
is_split_into_words | bool | False | The is_split_into_words argument applied to your hf_tokenizer during tokenization.Set this to True if your inputs are pre-tokenized (not numericalized) |
input_return_type | type | TextInput | The return type your decoded inputs should be cast too (used by methods such as show_batch ) |
dl_type | DataLoader | None | The type of DataLoader you want created (defaults to SortedDL ) |
batch_tokenize_kwargs | dict | {} | Any keyword arguments you want applied to your batch_tokenize_tfm |
batch_decode_kwargs | dict | {} | Any keyword arguments you want applied to your batch_decode_tfm (will be set as a fastai batch_tfms ) |
tok_kwargs | dict | {} | Any keyword arguments you want your Hugging Face tokenizer to use during tokenization |
text_gen_kwargs | dict | {} | Any keyword arguments you want to have applied with generating text |
kwargs |
A basic DataBlock
for our inputs, TextBlock
is designed with sensible defaults to minimize user effort in defining their transforms pipeline. It handles setting up your BatchTokenizeTransform
and BatchDecodeTransform
transforms regardless of data source (e.g., this will work with files, DataFrames, whatever).
Note: You must either pass in your own instance of a BatchTokenizeTransform
class or the Hugging Face objects returned from BLURR.get_hf_objects
(e.g.,architecture, config, tokenizer, and model). The other args are optional.
We also include a blurr_sort_func
that works with SortedDL
to properly sort based on the number of tokens in each example.
Utility classes and methods
These methods are use internally for getting blurr transforms associated to your DataLoaders
get_blurr_tfm
get_blurr_tfm (tfms_list:fastcore.transform.Pipeline, tfm_class:fastcore.transform.Transform=<class '__main__.BatchTokenizeTransform'>)
Given a fastai DataLoaders batch transforms, this method can be used to get at a transform instance used in your Blurr DataBlock
Type | Default | Details | |
---|---|---|---|
tfms_list | Pipeline | A list of transforms (e.g., dls.after_batch, dls.before_batch, etc…) | |
tfm_class | Transform | BatchTokenizeTransform | The transform to find |
first_blurr_tfm
first_blurr_tfm (dls:fastai.data.core.DataLoaders, tfms:list[fastcore.transform.Transform]=[<class '__main__.BatchTokenizeTransform'>, <class '__main__.BatchDecodeTransform'>])
This convenience method will find the first Blurr transform required for methods such as show_batch
and show_results
. The returned transform should have everything you need to properly decode and ‘show’ your Hugging Face inputs/targets
Type | Default | Details | |
---|---|---|---|
dls | DataLoaders | Your fast.ai `DataLoaders | |
tfms | list[Transform] | [<class ‘main.BatchTokenizeTransform’>, <class ‘main.BatchDecodeTransform’>] | The Blurr transforms to look for in order |
Mid-level Examples
The following eamples demonstrate several approaches to construct your DataBlock
for sequence classication tasks using the mid-level API.
Batch-Time Tokenization
Step 1: Get your Hugging Face objects.
There are a bunch of ways we can get at the four Hugging Face elements we need (e.g., architecture name, tokenizer, config, and model). We can just create them directly, or we can use one of the helper methods available via NLP
.
= AutoModelForSequenceClassification
model_cls
= "distilroberta-base" # "distilbert-base-uncased" "bert-base-uncased"
pretrained_model_name = get_hf_objects(pretrained_model_name, model_cls=model_cls) hf_arch, hf_config, hf_tokenizer, hf_model
Step 2: Create your DataBlock
= (TextBlock(hf_arch, hf_config, hf_tokenizer, hf_model, batch_tokenize_kwargs={"labels": labels}), CategoryBlock)
blocks = DataBlock(blocks=blocks, get_x=ColReader("text"), get_y=ColReader("label"), splitter=ColSplitter()) dblock
Step 3: Build your DataLoaders
= dblock.dataloaders(imdb_df, bs=4) dls
= dls.one_batch()
b len(b), len(b[0]["input_ids"]), b[0]["input_ids"].shape, len(b[1])
(2, 4, torch.Size([4, 512]), 4)
0] b[
{'input_ids': tensor([[ 0, 5102, 3764, ..., 1530, 36, 2],
[ 0, 22, 250, ..., 5422, 278, 2],
[ 0, 9342, 1864, ..., 80, 6, 2],
[ 0, 318, 47, ..., 5320, 853, 2]], device='cuda:1'),
'attention_mask': tensor([[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1]], device='cuda:1'),
'labels': TensorCategory([1, 0, 1, 1], device='cuda:1')}
Let’s take a look at the actual types represented by our batch
explode_types(b)
{tuple: [dict, fastai.torch_core.TensorCategory]}
=dls, max_n=2, trunc_at=500) dls.show_batch(dataloaders
text | target | |
---|---|---|
0 | ANCHORS AWEIGH sees two eager young sailors, Joe Brady (Gene Kelly) and Clarence Doolittle/Brooklyn (Frank Sinatra), get a special four-day shore leave. Eager to get to the girls, particularly Joe's Lola, neither Joe nor Brooklyn figure on the interruption of little Navy-mad Donald (Dean Stockwell) and his Aunt Susie (Kathryn Grayson). Unexperienced in the ways of females and courting, Brooklyn quickly enlists Joe to help him win Aunt Susie over. Along the way, however, Joe finds himself fallin | pos |
1 | WARNING: POSSIBLE SPOILERS (Not that you should care. Also, sorry for the caps.)<br /><br />Starting with an unnecessarily dramatic voice that's all the more annoying for talking nonsense, it goes on with nonsense and unnecessary drama. That's badly but accurately put.<br /><br />We know space travel is a risky enterprise. There's a complicated system with a lot of potential for malfunctions, radiation, stress-related symptoms etc, and unexpected things are bound to happen in largely unknown en | neg |
Using a preprocessed dataset
Preprocessing your raw data is the more traditional approach to using Transformers. It is required, for example, when you want to work with documents longer than your model will allow. A preprocessed dataset is used in the same way a non-preprocessed dataset is.
Step 1a: Get your Hugging Face objects.
= get_hf_objects(pretrained_model_name, model_cls=model_cls) hf_arch, hf_config, hf_tokenizer, hf_model
Step 1b. Preprocess dataset
= ClassificationPreprocessor(hf_tokenizer, label_mapping=labels)
preprocessor = preprocessor.process_hf_dataset(final_ds)
proc_ds proc_ds
Dataset({
features: ['proc_text', 'text', 'label', 'is_valid', 'label_name', 'text_start_char_idx', 'text_end_char_idx'],
num_rows: 1200
})
Step 2: Create your DataBlock
= (TextBlock(hf_arch, hf_config, hf_tokenizer, hf_model, batch_tokenize_kwargs={"labels": labels}), CategoryBlock)
blocks = DataBlock(blocks=blocks, get_x=ItemGetter("proc_text"), get_y=ItemGetter("label"), splitter=RandomSplitter()) dblock
Step 3: Build your DataLoaders
= dblock.dataloaders(proc_ds, bs=4) dls
=dls, max_n=2, trunc_at=500) dls.show_batch(dataloaders
text | target | |
---|---|---|
0 | I saw this film at the Adelaide Film Festival '07 and was thoroughly intrigued for all 106 minutes. I like documentaries, but often find them dragging with about 25 minutes to go. Forbidden Lie$ powered on though, never losing my interest.<br /><br />The film's subject is Norma Khoury, a Jordanian woman who found fame and fortune in 2001 with the publication of her book Forbidden Love, a biographical story of sorts concerning a Muslim friend of hers who was murdered by her family for having a r | pos |
1 | ANCHORS AWEIGH sees two eager young sailors, Joe Brady (Gene Kelly) and Clarence Doolittle/Brooklyn (Frank Sinatra), get a special four-day shore leave. Eager to get to the girls, particularly Joe's Lola, neither Joe nor Brooklyn figure on the interruption of little Navy-mad Donald (Dean Stockwell) and his Aunt Susie (Kathryn Grayson). Unexperienced in the ways of females and courting, Brooklyn quickly enlists Joe to help him win Aunt Susie over. Along the way, however, Joe finds himself fallin | pos |
Passing extra information
As of v.2, BLURR
now also allows you to pass extra information alongside your inputs in the form of a dictionary. If you use this approach, you must assign your text(s) to the text
attribute of the dictionary. This is a useful approach when splitting long documents into chunks, but wanting to score/predict by example rather than chunk (for example in extractive question answering tasks).
Note: A good place to access to this extra information during training/validation is in the before_batch
method of a Callback
.
= (TextBlock(hf_arch, hf_config, hf_tokenizer, hf_model, batch_tokenize_kwargs={"labels": labels}), CategoryBlock)
blocks
def get_x(item):
return {"text": item.text, "another_val": "testing123"}
= DataBlock(blocks=blocks, get_x=get_x, get_y=ColReader("label"), splitter=ColSplitter()) dblock
= dblock.dataloaders(imdb_df, bs=4) dls
= dls.one_batch()
b len(b), len(b[0]["input_ids"]), b[0]["input_ids"].shape, len(b[1])
(2, 4, torch.Size([4, 512]), 4)
=dls, max_n=2, trunc_at=500) dls.show_batch(dataloaders
text | target | |
---|---|---|
0 | ANCHORS AWEIGH sees two eager young sailors, Joe Brady (Gene Kelly) and Clarence Doolittle/Brooklyn (Frank Sinatra), get a special four-day shore leave. Eager to get to the girls, particularly Joe's Lola, neither Joe nor Brooklyn figure on the interruption of little Navy-mad Donald (Dean Stockwell) and his Aunt Susie (Kathryn Grayson). Unexperienced in the ways of females and courting, Brooklyn quickly enlists Joe to help him win Aunt Susie over. Along the way, however, Joe finds himself fallin | pos |
1 | I've rented and watched this movie for the 1st time on DVD without reading any reviews about it. So, after 15 minutes of watching I've noticed that something is wrong with this movie; it's TERRIBLE! I mean, in the trailers it looked scary and serious!<br /><br />I think that Eli Roth (Mr. Director) thought that if all the characters in this film were stupid, the movie would be funny...(So stupid, it's funny...? WRONG!) He should watch and learn from better horror-comedies such as:"Fright Night" | neg |
Low-level API
For working with PyTorch and/or fast.ai Datasets & DataLoaders, the low-level API allows you to get back fast.ai specific features such as show_batch
, show_results
, etc… when using plain ol’ PyTorch Datasets, Hugging Face Datasets, etc…
TextBatchCreator
TextBatchCreator (hf_arch:str, hf_config:transformers.configuration_utils.PretrainedCo nfig, hf_tokenizer:transformers.tokenization_utils_base .PreTrainedTokenizerBase, hf_model:transformers.modeling_utils.PreTrainedModel, data_collator:type=None)
A class that can be assigned to a TfmdDL.create_batch
method; used to in Blurr’s low-level API to create batches that can be used in the Blurr library
TextDataLoader
TextDataLoader (dataset:torch.utils.data.dataset.Dataset|Datasets, hf_arch:str, hf_config:PretrainedConfig, hf_tokenizer:PreTrainedTokenizerBase, hf_model:PreTrainedModel, batch_creator:TextBatchCreator=None, batch_decode_tfm:BatchDecodeTransform=None, input_return_type:type=<class '__main__.TextInput'>, preproccesing_func:Callable=None, batch_decode_kwargs:dict={}, bs:int=64, shuffle:bool=False, num_workers:int=None, verbose:bool=False, do_setup:bool=True, pin_memory=False, timeout=0, batch_size=None, drop_last=False, indexed=None, n=None, device=None, persistent_workers=False, pin_memory_device='', wif=None, before_iter=None, after_item=None, before_batch=None, after_batch=None, after_iter=None, create_batches=None, create_item=None, create_batch=None, retain=None, get_idxs=None, sample=None, shuffle_fn=None, do_batch=None)
A transformed DataLoader
that works with Blurr. From the fastai docs: A TfmDL
is described as “a DataLoader that creates Pipeline from a list of Transforms for the callbacks after_item
, before_batch
and after_batch
. As a result, it can decode or show a processed batch.
Type | Default | Details | |
---|---|---|---|
dataset | torch.utils.data.dataset.Dataset | Datasets | A standard PyTorch Dataset | |
hf_arch | str | The abbreviation/name of your Hugging Face transformer architecture (not required if passing in an instance of BatchTokenizeTransform to before_batch_tfm ) |
|
hf_config | PretrainedConfig | A Hugging Face configuration object (not required if passing in an instance of BatchTokenizeTransform to before_batch_tfm ) |
|
hf_tokenizer | PreTrainedTokenizerBase | A Hugging Face tokenizer (not required if passing in an instance of BatchTokenizeTransform to before_batch_tfm ) |
|
hf_model | PreTrainedModel | A Hugging Face model (not required if passing in an instance of BatchTokenizeTransform to before_batch_tfm ) |
|
batch_creator | TextBatchCreator | None | An instance of BlurrBatchCreator or equivalent (defaults to BlurrBatchCreator ) |
batch_decode_tfm | BatchDecodeTransform | None | The batch_tfm used to decode Blurr batches (defaults to BatchDecodeTransform ) |
input_return_type | type | TextInput | Used by typedispatched show methods |
preproccesing_func | Callable | None | (optional) A preprocessing function that will be applied to your dataset |
batch_decode_kwargs | dict | {} | Keyword arguments to be applied to your batch_decode_tfm |
bs | int | 64 | |
shuffle | bool | False | |
num_workers | int | None | |
verbose | bool | False | |
do_setup | bool | True | |
pin_memory | bool | False | |
timeout | int | 0 | |
batch_size | NoneType | None | |
drop_last | bool | False | |
indexed | NoneType | None | |
n | NoneType | None | |
device | NoneType | None | |
persistent_workers | bool | False | |
pin_memory_device | str | ||
wif | NoneType | None | |
before_iter | NoneType | None | |
after_item | NoneType | None | |
before_batch | NoneType | None | |
after_batch | NoneType | None | |
after_iter | NoneType | None | |
create_batches | NoneType | None | |
create_item | NoneType | None | |
create_batch | NoneType | None | |
retain | NoneType | None | |
get_idxs | NoneType | None | |
sample | NoneType | None | |
shuffle_fn | NoneType | None | |
do_batch | NoneType | None |
Low-level Examples
The following example demonstrates how to use the low-level API with standard PyTorch/Hugging Face/fast.ai Datasets and DataLoaders.
Step 1: Build your datasets
= load_dataset("glue", "mrpc") raw_datasets
Reusing dataset glue (/home/wgilliam/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
def tokenize_function(example):
return hf_tokenizer(example["sentence1"], example["sentence2"], truncation=True)
= raw_datasets.map(tokenize_function, batched=True) tokenized_datasets
Step 2: Dataset pre-processing (optional)
preproc_hf_dataset
preproc_hf_dataset (dataset:torch.utils.data.dataset.Dataset|fastai.data .core.Datasets, hf_tokenizer:transformers.tokenizatio n_utils_base.PreTrainedTokenizerBase, hf_model:transformers.modeling_utils.PreTrainedModel)
This method can be used to preprocess most Hugging Face Datasets for use in Blurr and other training libraries
Type | Details | |
---|---|---|
dataset | torch.utils.data.dataset.Dataset | Datasets | A standard PyTorch Dataset or fast.ai Datasets |
hf_tokenizer | PreTrainedTokenizerBase | A Hugging Face tokenizer |
hf_model | PreTrainedModel | A Hugging Face model |
Step 3: Build your DataLoaders
.
Use BlurrDataLoader
to build Blurr friendly dataloaders from your datasets. Passing {'labels': label_names}
to your batch_tfm_kwargs
will ensure that your lable/target names will be displayed in methods like show_batch
and show_results
(just as it works with the mid-level API)
= raw_datasets["train"].features["label"].names
label_names
= TextDataLoader(
trn_dl "train"],
tokenized_datasets[
hf_arch,
hf_config,
hf_tokenizer,
hf_model,=preproc_hf_dataset,
preproccesing_func={"labels": label_names},
batch_decode_kwargs=True,
shuffle=8,
batch_size
)
= TextDataLoader(
val_dl "validation"],
tokenized_datasets[
hf_arch,
hf_config,
hf_tokenizer,
hf_model,=preproc_hf_dataset,
preproccesing_func={"labels": label_names},
batch_decode_kwargs=16,
batch_size
)
= DataLoaders(trn_dl, val_dl) dls
= dls.one_batch()
b 0]["input_ids"].shape b[
torch.Size([8, 65])
=dls, max_n=2, trunc_at=800) dls.show_batch(dataloaders
text | target | |
---|---|---|
0 | The technology-laced Nasdaq Composite Index.IXIC inched down 1 point, or 0.11 percent, to 1,650. The broad Standard & Poor's 500 Index.SPX inched up 3 points, or 0.32 percent, to 970. | not_equivalent |
1 | His 1996 Chevrolet Tahoe was found abandoned June 25 in a Virginia Beach, Va., parking lot. His sport utility vehicle was found June 25, abandoned without its license plates in Virginia Beach, Va. | equivalent |
Tests
The tests below to ensure the core DataBlock code above works for all pretrained sequence classification models available in Hugging Face. These tests are excluded from the CI workflow because of how long they would take to run and the amount of data that would be required to download.
Note: Feel free to modify the code below to test whatever pretrained classification models you are working with … and if any of your pretrained sequence classification models fail, please submit a github issue (or a PR if you’d like to fix it yourself)
arch | tokenizer | model_name | result | error | |
---|---|---|---|---|---|
0 | albert | AlbertTokenizerFast | hf-internal-testing/tiny-albert | PASSED | |
1 | bart | BartTokenizerFast | hf-internal-testing/tiny-random-bart | PASSED | |
2 | bert | BertTokenizerFast | hf-internal-testing/tiny-bert | PASSED | |
3 | big_bird | BigBirdTokenizerFast | google/bigbird-roberta-base | PASSED | |
4 | bigbird_pegasus | PegasusTokenizerFast | google/bigbird-pegasus-large-arxiv | PASSED | |
5 | ctrl | CTRLTokenizer | hf-internal-testing/tiny-random-ctrl | PASSED | |
6 | camembert | CamembertTokenizerFast | camembert-base | PASSED | |
7 | canine | CanineTokenizer | hf-internal-testing/tiny-random-canine | PASSED | |
8 | convbert | ConvBertTokenizerFast | YituTech/conv-bert-base | PASSED | |
9 | deberta | DebertaTokenizerFast | hf-internal-testing/tiny-deberta | PASSED | |
10 | deberta_v2 | DebertaV2TokenizerFast | hf-internal-testing/tiny-random-deberta-v2 | PASSED | |
11 | distilbert | DistilBertTokenizerFast | hf-internal-testing/tiny-random-distilbert | PASSED | |
12 | electra | ElectraTokenizerFast | hf-internal-testing/tiny-electra | PASSED | |
13 | fnet | FNetTokenizerFast | google/fnet-base | PASSED | |
14 | flaubert | FlaubertTokenizer | hf-internal-testing/tiny-random-flaubert | PASSED | |
15 | funnel | FunnelTokenizerFast | hf-internal-testing/tiny-random-funnel | PASSED | |
16 | gpt2 | GPT2TokenizerFast | hf-internal-testing/tiny-random-gpt2 | PASSED | |
17 | gptj | GPT2TokenizerFast | anton-l/gpt-j-tiny-random | PASSED | |
18 | gpt_neo | GPT2TokenizerFast | hf-internal-testing/tiny-random-gpt_neo | PASSED | |
19 | ibert | RobertaTokenizer | kssteven/ibert-roberta-base | PASSED | |
20 | led | LEDTokenizerFast | hf-internal-testing/tiny-random-led | PASSED | |
21 | longformer | LongformerTokenizerFast | hf-internal-testing/tiny-random-longformer | PASSED | |
22 | mbart | MBartTokenizerFast | hf-internal-testing/tiny-random-mbart | PASSED | |
23 | mpnet | MPNetTokenizerFast | hf-internal-testing/tiny-random-mpnet | PASSED | |
24 | mobilebert | MobileBertTokenizerFast | hf-internal-testing/tiny-random-mobilebert | PASSED | |
25 | openai | OpenAIGPTTokenizerFast | openai-gpt | PASSED | |
26 | reformer | ReformerTokenizerFast | google/reformer-crime-and-punishment | PASSED | |
27 | rembert | RemBertTokenizerFast | google/rembert | PASSED | |
28 | roformer | RoFormerTokenizerFast | junnyu/roformer_chinese_sim_char_ft_small | PASSED | |
29 | roberta | RobertaTokenizerFast | roberta-base | PASSED | |
30 | squeezebert | SqueezeBertTokenizerFast | squeezebert/squeezebert-uncased | PASSED | |
31 | transfo_xl | TransfoXLTokenizer | hf-internal-testing/tiny-random-transfo-xl | PASSED | |
32 | xlm | XLMTokenizer | xlm-mlm-en-2048 | PASSED | |
33 | xlm_roberta | XLMRobertaTokenizerFast | xlm-roberta-base | PASSED | |
34 | xlnet | XLNetTokenizerFast | xlnet-base-cased | PASSED |
The text.data.core
module contains the fundamental bits for all data preprocessing tasks