This module contains the core bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data in a way modelable by Hugging Face transformer implementations.
 
What we're running with at the time this documentation was generated:
torch: 1.9.0+cu102
fastai: 2.5.2
transformers: 4.10.0

Mid-level API: Base tokenization, batch transform, and DataBlock methods

class HF_BaseInput[source]

HF_BaseInput(x, **kwargs) :: TensorBase

The base represenation of your inputs; used by the various fastai show methods

A HF_BaseInput object is returned from the decodes method of HF_AfterBatchTransform as a means to customize @typedispatched functions like DataLoaders.show_batch and Learner.show_results. It uses the "input_ids" of a Hugging Face object as the representative tensor for show methods

class HF_BeforeBatchTransform[source]

HF_BeforeBatchTransform(hf_arch:str, hf_config:PretrainedConfig, hf_tokenizer:PreTrainedTokenizerBase, hf_model:PreTrainedModel, max_length:int=None, padding:Union[bool, str]=True, truncation:Union[bool, str]=True, is_split_into_words:bool=False, tok_kwargs={}, **kwargs) :: Transform

Handles everything you need to assemble a mini-batch of inputs and targets, as well as decode the dictionary produced as a byproduct of the tokenization process in the encodes method.

Parameters:

  • hf_arch : <class 'str'>

    The abbreviation/name of your Hugging Face transformer architecture (e.b., bert, bart, etc..)

  • hf_config : <class 'transformers.configuration_utils.PretrainedConfig'>

    A specific configuration instance you want to use

  • hf_tokenizer : <class 'transformers.tokenization_utils_base.PreTrainedTokenizerBase'>

    A Hugging Face tokenizer

  • hf_model : <class 'transformers.modeling_utils.PreTrainedModel'>

    A Hugging Face model

  • max_length : <class 'int'>, optional

    To control the length of the padding/truncation. It can be an integer or None, in which case it will default to the maximum length the model can accept. If the model has no specific maximum input length, truncation/padding to max_length is deactivated. See [Everything you always wanted to know about padding and truncation](https://huggingface.co/transformers/preprocessing.html#everything-you-always-wanted-to-know-about-padding-and-truncation)

  • padding : typing.Union[bool, str], optional

    To control the `padding` applied to your `hf_tokenizer` during tokenization. If None, will default to `False` or `'do_not_pad'. See [Everything you always wanted to know about padding and truncation](https://huggingface.co/transformers/preprocessing.html#everything-you-always-wanted-to-know-about-padding-and-truncation)

  • truncation : typing.Union[bool, str], optional

    To control `truncation` applied to your `hf_tokenizer` during tokenization. If None, will default to `False` or `do_not_truncate`. See [Everything you always wanted to know about padding and truncation](https://huggingface.co/transformers/preprocessing.html#everything-you-always-wanted-to-know-about-padding-and-truncation)

  • is_split_into_words : <class 'bool'>, optional

    The `is_split_into_words` argument applied to your `hf_tokenizer` during tokenization. Set this to `True` if your inputs are pre-tokenized (not numericalized)

  • tok_kwargs : <class 'dict'>, optional

    Any other keyword arguments you want included when using your `hf_tokenizer` to tokenize your inputs

  • kwargs : <class 'inspect._empty'>

HF_BeforeBatchTransform was inspired by this article.

Inputs can come in as a string or a list of tokens, the later being for tasks like Named Entity Recognition (NER), where you want to predict the label of each token.

Notes re: on-the-fly batch-time tokenization: The previous version of the library performed the tokenization/numericalization as a type transform when the raw data was read, and included a couple batch transforms to prepare the data for collation (e.g., to be made into a mini-batch). With this update, everything is done in a single batch transform. Why? Part of the inspiration had to do with the mechanics of the huggingrace tokenizer, in particular how by default it returns a collated mini-batch of data given a list of sequences. And where do we get a list of examples with fastai? In the batch transforms! So I thought, hey, why not do everything dynamically at batch time? And with a bit of tweaking, I got everything to work pretty well. The result is less code, faster mini-batch creation, less RAM utilization and time spent tokenizing (really helps with very large datasets), and more flexibility.

class HF_AfterBatchTransform[source]

HF_AfterBatchTransform(hf_tokenizer:PreTrainedTokenizerBase, input_return_type:Type[CT_co]=HF_BaseInput) :: Transform

A class used to cast your inputs into something understandable in fastai show methods

Parameters:

  • hf_tokenizer : <class 'transformers.tokenization_utils_base.PreTrainedTokenizerBase'>

    A Hugging Face tokenizer

  • input_return_type : typing.Type, optional

    The return type your decoded inputs should be cast too (used by methods such as `show_batch`)

With fastai 2.1.5, before batch transforms no longer have a decodes method ... and so, I've introduced a standard batch transform here, HF_AfterBatchTransform, that will do the decoding for us.

blurr_sort_func[source]

blurr_sort_func(example, hf_tokenizer:PreTrainedTokenizerBase, is_split_into_words:bool=False, tok_kwargs={})

This method is used by the SortedDL to ensure your dataset is sorted after tokenization

Parameters:

  • example : <class 'inspect._empty'>

  • hf_tokenizer : <class 'transformers.tokenization_utils_base.PreTrainedTokenizerBase'>

    A Hugging Face tokenizer

  • is_split_into_words : <class 'bool'>, optional

    The `is_split_into_words` argument applied to your `hf_tokenizer` during tokenization. Set this to `True` if your inputs are pre-tokenized (not numericalized)

  • tok_kwargs : <class 'dict'>, optional

    Any other keyword arguments you want to include during tokenization

class HF_TextBlock[source]

HF_TextBlock(hf_arch:str=None, hf_config:PretrainedConfig=None, hf_tokenizer:PreTrainedTokenizerBase=None, hf_model:PreTrainedModel=None, before_batch_tfm:HF_BeforeBatchTransform=None, after_batch_tfm:HF_AfterBatchTransform=None, max_length:int=None, padding:Union[bool, str]=True, truncation:Union[bool, str]=True, is_split_into_words:bool=False, input_return_type=HF_BaseInput, dl_type:DataLoader=None, before_batch_kwargs={}, after_batch_kwargs={}, tok_kwargs={}, text_gen_kwargs={}, **kwargs) :: TransformBlock

The core TransformBlock to prepare your data for training in Blurr with fastai's DataBlock API

Parameters:

  • hf_arch : <class 'str'>, optional

    The abbreviation/name of your Hugging Face transformer architecture (not required if passing in an instance of `HF_BeforeBatchTransform` to `before_batch_tfm`)

  • hf_config : <class 'transformers.configuration_utils.PretrainedConfig'>, optional

    A Hugging Face configuration object (not required if passing in an instance of `HF_BeforeBatchTransform` to `before_batch_tfm`)

  • hf_tokenizer : <class 'transformers.tokenization_utils_base.PreTrainedTokenizerBase'>, optional

    A Hugging Face tokenizer (not required if passing in an instance of `HF_BeforeBatchTransform` to `before_batch_tfm`)

  • hf_model : <class 'transformers.modeling_utils.PreTrainedModel'>, optional

    A Hugging Face model (not required if passing in an instance of `HF_BeforeBatchTransform` to `before_batch_tfm`)

  • before_batch_tfm : <class 'blurr.data.core.HF_BeforeBatchTransform'>, optional

    The before batch transform you want to use to tokenize your raw data on the fly (defaults to an instance of `HF_BeforeBatchTransform` created using the Hugging Face objects defined above)

  • after_batch_tfm : <class 'blurr.data.core.HF_AfterBatchTransform'>, optional

    The batch_tfms to apply to the creation of your DataLoaders, (defaults to HF_AfterBatchTransform created using the Hugging Face objects defined above)

  • max_length : <class 'int'>, optional

    To control the length of the padding/truncation. It can be an integer or None, in which case it will default to the maximum length the model can accept. If the model has no specific maximum input length, truncation/padding to max_length is deactivated. See [Everything you always wanted to know about padding and truncation](https://huggingface.co/transformers/preprocessing.html#everything-you-always-wanted-to-know-about-padding-and-truncation)

  • padding : typing.Union[bool, str], optional

    To control the `padding` applied to your `hf_tokenizer` during tokenization. If None, will default to `False` or `'do_not_pad'. See [Everything you always wanted to know about padding and truncation](https://huggingface.co/transformers/preprocessing.html#everything-you-always-wanted-to-know-about-padding-and-truncation)

  • truncation : typing.Union[bool, str], optional

    To control `truncation` applied to your `hf_tokenizer` during tokenization. If None, will default to `False` or `do_not_truncate`. See [Everything you always wanted to know about padding and truncation](https://huggingface.co/transformers/preprocessing.html#everything-you-always-wanted-to-know-about-padding-and-truncation)

  • is_split_into_words : <class 'bool'>, optional

    The `is_split_into_words` argument applied to your `hf_tokenizer` during tokenization. Set this to `True` if your inputs are pre-tokenized (not numericalized)

  • input_return_type : <class 'torch._C._TensorMeta'>, optional

    The return type your decoded inputs should be cast too (used by methods such as `show_batch`)

  • dl_type : <class 'fastai.data.load.DataLoader'>, optional

    The type of `DataLoader` you want created (defaults to `SortedDL`)

  • before_batch_kwargs : <class 'dict'>, optional

    Any keyword arguments you want applied to your before batch tfm

  • after_batch_kwargs : <class 'dict'>, optional

    Any keyword arguments you want applied to your after batch tfm (or referred to in fastai as `batch_tfms`)

  • tok_kwargs : <class 'dict'>, optional

    Any keyword arguments you want your Hugging Face tokenizer to use during tokenization

  • text_gen_kwargs : <class 'dict'>, optional

    Any keyword arguments you want to have applied with generating text

  • kwargs : <class 'inspect._empty'>

A basic wrapper that links defaults transforms for the data block API

HF_TextBlock has been dramatically simplified from it's predecessor. It handles setting up your HF_BeforeBatchTransform and HF_AfterBatchTransform transforms regardless of data source (e.g., this will work with files, DataFrames, whatever). You must either pass in your own instance of a HF_BeforeBatchTransform class or the Hugging Face architecture and tokenizer via the hf_arch and hf_tokenizer (the other args are optional).

Low-level API: For working with PyTorch and/or fast.ai Datasets & DataLoaders

Below is a low-level API for working with basic PyTorch Datasets (e.g., a dataset from the Hugging Face datasets library) and DataLoaders. Use the approach detailed below if you already have, or want to use, a plain ol' PyTorch Dataset instead of the fast.ai DataBlock API.

class BlurrBatchCreator[source]

BlurrBatchCreator(hf_tokenizer:PreTrainedTokenizerBase, data_collator:Type[CT_co]=None)

A class that can be assigned to a TfmdDL.create_batch method; used to in Blurr's low-level API to create batches that can be used in the Blurr library

Parameters:

  • hf_tokenizer : <class 'transformers.tokenization_utils_base.PreTrainedTokenizerBase'>

    Your Hugging Face tokenizer

  • data_collator : typing.Type, optional

    Defaults to use Hugging Face's DataCollatorWithPadding(tokenizer=hf_tokenizer)

class BlurrBatchTransform[source]

BlurrBatchTransform(hf_arch:str=None, hf_config:PretrainedConfig=None, hf_tokenizer:PreTrainedTokenizerBase=None, hf_model:PreTrainedModel=None, is_split_into_words:bool=False, ignore_token_id:int=-100, tok_kwargs={}, text_gen_kwargs={}, input_return_type=HF_BaseInput, **kwargs) :: HF_AfterBatchTransform

A class used to cast your inputs into something understandable in fastai show methods

Parameters:

  • hf_arch : <class 'str'>, optional

    The abbreviation/name of your Hugging Face transformer architecture (not required if passing in an instance of `HF_BeforeBatchTransform` to `before_batch_tfm`)

  • hf_config : <class 'transformers.configuration_utils.PretrainedConfig'>, optional

    A Hugging Face configuration object (not required if passing in an instance of `HF_BeforeBatchTransform` to `before_batch_tfm`)

  • hf_tokenizer : <class 'transformers.tokenization_utils_base.PreTrainedTokenizerBase'>, optional

    A Hugging Face tokenizer (not required if passing in an instance of `HF_BeforeBatchTransform` to `before_batch_tfm`)

  • hf_model : <class 'transformers.modeling_utils.PreTrainedModel'>, optional

    A Hugging Face model (not required if passing in an instance of `HF_BeforeBatchTransform` to `before_batch_tfm`)

  • is_split_into_words : <class 'bool'>, optional

    The `is_split_into_words` argument applied to your `hf_tokenizer` during tokenization. Set this to `True` if your inputs are pre-tokenized (not numericalized)

  • ignore_token_id : <class 'int'>, optional

    The token ID to ignore when calculating loss/metrics

  • tok_kwargs : <class 'dict'>, optional

    Any other keyword arguments you want included when using your `hf_tokenizer` to tokenize your inputs

  • text_gen_kwargs : <class 'dict'>, optional

    Any text generation keyword arguments

  • input_return_type : <class 'torch._C._TensorMeta'>, optional

    The return type your decoded inputs should be cast too (used by methods such as `show_batch`)

  • kwargs : <class 'inspect._empty'>

class BlurrDataLoader[source]

BlurrDataLoader(dataset:Union[Dataset, Datasets], hf_arch:str, hf_config:PretrainedConfig, hf_tokenizer:PreTrainedTokenizerBase, hf_model:PreTrainedModel, batch_creator:BlurrBatchCreator=None, batch_tfm:BlurrBatchTransform=None, preproccesing_func:Callable[Union[Dataset, Datasets], PreTrainedTokenizerBase, PreTrainedModel, Union[Dataset, Datasets]]=None, batch_tfm_kwargs={}, bs=64, shuffle=False, num_workers=None, verbose=False, do_setup=True, pin_memory=False, timeout=0, batch_size=None, drop_last=False, indexed=None, n=None, device=None, persistent_workers=False, wif=None, before_iter=None, after_item=None, before_batch=None, after_batch=None, after_iter=None, create_batches=None, create_item=None, create_batch=None, retain=None, get_idxs=None, sample=None, shuffle_fn=None, do_batch=None) :: TfmdDL

A class that makes creating a fast.ai DataLoader that works with Blurr

Parameters:

  • dataset : typing.Union[torch.utils.data.dataset.Dataset, fastai.data.core.Datasets]

    A standard PyTorch Dataset

  • hf_arch : <class 'str'>

    The abbreviation/name of your Hugging Face transformer architecture (not required if passing in an instance of `HF_BeforeBatchTransform` to `before_batch_tfm`)

  • hf_config : <class 'transformers.configuration_utils.PretrainedConfig'>

    A Hugging Face configuration object (not required if passing in an instance of `HF_BeforeBatchTransform` to `before_batch_tfm`)

  • hf_tokenizer : <class 'transformers.tokenization_utils_base.PreTrainedTokenizerBase'>

    A Hugging Face tokenizer (not required if passing in an instance of `HF_BeforeBatchTransform` to `before_batch_tfm`)

  • hf_model : <class 'transformers.modeling_utils.PreTrainedModel'>

    A Hugging Face model (not required if passing in an instance of `HF_BeforeBatchTransform` to `before_batch_tfm`)

  • batch_creator : <class 'blurr.data.core.BlurrBatchCreator'>, optional

    An instance of `BlurrBatchCreator` or equivalent

  • batch_tfm : <class 'blurr.data.core.BlurrBatchTransform'>, optional

    The batch_tfm used to decode Blurr batches (default: HF_AfterBatchTransform)

  • preproccesing_func : typing.Callable[[typing.Union[torch.utils.data.dataset.Dataset, fastai.data.core.Datasets], transformers.tokenization_utils_base.PreTrainedTokenizerBase, transformers.modeling_utils.PreTrainedModel], typing.Union[torch.utils.data.dataset.Dataset, fastai.data.core.Datasets]], optional

    (optional) A preprocessing function that will be applied to your dataset

  • batch_tfm_kwargs : <class 'dict'>, optional

    Keyword arguments to be applied to your `batch_tfm`

  • bs : <class 'int'>, optional

  • shuffle : <class 'bool'>, optional

  • num_workers : <class 'NoneType'>, optional

  • verbose : <class 'bool'>, optional

  • do_setup : <class 'bool'>, optional

  • pin_memory : <class 'bool'>, optional

  • timeout : <class 'int'>, optional

  • batch_size : <class 'NoneType'>, optional

  • drop_last : <class 'bool'>, optional

  • indexed : <class 'NoneType'>, optional

  • n : <class 'NoneType'>, optional

  • device : <class 'NoneType'>, optional

  • persistent_workers : <class 'bool'>, optional

  • wif : <class 'NoneType'>, optional

  • before_iter : <class 'NoneType'>, optional

  • after_item : <class 'NoneType'>, optional

  • before_batch : <class 'NoneType'>, optional

  • after_batch : <class 'NoneType'>, optional

  • after_iter : <class 'NoneType'>, optional

  • create_batches : <class 'NoneType'>, optional

  • create_item : <class 'NoneType'>, optional

  • create_batch : <class 'NoneType'>, optional

  • retain : <class 'NoneType'>, optional

  • get_idxs : <class 'NoneType'>, optional

  • sample : <class 'NoneType'>, optional

  • shuffle_fn : <class 'NoneType'>, optional

  • do_batch : <class 'NoneType'>, optional

Utility & base show_batch methods

get_blurr_tfm[source]

get_blurr_tfm(tfms_list:Pipeline, tfm_class:Transform=HF_BeforeBatchTransform)

Given a fastai DataLoaders batch transforms, this method can be used to get at a transform instance used in your Blurr DataBlock

Parameters:

  • tfms_list : <class 'fastcore.transform.Pipeline'>

    A list of transforms (e.g., dls.after_batch, dls.before_batch, etc...)

  • tfm_class : <class 'fastcore.transform.Transform'>, optional

    The transform to find

first_blurr_tfm[source]

first_blurr_tfm(dls:DataLoaders, before_batch_tfm_class:Transform=HF_BeforeBatchTransform, blurr_batch_tfm_class:Transform=BlurrBatchTransform)

This convenience method will find the first Blurr transform required for methods such as show_batch and show_results. The returned transform should have everything you need to properly decode and 'show' your Hugging Face inputs/targets

Parameters:

  • dls : <class 'fastai.data.core.DataLoaders'>

    Your fast.ai `DataLoaders

  • before_batch_tfm_class : <class 'fastcore.transform.Transform'>, optional

    The before_batch transform to look for

  • blurr_batch_tfm_class : <class 'fastcore.transform.Transform'>, optional

    The after_batch (or batch_tfm) to look for

Sequence classification

Below demonstrates both how to contruct your DataBlock for a sequence classification task (e.g., a model that requires a single text input) using the mid-level API, and also with the low-level API should you wish to work with standard PyTorch or fast.ai Datasets and DataLoaders

Using the mid-level API

path = untar_data(URLs.IMDB_SAMPLE)

model_path = Path('models')
imdb_df = pd.read_csv(path/'texts.csv')
imdb_df.head()
label text is_valid
0 negative Un-bleeping-believable! Meg Ryan doesn't even look her usual pert lovable self in this, which normally makes me forgive her shallow ticky acting schtick. Hard to believe she was the producer on this dog. Plus Kevin Kline: what kind of suicide trip has his career been on? Whoosh... Banzai!!! Finally this was directed by the guy who did Big Chill? Must be a replay of Jonestown - hollywood style. Wooofff! False
1 positive This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script.<br /><br />But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries. There is som... False
2 negative Every once in a long while a movie will come along that will be so awful that I feel compelled to warn people. If I labor all my days and I can save but one soul from watching this movie, how great will be my joy.<br /><br />Where to begin my discussion of pain. For starters, there was a musical montage every five minutes. There was no character development. Every character was a stereotype. We had swearing guy, fat guy who eats donuts, goofy foreign guy, etc. The script felt as if it were being written as the movie was being shot. The production value was so incredibly low that it felt li... False
3 positive Name just says it all. I watched this movie with my dad when it came out and having served in Korea he had great admiration for the man. The disappointing thing about this film is that it only concentrate on a short period of the man's life - interestingly enough the man's entire life would have made such an epic bio-pic that it is staggering to imagine the cost for production.<br /><br />Some posters elude to the flawed characteristics about the man, which are cheap shots. The theme of the movie "Duty, Honor, Country" are not just mere words blathered from the lips of a high-brassed offic... False
4 negative This movie succeeds at being one of the most unique movies you've seen. However this comes from the fact that you can't make heads or tails of this mess. It almost seems as a series of challenges set up to determine whether or not you are willing to walk out of the movie and give up the money you just paid. If you don't want to feel slighted you'll sit through this horrible film and develop a real sense of pity for the actors involved, they've all seen better days, but then you realize they actually got paid quite a bit of money to do this and you'll lose pity for them just like you've alr... False

There are a bunch of ways we can get at the four Hugging Face elements we need (e.g., architecture name, tokenizer, config, and model). We can just create them directly, or we can use one of the helper methods available via BLURR.

from transformers import AutoModelForSequenceClassification
model_cls = AutoModelForSequenceClassification

pretrained_model_name = "distilroberta-base" # "distilbert-base-uncased" "bert-base-uncased"
hf_arch, hf_config, hf_tokenizer, hf_model = BLURR.get_hf_objects(pretrained_model_name, model_cls=model_cls)

Once you have those elements, you can create your DataBlock as simple as the below.

blocks = (HF_TextBlock(hf_arch, hf_config, hf_tokenizer, hf_model), CategoryBlock)
dblock = DataBlock(blocks=blocks, get_x=ColReader('text'), get_y=ColReader('label'), splitter=ColSplitter())
dls = dblock.dataloaders(imdb_df, bs=4)
b = dls.one_batch(); len(b), len(b[0]['input_ids']), b[0]['input_ids'].shape, len(b[1]) 
(2, 4, torch.Size([4, 512]), 4)

Let's take a look at the actual types represented by our batch

explode_types(b)
{tuple: [dict, fastai.torch_core.TensorCategory]}
dls.show_batch(dataloaders=dls, max_n=2, trunc_at=500)
text target
0 Raising Victor Vargas: A Review<br /><br />You know, Raising Victor Vargas is like sticking your hands into a big, steaming bowl of oatmeal. It's warm and gooey, but you're not sure if it feels right. Try as I might, no matter how warm and gooey Raising Victor Vargas became I was always aware that something didn't quite feel right. Victor Vargas suffers from a certain overconfidence on the director's part. Apparently, the director thought that the ethnic backdrop of a Latino family on the lower negative
1 Many neglect that this isn't just a classic due to the fact that it's the first 3D game, or even the first shoot-'em-up. It's also one of the first stealth games, one of the only(and definitely the first) truly claustrophobic games, and just a pretty well-rounded gaming experience in general. With graphics that are terribly dated today, the game thrusts you into the role of B.J.(don't even *think* I'm going to attempt spelling his last name!), an American P.O.W. caught in an underground bunker. positive

Using the low-level API

Step 1: Grab your datasets

raw_datasets = load_dataset("glue", "mrpc")
Reusing dataset glue (/home/wgilliam/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
def tokenize_function(example):
    return hf_tokenizer(example["sentence1"], example["sentence2"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

Step 2: Define any pre-processing that needs to be done to your datasets (optional)

preproc_hf_dataset[source]

preproc_hf_dataset(dataset:Union[Dataset, Datasets], hf_tokenizer:PreTrainedTokenizerBase, hf_model:PreTrainedModel)

This method can be used to preprocess most Hugging Face Datasets for use in Blurr and other training libraries

Parameters:

  • dataset : typing.Union[torch.utils.data.dataset.Dataset, fastai.data.core.Datasets]

    A standard PyTorch Dataset or fast.ai Datasets

  • hf_tokenizer : <class 'transformers.tokenization_utils_base.PreTrainedTokenizerBase'>

    A Hugging Face tokenizer

  • hf_model : <class 'transformers.modeling_utils.PreTrainedModel'>

    A Hugging Face model

Step 3: Use BlurrDataLoader to build Blurr friendly dataloaders from your datasets

trn_dl = BlurrDataLoader(tokenized_datasets["train"], 
                         hf_arch, hf_config, hf_tokenizer, hf_model,
                         preproccesing_func=preproc_hf_dataset, shuffle=True, batch_size=8)

val_dl = BlurrDataLoader(tokenized_datasets["validation"],
                         hf_arch, hf_config, hf_tokenizer, hf_model,
                         preproccesing_func=preproc_hf_dataset, batch_size=16)

dls = DataLoaders(trn_dl, val_dl)
b = dls.one_batch()
b[0]['input_ids'].shape
torch.Size([8, 71])
dls.show_batch(dataloaders=dls, max_n=2, trunc_at=500)
text target
0 The U.S. government and private technology experts warned Wednesday that hackers plan to attack thousands of Web sites Sunday in a loosely co-ordinated " contest " that could disrupt Internet traffic. THE US government and private technology experts have warned that hackers plan to attack thousands of websites on Sunday in a loosely co-ordinated " contest " that could disrupt Internet traffic. 1
1 What's more, Mr. O 'Neill said that he hoped Hyundai would sell one million vehicles annually in the United States by 2010. That wasn 't all : by 2010, Mr. O 'Neill said, he hoped Hyundai would sell 1 million vehicles annually in the United States. 1

Tests

The tests below to ensure the core DataBlock code above works for all pretrained sequence classification models available in Hugging Face. These tests are excluded from the CI workflow because of how long they would take to run and the amount of data that would be required to download.

Note: Feel free to modify the code below to test whatever pretrained classification models you are working with ... and if any of your pretrained sequence classification models fail, please submit a github issue (or a PR if you'd like to fix it yourself)

arch tokenizer model_name result error
0 albert AlbertTokenizerFast albert-base-v1 PASSED
1 bart BartTokenizerFast facebook/bart-base PASSED
2 bert BertTokenizerFast bert-base-uncased PASSED
3 big_bird BigBirdTokenizerFast google/bigbird-roberta-base PASSED
4 ctrl CTRLTokenizer sshleifer/tiny-ctrl PASSED
5 camembert CamembertTokenizerFast camembert-base PASSED
6 convbert ConvBertTokenizerFast sarnikowski/convbert-medium-small-da-cased PASSED
7 deberta DebertaTokenizerFast microsoft/deberta-base PASSED
8 deberta_v2 DebertaV2Tokenizer microsoft/deberta-v2-xlarge PASSED
9 distilbert DistilBertTokenizerFast distilbert-base-uncased PASSED
10 electra ElectraTokenizerFast monologg/electra-small-finetuned-imdb PASSED
11 flaubert FlaubertTokenizer flaubert/flaubert_small_cased PASSED
12 funnel FunnelTokenizerFast huggingface/funnel-small-base PASSED
13 gpt2 GPT2TokenizerFast gpt2 PASSED
14 ibert RobertaTokenizer kssteven/ibert-roberta-base PASSED
15 led LEDTokenizerFast allenai/led-base-16384 PASSED
16 layoutlm LayoutLMTokenizerFast microsoft/layoutlm-base-uncased PASSED
17 longformer LongformerTokenizerFast allenai/longformer-base-4096 PASSED
18 mbart MBartTokenizerFast sshleifer/tiny-mbart PASSED
19 mpnet MPNetTokenizerFast microsoft/mpnet-base PASSED
20 mobilebert MobileBertTokenizerFast google/mobilebert-uncased PASSED
21 openai OpenAIGPTTokenizerFast openai-gpt PASSED
22 roberta RobertaTokenizerFast roberta-base PASSED
23 squeezebert SqueezeBertTokenizerFast squeezebert/squeezebert-uncased PASSED
24 transfo_xl TransfoXLTokenizer transfo-xl-wt103 PASSED
25 xlm XLMTokenizer xlm-mlm-en-2048 PASSED
26 xlm_roberta XLMRobertaTokenizerFast xlm-roberta-base PASSED
27 xlnet XLNetTokenizerFast xlnet-base-cased PASSED

Summary

The blurr.data.core module contains the fundamental bits for all data preprocessing tasks