This module contains the bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data for causal and masked language modeling tasks. This includes things like training BERT from scratch or fine-tuning a particular pre-trained LM on your own corpus.
 
What we're running with at the time this documentation was generated:
torch: 1.9.0+cu102
fastai: 2.5.2
transformers: 4.10.0
wiki_path = untar_data(URLs.WIKITEXT_TINY)
wiki_path.ls()
(#2) [Path('/home/wgilliam/.fastai/data/wikitext-2/train.csv'),Path('/home/wgilliam/.fastai/data/wikitext-2/test.csv')]
train_df = pd.read_csv(wiki_path/'train.csv', header=None)
valid_df = pd.read_csv(wiki_path/'test.csv', header=None)

print(len(train_df), len(valid_df))
train_df.head()
615 47
0
0 \n = 2013 – 14 York City F.C. season = \n \n The 2013 – 14 season was the <unk> season of competitive association football and 77th season in the Football League played by York City Football Club , a professional football club based in York , North Yorkshire , England . Their 17th @-@ place finish in 2012 – 13 meant it was their second consecutive season in League Two . The season ran from 1 July 2013 to 30 June 2014 . \n Nigel Worthington , starting his first full season as York manager , made eight permanent summer signings . By the turn of the year York were only above the relegation z...
1 \n = Big Boy ( song ) = \n \n " Big Boy " <unk> " I 'm A Big Boy Now " was the first single ever recorded by the Jackson 5 , which was released by Steeltown Records in January 1968 . The group played instruments on many of their Steeltown compositions , including " Big Boy " . The song was neither a critical nor commercial success , but the Jackson family were delighted with the outcome nonetheless . \n The Jackson 5 would release a second single with Steeltown Records before moving to Motown Records . The group 's recordings at Steeltown Records were thought to be lost , but they were re...
2 \n = The Remix ( Lady Gaga album ) = \n \n The Remix is a remix album by American recording artist Lady Gaga . Released in Japan on March 3 , 2010 , it contains remixes of the songs from her first studio album , The Fame ( 2008 ) , and her third extended play , The Fame Monster ( 2009 ) . A revised version of the track list was prepared for release in additional markets , beginning with Mexico on May 3 , 2010 . A number of recording artists have produced the songs , including Pet Shop Boys , Passion Pit and The Sound of Arrows . The remixed versions feature both uptempo and <unk> composit...
3 \n = New Year 's Eve ( Up All Night ) = \n \n " New Year 's Eve " is the twelfth episode of the first season of the American comedy television series Up All Night . The episode originally aired on NBC in the United States on January 12 , 2012 . It was written by Erica <unk> and was directed by Beth McCarthy @-@ Miller . The episode also featured a guest appearance from Jason Lee as Chris and Reagan 's neighbor and Ava 's boyfriend , Kevin . \n During Reagan ( Christina Applegate ) and Chris 's ( Will <unk> ) first New Year 's Eve game night , Reagan 's competitiveness comes out causing Ch...
4 \n = Geopyxis carbonaria = \n \n Geopyxis carbonaria is a species of fungus in the genus Geopyxis , family <unk> . First described to science in 1805 , and given its current name in 1889 , the species is commonly known as the charcoal loving elf @-@ cup , dwarf <unk> cup , <unk> <unk> cup , or pixie cup . The small , <unk> @-@ shaped fruitbodies of the fungus are reddish @-@ brown with a whitish fringe and measure up to 2 cm ( 0 @.@ 8 in ) across . They have a short , tapered stalk . Fruitbodies are commonly found on soil where brush has recently been burned , sometimes in great numbers ....
train_df['is_valid'] = False
valid_df['is_valid'] = True

df = pd.concat([train_df, valid_df])
df.head()
0 is_valid
0 \n = 2013 – 14 York City F.C. season = \n \n The 2013 – 14 season was the <unk> season of competitive association football and 77th season in the Football League played by York City Football Club , a professional football club based in York , North Yorkshire , England . Their 17th @-@ place finish in 2012 – 13 meant it was their second consecutive season in League Two . The season ran from 1 July 2013 to 30 June 2014 . \n Nigel Worthington , starting his first full season as York manager , made eight permanent summer signings . By the turn of the year York were only above the relegation z... False
1 \n = Big Boy ( song ) = \n \n " Big Boy " <unk> " I 'm A Big Boy Now " was the first single ever recorded by the Jackson 5 , which was released by Steeltown Records in January 1968 . The group played instruments on many of their Steeltown compositions , including " Big Boy " . The song was neither a critical nor commercial success , but the Jackson family were delighted with the outcome nonetheless . \n The Jackson 5 would release a second single with Steeltown Records before moving to Motown Records . The group 's recordings at Steeltown Records were thought to be lost , but they were re... False
2 \n = The Remix ( Lady Gaga album ) = \n \n The Remix is a remix album by American recording artist Lady Gaga . Released in Japan on March 3 , 2010 , it contains remixes of the songs from her first studio album , The Fame ( 2008 ) , and her third extended play , The Fame Monster ( 2009 ) . A revised version of the track list was prepared for release in additional markets , beginning with Mexico on May 3 , 2010 . A number of recording artists have produced the songs , including Pet Shop Boys , Passion Pit and The Sound of Arrows . The remixed versions feature both uptempo and <unk> composit... False
3 \n = New Year 's Eve ( Up All Night ) = \n \n " New Year 's Eve " is the twelfth episode of the first season of the American comedy television series Up All Night . The episode originally aired on NBC in the United States on January 12 , 2012 . It was written by Erica <unk> and was directed by Beth McCarthy @-@ Miller . The episode also featured a guest appearance from Jason Lee as Chris and Reagan 's neighbor and Ava 's boyfriend , Kevin . \n During Reagan ( Christina Applegate ) and Chris 's ( Will <unk> ) first New Year 's Eve game night , Reagan 's competitiveness comes out causing Ch... False
4 \n = Geopyxis carbonaria = \n \n Geopyxis carbonaria is a species of fungus in the genus Geopyxis , family <unk> . First described to science in 1805 , and given its current name in 1889 , the species is commonly known as the charcoal loving elf @-@ cup , dwarf <unk> cup , <unk> <unk> cup , or pixie cup . The small , <unk> @-@ shaped fruitbodies of the fungus are reddish @-@ brown with a whitish fringe and measure up to 2 cm ( 0 @.@ 8 in ) across . They have a short , tapered stalk . Fruitbodies are commonly found on soil where brush has recently been burned , sometimes in great numbers .... False

LMType[source]

Enum = [CAUSAL, MASKED]

An enumeration.

Abstract LMStrategy Class & LM Before Batch Transform

class LMStrategy[source]

LMStrategy(hf_tokenizer, ignore_token_id=-100) :: ABC

ABC for various language modeling strategies (e.g., causal, BertMLM, WholeWordMLM, etc...)

Parameters:

  • hf_tokenizer : <class 'inspect._empty'>

  • ignore_token_id : <class 'int'>, optional

class HF_LMBeforeBatchTransform[source]

HF_LMBeforeBatchTransform(hf_arch:str, hf_config:PretrainedConfig, hf_tokenizer:PreTrainedTokenizerBase, hf_model:PreTrainedModel, lm_strategy_cls:LMStrategy, max_length:int=None, padding:Union[bool, str]=True, truncation:Union[bool, str]=True, ignore_token_id=-100, is_split_into_words:bool=False, tok_kwargs={}, text_gen_kwargs={}, **kwargs) :: HF_BeforeBatchTransform

Handles everything you need to assemble a mini-batch of inputs and targets, as well as decode the dictionary produced as a byproduct of the tokenization process in the encodes method.

Parameters:

  • hf_arch : <class 'str'>

    The abbreviation/name of your Hugging Face transformer architecture (e.b., bert, bart, etc..)

  • hf_config : <class 'transformers.configuration_utils.PretrainedConfig'>

    A specific configuration instance you want to use

  • hf_tokenizer : <class 'transformers.tokenization_utils_base.PreTrainedTokenizerBase'>

    A Hugging Face tokenizer

  • hf_model : <class 'transformers.modeling_utils.PreTrainedModel'>

    A Hugging Face model

  • lm_strategy_cls : <class 'blurr.data.language_modeling.LMStrategy'>

    The language modeling strategy (or objective)

  • max_length : <class 'int'>, optional

    To control the length of the padding/truncation. It can be an integer or None, in which case it will default to the maximum length the model can accept. If the model has no specific maximum input length, truncation/padding to max_length is deactivated. See [Everything you always wanted to know about padding and truncation](https://huggingface.co/transformers/preprocessing.html#everything-you-always-wanted-to-know-about-padding-and-truncation)

  • padding : typing.Union[bool, str], optional

    To control the `padding` applied to your `hf_tokenizer` during tokenization. If None, will default to `False` or `'do_not_pad'. See [Everything you always wanted to know about padding and truncation](https://huggingface.co/transformers/preprocessing.html#everything-you-always-wanted-to-know-about-padding-and-truncation)

  • truncation : typing.Union[bool, str], optional

    To control `truncation` applied to your `hf_tokenizer` during tokenization. If None, will default to `False` or `do_not_truncate`. See [Everything you always wanted to know about padding and truncation](https://huggingface.co/transformers/preprocessing.html#everything-you-always-wanted-to-know-about-padding-and-truncation)

  • ignore_token_id : <class 'int'>, optional

    The token ID that should be ignored when calculating the loss

  • is_split_into_words : <class 'bool'>, optional

    The `is_split_into_words` argument applied to your `hf_tokenizer` during tokenization. Set this to `True` if your inputs are pre-tokenized (not numericalized)

  • tok_kwargs : <class 'dict'>, optional

    Any other keyword arguments you want included when using your `hf_tokenizer` to tokenize your inputs

  • text_gen_kwargs : <class 'dict'>, optional

    Any keyword arguments you want included when generated text See [How to generate text](https://huggingface.co/blog/how-to-generate)

  • kwargs : <class 'inspect._empty'>

Our HF_LMBeforeBatchTransform allows us to update the input's labels and our targets appropriately given any language modeling task.

The labels argument allows you to forgo calculating the loss yourself by letting Hugging Face return it for you should you choose to do that. Padding tokens are set to -100 by default (e.g., CrossEntropyLossFlat().ignore_index) and prevent cross entropy loss from considering token prediction for tokens it should ... i.e., the padding tokens. For more information on the meaning of this argument, see the Hugging Face glossary entry for "Labels"

Causal LM

model_cls = AutoModelForCausalLM

pretrained_model_name = "gpt2"
hf_arch, hf_config, hf_tokenizer, hf_model = BLURR.get_hf_objects(pretrained_model_name, model_cls=model_cls)

# some tokenizers like gpt and gpt2 do not have a pad token, so we add it here mainly for the purpose
# of setting the "labels" key appropriately (see below)
if (hf_tokenizer.pad_token is None): hf_tokenizer.pad_token = '[PAD]'

hf_tokenizer.pad_token, hf_tokenizer.pad_token_id
Using pad_token, but it is not set yet.
('[PAD]', 50256)

class HF_CausalLMInput[source]

HF_CausalLMInput(x, **kwargs) :: HF_BaseInput

The base represenation of your inputs; used by the various fastai show methods

class CausalLMStrategy[source]

CausalLMStrategy(hf_tokenizer, ignore_token_id=-100) :: LMStrategy

For next token prediction language modeling tasks, we want to use the CausalLMStrategy which makes the necessary changes in your inputs/targets for causal LMs

Parameters:

  • hf_tokenizer : <class 'inspect._empty'>

  • ignore_token_id : <class 'int'>, optional

before_batch_tfm = HF_LMBeforeBatchTransform(hf_arch, hf_config, hf_tokenizer, hf_model, 
                                             lm_strategy_cls=CausalLMStrategy)

blocks = (HF_TextBlock(before_batch_tfm=before_batch_tfm, input_return_type=HF_CausalLMInput), noop)

dblock = DataBlock(blocks=blocks, get_x=ColReader(0), splitter=ColSplitter(col='is_valid'))
dls = dblock.dataloaders(df, bs=4)
b = dls.one_batch()
b[0]['input_ids'].shape, b[0]['labels'].shape, b[1].shape
(torch.Size([4, 1024]), torch.Size([4, 1024]), torch.Size([4, 1024]))
explode_types(b)
{tuple: [dict, torch.Tensor]}
dls.show_batch(dataloaders=dls, max_n=2, trunc_at=500)
text target
0 \n = Bob Dylan = \n \n Bob Dylan ( / <unk> / ; born Robert Allen Zimmerman, May 24, 1941 ) is an American singer @-@ songwriter, artist and writer. He has been influential in popular music and culture for more than five decades. Much of his most celebrated work dates from the 1960s when his songs chronicled social unrest, although Dylan repudiated suggestions from journalists that he was a spokesman for his generation. Nevertheless, early songs such as " Blowin'in the Wind " and " The Times They A \n = Bob Dylan = \n \n Bob Dylan ( / <unk> / ; born Robert Allen Zimmerman, May 24, 1941 ) is an American singer @-@ songwriter, artist and writer. He has been influential in popular music and culture for more than five decades. Much of his most celebrated work dates from the 1960s when his songs chronicled social unrest, although Dylan repudiated suggestions from journalists that he was a spokesman for his generation. Nevertheless, early songs such as " Blowin'in the Wind " and " The Times They Ar
1 \n = Ireland = \n \n Ireland ( / <unk> / ; Irish : <unk> [ <unk> ] ; Ulster @-@ Scots : <unk> [ <unk> ] ) is an island in the North Atlantic. It is separated from Great Britain to its east by the North Channel, the Irish Sea, and St George's Channel. Ireland is the second @-@ largest island of the British Isles, the third @-@ largest in Europe, and the twentieth @-@ largest on Earth. \n <unk>, Ireland is divided between the Republic of Ireland ( officially named Ireland ), which covers five @-@ <un \n = Ireland = \n \n Ireland ( / <unk> / ; Irish : <unk> [ <unk> ] ; Ulster @-@ Scots : <unk> [ <unk> ] ) is an island in the North Atlantic. It is separated from Great Britain to its east by the North Channel, the Irish Sea, and St George's Channel. Ireland is the second @-@ largest island of the British Isles, the third @-@ largest in Europe, and the twentieth @-@ largest on Earth. \n <unk>, Ireland is divided between the Republic of Ireland ( officially named Ireland ), which covers five @-@ <unk

Masked LM

model_cls = AutoModelForMaskedLM

pretrained_model_name = "bert-base-uncased"
hf_arch, hf_config, hf_tokenizer, hf_model = BLURR.get_hf_objects(pretrained_model_name, model_cls=model_cls)

# some tokenizers like gpt and gpt2 do not have a pad token, so we add it here mainly for the purpose
# of setting the "labels" key appropriately (see below)
if (hf_tokenizer.pad_token is None): hf_tokenizer.pad_token = '[PAD]'

hf_tokenizer.pad_token, hf_tokenizer.pad_token_id
('[PAD]', 0)

We need a new input type for MLM tasks, particularly for Learner.show_batch and Learner.show_results

class HF_MLMInput[source]

HF_MLMInput(x, **kwargs) :: HF_BaseInput

The base represenation of your inputs; used by the various fastai show methods

class BertMLMStrategy[source]

BertMLMStrategy(hf_tokenizer, ignore_token_id=-100) :: LMStrategy

A masked language modeling strategy using the default BERT masking definition.

Parameters:

  • hf_tokenizer : <class 'inspect._empty'>

  • ignore_token_id : <class 'int'>, optional

before_batch_tfm = HF_LMBeforeBatchTransform(hf_arch, hf_config, hf_tokenizer, hf_model,
                                             lm_strategy_cls=BertMLMStrategy)

blocks = (HF_TextBlock(before_batch_tfm=before_batch_tfm, input_return_type=HF_MLMInput), noop)

dblock = DataBlock(blocks=blocks, get_x=ColReader(0), splitter=ColSplitter(col='is_valid'))
dls = dblock.dataloaders(df, bs=4)
b = dls.one_batch()
b[0]['input_ids'].shape, b[0]['labels'].shape, b[1].shape
(torch.Size([4, 512]), torch.Size([4, 512]), torch.Size([4, 512]))
b[0]['input_ids'][0][:20], b[0]['labels'][0][:20], b[1][0][:20]
(tensor([  101,  1027,  3960,  7758,   103,  3960,  7758,  1006,   103,  1026,
          4895,  2243,  1028,   103,  1025,  2141,  2728,  5297, 27946,  1010],
        device='cuda:1'),
 tensor([ -100,  -100,  -100,  -100,  1027,  -100,  -100,  -100,  1013,  -100,
          -100,  -100,  -100,  1013,  -100,  -100,  -100,  -100, 27946,  -100],
        device='cuda:1'),
 tensor([ -100,  -100,  -100,  -100,  1027,  -100,  -100,  -100,  1013,  -100,
          -100,  -100,  -100,  1013,  -100,  -100,  -100,  -100, 27946,  -100],
        device='cuda:1'))
explode_types(b)
{tuple: [dict, torch.Tensor]}
dls.show_batch(dataloaders=dls, max_n=2, trunc_at=250)
text target
0 = bob dylan = bob dylan ( / < un ##k > / ; born robert [MASK] zimmerman , [MASK] 24 , 1941 [MASK] is an american singer @ - @ [MASK] , artist and writer . he has been influential [MASK] popular music [MASK] culture for [MASK] than five decades . much of his most celebrated work dates from the 1960s when his songs chronicle [MASK] social unrest , although dylan rep [MASK] ##ated suggestions from journalists [MASK] he was [MASK] spokesman for his generation . nevertheless , early [songs] such as " blow ##in ' in the wind " and " the times they are [MASK] [worldwide] - @ < un [##k] > ' " became [MASK] [MASK] for the american civil [MASK] and anti @ - @ war movements . after he left his [MASK] base in the american [essentially] [ru] revival , his six @ - @ minute single [MASK] like a rolling stone " altered the range [MASK] popular music in 1965 . his mid @ - [MASK] 1960s [recordings] , backed by rock musicians , reached the top end [of] the united states music charts while also [MASK] < un ##k > and criticism from others in the folk movement . dylan ' s [MASK] have [MASK] various political [MASK] social , philosophical [,] and literary [MASK] . [they] [MASK] ##ied existing pop music [MASK] and appealed [MASK] the bu ##rgeon ##ing counter ##culture . initially inspired by the performances of little richard and = bob dylan = bob dylan ( / < un ##k > / ; born robert [allen] zimmerman , [may] 24 , 1941 [)] is an american singer @ - @ [songwriter] , artist and writer . he has been influential [in] popular music [and] culture for [more] than five decades . much of his most celebrated work dates from the 1960s when his songs chronicle [##d] social unrest , although dylan rep [##udi] ##ated suggestions from journalists [that] he was [a] spokesman for his generation . nevertheless , early [songs] such as " blow ##in ' in the wind " and " the times they are [a] [@] - @ < un [##k] > ' " became [anthem] [##s] for the american civil [rights] and anti @ - @ war movements . after he left his [initial] base in the american [folk] [music] revival , his six @ - @ minute single ["] like a rolling stone " altered the range [of] popular music in 1965 . his mid @ - [@] 1960s [recordings] , backed by rock musicians , reached the top end [of] the united states music charts while also [attracting] < un ##k > and criticism from others in the folk movement . dylan ' s [lyrics] have [incorporated] various political [,] social , philosophical [,] and literary [influences] . [they] [def] ##ied existing pop music [conventions] and appealed [to] the bu ##rgeon ##ing counter ##culture . initially inspired by the performances of little richard and
1 = ireland = [MASK] [##itical] [/] [MASK] un ##k > / ; irish : [MASK] [MASK] ##k > [ < un ##k > ] ; ulster @ - @ scots : < un ##k > [ < un ##k > ] ) is an [MASK] in the north atlantic [MASK] it is [##示] from great britain to its east by the north channel [challenges] the [MASK] sea [,] and st george ' s channel . ireland is the second [MASK] - @ largest island of the british isles , the third @ - @ [cascade] in europe , [MASK] the [MASK] @ [MASK] @ largest on [MASK] . [MASK] un [MASK] [MASK] , ireland is divided between the republic of ireland ( officially [named] ireland [seminal] , which covers five @ - [MASK] < un ##k > of the island [MASK] and northern ireland , which is part of the united kingdom , in the northeast of the island . in 2011 [the] population of ireland was about 6 @ . @ 4 million , ranking it the second @ - @ most populous [MASK] in europe after great britain . just under [MASK] [MASK] . [MASK] 6 million live in [MASK] [MASK] of ireland and just over [MASK] @ . @ [8] million live in northern ireland . the island ' s geography comprises [MASK] low @ - @ lying mountains surrounding a central plain , with several navigable rivers extending inland . the island has lush vegetation , = ireland = [ireland] [(] [/] [<] un ##k > / ; irish : [<] [un] ##k > [ < un ##k > ] ; ulster @ - @ scots : < un ##k > [ < un ##k > ] ) is an [island] in the north atlantic [.] it is [separated] from great britain to its east by the north channel [,] the [irish] sea [,] and st george ' s channel . ireland is the second [@] - @ largest island of the british isles , the third @ - @ [largest] in europe , [and] the [twentieth] @ [-] @ largest on [earth] . [<] un [##k] [>] , ireland is divided between the republic of ireland ( officially [named] ireland [)] , which covers five @ - [@] < un ##k > of the island [,] and northern ireland , which is part of the united kingdom , in the northeast of the island . in 2011 [the] population of ireland was about 6 @ . @ 4 million , ranking it the second @ - @ most populous [island] in europe after great britain . just under [4] [@] . [@] 6 million live in [the] [republic] of ireland and just over [1] @ . @ [8] million live in northern ireland . the island ' s geography comprises [relatively] low @ - @ lying mountains surrounding a central plain , with several navigable rivers extending inland . the island has lush vegetation ,

Summary

You can implement whatever LM strategy your heart desires for either causal or masked language modeling tasks in Blurr