This module contains the bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data for causal and masked language modeling tasks. This includes things like training BERT from scratch or fine-tuning a particular pre-trained LM on your own corpus.
torch.cuda.set_device(1)
print(f'Using GPU #{torch.cuda.current_device()}: {torch.cuda.get_device_name()}')
Using GPU #1: GeForce GTX 1080 Ti
wiki_path = untar_data(URLs.WIKITEXT_TINY)
wiki_path.ls()
(#2) [Path('/home/wgilliam/.fastai/data/wikitext-2/train.csv'),Path('/home/wgilliam/.fastai/data/wikitext-2/test.csv')]
train_df = pd.read_csv(wiki_path/'train.csv', header=None)
valid_df = pd.read_csv(wiki_path/'test.csv', header=None)

print(len(train_df), len(valid_df))
train_df.head()
615 47
0
0 \n = 2013 – 14 York City F.C. season = \n \n The 2013 – 14 season was the <unk> season of competitive association football and 77th season in the Football League played by York City Football Club , a professional football club based in York , North Yorkshire , England . Their 17th @-@ place finish in 2012 – 13 meant it was their second consecutive season in League Two . The season ran from 1 July 2013 to 30 June 2014 . \n Nigel Worthington , starting his first full season as York manager , made eight permanent summer signings . By the turn of the year York were only above the relegation z...
1 \n = Big Boy ( song ) = \n \n " Big Boy " <unk> " I 'm A Big Boy Now " was the first single ever recorded by the Jackson 5 , which was released by Steeltown Records in January 1968 . The group played instruments on many of their Steeltown compositions , including " Big Boy " . The song was neither a critical nor commercial success , but the Jackson family were delighted with the outcome nonetheless . \n The Jackson 5 would release a second single with Steeltown Records before moving to Motown Records . The group 's recordings at Steeltown Records were thought to be lost , but they were re...
2 \n = The Remix ( Lady Gaga album ) = \n \n The Remix is a remix album by American recording artist Lady Gaga . Released in Japan on March 3 , 2010 , it contains remixes of the songs from her first studio album , The Fame ( 2008 ) , and her third extended play , The Fame Monster ( 2009 ) . A revised version of the track list was prepared for release in additional markets , beginning with Mexico on May 3 , 2010 . A number of recording artists have produced the songs , including Pet Shop Boys , Passion Pit and The Sound of Arrows . The remixed versions feature both uptempo and <unk> composit...
3 \n = New Year 's Eve ( Up All Night ) = \n \n " New Year 's Eve " is the twelfth episode of the first season of the American comedy television series Up All Night . The episode originally aired on NBC in the United States on January 12 , 2012 . It was written by Erica <unk> and was directed by Beth McCarthy @-@ Miller . The episode also featured a guest appearance from Jason Lee as Chris and Reagan 's neighbor and Ava 's boyfriend , Kevin . \n During Reagan ( Christina Applegate ) and Chris 's ( Will <unk> ) first New Year 's Eve game night , Reagan 's competitiveness comes out causing Ch...
4 \n = Geopyxis carbonaria = \n \n Geopyxis carbonaria is a species of fungus in the genus Geopyxis , family <unk> . First described to science in 1805 , and given its current name in 1889 , the species is commonly known as the charcoal loving elf @-@ cup , dwarf <unk> cup , <unk> <unk> cup , or pixie cup . The small , <unk> @-@ shaped fruitbodies of the fungus are reddish @-@ brown with a whitish fringe and measure up to 2 cm ( 0 @.@ 8 in ) across . They have a short , tapered stalk . Fruitbodies are commonly found on soil where brush has recently been burned , sometimes in great numbers ....
train_df['is_valid'] = False
valid_df['is_valid'] = True

df = pd.concat([train_df, valid_df])
df.head()
0 is_valid
0 \n = 2013 – 14 York City F.C. season = \n \n The 2013 – 14 season was the <unk> season of competitive association football and 77th season in the Football League played by York City Football Club , a professional football club based in York , North Yorkshire , England . Their 17th @-@ place finish in 2012 – 13 meant it was their second consecutive season in League Two . The season ran from 1 July 2013 to 30 June 2014 . \n Nigel Worthington , starting his first full season as York manager , made eight permanent summer signings . By the turn of the year York were only above the relegation z... False
1 \n = Big Boy ( song ) = \n \n " Big Boy " <unk> " I 'm A Big Boy Now " was the first single ever recorded by the Jackson 5 , which was released by Steeltown Records in January 1968 . The group played instruments on many of their Steeltown compositions , including " Big Boy " . The song was neither a critical nor commercial success , but the Jackson family were delighted with the outcome nonetheless . \n The Jackson 5 would release a second single with Steeltown Records before moving to Motown Records . The group 's recordings at Steeltown Records were thought to be lost , but they were re... False
2 \n = The Remix ( Lady Gaga album ) = \n \n The Remix is a remix album by American recording artist Lady Gaga . Released in Japan on March 3 , 2010 , it contains remixes of the songs from her first studio album , The Fame ( 2008 ) , and her third extended play , The Fame Monster ( 2009 ) . A revised version of the track list was prepared for release in additional markets , beginning with Mexico on May 3 , 2010 . A number of recording artists have produced the songs , including Pet Shop Boys , Passion Pit and The Sound of Arrows . The remixed versions feature both uptempo and <unk> composit... False
3 \n = New Year 's Eve ( Up All Night ) = \n \n " New Year 's Eve " is the twelfth episode of the first season of the American comedy television series Up All Night . The episode originally aired on NBC in the United States on January 12 , 2012 . It was written by Erica <unk> and was directed by Beth McCarthy @-@ Miller . The episode also featured a guest appearance from Jason Lee as Chris and Reagan 's neighbor and Ava 's boyfriend , Kevin . \n During Reagan ( Christina Applegate ) and Chris 's ( Will <unk> ) first New Year 's Eve game night , Reagan 's competitiveness comes out causing Ch... False
4 \n = Geopyxis carbonaria = \n \n Geopyxis carbonaria is a species of fungus in the genus Geopyxis , family <unk> . First described to science in 1805 , and given its current name in 1889 , the species is commonly known as the charcoal loving elf @-@ cup , dwarf <unk> cup , <unk> <unk> cup , or pixie cup . The small , <unk> @-@ shaped fruitbodies of the fungus are reddish @-@ brown with a whitish fringe and measure up to 2 cm ( 0 @.@ 8 in ) across . They have a short , tapered stalk . Fruitbodies are commonly found on soil where brush has recently been burned , sometimes in great numbers .... False

Abstract LMStrategy Class & LM Before Batch Transform

class LMStrategy[source]

LMStrategy(hf_tokenizer, ignore_token_id=-100) :: ABC

Helper class that provides a standard way to create an ABC using inheritance.

class HF_LMBeforeBatchTransform[source]

HF_LMBeforeBatchTransform(hf_arch, hf_config, hf_tokenizer, hf_model, lm_strategy_cls:LMStrategy, max_length=None, padding=True, truncation=True, is_split_into_words=False, ignore_token_id=-100, tok_kwargs={}, text_gen_kwargs={}, **kwargs) :: HF_BeforeBatchTransform

Handles everything you need to assemble a mini-batch of inputs and targets, as well as decode the dictionary produced as a byproduct of the tokenization process in the encodes method.

Causal LM

model_cls = AutoModelForCausalLM

pretrained_model_name = "gpt2"
hf_arch, hf_config, hf_tokenizer, hf_model = BLURR.get_hf_objects(pretrained_model_name, model_cls=model_cls)

# some tokenizers like gpt and gpt2 do not have a pad token, so we add it here mainly for the purpose
# of setting the "labels" key appropriately (see below)
if (hf_tokenizer.pad_token is None): hf_tokenizer.pad_token = '[PAD]'

hf_tokenizer.pad_token, hf_tokenizer.pad_token_id
Using pad_token, but it is not set yet.
('[PAD]', 50256)

class HF_CausalLMInput[source]

HF_CausalLMInput(x, **kwargs) :: HF_BaseInput

A Tensor which support subclass pickling, and maintains metadata when casting or after methods

class CausalLMStrategy[source]

CausalLMStrategy(hf_tokenizer, ignore_token_id=-100) :: LMStrategy

Helper class that provides a standard way to create an ABC using inheritance.

Our HF_CausalLMBeforeBatchTransform allows us to update the input's labels and our targets appropriately given a causal LM task.

The labels argument allows you to forgo calculating the loss yourself by letting huggingface return it for you should you choose to do that. Padding tokens are set to -100 by default (e.g., CrossEntropyLossFlat().ignore_index) and prevent cross entropy loss from considering token prediction for tokens it should ... i.e., the padding tokens. For more information on the meaning of this argument, see the huggingface glossary entry for "Labels"

before_batch_tfm = HF_LMBeforeBatchTransform(hf_arch, hf_config, hf_tokenizer, hf_model, 
                                             lm_strategy_cls=CausalLMStrategy)

blocks = (HF_TextBlock(before_batch_tfm=before_batch_tfm, input_return_type=HF_CausalLMInput), noop)

dblock = DataBlock(blocks=blocks, get_x=ColReader(0), splitter=ColSplitter(col='is_valid'))
dls = dblock.dataloaders(df, bs=4)
b = dls.one_batch()
b[0]['input_ids'].shape, b[0]['labels'].shape, b[1].shape
(torch.Size([4, 1024]), torch.Size([4, 1024]), torch.Size([4, 1024]))
explode_types(b)
{tuple: [dict, torch.Tensor]}
dls.show_batch(dataloaders=dls, max_n=2, trunc_at=500)
text target
0 \n = Bob Dylan = \n \n Bob Dylan ( / <unk> / ; born Robert Allen Zimmerman, May 24, 1941 ) is an American singer @-@ songwriter, artist and writer. He has been influential in popular music and culture for more than five decades. Much of his most celebrated work dates from the 1960s when his songs chronicled social unrest, although Dylan repudiated suggestions from journalists that he was a spokesman for his generation. Nevertheless, early songs such as " Blowin'in the Wind " and " The Times They A \n = Bob Dylan = \n \n Bob Dylan ( / <unk> / ; born Robert Allen Zimmerman, May 24, 1941 ) is an American singer @-@ songwriter, artist and writer. He has been influential in popular music and culture for more than five decades. Much of his most celebrated work dates from the 1960s when his songs chronicled social unrest, although Dylan repudiated suggestions from journalists that he was a spokesman for his generation. Nevertheless, early songs such as " Blowin'in the Wind " and " The Times They A
1 \n = Ireland = \n \n Ireland ( / <unk> / ; Irish : <unk> [ <unk> ] ; Ulster @-@ Scots : <unk> [ <unk> ] ) is an island in the North Atlantic. It is separated from Great Britain to its east by the North Channel, the Irish Sea, and St George's Channel. Ireland is the second @-@ largest island of the British Isles, the third @-@ largest in Europe, and the twentieth @-@ largest on Earth. \n <unk>, Ireland is divided between the Republic of Ireland ( officially named Ireland ), which covers five @-@ <un \n = Ireland = \n \n Ireland ( / <unk> / ; Irish : <unk> [ <unk> ] ; Ulster @-@ Scots : <unk> [ <unk> ] ) is an island in the North Atlantic. It is separated from Great Britain to its east by the North Channel, the Irish Sea, and St George's Channel. Ireland is the second @-@ largest island of the British Isles, the third @-@ largest in Europe, and the twentieth @-@ largest on Earth. \n <unk>, Ireland is divided between the Republic of Ireland ( officially named Ireland ), which covers five @-@ <un

Masked LM

model_cls = AutoModelForMaskedLM

pretrained_model_name = "bert-base-uncased"
hf_arch, hf_config, hf_tokenizer, hf_model = BLURR.get_hf_objects(pretrained_model_name, model_cls=model_cls)

# some tokenizers like gpt and gpt2 do not have a pad token, so we add it here mainly for the purpose
# of setting the "labels" key appropriately (see below)
if (hf_tokenizer.pad_token is None): hf_tokenizer.pad_token = '[PAD]'

hf_tokenizer.pad_token, hf_tokenizer.pad_token_id
('[PAD]', 0)

We need a new input type for MLM tasks, particularly for Learner.show_batch and Learner.show_results

The BERT-style masking strategy

class HF_MLMInput[source]

HF_MLMInput(x, **kwargs) :: HF_BaseInput

A Tensor which support subclass pickling, and maintains metadata when casting or after methods

class BertMLMStrategy[source]

BertMLMStrategy(hf_tokenizer, ignore_token_id=-100) :: LMStrategy

Helper class that provides a standard way to create an ABC using inheritance.

before_batch_tfm = HF_LMBeforeBatchTransform(hf_arch, hf_config, hf_tokenizer, hf_model,
                                             lm_strategy_cls=BertMLMStrategy)

blocks = (HF_TextBlock(before_batch_tfm=before_batch_tfm, input_return_type=HF_MLMInput), noop)

dblock = DataBlock(blocks=blocks, get_x=ColReader(0), splitter=ColSplitter(col='is_valid'))
dls = dblock.dataloaders(df, bs=4)
b = dls.one_batch()
b[0]['input_ids'].shape, b[0]['labels'].shape, b[1].shape
(torch.Size([4, 512]), torch.Size([4, 512]), torch.Size([4, 512]))
b[0]['input_ids'][0][:20], b[0]['labels'][0][:20], b[1][0][:20]
(tensor([  101,  1027,  3960,  7758,  1027,  3960,  7758, 24836,  1013, 13538,
           103,   103,  1028,  1013,  1025,   103, 24498,  5297, 27946,  1010],
        device='cuda:1'),
 tensor([-100, -100, -100, -100, -100, -100, -100, 1006, -100, 1026, 4895, 2243,
         -100, -100, -100, 2141, 2728, -100, -100, -100], device='cuda:1'),
 tensor([-100, -100, -100, -100, -100, -100, -100, 1006, -100, 1026, 4895, 2243,
         -100, -100, -100, 2141, 2728, -100, -100, -100], device='cuda:1'))
explode_types(b)
{tuple: [dict, torch.Tensor]}
dls.show_batch(dataloaders=dls, max_n=2, trunc_at=250)
text target
0 [MASK] bob dylan = bob dylan ( / < un ##k > / ; born robert allen zimmerman , may 24 , [MASK] ) is [low] american singer @ - @ songwriter , artist and writer . he has been [splash] [fin] popular music and culture for more than five decades . [MASK] of his most celebrated work dates [MASK] the 1960s [MASK] [MASK] songs chronicle [MASK] social [MASK] , although dylan rep ##udi [goodnight] suggestions from journalists that he was [MASK] spokesman for his generation . nevertheless , [MASK] songs such as " blow ##in ' in [MASK] wind " and [MASK] [MASK] [times] [MASK] are a @ - @ < un ##k > ' " became anthem ##s [revolt] the american civil rights and anti @ - @ war movements . after he left his initial base in the american folk music revival , his six @ - @ minute single [MASK] like a rolling stone " altered the range of popular music [MASK] 1965 . his mid @ - @ 1960s [MASK] , [backed] by [MASK] musicians , [MASK] the top end of the united states music charts while [MASK] attracting < [MASK] ##k > and criticism from others in the folk movement . dylan ' s lyrics have incorporated various [MASK] , [social] , philosophical , and literary [MASK] . [MASK] def ##ied existing pop music conventions and appealed to the bu ##rgeon ##ing [MASK] ##culture . [MASK] inspired by the performances of little [MASK] and [=] bob dylan = bob dylan ( / < un ##k > / ; born robert allen zimmerman , may 24 , [1941] ) is [an] american singer @ - @ songwriter , artist and writer . he has been [influential] [in] popular music and culture for more than five decades . [much] of his most celebrated work dates [from] the 1960s [when] [his] songs chronicle [##d] social [unrest] , although dylan rep ##udi [##ated] suggestions from journalists that he was [a] spokesman for his generation . nevertheless , [early] songs such as " blow ##in ' in [the] wind " and ["] [the] [times] [they] are a @ - @ < un ##k > ' " became anthem ##s [for] the american civil rights and anti @ - @ war movements . after he left his initial base in the american folk music revival , his six @ - @ minute single ["] like a rolling stone " altered the range of popular music [in] 1965 . his mid @ - @ 1960s [recordings] , [backed] by [rock] musicians , [reached] the top end of the united states music charts while [also] attracting < [un] ##k > and criticism from others in the folk movement . dylan ' s lyrics have incorporated various [political] , [social] , philosophical , and literary [influences] . [they] def ##ied existing pop music conventions and appealed to the bu ##rgeon ##ing [counter] ##culture . [initially] inspired by the performances of little [richard] and
1 = [MASK] olivier = laurence kerr olivier , baron olivier , < un ##k > [MASK] [MASK] < un ##k > < un ##k > < un ##k > / ; 22 may 1907 [MASK] 11 july 1989 ) was an english actor who , along with his [contemporaries] ralph [MASK] and john gi ##el ##gu ##d , dominated the british [MASK] of the mid @ [sounded] @ 20th century . he also worked in [MASK] throughout his career , playing more than fifty cinema [MASK] . late in his career , he had considerable success in television [MASK] . his family had [MASK] theatrical connections , but olivier ' s [MASK] , a clergyman , decided that his son should become [an] actor [MASK] after attending a [MASK] school in london , [MASK] learned his craft in a succession of acting jobs during the late 1920s [MASK] in 1930 he [MASK] his first important [MASK] end success in noel coward ' s private lives , and he appeared in his [MASK] [MASK] . in [alonso] he played in a celebrated production of romeo and juliet [alongside] gi ##el ##gu ##d and [MASK] ash [MASK] , and by the end of the [MASK] he was an established star . in the 1940s , together with richardson and john burr ##ell , olivier was the co @ - @ [MASK] [##moor] the old [vic] , building it into a highly respected company . there his most celebrated roles [mathias] shakespeare ' = [laurence] olivier = laurence kerr olivier , baron olivier , < un ##k > [(] [/] < un ##k > < un ##k > < un ##k > / ; 22 may 1907 [–] 11 july 1989 ) was an english actor who , along with his [contemporaries] ralph [richardson] and john gi ##el ##gu ##d , dominated the british [stage] of the mid @ [-] @ 20th century . he also worked in [films] throughout his career , playing more than fifty cinema [roles] . late in his career , he had considerable success in television [roles] . his family had [no] theatrical connections , but olivier ' s [father] , a clergyman , decided that his son should become [an] actor [.] after attending a [drama] school in london , [olivier] learned his craft in a succession of acting jobs during the late 1920s [.] in 1930 he [had] his first important [west] end success in noel coward ' s private lives , and he appeared in his [first] [film] . in [1935] he played in a celebrated production of romeo and juliet [alongside] gi ##el ##gu ##d and [peggy] ash [##croft] , and by the end of the [decade] he was an established star . in the 1940s , together with richardson and john burr ##ell , olivier was the co @ - @ [director] [of] the old [vic] , building it into a highly respected company . there his most celebrated roles [included] shakespeare '

Cleanup