= untar_data(URLs.WIKITEXT_TINY)
wiki_path wiki_path.ls()
(#2) [Path('/home/wgilliam/.fastai/data/wikitext-2/train.csv'),Path('/home/wgilliam/.fastai/data/wikitext-2/test.csv')]
text.data.language_modeling
module contains the bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data for causal and masked language modeling tasks. This includes things like training BERT from scratch or fine-tuning a particular pre-trained LM on your own corpus.
For this example, we’ll use the WIKITEXT_TINY
dataset available from fastai. In addition to using the Datasets
library from Hugging Face, fastai provides a lot of smaller datasets that are really useful when experimenting and/or in the early development of your training/validation/inference coding.
(#2) [Path('/home/wgilliam/.fastai/data/wikitext-2/train.csv'),Path('/home/wgilliam/.fastai/data/wikitext-2/test.csv')]
train_df = pd.read_csv(wiki_path / "train.csv", header=None)
valid_df = pd.read_csv(wiki_path / "test.csv", header=None)
print(len(train_df), len(valid_df))
train_df.head()
615 47
0 | |
---|---|
0 | \n = 2013 – 14 York City F.C. season = \n \n The 2013 – 14 season was the <unk> season of competitive association football and 77th season in the Football League played by York City Football Club , a professional football club based in York , North Yorkshire , England . Their 17th @-@ place finish in 2012 – 13 meant it was their second consecutive season in League Two . The season ran from 1 July 2013 to 30 June 2014 . \n Nigel Worthington , starting his first full season as York manager , made eight permanent summer signings . By the turn of the year York were only above the relegation z... |
1 | \n = Big Boy ( song ) = \n \n " Big Boy " <unk> " I 'm A Big Boy Now " was the first single ever recorded by the Jackson 5 , which was released by Steeltown Records in January 1968 . The group played instruments on many of their Steeltown compositions , including " Big Boy " . The song was neither a critical nor commercial success , but the Jackson family were delighted with the outcome nonetheless . \n The Jackson 5 would release a second single with Steeltown Records before moving to Motown Records . The group 's recordings at Steeltown Records were thought to be lost , but they were re... |
2 | \n = The Remix ( Lady Gaga album ) = \n \n The Remix is a remix album by American recording artist Lady Gaga . Released in Japan on March 3 , 2010 , it contains remixes of the songs from her first studio album , The Fame ( 2008 ) , and her third extended play , The Fame Monster ( 2009 ) . A revised version of the track list was prepared for release in additional markets , beginning with Mexico on May 3 , 2010 . A number of recording artists have produced the songs , including Pet Shop Boys , Passion Pit and The Sound of Arrows . The remixed versions feature both uptempo and <unk> composit... |
3 | \n = New Year 's Eve ( Up All Night ) = \n \n " New Year 's Eve " is the twelfth episode of the first season of the American comedy television series Up All Night . The episode originally aired on NBC in the United States on January 12 , 2012 . It was written by Erica <unk> and was directed by Beth McCarthy @-@ Miller . The episode also featured a guest appearance from Jason Lee as Chris and Reagan 's neighbor and Ava 's boyfriend , Kevin . \n During Reagan ( Christina Applegate ) and Chris 's ( Will <unk> ) first New Year 's Eve game night , Reagan 's competitiveness comes out causing Ch... |
4 | \n = Geopyxis carbonaria = \n \n Geopyxis carbonaria is a species of fungus in the genus Geopyxis , family <unk> . First described to science in 1805 , and given its current name in 1889 , the species is commonly known as the charcoal loving elf @-@ cup , dwarf <unk> cup , <unk> <unk> cup , or pixie cup . The small , <unk> @-@ shaped fruitbodies of the fungus are reddish @-@ brown with a whitish fringe and measure up to 2 cm ( 0 @.@ 8 in ) across . They have a short , tapered stalk . Fruitbodies are commonly found on soil where brush has recently been burned , sometimes in great numbers .... |
train_df["is_valid"] = False
valid_df["is_valid"] = True
df = pd.concat([train_df, valid_df])
df.head()
0 | is_valid | |
---|---|---|
0 | \n = 2013 – 14 York City F.C. season = \n \n The 2013 – 14 season was the <unk> season of competitive association football and 77th season in the Football League played by York City Football Club , a professional football club based in York , North Yorkshire , England . Their 17th @-@ place finish in 2012 – 13 meant it was their second consecutive season in League Two . The season ran from 1 July 2013 to 30 June 2014 . \n Nigel Worthington , starting his first full season as York manager , made eight permanent summer signings . By the turn of the year York were only above the relegation z... | False |
1 | \n = Big Boy ( song ) = \n \n " Big Boy " <unk> " I 'm A Big Boy Now " was the first single ever recorded by the Jackson 5 , which was released by Steeltown Records in January 1968 . The group played instruments on many of their Steeltown compositions , including " Big Boy " . The song was neither a critical nor commercial success , but the Jackson family were delighted with the outcome nonetheless . \n The Jackson 5 would release a second single with Steeltown Records before moving to Motown Records . The group 's recordings at Steeltown Records were thought to be lost , but they were re... | False |
2 | \n = The Remix ( Lady Gaga album ) = \n \n The Remix is a remix album by American recording artist Lady Gaga . Released in Japan on March 3 , 2010 , it contains remixes of the songs from her first studio album , The Fame ( 2008 ) , and her third extended play , The Fame Monster ( 2009 ) . A revised version of the track list was prepared for release in additional markets , beginning with Mexico on May 3 , 2010 . A number of recording artists have produced the songs , including Pet Shop Boys , Passion Pit and The Sound of Arrows . The remixed versions feature both uptempo and <unk> composit... | False |
3 | \n = New Year 's Eve ( Up All Night ) = \n \n " New Year 's Eve " is the twelfth episode of the first season of the American comedy television series Up All Night . The episode originally aired on NBC in the United States on January 12 , 2012 . It was written by Erica <unk> and was directed by Beth McCarthy @-@ Miller . The episode also featured a guest appearance from Jason Lee as Chris and Reagan 's neighbor and Ava 's boyfriend , Kevin . \n During Reagan ( Christina Applegate ) and Chris 's ( Will <unk> ) first New Year 's Eve game night , Reagan 's competitiveness comes out causing Ch... | False |
4 | \n = Geopyxis carbonaria = \n \n Geopyxis carbonaria is a species of fungus in the genus Geopyxis , family <unk> . First described to science in 1805 , and given its current name in 1889 , the species is commonly known as the charcoal loving elf @-@ cup , dwarf <unk> cup , <unk> <unk> cup , or pixie cup . The small , <unk> @-@ shaped fruitbodies of the fungus are reddish @-@ brown with a whitish fringe and measure up to 2 cm ( 0 @.@ 8 in ) across . They have a short , tapered stalk . Fruitbodies are commonly found on soil where brush has recently been burned , sometimes in great numbers .... | False |
model_cls = AutoModelForCausalLM
hf_logging.set_verbosity_error()
pretrained_model_name = "gpt2"
hf_arch, hf_config, hf_tokenizer, hf_model = get_hf_objects(pretrained_model_name, model_cls=model_cls)
# some tokenizers like gpt and gpt2 do not have a pad token, so we add it here mainly for the purpose
# of setting the "labels" key appropriately (see below)
if hf_tokenizer.pad_token is None:
hf_tokenizer.pad_token = "[PAD]"
hf_tokenizer.pad_token, hf_tokenizer.pad_token_id
Using pad_token, but it is not set yet.
('[PAD]', 50256)
Starting with version 2.0, BLURR
provides a language preprocessing class that can be used to preprocess DataFrames or Hugging Face Datasets for both causal and masked language modeling tasks.
LMPreprocessor (hf_tokenizer:transformers.tokenization_utils_base.PreTrai nedTokenizerBase, batch_size:int=1000, chunk_size:Optional[int]=None, sep_token:Optional[str]=None, text_attr:str='text', is_valid_attr:Optional[str]='is_valid', tok_kwargs:dict={})
Initialize self. See help(type(self)) for accurate signature.
Type | Default | Details | |
---|---|---|---|
hf_tokenizer | PreTrainedTokenizerBase | A Hugging Face tokenizer | |
batch_size | int | 1000 | The number of examples to process at a time |
chunk_size | Optional | None | How big each chunk of text should be (default: hf_tokenizer.model_max_length) |
sep_token | Optional | None | How to indicate the beginning on a new text example (default is hf_tokenizer.eos_token|sep_token |
text_attr | str | text | The attribute holding the text |
is_valid_attr | Optional | is_valid | The attribute that should be created if your are processing individual training and validation |
datasets into a single dataset, and will indicate to which each example is associated | |||
tok_kwargs | dict | {} | Tokenization kwargs that will be applied with calling the tokenizer |
DataFrame
preprocessor = LMPreprocessor(hf_tokenizer, chunk_size=128, text_attr=0)
proc_df = preprocessor.process_df(train_df, valid_df)
print(len(proc_df))
proc_df.head(2)
21330
proc_0 | is_valid | |
---|---|---|
0 | \n = 2013 – 14 York City F.C. season = \n \n The 2013 – 14 season was the <unk> season of competitive association football and 77th season in the Football League played by York City Football Club , a professional football club based in York , North Yorkshire , England . Their 17th @-@ place finish in 2012 – 13 meant it was their second consecutive season in League Two . The season ran from 1 July 2013 to 30 June 2014 . \n Nigel Worthington , starting his first full season as York manager , made eight permanent summer signings . By the turn of the year York were only | False |
1 | above the relegation zone on goal difference , before a 17 @-@ match unbeaten run saw the team finish in seventh @-@ place in the 24 @-@ team 2013 – 14 Football League Two . This meant York qualified for the play @-@ offs , and they were eliminated in the semi @-@ final by Fleetwood Town . York were knocked out of the 2013 – 14 FA Cup , Football League Cup and Football League Trophy in their opening round matches . \n 35 players made at least one appearance in nationally organised first @-@ team competition , and there were 12 different <unk> . Defender Ben Davies missed | False |
Dataset
LMType (value, names=None, module=None, qualname=None, type=None, start=1)
Use this enum to indicate what kind of language model you are training
BaseLMStrategy (hf_tokenizer, ignore_token_id=-100)
ABC for various language modeling strategies (e.g., causal, BertMLM, WholeWordMLM, etc…)
Here we include a BaseLMStrategy
abstract class and several different strategies for building your inputs and targets for causal and masked language modeling tasks. With CLMs, the objective is to simply predict the next token, but with MLMs, a variety of masking strategies may be used (e.g., mask random tokens, mask random words, mask spans, etc…). A BertMLMStrategy
is introduced below that follows the “mask random tokens” strategy used in the BERT paper, but users can create their own BaseLMStrategy
subclass to support any masking strategy they desire.
CausalLMStrategy (hf_tokenizer, ignore_token_id=-100)
For next token prediction language modeling tasks, we want to use the CausalLMStrategy
which makes the necessary changes in your inputs/targets for causal LMs
BertMLMStrategy (hf_tokenizer, ignore_token_id=-100)
A masked language modeling strategy using the default BERT masking definition.
Follows the masking strategy used in the BERT paper for random token masking
MLMTextInput (x, **kwargs)
The base represenation of your inputs; used by the various fastai show
methods
CausalLMTextInput (x, **kwargs)
The base represenation of your inputs; used by the various fastai show
methods
Again, we define a custom classes for the @typedispatch
ed methods to use so that we can override how both causal and masked language modeling inputs/targets are assembled, as well as, how the data is shown via methods like show_batch
and show_results
.
LMBatchTokenizeTransform (hf_arch:str, hf_config:transformers.configuration_utils.Pret rainedConfig, hf_tokenizer:transformers.tokeniz ation_utils_base.PreTrainedTokenizerBase, hf_mo del:transformers.modeling_utils.PreTrainedModel , include_labels:bool=True, ignore_token_id:int=-100, lm_strategy_cls:__main__.BaseLMStrategy=<class '__main__.CausalLMStrategy'>, max_length:int=None, padding:Union[bool,str]=True, truncation:Union[bool,str]=True, is_split_into_words:bool=False, tok_kwargs={}, text_gen_kwargs={}, **kwargs)
Handles everything you need to assemble a mini-batch of inputs and targets, as well as decode the dictionary produced as a byproduct of the tokenization process in the encodes
method.
Type | Default | Details | |
---|---|---|---|
hf_arch | str | The abbreviation/name of your Hugging Face transformer architecture (e.b., bert, bart, etc..) | |
hf_config | PretrainedConfig | A specific configuration instance you want to use | |
hf_tokenizer | PreTrainedTokenizerBase | A Hugging Face tokenizer | |
hf_model | PreTrainedModel | A Hugging Face model | |
include_labels | bool | True | To control whether the “labels” are included in your inputs. If they are, the loss will be calculated in |
the model’s forward function and you can simply use PreCalculatedLoss as your Learner ’s loss function to use it |
|||
ignore_token_id | int | -100 | The token ID that should be ignored when calculating the loss |
lm_strategy_cls | BaseLMStrategy | CausalLMStrategy | The language modeling strategy (or objective) |
max_length | int | None | To control the length of the padding/truncation. It can be an integer or None, |
in which case it will default to the maximum length the model can accept. If the model has no specific maximum input length, truncation/padding to max_length is deactivated. See Everything you always wanted to know about padding and truncation | | padding | Union | True | To control the padding
applied to your hf_tokenizer
during tokenization. If None, will default to False
or 'do_not_pad'. See [Everything you always wanted to know about padding and truncation](https://huggingface.co/transformers/preprocessing.html#everything-you-always-wanted-to-know-about-padding-and-truncation) | | truncation | Union | True | To control
truncationapplied to your
hf_tokenizerduring tokenization. If None, will default to
Falseor
do_not_truncate. See [Everything you always wanted to know about padding and truncation](https://huggingface.co/transformers/preprocessing.html#everything-you-always-wanted-to-know-about-padding-and-truncation) | | is_split_into_words | bool | False | The
is_split_into_wordsargument applied to your
hf_tokenizerduring tokenization. Set this to
Trueif your inputs are pre-tokenized (not numericalized) | | tok_kwargs | dict | {} | Any other keyword arguments you want included when using your
hf_tokenizer` to tokenize your inputs | | text_gen_kwargs | dict | {} | Any keyword arguments you want included when generated text See How to generate text | | kwargs | | | |
Our LMBatchTokenizeTransform
allows us to update the input’s labels
and our targets appropriately given any language modeling task.
The labels
argument allows you to forgo calculating the loss yourself by letting Hugging Face return it for you should you choose to do that. Padding tokens are set to -100 by default (e.g., CrossEntropyLossFlat().ignore_index
) and prevent cross entropy loss from considering token prediction for tokens it should … i.e., the padding tokens. For more information on the meaning of this argument, see the Hugging Face glossary entry for “Labels”
model_cls = AutoModelForCausalLM
hf_logging.set_verbosity_error()
pretrained_model_name = "gpt2"
hf_arch, hf_config, hf_tokenizer, hf_model = get_hf_objects(pretrained_model_name, model_cls=model_cls)
# some tokenizers like gpt and gpt2 do not have a pad token, so we add it here mainly for the purpose
# of setting the "labels" key appropriately (see below)
if hf_tokenizer.pad_token is None:
hf_tokenizer.pad_token = "[PAD]"
Using pad_token, but it is not set yet.
preprocessor = LMPreprocessor(hf_tokenizer, chunk_size=128, text_attr=0)
proc_df = preprocessor.process_df(train_df, valid_df)
print(len(proc_df))
proc_df.head(2)
21330
proc_0 | is_valid | |
---|---|---|
0 | \n = 2013 – 14 York City F.C. season = \n \n The 2013 – 14 season was the <unk> season of competitive association football and 77th season in the Football League played by York City Football Club , a professional football club based in York , North Yorkshire , England . Their 17th @-@ place finish in 2012 – 13 meant it was their second consecutive season in League Two . The season ran from 1 July 2013 to 30 June 2014 . \n Nigel Worthington , starting his first full season as York manager , made eight permanent summer signings . By the turn of the year York were only | False |
1 | above the relegation zone on goal difference , before a 17 @-@ match unbeaten run saw the team finish in seventh @-@ place in the 24 @-@ team 2013 – 14 Football League Two . This meant York qualified for the play @-@ offs , and they were eliminated in the semi @-@ final by Fleetwood Town . York were knocked out of the 2013 – 14 FA Cup , Football League Cup and Football League Trophy in their opening round matches . \n 35 players made at least one appearance in nationally organised first @-@ team competition , and there were 12 different <unk> . Defender Ben Davies missed | False |
DataBlock
batch_tok_tfm = LMBatchTokenizeTransform(hf_arch, hf_config, hf_tokenizer, hf_model, lm_strategy_cls=CausalLMStrategy)
blocks = (TextBlock(batch_tokenize_tfm=batch_tok_tfm, input_return_type=CausalLMTextInput), noop)
dblock = DataBlock(blocks=blocks, get_x=ColReader("proc_0"), splitter=ColSplitter(col="is_valid"))
DataLoaders
(torch.Size([4, 129]), torch.Size([4, 129]), torch.Size([4, 129]))
text | target | |
---|---|---|
0 | ₹ 40 million ( US $ 590 @,@ 000 ) was spent solely on VFX for Magadheera. \n \n = = = <unk> = = = \n \n During the film's shoot at Ramoji Film City in late November 2008, a 500 square feet ( 46 m2 ) film can, containing two or three scenes, was discovered missing from Rainbow lab. The filmmakers filed a case at <unk> police station. Security personnel and film unit members searched, but failed to recover the reels. Rajamouli's unit said it was not important if the scenes from | �� 40 million ( US $ 590 @,@ 000 ) was spent solely on VFX for Magadheera. \n \n = = = <unk> = = = \n \n During the film's shoot at Ramoji Film City in late November 2008, a 500 square feet ( 46 m2 ) film can, containing two or three scenes, was discovered missing from Rainbow lab. The filmmakers filed a case at <unk> police station. Security personnel and film unit members searched, but failed to recover the reels. Rajamouli's unit said it was not important if the scenes from |
1 | ederation. Described as " the most organized of the Northern Arabian tribes ", at the peak of its power in the 6th century BCE it controlled a large region between the Persian Gulf and the Sinai Peninsula. \n Biblical tradition holds that the Qedarites are named for Qedar, the second son of Ishmael, mentioned in the Bible's books of Genesis ( 25 : 13 ) and 1 Chronicles ( 1 : 29 ), where there are also frequent references to Qedar as a tribe. The earliest <unk> inscriptions discovered by archaeol | eration. Described as " the most organized of the Northern Arabian tribes ", at the peak of its power in the 6th century BCE it controlled a large region between the Persian Gulf and the Sinai Peninsula. \n Biblical tradition holds that the Qedarites are named for Qedar, the second son of Ishmael, mentioned in the Bible's books of Genesis ( 25 : 13 ) and 1 Chronicles ( 1 : 29 ), where there are also frequent references to Qedar as a tribe. The earliest <unk> inscriptions discovered by archaeologi |
model_cls = AutoModelForMaskedLM
hf_logging.set_verbosity_error()
pretrained_model_name = "bert-base-uncased"
hf_arch, hf_config, hf_tokenizer, hf_model = get_hf_objects(pretrained_model_name, model_cls=model_cls)
# some tokenizers like gpt and gpt2 do not have a pad token, so we add it here mainly for the purpose
# of setting the "labels" key appropriately (see below)
if hf_tokenizer.pad_token is None:
hf_tokenizer.pad_token = "[PAD]"
preprocessor = LMPreprocessor(hf_tokenizer, chunk_size=128, text_attr=0)
proc_df = preprocessor.process_df(train_df, valid_df)
print(len(proc_df))
proc_df.head(2)
Using eos_token, but it is not set yet.
21227
proc_0 | is_valid | |
---|---|---|
0 | \n = 2013 – 14 York City F.C. season = \n \n The 2013 – 14 season was the <unk> season of competitive association football and 77th season in the Football League played by York City Football Club , a professional football club based in York , North Yorkshire , England . Their 17th @-@ place finish in 2012 – 13 meant it was their second consecutive season in League Two . The season ran from 1 July 2013 to 30 June 2014 . \n Nigel Worthington , starting his first full season as York manager , made eight permanent summer signings . By the turn of the year York were only above the relegation z... | False |
1 | goal difference , before a 17 @-@ match unbeaten run saw the team finish in seventh @-@ place in the 24 @-@ team 2013 – 14 Football League Two . This meant York qualified for the play @-@ offs , and they were eliminated in the semi @-@ final by Fleetwood Town . York were knocked out of the 2013 – 14 FA Cup , Football League Cup and Football League Trophy in their opening round matches . \n 35 players made at least one appearance in nationally organised first @-@ team competition , and there were 12 different <unk> . Defender Ben Davies missed only five of the fifty @ | False |
DataBlock
batch_tok_tfm = LMBatchTokenizeTransform(hf_arch, hf_config, hf_tokenizer, hf_model, lm_strategy_cls=BertMLMStrategy)
blocks = (TextBlock(batch_tokenize_tfm=batch_tok_tfm, input_return_type=MLMTextInput), noop)
dblock = DataBlock(blocks=blocks, get_x=ColReader("proc_0"), splitter=ColSplitter(col="is_valid"))
DataLoaders
(torch.Size([4, 128]), torch.Size([4, 128]), torch.Size([4, 128]))
(tensor([ 101, 2003, 2098, 2340, 103, 2101, 103, 1026, 4895, 103, 1028, 1026,
103, 2243, 1028, 1998, 1996, 2674, 2736, 1037], device='cuda:1'),
tensor([-100, -100, -100, -100, 2781, -100, 2083, -100, -100, 2243, -100, -100,
4895, -100, 1028, -100, -100, -100, -100, -100], device='cuda:1'),
tensor([-100, -100, -100, -100, 2781, -100, 2083, -100, -100, 2243, -100, -100,
4895, -100, 1028, -100, -100, -100, -100, -100], device='cuda:1'))
text | target | |
---|---|---|
0 | [##las] ##ed 11 minutes [MASK] through < [un] ##k > [MASK] un ##k > and the match finished a 1 [MASK] 1 [MASK] . york [were] knocked out of the fa cup after losing 3 – 2 at home to bristol rovers in a first round [MASK] ; the [MASK] were 3 [–] 0 up by 50 @ - @ minutes before fletcher pulled two back [MASK] york with a penalty [MASK] a long @ - @ range strike . defender keith [MASK] , of cheltenham , and [MASK] nick pope [MASK] of charlton athletic , were signed on loan until january 2014 . they [MASK] played in york ' s first league [MASK] [MASK] four weeks , 2 – 1 [MASK] , to southend united | [is] ##ed 11 minutes [later] through < [un] ##k > [<] un ##k > and the match finished a 1 [–] 1 [draw] . york [were] knocked out of the fa cup after losing 3 – 2 at home to bristol rovers in a first round [replay] ; the [visitors] were 3 [–] 0 up by 50 @ - @ minutes before fletcher pulled two back [for] york with a penalty [and] a long @ - @ range strike . defender keith [lowe] , of cheltenham , and [goalkeeper] nick pope [,] of charlton athletic , were signed on loan until january 2014 . they [both] played in york ' s first league [defeat] [in] four weeks , 2 – 1 [away] , to southend united |
1 | [MASK] ##on . [MASK] 134 ##5 [MASK] iii was planning a [MASK] assault on france . a three [MASK] [MASK] [MASK] < un ##k > attack would have the earl of northampton attacking from brittany [,] the [MASK] himself from flanders , while gr [bubble] ##mont was dispatched [MASK] < un ##k > to prepare a campaign in the south [MASK] moving rapidly through the country , he confronted the [comte] d ’ [MASK] at < un ##k > [on] [MASK] october and there achieved a victory described as " [MASK] greatest single achievement of lancaster ' s entire military career " . the ransom from the prisoners has been [MASK] at £ 50 @ , @ 000 . [MASK] next year , while edward was | [ign] ##on . [in] 134 ##5 [edward] iii was planning a [major] assault on france . a three [@] [-] [@] < un ##k > attack would have the earl of northampton attacking from brittany [,] the [king] himself from flanders , while gr [##os] ##mont was dispatched [to] < un ##k > to prepare a campaign in the south [.] moving rapidly through the country , he confronted the [comte] d ’ [isle] at < un ##k > [on] [21] october and there achieved a victory described as " [the] greatest single achievement of lancaster ' s entire military career " . the ransom from the prisoners has been [estimated] at £ 50 @ , @ 000 . [the] next year , while edward was |