This module contains core custom models, loss functions, etc... for Seq2Seq based tasks (e.g., language modeling, summarization, translation, etc...)
[nltk_data] Downloading package wordnet to /home/wgilliam/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
 
[nltk_data] Downloading package wordnet to /home/wgilliam/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
torch.cuda.set_device(1)
print(f'Using GPU #{torch.cuda.current_device()}: {torch.cuda.get_device_name()}')
Using GPU #1: GeForce GTX 1080 Ti

Seq2Seq

path = Path('./')
cnndm_df = pd.read_csv(path/'cnndm_sample.csv')

cnndm_df.head(2)
article highlights ds_type
0 (CNN) -- Globalization washes like a flood over the world's cultures and economies. Floods can be destructive; however, they can also bring blessings, as the annual floods of the Nile did for ancient Egypt. The world's great universities can be crucial instruments in shaping, in a positive way, humankind's reaction to globalization and the development of humankind itself. Traditionally, universities have been defined and limited by location, creating an academic community and drawing students and scholars to that place. Eventually, some universities began to encourage students to study el... John Sexton: Traditionally, universities have been defined and limited by location .\nGlobal campuses form a network of thought, innovation, he writes .\nFaculty can teach, Sexton says, students can team up in many cities at once .\nSexton: Research, scholarship can be shared and cultural ties made in "century of knowledge" train
1 (CNN) -- Armenian President Robert Kocharian declared a state of emergency Saturday night after a day of clashes between police and protesters, a spokeswoman for the Armenian Foreign Ministry said. Opposition supporters wave an Armenian flag during a protest rally in Yerevan, Armenia, on Saturday. The protesters claim last month's presidential election was rigged. The state of emergency will "hopefully bring some order" to the capital, Yerevan, said Salpi Ghazarian, assistant to the Armenian foreign minister, who spoke to CNN early Sunday. The state of emergency could last until March 20, ... NEW: Protest moves after crackdown at Freedom Square .\nOrder sought after protests over last month's election turn violent .\nDemonstrators say the election was fraudulent .\nState of emergency could last until March 20, official says . train
pretrained_model_name = "facebook/bart-large-cnn"
hf_arch, hf_config, hf_tokenizer, hf_model = BLURR_MODEL_HELPER.get_hf_objects(pretrained_model_name, 
                                                                               model_cls=BartForConditionalGeneration)

hf_arch, type(hf_config), type(hf_tokenizer), type(hf_model)
('bart',
 transformers.models.bart.configuration_bart.BartConfig,
 transformers.models.bart.tokenization_bart_fast.BartTokenizerFast,
 transformers.models.bart.modeling_bart.BartForConditionalGeneration)
before_batch_tfm = HF_Seq2SeqBeforeBatchTransform(hf_arch, hf_config, hf_tokenizer, hf_model,
                                                  max_length=256, max_target_length=130)

blocks = (HF_Seq2SeqBlock(before_batch_tfm=before_batch_tfm), noop)

dblock = DataBlock(blocks=blocks, 
                   get_x=ColReader('article'), 
                   get_y=ColReader('highlights'), 
                   splitter=RandomSplitter())
dls = dblock.dataloaders(cnndm_df, bs=2)
b = dls.one_batch()
len(b), b[0]['input_ids'].shape, b[1].shape
(2, torch.Size([2, 256]), torch.Size([2, 68]))
dls.show_batch(dataloaders=dls, max_n=2)
text target
0 Dan Condon believes in recycling. Just not when it comes to his hotel towels. Condon composts when he's at home in Boulder, Colorado. He eats local, organic and fair-trade food and drives a Honda CR-Z hybrid sports car. You might call him green. Except he's not so green when he travels for his work at an education nonprofit and stays in a hotel, which happens about 10 weeks per year. There, he uses a new towel every day. And don't try to bribe him with a drink or dessert coupon to get him to reuse the same one. "I could care less about rewards for environmentally conscious behavior unless it's miles," Condon wrote in an e-mail. If hotels can't convince a hybrid-driving recycling enthusiast like Condon to go green while traveling, how can they possibly convince everyone else? 9 glamorous movie-star hotels. That's the problem of hotels trying to "green" your hotel stay. After guests have paid a pretty penny for a night at the inn, even the most environmental guests may want to treat themselves to fresh towels every day and those little bottles of sweet-smelling shampoo. Despite the fact that most people describe themselves in surveys as environmentally conscious and as preferring green products, Hotel guests who "go green" are happier with their stay.\nIncreasing water and energy costs are pushing hotels to cut costs wherever they can.\nMany hotels find that guests don't mind using the same towels and sheets every night.\nTripAdvisor will be adding a green label for hotels listed on its site.
1 I have an uncle who has always been a robust and healthy guy. He drank a glass of skim milk every day, bragged about how many pull-ups he was doing and fit into pants he was wearing 20 years before. He didn't take a single medication and retired early. Given that he had no medical problems and ran his own business, he opted to go several years without health insurance. Eventually, when he turned 65, he picked up Medicare. What happened next was a little strange. He fell off the wagon. He exercised only sporadically, and paid hardly any attention to what he was eating. One day, I saw him eat an entire bag of potato chips. He bemoaned the fact that he was forced to buy new, bigger pants, and he stopped drinking his milk. For him, becoming newly insured had nearly the opposite effect on him of what we doctors hope to achieve. He'd become unhealthier. In many ways, my uncle was demonstrating a concept known as the moral hazard. Two economists wrote about this exact scenario in 2006. They found that many men, at the time they obtained Medicare, started behaving badly. Moral, or morale, hazard is a term largely used by economists to describe the actions of people more Sanjay Gupta: Moral hazard causes some to neglect health when they get health insurance.\nHe says Obamacare alone won't guarantee good health; personal habits must do that.\nHe says research shows 30 minutes of daily exercise cuts heart attack, stroke risk by a third.\nGupta: It's time to stop playing defense on your health; instead, start optimizing it yourself.

Training

Here we create a Seq2Seq specific subclass of HF_BaseModelCallback in order to include custom, Seq2Seq specific, metrics, and also handle the pre-calculated loss during training

seq2seq_metrics

  • {'rouge': { 'compute_args': {'return_types': ["rouge1", "rouge2", "rougeL"], 'use_stemmer': True}, 'returns':["rouge1", "rouge2", "rougeL"]}
  • {'bert_score': { 'returns': ["precision", "recall", "f1"] }
  • {'bleu': { 'returns': "bleu" }
  • {'bleurt': { 'returns': "scores" }
  • {'meteor': { 'returns': "meteor" }
  • {'sacrebleu': { 'returns': "score" }

class HF_Seq2SeqMetricsCallback[source]

HF_Seq2SeqMetricsCallback(custom_metrics=None, ignore_token_id=-100, text_gen_kwargs={}, **kwargs) :: Callback

Basic class handling tweaks of the training loop by changing a Learner in various events

We add a custom param splitter to give us a bit more depth in applying discriminative learning rates for Seq2Seq tasks.

seq2seq_splitter[source]

seq2seq_splitter(m, arch)

Custom param splitter for summarization models

seq2seq_metrics = {
    'rouge': {
        'compute_kwargs': {
            'rouge_types': ["rouge1", "rouge2", "rougeL"], 'use_stemmer': True
        }, 
        'returns': ["rouge1", "rouge2", "rougeL"] 
    }, 
    'bleu': { 'returns': "bleu" },
    'meteor': { 'returns': "meteor" },
    'sacrebleu': { 'returns': "score" }
}

model = HF_BaseModelWrapper(hf_model)
learn_cbs = [HF_BaseModelCallback]
fit_cbs = [HF_Seq2SeqMetricsCallback(custom_metrics=seq2seq_metrics)]

learn = Learner(dls, 
                model,
                opt_func=partial(Adam),
                loss_func=CrossEntropyLossFlat(), #HF_PreCalculatedLoss()
                cbs=learn_cbs,
                splitter=partial(seq2seq_splitter, arch=hf_arch)) #.to_native_fp16() #.to_fp16()

learn.create_opt() 
learn.freeze()
b = dls.one_batch()
preds = learn.model(b[0])

len(preds),preds['loss'].shape, preds['logits'].shape
(4, torch.Size([]), torch.Size([2, 84, 50264]))
b = dls.one_batch()
preds = learn.model(b[0])

len(preds),preds['loss'].shape, preds['logits'].shape
(4, torch.Size([]), torch.Size([2, 68, 50264]))
print(len(learn.opt.param_groups))
3
learn.lr_find(suggestions=True)
SuggestedLRs(lr_min=0.00012022644514217973, lr_steep=4.365158383734524e-05)
learn.fit_one_cycle(1, lr_max=4e-5, cbs=fit_cbs)
epoch train_loss valid_loss rouge1 rouge2 rougeL bleu meteor sacrebleu time
0 1.719814 1.711524 0.386660 0.167494 0.264603 0.150649 0.296390 12.609678 03:24

Showing results

Below we'll add in additional functionality to take advantage of huggingface's PreTrainedModel.generate model, which can be used to easily implement beam search, top-k/nucleous sampling, etc... so that we get more human sounding results.

test_article = """
About 10 men armed with pistols and small machine guns raided a casino in Switzerland and made off 
into France with several hundred thousand Swiss francs in the early hours of Sunday morning, police said. 
The men, dressed in black clothes and black ski masks, split into two groups during the raid on the Grand Casino 
Basel, Chief Inspector Peter Gill told CNN. One group tried to break into the casino's vault on the lower level 
but could not get in, but they did rob the cashier of the money that was not secured, he said. The second group 
of armed robbers entered the upper level where the roulette and blackjack tables are located and robbed the 
cashier there, he said. As the thieves were leaving the casino, a woman driving by and unaware of what was 
occurring unknowingly blocked the armed robbers' vehicles. A gunman pulled the woman from her vehicle, beat 
her, and took off for the French border. The other gunmen followed into France, which is only about 100 
meters (yards) from the casino, Gill said. There were about 600 people in the casino at the time of the robbery. 
There were no serious injuries, although one guest on the Casino floor was kicked in the head by one of the 
robbers when he moved, the police officer said. Swiss authorities are working closely with French authorities, 
Gill said. The robbers spoke French and drove vehicles with French lRicense plates. CNN's Andreena Narayan 
contributed to this report.
"""
res = learn.blurr_predict(test_article)
print(hf_tokenizer.decode(res[0][0][0][:20]))
<s><s>                About 10 men armed with pistols and small machine guns raid a casino in Switzerland. made

That doesn't look much like a human-generated text. Let's use huggingface's PreTrainedModel.generate method to create something more human-like.

b = dls.valid.one_batch()

b_before_batch_tfm = get_blurr_tfm(dls.before_batch)

b_hf_tokenizer = b_before_batch_tfm.hf_tokenizer
b_ignore_token_id = b_before_batch_tfm.ignore_token_id

test_input_ids = b[0]['input_ids'][0].unsqueeze(0).to(learn.model.hf_model.device)
test_trg_ids = b[1][0].unsqueeze(0).to(learn.model.hf_model.device)
test_trg_ids = [ trg[trg != b_ignore_token_id] for trg in test_trg_ids ]

gen_text = learn.model.hf_model.generate(test_input_ids, num_beams=4, max_length=130, min_length=30)

print('=== Target ===')
print(f'{b_hf_tokenizer.decode(test_trg_ids[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)}\n')

print('=== Prediction ===')
print(b_hf_tokenizer.decode(gen_text[0], skip_special_tokens=True, clean_up_tokenization_spaces=True))
=== Target ===
 Mexico hosts to up to 10 percent of all known species on Earth.
It is home to 502 types of mammals, 290 bird species and 26,000 types of plants.
Human development and climate change is placing a big strain on its biodiversity.
The Golden Eagle is under threat in spite of being the country's national symbol.

=== Prediction ===
 Mexico is home to up to 10 percent of all known species on the planet.
Climate change and human encroachment on natural environments threaten the country's rich wildlife.
Mexico ranks 11th in the United Nations Environment Program's list of megadiverse countries.
Some 574 out of 717 reptile species found in Mexico can only be encountered within its borders.
Pronatura, a non-profit organization, has selected six species which it says symbolize the problems faced by the country.

We'll add a blurr_generate method to Learner that uses huggingface's PreTrainedModel.generate to create our predictions. For the full list of arguments you can pass in see here. You can also check out their "How To Generate" notebook for more information about how it all works.

Learner.blurr_generate[source]

Learner.blurr_generate(inp, task=None, **kwargs)

Uses the built-in generate method to generate the text (see here for a list of arguments you can pass in)

outputs = learn.blurr_generate(test_article, num_return_sequences=3)

for idx, o in enumerate(outputs):
    print(f'=== Prediction {idx+1} ===\n{o}\n')
=== Prediction 1 ===
 About 10 men with pistols and machine guns raid Swiss casino in Basel, police say .
They make off with several hundred thousand Swiss francs in the early hours of Sunday morning .
There were about 600 people in the casino at the time of the robbery .
The robbers spoke French and drove vehicles with French lRicense plates .
Swiss authorities are working closely with French authorities .

=== Prediction 2 ===
 About 10 men with pistols and machine guns raid Swiss casino in Basel, police say .
They make off with several hundred thousand Swiss francs in the early hours of Sunday morning .
There were no serious injuries, although one guest was kicked in the head by one of the robbers .
The robbers spoke French and drove vehicles with French lRicense plates, police officer says .

=== Prediction 3 ===
 About 10 men with pistols and machine guns raid Swiss casino in Basel, police say .
They make off with several hundred thousand Swiss francs in the early hours of Sunday morning .
There were no serious injuries, although one guest was kicked in the head by one of the robbers .
The robbers spoke French and drove vehicles with French lRicense plates .

Much nicer!!! Now, we can update our @typedispatched show_results to use this new method.

learn.show_results(learner=learn, input_trunc_at=500, target_trunc_at=250)
text target prediction
0 (CNN) -- Home to up to 10 percent of all known species, Mexico is recognized as one of the most biodiverse regions on the planet. The twin threats of climate change and human encroachment on natural environments are, however, threatening the existence of the country's rich wildlife. And there is a great deal to lose. In the United Nations Environment Program (UNEP) World Conservation Monitoring Centre's list of megadiverse countries Mexico ranks 11th. The list represents a group of 17 countries Mexico hosts to up to 10 percent of all known species on Earth.\nIt is home to 502 types of mammals, 290 bird species and 26,000 types of plants.\nHuman development and climate change is placing a big strain on its biodiversity.\nThe Golden Eagle is un Mexico is home to up to 10 percent of all known species on the planet .\nClimate change and human encroachment on natural environments threaten the country's rich wildlife .\nMexico ranks 11th in the United Nations Environment Program's list of megadi
1 The story of the Beatles has taken on the power of myth. Today, five decades after Beatlemania erupted, it seems almost inevitable, a magical confluence of talent and timing. A group of scruffy musicians from Liverpool, a depressed port in northern England, become the biggest band in the world, known on a first-name basis? They put out album after groundbreaking album, their influence as great as their popularity? They dominate the pop culture of the 1960s and break up while still at the top of New biography tries to set Beatles record straight.\nAuthor Mark Lewisohn's "Tune In" is exhaustive, and just first volume of three.\nBeatles can be heard at career's beginning on new album of BBC recordings.\nThe group may have been groundbreaking, bu Beatles historian Mark Lewisohn has written a mammoth biography of the Fab Four .\n"Tune In" is the first of a projected three-volume work .\nThe book ends in 1962, with the Beatles' last single, "I Want You Back"\nLewisohn says the Beatles story has b

Inference

export_fname = 'summarize_export'
learn.metrics = None
learn.export(fname=f'{export_fname}.pkl')
inf_learn = load_learner(fname=f'{export_fname}.pkl')
inf_learn.blurr_generate(test_article)
[' About 10 men with pistols and machine guns raid Swiss casino in Basel, police say .\nThey make off with several hundred thousand Swiss francs in the early hours of Sunday morning .\nThere were about 600 people in the casino at the time of the robbery .\nThe robbers spoke French and drove vehicles with French lRicense plates .\nSwiss authorities are working closely with French authorities .']

Cleanup