This notebook demonstrates how we can use Blurr to tackle the General Language Understanding Evaluation(GLUE) benchmark to train, evalulate, and do inference.
 
Here's what we're running with ...

Using pytorch 1.7.1
Using fastai 2.4
Using transformers 4.6.1
Using GPU #1: GeForce GTX 1080 Ti

GLUE tasks

Abbr Name Task type Description Size Metrics
CoLA Corpus of Linguistic Acceptability Single-Sentence Task Predict whether a sequence is a grammatical English sentence 8.5k Matthews corr.
SST-2 Stanford Sentiment Treebank Single-Sentence Task Predict the sentiment of a given sentence 67k Accuracy
MRPC Microsoft Research Paraphrase Corpus Similarity and Paraphrase Tasks Predict whether two sentences are semantically equivalent 3.7k F1/Accuracy
SST-B Semantic Textual Similarity Benchmark Similarity and Paraphrase Tasks Predict the similarity score for two sentences on a scale from 1 to 5 7k Pearson/Spearman corr.
QQP Quora question pair Similarity and Paraphrase Tasks Predict if two questions are a paraphrase of one another 364k F1/Accuracy
MNLI Mulit-Genre Natural Language Inference Inference Tasks Predict whether the premise entails, contradicts or is neutral to the hypothesis 393k Accuracy
QNLI Stanford Question Answering Dataset Inference Tasks Predict whether the context sentence contains the answer to the question 105k Accuracy
RTE Recognize Textual Entailment Inference Tasks Predict whether one sentece entails another 2.5k Accuracy
WNLI Winograd Schema Challenge Inference Tasks Predict if the sentence with the pronoun substituted is entailed by the original sentence 634 Accuracy

Define the task and hyperparmeters

We'll use the "distilroberta-base" checkpoint for this example, but if you want to try an architecture that returns token_type_ids for example, you can use something like bert-cased.

task = 'mrpc'
task_meta = glue_tasks[task]
train_ds_name = task_meta['dataset_names']["train"]
valid_ds_name = task_meta['dataset_names']["valid"]
test_ds_name = task_meta['dataset_names']["test"]

task_inputs =  task_meta['inputs']
task_target =  task_meta['target']
task_metrics = task_meta['metric_funcs']

pretrained_model_name = "distilroberta-base" # bert-base-cased | distilroberta-base

bsz = 16
val_bsz = bsz *2

Raw data

Let's start by building our DataBlock. We'll load the MRPC datset from huggingface's datasets library which will be cached after downloading via the load_dataset method. For more information on the datasets API, see the documentation here.

raw_datasets = load_dataset('glue', task) 
print(f'{raw_datasets}\n')
print(f'{raw_datasets[train_ds_name][0]}\n')
print(f'{raw_datasets[train_ds_name].features}\n')
Reusing dataset glue (/home/wgilliam/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

{'idx': 0, 'label': 1, 'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .'}

{'sentence1': Value(dtype='string', id=None), 'sentence2': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None), 'idx': Value(dtype='int32', id=None)}

Prepare the Hugging Face objects

My #1 answer as to the question, "Why aren't my transformers training?", is that you likely don't have num_labels set correctly. The default for sequence classification tasks is 2, and even though that is what we have here, let's show how to set this either way.

n_lbls = raw_datasets[train_ds_name].features[task_target].num_classes
n_lbls
2
model_cls = AutoModelForSequenceClassification

hf_arch, hf_config, hf_tokenizer, hf_model = BLURR.get_hf_objects(pretrained_model_name, 
                                                                  model_cls=model_cls, 
                                                                  config_kwargs={'num_labels': n_lbls})

print(hf_arch)
print(type(hf_config))
print(type(hf_tokenizer))
print(type(hf_model))
roberta
<class 'transformers.models.roberta.configuration_roberta.RobertaConfig'>
<class 'transformers.models.roberta.tokenization_roberta_fast.RobertaTokenizerFast'>
<class 'transformers.models.roberta.modeling_roberta.RobertaForSequenceClassification'>

Build the tokenized datasets

Tokenize (and numericalize) the raw text using the datasets.map function, and then remove unnecessary and/or problematic attributes from the resulting tokenized dataset (e.g., things like strings that can't be converted to a tensor)

def tokenize_function(example):
    return hf_tokenizer(*[example[inp] for inp in task_inputs ], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(task_inputs + ['idx'])
tokenized_datasets["train"].column_names
['attention_mask', 'input_ids', 'label']

1. Using PyTorch DataLoaders

Build the DataLoaders

We have to augment the default DataCollatorWithPadding method to return a tuple of inputs/targets. As huggingface returns a BatchEncoding object after the call to DataCollatorWithPadding, we convert it to a dict so that fastai can put the batches on the correct device for training

@dataclass
class Blurr_DataCollatorWithPadding(DataCollatorWithPadding):
    def __call__(self, features):
        batch = super().__call__(features)
        return dict(batch), batch['labels']

Build the plain ol' PyTorch DataLoaders

data_collator = Blurr_DataCollatorWithPadding(tokenizer=hf_tokenizer)

train_dataloader = torch.utils.data.DataLoader(tokenized_datasets[train_ds_name], shuffle=True, batch_size=bsz, 
                                               collate_fn=data_collator)

eval_dataloader = torch.utils.data.DataLoader(tokenized_datasets[valid_ds_name], batch_size=val_bsz, 
                                              collate_fn=data_collator)
dls = DataLoaders(train_dataloader, eval_dataloader)
for b in dls.train: break
b[0]['input_ids'].shape, b[1].shape, b[0]['input_ids'].device, b[1].device
(torch.Size([16, 80]),
 torch.Size([16]),
 device(type='cpu'),
 device(type='cpu'))

Train

With our plain ol' PyTorch DataLoaders built, we can now build our Learner and train.

Note: Certain fastai methods like dls.one_batch, get_preds and dls.test_dl won't work with standard PyTorch DataLoaders ... but we'll show how to remedy that in a moment :)

model = HF_BaseModelWrapper(hf_model)

learn = Learner(dls, 
                model,
                opt_func=partial(Adam),
                loss_func=HF_PreCalculatedLoss(),
                metrics=task_metrics,
                cbs=[HF_BaseModelCallback],
                splitter=hf_splitter).to_fp16()

learn.freeze()
# learn.summary() # ... won't work :(
learn.lr_find(suggest_funcs=[minimum, steep, valley, slide])
/home/wgilliam/miniconda3/envs/blurr/lib/python3.9/site-packages/fastai/callback/schedule.py:270: UserWarning: color is redundantly defined by the 'color' keyword argument and the fmt string "ro" (-> color='r'). The keyword argument will take precedence.
  ax.plot(val, idx, 'ro', label=nm, c=color)
SuggestedLRs(minimum=4.786300996784121e-05, steep=6.309573450380412e-07, valley=tensor(0.0007), slide=tensor(0.0021))
learn.fit_one_cycle(1, lr_max=2e-3)
epoch train_loss valid_loss f1_score accuracy time
0 0.576206 0.500343 0.834646 0.742647 00:11
learn.unfreeze()
learn.lr_find(start_lr=1e-12, end_lr=2e-3, suggest_funcs=[minimum, steep, valley, slide])
/home/wgilliam/miniconda3/envs/blurr/lib/python3.9/site-packages/fastai/callback/schedule.py:270: UserWarning: color is redundantly defined by the 'color' keyword argument and the fmt string "ro" (-> color='r'). The keyword argument will take precedence.
  ax.plot(val, idx, 'ro', label=nm, c=color)
SuggestedLRs(minimum=1.3065426344993637e-12, steep=1.0546620000939644e-11, valley=tensor(7.6342e-06), slide=tensor(1.1716e-05))
learn.fit_one_cycle(2, lr_max=slice(2e-5, 2e-4))
epoch train_loss valid_loss f1_score accuracy time
0 0.447772 0.412896 0.878289 0.818627 00:19
1 0.301240 0.307498 0.911032 0.877451 00:19

Evaluate

How did we do?

val_res = learn.validate()
val_res_d = { 'loss': val_res[0]}
for idx, m in enumerate(learn.metrics): val_res_d[m.name] = val_res[idx+1]
    
val_res_d
{'loss': 0.3074977993965149,
 'f1_score': 0.911032028469751,
 'accuracy': 0.8774510025978088}
# preds, targs = learn.get_preds()  # ... won't work :(

Inference

Let's do item inference on an example from our test dataset

raw_test_df = raw_datasets[test_ds_name].to_pandas()
raw_test_df.head()
idx label sentence1 sentence2
0 0 1 PCCW 's chief operating officer , Mike Butcher , and Alex Arena , the chief financial officer , will report directly to Mr So . Current Chief Operating Officer Mike Butcher and Group Chief Financial Officer Alex Arena will report to So .
1 1 1 The world 's two largest automakers said their U.S. sales declined more than predicted last month as a late summer sales frenzy caused more of an industry backlash than expected . Domestic sales at both GM and No. 2 Ford Motor Co. declined more than predicted as a late summer sales frenzy prompted a larger-than-expected industry backlash .
2 2 1 According to the federal Centers for Disease Control and Prevention ( news - web sites ) , there were 19 reported cases of measles in the United States in 2002 . The Centers for Disease Control and Prevention said there were 19 reported cases of measles in the United States in 2002 .
3 3 0 A tropical storm rapidly developed in the Gulf of Mexico Sunday and was expected to hit somewhere along the Texas or Louisiana coasts by Monday night . A tropical storm rapidly developed in the Gulf of Mexico on Sunday and could have hurricane-force winds when it hits land somewhere along the Louisiana coast Monday night .
4 4 0 The company didn 't detail the costs of the replacement and repairs . But company officials expect the costs of the replacement work to run into the millions of dollars .
test_ex_idx = 0
test_ex = raw_test_df.iloc[test_ex_idx][task_inputs].values.tolist()
inputs = hf_tokenizer(*test_ex, return_tensors="pt").to('cuda:1')
outputs = hf_model(**inputs)
outputs.logits
tensor([[-2.8143,  1.8015]], device='cuda:1', grad_fn=<AddmmBackward>)
torch.argmax(torch.softmax(outputs.logits, dim=-1))
tensor(1, device='cuda:1')

Let's do batch inference on the entire test dataset

test_dataloader = torch.utils.data.DataLoader(tokenized_datasets[test_ds_name], shuffle=False, batch_size=val_bsz, 
                                               collate_fn=data_collator)

hf_model.eval()

probs, preds = [], []
for xb,yb in test_dataloader: 
    xb = to_device(xb,'cuda')
    with torch.no_grad(): 
        outputs = hf_model(**xb)
        
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    
    probs.append(logits)
    preds.append(predictions)
all_probs = torch.cat(probs, dim=0)
all_preds = torch.cat(preds, dim=0)

print(all_probs.shape, all_preds.shape)
torch.Size([1725, 2]) torch.Size([1725])

2. Using fastai DataLoaders

Let's start with a fresh set of huggingface objects

hf_arch, hf_config, hf_tokenizer, hf_model = BLURR.get_hf_objects(pretrained_model_name, 
                                                                  model_cls=model_cls, 
                                                                  config_kwargs={'num_labels': n_lbls})

... and the fix is this simple! Instead of using the PyTorch Dataloaders, let's use the fastai flavor like this ...

data_collator = Blurr_DataCollatorWithPadding(tokenizer=hf_tokenizer)

train_dataloader = DataLoader(tokenized_datasets[train_ds_name], shuffle=True, batch_size=bsz, 
                              create_batch=data_collator)

eval_dataloader = DataLoader(tokenized_datasets[valid_ds_name], batch_size=val_bsz, 
                             create_batch=data_collator)

Everything else is the same ... but now we get a bit more of the fastai features back

Build the DataLoaders

dls = DataLoaders(train_dataloader, eval_dataloader)
b = dls.one_batch()
b[0]['input_ids'].shape, b[1].shape, b[0]['input_ids'].device, b[1].device
(torch.Size([16, 70]),
 torch.Size([16]),
 device(type='cpu'),
 device(type='cpu'))

Train

model = HF_BaseModelWrapper(hf_model)

learn = Learner(dls, 
                model,
                opt_func=partial(Adam),
                loss_func=HF_PreCalculatedLoss(),
                metrics=task_metrics,
                cbs=[HF_BaseModelCallback],
                splitter=hf_splitter).to_fp16()

learn.freeze()
learn.lr_find(suggest_funcs=[minimum, steep, valley, slide])
/home/wgilliam/miniconda3/envs/blurr/lib/python3.9/site-packages/fastai/callback/schedule.py:270: UserWarning: color is redundantly defined by the 'color' keyword argument and the fmt string "ro" (-> color='r'). The keyword argument will take precedence.
  ax.plot(val, idx, 'ro', label=nm, c=color)
SuggestedLRs(minimum=0.0019054606556892395, steep=0.00010964782268274575, valley=tensor(0.0006), slide=tensor(0.0012))
learn.fit_one_cycle(1, lr_max=2e-3)
epoch train_loss valid_loss f1_score accuracy time
0 0.520931 0.479620 0.847826 0.759804 00:11
learn.unfreeze()
learn.lr_find(start_lr=1e-12, end_lr=2e-3, suggest_funcs=[minimum, steep, valley, slide])
/home/wgilliam/miniconda3/envs/blurr/lib/python3.9/site-packages/fastai/callback/schedule.py:270: UserWarning: color is redundantly defined by the 'color' keyword argument and the fmt string "ro" (-> color='r'). The keyword argument will take precedence.
  ax.plot(val, idx, 'ro', label=nm, c=color)
SuggestedLRs(minimum=2.227479490102269e-06, steep=8.513399187004556e-12, valley=tensor(1.3762e-06), slide=tensor(2.2275e-05))
learn.fit_one_cycle(2, lr_max=slice(2e-5, 2e-4))
epoch train_loss valid_loss f1_score accuracy time
0 0.457834 0.340799 0.895683 0.857843 00:20
1 0.267792 0.320309 0.899654 0.857843 00:20

Evaluate

How did we do?

val_res = learn.validate()
val_res_d = { 'loss': val_res[0]}
for idx, m in enumerate(learn.metrics): val_res_d[m.name] = val_res[idx+1]
    
val_res_d
{'loss': 0.3203085660934448,
 'f1_score': 0.8996539792387542,
 'accuracy': 0.8578431606292725}

Now we can use Learner.get_preds()

preds, targs = learn.get_preds()
print(preds.shape, targs.shape)
print(accuracy(preds, targs))
torch.Size([408, 2]) torch.Size([408])
TensorBase(0.8578)

Inference

Let's do item inference on an example from our test dataset

raw_test_df = raw_datasets[test_ds_name].to_pandas()
raw_test_df.head()
idx label sentence1 sentence2
0 0 1 PCCW 's chief operating officer , Mike Butcher , and Alex Arena , the chief financial officer , will report directly to Mr So . Current Chief Operating Officer Mike Butcher and Group Chief Financial Officer Alex Arena will report to So .
1 1 1 The world 's two largest automakers said their U.S. sales declined more than predicted last month as a late summer sales frenzy caused more of an industry backlash than expected . Domestic sales at both GM and No. 2 Ford Motor Co. declined more than predicted as a late summer sales frenzy prompted a larger-than-expected industry backlash .
2 2 1 According to the federal Centers for Disease Control and Prevention ( news - web sites ) , there were 19 reported cases of measles in the United States in 2002 . The Centers for Disease Control and Prevention said there were 19 reported cases of measles in the United States in 2002 .
3 3 0 A tropical storm rapidly developed in the Gulf of Mexico Sunday and was expected to hit somewhere along the Texas or Louisiana coasts by Monday night . A tropical storm rapidly developed in the Gulf of Mexico on Sunday and could have hurricane-force winds when it hits land somewhere along the Louisiana coast Monday night .
4 4 0 The company didn 't detail the costs of the replacement and repairs . But company officials expect the costs of the replacement work to run into the millions of dollars .
test_ex_idx = 0
test_ex = raw_test_df.iloc[test_ex_idx][task_inputs].values.tolist()
inputs = hf_tokenizer(*test_ex, return_tensors="pt").to('cuda:1')
outputs = hf_model(**inputs)
outputs.logits
tensor([[-2.2560,  1.7812]], device='cuda:1', grad_fn=<AddmmBackward>)
torch.argmax(torch.softmax(outputs.logits, dim=-1))
tensor(1, device='cuda:1')

Let's do batch inference on the entire test dataset using dls.test_dl

test_dl = dls.test_dl(tokenized_datasets[test_ds_name])
preds = learn.get_preds(dl=test_dl)
preds
(tensor([[0.0173, 0.9827],
         [0.0993, 0.9007],
         [0.0077, 0.9923],
         ...,
         [0.0276, 0.9724],
         [0.0217, 0.9783],
         [0.0175, 0.9825]]),
 tensor([1, 1, 1,  ..., 0, 1, 1]))

Summary

So you can see, with one simple swap of the DataLoader objects, you can get back a lot of that nice fastai functionality folks using the mid/high-level APIs have at their disposal. Nevertheless, if you're hell bent on using the standard PyTorch DataLoaders, you're still good to go with using the fastai Learner, it's callbacks, etc...