This notebook demonstrates how we can use Blurr to tackle the General Language Understanding Evaluation(GLUE) benchmark tasks.
 
Here's what we're running with ...

Using pytorch 1.7.1
Using fastai 2.4
Using transformers 4.6.1
Using GPU #1: GeForce GTX 1080 Ti

GLUE tasks

Abbr Name Task type Description Size Metrics
CoLA Corpus of Linguistic Acceptability Single-Sentence Task Predict whether a sequence is a grammatical English sentence 8.5k Matthews corr.
SST-2 Stanford Sentiment Treebank Single-Sentence Task Predict the sentiment of a given sentence 67k Accuracy
MRPC Microsoft Research Paraphrase Corpus Similarity and Paraphrase Tasks Predict whether two sentences are semantically equivalent 3.7k F1/Accuracy
SST-B Semantic Textual Similarity Benchmark Similarity and Paraphrase Tasks Predict the similarity score for two sentences on a scale from 1 to 5 7k Pearson/Spearman corr.
QQP Quora question pair Similarity and Paraphrase Tasks Predict if two questions are a paraphrase of one another 364k F1/Accuracy
MNLI Mulit-Genre Natural Language Inference Inference Tasks Predict whether the premise entails, contradicts or is neutral to the hypothesis 393k Accuracy
QNLI Stanford Question Answering Dataset Inference Tasks Predict whether the context sentence contains the answer to the question 105k Accuracy
RTE Recognize Textual Entailment Inference Tasks Predict whether one sentece entails another 2.5k Accuracy
WNLI Winograd Schema Challenge Inference Tasks Predict if the sentence with the pronoun substituted is entailed by the original sentence 634 Accuracy

Define the task and hyperparmeters

We'll use the "distilroberta-base" checkpoint for this example, but if you want to try an architecture that returns token_type_ids for example, you can use something like bert-cased.

task = 'mrpc'
task_meta = glue_tasks[task]
train_ds_name = task_meta['dataset_names']["train"]
valid_ds_name = task_meta['dataset_names']["valid"]
test_ds_name = task_meta['dataset_names']["test"]

task_inputs =  task_meta['inputs']
task_target =  task_meta['target']
task_metrics = task_meta['metric_funcs']

pretrained_model_name = "distilroberta-base" # bert-base-cased | distilroberta-base

bsz = 16
val_bsz = bsz *2

Prepare the datasets

Let's start by building our DataBlock. We'll load the MRPC datset from huggingface's datasets library which will be cached after downloading via the load_dataset method. For more information on the datasets API, see the documentation here.

raw_datasets = load_dataset('glue', task) 
print(f'{raw_datasets}\n')
print(f'{raw_datasets[train_ds_name][0]}\n')
print(f'{raw_datasets[train_ds_name].features}\n')
Reusing dataset glue (/home/wgilliam/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

{'idx': 0, 'label': 1, 'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .', 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .'}

{'sentence1': Value(dtype='string', id=None), 'sentence2': Value(dtype='string', id=None), 'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None), 'idx': Value(dtype='int32', id=None)}

There are a variety of ways we can preprocess the dataset for DataBlock consumption. For example, we could push the data into a DataFrame, add a boolean is_valid column, and use the ColSplitter method to define our train/validation splits like this:

raw_train_df = pd.DataFrame(raw_datasets[train_ds_name], columns=list(raw_datasets[train_ds_name].features.keys()))
raw_train_df['is_valid'] = False

raw_valid_df = pd.DataFrame(raw_datasets[valid_ds_name], columns=list(raw_datasets[train_ds_name].features.keys()))
raw_valid_df['is_valid'] = True

raw_df = pd.concat([raw_train_df, raw_valid_df])
print(len(raw_df))
raw_df.head()
4076
sentence1 sentence2 label idx is_valid
0 Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence . Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence . 1 0 False
1 Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion . Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 . 0 1 False
2 They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added . On June 10 , the ship 's owners had published an advertisement on the Internet , offering the explosives for sale . 1 2 False
3 Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 . Tab shares jumped 20 cents , or 4.6 % , to set a record closing high at A $ 4.57 . 0 3 False
4 The stock rose $ 2.11 , or about 11 percent , to close Friday at $ 21.51 on the New York Stock Exchange . PG & E Corp. shares jumped $ 1.63 or 8 percent to $ 21.03 on the New York Stock Exchange on Friday . 1 4 False

Another option is to capture the indexes for both train and validation sets, use the datasets concatenate_datasets to put them into a single dataset, and finally use the IndexSplitter method to define our train/validation splits as such:

n_train, n_valid = raw_datasets[train_ds_name].num_rows, raw_datasets[valid_ds_name].num_rows
train_idxs, valid_idxs = L(range(n_train)), L(range(n_train, n_train + n_valid))
raw_ds = concatenate_datasets([raw_datasets[train_ds_name], raw_datasets[valid_ds_name]])

Mid-level API

Prepare the huggingface objects

How many classes are we working with? Depending on your approach above, you can do one of the two approaches below.

n_lbls = raw_df[task_target].nunique(); n_lbls
2
n_lbls = len(set([item[task_target] for item in raw_ds])); n_lbls
2
model_cls = AutoModelForSequenceClassification

config = AutoConfig.from_pretrained(pretrained_model_name)
config.num_labels = n_lbls

hf_arch, hf_config, hf_tokenizer, hf_model = BLURR.get_hf_objects(pretrained_model_name, 
                                                                  model_cls=model_cls, 
                                                                  config=config)

print(hf_arch)
print(type(hf_config))
print(type(hf_tokenizer))
print(type(hf_model))
roberta
<class 'transformers.models.roberta.configuration_roberta.RobertaConfig'>
<class 'transformers.models.roberta.tokenization_roberta_fast.RobertaTokenizerFast'>
<class 'transformers.models.roberta.modeling_roberta.RobertaForSequenceClassification'>

Build the DataBlock

blocks = (HF_TextBlock(hf_arch, hf_config, hf_tokenizer, hf_model), CategoryBlock())

def get_x(r, attr): 
    return r[attr] if (isinstance(attr, str)) else tuple(r[inp] for inp in attr)
    
dblock = DataBlock(blocks=blocks, 
                   get_x=partial(get_x, attr=task_inputs), 
                   get_y=ItemGetter(task_target), 
                   splitter=IndexSplitter(valid_idxs))
dls = dblock.dataloaders(raw_ds, bs=bsz, val_bs=val_bsz)
b = dls.one_batch()
len(b), b[0]['input_ids'].shape, b[1].shape
(2, torch.Size([16, 69]), torch.Size([16]))
if ('token_type_ids' in b[0]):
    print([(hf_tokenizer.convert_ids_to_tokens(inp_id.item()), inp_id.item(), tt_id.item() )
           for inp_id, tt_id in zip (b[0]['input_ids'][0], b[0]['token_type_ids'][0]) 
           if inp_id != hf_tokenizer.pad_token_id])
dls.show_batch(dataloaders=dls, max_n=5)
text category
0 Amrozi accused his brother, whom he called " the witness ", of deliberately distorting his evidence. Referring to him as only " the witness ", Amrozi accused his brother of deliberately distorting his evidence. 1
1 All five were charged with robbery and criminal impersonation of a police officer. The teens are being held on charges of robbery and criminal impersonation of a police officer, sources said. 1
2 As part of a 2001 agreement to extradite them from Canada, prosecutors agreed not to seek the death penalty. As part of the agreement to extradite the two best friends from Canada, prosecutors agreed not to seek the death penalty for convictions. 1
3 They were not supplied or given to us but unearthed by our reporter, David Blair, in the Foreign Ministry in Baghdad. " They were not supplied or given to us, but unearthed by our reporter " in Iraq's foreign ministry, he said. 1
4 October gasoline prices settled 1.47 cents lower at 78.70 cents a gallon. October heating oil ended down 0.41 cent to 70.74 cents a gallon. 1

Train

With our DataLoaders built, we can now build our Learner and train. We'll use mixed precision so we can train with bigger batches

model = HF_BaseModelWrapper(hf_model)

learn = Learner(dls, 
                model,
                opt_func=partial(Adam),
                loss_func=CrossEntropyLossFlat(),
                metrics=task_metrics,
                cbs=[HF_BaseModelCallback],
                splitter=hf_splitter).to_fp16()

learn.freeze()
learn.summary()
HF_BaseModelWrapper (Input shape: 16)
============================================================================
Layer (type)         Output Shape         Param #    Trainable 
============================================================================
                     16 x 69 x 768       
Embedding                                 38603520   False     
Embedding                                 394752     False     
Embedding                                 768        False     
LayerNorm                                 1536       True      
Dropout                                                        
Linear                                    590592     False     
Linear                                    590592     False     
Linear                                    590592     False     
Dropout                                                        
Linear                                    590592     False     
LayerNorm                                 1536       True      
Dropout                                                        
____________________________________________________________________________
                     16 x 69 x 3072      
Linear                                    2362368    False     
____________________________________________________________________________
                     16 x 69 x 768       
Linear                                    2360064    False     
LayerNorm                                 1536       True      
Dropout                                                        
Linear                                    590592     False     
Linear                                    590592     False     
Linear                                    590592     False     
Dropout                                                        
Linear                                    590592     False     
LayerNorm                                 1536       True      
Dropout                                                        
____________________________________________________________________________
                     16 x 69 x 3072      
Linear                                    2362368    False     
____________________________________________________________________________
                     16 x 69 x 768       
Linear                                    2360064    False     
LayerNorm                                 1536       True      
Dropout                                                        
Linear                                    590592     False     
Linear                                    590592     False     
Linear                                    590592     False     
Dropout                                                        
Linear                                    590592     False     
LayerNorm                                 1536       True      
Dropout                                                        
____________________________________________________________________________
                     16 x 69 x 3072      
Linear                                    2362368    False     
____________________________________________________________________________
                     16 x 69 x 768       
Linear                                    2360064    False     
LayerNorm                                 1536       True      
Dropout                                                        
Linear                                    590592     False     
Linear                                    590592     False     
Linear                                    590592     False     
Dropout                                                        
Linear                                    590592     False     
LayerNorm                                 1536       True      
Dropout                                                        
____________________________________________________________________________
                     16 x 69 x 3072      
Linear                                    2362368    False     
____________________________________________________________________________
                     16 x 69 x 768       
Linear                                    2360064    False     
LayerNorm                                 1536       True      
Dropout                                                        
Linear                                    590592     False     
Linear                                    590592     False     
Linear                                    590592     False     
Dropout                                                        
Linear                                    590592     False     
LayerNorm                                 1536       True      
Dropout                                                        
____________________________________________________________________________
                     16 x 69 x 3072      
Linear                                    2362368    False     
____________________________________________________________________________
                     16 x 69 x 768       
Linear                                    2360064    False     
LayerNorm                                 1536       True      
Dropout                                                        
Linear                                    590592     False     
Linear                                    590592     False     
Linear                                    590592     False     
Dropout                                                        
Linear                                    590592     False     
LayerNorm                                 1536       True      
Dropout                                                        
____________________________________________________________________________
                     16 x 69 x 3072      
Linear                                    2362368    False     
____________________________________________________________________________
                     16 x 69 x 768       
Linear                                    2360064    False     
LayerNorm                                 1536       True      
Dropout                                                        
Linear                                    590592     True      
Dropout                                                        
____________________________________________________________________________
                     16 x 2              
Linear                                    1538       True      
____________________________________________________________________________

Total params: 82,119,938
Total trainable params: 612,098
Total non-trainable params: 81,507,840

Optimizer used: functools.partial(<function Adam at 0x7f63b8cdad30>)
Loss function: FlattenedLoss of CrossEntropyLoss()

Model frozen up to parameter group #2

Callbacks:
  - TrainEvalCallback
  - HF_BaseModelCallback
  - MixedPrecision
  - Recorder
  - ProgressCallback
preds = model(b[0])
preds.logits.shape, preds
(torch.Size([16, 2]),
 SequenceClassifierOutput(loss=None, logits=tensor([[ 0.1071, -0.0940],
         [ 0.1020, -0.0825],
         [ 0.1007, -0.1009],
         [ 0.1005, -0.0866],
         [ 0.1160, -0.0963],
         [ 0.1053, -0.0829],
         [ 0.1144, -0.0933],
         [ 0.1095, -0.0989],
         [ 0.1001, -0.0940],
         [ 0.1097, -0.0874],
         [ 0.1112, -0.0802],
         [ 0.1112, -0.0944],
         [ 0.1115, -0.0962],
         [ 0.1098, -0.0901],
         [ 0.1239, -0.0915],
         [ 0.1121, -0.0852]], device='cuda:1', grad_fn=<AddmmBackward>), hidden_states=None, attentions=None))
learn.lr_find(suggest_funcs=[minimum, steep, valley, slide])
/home/wgilliam/miniconda3/envs/blurr/lib/python3.9/site-packages/fastai/callback/schedule.py:270: UserWarning: color is redundantly defined by the 'color' keyword argument and the fmt string "ro" (-> color='r'). The keyword argument will take precedence.
  ax.plot(val, idx, 'ro', label=nm, c=color)
SuggestedLRs(minimum=3.311311302240938e-05, steep=0.02290867641568184, valley=tensor(0.0003), slide=tensor(0.0021))
learn.fit_one_cycle(1, lr_max=2e-3)
epoch train_loss valid_loss f1_score accuracy time
0 0.554214 0.487845 0.854400 0.776961 00:12
learn.unfreeze()
learn.lr_find(start_lr=1e-12, end_lr=2e-3, suggest_funcs=[minimum, steep, valley, slide])
/home/wgilliam/miniconda3/envs/blurr/lib/python3.9/site-packages/fastai/callback/schedule.py:270: UserWarning: color is redundantly defined by the 'color' keyword argument and the fmt string "ro" (-> color='r'). The keyword argument will take precedence.
  ax.plot(val, idx, 'ro', label=nm, c=color)
SuggestedLRs(minimum=1.3065426344993637e-12, steep=1.0546620000939644e-11, valley=tensor(2.7595e-05), slide=tensor(2.2275e-05))
learn.fit_one_cycle(2, lr_max=slice(2e-5, 2e-4))
epoch train_loss valid_loss f1_score accuracy time
0 0.465272 0.387093 0.869565 0.816176 00:21
1 0.294113 0.330454 0.895944 0.855392 00:21
learn.show_results(learner=learn, max_n=5)
text category target
0 He said the foodservice pie business doesn 't fit the company's long-term growth strategy. " The foodservice pie business does not fit our long-term growth strategy. 1 1
1 The first products are likely to be dongles costing between US $ 100 and US $ 150 that will establish connections between consumer electronics devices and PCs. The first products will likely be dongles costing $ 100 to $ 150 that will establish connections between consumer electronics devices and PCs. 1 1
2 About 10 percent of high school and 16 percent of elementary students must be proficient at math. In math, 16 percent of elementary and middle school students and 9.6 percent of high school students must be proficient. 1 1
3 The decision came a year after Whipple ended federal oversight of the district's racial balance, facilities, budget, and busing. The decision came a year after Whipple ended federal oversight of school busing as well as the district's racial balance, facilities and budget. 1 1
4 Parson was charged with intentionally causing and attempting to cause damage to protected computers. Parson is charged with one count of intentionally causing damage to a protected computer. 1 1

Evaluate

How did we do?

val_res = learn.validate()
val_res_d = { 'loss': val_res[0]}
for idx, m in enumerate(learn.metrics):
    val_res_d[m.name] = val_res[idx+1]
    
val_res_d
{'loss': 0.33045369386672974,
 'f1_score': 0.8959435626102292,
 'accuracy': 0.8553921580314636}
preds, targs, losses = learn.get_preds(with_loss=True)
print(preds.shape, targs.shape, losses.shape)
print(losses.mean(), accuracy(preds, targs))
torch.Size([408, 2]) torch.Size([408]) torch.Size([408])
TensorBase(0.3305) TensorBase(0.8554)

Inference

Let's do item inference on an example from our test dataset

raw_test_df = pd.DataFrame(raw_datasets[test_ds_name], columns=list(raw_datasets[test_ds_name].features.keys()))
raw_test_df.head(10)
sentence1 sentence2 label idx
0 PCCW 's chief operating officer , Mike Butcher , and Alex Arena , the chief financial officer , will report directly to Mr So . Current Chief Operating Officer Mike Butcher and Group Chief Financial Officer Alex Arena will report to So . 1 0
1 The world 's two largest automakers said their U.S. sales declined more than predicted last month as a late summer sales frenzy caused more of an industry backlash than expected . Domestic sales at both GM and No. 2 Ford Motor Co. declined more than predicted as a late summer sales frenzy prompted a larger-than-expected industry backlash . 1 1
2 According to the federal Centers for Disease Control and Prevention ( news - web sites ) , there were 19 reported cases of measles in the United States in 2002 . The Centers for Disease Control and Prevention said there were 19 reported cases of measles in the United States in 2002 . 1 2
3 A tropical storm rapidly developed in the Gulf of Mexico Sunday and was expected to hit somewhere along the Texas or Louisiana coasts by Monday night . A tropical storm rapidly developed in the Gulf of Mexico on Sunday and could have hurricane-force winds when it hits land somewhere along the Louisiana coast Monday night . 0 3
4 The company didn 't detail the costs of the replacement and repairs . But company officials expect the costs of the replacement work to run into the millions of dollars . 0 4
5 The settling companies would also assign their possible claims against the underwriters to the investor plaintiffs , he added . Under the agreement , the settling companies will also assign their potential claims against the underwriters to the investors , he added . 1 5
6 Air Commodore Quaife said the Hornets remained on three-minute alert throughout the operation . Air Commodore John Quaife said the security operation was unprecedented . 0 6
7 A Washington County man may have the countys first human case of West Nile virus , the health department said Friday . The countys first and only human case of West Nile this year was confirmed by health officials on Sept . 8 . 1 7
8 Moseley and a senior aide delivered their summary assessments to about 300 American and allied military officers on Thursday . General Moseley and a senior aide presented their assessments at an internal briefing for American and allied military officers at Nellis Air Force Base in Nevada on Thursday . 1 8
9 The broader Standard & Poor 's 500 Index < .SPX > was 0.46 points lower , or 0.05 percent , at 997.02 . The technology-laced Nasdaq Composite Index .IXIC was up 7.42 points , or 0.45 percent , at 1,653.44 . 0 9
learn.blurr_predict(raw_test_df.iloc[9].to_dict())
[(('0',), (#1) [tensor(0)], (#1) [tensor([0.9511, 0.0489])])]

Let's do batch inference on the entire test dataset

test_dl = dls.test_dl(raw_datasets[test_ds_name])
preds = learn.get_preds(dl=test_dl)
preds
(tensor([[0.0093, 0.9907],
         [0.0677, 0.9323],
         [0.0070, 0.9930],
         ...,
         [0.1376, 0.8624],
         [0.0072, 0.9928],
         [0.0161, 0.9839]]),
 None)

High-level API

With the high-level API, we can create our DataBlock, DataLoaders, and Blearner in one line of code

dl_kwargs = {'bs': bsz, 'val_bs': val_bsz}
learn_kwargs = { 'metrics': task_metrics }

learn = BlearnerForSequenceClassification.from_dataframe(raw_df, pretrained_model_name, 
                                                         text=task_inputs, label=task_target,
                                                         dl_kwargs=dl_kwargs, learner_kwargs=learn_kwargs)
learn.fit_one_cycle(1, lr_max=2e-3)
epoch train_loss valid_loss f1_score accuracy time
0 0.509683 0.490516 0.847512 0.767157 00:11
learn.show_results(learner=learn, max_n=5)
text category target
0 He said the foodservice pie business doesn 't fit the company's long-term growth strategy. " The foodservice pie business does not fit our long-term growth strategy. 1 1
1 He was arrested Friday night at an Alpharetta seafood restaurant while dining with his wife, singer Whitney Houston. He was arrested again Friday night at an Alpharetta restaurant where he was having dinner with his wife. 1 1
2 After losing as much as 84.56 earlier, the Dow Jones industrial average closed up 22.81, or 0.2 percent, at 9,340.45. In midday trading, the Dow Jones industrial average lost 68.84, or 0.7 percent, to 9,248.80. 0 0
3 In that position, Elias will report to Joe Tucci, president and CEO of EMC. As executive vice president of new ventures, Elias will report to Joe Tucci, EMC's president and chief executive. 1 1
4 We strongly disagree with Novell's position and view it as a desperate measure to curry favor with the Linux community. McBride characterized Novell's move as " a desperate measure to curry favor with the Linux community. " 0 1

Summary

The general flow of this notebook was inspired by Zach Mueller's "Text Classification with Transformers" example that can be found in the wonderful Walk With Fastai docs. Take a look there for another approach to working with fast.ai and Hugging Face on GLUE tasks.