This module contains the bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data for token classification tasks (e.g., NER or named entity recognition, etc...).
torch.cuda.set_device(1)
print(f'Using GPU #{torch.cuda.current_device()}: {torch.cuda.get_device_name()}')
Using GPU #1: GeForce GTX 1080 Ti

Token classification tokenization, batch transform, and DataBlock methods

Token classification tasks attempt to predict a class for each token. The idea is similar to that in image segmentation models where the objective is to predict a class for each pixel. Such models are common in building named entity recognition (NER) systems.

df_converters = {'tokens': ast.literal_eval, 'labels': ast.literal_eval, 'nested-labels': ast.literal_eval}

path = Path('./')
germ_eval_df = pd.read_csv(path/'germeval2014_sample.csv', converters=df_converters); len(germ_eval_df)
1000
labels = sorted(list(set([lbls for sublist in germ_eval_df.labels.tolist() for lbls in sublist])))
print(labels)
['B-LOC', 'B-LOCderiv', 'B-LOCpart', 'B-ORG', 'B-ORGpart', 'B-OTH', 'B-OTHderiv', 'B-OTHpart', 'B-PER', 'B-PERderiv', 'B-PERpart', 'I-LOC', 'I-LOCderiv', 'I-ORG', 'I-ORGpart', 'I-OTH', 'I-PER', 'O']
task = HF_TASKS_AUTO.TokenClassification

pretrained_model_name = "bert-base-multilingual-cased"
n_labels = len(labels)

hf_arch, hf_config, hf_tokenizer, hf_model = BLURR_MODEL_HELPER.get_hf_objects(pretrained_model_name, 
                                                                               task=task,
                                                                               config_kwargs={'num_labels': n_labels})
hf_arch, type(hf_config), type(hf_tokenizer), type(hf_model)


('bert',
 transformers.configuration_bert.BertConfig,
 transformers.tokenization_bert.BertTokenizer,
 transformers.modeling_bert.BertForTokenClassification)

Below, we define a new class and transform for token classification targets/predictions.

class HF_TokenTensorCategory[source]

HF_TokenTensorCategory(x, **kwargs) :: TensorBase

class HF_TokenCategorize[source]

HF_TokenCategorize(vocab=None, ignore_token=None, ignore_token_id=None) :: Transform

Reversible transform of a list of category string to vocab id

HF_TokenCategorize modifies the fastai Categorize transform in a couple of ways. First, it allows your targets to consist of a Category per token, and second, it uses the idea of an ignore_token_id to mask subtokens that don't need a prediction. For example, the target of special tokens (e.g., pad, cls, sep) are set to ignore_token_id as are subsequent sub-tokens of a given token should more than 1 sub-token make it up.

HF_TokenCategoryBlock[source]

HF_TokenCategoryBlock(vocab=None, ignore_token=None, ignore_token_id=None)

TransformBlock for single-label categorical targets

Again, we define a custom class, HF_TokenClassInput, for the @typedispatched methods to use so that we can override how token classification inputs/targets are assembled, as well as, how the data is shown via methods like show_batch and show_results.

class HF_TokenClassInput[source]

HF_TokenClassInput(x, **kwargs) :: HF_BaseInput

class HF_TokenClassBeforeBatchTransform[source]

HF_TokenClassBeforeBatchTransform(hf_arch, hf_tokenizer, ignore_token_id=-100, max_length=None, padding=True, truncation=True, is_split_into_words=True, n_tok_inps=1, tok_kwargs={}, **kwargs) :: HF_BeforeBatchTransform

Handles everything you need to assemble a mini-batch of inputs and targets, as well as decode the dictionary produced as a byproduct of the tokenization process in the encodes method.

HF_TokenClassBeforeBatchTransform is used to exclude any of the target's tokens we don't want to include in the loss calcuation (e.g. padding, cls, sep, etc...).

before_batch_tfm = HF_TokenClassBeforeBatchTransform(hf_arch, hf_tokenizer,
                                                     is_split_into_words=True, 
                                                     tok_kwargs={ 'return_special_tokens_mask': True })

blocks = (
    HF_TextBlock(before_batch_tfms=before_batch_tfm, input_return_type=HF_TokenClassInput), 
    HF_TokenCategoryBlock(vocab=labels)
)

def get_y(inp):
    return [ (label, len(hf_tokenizer.tokenize(str(entity)))) for entity, label in zip(inp.tokens, inp.labels) ]

dblock = DataBlock(blocks=blocks, 
                   get_x=ColReader('tokens'),
                   get_y=get_y,
                   splitter=RandomSplitter())

Note in the example above we had to define a get_y in order to return both the entity we want to predict a category for, as well as, how many subtokens are used by the hf_tokenizer to represent it. This is necessary for the input/target alignment discussed above.

 
dls = dblock.dataloaders(germ_eval_df, bs=4)
b = dls.one_batch()
len(b), b[0]['input_ids'].shape, b[1].shape
(2, torch.Size([4, 76]), torch.Size([4, 76]))
dls.show_batch(dataloaders=dls, max_n=2, trunc_at=10)
token / target label
0 [('Helbig', 'B-OTH'), ('et', 'I-OTH'), ('al', 'I-OTH'), ('.', 'O'), ('(', 'O'), ('1994', 'O'), (')', 'O'), ('S', 'O'), ('.', 'O'), ('593', 'B-OTH')]
1 [('Der', 'O'), ('28', 'O'), ('-', 'O'), ('Jährige', 'O'), ('und', 'O'), ('sein', 'O'), ('Team', 'O'), (',', 'O'), ('zu', 'O'), ('dem', 'B-PER')]

Tests

The tests below to ensure the core DataBlock code above works for all pretrained token classification models available in huggingface. These tests are excluded from the CI workflow because of how long they would take to run and the amount of data that would be required to download.

Note: Feel free to modify the code below to test whatever pretrained classification models you are working with ... and if any of your pretrained token classification models fail, please submit a github issue (or a PR if you'd like to fix it yourself)

BLURR_MODEL_HELPER.get_models(task='TokenClassification')
[transformers.modeling_albert.AlbertForTokenClassification,
 transformers.modeling_auto.AutoModelForTokenClassification,
 transformers.modeling_bert.BertForTokenClassification,
 transformers.modeling_camembert.CamembertForTokenClassification,
 transformers.modeling_distilbert.DistilBertForTokenClassification,
 transformers.modeling_electra.ElectraForTokenClassification,
 transformers.modeling_flaubert.FlaubertForTokenClassification,
 transformers.modeling_funnel.FunnelForTokenClassification,
 transformers.modeling_layoutlm.LayoutLMForTokenClassification,
 transformers.modeling_longformer.LongformerForTokenClassification,
 transformers.modeling_mobilebert.MobileBertForTokenClassification,
 transformers.modeling_roberta.RobertaForTokenClassification,
 transformers.modeling_squeezebert.SqueezeBertForTokenClassification,
 transformers.modeling_xlm.XLMForTokenClassification,
 transformers.modeling_xlm_roberta.XLMRobertaForTokenClassification,
 transformers.modeling_xlnet.XLNetForTokenClassification]
pretrained_model_names = [
    'albert-base-v1',
    'bert-base-multilingual-cased',
    'camembert-base',
    'distilbert-base-uncased',
    'monologg/electra-small-finetuned-imdb',
    'allenai/longformer-base-4096',
    'google/mobilebert-uncased',
    'roberta-base',
    'xlm-mlm-en-2048',
    'xlm-roberta-base',
    'xlnet-base-cased'
]
#hide_output
task = HF_TASKS_AUTO.TokenClassification
bsz = 2
seq_sz = 128

test_results = []
for model_name in pretrained_model_names:
    error=None
    
    print(f'=== {model_name} ===\n')
    
    hf_arch, hf_config, hf_tokenizer, hf_model = BLURR_MODEL_HELPER.get_hf_objects(model_name, task=task)
    print(f'architecture:\t{hf_arch}\ntokenizer:\t{type(hf_tokenizer).__name__}\n')
    
    before_batch_tfm = HF_TokenClassBeforeBatchTransform(hf_arch, hf_tokenizer, 
                                                         padding='max_length', 
                                                         max_length=seq_sz, 
                                                         is_split_into_words=True, 
                                                         tok_kwargs={ 'return_special_tokens_mask': True })

    blocks = (
        HF_TextBlock(hf_arch, hf_tokenizer, 
                     before_batch_tfms=before_batch_tfm, 
                     input_return_type=HF_TokenClassInput), 
        HF_TokenCategoryBlock(vocab=labels)
    )

    dblock = DataBlock(blocks=blocks, 
                       get_x=ColReader('tokens'),
                       get_y= lambda inp: [ (label, len(hf_tokenizer.tokenize(str(entity)))) for entity, label in zip(inp.tokens, inp.labels) ],
                       splitter=RandomSplitter())
    
    dls = dblock.dataloaders(germ_eval_df, bs=bsz)
    b = dls.one_batch()
    
    try:
        print('*** TESTING DataLoaders ***\n')
        test_eq(len(b), 2)
        test_eq(len(b[0]['input_ids']), bsz)
        test_eq(b[0]['input_ids'].shape, torch.Size([bsz, seq_sz]))
        test_eq(len(b[1]), bsz)

        if (hasattr(hf_tokenizer, 'add_prefix_space')):
            test_eq(dls.before_batch[0].tok_kwargs['add_prefix_space'], True)

        test_results.append((hf_arch, type(hf_tokenizer).__name__, model_name, 'PASSED', ''))
        dls.show_batch(dataloaders=dls, max_n=2, trunc_at=10)
        
    except Exception as err:
        test_results.append((hf_arch, type(hf_tokenizer).__name__, model_name, 'FAILED', err))
arch tokenizer model_name result error
0 albert AlbertTokenizer albert-base-v1 PASSED
1 bert BertTokenizer bert-base-multilingual-cased PASSED
2 camembert CamembertTokenizer camembert-base PASSED
3 distilbert DistilBertTokenizer distilbert-base-uncased PASSED
4 electra ElectraTokenizer monologg/electra-small-finetuned-imdb PASSED
5 longformer LongformerTokenizer allenai/longformer-base-4096 PASSED
6 mobilebert MobileBertTokenizer google/mobilebert-uncased PASSED
7 roberta RobertaTokenizer roberta-base PASSED
8 xlm XLMTokenizer xlm-mlm-en-2048 PASSED
9 xlm_roberta XLMRobertaTokenizer xlm-roberta-base PASSED
10 xlnet XLNetTokenizer xlnet-base-cased PASSED

Cleanup