This module contains the bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data for token classification tasks (e.g., NER or named entity recognition, etc...).
 
What we're running with at the time this documentation was generated:
torch: 1.9.0+cu102
fastai: 2.5.2
transformers: 4.10.0

Token classification tokenization, batch transform, and DataBlock methods

Token classification tasks attempt to predict a class for each token. The idea is similar to that in image segmentation models where the objective is to predict a class for each pixel. Such models are common in building named entity recognition (NER) systems.

df_converters = {'tokens': ast.literal_eval, 'labels': ast.literal_eval, 'nested-labels': ast.literal_eval}

path = Path('./')
germ_eval_df = pd.read_csv(path/'germeval2014_sample.csv', converters=df_converters); len(germ_eval_df)
1000
labels = sorted(list(set([lbls for sublist in germ_eval_df.labels.tolist() for lbls in sublist])))
print(labels)
['B-LOC', 'B-LOCderiv', 'B-LOCpart', 'B-ORG', 'B-ORGpart', 'B-OTH', 'B-OTHderiv', 'B-OTHpart', 'B-PER', 'B-PERderiv', 'B-PERpart', 'I-LOC', 'I-LOCderiv', 'I-ORG', 'I-ORGpart', 'I-OTH', 'I-PER', 'O']
model_cls = AutoModelForTokenClassification

# pretrained_model_name = "bert-base-multilingual-cased"
pretrained_model_name = 'roberta-base'
n_labels = len(labels)

hf_arch, hf_config, hf_tokenizer, hf_model = BLURR.get_hf_objects(pretrained_model_name, 
                                                                  model_cls=model_cls,
                                                                  config_kwargs={'num_labels': n_labels})
hf_arch, type(hf_config), type(hf_tokenizer), type(hf_model)
('roberta',
 transformers.models.roberta.configuration_roberta.RobertaConfig,
 transformers.models.roberta.tokenization_roberta_fast.RobertaTokenizerFast,
 transformers.models.roberta.modeling_roberta.RobertaForTokenClassification)

Below, we define a new class and transform for token classification targets/predictions.

class HF_TokenTensorCategory[source]

HF_TokenTensorCategory(x, **kwargs) :: TensorBase

A Tensor which support subclass pickling, and maintains metadata when casting or after methods

class HF_TokenCategorize[source]

HF_TokenCategorize(vocab=None, ignore_token=None, ignore_token_id=None) :: Transform

Reversible transform of a list of category string to vocab id

Parameters:

  • vocab : <class 'NoneType'>, optional

    The unique list of entities (e.g., B-LOC) (default: CategoryMap(vocab))

  • ignore_token : <class 'NoneType'>, optional

    The token used to identifiy ignored tokens (default: xIGNx)

  • ignore_token_id : <class 'NoneType'>, optional

    The token ID that should be ignored when calculating the loss (default: CrossEntropyLossFlat().ignore_index)

HF_TokenCategorize modifies the fastai Categorize transform in a couple of ways. First, it allows your targets to consist of a Category per token, and second, it uses the idea of an ignore_token_id to mask subtokens that don't need a prediction. For example, the target of special tokens (e.g., pad, cls, sep) are set to ignore_token_id as are subsequent sub-tokens of a given token should more than 1 sub-token make it up.

HF_TokenCategoryBlock[source]

HF_TokenCategoryBlock(vocab=None, ignore_token=None, ignore_token_id=None)

TransformBlock for single-label categorical targets

Parameters:

  • vocab : <class 'NoneType'>, optional

    The unique list of entities (e.g., B-LOC) (default: CategoryMap(vocab))

  • ignore_token : <class 'NoneType'>, optional

    The token used to identifiy ignored tokens (default: xIGNx)

  • ignore_token_id : <class 'NoneType'>, optional

    The token ID that should be ignored when calculating the loss (default: CrossEntropyLossFlat().ignore_index)

Again, we define a custom class, HF_TokenClassInput, for the @typedispatched methods to use so that we can override how token classification inputs/targets are assembled, as well as, how the data is shown via methods like show_batch and show_results.

class HF_TokenClassInput[source]

HF_TokenClassInput(x, **kwargs) :: HF_BaseInput

The base represenation of your inputs; used by the various fastai show methods

class HF_TokenClassBeforeBatchTransform[source]

HF_TokenClassBeforeBatchTransform(hf_arch:str, hf_config:PretrainedConfig, hf_tokenizer:PreTrainedTokenizerBase, hf_model:PreTrainedModel, ignore_token_id=-100, max_length:int=None, padding:Union[bool, str]=True, truncation:Union[bool, str]=True, is_split_into_words:bool=False, tok_kwargs={}, **kwargs) :: HF_BeforeBatchTransform

Handles everything you need to assemble a mini-batch of inputs and targets, as well as decode the dictionary produced as a byproduct of the tokenization process in the encodes method.

Parameters:

  • hf_arch : <class 'str'>

    The abbreviation/name of your Hugging Face transformer architecture (e.b., bert, bart, etc..)

  • hf_config : <class 'transformers.configuration_utils.PretrainedConfig'>

    A specific configuration instance you want to use

  • hf_tokenizer : <class 'transformers.tokenization_utils_base.PreTrainedTokenizerBase'>

    A Hugging Face tokenizer

  • hf_model : <class 'transformers.modeling_utils.PreTrainedModel'>

    A Hugging Face model

  • ignore_token_id : <class 'int'>, optional

    The token ID that should be ignored when calculating the loss

  • max_length : <class 'int'>, optional

    To control the length of the padding/truncation. It can be an integer or None, in which case it will default to the maximum length the model can accept. If the model has no specific maximum input length, truncation/padding to max_length is deactivated. See [Everything you always wanted to know about padding and truncation](https://huggingface.co/transformers/preprocessing.html#everything-you-always-wanted-to-know-about-padding-and-truncation)

  • padding : typing.Union[bool, str], optional

    To control the `padding` applied to your `hf_tokenizer` during tokenization. If None, will default to `False` or `'do_not_pad'. See [Everything you always wanted to know about padding and truncation](https://huggingface.co/transformers/preprocessing.html#everything-you-always-wanted-to-know-about-padding-and-truncation)

  • truncation : typing.Union[bool, str], optional

    To control `truncation` applied to your `hf_tokenizer` during tokenization. If None, will default to `False` or `do_not_truncate`. See [Everything you always wanted to know about padding and truncation](https://huggingface.co/transformers/preprocessing.html#everything-you-always-wanted-to-know-about-padding-and-truncation)

  • is_split_into_words : <class 'bool'>, optional

    The `is_split_into_words` argument applied to your `hf_tokenizer` during tokenization. Set this to `True` if your inputs are pre-tokenized (not numericalized)

  • tok_kwargs : <class 'dict'>, optional

    Any other keyword arguments you want included when using your `hf_tokenizer` to tokenize your inputs

  • kwargs : <class 'inspect._empty'>

HF_TokenClassBeforeBatchTransform is used to exclude any of the target's tokens we don't want to include in the loss calcuation (e.g. padding, cls, sep, etc...).

before_batch_tfm = HF_TokenClassBeforeBatchTransform(hf_arch, hf_config, hf_tokenizer, hf_model,
                                                     is_split_into_words=True, 
                                                     tok_kwargs={ 'return_special_tokens_mask': True })

blocks = (
    HF_TextBlock(before_batch_tfm=before_batch_tfm, input_return_type=HF_TokenClassInput), 
    HF_TokenCategoryBlock(vocab=labels)
)

def get_y(inp):
    return [ (label, len(hf_tokenizer.tokenize(str(entity)))) for entity, label in zip(inp.tokens, inp.labels) ]

dblock = DataBlock(blocks=blocks, get_x=ColReader('tokens'), get_y=get_y, splitter=RandomSplitter())

Note in the example above we had to define a get_y in order to return both the entity we want to predict a category for, as well as, how many subtokens are used by the hf_tokenizer to represent it. This is necessary for the input/target alignment discussed above.

dls = dblock.dataloaders(germ_eval_df, bs=4)
b = dls.one_batch()
len(b), b[0]['input_ids'].shape, b[1].shape
(2, torch.Size([4, 98]), torch.Size([4, 98]))
dls.show_batch(dataloaders=dls, max_n=2, trunc_at=10)
token / target label
0 [('Helbig', 'B-OTH'), ('et', 'I-OTH'), ('al', 'I-OTH'), ('.', 'O'), ('(', 'O'), ('1994', 'O'), (')', 'O'), ('S.', 'O'), ('593.', 'O'), ('Wink', 'B-OTH')]
1 [('Scenes', 'B-OTH'), ('of', 'I-OTH'), ('a', 'I-OTH'), ('Sexual', 'I-OTH'), ('Nature', 'I-OTH'), ('(', 'O'), ('GB', 'O'), ('2006', 'O'), (')', 'O'), ('-', 'O')]

Tests

The tests below to ensure the core DataBlock code above works for all pretrained token classification models available in Hugging Face. These tests are excluded from the CI workflow because of how long they would take to run and the amount of data that would be required to download.

Note: Feel free to modify the code below to test whatever pretrained classification models you are working with ... and if any of your pretrained token classification models fail, please submit a github issue (or a PR if you'd like to fix it yourself)

arch tokenizer model_name result error
0 albert AlbertTokenizerFast albert-base-v1 PASSED
1 bert BertTokenizerFast bert-base-multilingual-cased PASSED
2 camembert CamembertTokenizerFast camembert-base PASSED
3 distilbert DistilBertTokenizerFast distilbert-base-uncased PASSED
4 electra ElectraTokenizerFast monologg/electra-small-finetuned-imdb PASSED
5 flaubert FlaubertTokenizer flaubert/flaubert_small_cased PASSED
6 funnel FunnelTokenizerFast huggingface/funnel-small-base PASSED
7 longformer LongformerTokenizerFast allenai/longformer-base-4096 PASSED
8 mpnet MPNetTokenizerFast microsoft/mpnet-base PASSED
9 mobilebert MobileBertTokenizerFast google/mobilebert-uncased PASSED
10 roberta RobertaTokenizerFast roberta-base PASSED
11 squeezebert SqueezeBertTokenizerFast squeezebert/squeezebert-uncased PASSED
12 xlm XLMTokenizer xlm-mlm-en-2048 PASSED
13 xlm_roberta XLMRobertaTokenizerFast xlm-roberta-base PASSED
14 xlnet XLNetTokenizerFast xlnet-base-cased PASSED

Summary

This module includes all the low, mid, and high-level API bits for token classification tasks data prep.