This module contains the bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data for question/answering tasks.
 
What we're running with at the time this documentation was generated:
torch: 1.9.0+cu102
fastai: 2.5.2
transformers: 4.10.0

Question/Answering tokenization, batch transform, and DataBlock methods

Question/Answering tasks are models that require two text inputs (a context that includes the answer and the question). The objective is to predict the start/end tokens of the answer in the context)

path = Path('./')
squad_df = pd.read_csv(path/'squad_sample.csv'); len(squad_df)

We've provided a simple subset of a pre-processed SQUADv2 dataset below just for demonstration purposes. There is a lot that can be done to make this much better and more fully functional. The idea here is just to show you how things can work for tasks beyond sequence classification.

squad_df.head(2)
model_cls = AutoModelForQuestionAnswering

pretrained_model_name = 'roberta-base' #'xlm-mlm-ende-1024'
hf_arch, hf_config, hf_tokenizer, hf_model = BLURR.get_hf_objects(pretrained_model_name, model_cls=model_cls)

pre_process_squad[source]

pre_process_squad(row, hf_arch:str, hf_tokenizer:PreTrainedTokenizerBase, ctx_attr:str='context', qst_attr:str='question', ans_attr:str='answer_text')

Parameters:

  • row : <class 'inspect._empty'>

    A row in your pd.DataFrame

  • hf_arch : <class 'str'>

    The abbreviation/name of your Hugging Face transformer architecture (e.b., bert, bart, etc..)

  • hf_tokenizer : <class 'transformers.tokenization_utils_base.PreTrainedTokenizerBase'>

    A Hugging Face tokenizer

  • ctx_attr : <class 'str'>, optional

    The attribute in your dataset that contains the context (where the answer is included) (default: 'context')

  • qst_attr : <class 'str'>, optional

    The attribute in your dataset that contains the question being asked (default: 'question')

  • ans_attr : <class 'str'>, optional

    The attribute in your dataset that contains the actual answer (default: 'answer_text')

The pre_process_squad method is structured around how we've setup the squad DataFrame above.

squad_df = squad_df.apply(partial(pre_process_squad, hf_arch=hf_arch, hf_tokenizer=hf_tokenizer), axis=1)
max_seq_len= 128
squad_df = squad_df[(squad_df.tok_answer_end < max_seq_len) & (squad_df.is_impossible == False)]
vocab = dict(enumerate(range(max_seq_len)))

class HF_QuestionAnswerInput[source]

HF_QuestionAnswerInput(x, **kwargs) :: HF_BaseInput

The base represenation of your inputs; used by the various fastai show methods

We'll return a HF_QuestionAnswerInput from our custom HF_BeforeBatchTransform so that we can customize the show_batch/results methods for this task.

class HF_QABeforeBatchTransform[source]

HF_QABeforeBatchTransform(hf_arch:str, hf_config:PretrainedConfig, hf_tokenizer:PreTrainedTokenizerBase, hf_model:PreTrainedModel, max_length:int=None, padding:Union[bool, str]=True, truncation:Union[bool, str]=True, is_split_into_words:bool=False, tok_kwargs={}, **kwargs) :: HF_BeforeBatchTransform

Handles everything you need to assemble a mini-batch of inputs and targets, as well as decode the dictionary produced as a byproduct of the tokenization process in the encodes method.

Parameters:

  • hf_arch : <class 'str'>

    The abbreviation/name of your Hugging Face transformer architecture (e.b., bert, bart, etc..)

  • hf_config : <class 'transformers.configuration_utils.PretrainedConfig'>

    A specific configuration instance you want to use

  • hf_tokenizer : <class 'transformers.tokenization_utils_base.PreTrainedTokenizerBase'>

    A Hugging Face tokenizer

  • hf_model : <class 'transformers.modeling_utils.PreTrainedModel'>

    A Hugging Face model

  • max_length : <class 'int'>, optional

    To control the length of the padding/truncation. It can be an integer or None, in which case it will default to the maximum length the model can accept. If the model has no specific maximum input length, truncation/padding to max_length is deactivated. See [Everything you always wanted to know about padding and truncation](https://huggingface.co/transformers/preprocessing.html#everything-you-always-wanted-to-know-about-padding-and-truncation)

  • padding : typing.Union[bool, str], optional

    To control the `padding` applied to your `hf_tokenizer` during tokenization. If None, will default to `False` or `'do_not_pad'. See [Everything you always wanted to know about padding and truncation](https://huggingface.co/transformers/preprocessing.html#everything-you-always-wanted-to-know-about-padding-and-truncation)

  • truncation : typing.Union[bool, str], optional

    To control `truncation` applied to your `hf_tokenizer` during tokenization. If None, will default to `False` or `do_not_truncate`. See [Everything you always wanted to know about padding and truncation](https://huggingface.co/transformers/preprocessing.html#everything-you-always-wanted-to-know-about-padding-and-truncation)

  • is_split_into_words : <class 'bool'>, optional

    The `is_split_into_words` argument applied to your `hf_tokenizer` during tokenization. Set this to `True` if your inputs are pre-tokenized (not numericalized)

  • tok_kwargs : <class 'dict'>, optional

    Any other keyword arguments you want included when using your `hf_tokenizer` to tokenize your inputs

  • kwargs : <class 'inspect._empty'>

By overriding HF_BeforeBatchTransform we can add other inputs to each example for this particular task.

before_batch_tfm = HF_QABeforeBatchTransform(hf_arch, hf_config, hf_tokenizer, hf_model,
                                             max_length=max_seq_len, truncation='only_second', 
                                             tok_kwargs={ 'return_special_tokens_mask': True })

blocks = (
    HF_TextBlock(before_batch_tfm=before_batch_tfm, input_return_type=HF_QuestionAnswerInput), 
    CategoryBlock(vocab=vocab),
    CategoryBlock(vocab=vocab)
)

dblock = DataBlock(blocks=blocks, 
                   get_x=lambda x: (x.question, x.context),
                   get_y=[ColReader('tok_answer_start'), ColReader('tok_answer_end')],
                   splitter=RandomSplitter(),
                   n_inp=1)
dls = dblock.dataloaders(squad_df, bs=4)
b = dls.one_batch(); len(b), len(b[0]), len(b[1]), len(b[2])
b[0]['input_ids'].shape, b[0]['attention_mask'].shape, b[1].shape, b[2].shape

The show_batch method above allows us to create a more interpretable view of our question/answer data.

dls.show_batch(dataloaders=dls, max_n=2, trunc_at=500)

Summary

This module includes all the low, mid, and high-level API bits for extractive Q&A tasks data preparation.