This module contains the bits required to use the fastai DataBlock API and/or mid-level data processing pipelines to organize your data for question/answering tasks.
torch.cuda.set_device(1)
print(f'Using GPU #{torch.cuda.current_device()}: {torch.cuda.get_device_name()}')
Using GPU #1: GeForce GTX 1080 Ti

Question/Answering tokenization, batch transform, and DataBlock methods

Question/Answering tasks are models that require two text inputs (a context that includes the answer and the question). The objective is to predict the start/end tokens of the answer in the context)

path = Path('./')
squad_df = pd.read_csv(path/'squad_sample.csv'); len(squad_df)
1000

We've provided a simple subset of a pre-processed SQUADv2 dataset below just for demonstration purposes. There is a lot that can be done to make this much better and more fully functional. The idea here is just to show you how things can work for tasks beyond sequence classification.

squad_df.head(2)
id title context question answers ds_type answer_text is_impossible
0 56be85543aeaaa14008c9063 Beyoncé Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five G... When did Beyonce start becoming popular? {'text': ['in the late 1990s'], 'answer_start': [269]} train in the late 1990s False
1 56be85543aeaaa14008c9065 Beyoncé Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five G... What areas did Beyonce compete in when she was growing up? {'text': ['singing and dancing'], 'answer_start': [207]} train singing and dancing False
task = HF_TASKS_AUTO.QuestionAnswering

pretrained_model_name = 'roberta-base' #'xlm-mlm-ende-1024'
hf_arch, hf_config, hf_tokenizer, hf_model = BLURR_MODEL_HELPER.get_hf_objects(pretrained_model_name, task=task)

pre_process_squad[source]

pre_process_squad(row, hf_arch, hf_tokenizer)

The pre_process_squad method is structured around how we've setup the squad DataFrame above.

squad_df = squad_df.apply(partial(pre_process_squad, hf_arch=hf_arch, hf_tokenizer=hf_tokenizer), axis=1)
max_seq_len= 128
squad_df = squad_df[(squad_df.tok_answer_end < max_seq_len) & (squad_df.is_impossible == False)]
vocab = dict(enumerate(range(max_seq_len)))

class HF_QuestionAnswerInput[source]

HF_QuestionAnswerInput(x, **kwargs) :: HF_BaseInput

We'll return a HF_QuestionAnswerInput from our custom HF_BeforeBatchTransform so that we can customize the show_batch/results methods for this task.

class HF_QABeforeBatchTransform[source]

HF_QABeforeBatchTransform(hf_arch, hf_tokenizer, max_length=None, padding=True, truncation=True, is_split_into_words=False, n_tok_inps=1, tok_kwargs={}, **kwargs) :: HF_BeforeBatchTransform

Handles everything you need to assemble a mini-batch of inputs and targets, as well as decode the dictionary produced as a byproduct of the tokenization process in the encodes method.

By overriding HF_BeforeBatchTransform we can add other inputs to each example for this particular task.

before_batch_tfm = HF_QABeforeBatchTransform(hf_arch, hf_tokenizer, 
                                             max_length=max_seq_len, truncation='only_second', 
                                             tok_kwargs={ 'return_special_tokens_mask': True })

blocks = (
    HF_TextBlock(hf_arch, hf_tokenizer, 
                 before_batch_tfms=before_batch_tfm, 
                 input_return_type=HF_QuestionAnswerInput), 
    CategoryBlock(vocab=vocab),
    CategoryBlock(vocab=vocab)
)

dblock = DataBlock(blocks=blocks, 
                   get_x=lambda x: (x.question, x.context),
                   get_y=[ColReader('tok_answer_start'), ColReader('tok_answer_end')],
                   splitter=RandomSplitter(),
                   n_inp=1)
dls = dblock.dataloaders(squad_df, bs=4)
b = dls.one_batch(); len(b), len(b[0]), len(b[1]), len(b[2])
(3, 5, 4, 4)
b[0]['input_ids'].shape, b[0]['attention_mask'].shape, b[1].shape, b[2].shape
(torch.Size([4, 128]), torch.Size([4, 128]), torch.Size([4]), torch.Size([4]))

The show_batch method above allows us to create a more interpretable view of our question/answer data.

dls.show_batch(dataloaders=dls, max_n=2, trunc_at=500)
text start/end answer
0 Chopin was active during what era? Frédéric François Chopin (/ˈʃoʊpæn/; French pronunciation: ​[fʁe.de.ʁik fʁɑ̃.swa ʃɔ.pɛ̃]; 22 February or 1 March 1810 – 17 October 1849), born Fryderyk Franciszek Chopin,[n 1] was a Polish and French (by citizenship and birth of father) composer and a virtuoso pianist of the Romantic era, who wrote primarily for the solo piano. (116, 118) Romantic era
1 Pepsi paid Beyonce how much in 2012 for her endorsement? Beyoncé has worked with Pepsi since 2002, and in 2004 appeared in a Gladiator-themed commercial with Britney Spears, Pink, and Enrique Iglesias. In 2012, Beyoncé signed a $50 million deal to endorse Pepsi. The Center for Science in the Public Interest (CSPINET) wrote Beyoncé an open letter asking her to reconsider the deal because of the unhealthiness of the product and to donate the proceeds to a medical organisation. Nevertheless, NetBa (0, 0)

Cleanup