You can now pip install blurr via pip install ohmeow-blurr
Or, even better as this library is under very active development, create an editable install like this:
git clone https://github.com/ohmeow/blurr.git
cd blurr
pip install -e ".[dev]"
The initial release includes everything you need for sequence classification and question answering tasks. Support for token classification and summarization are incoming. Please check the documentation for more thorough examples of how to use this package.
The following two packages need to be installed for blurr to work:
- fastai2 (see http://docs.fast.ai/ for installation instructions)
- huggingface transformers (see https://huggingface.co/transformers/installation.html for details)
import torch
from transformers import *
from fastai.text.all import *
from blurr.data.all import *
from blurr.modeling.all import *
path = untar_data(URLs.IMDB_SAMPLE)
model_path = Path('models')
imdb_df = pd.read_csv(path/'texts.csv')
task = HF_TASKS_AUTO.SequenceClassification
pretrained_model_name = "bert-base-uncased"
hf_arch, hf_config, hf_tokenizer, hf_model = BLURR_MODEL_HELPER.get_hf_objects(pretrained_model_name, task=task)
blocks = (HF_TextBlock(hf_arch, hf_config, hf_tokenizer, hf_model), CategoryBlock)
dblock = DataBlock(blocks=blocks, get_x=ColReader('text'), get_y=ColReader('label'), splitter=ColSplitter())
dls = dblock.dataloaders(imdb_df, bs=4)
dls.show_batch(dataloaders=dls, max_n=2)
model = HF_BaseModelWrapper(hf_model)
learn = Learner(dls,
model,
opt_func=partial(Adam, decouple_wd=True),
loss_func=CrossEntropyLossFlat(),
metrics=[accuracy],
cbs=[HF_BaseModelCallback],
splitter=hf_splitter)
learn.create_opt()
learn.freeze()
learn.fit_one_cycle(3, lr_max=1e-3)
learn.show_results(learner=learn, max_n=2)
❗ Updates
12/31/2020
The "Goodbye 2020" release with lots of goodies for blurr users:
- Updated the Seq2Seq models to use some of the latest huggingface bits like
tokenizer.prepare_seq2seq_batch
. - Separated out the Seq2Seq and Token Classification metrics into metrics-specific callbacks for a better separation of concerns. As a best practice, you should now only use them as
fit_one_cycle
, etc.. callbacks rather than attach them to yourLearner
. - NEW: Translation are now available in blurr, joining causal language modeling and summarization in our core Seq2Seq stack
- NEW: Integration of huggingface's Seq2Seq metrics (rouge, bertscore, meteor, bleu, and sacrebleu). Plenty of info on how to set this up in the docs.
- NEW: Added
default_text_gen_kwargs
, a method that given a huggingface config, model, and task (optional), will return the default/recommended kwargs for any text generation models. - A lot of code cleanup (e.g., refactored naming and removal of redundant code into classes/methods)
- More model support and more tests across the board! Check out the docs for more info
- Misc. validation improvements and bug fixes.
As I'm sure there is plenty I can do to make this library better, please don't hesitate to join in and help the effort by submitting PRs, pointing out problems with my code, or letting me know what and how I can improve things generally. Some models, like mbart and mt5 for example, aren't giving good results and I'd love to get any and all feedback from the community on how to resolve such issues ... so hit me up, I promise I won't bit :)
12/20/2020
- Updated
Learner.blurr_predict
andLearner.blurr_predict_tokens
to support single or multiple items - Added ONNX support for sequence classification, token classification, and question/answering tasks.
blurrONNX
provides ONNX friendly variants ofLearner.blurr_predict
andLearner.blurr_predict_tokens
in the form ofblurrONNX.predict
andblurrONNX.predict_tokens
respectively. Like their Learner equivalents, these methods support single or multiple items for inferece. See the docs/code for examples and speed gains you get with ONNX. - Added quantization support when converting your blurr models to ONNX.
- Requires fast.ai >= 2.1.5 and huggingface transformers >= 4.x
12/12/2020
- Updated to work with the latest version of fast.ai (2.1.8) and huggingface transformers >= 4.x
- Fixed
Learner.blurr_summary
to work with fast.ai >= 2.1.8 - Fixed inclusion of
add_prefix_space
in tokenizerBLURR_MODEL_HELPER
- Fixed token classification
show_results
for tokenizers that add a prefix space - Notebooks run with environment variable "TOKENIZERS_PARALLELISM=false" to avoid fast tokenizer warnings
- Updated docs
11/12/2020
- Updated documentation
- Updated model callbacks to support mixed precision training regardless of whether you are calculating the loss yourself or letting huggingface do it for you.
11/10/2020
- Major update just about everywhere to facilitate a breaking change in fastai's treatment of
before_batch
transforms. - Reorganized code as I being to work on LM and other text2text tasks
- Misc. fixes
10/08/2020
- Updated all models to use ModelOutput classes instead of traditional tuples.
ModelOutput
attributes are assigned to the appropriate fastai bits likeLearner.pred
andLearner.loss
and anything else you've requested the huggingface model to return is available via theLearner.blurr_model_outputs
dictionary (see next two bullet items) - Added ability to grab attentions and hidden state from
Learner
. You can get at them viaLearner.blurr_model_outputs
dictionary if you tellHF_BaseModelWrapper
to provide them. - Added
model_kwargs
toHF_BaseModelWrapper
should you need to request a huggingface model to return something specific to it's type. These outputs will be available via theLearner.blurr_model_outputs
dictionary as well.
09/16/2020
- Major overhaul to do everything at batch time (including tokenization/numericalization). If this backfires, I'll roll everything back but as of now, I think this approach not only meshes better with how huggingface tokenization works and reduce RAM utilization for big datasets, but also opens up opportunities for incorporating augmentation, building adversarial models, etc.... Thoughts?
- Added tests for summarization bits
- New change may require some small modifications (see docs or ask on issues thread if you have problems you can't fiture out). I'm NOT doing a release until pypi until folks have a chance to work with the latest.
09/07/2020
- Added tests for question/answer and summarization transformer models
- Updated summarization to support BART, T5, and Pegasus
08/20/2020
- Updated everything to work latest version of fastai (tested against 2.0.0)
- Added batch-time padding, so that by default now,
HF_TokenizerTransform
doesn't add any padding tokens and all huggingface inputs are padded simply to the max sequence length in each batch rather than to the max length (passed in and/or acceptable to the model). This should create efficiencies across the board, from memory consumption to GPU utilization. The old tried and true method of padding during tokenization requires you to pass inpadding='max_length
toHF_TextBlock
. - Removed code to remove fastai2 @patched summary methods which had previously conflicted with a couple of the huggingface transformers
08/13/2020
- Updated everything to work latest transformers and fastai
- Reorganized code to bring it more inline with how huggingface separates out their "tasks".
07/06/2020
- Updated everything to work huggingface>=3.02
- Changed a lot of the internals to make everything more efficient and performant along with the latest version of huggingface ... meaning, I have broken things for folks using previous versions of blurr :).
06/27/2020
- Simplified the
BLURR_MODEL_HELPER.get_hf_objects
method to support a wide range of options in terms of building the necessary huggingface objects (architecture, config, tokenizer, and model). Also addedcache_dir
for saving pre-trained objects in a custom directory. - Misc. renaming and cleanup that may break existing code (please see the docs/source if things blow up)
- Added missing required libraries to requirements.txt (e.g., nlp)
05/23/2020
- Initial support for text generation (e.g., summarization, conversational agents) models now included. Only tested with BART so if you try it with other models before I do, lmk what works ... and what doesn't
05/17/2020
- Major code restructuring to make it easier to build out the library.
HF_TokenizerTransform
replacesHF_Tokenizer
, handling the tokenization and numericalization in one place. DataBlock code has been dramatically simplified.- Tokenization correctly handles huggingface tokenizers that require
add_prefix_space=True
. HF_BaseModelCallback
andHF_BaseModelCallback
are required and work together in order to allow developers to tie into any callback friendly event exposed by fastai2 and also pass in named arguments to the huggingface models.show_batch
andshow_results
have been updated for Question/Answer and Token Classification models to represent the data and results in a more easily intepretable manner than the defaults.
05/06/2020
- Initial support for Token classification (e.g., NER) models now included
- Extended fastai's
Learner
object with apredict_tokens
method used specifically in token classification HF_BaseModelCallback
can be used (or extended) instead of the model wrapper to ensure your inputs into the huggingface model is correct (recommended). See docs for examples (and thanks to fastai's Sylvain for the suggestion!)HF_Tokenizer
can work with strings or a string representation of a list (the later helpful for token classification tasks)show_batch
andshow_results
methods have been updated to allow better control on how huggingface tokenized data is represented in those methods
⭐ Props
A word of gratitude to the following individuals, repos, and articles upon which much of this work is inspired from:
- The wonderful community that is the fastai forum and especially the tireless work of both Jeremy and Sylvain in building this amazing framework and place to learn deep learning.
- All the great tokenizers, transformers, docs and examples over at huggingface
- FastHugs
- Fastai with 🤗Transformers (BERT, RoBERTa, XLNet, XLM, DistilBERT)
- Fastai integration with BERT: Multi-label text classification identifying toxicity in texts
- A Tutorial to Fine-Tuning BERT with Fast AI
- fastinference