Basic Utilities for PyTorch NLP Software
PyTorch-NLP, or torchnlp
for short, is a library of basic utilities for PyTorch
Natural Language Processing (NLP). torchnlp
extends PyTorch to provide you with
basic text data processing functions.
Logo by Chloe Yeo, Corporate Sponsorship by WellSaid Labs
Installation 🐾
Make sure you have Python 3.5+ and PyTorch 1.0+. You can then install pytorch-nlp
using
pip:
pip install pytorch-nlp
Or to install the latest code via:
pip install git+https://github.com/PetrochukM/PyTorch-NLP.git
Docs
The complete documentation for PyTorch-NLP is available
via our ReadTheDocs website.
Get Started
Within an NLP data pipeline, you'll want to implement these basic steps:
Load Your Data 🐿
Load the IMDB dataset, for example:
from torchnlp.datasets import imdb_dataset
train = imdb_dataset(train=True)
train[0]
Load a custom dataset, for example:
from pathlib import Path
from torchnlp.download import download_file_maybe_extract
directory_path = Path('data/')
train_file_path = Path('trees/train.txt')
download_file_maybe_extract(
url='http://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip',
directory=directory_path,
check_files=[train_file_path])
open(directory_path / train_file_path)
Don't worry we'll handle caching for you!
Text To Tensor
Tokenize and encode your text as a tensor. For example, a WhitespaceEncoder
breaks
text into terms whenever it encounters a whitespace character.
from torchnlp.encoders.text import WhitespaceEncoder
loaded_data = ["now this ain't funny", "so don't you dare laugh"]
encoder = WhitespaceEncoder(loaded_data)
encoded_data = [encoder.encode(example) for example in loaded_data]
Tensor To Batch
With your loaded and encoded data in hand, you'll want to batch your dataset.
import torch
from torchnlp.samplers import BucketBatchSampler
from torchnlp.utils import collate_tensors
from torchnlp.encoders.text import stack_and_pad_tensors
encoded_data = [torch.randn(2), torch.randn(3), torch.randn(4), torch.randn(5)]
train_sampler = torch.utils.data.sampler.SequentialSampler(encoded_data)
train_batch_sampler = BucketBatchSampler(
train_sampler, batch_size=2, drop_last=False, sort_key=lambda i: encoded_data[i].shape[0])
batches = [[encoded_data[i] for i in batch] for batch in train_batch_sampler]
batches = [collate_tensors(batch, stack_tensors=stack_and_pad_tensors) for batch in batches]
PyTorch-NLP builds on top of PyTorch's existing torch.utils.data.sampler
, torch.stack
and default_collate
to support sequential inputs of varying lengths!
Your Good To Go!
With your batch in hand, you can use PyTorch to develop and train your model using gradient descent.
Last But Not Least
PyTorch-NLP has a couple more NLP focused utility packages to support you! 🤗
Deterministic Functions
Now you've setup your pipeline, you may want to ensure that some functions run deterministically.
Wrap any code that's random, with fork_rng
and you'll be good to go, like so:
import random
import numpy
import torch
from torchnlp.random import fork_rng
with fork_rng(seed=123):
print('Random:', random.randint(1, 2**31))
print('Numpy:', numpy.random.randint(1, 2**31))
print('Torch:', int(torch.randint(1, 2**31, (1,))))
This will always print:
Random: 224899943
Numpy: 843828735
Torch: 843828736
Pre-Trained Word Vectors
Now that you've computed your vocabulary, you may want to make use of
pre-trained word vectors, like so:
import torch
from torchnlp.encoders.text import WhitespaceEncoder
from torchnlp.word_to_vector import GloVe
encoder = WhitespaceEncoder(["now this ain't funny", "so don't you dare laugh"])
vocab = set(encoder.vocab)
pretrained_embedding = GloVe(name='6B', dim=100, is_include=lambda w: w in vocab)
embedding_weights = torch.Tensor(encoder.vocab_size, pretrained_embedding.dim)
for i, token in enumerate(encoder.vocab):
embedding_weights[i] = pretrained_embedding[token]
Neural Networks Layers
For example, from the neural network package, apply the state-of-the-art LockedDropout
:
import torch
from torchnlp.nn import LockedDropout
input_ = torch.randn(6, 3, 10)
dropout = LockedDropout(0.5)
dropout(input_)
Metrics
Compute common NLP metrics such as the BLEU score.
from torchnlp.metrics import get_moses_multi_bleu
hypotheses = ["The brown fox jumps over the dog 笑"]
references = ["The quick brown fox jumps over the lazy dog 笑"]
get_moses_multi_bleu(hypotheses, references, lowercase=True)
Help :question:
Maybe looking at longer examples may help you at examples/
.
Need more help? We are happy to answer your questions via Gitter Chat
Contributing
We've released PyTorch-NLP because we found a lack of basic toolkits for NLP in PyTorch. We hope that other organizations can benefit from the project. We are thankful for any contributions from the community.
Contributing Guide
Read our contributing guide to learn about our development process, how to propose bugfixes and improvements, and how to build and test your changes to PyTorch-NLP.
Related Work
torchtext and PyTorch-NLP differ in the architecture and feature set; otherwise, they are similar. torchtext and PyTorch-NLP provide pre-trained word vectors, datasets, iterators and text encoders. PyTorch-NLP also provides neural network modules and metrics. From an architecture standpoint, torchtext is object orientated with external coupling while PyTorch-NLP is object orientated with low coupling.
AllenNLP is designed to be a platform for research. PyTorch-NLP is designed to be a lightweight toolkit.
Authors
Citing
If you find PyTorch-NLP useful for an academic publication, then please use the following BibTeX to cite it:
@misc{pytorch-nlp,
author = {Petrochuk, Michael},
title = {PyTorch-NLP: Rapid Prototyping with PyTorch Natural Language Processing (NLP) Tools},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/PetrochukM/PyTorch-NLP}},
}