VFormer
A modular PyTorch library for vision transformers models
Library Features
- Contains implementations of prominent ViT architectures broken down into modular components like encoder, attention mechanism, and decoder
- Makes it easy to develop custom models by composing components of different architectures
- Utilities for visualizing attention using techniques such as gradient rollout
Installation
From source (recommended)
git clone https://github.com/SforAiDl/vformer.git
cd vformer/
python setup.py install
From PyPI
pip install vformer
Models supported
Example usage
To instantiate and use a Swin Transformer model -
import torch
from vformer.models.classification import SwinTransformer
image = torch.randn(1, 3, 224, 224)
model = SwinTransformer(
img_size=224,
patch_size=4,
in_channels=3,
n_classes=10,
embed_dim=96,
depths=[2, 2, 6, 2],
num_heads=[3, 6, 12, 24],
window_size=7,
drop_rate=0.2,
)
logits = model(image)
VFormer
has a modular design and allows for easy experimentation using blocks/modules of different architectures. For example, if desired, you can use just the encoder or the windowed attention layer of the Swin Transformer model.
from vformer.attention import WindowAttention
window_attn = WindowAttention(
dim=128,
window_size=7,
num_heads=2,
**kwargs,
)
from vformer.encoder import SwinEncoder
swin_encoder = SwinEncoder(
dim=128,
input_resolution=(224, 224),
depth=2,
num_heads=2,
window_size=7,
**kwargs,
)
Please refer to our documentation to know more.
References