PyTorch implementation for Vision Transformer[Dosovitskiy, A.(ICLR'21)] modified to obtain over 90% accuracy(, I know, which is easily reached using CNN-based architectures.) FROM SCRATCH on CIFAR-10 with small number of parameters (= 6.3M, originally ViT-B has 86M). If there is some problem, let me know kindly :) Any suggestions are welcomed!
- Install packages
$git clone https://github.com/omihub777/ViT-CIFAR.git
$cd ViT-CIFAR/
$bash setup.sh- Train ViT on CIFAR-10
$python main.py --dataset c10 --label-smoothing --autoaugment- (Optinal) Train ViT on CIFAR-10 using Comet.ml
If you have a Comet.ml account, this automatically logs experiments by specifying your api key.(Otherwise, your experiments are automatically logged usingCSVLogger.)
$python main.py --api-key [YOUR COMET API KEY] --dataset c10| Dataset | Acc.(%) | Time(hh:mm:ss) |
|---|---|---|
| CIFAR-10 | 90.92 | 02:14:22 |
| CIFAR-100 | 66.54 | 02:14:17 |
| SVHN | 97.31 | 03:24:23 |
- Number of parameters: 6.3 M
- Device: V100 (single GPU)
- Mixed Precision is enabled
| Param | Value |
|---|---|
| Epoch | 200 |
| Batch Size | 128 |
| Optimizer | Adam |
| Weight Decay | 5e-5 |
| LR Scheduler | Cosine |
| (Init LR, Last LR) | (1e-3, 1e-5) |
| Warmup | 5 epochs |
| Dropout | 0.0 |
| AutoAugment | True |
| Label Smoothing | 0.1 |
| Heads | 12 |
| Layers | 7 |
| Hidden | 384 |
| MLP Hidden | 384 |
- Longer training gives performance boost.
- ViT doesn't seem to converge in 200 epochs.
- More extensive hyperparam search(e.g. InitLR/LastLR/Weight Decay/Label Smoothing/#heads...etc) definitely gives performance gain.
-
- Vision Transformer paper.
-
"TransGAN: Two Transformers Can Make One Strong GAN", Jiang, Y., et. al, (2021)
- This repo is inspired by the discriminator of TransGAN.
-
- Some tricks comes from this paper.





