Skip to content

Commit 76017b7

Browse files
authored
Merge branch 'main' into feat/alignment/dpo
2 parents 80b1fb6 + 523fd66 commit 76017b7

File tree

16 files changed

+999
-105
lines changed

16 files changed

+999
-105
lines changed

.github/workflows/ci.yml

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,31 @@ jobs:
6161
python -m pip install --upgrade pip
6262
pip install modal
6363
64-
- name: Run unit tests
64+
- name: Run tests
6565
run: |
6666
modal run dev.modal.tests
67+
68+
tests-bwd:
69+
runs-on: ubuntu-latest
70+
needs: [checkstyle]
71+
env:
72+
MODAL_TOKEN_ID: ${{ secrets.MODAL_TOKEN_ID }}
73+
MODAL_TOKEN_SECRET: ${{ secrets.MODAL_TOKEN_SECRET }}
74+
75+
steps:
76+
- name: Checkout code
77+
uses: actions/checkout@v3
78+
79+
- name: Set up Python
80+
uses: actions/setup-python@v3
81+
with:
82+
python-version: '3.10'
83+
84+
- name: Install dependencies
85+
run: |
86+
python -m pip install --upgrade pip
87+
pip install modal
88+
89+
- name: Run tests
90+
run: |
91+
modal run dev.modal.tests_bwd

Makefile

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,8 +20,9 @@ checkstyle:
2020
# Command to run pytest for convergence tests
2121
# We have to explicitly set HF_DATASETS_OFFLINE=1, or dataset will silently try to send metrics and timeout (80s) https://github.com/huggingface/datasets/blob/37a603679f451826cfafd8aae00738b01dcb9d58/src/datasets/load.py#L286
2222
test-convergence:
23-
HF_DATASETS_OFFLINE=1 python -m pytest --disable-warnings test/convergence
24-
23+
HF_DATASETS_OFFLINE=1 python -m pytest --disable-warnings test/convergence/test_mini_models.py
24+
HF_DATASETS_OFFLINE=1 python -m pytest --disable-warnings test/convergence/test_mini_models_multimodal.py
25+
HF_DATASETS_OFFLINE=1 python -m pytest --disable-warnings test/convergence/test_mini_models_with_logits.py
2526

2627
# Command to run all benchmark scripts and update benchmarking data file
2728
# By default this doesn't overwrite existing data for the same benchmark experiment

README.md

Lines changed: 14 additions & 80 deletions
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,8 @@
5252

5353
<details>
5454
<summary>Latest News 🔥</summary>
55-
55+
56+
- [2024/11/6] We release [v0.4.0](https://github.com/linkedin/Liger-Kernel/releases/tag/v0.4.0): Full AMD support, Tech Report, Modal CI, Llama-3.2-Vision!
5657
- [2024/10/21] We have released the tech report of Liger Kernel on Arxiv: https://arxiv.org/pdf/2410.10989
5758
- [2024/9/6] We release v0.2.1 ([X post](https://x.com/liger_kernel/status/1832168197002510649)). 2500+ Stars, 10+ New Contributors, 50+ PRs, 50k Downloads in two weeks!
5859
- [2024/8/31] CUDA MODE talk, [Liger-Kernel: Real-world Triton kernel for LLM Training](https://youtu.be/gWble4FreV4?si=dxPeIchhkJ36Mbns), [Slides](https://github.com/cuda-mode/lectures?tab=readme-ov-file#lecture-28-liger-kernel)
@@ -80,18 +81,12 @@ With one line of code, Liger Kernel can increase throughput by more than 20% and
8081
8182
## Examples
8283

83-
### Basic
84-
85-
| **Example** | **Description** | **Lightning Studio** |
86-
|------------------------------------------------|---------------------------------------------------------------------------------------------------|----------------------|
87-
| [**Hugging Face Trainer**](https://github.com/linkedin/Liger-Kernel/tree/main/examples/huggingface) | Train LLaMA 3-8B ~20% faster with over 40% memory reduction on Alpaca dataset using 4 A100s with FSDP | TBA |
88-
| [**Lightning Trainer**](https://github.com/linkedin/Liger-Kernel/tree/main/examples/lightning) | Increase 15% throughput and reduce memory usage by 40% with LLaMA3-8B on MMLU dataset using 8 A100s with DeepSpeed ZeRO3 | TBA |
8984

90-
### Advanced
91-
92-
| **Example** | **Description** | **Lightning Studio** |
93-
|------------------------------------------------|---------------------------------------------------------------------------------------------------|----------------------|
94-
| [**Medusa Multi-head LLM (Retraining Phase)**](https://github.com/linkedin/Liger-Kernel/tree/main/examples/medusa) | Reduce memory usage by 80% with 5 LM heads and improve throughput by 40% using 8 A100s with FSDP | TBA |
85+
| **Use Case** | **Description** |
86+
|------------------------------------------------|---------------------------------------------------------------------------------------------------|
87+
| [**Hugging Face Trainer**](https://github.com/linkedin/Liger-Kernel/tree/main/examples/huggingface) | Train LLaMA 3-8B ~20% faster with over 40% memory reduction on Alpaca dataset using 4 A100s with FSDP |
88+
| [**Lightning Trainer**](https://github.com/linkedin/Liger-Kernel/tree/main/examples/lightning) | Increase 15% throughput and reduce memory usage by 40% with LLaMA3-8B on MMLU dataset using 8 A100s with DeepSpeed ZeRO3 |
89+
| [**Medusa Multi-head LLM (Retraining Phase)**](https://github.com/linkedin/Liger-Kernel/tree/main/examples/medusa) | Reduce memory usage by 80% with 5 LM heads and improve throughput by 40% using 8 A100s with FSDP | |
9590

9691
## Key Features
9792

@@ -102,13 +97,6 @@ With one line of code, Liger Kernel can increase throughput by more than 20% and
10297
- **Multi-GPU supported:** Compatible with multi-GPU setups (PyTorch FSDP, DeepSpeed, DDP, etc.).
10398
- **Trainer Framework Integration**: [Axolotl](https://github.com/axolotl-ai-cloud/axolotl), [LLaMa-Factory](https://github.com/hiyouga/LLaMA-Factory), [SFTTrainer](https://github.com/huggingface/trl/releases/tag/v0.10.1), [Hugging Face Trainer](https://github.com/huggingface/transformers/pull/32860), [SWIFT](https://github.com/modelscope/ms-swift)
10499

105-
## Target Audiences
106-
107-
- **Researchers**: Looking to compose models using efficient and reliable kernels for frontier experiments.
108-
- **ML Practitioners**: Focused on maximizing GPU training efficiency with optimal, high-performance kernels.
109-
- **Curious Novices**: Eager to learn how to write reliable Triton kernels to enhance training efficiency.
110-
111-
112100
## Installation
113101

114102
### Dependencies
@@ -214,23 +202,6 @@ loss = loss_fn(model.weight, input, target)
214202
loss.backward()
215203
```
216204

217-
218-
## Structure
219-
220-
### Source Code
221-
222-
- `ops/`: Core Triton operations.
223-
- `transformers/`: PyTorch `nn.Module` implementations built on Triton operations, compliant with the `transformers` API.
224-
225-
### Tests
226-
227-
- `transformers/`: Correctness tests for the Triton-based layers.
228-
- `convergence/`: Patches Hugging Face models with all kernels, runs multiple iterations, and compares weights, logits, and loss layer-by-layer.
229-
230-
### Benchmark
231-
232-
- `benchmark/`: Execution time and memory benchmarks compared to Hugging Face layers.
233-
234205
## APIs
235206

236207
### AutoModel
@@ -249,7 +220,7 @@ loss.backward()
249220
| Mistral | `liger_kernel.transformers.apply_liger_kernel_to_mistral` | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy |
250221
| Mixtral | `liger_kernel.transformers.apply_liger_kernel_to_mixtral` | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy |
251222
| Gemma1 | `liger_kernel.transformers.apply_liger_kernel_to_gemma` | RoPE, RMSNorm, GeGLU, CrossEntropyLoss, FusedLinearCrossEntropy |
252-
| Gemma2 | `liger_kernel.transformers.apply_liger_kernel_to_gemma2` | RoPE, RMSNorm, GeGLU, CrossEntropyLoss |
223+
| Gemma2 | `liger_kernel.transformers.apply_liger_kernel_to_gemma2` | RoPE, RMSNorm, GeGLU, CrossEntropyLoss, FusedLinearCrossEntropy |
253224
| Qwen2 & Qwen2.5 | `liger_kernel.transformers.apply_liger_kernel_to_qwen2` | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy |
254225
| Qwen2-VL | `liger_kernel.transformers.apply_liger_kernel_to_qwen2_vl` | RMSNorm, LayerNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy |
255226
| Phi3 & Phi3.5 | `liger_kernel.transformers.apply_liger_kernel_to_phi3` | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy |
@@ -299,54 +270,17 @@ $$\text{GeGLU}(x)=\text{GELU}(xW+b)\otimes(xV+c)$$
299270
- **Embedding**: [Embedding](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) is implemented by fusing embedding lookup and output operations. It achieves a peak speedup of ~1.5x in the forward pass and an overall speedup of ~1.1x.
300271
- **Matmul int2xint8**: is implemented by using the cache tiled matrix multiplication and by fusing the matmul with the unpacking process which achieves a considerable speed up and performs on par with @torch.compile
301272
<!-- TODO: be more specific about batch size -->
302-
> **Note:**
303-
> Reported speedups and memory reductions are with respect to the LLaMA 3-8B Hugging Face layer implementations. All models use 4K hidden size and 4K sequence length and are evaluated based on memory usage and wall time for the forward+backward pass on a single NVIDIA A100 80G GPU using small batch sizes. Liger kernels exhibit more efficient scaling to larger batch sizes, detailed further in the [Benchmark](./benchmark) folder.
304-
305-
## Contributing
306-
307-
[CONTRIBUTING GUIDE](https://github.com/linkedin/Liger-Kernel/blob/main/CONTRIBUTING.md)
308-
309-
## Acknowledgement
310-
311-
312-
### Design
313-
314-
- [@claire_yishan](https://twitter.com/claire_yishan) for the LOGO design
315-
- [Wave Snippets](https://www.wavesnippets.com/) for generating the animated code snippets
316-
317-
### Code
318-
319-
We referenced or used the following projects:
320-
321-
322-
323-
| # | Project | Description | Location | License |
324-
|---|----------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------|
325-
| 1 | [Unsloth](https://github.com/unslothai/unsloth/blob/fd753fed99ed5f10ef8a9b7139588d9de9ddecfb/unsloth/kernels/utils.py#L43) | `calculate_settings` to determine block size and warp; We reuse it for Norm and MLP | [Liger Kernel Utils](https://github.com/linkedin/Liger-Kernel/blob/e249eee723978bf8610ff1ea2297d048a2417e20/src/liger_kernel/ops/utils.py#L23) | [Apache](https://github.com/unslothai/unsloth/blob/fd753fed99ed5f10ef8a9b7139588d9de9ddecfb/LICENSE) |
326-
| 2 | [Unsloth](https://github.com/unslothai/unsloth/blob/976d11a10d54383aeb7a692c69e01151a20bfd72/unsloth/kernels/rms_layernorm.py#L48) | We modified and added dW calculation on top of Unsloth implementation | [Liger Kernel RMS Norm](https://github.com/linkedin/Liger-Kernel/blob/e249eee723978bf8610ff1ea2297d048a2417e20/src/liger_kernel/ops/rms_norm.py#L50) | [Apache](https://github.com/unslothai/unsloth/blob/fd753fed99ed5f10ef8a9b7139588d9de9ddecfb/LICENSE) |
327-
| 3 | [Triton tutorial](https://triton-lang.org/main/index.html) | We modified on top of triton tutorials | [Liger Kernel RMS Norm](https://github.com/linkedin/Liger-Kernel/blob/e249eee723978bf8610ff1ea2297d048a2417e20/src/liger_kernel/ops/rms_norm.py#L50) | [MIT](https://github.com/triton-lang/triton/blob/main/LICENSE) |
328-
| 4 | [tiny shakespeare dataset](https://huggingface.co/datasets/karpathy/tiny_shakespeare) | We use tiny shakespeare dataset to conduct convergence test on mini model | [Liger Kernel Convergence](https://github.com/linkedin/Liger-Kernel/tree/main/test/convergence) | N/A |
329-
| 5 | [Efficient Cross Entropy](https://github.com/mgmalek/efficient_cross_entropy) | We use the idea of gradient-in-forward and chunking | [Liger Kernel Linear Cross Entropy](https://github.com/linkedin/Liger-Kernel/blob/main/src/liger_kernel/ops/fused_linear_cross_entropy.py) | [MIT](https://github.com/mgmalek/efficient_cross_entropy/blob/main/LICENSE) |
330-
| 6 | [Flash attn](https://github.com/Dao-AILab/flash-attention) | We take many optimization ideas from the work, such as tiling and recomputation | | [BSD](https://github.com/Dao-AILab/flash-attention/blob/main/LICENSE) |
331-
| 7 | [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) | We reference the design of automodel | [Liger Kernel Auto Model](https://github.com/linkedin/Liger-Kernel/blob/main/src/liger_kernel/transformers/auto_model.py) | [MIT](https://github.com/casper-hansen/AutoAWQ/blob/main/LICENSE) |
332-
| 8 | [llm.c](https://github.com/karpathy/llm.c) | We reference the design of end-to-end testing | [Liger Kernel Convergence Tests](https://github.com/linkedin/Liger-Kernel/tree/main/test/convergence) | [MIT](https://github.com/karpathy/llm.c/blob/master/LICENSE) |
333-
334-
Many thanks to the contributors to these projects for their invaluable work that helped make Liger possible.
335-
336-
## License
337273

338-
This project is licensed under the [BSD 2-CLAUSE](https://github.com/linkedin/Liger-Kernel/blob/main/LICENSE) License (see `LICENSE` for details).
339-
It also includes components from projects licensed under:
274+
## Contributing, Acknowledgements, and License
340275

341-
- Apache License 2.0 (see `LICENSE-APACHE-2.0` for details).
342-
- MIT License (see `LICENSE-MIT-AutoAWQ` for details).
343-
- MIT License (see `LICENSE-MIT-Efficient Cross Entropy` for details).
344-
- MIT License (see `LICENSE-MIT-llmc` for details).
345-
- MIT License (see `LICENSE-MIT-triton` for details).
276+
- [Contributing Guidelines](https://github.com/linkedin/Liger-Kernel/blob/main/docs/CONTRIBUTING.md)
277+
- [Acknowledgements](https://github.com/linkedin/Liger-Kernel/blob/main/docs/Acknowledgement.md)
278+
- [License Information](https://github.com/linkedin/Liger-Kernel/blob/main/docs/License.md)
346279

347280
## Contact
348281

349-
- For public discussion, join [our discord channel](https://discord.gg/vNBDpjhb)
282+
- For issues, create a Github ticket in this repository
283+
- For open discussion, join [our discord channel](https://discord.gg/gpumode)
350284
- For formal collaboration, send an email to [email protected]
351285

352286
## Cite this work

dev/modal/tests_bwd.py

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
from pathlib import Path
2+
3+
import modal
4+
5+
ROOT_PATH = Path(__file__).parent.parent.parent
6+
7+
# tests_bwd is to ensure the backward compatibility of liger with older transformers
8+
image = (
9+
modal.Image.debian_slim()
10+
.pip_install_from_pyproject(
11+
ROOT_PATH / "pyproject.toml", optional_dependencies=["dev"]
12+
)
13+
.pip_install("transformers==4.44.2")
14+
)
15+
16+
app = modal.App("liger_tests", image=image)
17+
18+
# mount: add local files to the remote container
19+
repo = modal.Mount.from_local_dir(ROOT_PATH, remote_path="/root/liger-kernel")
20+
21+
22+
@app.function(gpu="A10G", mounts=[repo], timeout=60 * 10)
23+
def liger_tests():
24+
import subprocess
25+
26+
subprocess.run(["pip", "install", "-e", "."], check=True, cwd="/root/liger-kernel")
27+
subprocess.run(["make", "test"], check=True, cwd="/root/liger-kernel")
28+
subprocess.run(["make", "test-convergence"], check=True, cwd="/root/liger-kernel")

0 commit comments

Comments
 (0)