Skip to content

Commit 523fd66

Browse files
authored
modify readmes and create license/acknowledgement docs (#377)
## Summary <!--- This is a required section; please describe the main purpose of this proposed code change. ---> <!--- ## Details This is an optional section; is there anything specific that reviewers should be aware of? ---> ## Testing Done <!--- This is a required section; please describe how this change was tested. ---> <!-- Replace BLANK with your device type. For example, A100-80G-PCIe Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> - Hardware Type: <BLANK> - [ ] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence
1 parent d784664 commit 523fd66

File tree

4 files changed

+64
-79
lines changed

4 files changed

+64
-79
lines changed

README.md

Lines changed: 13 additions & 79 deletions
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,8 @@
5252

5353
<details>
5454
<summary>Latest News 🔥</summary>
55-
55+
56+
- [2024/11/6] We release [v0.4.0](https://github.com/linkedin/Liger-Kernel/releases/tag/v0.4.0): Full AMD support, Tech Report, Modal CI, Llama-3.2-Vision!
5657
- [2024/10/21] We have released the tech report of Liger Kernel on Arxiv: https://arxiv.org/pdf/2410.10989
5758
- [2024/9/6] We release v0.2.1 ([X post](https://x.com/liger_kernel/status/1832168197002510649)). 2500+ Stars, 10+ New Contributors, 50+ PRs, 50k Downloads in two weeks!
5859
- [2024/8/31] CUDA MODE talk, [Liger-Kernel: Real-world Triton kernel for LLM Training](https://youtu.be/gWble4FreV4?si=dxPeIchhkJ36Mbns), [Slides](https://github.com/cuda-mode/lectures?tab=readme-ov-file#lecture-28-liger-kernel)
@@ -80,18 +81,12 @@ With one line of code, Liger Kernel can increase throughput by more than 20% and
8081
8182
## Examples
8283

83-
### Basic
84-
85-
| **Example** | **Description** | **Lightning Studio** |
86-
|------------------------------------------------|---------------------------------------------------------------------------------------------------|----------------------|
87-
| [**Hugging Face Trainer**](https://github.com/linkedin/Liger-Kernel/tree/main/examples/huggingface) | Train LLaMA 3-8B ~20% faster with over 40% memory reduction on Alpaca dataset using 4 A100s with FSDP | TBA |
88-
| [**Lightning Trainer**](https://github.com/linkedin/Liger-Kernel/tree/main/examples/lightning) | Increase 15% throughput and reduce memory usage by 40% with LLaMA3-8B on MMLU dataset using 8 A100s with DeepSpeed ZeRO3 | TBA |
8984

90-
### Advanced
91-
92-
| **Example** | **Description** | **Lightning Studio** |
93-
|------------------------------------------------|---------------------------------------------------------------------------------------------------|----------------------|
94-
| [**Medusa Multi-head LLM (Retraining Phase)**](https://github.com/linkedin/Liger-Kernel/tree/main/examples/medusa) | Reduce memory usage by 80% with 5 LM heads and improve throughput by 40% using 8 A100s with FSDP | TBA |
85+
| **Use Case** | **Description** |
86+
|------------------------------------------------|---------------------------------------------------------------------------------------------------|
87+
| [**Hugging Face Trainer**](https://github.com/linkedin/Liger-Kernel/tree/main/examples/huggingface) | Train LLaMA 3-8B ~20% faster with over 40% memory reduction on Alpaca dataset using 4 A100s with FSDP |
88+
| [**Lightning Trainer**](https://github.com/linkedin/Liger-Kernel/tree/main/examples/lightning) | Increase 15% throughput and reduce memory usage by 40% with LLaMA3-8B on MMLU dataset using 8 A100s with DeepSpeed ZeRO3 |
89+
| [**Medusa Multi-head LLM (Retraining Phase)**](https://github.com/linkedin/Liger-Kernel/tree/main/examples/medusa) | Reduce memory usage by 80% with 5 LM heads and improve throughput by 40% using 8 A100s with FSDP | |
9590

9691
## Key Features
9792

@@ -102,13 +97,6 @@ With one line of code, Liger Kernel can increase throughput by more than 20% and
10297
- **Multi-GPU supported:** Compatible with multi-GPU setups (PyTorch FSDP, DeepSpeed, DDP, etc.).
10398
- **Trainer Framework Integration**: [Axolotl](https://github.com/axolotl-ai-cloud/axolotl), [LLaMa-Factory](https://github.com/hiyouga/LLaMA-Factory), [SFTTrainer](https://github.com/huggingface/trl/releases/tag/v0.10.1), [Hugging Face Trainer](https://github.com/huggingface/transformers/pull/32860), [SWIFT](https://github.com/modelscope/ms-swift)
10499

105-
## Target Audiences
106-
107-
- **Researchers**: Looking to compose models using efficient and reliable kernels for frontier experiments.
108-
- **ML Practitioners**: Focused on maximizing GPU training efficiency with optimal, high-performance kernels.
109-
- **Curious Novices**: Eager to learn how to write reliable Triton kernels to enhance training efficiency.
110-
111-
112100
## Installation
113101

114102
### Dependencies
@@ -214,23 +202,6 @@ loss = loss_fn(model.weight, input, target)
214202
loss.backward()
215203
```
216204

217-
218-
## Structure
219-
220-
### Source Code
221-
222-
- `ops/`: Core Triton operations.
223-
- `transformers/`: PyTorch `nn.Module` implementations built on Triton operations, compliant with the `transformers` API.
224-
225-
### Tests
226-
227-
- `transformers/`: Correctness tests for the Triton-based layers.
228-
- `convergence/`: Patches Hugging Face models with all kernels, runs multiple iterations, and compares weights, logits, and loss layer-by-layer.
229-
230-
### Benchmark
231-
232-
- `benchmark/`: Execution time and memory benchmarks compared to Hugging Face layers.
233-
234205
## APIs
235206

236207
### AutoModel
@@ -299,54 +270,17 @@ $$\text{GeGLU}(x)=\text{GELU}(xW+b)\otimes(xV+c)$$
299270
- **Embedding**: [Embedding](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) is implemented by fusing embedding lookup and output operations. It achieves a peak speedup of ~1.5x in the forward pass and an overall speedup of ~1.1x.
300271
- **Matmul int2xint8**: is implemented by using the cache tiled matrix multiplication and by fusing the matmul with the unpacking process which achieves a considerable speed up and performs on par with @torch.compile
301272
<!-- TODO: be more specific about batch size -->
302-
> **Note:**
303-
> Reported speedups and memory reductions are with respect to the LLaMA 3-8B Hugging Face layer implementations. All models use 4K hidden size and 4K sequence length and are evaluated based on memory usage and wall time for the forward+backward pass on a single NVIDIA A100 80G GPU using small batch sizes. Liger kernels exhibit more efficient scaling to larger batch sizes, detailed further in the [Benchmark](./benchmark) folder.
304-
305-
## Contributing
306-
307-
[CONTRIBUTING GUIDE](https://github.com/linkedin/Liger-Kernel/blob/main/CONTRIBUTING.md)
308-
309-
## Acknowledgement
310-
311-
312-
### Design
313-
314-
- [@claire_yishan](https://twitter.com/claire_yishan) for the LOGO design
315-
- [Wave Snippets](https://www.wavesnippets.com/) for generating the animated code snippets
316-
317-
### Code
318-
319-
We referenced or used the following projects:
320-
321-
322-
323-
| # | Project | Description | Location | License |
324-
|---|----------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------|
325-
| 1 | [Unsloth](https://github.com/unslothai/unsloth/blob/fd753fed99ed5f10ef8a9b7139588d9de9ddecfb/unsloth/kernels/utils.py#L43) | `calculate_settings` to determine block size and warp; We reuse it for Norm and MLP | [Liger Kernel Utils](https://github.com/linkedin/Liger-Kernel/blob/e249eee723978bf8610ff1ea2297d048a2417e20/src/liger_kernel/ops/utils.py#L23) | [Apache](https://github.com/unslothai/unsloth/blob/fd753fed99ed5f10ef8a9b7139588d9de9ddecfb/LICENSE) |
326-
| 2 | [Unsloth](https://github.com/unslothai/unsloth/blob/976d11a10d54383aeb7a692c69e01151a20bfd72/unsloth/kernels/rms_layernorm.py#L48) | We modified and added dW calculation on top of Unsloth implementation | [Liger Kernel RMS Norm](https://github.com/linkedin/Liger-Kernel/blob/e249eee723978bf8610ff1ea2297d048a2417e20/src/liger_kernel/ops/rms_norm.py#L50) | [Apache](https://github.com/unslothai/unsloth/blob/fd753fed99ed5f10ef8a9b7139588d9de9ddecfb/LICENSE) |
327-
| 3 | [Triton tutorial](https://triton-lang.org/main/index.html) | We modified on top of triton tutorials | [Liger Kernel RMS Norm](https://github.com/linkedin/Liger-Kernel/blob/e249eee723978bf8610ff1ea2297d048a2417e20/src/liger_kernel/ops/rms_norm.py#L50) | [MIT](https://github.com/triton-lang/triton/blob/main/LICENSE) |
328-
| 4 | [tiny shakespeare dataset](https://huggingface.co/datasets/karpathy/tiny_shakespeare) | We use tiny shakespeare dataset to conduct convergence test on mini model | [Liger Kernel Convergence](https://github.com/linkedin/Liger-Kernel/tree/main/test/convergence) | N/A |
329-
| 5 | [Efficient Cross Entropy](https://github.com/mgmalek/efficient_cross_entropy) | We use the idea of gradient-in-forward and chunking | [Liger Kernel Linear Cross Entropy](https://github.com/linkedin/Liger-Kernel/blob/main/src/liger_kernel/ops/fused_linear_cross_entropy.py) | [MIT](https://github.com/mgmalek/efficient_cross_entropy/blob/main/LICENSE) |
330-
| 6 | [Flash attn](https://github.com/Dao-AILab/flash-attention) | We take many optimization ideas from the work, such as tiling and recomputation | | [BSD](https://github.com/Dao-AILab/flash-attention/blob/main/LICENSE) |
331-
| 7 | [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) | We reference the design of automodel | [Liger Kernel Auto Model](https://github.com/linkedin/Liger-Kernel/blob/main/src/liger_kernel/transformers/auto_model.py) | [MIT](https://github.com/casper-hansen/AutoAWQ/blob/main/LICENSE) |
332-
| 8 | [llm.c](https://github.com/karpathy/llm.c) | We reference the design of end-to-end testing | [Liger Kernel Convergence Tests](https://github.com/linkedin/Liger-Kernel/tree/main/test/convergence) | [MIT](https://github.com/karpathy/llm.c/blob/master/LICENSE) |
333-
334-
Many thanks to the contributors to these projects for their invaluable work that helped make Liger possible.
335-
336-
## License
337273

338-
This project is licensed under the [BSD 2-CLAUSE](https://github.com/linkedin/Liger-Kernel/blob/main/LICENSE) License (see `LICENSE` for details).
339-
It also includes components from projects licensed under:
274+
## Contributing, Acknowledgements, and License
340275

341-
- Apache License 2.0 (see `LICENSE-APACHE-2.0` for details).
342-
- MIT License (see `LICENSE-MIT-AutoAWQ` for details).
343-
- MIT License (see `LICENSE-MIT-Efficient Cross Entropy` for details).
344-
- MIT License (see `LICENSE-MIT-llmc` for details).
345-
- MIT License (see `LICENSE-MIT-triton` for details).
276+
- [Contributing Guidelines](https://github.com/linkedin/Liger-Kernel/blob/main/docs/CONTRIBUTING.md)
277+
- [Acknowledgements](https://github.com/linkedin/Liger-Kernel/blob/main/docs/Acknowledgement.md)
278+
- [License Information](https://github.com/linkedin/Liger-Kernel/blob/main/docs/License.md)
346279

347280
## Contact
348281

349-
- For public discussion, join [our discord channel](https://discord.gg/vNBDpjhb)
282+
- For issues, create a Github ticket in this repository
283+
- For open discussion, join [our discord channel](https://discord.gg/gpumode)
350284
- For formal collaboration, send an email to [email protected]
351285

352286
## Cite this work

docs/Acknowledgement.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
2+
## Acknowledgement
3+
4+
5+
### Design
6+
7+
- [@claire_yishan](https://twitter.com/claire_yishan) for the LOGO design
8+
- [Wave Snippets](https://www.wavesnippets.com/) for generating the animated code snippets
9+
10+
### Code
11+
12+
We referenced or used the following projects:
13+
14+
15+
16+
| # | Project | Description | Location | License |
17+
|---|----------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------|
18+
| 1 | [Unsloth](https://github.com/unslothai/unsloth/blob/fd753fed99ed5f10ef8a9b7139588d9de9ddecfb/unsloth/kernels/utils.py#L43) | `calculate_settings` to determine block size and warp; We reuse it for Norm and MLP | [Liger Kernel Utils](https://github.com/linkedin/Liger-Kernel/blob/e249eee723978bf8610ff1ea2297d048a2417e20/src/liger_kernel/ops/utils.py#L23) | [Apache](https://github.com/unslothai/unsloth/blob/fd753fed99ed5f10ef8a9b7139588d9de9ddecfb/LICENSE) |
19+
| 2 | [Unsloth](https://github.com/unslothai/unsloth/blob/976d11a10d54383aeb7a692c69e01151a20bfd72/unsloth/kernels/rms_layernorm.py#L48) | We modified and added dW calculation on top of Unsloth implementation | [Liger Kernel RMS Norm](https://github.com/linkedin/Liger-Kernel/blob/e249eee723978bf8610ff1ea2297d048a2417e20/src/liger_kernel/ops/rms_norm.py#L50) | [Apache](https://github.com/unslothai/unsloth/blob/fd753fed99ed5f10ef8a9b7139588d9de9ddecfb/LICENSE) |
20+
| 3 | [Triton tutorial](https://triton-lang.org/main/index.html) | We modified on top of triton tutorials | [Liger Kernel RMS Norm](https://github.com/linkedin/Liger-Kernel/blob/e249eee723978bf8610ff1ea2297d048a2417e20/src/liger_kernel/ops/rms_norm.py#L50) | [MIT](https://github.com/triton-lang/triton/blob/main/LICENSE) |
21+
| 4 | [tiny shakespeare dataset](https://huggingface.co/datasets/karpathy/tiny_shakespeare) | We use tiny shakespeare dataset to conduct convergence test on mini model | [Liger Kernel Convergence](https://github.com/linkedin/Liger-Kernel/tree/main/test/convergence) | N/A |
22+
| 5 | [Efficient Cross Entropy](https://github.com/mgmalek/efficient_cross_entropy) | We use the idea of gradient-in-forward and chunking | [Liger Kernel Linear Cross Entropy](https://github.com/linkedin/Liger-Kernel/blob/main/src/liger_kernel/ops/fused_linear_cross_entropy.py) | [MIT](https://github.com/mgmalek/efficient_cross_entropy/blob/main/LICENSE) |
23+
| 6 | [Flash attn](https://github.com/Dao-AILab/flash-attention) | We take many optimization ideas from the work, such as tiling and recomputation | | [BSD](https://github.com/Dao-AILab/flash-attention/blob/main/LICENSE) |
24+
| 7 | [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) | We reference the design of automodel | [Liger Kernel Auto Model](https://github.com/linkedin/Liger-Kernel/blob/main/src/liger_kernel/transformers/auto_model.py) | [MIT](https://github.com/casper-hansen/AutoAWQ/blob/main/LICENSE) |
25+
| 8 | [llm.c](https://github.com/karpathy/llm.c) | We reference the design of end-to-end testing | [Liger Kernel Convergence Tests](https://github.com/linkedin/Liger-Kernel/tree/main/test/convergence) | [MIT](https://github.com/karpathy/llm.c/blob/master/LICENSE) |
26+
27+
Many thanks to the contributors to these projects for their invaluable work that helped make Liger possible.

CONTRIBUTING.md renamed to docs/CONTRIBUTING.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,22 @@ Leave `#take` in the comment and tag the maintainer.
2626
pip install -e .'[dev]'
2727
```
2828

29+
## Structure
30+
31+
### Source Code
32+
33+
- `ops/`: Core Triton operations.
34+
- `transformers/`: PyTorch `nn.Module` implementations built on Triton operations, compliant with the `transformers` API.
35+
36+
### Tests
37+
38+
- `transformers/`: Correctness tests for the Triton-based layers.
39+
- `convergence/`: Patches Hugging Face models with all kernels, runs multiple iterations, and compares weights, logits, and loss layer-by-layer.
40+
41+
### Benchmark
42+
43+
- `benchmark/`: Execution time and memory benchmarks compared to Hugging Face layers.
44+
2945
## Adding support for a new model
3046
To get familiar with the folder structure, please refer to https://github.com/linkedin/Liger-Kernel?tab=readme-ov-file#structure.
3147

docs/License.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
This project is licensed under the [BSD 2-CLAUSE](https://github.com/linkedin/Liger-Kernel/blob/main/LICENSE) License (see `LICENSE` for details).
2+
It also includes components from projects licensed under:
3+
4+
- Apache License 2.0 (see `LICENSE-APACHE-2.0` for details).
5+
- MIT License (see `LICENSE-MIT-AutoAWQ` for details).
6+
- MIT License (see `LICENSE-MIT-Efficient Cross Entropy` for details).
7+
- MIT License (see `LICENSE-MIT-llmc` for details).
8+
- MIT License (see `LICENSE-MIT-triton` for details).

0 commit comments

Comments
 (0)