Skip to content

Commit b12cc76

Browse files
woominsongWoomin Song
andauthored
Add 4 papers (#148)
* Add 4 papers * Move Simba --------- Co-authored-by: Woomin Song <[email protected]>
1 parent 7d153bd commit b12cc76

File tree

1 file changed

+4
-0
lines changed

1 file changed

+4
-0
lines changed

README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -378,6 +378,7 @@ python3 download_pdfs.py # The code is generated by Doubao AI
378378
|2024.04|🔥🔥[Infini-attention] Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention(@Google) | [[pdf]](https://arxiv.org/pdf/2404.07143.pdf) | ⚠️ |⭐️⭐️ |
379379
|2024.04|🔥🔥[RAGCache] RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation(@Peking University&ByteDance Inc) | [[pdf]](https://arxiv.org/pdf/2404.12457.pdf) | ⚠️ |⭐️⭐️ |
380380
|2024.04|🔥🔥[**KCache**] EFFICIENT LLM INFERENCE WITH KCACHE(@Qiaozhi He, Zhihua Wu)| [[pdf]](https://arxiv.org/pdf/2404.18057) | ⚠️ |⭐️⭐️ |
381+
|2024.04|🔥[**HOMER**] Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs(@KAIST)|[[pdf]](https://arxiv.org/abs/2404.10308)|[[HOMER]](https://github.com/alinlab/HOMER) ![](https://img.shields.io/github/stars/alinlab/HOMER?style=social) |⭐️⭐️ |
381382
|2024.05|🔥🔥[**YOCO**] You Only Cache Once: Decoder-Decoder Architectures for Language Models(@Microsoft)| [[pdf]](https://arxiv.org/pdf/2405.05254) | [[unilm-YOCO]](https://github.com/microsoft/unilm/tree/master/YOCO) ![](https://img.shields.io/github/stars/microsoft/unilm.svg?style=social) |⭐️⭐️ |
382383
|2024.05|🔥🔥[SKVQ] SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models(@Shanghai AI Laboratory)| [[pdf]](https://arxiv.org/pdf/2405.06219) | ⚠️ |⭐️⭐️ |
383384
|2024.05|🔥🔥[**CLA**] Reducing Transformer Key-Value Cache Size with Cross-Layer Attention(@MIT-IBM)| [[pdf]](https://arxiv.org/pdf/2405.12981) | ⚠️ |⭐️⭐️ |
@@ -391,6 +392,7 @@ python3 download_pdfs.py # The code is generated by Doubao AI
391392
|2024.09|🔥[**RetrievalAttention**] RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval(@microsoft.com)|[[pdf]](https://arxiv.org/pdf/2409.10516)|⚠️|⭐️⭐️ |
392393
|2024.10|🔥[**ShadowKV**] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference(@CMU & bytedance)|[[pdf]](https://arxiv.org/pdf/2410.21465)|[[ShadowKV]](https://github.com/bytedance/ShadowKV) ![](https://img.shields.io/github/stars/bytedance/ShadowKV.svg?style=social) |⭐️⭐️ |
393394
|2025.01|🔥🔥🔥 [**Lightning Attention**] MiniMax-01: Scaling Foundation Models with Lightning Attention | [[report]](https://filecdn.minimax.chat/_Arxiv_MiniMax_01_Report.pdf) | [[MiniMax-01]](https://github.com/MiniMax-AI/MiniMax-01) ![](https://img.shields.io/github/stars/MiniMax-AI/MiniMax-01.svg?style=social) | ⭐️⭐️ |
395+
|2025.06|🔥[**REFORM**] Compress, Gather, and Recompute: REFORMing Long-Context Processing in Transformers(@KAIST & Amazon etc)|[[pdf]](https://arxiv.org/abs/2506.01215)|⚠️|⭐️⭐️ |
394396

395397
### 📖Early-Exit/Intermediate Layer Decoding ([©️back👆🏻](#paperlist))
396398
<div id="Early-Exit"></div>
@@ -435,6 +437,7 @@ python3 download_pdfs.py # The code is generated by Doubao AI
435437
|2024.09|🔥[**Hybrid Inference**] Efficient Hybrid Inference for LLMs: Reward-Based Token Modelling with Selective Cloud Assistance|[[pdf]](https://arxiv.org/pdf/2409.13757) | ⚠️ |⭐️⭐️ |
436438
|2024.10|🔥[**PARALLELSPEC**] PARALLELSPEC: PARALLEL DRAFTER FOR EFFICIENT SPECULATIVE DECODING(@Tencent AI Lab etc)|[[pdf]](https://arxiv.org/pdf/2410.05589) | ⚠️ |⭐️⭐️ |
437439
|2024.10|🔥[**Fast Best-of-N**] Fast Best-of-N Decoding via Speculative Rejection(@CMU etc) | [[pdf]](https://arxiv.org/pdf/2410.20290) | ⚠️ |⭐️⭐️ |
440+
|2025.06|🔥[**Mamba Drafters**] Mamba Drafters for Speculative Decoding(@KAIST & Amazon etc) | [[pdf]](https://arxiv.org/abs/2506.01206) | ⚠️ |⭐️⭐️ |
438441

439442

440443
### 📖Structured Prune/KD/Weight Sparse ([©️back👆🏻](#paperlist))
@@ -447,6 +450,7 @@ python3 download_pdfs.py # The code is generated by Doubao AI
447450
|2023.12|[PowerInfer] PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU(@SJTU)|[[pdf]](https://ipads.se.sjtu.edu.cn/_media/publications/powerinfer-20231219.pdf)|[[PowerInfer]](https://github.com/SJTU-IPADS/PowerInfer) ![](https://img.shields.io/github/stars/SJTU-IPADS/PowerInfer.svg?style=social)|⭐️ |
448451
|2024.01|[**Admm Pruning**] Fast and Optimal Weight Update for Pruned Large Language Models(@fmph.uniba.sk)|[[pdf]](https://arxiv.org/pdf/2401.02938.pdf)|[[admm-pruning]](https://github.com/fmfi-compbio/admm-pruning) ![](https://img.shields.io/github/stars/fmfi-compbio/admm-pruning.svg?style=social)|⭐️ |
449452
|2024.01|[FFSplit] FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference(@1Rice University etc) | [[pdf]](https://arxiv.org/pdf/2401.04044.pdf) | ⚠️ |⭐️|
453+
|2025.03|🔥[**Simba**] Sparsified State-Space Models are Efficient Highway Networks(@KAIST)| [[pdf]](https://arxiv.org/abs/2505.20698)|[[Simba]](https://github.com/woominsong/Simba) ![](https://img.shields.io/github/stars/woominsong/Simba.svg?style=social)|⭐️ |
450454

451455
### 📖Mixture-of-Experts(MoE) LLM Inference ([©️back👆🏻](#paperlist))
452456
<div id="Mixture_of_Experts_LLM_Inference"></div>

0 commit comments

Comments
 (0)