Add 4 papers (#148)

woominsong · Woomin Song · web-flow · commit b12cc7630362 · 2025-06-04T10:50:05.000+08:00
* Add 4 papers

* Move Simba

---------

Co-authored-by: Woomin Song &lt;woomin.song@kaist.ac.kr&gt;
diff --git a/README.md b/README.md
@@ -378,6 +378,7 @@ python3 download_pdfs.py # The code is generated by Doubao AI
 |2024.04|🔥🔥[Infini-attention] Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention(@Google) | [[pdf]](https://arxiv.org/pdf/2404.07143.pdf) | ⚠️ |⭐️⭐️ |
 |2024.04|🔥🔥[RAGCache] RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation(@Peking University&ByteDance Inc) | [[pdf]](https://arxiv.org/pdf/2404.12457.pdf) | ⚠️ |⭐️⭐️ |
 |2024.04|🔥🔥[**KCache**] EFFICIENT LLM INFERENCE WITH KCACHE(@Qiaozhi He, Zhihua Wu)| [[pdf]](https://arxiv.org/pdf/2404.18057) | ⚠️ |⭐️⭐️ |
+|2024.04|🔥[**HOMER**] Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs(@KAIST)|[[pdf]](https://arxiv.org/abs/2404.10308)|[[HOMER]](https://github.com/alinlab/HOMER) ![](https://img.shields.io/github/stars/alinlab/HOMER?style=social) |⭐️⭐️ |
 |2024.05|🔥🔥[**YOCO**] You Only Cache Once: Decoder-Decoder Architectures for Language Models(@Microsoft)| [[pdf]](https://arxiv.org/pdf/2405.05254) | [[unilm-YOCO]](https://github.com/microsoft/unilm/tree/master/YOCO) ![](https://img.shields.io/github/stars/microsoft/unilm.svg?style=social) |⭐️⭐️ |
 |2024.05|🔥🔥[SKVQ] SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models(@Shanghai AI Laboratory)| [[pdf]](https://arxiv.org/pdf/2405.06219) | ⚠️ |⭐️⭐️ |
 |2024.05|🔥🔥[**CLA**] Reducing Transformer Key-Value Cache Size with Cross-Layer Attention(@MIT-IBM)| [[pdf]](https://arxiv.org/pdf/2405.12981) | ⚠️ |⭐️⭐️ |
@@ -391,6 +392,7 @@ python3 download_pdfs.py # The code is generated by Doubao AI
 |2024.09|🔥[**RetrievalAttention**] RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval(@microsoft.com)|[[pdf]](https://arxiv.org/pdf/2409.10516)|⚠️|⭐️⭐️ |
 |2024.10|🔥[**ShadowKV**] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference(@CMU & bytedance)|[[pdf]](https://arxiv.org/pdf/2410.21465)|[[ShadowKV]](https://github.com/bytedance/ShadowKV) ![](https://img.shields.io/github/stars/bytedance/ShadowKV.svg?style=social) |⭐️⭐️ |
 |2025.01|🔥🔥🔥 [**Lightning Attention**] MiniMax-01: Scaling Foundation Models with Lightning Attention | [[report]](https://filecdn.minimax.chat/_Arxiv_MiniMax_01_Report.pdf) | [[MiniMax-01]](https://github.com/MiniMax-AI/MiniMax-01) ![](https://img.shields.io/github/stars/MiniMax-AI/MiniMax-01.svg?style=social) | ⭐️⭐️ |
+|2025.06|🔥[**REFORM**] Compress, Gather, and Recompute: REFORMing Long-Context Processing in Transformers(@KAIST & Amazon etc)|[[pdf]](https://arxiv.org/abs/2506.01215)|⚠️|⭐️⭐️ |
 
 ### 📖Early-Exit/Intermediate Layer Decoding ([©️back👆🏻](#paperlist))
 <div id="Early-Exit"></div>
@@ -435,6 +437,7 @@ python3 download_pdfs.py # The code is generated by Doubao AI
 |2024.09|🔥[**Hybrid Inference**] Efficient Hybrid Inference for LLMs: Reward-Based Token Modelling with Selective Cloud Assistance|[[pdf]](https://arxiv.org/pdf/2409.13757) | ⚠️ |⭐️⭐️ |
 |2024.10|🔥[**PARALLELSPEC**] PARALLELSPEC: PARALLEL DRAFTER FOR EFFICIENT SPECULATIVE DECODING(@Tencent AI Lab etc)|[[pdf]](https://arxiv.org/pdf/2410.05589) | ⚠️ |⭐️⭐️ |
 |2024.10|🔥[**Fast Best-of-N**] Fast Best-of-N Decoding via Speculative Rejection(@CMU etc) | [[pdf]](https://arxiv.org/pdf/2410.20290) | ⚠️ |⭐️⭐️ |
+|2025.06|🔥[**Mamba Drafters**] Mamba Drafters for Speculative Decoding(@KAIST & Amazon etc) | [[pdf]](https://arxiv.org/abs/2506.01206) | ⚠️ |⭐️⭐️ |
 
 
 ### 📖Structured Prune/KD/Weight Sparse ([©️back👆🏻](#paperlist))
@@ -447,6 +450,7 @@ python3 download_pdfs.py # The code is generated by Doubao AI
 |2023.12|[PowerInfer] PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU(@SJTU)|[[pdf]](https://ipads.se.sjtu.edu.cn/_media/publications/powerinfer-20231219.pdf)|[[PowerInfer]](https://github.com/SJTU-IPADS/PowerInfer) ![](https://img.shields.io/github/stars/SJTU-IPADS/PowerInfer.svg?style=social)|⭐️ |
 |2024.01|[**Admm Pruning**] Fast and Optimal Weight Update for Pruned Large Language Models(@fmph.uniba.sk)|[[pdf]](https://arxiv.org/pdf/2401.02938.pdf)|[[admm-pruning]](https://github.com/fmfi-compbio/admm-pruning) ![](https://img.shields.io/github/stars/fmfi-compbio/admm-pruning.svg?style=social)|⭐️ |
 |2024.01|[FFSplit] FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference(@1Rice University etc) | [[pdf]](https://arxiv.org/pdf/2401.04044.pdf) |  ⚠️ |⭐️|
+|2025.03|🔥[**Simba**] Sparsified State-Space Models are Efficient Highway Networks(@KAIST)| [[pdf]](https://arxiv.org/abs/2505.20698)|[[Simba]](https://github.com/woominsong/Simba) ![](https://img.shields.io/github/stars/woominsong/Simba.svg?style=social)|⭐️ |
 
 ### 📖Mixture-of-Experts(MoE) LLM Inference ([©️back👆🏻](#paperlist))
 <div id="Mixture_of_Experts_LLM_Inference"></div>