H$^2$-LLM: Hardware-Dataflow Co-Exploration for Heterogeneous Hybrid-Bonding-based Low-Batch LLM Inference

Published in International Symposium on Computer Architecture (ISCA), 2025

Low-batch large language model (LLM) inference has been extensively applied to edge-side generative tasks, such as personal chat helper, virtual assistant, reception bot, private edge server, etc. To efficiently handle both prefill and decoding stages in LLM inference, near-memory processing (NMP) enabled heterogeneous computation paradigm has been proposed. However, existing NMP designs typically embed processing engines into DRAM dies, resulting in limited computation capacity, which in turn restricts their ability to accelerate edge-side low-batch LLM inference.

To tackle this problem, we propose H$^2$-LLM, a Hybrid-bonding-based Heterogeneous accelerator for edge-side low-batch LLM inference. To balance the trade-off between computation capacity and bandwidth intrinsic to hybrid-bonding technology, we propose H$^2$-LLM’s architecture and extract its architecture design space. We further propose a data-centric dataflow abstraction to fully exploit the heterogeneous architecture’s acceleration opportunities in low-batch LLM inference. Based on the whole design space, we propose a design space exploration (DSE) framework to automatically find out the optimal design. Compared with existing in-die NMP-based heterogeneous accelerators, H$^2$-LLM achieves 2.72 × geomean speedup and 1.48 × geomean better energy efficiency. H$^2$-LLM’s data-centric dataflow exploration framework is open-sourced at https://github.com/leesou/H2-LLM-ISCA-2025.

Download paper here

@inproceedings{li2025h2,
  title={H2-LLM: Hardware-Dataflow Co-Exploration for Heterogeneous Hybrid-Bonding-based Low-Batch LLM Inference},
  author={Li, Cong and Yin, Yihan and Wu, Xintong and Zhu, Jingchen and Gao, Zhutianya and Niu, Dimin and Wu, Qiang and Si, Xin and Xie, Yuan and Zhang, Chen and others},
  booktitle={Proceedings of the 52nd Annual International Symposium on Computer Architecture},
  pages={194--210},
  year={2025}
}

X (formerly Twitter) Facebook LinkedIn

Cong Li