Sitemap

A list of all the posts and pages found on the site. For you robots out there, there is an XML version available for digesting as well.

Pages

Posts

Future Blog Post

less than 1 minute read

Published:

This post will show up by default. To disable scheduling of future posts, edit config.yml and set future: false.

Blog Post number 4

less than 1 minute read

Published:

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Blog Post number 3

less than 1 minute read

Published:

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Blog Post number 2

less than 1 minute read

Published:

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Blog Post number 1

less than 1 minute read

Published:

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

portfolio

publications

GNNear: Accelerating Full-Batch Training of Graph Neural Networks with Near-Memory Processing

Published in International Conference on Parallel Architectures and Compilation Techniques (PACT), 2022

In this paper we propose GNNear, a hybrid accelerator architecture leveraging near-memory processing to accelerate full-batch GNN training on large graphs. GNNear matchs the heterogeneous nature of GNN training by offloading the memory-intensive Reduce operations to in-DIMM Near-Memory-Engines (NMEs) and using a Centralized-Acceleration-Engine (CAE) to process the computation-intensive Update operations. To deal with the irregularity of graphs, we also propose several optimization strategies concerning data reuse, graph partitioning, and dataflow scheduling, etc. Evaluations on 16 tasks demonstrate that GNNear achieves 30.8× / 2.5× geomean speedup and 79.6× / 7.3× (geomean) higher energy efficiency compared to Xeon E5-2698-v4 CPU and V100 GPU.

DIMM-Link: Enabling Efficient Inter-DIMM Communication for Near-Memory Processing

Published in International Symposium on High-Performance Computer Architecture (HPCA), 2023

DIMM-based near-memory processing architectures (DIMM-NMP) have received growing interest from both academia and industry. They have the advantages of large memory capacity, low manufacturing cost, high flexibility, compatible form factor, etc. However, inter-DIMM communication (IDC) has become a critical obstacle for generic DIMM-NMP architectures because it involves costly forwarding transactions through the host CPU. Recent research has demonstrated that, for many applications, the overhead induced by IDC may even offset the performance and energy benefits of near-memory processing.To tackle this problem, we propose DIMM-Link, which enables high-performance IDC in DIMM-NMP architectures and supports seamless integration with existing host memory systems. It adopts bidirectional external data links to connect DIMMs, via which point-to-point communication and inter-DIMM broadcast are efficiently supported in a packet-routing way. We present the full-stack design of DIMM-Link, including the hardware architecture, interconnect protocol, system organization, routing mechanisms, optimization strategies, etc. Comprehensive experiments on typical data-intensive tasks demonstrate that the DIMM-Link-equipped NMP system can achieve a 5.93× average speedup over the 16-core CPU baseline. Compared to other IDC methods, DIMM-Link outperforms MCN, AIM, and ABC-DIMM by 2.42×, 1.87×, and 1.77×, respectively. More importantly, DIMM-Link fully considers the implementation feasibility and system integration constraints, which are critical for designing NMP architectures based on modern DDR4/DDR5 DIMMs.

NMExplorer: An Efficient Exploration Framework for DIMM-based Near-Memory Tensor Reduction

Published in Design Automation Conference (DAC), 2023

Various DIMM-based near-memory processing (DIMM-NMP) architectures have been proposed to accelerate tensor reduction. With careful evaluation, we find that diverse scenarios exhibit distinct performance on DIMM-NMP architectures adopting different design configurations. However, given a tensor reduction scenario, there is a lack of a fast and accurate solution to identify a proper DIMM-NMP architecture. To tackle this problem, we propose an efficient exploration framework called NMExplorer. Given a scenario and hardware parameters, NMExplorer can generate and explore a wide range of potential design configurations. Experiments show that the recommended designs can outperform state-of-the-art DIMM-NMP accelerators by up to 1.95× in performance and 3.69× in energy.

PIM-DL: Expanding the Applicability of Commodity DRAM-PIMs for Deep Learning via Algorithm-System Co-Optimization

Published in International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024

DRAM-based processing-in-memory (DRAM-PIM) has gained commercial prominence in recent years. However, their integration for deep learning acceleration poses inherent challenges. Existing DRAM-PIMs are limited in computational capabilities, primarily applicable for element-wise and GEMV operators. Unfortunately, these operators contribute only a small portion of the execution time in most DNN workloads. Current systems still necessitate powerful hosts to handle a significant portion of compute-heavy operators.

SpecPIM: Accelerating Speculative Inference on PIM-Enabled System via Architecture-Dataflow Co-Exploration

Published in International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024

Generative large language models’ (LLMs) inference suffers from inefficiency because of the token dependency brought by autoregressive decoding. Recently, speculative inference has been proposed to alleviate this problem, which introduces small language models to generate draft tokens and adopts the original large language model to conduct verification. Although speculative inference can enhance the efficiency of the decoding procedure, we find that it presents variable resource demands due to the distinct computation patterns of the models used in speculative inference. This variability impedes the full realization of speculative inference’s acceleration potential in current systems.

UniNDP: A Unified Compilation and Simulation Tool for Near DRAM Processing Architectures

Published in International Symposium on High Performance Computer Architecture (HPCA), 2025

Near DRAM Processing (NDP) architectures have emerged to be a promising solution for commercializing in-memory computing and addressing the “memory wall” problem, especially for the memory-intensive machine learning (ML) workloads. In NDP architectures, the Processing Units (PUs) are distributed next to different memory units to exploit the high internal bandwidth. Therefore, in order to fully utilize the bandwidth advantage of NDP architectures for ML applications, meticulous evaluations and optimizations of data placement in DRAM and workload scheduling among different PUs are required. However, existing simulation and compilation tools face two insuperable obstacles to achieving these targets. On the one hand, tools for traditional von Neumann architectures only focus on the data access behaviors between the host and DRAM and treat DRAM as a whole part, which cannot support NDP architectures with multiple independent processing and memory units working simultaneously. On the other hand, existing NDP simulators and compilers are designed for specific DRAM technology and NDP architecture, lacking compatibility for various NDP architectures. In order to overcome these challenges and optimize data mapping and workload scheduling for different NDP architectures, we propose UniNDP, a unified NDP compilation and simulation tool for ML applications. Firstly, we propose a unified tree-based NDP hardware abstraction and the corresponding instruction set, enabling the support for various NDP architectures based on different DRAM technologies. Secondly, we design a cycle-accurate and instruction-driven NDP simulator to evaluate hardware performance by accurately tracking the working status of memory elements and PUs. The accurate simulation can provide effective guidance for compilation. Thirdly, we design an NDP compiler that optimizes data partition, mapping, and workload scheduling in different DRAM hierarchies. Furthermore, to enhance the compilation efficiency, we propose a hardware status-guided search space pruning strategy and a fast performance predictor using DRAM timing parameters. Extensive experimental results show that, compared to existing mapping and compilation methods, UniNDP can achieve 1.05-3.43 × speedup across multiple NDP architectures and different ML workloads. Furthermore, based on the results of UniNDP, we provide insights for the future NDP architecture design and deployment in ML applications.

CTXNL: A Software-Hardware Co-designed Solution for Efficient CXL-Based Transaction Processing

Published in International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2025

Transaction processing systems are the crux for modern data-center applications, yet current multi-node systems are slow due to network overheads. This paper advocates for Compute Express Link (CXL) as a network alternative, which enables low-latency and cache-coherent shared memory accesses. However, directly adopting standard CXL primitives leads to performance degradation due to the high cost of maintaining cross-node cache coherence. To address the CXL challenges, this paper introduces CTXNL, a software-hardware co-designed system that implements a novel hybrid coherence primitive tailored to the loosely coherent nature of transactional data. The core innovation of CTXNL is empowering transaction system developers with the ability to selectively achieve data coherence. Our evaluations on OLTP workloads demonstrate that CTXNL enhances performance, outperforming current network-based systems and achieves up to 2.08x greater throughput than vanilla CXL memory sharing architectures across universal transaction processing policies.

AIM: Software and Hardware Co-design for Architecture-level IR-drop Mitigation in High-performance PIM

Published in International Symposium on Computer Architecture (ISCA), 2025

SRAM Processing-in-Memory (PIM) has emerged as the most promising implementation for high-performance PIM, delivering superior computing density, energy efficiency, and computational precision. However, the pursuit of higher performance necessitates more complex circuit designs and increased operating frequencies, which exacerbate IR-drop issues. Severe IR-drop can significantly degrade chip performance and even threaten reliability. Conventional circuit-level IR-drop mitigation methods, such as back-end optimizations, are resource-intensive and often compromise power, performance, and area (PPA). To address these challenges, we propose AIM, comprehensive software and hardware co-design for architecture-level IR-drop mitigation in high-performance PIM. Initially, leveraging the bit-serial and in-situ dataflow processing properties of PIM, we introduce Rtog and HR, which establish a direct correlation between PIM workloads and IR-drop. Building on this foundation, we propose LHR and WDS, enabling extensive exploration of architecture-level IR-drop mitigation while maintaining computational accuracy through software optimization. Subsequently, we develop IR-Booster, a dynamic adjustment mechanism that integrates software-level HR information with hardware-based IR-drop monitoring to adapt the V-f pairs of the PIM macro, achieving enhanced energy efficiency and performance. Finally, we propose the HR-aware task mapping method, bridging software and hardware designs to achieve optimal improvement. Post-layout simulation results on a 7nm 256-TOPS PIM chip demonstrate that AIM achieves up to 69.2% IR-drop mitigation, resulting in 2.29 × energy efficiency improvement and 1.152 × speedup.

H$^2$-LLM: Hardware-Dataflow Co-Exploration for Heterogeneous Hybrid-Bonding-based Low-Batch LLM Inference

Published in International Symposium on Computer Architecture (ISCA), 2025

Low-batch large language model (LLM) inference has been extensively applied to edge-side generative tasks, such as personal chat helper, virtual assistant, reception bot, private edge server, etc. To efficiently handle both prefill and decoding stages in LLM inference, near-memory processing (NMP) enabled heterogeneous computation paradigm has been proposed. However, existing NMP designs typically embed processing engines into DRAM dies, resulting in limited computation capacity, which in turn restricts their ability to accelerate edge-side low-batch LLM inference.

talks

teaching

Teaching experience 1

Undergraduate course, University 1, Department, 2014

This is a description of a teaching experience. You can use markdown like any other post.

Teaching experience 2

Workshop, University 1, Department, 2015

This is a description of a teaching experience. You can use markdown like any other post.