Portfolio item number 1
Short description of portfolio item number 1
Short description of portfolio item number 1
Short description of portfolio item number 2
Published in International Conference on Parallel Architectures and Compilation Techniques (PACT), 2022
In this paper we propose GNNear, a hybrid accelerator architecture leveraging near-memory processing to accelerate full-batch GNN training on large graphs. GNNear matchs the heterogeneous nature of GNN training by offloading the memory-intensive Reduce operations to in-DIMM Near-Memory-Engines (NMEs) and using a Centralized-Acceleration-Engine (CAE) to process the computation-intensive Update operations. To deal with the irregularity of graphs, we also propose several optimization strategies concerning data reuse, graph partitioning, and dataflow scheduling, etc. Evaluations on 16 tasks demonstrate that GNNear achieves 30.8× / 2.5× geomean speedup and 79.6× / 7.3× (geomean) higher energy efficiency compared to Xeon E5-2698-v4 CPU and V100 GPU.
Published in International Symposium on High-Performance Computer Architecture (HPCA), 2023
DIMM-based near-memory processing architectures (DIMM-NMP) have received growing interest from both academia and industry. They have the advantages of large memory capacity, low manufacturing cost, high flexibility, compatible form factor, etc. However, inter-DIMM communication (IDC) has become a critical obstacle for generic DIMM-NMP architectures because it involves costly forwarding transactions through the host CPU. Recent research has demonstrated that, for many applications, the overhead induced by IDC may even offset the performance and energy benefits of near-memory processing.To tackle this problem, we propose DIMM-Link, which enables high-performance IDC in DIMM-NMP architectures and supports seamless integration with existing host memory systems. It adopts bidirectional external data links to connect DIMMs, via which point-to-point communication and inter-DIMM broadcast are efficiently supported in a packet-routing way. We present the full-stack design of DIMM-Link, including the hardware architecture, interconnect protocol, system organization, routing mechanisms, optimization strategies, etc. Comprehensive experiments on typical data-intensive tasks demonstrate that the DIMM-Link-equipped NMP system can achieve a 5.93× average speedup over the 16-core CPU baseline. Compared to other IDC methods, DIMM-Link outperforms MCN, AIM, and ABC-DIMM by 2.42×, 1.87×, and 1.77×, respectively. More importantly, DIMM-Link fully considers the implementation feasibility and system integration constraints, which are critical for designing NMP architectures based on modern DDR4/DDR5 DIMMs.
Published in Design Automation Conference (DAC), 2023
Various DIMM-based near-memory processing (DIMM-NMP) architectures have been proposed to accelerate tensor reduction. With careful evaluation, we find that diverse scenarios exhibit distinct performance on DIMM-NMP architectures adopting different design configurations. However, given a tensor reduction scenario, there is a lack of a fast and accurate solution to identify a proper DIMM-NMP architecture. To tackle this problem, we propose an efficient exploration framework called NMExplorer. Given a scenario and hardware parameters, NMExplorer can generate and explore a wide range of potential design configurations. Experiments show that the recommended designs can outperform state-of-the-art DIMM-NMP accelerators by up to 1.95× in performance and 3.69× in energy.
Published in International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024
DRAM-based processing-in-memory (DRAM-PIM) has gained commercial prominence in recent years. However, their integration for deep learning acceleration poses inherent challenges. Existing DRAM-PIMs are limited in computational capabilities, primarily applicable for element-wise and GEMV operators. Unfortunately, these operators contribute only a small portion of the execution time in most DNN workloads. Current systems still necessitate powerful hosts to handle a significant portion of compute-heavy operators.
Published in International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024
Generative large language models’ (LLMs) inference suffers from inefficiency because of the token dependency brought by autoregressive decoding. Recently, speculative inference has been proposed to alleviate this problem, which introduces small language models to generate draft tokens and adopts the original large language model to conduct verification. Although speculative inference can enhance the efficiency of the decoding procedure, we find that it presents variable resource demands due to the distinct computation patterns of the models used in speculative inference. This variability impedes the full realization of speculative inference’s acceleration potential in current systems.
Published:
This is a description of your talk, which is a markdown file that can be all markdown-ified like any other post. Yay markdown!
Published:
This is a description of your conference proceedings talk, note the different field in type. You can put anything in this field.
Undergraduate course, University 1, Department, 2014
This is a description of a teaching experience. You can use markdown like any other post.
Workshop, University 1, Department, 2015
This is a description of a teaching experience. You can use markdown like any other post.