Zhenglun Kong

Zhenglun Kong

I am currently a Research Fellow at Harvard University, working closely with Marinka Zitnik at Harvard and Manolis Kellis at MIT. I received my PhD from Northeastern University, supervised by Prof. Yanzhi Wang. Prior to that, I earned my master degree from Northeastern University and B.E. degree from Huazhong University of Science and Technology (HUST), China. I was a research intern at Microsoft Research, ARM, and Samsung Research. My research focuses on developing efficient deep learning methods applicable to general-purpose AI systems and scientific discovery. I was selected as the Machine Learning and Systems Rising Stars 2024.

I am actively looking for faculty and industry positions starting Fall 2026. Please kindly reach out to me for any opportunities. Thanks!

Email / CV / Google Scholar / Github / LinkedIn / Twitter

Research

I am dedicated to achieving the general and practical implementation of AI. My research is particularly focused on the following domains and methodologies:

Efficient Deep Learning: Accelerating pre-training/fine-tuning and inference, data/model compression, and fast & robust DNN design for Large Language Models and Vision Models.
AI for Health: Efficient algorithms for AI agents and cell foundation models.

News

10/2025: Our paper Democratizing AI scientists using ToolUniverse is released.
06/2025: Our paper SPATIA is released
05/2025: Our position paper Token Reduction Should Go Beyond Efficiency in Generative Models is released.
04/2025: Our paper AutoViT is accepted to IJCV.
03/2025: Our paper TxAgent is released.
03/2025: Our paper Moxin-LLM is accepted to SCI-FM@ICLR'25.
01/2025: Two papers (Sparse Learning for SSM + Efficient Self-Supervised Learning) are accepted to ICLR'25.
12/2024: One paper (LLM pruning) is accepted to AAAI'25; One paper (Rank-adaptive Reliability Optimization) is accepted to ICASSP'25.
10/2024: Two journal papers (Quantization for LLMs + Segmentation on Auto-Vehicles) are accepted to TCAD.
10/2024: One paper (Quantization-Aware BEV) is accepted to WACV'25.
09/2024: Three papers (Efficient LLM search + Token pruning for Vision SSM + Video Diffusion) are accepted to NeurIPS'24.
09/2024: One main paper (Token reduction for SSM) and one findings paper (Pruning LLMs without retraining) are accepted to EMNLP'24.
05/2024: Honored to be selected as the 2024 Machine Learning and Systems Rising Stars.
04/2024: I will join Harvard as a Postdoctoral Research Fellow in August.
01/2024: One paper (Activation-Guided Quantization for LLMs) is accepted to AAAI'24
01/2024: I gave a talk at the Robotics Institute at CMU during the VASC Seminar. The topic was "Towards Efficient Techniques and Applications for Universal AI Implementation."
09/2023: One paper (Hardware-oriented 3D Detector) is accepted to NeurIPS'23, see you in New Orleans!

Experiences

Harvard University, August 2024 - present
Postdoctoral researcher
Mentors: Marinka Zitnik, Manolis Kellis
Microsoft, June 2022 - August 2022
Research Intern
Mentor: Subhabrata Mukherjee
ARM, June 2021 - August 2021
Research Intern
Mentors: Lingchuan Meng, Danny Loh
Samsung Research America, October 2021 - December 2021
Research Intern
Mentors: Avik Ray, Yilin Shen, Hongxia Jin

Selected Publications * means equal contribution
Preprints
	Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality Zhenglun Kong, Yize Li, Fanhu Zeng, Lei Xin, Shvat Messica, Xue Lin, Pu Zhao, Manolis Kellis, Hao Tang, Marinka Zitnik PDF / Code We argue that viewing token reduction purely from an efficiency perspective is fundamentally limited. Instead, we position token reduction as a core design principle in generative modeling.
	SPATIA: Multimodal Model for Prediction and Generation of Spatial Cell Phenotypes Zhenglun Kong, Mufan Qiu, John Boesen, Xiang Lin, Sukwon Yun, Tianlong Chen, Manolis Kellis, Marinka Zitnik PDF We introduce SPATIA, a multi-scale generative and predictive model for spatial transcriptomics. SPATIA learns cell-level embeddings by fusing image-derived morphological tokens and transcriptomic vector tokens and then aggregates them at niche and tissue levels to capture spatial dependencies.
	Enabling Flexible Multi-LLM Integration for Scalable Knowledge Aggregation Zhenglun Kong, Zheng Zhan, Shiyue Hou, Yifan Gong, Xin Meng, Pengwei Sui, Peiyan Dong, Xuan Shen, Zifeng Wang, Pu Zhao, Hao Tang, Stratis Ioannidis, Yanzhi Wang PDF / Code We propose a framework that adaptively selects and aggregates knowledge from diverse LLMs to build a single, stronger model, avoiding the high memory overhead of ensemble and inflexible weight merging.
Published
	AutoViT: Achieving Real-Time Vision Transformers on Mobile via Latency-aware Coarse-to-Fine Search Zhenglun Kong, Dongkuan Xu, Zhengang Li, Peiyan Dong, Hao Tang, Yanzhi Wang, Subhabrata Mukherjee [IJCV] International Journal of Computer Vision PDF We introduce a truly efficient, hardware-oriented approach for searching efficient vision transformer structure. This approach has been optimized to seamlessly adapt to the constraints of the target hardware and fulfill the specific speed requirements.
	Q-TempFusion: Quantization-Aware Multi-Sensor Fusion on Bird’s-Eye View Representation with Temporal Integration Pinrui Yu, Zhenglun Kong, Pu Zhao, Peiyan Dong, Hao Tang, Fei Sun, Xue Lin, Yanzhi Wang [WACV 2025] Winter Conference on Applications of Computer Vision PDF We propose Q-TempFusion, a novel approach for temporal multi-sensor fusion designed to enhance the BEV model’s inference speed while keeping high predictive performance.
	Exploring Token Pruning in Vision State Space Models Zheng Zhan, Zhenglun Kong, Yifan Gong, Yushu Wu, Zichong Meng, Hangyu Zheng, Xuan Shen, Stratis Ioannidis, Wei Niu, Pu Zhao, Yanzhi Wang [NeurIPS 2024] Advances in Neural Information Processing Systems PDF / Code We revisit the unique computational characteristics of SSMs and discover that naive application disrupts the sequential token positions. This insight motivates us to design a novel and general token pruning method specifically for SSM-based vision models.
	Rethinking Token Reduction for State Space Models Zheng Zhan, Yushu Wu, Zhenglun Kong, Changdi Yang, Yifan Gong, Xuan Shen, Xue Lin, Pu Zhao, Yanzhi Wang [EMNLP 2024]* Conference on Empirical Methods in Natural Language Processing PDF / Code We propose a tailored, unified post-training token reduction method for SSMs. Our approach integrates token importance and similarity, thus taking advantage of both pruning and merging, to devise a fine-grained intra-layer token reduction strategy.
	EdgeQAT: Entropy and Distribution Guided Quantization-Aware Training for the Acceleration of Lightweight LLMs on the Edge Xuan Shen, Zhenglun Kong, Changdi Yang, Zhaoyang Han, Lei Lu, Peiyan Dong, Cheng Lyu, Chih-hsiang Li, Xuehang Guo, Zhihao Shu, Wei Niu, Miriam Leeser, Pu Zhao, Yanzhi Wang arXiv:2402.10787 PDF / Code We design an entropy \& distribution guided quantization method to reduce information distortion in quantized query, key, and attention maps, tackling the bottleneck of QAT for LLMs.
	HotBEV: Hardware-oriented Transformer-based Multi-View 3D Detector for BEV Perception Zhenglun Kong, Peiyan Dong, Xin Meng, Pinrui Yu, Yanyue Xie, Yifan Gong, Geng Yuan, Fei Sun, Hao Tang, Yanzhi Wang [NeurIPS 2023] Advances in Neural Information Processing Systems PDF / Code We present a hardware-oriented transformer-based framework for 3D detection tasks, which achieves higher detection precision and remarkable speedup across high-end and low-end GPUs.
	SpeedDETR: Speed-aware Transformers for End-to-end Object Detection Peiyan Dong, Zhenglun Kong, Xin Meng, Peng Zhang, Hao Tang, Yanzhi Wang, Chih-Hsien Chou [ICML 2023] International Conference on Machine Learning PDF / Code We propose a novel speed-aware transformer for end-to-end object detectors, achieving high-speed inference on multiple devices.
	Peeling the Onion: Hierarchical Reduction of Data Redundancy for Efficient Vision Transformer Training Zhenglun Kong, Haoyu Ma, Geng Yuan, Mengshu Sun, Yanyue Xie, Peiyan Dong, Yanzhi Wang, et al. [AAAI 2023 Oral] The Thirty-Seventh AAAI Conference on Artificial Intelligence PDF / Code We introduce sparsity into data and propose an end-to-end efficient training framework to accelerate ViT training and inference.
	Data Level Lottery Ticket Hypothesis for Vision Transformers Xuan Shen, Zhenglun Kong, Minghai Qin, Peiyan Dong, Geng Yuan, Xin Meng, Hao Tang, Xiaolong Ma, Yanzhi Wang [IJCAI 2023 Oral] The 32nd International Joint Conference on Artificial Intelligence PDF / Code We generalize the LTH for ViTs to input data consisting of image patches inspired by the input dependence of ViTs. That is, there exists a subset of input image patches such that a ViT can be trained from scratch by using only this subset of patches and achieve similar accuracy to the ViTs trained by using all image patches.
	SPViT: Enabling Faster Vision Transformers via Latency-aware Soft Token Pruning Zhenglun Kong, Peiyan Dong, Xiaolong Ma, Xin Meng, Mengshu Sun, Wei Niu, Xuan Shen, Geng Yuan, Bin Ren, Minghai Qin, Hao Tang, Yanzhi Wang [ECCV 2022] European Conference on Computer Vision CVPRW 2022 Spotlight PDF / Code We propose a dynamic, latency-aware soft token pruning framework for Vision Transformer. Our framework significantly reduces the computation cost of ViTs while maintaining comparable performance on image classification.
	Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning Bingbing Li, Zhenglun Kong, Tianyun Zhang, Ji Li, Zhengang Li, Hang Liu, Caiwen Ding [EMNLP 2020 Findings] Conference on Empirical Methods in Natural Language Processing PDF / Code We propose an efficient transformer-based large-scale language representation using hardware-friendly block structure pruning. We incorporate the reweighted group Lasso into block-structured pruning for optimization.
	A Compression-Compilation Framework for On-mobile Real-time BERT Applications Wei Niu, Zhenglun Kong, Geng Yuan, Weiwen Jiang, Jiexiong Guan, Caiwen Ding, Pu Zhao, Sijia Liu, Bin Ren, Yanzhi Wang [IJCAI 2021 Demo] 30th International Joint Conference on Artificial Intelligence PDF We propose a compression-compilation codesign framework that can guarantee BERT model to meet both resource and real-time specifications of mobile devices.
	Automatic Tissue Image Segmentation Based on Image Processing and Deep Learning Zhenglun Kong, Ting Li, Junyi Luo, Shengpu Xu Journal of Healthcare Engineering PDF We realize automatic image segmentation with convolutional neural network to extract the accurate contours of four tissues: the skull, cerebrospinal fluid (CSF), grey matter (GM), and white matter (WM) on 5 MRI head image datasets.

Education

Northeastern University, Sep 2019 - Present,
PhD in Computer Engineering
Advisor: Prof. Yanzhi Wang
Northeastern University, Sep 2017 - May 2019
M.S. in Computer Engineering
Advisor: Prof. Yun Raymond Fu
Huazhong University of Science and Technology, China, Sep 2013 - July 2017
B.E. in Optoelectronic Information Science and Engineering

Teaching Experiences

Teaching Assistant at Northeastern University
- EECE 7205 Fundamentals of Computer Engineering, Fall 2021
  Instructor: Prof. Xue Lin
- EECE 5552 Assistive Robotics, Fall 2019
  Instructor: Prof. Alireza Ramezani
- EECE 5644 34595 Machine Learning/Pattern Recognition, Spring 2019
  Instructor: Prof. Jennifer Dy

Guest Lecturer
- CSC 791&591 Advanced Topics in Efficient Deep Learning
  North Carolina State University, 2022 Fall
  Instructor: Prof. Dongkuan Xu

Professional Talks

Towards Efficient Deep Learning for Practical AI Implementation
Carnegie Mellon University, Pittsburgh, PA, Jan. 2024.
The Lottery Ticket Hypothesis for Vision Transformers
IJCAI, Macao, SAR, Aug. 2023.
Peeling the Onion: Hierarchical Reduction of Data Redundancy for Efficient Vision Transformer Training
AAAI, Washington, DC, Feb. 2023.
Enabling Faster Vision Transformers via Soft Token Pruning (link)
The 8th EMC2 - Energy Efficient Training and Inference of Transformer Based Models, Washington, DC, Feb. 2023.
Vision Transformer Optimization
ARM, San Jose, CA, Aug. 2021.
A Compression-Compilation Framework for On-mobile Real-time BERT Applications.
IJCAI, Montreal-themed virtual reality, Aug 2021.
Hardware-friendly Block Structured Pruning for Transformer
ARM, San Jose, CA, Jun. 2021.
Compiler-aware Neural Architecture Optimization for Transformer
Samsung Research America, Mountain View, CA, Oct. 2020.

Professional Services

Conference Reviewer:
- ICML'22, ECCV'22, NeurIPS'22, AAAI'23, CVPR'23, KDD'23, IJCAI'23, ICML'23, NeurIPS'23, AAAI'23, ICML'24, ECCV'24, NeurIPS'24, EMNLP'24, WACV'25, ICLR'25

Journal Reviewer:
- IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
- IEEE Transactions on Neural Networks and Learning Systems (TNNLS)
- IEEE Transactions on Image Processing (TIP)
- Pattern Recognition
- Neurocomputing

Academic Committee Member:
- MLNLP