Peng Gao

Young Scientist
Shanghai AI Lab
Email: gaopeng [at] pjlab (dot) org (dot) cn

Google Scholar / Github


I am a Young Scientist at Shanghai AI Lab. I got my Ph.D. degree from Multimedia Lab, the Chinese University of Hong Kong in 2021. During my Ph.D. period, I was supervised by Xiaogang Wang and Hongsheng Li. I was luckily to be involved in internship program at MERL Boston, Microsoft Seattle, AI2 Seattle and Sensetime Beijing/Shenzhen during my Ph.D. time. My research interestes lie in multi-modality Learning, efficient visual backbone design, self-supervised representation learning.

If you are interested in research intern, research engineer, full-time researcher at Shanghai AI lab or Ph.D. program of MMLAB at CUHK. Please send me an email.



  • [03/2023] Five paper (Cafo, Maskalign, I2P-MAE, Q-DETR, Point-NN) are accepted by CVPR 2023.
  • [09/2022] Three paper (ConvMAE, Point-M2AE, QViT) are accepted by NeuIPS 2022.
  • [06/2022] Five paper(1 oral, 4 poster) are accepted by ECCV 2022.
  • [05/2022] Vision team at Shanghai AI Lab realeased ConvMAE, FastConvMAE and VideoConvMAE.
  • [01/2022] PointCLIP accepted by CVPR 2022.
  • [01/2022] A strong image and video Backbone Uniformer accepted by ICLR2022.
  • [11/2021] Vision team at Shanghai AI Lab realeased Tip-Adaptor on Arxiv.
  • [10/2021] Vision team at Shanghai AI Lab realeased CLIP-Adaptor on Arxiv.
  • [10/2021] Attempts to replicate interesting paper Pix2Seq is released at Unoffical Pix2Seq.
  • [10/2021] Two paper are accepted by NeuIPS 2021.
  • [07/2021] One paper is accepted by ICCV 2021.
  • [06/2021] One paper is accepted by ACMMM 2021.
  • [02/2021] One paper is accepted by AAAI 2021.




Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training
Renrui Zhang, Ziyu Guo, Peng Gao, Rongyao Fang, Bin Zhao, Dong Wang, Yu Qiao, Hongsheng Li,
Arxiv 2022 / Paper / Code

PointCLIP: Point Cloud Understanding by CLIP
Renrui Zhang*, Ziyu Guo*, Wei Zhang, Kunchang Li, Xupeng Miao, Bin Cui, Yu Qiao, Peng Gao**, Hongsheng Li,
CVPR 2022 / Paper / Code

UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning
Kunchang Li, Yali Wang, Peng Gao, Guanglu Song, Yu Liu, Hongsheng Li, Yu Qiao,
ICLR 2022 / Paper / Code


A Simple Long-Tailed Recognition Baseline via Vision-Language Model
Teli Ma*, Shijie Geng, Mengmeng Wang, Jing Shao, Jiasen Lu, Hongsheng Li, Peng Gao**, Yu Qiao,
Arxiv 2021 / Paper / Code

CLIP-Adapter: Better Vision-Language Models with Feature Adapters
Peng Gao*, Shijie Geng*, Renrui Zhang*, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, Yu Qiao,
Arxiv 2021 / Paper / Code

Container : Context Aggregation Network
Peng Gao*, Jiasen Lu, Hongsheng Li, Roozbeh Mottaghi, Aniruddha Kembhavi,
NeuIPS 2021 / Paper / Code

Fast Convergence of DETR with Spatially Modulated Co-attention
Peng Gao, Minghang Zeng, Xiaogang Wang, Jifeng Dai, Hongsheng Li,
ICCV 2021 / Paper / Code

Scalable Transformers for Neural Machine Translation
Peng Gao, Shijie Geng, Yu Qiao, Xiaogang Wang, Jifeng Dai, Hongsheng Li,
Arxiv 2021 / Paper / Code

Dual Stream Network for Vision Recognition
Mingyuan Mao*, Peng Gao*, Renrui Zhang*, Honghui Zheng*, Teli Ma, Yan Peng, Errui Ding, Shumin Han
NeuIPS 2021 / Paper

End-to-End Object Detection with Adaptive Clustering Transformer
Minghang Zeng, Peng Gao, Renrui Zhang, Kunchang Li, Xiaogang Wang, Hongsheng Li, Dong Hao,
BMVC 2021 Oral / Paper / Code

Dense Contrastive Visual-Linguistic Pretraining
Lei Shi, Kai Shuang, Shijie Geng, Peng Gao, Zuohui Fu, Gerard de Melo, Yunpeng Chen Sen Su
ACMMM 2021 / Paper

Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers
Shijie Geng, Peng Gao, Moitreya Chatterjee, Chiori Hori, Jonathan Le Roux, Yongfeng Zhang, Hongsheng Li, Anoop Cherian
AAAI 2021 / Paper


Learning Where to Focus for Efficient Video Object Detection
Zhengkai Jiang, Yu Liu, Ceyuan Yang, Jihao Liu, Peng Gao, Qian Zhang, Shiming Xiang Chunhong Pan
ECCV 2020 / Paper / Code


Multi-modality Latent Interaction Network for Visual Question Answering
Peng Gao, Haoxuan You, Zhanpeng Zhang, Xiaogang Wang, Hongsheng Li
ICCV 2019 / Paper

Dynamic Fusion with Intra and Inter-Modality Attention Flow for Visual Question Answering
Peng Gao, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven CH Hoi, Xiaogang Wang, Hongsheng Li
CVPR 2019 / Paper
Oral Presentation

Video Object Detection with Locally-Weightd Deformable Neighboors
Zhengkai Jiang, Peng Gao, Chaoxu Guo, Qian Zhang, Shiming Xiang, Chunhong Pan
AAAI 2019 / Paper


Question-guided Hybrid Convolution for Visual Question Answering
Peng Gao, Hongsheng Li, Shuang Li, Pan Lu, Yikang Li, Steven C.H. Hoi, Xiaogang Wang
ECCV 2018 / Paper