Can Qin

Email: cqin[at]salesforce.com or qin.ca[at]northeastern.edu

Hello and welcome! I’m currently embracing the exciting world of artificial intelligence as a Research Scientist at Salesforce AI Research. My journey is driven by a deep passion for Generative AI and Multi-modal Learning, with a focus on developing Video/Image to Text (Understanding) and Text to Video/Image (Generation) techniques.

In 2023, I earned my Ph.D. from Northeastern University in Boston, USA. My research during this period was primarily centered around the realms of Transfer Learning and Efficient AI, where I delved into complex problems and innovative solutions.

Before my Ph.D. journey, I obtained my B.E. degree from Xidian University in Xi’an, China, in 2018. This foundation laid the groundwork for my ongoing pursuit of knowledge and innovation.

news

Oct, 2025	Holitom was accepted by NeurIPS 25. We have released the CoDA (a 1.7b coding DLLM model).
May, 2025	CogAlign was accepted by ACL findings and we have released BLIP-3o.
Feb, 2025	We have two papers accepted by CVPR 25! Our latest paper CogAlign was released.
Sep, 2024	Our Medical MLLM paper was accepted by EMNLP 24 (Main)!
Aug, 2024	The xGen-MM (BLIP3) and xGen-VideoSyn-1 were released to the public! We have a paper accepted by TKDE and congrats to Yizhou! I have been invited as the reviewer of Nature Communications.
Jul, 2024	We have one paper accepted by ECCV 24!
Feb, 2024	We have one paper accepted by CVPR 24!
Nov, 2023	Begin my journey at Salesforce Research in Palo Alto!
Jun, 2023	I have passed the PhD Dissertation Defense and become Dr. Qin!

selected publications

UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG

Xiangyu Peng*, Can Qin*, Zeyuan Chen, Ran Xu, Caiming Xiong, and Chien-Sheng Wu

arXiv preprint arXiv:2510.03663, 2025

*Equal Contribution

Doc UnderstandingMultimodal

arXiv Code
CoDA: Coding LM via Diffusion Adaptation

Haolin Chen*, Shiyu Wang*, Can Qin*, Bo Pang, Zuxin Liu, Jielin Qiu, Jianguo Zhang, Yingbo Zhou, Zeyuan Chen, Ran Xu, and others

arXiv preprint arXiv:2510.03270, 2025

*Core Contribution

Code GenerationDiffusion LLM

arXiv Code
Vlm2vec-v2 (MMEB-V2): Advancing multimodal embedding for videos, images, and visual documents

Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, and others

arXiv preprint arXiv:2507.04590, 2025

Embedding ModelMultimodal

arXiv Code
HoliTom: Holistic Token Merging for Fast Video Large Language Models

Kele Shao, Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang

Advances in Neural Information Processing Systems (NeurIPS), 2025

Video LLMToken Compression

arXiv Code
When Tokens Talk Too Much: A Survey of Multimodal Long-context Token Compression across Images, Videos, and Audios

Kele Shao, Keda Tao, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, and Huan Wang

arXiv preprint arXiv:2507.20198, 2025

Token CompressionSurveyMultimodal

arXiv Code
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, and others

arXiv preprint arXiv:2505.09568, 2025

Unified Multimodal Model

arXiv Code
Plug-and-Play 1. x-Bit KV Cache Quantization for Video Large Language Models

Keda Tao, Haoxuan You, Yang Sui, Can Qin, and Huan Wang

arXiv preprint arXiv:2503.16257, 2025

Video LLMToken Compression

arXiv Code
Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding

Kung-Hsiang Huang, Can Qin, Haoyi Qiu, Philippe Laban, Shafiq Joty, Caiming Xiong, and Chien-Sheng Wu

Annual Meeting of the Association for Computational Linguistics (Findings), 2025

VLMChartGeometry

arXiv Code
DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models

Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

Video LLMToken Compression

arXiv Code
xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs

Michael S Ryoo, Honglu Zhou, Shrikant Kendre, Can Qin, Le Xue, Manli Shu, Silvio Savarese, Ran Xu, Caiming Xiong, and Juan Carlos Niebles

arXiv preprint arXiv:2410.16267, 2024

Video LLMToken Compression

arXiv Website
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

Can Qin, Congying Xia, Krithika Ramakrishnan, Michael Ryoo, Lifu Tu, Yihao Feng, Manli Shu, Honglu Zhou, Anas Awadalla, Jun Wang, and others

arXiv preprint arXiv:2408.12590, 2024

DiffusionVideo Generation

arXiv Code
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, and others

arXiv preprint arXiv:2408.08872, 2024

VLMMultimodal

arXiv Code
Self-Training Large Language and Vision Assistant for Medical

Guohao Sun, Can Qin, Huazhu Fu, Linwei Wang, and Zhiqiang Tao

In The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

VLMMedical

arXiv Code
SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant

Guohao Sun, Can Qin, Jiamian Wang, Zeyuan Chen, Ran Xu, and Zhiqiang Tao

European Conference on Computer Vision (ECCV), 2024

VLMMultimodal

arXiv Code
HIVE: Harnessing Human Feedback for Instructional Visual Editing

Shu Zhang*, Xinyi Yang*, Yihao Feng*, Can Qin, Chia-Chih Chen, Ning Yu, Zeyuan Chen, Huan Wang, Silvio Savarese, Stefano Ermon, Caiming Xiong, and Ran Xu

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

DiffusionImage EditingHuman-in-the-loop

arXiv Code
UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild

Can Qin, Shu Zhang, Ning Yu, Yihao Feng, Xinyi Yang, Yingbo Zhou, Huan Wang, Juan Carlos Niebles, Caiming Xiong, Silvio Savarese, Stefano Ermon, Yun Fu, and Ran Xu

Advances in Neural Information Processing Systems (NeurIPS), 2023

DiffusionControllable Image Generation

arXiv Code Website
GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation

Can Qin, Ning Yu, Chen Xing, Shu Zhang, Zeyuan Chen, Stefano Ermon, Yun Fu, Caiming Xiong, and Ran Xu

International Conference on Computer Vision (ICCV), 2023

DiffusionMultimodalImageTextAudio

arXiv Code Website