Can Qin

Salesforce AI Research, 181 Lytton Avenue, Palo Alto, CA, 94301, USA

qincan01_5.JPG

Email: cqin[at]salesforce.com or qin.ca[at]northeastern.edu

Hello and welcome! I’m currently embracing the exciting world of artificial intelligence as a Research Scientist at Salesforce AI Research. My journey is driven by a deep passion for Generative AI and Multi-modal Learning, with a focus on developing Video/Image to Text (Understanding) and Text to Video/Image (Generation) techniques.

In 2023, I earned my Ph.D. from Northeastern University in Boston, USA. My research during this period was primarily centered around the realms of Transfer Learning and Efficient AI, where I delved into complex problems and innovative solutions.

Before my Ph.D. journey, I obtained my B.E. degree from Xidian University in Xi’an, China, in 2018. This foundation laid the groundwork for my ongoing pursuit of knowledge and innovation.

news

Oct, 2025 Holitom was accepted by NeurIPS 25. We have released the CoDA (a 1.7b coding DLLM model).
May, 2025 CogAlign was accepted by ACL findings and we have released BLIP-3o.
Feb, 2025 We have two papers accepted by CVPR 25! Our latest paper CogAlign was released.
Sep, 2024 Our Medical MLLM paper was accepted by EMNLP 24 (Main)!
Aug, 2024 The xGen-MM (BLIP3) and xGen-VideoSyn-1 were released to the public! We have a paper accepted by TKDE and congrats to Yizhou! I have been invited as the reviewer of Nature Communications.
Jul, 2024 We have one paper accepted by ECCV 24!
Feb, 2024 We have one paper accepted by CVPR 24!
Nov, 2023 Begin my journey at Salesforce Research in Palo Alto!
Jun, 2023 I have passed the PhD Dissertation Defense and become Dr. Qin!

selected publications

  1. unidoc-bench.png
    UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG
    Xiangyu Peng*, Can Qin*, Zeyuan Chen, Ran Xu, Caiming Xiong, and Chien-Sheng Wu
    arXiv preprint arXiv:2510.03663, 2025
    *Equal Contribution
    Doc UnderstandingMultimodal
  2. coda.gif
    CoDA: Coding LM via Diffusion Adaptation
    Haolin Chen*, Shiyu Wang*, Can Qin*, Bo Pang, Zuxin Liu, Jielin Qiu, Jianguo Zhang, Yingbo Zhou, Zeyuan Chen, Ran Xu, and  others
    arXiv preprint arXiv:2510.03270, 2025
    *Core Contribution
    Code GenerationDiffusion LLM
  3. mmeb-v2.png
    Vlm2vec-v2 (MMEB-V2): Advancing multimodal embedding for videos, images, and visual documents
    Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, and  others
    arXiv preprint arXiv:2507.04590, 2025
    Embedding ModelMultimodal
  4. holitom.png
    HoliTom: Holistic Token Merging for Fast Video Large Language Models
    Kele Shao, Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang
    Advances in Neural Information Processing Systems (NeurIPS), 2025
    Video LLMToken Compression
  5. token-compression-survey.png
    When Tokens Talk Too Much: A Survey of Multimodal Long-context Token Compression across Images, Videos, and Audios
    Kele Shao, Keda Tao, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, and Huan Wang
    arXiv preprint arXiv:2507.20198, 2025
    Token CompressionSurveyMultimodal
  6. image-blip3o.jpg
    BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
    Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, and  others
    arXiv preprint arXiv:2505.09568, 2025
    Unified Multimodal Model
  7. vidkv.png
    Plug-and-Play 1. x-Bit KV Cache Quantization for Video Large Language Models
    Keda Tao, Haoxuan You, Yang Sui, Can Qin, and Huan Wang
    arXiv preprint arXiv:2503.16257, 2025
    Video LLMToken Compression
  8. cogalign.png
    Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding
    Kung-Hsiang Huang, Can Qin, Haoyi Qiu, Philippe Laban, Shafiq Joty, Caiming Xiong, and Chien-Sheng Wu
    Annual Meeting of the Association for Computational Linguistics (Findings), 2025
    VLMChartGeometry
  9. dycoke-demo.gif
    DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models
    Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
    Video LLMToken Compression
  10. xgen-mm-vid1.gif
    xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
    Michael S Ryoo, Honglu Zhou, Shrikant Kendre, Can Qin, Le Xue, Manli Shu, Silvio Savarese, Ran Xu, Caiming Xiong, and Juan Carlos Niebles
    arXiv preprint arXiv:2410.16267, 2024
    Video LLMToken Compression
  11. xgen-videosyn-1.gif
    xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations
    Can Qin, Congying Xia, Krithika Ramakrishnan, Michael Ryoo, Lifu Tu, Yihao Feng, Manli Shu, Honglu Zhou, Anas Awadalla, Jun Wang, and  others
    arXiv preprint arXiv:2408.12590, 2024
    DiffusionVideo Generation
  12. blip3.png
    xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
    Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, and  others
    arXiv preprint arXiv:2408.08872, 2024
    VLMMultimodal
  13. preference_data_st_llava_med.png
    Self-Training Large Language and Vision Assistant for Medical
    Guohao Sun, Can Qin, Huazhu Fu, Linwei Wang, and Zhiqiang Tao
    In The 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
    VLMMedical
  14. sq-llava.png
    SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant
    Guohao Sun, Can Qin, Jiamian Wang, Zeyuan Chen, Ran Xu, and Zhiqiang Tao
    European Conference on Computer Vision (ECCV), 2024
    VLMMultimodal
  15. hive.png
    HIVE: Harnessing Human Feedback for Instructional Visual Editing
    Shu Zhang*, Xinyi Yang*, Yihao Feng*, Can Qin, Chia-Chih Chen, Ning Yu, Zeyuan Chen, Huan Wang, Silvio Savarese, Stefano Ermon, Caiming Xiong, and Ran Xu
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
    DiffusionImage EditingHuman-in-the-loop
  16. unicontrol.png
    UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild
    Can Qin, Shu Zhang, Ning Yu, Yihao Feng, Xinyi Yang, Yingbo Zhou, Huan Wang, Juan Carlos Niebles, Caiming Xiong, Silvio Savarese, Stefano Ermon, Yun Fu, and Ran Xu
    Advances in Neural Information Processing Systems (NeurIPS), 2023
    DiffusionControllable Image Generation
  17. gluegen.png
    GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation
    Can Qin, Ning Yu, Chen Xing, Shu Zhang, Zeyuan Chen, Stefano Ermon, Yun Fu, Caiming Xiong, and Ran Xu
    International Conference on Computer Vision (ICCV), 2023
    DiffusionMultimodalImageTextAudio