Sounak Mondal

I am a final year Computer Science PhD candidate at Stony Brook University interested in research on vision-language modeling. My PhD thesis focuses on using vision-language representation learning and multimodal foundation models (e.g., multimodal LLMs) for modeling human visual attention (eye gaze). I am advised by Minh Hoai Nguyen, Dimitris Samaras and Gregory Zelinsky. I also collaborate with Niranjan Balasubramanian, Lester Loschky, Sidney D'Mello, and Sanjay Rebello.

Previously, I was an NLP Engineer at Samsung Research Institute, Bangalore, where I worked as part of the Natural Language Understanding team. Before that, I was an undergraduate student at the Department of Computer Science & Engineering, Jadavpur University, Kolkata, working on action detection and recognition in videos.

I have served as a reviewer for CVPR, ICCV, ICLR, NeurIPS and TPAMI.

Résumé / Email / Google Scholar / LinkedIn

News

I am actively looking for full-time industry research scientist / applied scientist opportunities. Please contact me via email or LinkedIn if you have any leads.
[September 2025] I have been selected for the ICCV 2025 Doctoral Consortium!
[June 2025] One paper accepted to ICCV 2025! This work was done during my internship at Meta Reality Labs Research.
[June 2024] I have joined Meta Reality Labs Research (RL-R), Burlingame as a Research Scientist Intern!
[June 2025] I successfully defended my thesis proposal!
[February 2025] One paper accepted to CVPR 2025!
[October 2024] I will continue working at Meta Reality Labs Research (RL-R) remotely as a Part-Time Student Researcher.
[July 2024] One paper on gaze prediction for object referral, and one paper on gaze following accepted to ECCV 2024!
[June 2024] I have joined Meta Reality Labs Research (RL-R), Redmond as a Research Scientist Intern!
[February 2024] One paper accepted to CVPR 2024!
[March 2023] One paper accepted to CVPR 2023!
[July 2022] One paper accepted to ECCV 2022!

Research

I am broadly interested in Computer Vision, Natural Language Processing and Multimodal AI (Vision-Language Modeling). My PhD research focuses on using vision-language representation learning and multimodal foundation models (e.g., multimodal LLMs) for modeling human visual attention (eye gaze). For more details, refer to my résumé.

	Gaze-Language Alignment for Zero-Shot Prediction of Visual Search Targets from Human Gaze Scanpaths Sounak Mondal, Naveen Sendhilnathan, Ting Zhang, Yue Liu, Michael Proulx, Michael Iuzzolino, Chuan Qin, Tanya Jonker ICCV, 2025 Paper
	Few-shot Personalized Scanpath Prediction Ruoyu Xue, Jingyi Xu, Sounak Mondal, Hieu Le, Gregory Zelinsky, Minh Hoai, Dimitris Samaras CVPR, 2025 Paper
	Look Hear: Gaze Prediction for Speech-directed Human Attention Sounak Mondal, Seoyoung Ahn, Zhibo Yang, Niranjan Balasubramanian, Dimitris Samaras, Gregory Zelinsky, Minh Hoai ECCV, 2024 arXiv / Project Page / Code / Dataset / Talk
	Diffusion-Refined VQA Annotations for Semi-Supervised Gaze Following Qiaomu Miao, Alexandros Graikos, Jingwei Zhang, Sounak Mondal, Minh Hoai, Dimitris Samaras ECCV, 2024 arXiv
	Unifying Top-down and Bottom-up Scanpath Prediction using Transformers Zhibo Yang, Sounak Mondal, Seoyoung Ahn, Ruoyu Xue, Gregory Zelinsky, Minh Hoai, Dimitris Samaras CVPR, 2024 arXiv
	Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed Human Attention Sounak Mondal, Zhibo Yang, Seoyoung Ahn, Dimitris Samaras, Gregory Zelinsky, Minh Hoai CVPR, 2023 arXiv / Supplement / Code / Talk
	Target-absent Human Attention Zhibo Yang, Sounak Mondal, Seoyoung Ahn, Gregory Zelinsky, Minh Hoai, Dimitris Samaras ECCV, 2022 arXiv / Supplement / Code
	Characterizing Target-absent Human Attention Yupei Chen, Zhibo Yang, Souradeep Chakraborty, Sounak Mondal, Seoyoung Ahn, Dimitris Samaras, Minh Hoai, Gregory Zelinsky CVPR Workshop, 2022 Paper / Supplement
	ICAN: Introspective Convolutional Attention Network for Semantic Text Classification Sounak Mondal, Suraj Modi, Sakshi Garg, Dhruva Das, Siddhartha Mukherjee ICSC, 2020 ( indicates equal contribution)* Paper
	Violent/Non-Violent Video Classification based on Deep Neural Network Sounak Mondal, Soumyajit Pal, Sanjoy Kumar Saha, Bhabatosh Chanda ICAPR, 2017 Paper
	A Beta Distribution Based Novel Scheme for Detection of Changes in Crowd Motion Soumyajit Pal, Sounak Mondal, Sanjoy Kumar Saha, Bhabatosh Chanda ICVGIP Workshop, 2016 Paper

Webpage template from Jon Barron