CSE590-01: Vision-Language Foundation Models

(listed in SOLAR as "Visual Modeling and Intelligence")

Stony Brook University | Department of Computer Science

Fall 2025

Instructor Paola Cascante-Bonilla
Assistant Professor, Computer Science
Office HoursMonday/Wednesday - 4:30pm-5:30pm - NCS 235
TAs Kathakoli Sengupta
Ph.D. Student, Computer Science
TA Office HoursTuesday - 3:30pm-4:30pm - Room: Old CS building 2126
Class MeetingsMonday/Wednesday - 6:30pm-7:50pm - Melville Library N4000

Course Description

In this graduate-level special topics course, we will explore recent advancements in vision-language models (VLMs), focusing on their architectures, applications, and ongoing research. We will study VLMs applied to both images and videos, addressing tasks such as visual reasoning, classification by description, image and video captioning, text-to-image and text-to-video generation, and visual question answering. We will analyze current challenges, including representation learning, domain shifts, and cultural biases, with special attention to compositionality and embodied AI. We will emphasize on the critical analysis of these techniques, with a focus on reading and discussion of recent work. This course will primarily consist of student presentations on assigned conference and journal publications, and a semester-long project, aiming to equip students with a comprehensive understanding of the current state and future directions of VLMs.

We will be using Google Classroom for all course materials, including readings, assignments, and announcements. We might also use EDStem for Q&A. Please make sure you have access. Google Classroom UPDATED LINK: https://classroom.google.com/c/NzgwMTE2NTQyMDQ0?cjc=uoabii5y

Prerequisites

This is a graduate-level course. Students are expected to have a solid mathematics background and strong programming skills. Students are also expected to have a strong foundation in machine learning and deep learning (e.g., CSE 512, CSE 527, CSE 538, or equivalent). Proficiency in Python and experience with a deep learning framework (e.g., PyTorch, TensorFlow) are required.

Learning Objectives

Upon successful completion of this course, students will be able to:

Course Structure & Grading

This is a graduate-level course centered around reading, presenting, and discussing research papers. A significant portion of this course is dedicated to a semester-long research project.

Paper Presentations30%
Class Participation & Discussion10%
Paper Critiques (4 total)10%
Semester Project50%

Credit Breakdown for Paper Presentations (30%):

Logistics: Students sign up in Week 2 to claim one paper from the approved pool. Upload a PDF of your slide deck to Google Classroom 48 hours before your in-class talk, prepare a presentation followed by Q&A, and conclude with 2–3 discussion topics that tie the work to course themes (TBD - subject to change).


Paper Critiques Structure (10%):


Credit Breakdown for the Semester Project (50%):

Schedule & Readings

This schedule is tentative and may be adjusted. All papers will be linked from the course website. Students are expected to have read the assigned papers *before* the class session.

Week Date Topic Readings Due Activities
1 08/25
08/27
• Introduction & Course Overview
+ Foundations: Vision & Language Representations
+ Connecting Vision & Language
• Logistics. Introduction to Multimodality.
Reading materials:
+ Vaswani et al. "Attention Is All You Need"
+ Dosovitskiy et al. "An Image is Worth 16x16 Words".
+ Radford et al. "Learning Transferable Visual Models From Natural Language Supervision".
Select Paper to Present
2 09/01 Labor Day - No class
09/03 • Foundations & Representation Learning Reading materials:
+ Li et al. "Align before Fuse: Vision and Language Representation Learning"
Find Your Project Team Members
3 09/08 + Liu, Haotian, et al. "Visual instruction tuning."
09/10 + Kirillov, Alexander, et al. "Segment anything."
4 09/15 + Siméoni, Oriane, et al. "DINOv3."
Guest Lecturer: Abe Leite
09/17 • Compositionality & Visual Reasoning + Thrush, Tristan, et al. "Winoground: Probing vision and language models for visio-linguistic compositionality."
5 09/22 + Wu, Wenshan, et al. "Mind’s Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models"
+ Li, Chengzu, et al. "Imagine while Reasoning in Space: Multimodal Visualization-of-Thought"
09/24 • Domain Shift & Generalization + Wortsman, Mitchell, et al. "Robust fine-tuning of zero-shot models."
6 09/29 + Lafon, Marc, et al. "Gallop: Learning global and local prompts for vision-language models." Project Proposal Due
10/01 + Zhang, Yabin, et al. "Dual memory networks: A versatile adaptation approach for vision-language models."
7 10/06 • Cultural Bias & Fairness + Nayak, Shravan, et al. "Benchmarking vision language models for cultural understanding."
10/08 + Lee, Tony, et al. "Vhelm: A holistic evaluation of vision language models."
8 10/13 Fall Break - No class
10/15 • Embodied AI & Agents + Driess, Danny, et al. "Palm-e: An embodied multimodal language model." ICML, 2023.
9 10/20 + Zitkovich, Brianna, et al. "Rt-2: Vision-language-action models transfer web knowledge to robotic control." Conference on Robot Learning. PMLR, 2023.
10/22 + Kim, Moo Jin, et al. "Openvla: An open-source vision-language-action model." CoRL, 2024.
10 10/27 + Wang, Weizhen, et al. "Embodied scene understanding for vision language models via metavqa." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025. Midterm Report Due
10/29 • Generation & Diffusion + Rombach, Robin, et al. "High-resolution image synthesis with latent diffusion models." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.
11 11/03 + Esser, Patrick, et al. "Scaling rectified flow transformers for high-resolution image synthesis." Forty-first international conference on machine learning. 2024.
/05 + Polyak, Adam, et al. "Movie gen: A cast of media foundation models." arXiv preprint arXiv:2410.13720 (2024).
12 11/10 + Wu, Rundi, et al. "Cat4d: Create anything in 4d with multi-view video diffusion models." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025.
11/12 + Normalizing Flows: Masked Autoregressive Flows (NeurIPS 17), TarFlow (ICML 25), STarFlow (NeurIPS 25)
13 11/17 • World Models + Yang, Mengjiao, et al. "UniSim: Learning interactive real-world simulators." ICLR, 2024
11/19 + Bruce, Jake, et al. "Genie: Generative interactive environments." Forty-first International Conference on Machine Learning. 2024.
14 11/24 + Yu, Hong-Xing, et al. "Wonderjourney: Going from anywhere to everywhere." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
11/26 Thanksgiving Break - No class
15 12/01 • Efficiency & Scaling Strategies + Rajbhandari, Samyam, et al. "Zero: Memory optimizations toward training trillion parameter models." SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020.
12/03 + Dao, Tri, et al. "Flashattention: Fast and memory-efficient exact attention with io-awareness." Advances in neural information processing systems 35 (2022): 16344-16359.
16 12/08 + Dao, Tri. "Flashattention-2: Faster attention with better parallelism and work partitioning." arXiv preprint arXiv:2307.08691 (2023).
12/10 • Finals Final Project Presentation Final Project Report & Code Due

Semester Project

The semester project is a core component of this course. It provides an opportunity to explore a research topic in depth. Projects can be done individually or in groups of three. The goal is to produce a conference-quality paper (6-8 pages in a format like CVPR or NeurIPS).

Project Deliverables

  1. Proposal (2 pages): A clear description of the problem you want to solve, your proposed approach, required datasets/resources, and a plan for evaluation.
  2. Mid-term Report (3-4 pages): A progress update including preliminary experiments, results, and any roadblocks encountered. This will be accompanied by a short in-class presentation.
  3. Final Presentation (10 minutes): A concise summary of your project's motivation, methods, results, and conclusions, presented to the class.
  4. Final Report (6-8 pages): A complete, polished paper detailing your project. It should be structured like a typical research paper with an abstract, introduction, related work, methods, experiments, results, and conclusion.

Potential Project Ideas

University Policies

Academic Integrity

Each student must pursue his or her academic goals honestly and be personally accountable for all submitted work. Representing another person's work as your own is always a serious offense. All submitted work must be your own. For the semester project, collaboration within your group is expected, but all work must be original to the group. Any use of external code or ideas must be properly cited. Any violation of academic integrity will be reported to the Academic Judiciary and can result in failure of the course. Faculty is required to report any suspected instances of academic dishonesty to the Academic Judiciary. For more comprehensive information on academic integrity, including categories of academic dishonesty please refer to the academic judiciary website.

Student Accessibility Support Center Statement

If you have a physical, psychological, medical, or learning disability that may impact your course work, please contact the Student Accessibility Support Center, Stony Brook Union Suite 107, (631) 632-6748, or at sasc@stonybrook.edu. They will determine with you what accommodations are necessary and appropriate. All information and documentation is confidential.

Critical Incident Management

Stony Brook University expects students to respect the rights, privileges, and property of other people. Faculty are required to report to the Office of Student Conduct and Community Standards any disruptive behavior that interrupts their ability to teach, compromises the safety of the learning environment, or inhibits students' ability to learn. Faculty in the HSC Schools and the School of Medicine are required to follow their school-specific procedures. Further information about most academic matters can be found in the Undergraduate Bulletin, the Undergraduate Class Schedule, and the Faculty-Employee Handbook.