CSE590-01: Vision-Language Foundation Models

Course Description

In this graduate-level special topics course, we will explore recent advancements in vision-language models (VLMs), focusing on their architectures, applications, and ongoing research. We will study VLMs applied to both images and videos, addressing tasks such as visual reasoning, classification by description, image and video captioning, text-to-image and text-to-video generation, and visual question answering. We will analyze current challenges, including representation learning, domain shifts, and cultural biases, with special attention to compositionality and embodied AI. We will emphasize on the critical analysis of these techniques, with a focus on reading and discussion of recent work. This course will primarily consist of student presentations on assigned conference and journal publications, and a semester-long project, aiming to equip students with a comprehensive understanding of the current state and future directions of VLMs.

Prerequisites

This is a graduate-level course. Students are expected to have a solid mathematics background and strong programming skills. Students are also expected to have a strong foundation in machine learning and deep learning (e.g., CSE 512, CSE 527, CSE 538, or equivalent). Proficiency in Python and experience with a deep learning framework (e.g., PyTorch, TensorFlow) are required.

Learning Objectives

Upon successful completion of this course, students will be able to:

Critically analyze and discuss state-of-the-art research papers in vision-language learning.
Understand the architectures and training procedures for prominent VLMs (e.g., CLIP, BLIP, LLaVA, VideoGPT, etc).
Implement and experiment with pre-trained VLMs for various downstream tasks.
Identify and articulate the key challenges and limitations of current models, including biases and evaluation.
Formulate a problem, design a project to address it, and present the findings in a conference-style format.

Course Structure & Grading

This is a graduate-level course centered around reading, presenting, and discussing research papers. A significant portion of this course is dedicated to a semester-long research project.

Paper Presentations	30%
Class Participation & Discussion	10%
Paper Critiques (4 total)	10%
Semester Project	50%

Credit Breakdown for Paper Presentations (15%):

Technical Understanding
Slide Design & Visual Aids
Delivery & Clarity
Critical Insight (Strengths & Weaknesses)
Q&A

Logistics: Students sign up in Week 2 to claim one paper from the approved pool. Upload a PDF of your slide deck to Piazza 48 hours before your in-class talk, prepare a presentation followed by Q&A, and conclude with 2–3 discussion topics that tie the work to course themes (TBD - subject to change).

Paper Critiques Structure (10%):

Concise Summary
Strengths Analysis
Weaknesses / Limitations
Extensions & Open Questions
Overall Rating (1–5) + Justification

Credit Breakdown for the Semester Project (50%):

Project Proposal (5%)
Mid-term Report & Presentation (15%)
Final Report (25%)
Final Presentation (5%)

Schedule & Readings

This schedule is tentative and may be adjusted. All papers will be linked from the course website. Students are expected to have read the assigned papers *before* the class session.

Week	Date	Topic	Readings	Due Activities
1	08/25 08/27	• Introduction & Course Overview + Foundations: Vision & Language Representations + Connecting Vision & Language	• Logistics. Introduction to Multimodality. Reading materials: + Vaswani et al. "Attention Is All You Need" + Dosovitskiy et al. "An Image is Worth 16x16 Words". + Radford et al. "Learning Transferable Visual Models From Natural Language Supervision".	Select Paper to Present
2	09/01	Labor Day - No class
	09/03	• Foundations & Representation Learning	Reading materials: + Li et al. "Align before Fuse: Vision and Language Representation Learning" + Li et al. "BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation" + Tong et al. "VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training"	Find Your Project Team Members
3	09/08	• Compositionality & Visual Reasoning
	09/10
4	09/15
	09/17	• Domain Shift & Generalization
5	09/22
	09/24
6	09/29	• Cultural Bias & Fairness		Project Proposal Due
	10/01
7	10/06
	10/08
8	10/13	• Embodied AI & Agents
	10/15
9	10/20			Midterm Report Due
	10/22	• Generation & Diffusion
10	10/27
	10/29
11	11/03
	11/05	• World Models
12	11/10
	11/12
13	11/17	• Efficiency & Scaling Strategies
	11/19
14	11/24
	11/26	Thanksgiving Break - No class
15	12/01	• Guest Lecture (TBA)
	12/03	• Guest Lecture (TBA)
16	12/08	• Guest Lecture (TBA)		Final Project Report & Code Due
	12/10	• Finals	Final Project Presentation

Semester Project

The semester project is a core component of this course. It provides an opportunity to explore a research topic in depth. Projects can be done individually or in groups of three. The goal is to produce a conference-quality paper (6-8 pages in a format like CVPR or NeurIPS).

Project Deliverables

Proposal (2 pages): A clear description of the problem you want to solve, your proposed approach, required datasets/resources, and a plan for evaluation.
Mid-term Report (3-4 pages): A progress update including preliminary experiments, results, and any roadblocks encountered. This will be accompanied by a short in-class presentation.
Final Presentation (10 minutes): A concise summary of your project's motivation, methods, results, and conclusions, presented to the class.
Final Report (6-8 pages): A complete, polished paper detailing your project. It should be structured like a typical research paper with an abstract, introduction, related work, methods, experiments, results, and conclusion.

Potential Project Ideas

Reproducibility Study: Re-implement a recent VLM and verify its results.
Novel Application: Apply an existing VLM to a specific domain or task (e.g., medical imaging, art analysis, etc).
Model Analysis: Conduct an in-depth analysis of a VLM's failure modes, biases, or compositional reasoning abilities.
New Method/Architecture: Propose and implement a new method for improving some aspect of VLMs (e.g., efficiency, data augmentation, fine-tuning).
New Dataset or Benchmark: Create a new dataset and benchmark to evaluate existing models.

University Policies

Academic Integrity

Each student must pursue his or her academic goals honestly and be personally accountable for all submitted work. Representing another person's work as your own is always a serious offense. All submitted work must be your own. For the semester project, collaboration within your group is expected, but all work must be original to the group. Any use of external code or ideas must be properly cited. Any violation of academic integrity will be reported to the Academic Judiciary and can result in failure of the course.

Disability Support Services (DSS)

If you have a physical, psychological, medical, or learning disability that may impact your course work, please contact Student Accessibility Support Center (SASC) at (631) 632-6748 or http://studentaffairs.stonybrook.edu/dss/. They will determine with you what accommodations are necessary and appropriate. All information and documentation is confidential.

Instructor	Paola Cascante-Bonilla Assistant Professor, Computer Science paola@cs.stonybrook.edu
Office Hours	TBD
TAs	TBD
TA Office Hours	TBD
Class Meetings	Monday/Wednesday - 6:30pm-7:50pm - Melville Library N4000