NYC Computer Vision Day 2025

NYC Computer Vision Day is an invite-only event that aims to be an informal day where the computer vision community from NYC and surroundings can share ideas and meet. A primary focus is visibility for graduate students and early career researchers.

Date and Time: Feburary 3, 2025 10AM - 6PM
Location: New York Marriott at the Brooklyn Bridge
Address: 333 Adams St, Brooklyn, NY 11201
Directions: Directions to NYU Tandon, which is next door
Lead Organizer: David Fouhey
Advisory Committee: Carl Vondrick, Olga Russakovsky, Jia Deng
Program Committee: Zhuang Liu, Sarah Jabbour, Mahi Shafiullah, Sunnie S. Y. Kim, Ruoshi Liu

Attendance Information: There is a strict guest list. If you are not a confirmed guest, you will not be admitted to the event. There are no exceptions.

Tentative Schedule

Casual Conversations and Coffee: 9:30AM — 10:00AM Doors will open just after 9:30AM to give time to get settled in with some coffee

Talk Session 1: 10AM — 11:30AM
	⚡ Lightning Talk Session 1 Peiyao Wang, Stony Brook: Efficient Temporal Action Segmentation via Boundary-aware Query Voting Ziyun Wang, Penn (website): Human Motion Fields from Events Eadom Dessalene, UMD: Understanding the organization of action from large-scale video datasets Junyao Shi, Penn: ZeroMimic: Distilling Robotic Manipulation Skills from Web Videos (paper) Sanjoy Chowdhury, UMD (website): EgoAdapt: A Joint Distillation and Policy Learning Framework for Efficient Multisensory Egocentric Perception Yanlai Yang , NYU (website): Memory Storyboard: Leveraging Temporal Segmentation for Streaming Self-Supervised Learning from Egocentric Videos (paper) Seong Jong Yoo, UMD (website): VioPose: Violin Performance 4D Pose Estimation by Hierarchical Audiovisual Inference (paper) Ruojin Cai, Cornell (website): Can Generative Video Models Help Pose Estimation? (paper) Siddhant Haldar, NYU (website): Generalizable Priors for Policy Learning Xiang Li, Stony Brook (website): LLaRA: Supercharging Robot Learning Data for Vision-Language Policy (paper) Binghao Huang, Columbia (website): 3D-ViTac: Learning Fine-Grained Manipulation with Visuo-Tactile Sensing (paper) Yiqing Liang, Brown (website): Zero-Shot Monocular Scene Flow Estimation in the Wild (paper)


	💎 Keynote 1: Hadar Averbuch-Elor Assistant Professor, Cornell Tech Editing 3D Objects with (and without) Multimodal Foundation Models Multimodal foundation models have recently shown great promise for a wide array of tasks involving images, text and 3D geometry, pushing the boundaries of what was considered impossible just a few years ago. In this talk, I will present an ongoing line of research that leverages multimodal foundation models for addressing the task of semantic editing over 3D objects, and demonstrate the paradigm shift between how we addressed this task several years ago and how we address it today in the presence of these powerful models.

🥪 Lunch and 🪧 Poster Session 1: 11:30 — 1:30PM We'll have posters from 35 of the 75+ attending labs, and ample time for casual conversation. Link to Poster Session Ids

Talk Session 2: 1:30PM — 4:30PM
	💎 Welcoming Remarks: Juan De Pablo Executive Vice President for Global Science and Technology, New York University Executive Dean of the NYU Tandon School of Engineering New York University

	⚡ Lightning Talk Session 2 Jiuhong Xiao, NYU (website): Learning Visual Geo-Localization (paper) Ilya Chugunov, Princeton (website): Neural Light Spheres for Implicit Image Stitching and View Synthesis (paper) Sihang Li, NYU (website): Unleashing the Power of Data Synthesis in Visual Localization (paper) Chris Rockwell, NYU (website): Dynamic Camera Poses and Where to Find Them Liyan Chen, Stevens: Learning the distribution of errors in stereo matching for joint disparity and uncertainty estimation Ran Gong, NYU: Differentiable Textured Surfel Octree as an Efficient 3D Scene Representation Hyoungseob Park, Yale: AugUndo: Scaling Up Augmentations for Monocular Depth Completion and Estimation Nhan Tran, Cornell (website): Personal Time-Lapse (paper) Jiawei Liu, CUNY: SMDAF: A Scalable Sidewalk Material Data Acquisition Framework with Bidirectional Cross-Modal Knowledge Distillation Rao Fu, Brown (website): GigaHands: A Massive Annotated Dataset of Bimanual Hand Activities (paper) Moinak Bhattacharya, Stony Brook (website): Eye Gaze-driven Medical Image Analysis Rangel Daroya, UMass: WildSAT: Learning Satellite Image Representations from Wildlife Observations

Brief Break

	💎 Keynote 2: Yunzhu Li Assistant Professor, Columbia University Foundation Models for Robotic Manipulation: Opportunities and Challenges In this talk, I will discuss the opportunities for incorporating foundation models into classic robotic pipelines to endow robots with capabilities beyond those achievable with traditional robotic tools. The central idea behind this research is to translate the commonsense knowledge embedded in foundation models into structural priors that can be integrated into robot learning systems. I will demonstrate how such integration enables robots to interpret instructions provided in free-form natural language to handle a wide range of real-world manipulation tasks. Toward the end of the talk, I will discuss the limitations of the current foundation models, challenges that still lie ahead, and potential avenues to address these challenges.

	⚡ Lightning Talk Session 3 Yihong Sun, Cornell (website): Video Creation by Demonstration (paper) Yunxiang Zhang, NYU (website): GazeFusion: Saliency-Guided Image Generation (paper) Ye Zhu, Princeton (website): Generative dynamics for image control and beyond Alexander Raistrick, Princeton (website): Infinigen Indoors: Photorealistic Indoor Scenes using Procedural Generation (paper) Qinhong Zhou, UMass (website): Virtual Community: A Generative Social World for Embodied AI (paper) Saumya Gupta, Stony Brook (website): TopoDiffusionNet: A Topology-aware Diffusion Model (paper) Courtney King, Fordham: Efficient Occluded Object Detection Using Scene Context Yuan Zang, Brown: Pre-trained Vision-Language Models Learn Discoverable Visual Concepts Morris Alper, Cornell (website): Emergent Visual-Semantic Hierarchies in Image-Text Representations (paper) Hritam Basak, Stony Brook (website): Forget More to Learn More: Domain-specific Feature Unlearning for Semi-supervised and Unsupervised Domain Adaptation (paper) Matthew Chan, University of Maryland (website): HyperDM: Estimating Epistemic and Aleatoric Uncertainty with a Single Model (paper) Willis Ma, NYU (website): Inference Time Scaling for Diffusion Models beyond Scaling Denoising Steps (paper)

🪧 Poster Session 2: 4:30PM - 6:00PM We'll have posters from 35 of the 75+ attending labs, and ample time for casual conversation. Link to Poster Session Ids

Host Information

NYC Computer Vision Day 2025 is hosted by, and would not be possible without the generous backing and support of the NYU Tandon School of Engineering.

We are grateful for additional support from NYU Electrical and Computer Engineering.

Tentative Schedule

Casual Conversations and Coffee: 9:30AM — 10:00AM

Talk Session 1: 10AM — 11:30AM

🥪 Lunch and 🪧 Poster Session 1: 11:30 — 1:30PM

Talk Session 2: 1:30PM — 4:30PM

Brief Break

🪧 Poster Session 2: 4:30PM - 6:00PM

Host Information

Sponsors