NYC Computer Vision Day 2025
NYC Computer Vision Day is an invite-only event that aims to be an informal day where the computer vision community from NYC and surroundings can share ideas and meet. A primary focus is visibility for graduate students and early career researchers.
Date and Time: Feburary 3, 2025 10AM - 6PM
Location: New York Marriott at the Brooklyn Bridge
Address: 333 Adams St, Brooklyn, NY 11201
Directions: Directions to NYU Tandon, which is next door
Lead Organizer: David Fouhey
Advisory Committee: Carl Vondrick, Olga Russakovsky, Jia Deng
Program Committee: Zhuang Liu, Sarah Jabbour, Mahi Shafiullah, Sunnie S. Y. Kim, Ruoshi Liu
Attendance Information: There is a strict guest list. If you are not a confirmed guest, you will not be admitted to the event. There are no exceptions.
|
|
Casual Conversations and Coffee: 9:30AM — 10:00AM
Doors will open just after 9:30AM to give time to get settled in with some coffee
|
|
Talk Session 1: 10AM — 11:30AM
|
|
⚡ Lightning Talk Session 1
- Peiyao Wang, Stony Brook: Efficient Temporal Action Segmentation via Boundary-aware Query Voting
- Ziyun Wang, Penn (website): Human Motion Fields from Events
- Eadom Dessalene, UMD: Understanding the organization of action from large-scale video datasets
- Junyao Shi, Penn: ZeroMimic: Distilling Robotic Manipulation Skills from Web Videos (paper)
- Sanjoy Chowdhury, UMD (website): EgoAdapt: A Joint Distillation and Policy Learning Framework for Efficient Multisensory Egocentric Perception
- Yanlai Yang , NYU (website): Memory Storyboard: Leveraging Temporal Segmentation for Streaming Self-Supervised Learning from Egocentric Videos (paper)
- Seong Jong Yoo, UMD (website): VioPose: Violin Performance 4D Pose Estimation by Hierarchical Audiovisual Inference (paper)
- Ruojin Cai, Cornell (website): Can Generative Video Models Help Pose Estimation? (paper)
- Siddhant Haldar, NYU (website): Generalizable Priors for Policy Learning
- Xiang Li, Stony Brook (website): LLaRA: Supercharging Robot Learning Data for Vision-Language Policy (paper)
- Binghao Huang, Columbia (website): 3D-ViTac: Learning Fine-Grained Manipulation with Visuo-Tactile Sensing (paper)
- Yiqing Liang, Brown (website): Zero-Shot Monocular Scene Flow Estimation in the Wild (paper)
|
|
|
|
💎 Keynote 1: Hadar Averbuch-Elor
Assistant Professor, Cornell Tech
Editing 3D Objects with (and without) Multimodal Foundation Models
Multimodal foundation models have recently shown great promise for a wide array of tasks involving images, text and 3D geometry, pushing the boundaries of what was considered impossible just a few years ago. In this talk, I will present an ongoing line of research that leverages multimodal foundation models for addressing the task of semantic editing over 3D objects, and demonstrate the paradigm shift between how we addressed this task several years ago and how we address it today in the presence of these powerful models.
|
|
🥪 Lunch and 🪧 Poster Session 1: 11:30 — 1:30PM
We'll have posters from 35 of the 75+ attending labs, and ample
time for casual conversation.
Link to Poster Session Ids
|
|
Talk Session 2: 1:30PM — 4:30PM
|
|
💎 Welcoming Remarks: Juan De Pablo
Executive Vice President for Global Science and Technology, New York University
Executive Dean of the NYU Tandon School of Engineering
New York University
|
|
|
⚡ Lightning Talk Session 2
- Jiuhong Xiao, NYU (website): Learning Visual Geo-Localization (paper)
- Ilya Chugunov, Princeton (website): Neural Light Spheres for Implicit Image Stitching and View Synthesis (paper)
- Sihang Li, NYU (website): Unleashing the Power of Data Synthesis in Visual Localization (paper)
- Chris Rockwell, NYU (website): Dynamic Camera Poses and Where to Find Them
- Liyan Chen, Stevens: Learning the distribution of errors in stereo matching for joint disparity and uncertainty estimation
- Ran Gong, NYU: Differentiable Textured Surfel Octree as an Efficient 3D Scene Representation
- Hyoungseob Park, Yale: AugUndo: Scaling Up Augmentations for Monocular Depth Completion and Estimation
- Nhan Tran, Cornell (website): Personal Time-Lapse (paper)
- Jiawei Liu, CUNY: SMDAF: A Scalable Sidewalk Material Data Acquisition Framework with Bidirectional Cross-Modal Knowledge Distillation
- Rao Fu, Brown (website): GigaHands: A Massive Annotated Dataset of Bimanual Hand Activities (paper)
- Moinak Bhattacharya, Stony Brook (website): Eye Gaze-driven Medical Image Analysis
- Rangel Daroya, UMass: WildSAT: Learning Satellite Image Representations from Wildlife Observations
|
|
Brief Break
|
|
|
💎 Keynote 2: Yunzhu Li
Assistant Professor, Columbia University
Foundation Models for Robotic Manipulation: Opportunities and Challenges
In this talk, I will discuss the opportunities for incorporating foundation models into classic robotic pipelines to endow robots with capabilities beyond those achievable with traditional robotic tools. The central idea behind this research is to translate the commonsense knowledge embedded in foundation models into structural priors that can be integrated into robot learning systems. I will demonstrate how such integration enables robots to interpret instructions provided in free-form natural language to handle a wide range of real-world manipulation tasks. Toward the end of the talk, I will discuss the limitations of the current foundation models, challenges that still lie ahead, and potential avenues to address these challenges.
|
|
|
⚡ Lightning Talk Session 3
- Yihong Sun, Cornell (website): Video Creation by Demonstration (paper)
- Yunxiang Zhang, NYU (website): GazeFusion: Saliency-Guided Image Generation (paper)
- Ye Zhu, Princeton (website): Generative dynamics for image control and beyond
- Alexander Raistrick, Princeton (website): Infinigen Indoors: Photorealistic Indoor Scenes using Procedural Generation (paper)
- Qinhong Zhou, UMass (website): Virtual Community: A Generative Social World for Embodied AI (paper)
- Saumya Gupta, Stony Brook (website): TopoDiffusionNet: A Topology-aware Diffusion Model (paper)
- Courtney King, Fordham: Efficient Occluded Object Detection Using Scene Context
- Yuan Zang, Brown: Pre-trained Vision-Language Models Learn Discoverable Visual Concepts
- Morris Alper, Cornell (website): Emergent Visual-Semantic Hierarchies in Image-Text Representations (paper)
- Hritam Basak, Stony Brook (website): Forget More to Learn More: Domain-specific Feature Unlearning for Semi-supervised and Unsupervised Domain Adaptation (paper)
- Matthew Chan, University of Maryland (website): HyperDM: Estimating Epistemic and Aleatoric Uncertainty with a Single Model (paper)
- Willis Ma, NYU (website): Inference Time Scaling for Diffusion Models beyond Scaling Denoising Steps (paper)
|
|
🪧 Poster Session 2: 4:30PM - 6:00PM
We'll have posters from 35 of the 75+ attending labs, and ample
time for casual conversation.
Link to Poster Session Ids
|
|
Sponsors
|
|
We gratefully acknowledge the support of:
|
|