Meet Banthia

MSCS & Research Assistant, NYU Courant Institute

meetbanthia0304[at]gmail[dot]com

I’m an MSCS Graduate Student and Research Assistant under Professor Jinyang Li at NYU Courant. Previously, I did my undergradute studies from NIT Surathkal, India in CS and have worked at Oracle as a Server Technology Intern.

What am i currently working on? (as of March 2026)

3D Gaussian Ray Tracing — ML Systems Research @ NYU Courant - Working under Professor Jinyang Li at NYU Courant on Machine Learning Systems research. I am currently experimenting with 3D Gaussian Ray Tracing, tweaking various parameters in the pipeline like gaussian count, bounding box primitive types, etc, and seeing how they affect overall performance. I have also been observing performance of rendering kernels and factors affecting it. I'm mostly concerned about tradeoffs between bvh build time and rendering time which is well described in the 3dgrt paper. My goal is to spot some pattern among observed bottlenecks and squeeze as much performance possible using systems optimizations.

Interests

Systems for ML
Distributed Systems

Academia

NYU Courant Institute of Mathematical Sciences

2025 - present

M.S. Computer Science

GPA: 4.00/4.00 | Fall'25: ML Systems, Honors Analysis of Algorithms, Programming Parallel Algorithms, Programming Languages | Spring'26: Distributed Systems, Multicore Architecture and Processors, Operating Systems

NIT Karnataka, Surathkal [some photos]

2021 - 2025

B.Tech. Computer Science & Engineering

GPA: 8.51/10.00

Projects

See my GitHub for the complete list.

ForkAndMove: Parallel Chess Engine in MPL Oct 2025 – Dec 2025

code report

I built my own chess engine from scratch as a fun project. The engine used a bitmap-based board representation to enable fast move generation and board evaluation through bitwise operations. While bitboards made many things elegant and efficient, handling chess-specific rules like castling, en passant, and sliding-piece move generation made it a bit inconvenient to use bitboards.

Beyond the core board representation and evaluation logic, the main focus of this project was game-tree search. I implemented minimax, alpha-beta pruning, and an optimized parallel version of alpha-beta using Principal Variation Search (PVS). Although alpha-beta is fundamentally sequential due to left-to-right dependencies, PVS allowed meaningful parallelism by first exploring the leftmost branch sequentially to establish tight alpha-beta bounds, and then searching the remaining branches in parallel using these bounds. This significantly improved pruning effectiveness while still exploiting multicore parallelism.

The engine was implemented in MaPLe, a Standard ML-based functional language designed for provably efficient and safe multicore parallelism. Parallelism was expressed using high-level primitives such as reduce, which we used to combine results from parallel alpha-beta searches in a stable and deterministic way. In addition to PVS, I also implemented semi-parallel and fully parallel minimax variants to compare performance trade-offs.

We experimented with parallelizing move generation as well, but found that for chess especially for sliding pieces like rooks, bishops, and queens, move generation was inherently sequential and offered limited parallel benefit relative to scheduling overhead. Instead, performance gains primarily came from search optimizations such as lazy game-tree generation, where successor states were generated only when required rather than eagerly expanding the entire subtree.

Sequential alpha-beta consistently outperformed parallel minimax, while parallel PVS outperformed both, especially as the number of processors increased. There was still significant room for improvement, particularly in move ordering, which had a major impact on pruning efficiency.

Analyse Tensor-Core-Based Gaussian Splatting CUDA Kernel Oct 2025 – Dec 2025

I did some background study of NeRFs and their method for view synthesis. After understanding several research papers, I ran the Tensor Core Implementation and profiled individual kernel performances required in the entire pipeline. My goal was to spot key bottlenecks in this pipeline.

Research Papers Read:

3D Gaussian Splatting for Real-Time Radiance Field Rendering — link
TC-GS: A Faster Gaussian Splatting Module Utilizing Tensor Cores — link
CLM: Removing the GPU Memory Barrier for 3D Gaussian Splatting — link

I only had access to a consumer-grade GPU (RTX 2080 Ti), which made it difficult to run the full 3DGS model. I read about CLM (similar to ZeRO-Offload), which works on a single consumer-grade GPU by offloading Gaussians to CPU and loading them when necessary. The paper discusses various issues with the naive approach and presents detailed design and optimizations — I really enjoyed reading it.

Interests

Academia

News

Projects

See my GitHub for the complete list.

Blogs

Understanding 3 lines of lemma3 in FLP Impossibility paper