I’m an MSCS Graduate Student and Research Assistant under Professor Jinyang Li at NYU Courant. Previously, I did my undergradute studies from NIT Surathkal, India in CS and have worked at Oracle as a Server Technology Intern.
What am i currently working on? (as of March 2026)
3D Gaussian Ray Tracing — ML Systems Research @ NYU Courant - Working under Professor Jinyang Li at NYU Courant on Machine Learning Systems research. I am currently experimenting with 3D Gaussian Ray Tracing, tweaking various parameters in the pipeline like gaussian count, bounding box primitive types, etc, and seeing how they affect overall performance. I have also been observing performance of rendering kernels and factors affecting it. I'm mostly concerned about tradeoffs between bvh build time and rendering time which is well described in the 3dgrt paper. My goal is to spot some pattern among observed bottlenecks and squeeze as much performance possible using systems optimizations.
I built my own chess engine from scratch as a fun project. The engine used a bitmap-based board representation to enable fast move generation and board evaluation through bitwise operations. While bitboards made many things elegant and efficient, handling chess-specific rules like castling, en passant, and sliding-piece move generation made it a bit inconvenient to use bitboards.
Beyond the core board representation and evaluation logic, the main focus of this project was game-tree search. I implemented minimax, alpha-beta pruning, and an optimized parallel version of alpha-beta using Principal Variation Search (PVS). Although alpha-beta is fundamentally sequential due to left-to-right dependencies, PVS allowed meaningful parallelism by first exploring the leftmost branch sequentially to establish tight alpha-beta bounds, and then searching the remaining branches in parallel using these bounds. This significantly improved pruning effectiveness while still exploiting multicore parallelism.
The engine was implemented in MaPLe, a Standard ML-based functional language designed for provably efficient and safe multicore parallelism. Parallelism was expressed using high-level primitives such as reduce, which we used to combine results from parallel alpha-beta searches in a stable and deterministic way. In addition to PVS, I also implemented semi-parallel and fully parallel minimax variants to compare performance trade-offs.
We experimented with parallelizing move generation as well, but found that for chess especially for sliding pieces like rooks, bishops, and queens, move generation was inherently sequential and offered limited parallel benefit relative to scheduling overhead. Instead, performance gains primarily came from search optimizations such as lazy game-tree generation, where successor states were generated only when required rather than eagerly expanding the entire subtree.
Sequential alpha-beta consistently outperformed parallel minimax, while parallel PVS outperformed both, especially as the number of processors increased. There was still significant room for improvement, particularly in move ordering, which had a major impact on pruning efficiency.
I did some background study of NeRFs and their method for view synthesis. After understanding several research papers, I ran the Tensor Core Implementation and profiled individual kernel performances required in the entire pipeline. My goal was to spot key bottlenecks in this pipeline.
Research Papers Read:
I only had access to a consumer-grade GPU (RTX 2080 Ti), which made it difficult to run the full 3DGS model. I read about CLM (similar to ZeRO-Offload), which works on a single consumer-grade GPU by offloading Gaussians to CPU and loading them when necessary. The paper discusses various issues with the naive approach and presents detailed design and optimizations — I really enjoyed reading it.
This post is about understanding 3 specific lines of lemma3 for which author claimed to be an easy induction, but it did not seem that simple to me. This post is about proving "easy induction" claim.