projects
Systems and ML infrastructure work.
Projects
Agentic-DS-Ops
MAY 2025 – PRESENT · Distributed Systems Project · Order Lab, Ann Arbor, MI

- Built an agent-based autonomous mitigation system for distributed failures like overload and network faults on ZooKeeper clusters; experimented with a custom overload benchmark in Go.
- Integrated Prometheus metrics with sliding windows plus JMX-exporter logs to detect trends and surface failures.
- Designed a risk-aware mitigation framework where the agent selects actions via HAProxy/Resilience4j based on symptom severity; reduced tail latency in overload cases by over 50%.
- Applied pre- and post-evaluators to predict and verify outcomes against SLOs/throughput before declaring success.
CUDA Graphs for Reducing Kernel Launch Overhead
SEPT. 2025 – PRESENT · MLSys Research Project · Ann Arbor, MI

- Built a hybrid runtime proxy reducing kernel-launch overhead and cutting tail latency for large recommendation and MoE inference systems.
- Applied CUDA Graphs for stable, high-arithmetic compute (MLP blocks/attention) and a persistent kernel consuming device-queue tasks for irregular micro-ops (pack/scatter/routing).
- Used bucketing and static pools to capture several robust graphs, achieving over 50% higher replay hit rate.
- Designed microbenchmarks to isolate persistent-kernel overhead, measuring interference with concurrent graph replay and fixing scheduling costs that dominated small glue segments.
- Project report (PDF)
COCONUT Replication
Course project · GSM8k / ProsQA
- Replicating and extending the COCONUT latent-reasoning framework to study token efficiency vs. accuracy and hallucination trade-offs.