projects

Systems and ML infrastructure work.

Projects

Agentic-DS-Ops

MAY 2025 – PRESENT · Distributed Systems Project · Order Lab, Ann Arbor, MI

Agentic Distributed System Ops diagram

  • Built an agent-based autonomous mitigation system for distributed failures like overload and network faults on ZooKeeper clusters; experimented with a custom overload benchmark in Go.
  • Integrated Prometheus metrics with sliding windows plus JMX-exporter logs to detect trends and surface failures.
  • Designed a risk-aware mitigation framework where the agent selects actions via HAProxy/Resilience4j based on symptom severity; reduced tail latency in overload cases by over 50%.
  • Applied pre- and post-evaluators to predict and verify outcomes against SLOs/throughput before declaring success.

CUDA Graphs for Reducing Kernel Launch Overhead

SEPT. 2025 – PRESENT · MLSys Research Project · Ann Arbor, MI

CUDA Graph Runtime diagram

  • Built a hybrid runtime proxy reducing kernel-launch overhead and cutting tail latency for large recommendation and MoE inference systems.
  • Applied CUDA Graphs for stable, high-arithmetic compute (MLP blocks/attention) and a persistent kernel consuming device-queue tasks for irregular micro-ops (pack/scatter/routing).
  • Used bucketing and static pools to capture several robust graphs, achieving over 50% higher replay hit rate.
  • Designed microbenchmarks to isolate persistent-kernel overhead, measuring interference with concurrent graph replay and fixing scheduling costs that dominated small glue segments.
  • Project report (PDF)

COCONUT Replication

Course project · GSM8k / ProsQA

  • Replicating and extending the COCONUT latent-reasoning framework to study token efficiency vs. accuracy and hallucination trade-offs.