ABOUT THE COMPANY
We're building autonomous research agents for recursive self-improvement (multi-agent systems that propose, run, and analyze machine learning experiments). We're a small team based in San Francisco, on-site
ABOUT THE ROLE
You'll be researching making models efficient: quantization, speculative decoding, sparse and structured attention, distillation, mixture-of-experts inference, and the training-time techniques that make those methods possible. The work spans algorithm design, careful evaluation, and pushing methods to where they actually run.
This is a senior research role with a clear engineering edge. You'll spend time at the intersection of model architecture and inference performance, designing methods that move accuracy/latency/cost trade-offs in our favor (then partnering with engineers to make those wins real in production).
WHAT YOU'LL DO
- Research and develop quantization methods: post-training quantization, quantization-aware training, mixed-precision regimes, low-bit-width arithmetic
- Design and evaluate speculative decoding approaches: draft models, tree attention, parallel speculation, lookahead decoding
- Investigate training-time efficiency methods that compose well with inference: distillation, sparse attention, mixture-of-experts, low-rank adaptation, pruning
- Run controlled experiments at production scale; characterize what works on real workloads, not just toy benchmarks
- Co-design methods with the inference engineering team: push results to where they actually run, not stop at the paper
- Read deeply across the efficient ML / efficient inference literature; translate the most useful ideas into our stack
- Publish when the work warrants it; share findings internally
- Partner with model and training researchers so efficiency choices align with model architecture and post-training decisions
WHAT WE'RE LOOKING FOR
- Strong track record of ML research on efficiency methods: quantization, speculative decoding, distillation, MoE, sparse attention, or adjacent
- 5+ years of hands-on research experience
- Deep familiarity with both training and inference performance characteristics
- Fluent in PyTorch, Jax or equivalent; comfortable working at the kernel and serving-framework level when methods require it
- Track record of moving efficiency research from prototype to production
- Strong statistical expertise: you'd notice a flawed comparison before someone else points it out
- Strong written communication
- Published research at NeurIPS, ICML, ICLR, MLSys, or comparable venues
NICE TO HAVE
- PhD in ML, systems, or related field
- Open-source contributions to quantization, speculative-decoding, or
efficient-inference libraries
- Experience with hardware-aware optimization and accelerator-specific
tooling
- Background in numerical methods, low-precision arithmetic, or
approximate computation
THIS ROLE IS PROBABLY NOT FOR YOU IF
- You want to focus on pretraining large models from scratch (that's a different role)
- You prefer abstract algorithmic research without hands-on implementation
- You want a fixed benchmark with stable targets (our targets shift with what our models actually need to do)
Learn more about this Employer on their Career Site
