ABOUT THE COMPANY

We're building autonomous research agents for recursive self-improvement (multi-agent systems that propose, run, and analyze machine learning experiments). We're a small team based in San Francisco, on-site

ABOUT THE ROLE

You'll be researching making models efficient: quantization, speculative decoding, sparse and structured attention, distillation, mixture-of-experts inference, and the training-time techniques that make those methods possible. The work spans algorithm design, careful evaluation, and pushing methods to where they actually run.

This is a senior research role with a clear engineering edge. You'll spend time at the intersection of model architecture and inference performance, designing methods that move accuracy/latency/cost trade-offs in our favor (then partnering with engineers to make those wins real in production).

WHAT YOU'LL DO

- Research and develop quantization methods: post-training quantization, quantization-aware training, mixed-precision regimes, low-bit-width arithmetic

- Design and evaluate speculative decoding approaches: draft models, tree attention, parallel speculation, lookahead decoding

- Investigate training-time efficiency methods that compose well with inference: distillation, sparse attention, mixture-of-experts, low-rank adaptation, pruning

- Run controlled experiments at production scale; characterize what works on real workloads, not just toy benchmarks

- Co-design methods with the inference engineering team: push results to where they actually run, not stop at the paper

- Read deeply across the efficient ML / efficient inference literature; translate the most useful ideas into our stack

- Publish when the work warrants it; share findings internally

- Partner with model and training researchers so efficiency choices align with model architecture and post-training decisions

WHAT WE'RE LOOKING FOR

- Strong track record of ML research on efficiency methods: quantization, speculative decoding, distillation, MoE, sparse attention, or adjacent

- 5+ years of hands-on research experience

- Deep familiarity with both training and inference performance characteristics

- Fluent in PyTorch, Jax or equivalent; comfortable working at the kernel and serving-framework level when methods require it

- Track record of moving efficiency research from prototype to production

- Strong statistical expertise: you'd notice a flawed comparison before someone else points it out

- Strong written communication

- Published research at NeurIPS, ICML, ICLR, MLSys, or comparable venues

NICE TO HAVE

- PhD in ML, systems, or related field

- Open-source contributions to quantization, speculative-decoding, or

efficient-inference libraries

- Experience with hardware-aware optimization and accelerator-specific

tooling

- Background in numerical methods, low-precision arithmetic, or

approximate computation

THIS ROLE IS PROBABLY NOT FOR YOU IF

- You want to focus on pretraining large models from scratch (that's a different role)

- You prefer abstract algorithmic research without hands-on implementation

- You want a fixed benchmark with stable targets (our targets shift with what our models actually need to do)

Learn more about this Employer on their Career Site

RESEARCHER, EFFICIENT INFERENCE