SonicJobs Logo
Left arrow iconBack to search

RESEARCHER, EFFICIENT INFERENCE

MakerMaker
Posted 7 days ago, valid for 7 days
Location

San Francisco, San Francisco 94102, CA

Salary

Competitive

Contract type

Full Time

By applying, a Sonicjobs account will be created for you. Sonicjobs's Privacy Policy and Terms & Conditions will apply.

SonicJobs' Terms & Conditions and Privacy Policy also apply.

Sonic Summary

info
  • The company is seeking a senior research engineer to develop efficient machine learning models, focusing on techniques like quantization and speculative decoding.
  • Candidates should have at least 5 years of hands-on research experience in efficiency methods and be fluent in PyTorch, Jax, or similar frameworks.
  • The role involves designing methods to improve accuracy, latency, and cost trade-offs, as well as collaborating with engineers to implement these methods in production.
  • A strong track record of published research in prominent venues such as NeurIPS or ICML is required, along with expertise in statistical analysis.
  • Salary details were not provided in the job description, but the position is based in San Francisco, suggesting a competitive compensation package.

ABOUT THE COMPANY

We're building autonomous research agents for recursive self-improvement (multi-agent systems that propose, run, and analyze machine learning experiments). We're a small team based in San Francisco, on-site

ABOUT THE ROLE

You'll be researching making models efficient: quantization, speculative decoding, sparse and structured attention, distillation, mixture-of-experts inference, and the training-time techniques that make those methods possible. The work spans algorithm design, careful evaluation, and pushing methods to where they actually run.

This is a senior research role with a clear engineering edge. You'll spend time at the intersection of model architecture and inference performance, designing methods that move accuracy/latency/cost trade-offs in our favor (then partnering with engineers to make those wins real in production).

WHAT YOU'LL DO

- Research and develop quantization methods: post-training quantization, quantization-aware training, mixed-precision regimes, low-bit-width arithmetic

- Design and evaluate speculative decoding approaches: draft models, tree attention, parallel speculation, lookahead decoding

- Investigate training-time efficiency methods that compose well with inference: distillation, sparse attention, mixture-of-experts, low-rank adaptation, pruning

- Run controlled experiments at production scale; characterize what works on real workloads, not just toy benchmarks

- Co-design methods with the inference engineering team: push results to where they actually run, not stop at the paper

- Read deeply across the efficient ML / efficient inference literature; translate the most useful ideas into our stack

- Publish when the work warrants it; share findings internally

- Partner with model and training researchers so efficiency choices align with model architecture and post-training decisions

WHAT WE'RE LOOKING FOR

- Strong track record of ML research on efficiency methods: quantization, speculative decoding, distillation, MoE, sparse attention, or adjacent

- 5+ years of hands-on research experience

- Deep familiarity with both training and inference performance characteristics

- Fluent in PyTorch, Jax or equivalent; comfortable working at the kernel and serving-framework level when methods require it

- Track record of moving efficiency research from prototype to production

- Strong statistical expertise: you'd notice a flawed comparison before someone else points it out

- Strong written communication

- Published research at NeurIPS, ICML, ICLR, MLSys, or comparable venues

NICE TO HAVE

- PhD in ML, systems, or related field

- Open-source contributions to quantization, speculative-decoding, or

efficient-inference libraries

- Experience with hardware-aware optimization and accelerator-specific

tooling

- Background in numerical methods, low-precision arithmetic, or

approximate computation

THIS ROLE IS PROBABLY NOT FOR YOU IF

- You want to focus on pretraining large models from scratch (that's a different role)

- You prefer abstract algorithmic research without hands-on implementation

- You want a fixed benchmark with stable targets (our targets shift with what our models actually need to do)




Learn more about this Employer on their Career Site

Apply now in a few quick clicks

By applying, a Sonicjobs account will be created for you. Sonicjobs's Privacy Policy and Terms & Conditions will apply.

SonicJobs' Terms & Conditions and Privacy Policy also apply.