Evals Tech Lead

Confidential

United States
Permanent
Remote
$170,000 - $200,000/year
PythonLLM EvaluationsCI/CD Pipelines

Evals Tech Lead

About Us

We are a leading-edge technology organisation dedicated to pushing the boundaries of artificial intelligence. Our mission is to build safe, reliable, and highly capable AI systems that positively impact society. As we scale our cutting-edge models, ensuring their rigour, safety, and performance is paramount. We foster a collaborative, intellectually curious, and highly innovative environment where world-class engineers and researchers thrive.

Role Overview

We are seeking a visionary and highly technical Evals Tech Lead to pioneer our evaluation engineering framework. In this role, you will be the driving force behind how we measure, analyse, and understand the capabilities and behaviours of our advanced AI models.

As the Tech Lead, you will bridge the gap between research and engineering, designing robust infrastructure to systematically evaluate model performance, alignment, safety, and generalisation.

Key Responsibilities

  • Framework Development: Architect and build scalable, reliable, and automated evaluation pipelines capable of testing large-scale models across diverse benchmarks.
  • Collaboration & Alignment: Work closely with model training, research, and safety teams to integrate evaluations seamlessly into the model development lifecycle.
  • Metric Definition: Design novel metrics and evaluation protocols to assess model capabilities, reasoning, safety boundaries, and potential biases.
  • Rigorous Analysis: Oversee the collection and analysis of evaluation data, translating complex test results into actionable insights for model improvement.

Required Skills and Qualifications

  • Technical Background: Degree in Computer Science, Machine Learning, Mathematics, or a closely related quantitative field.
  • Programming Expertise: Exceptional programming skills in Python and experience with modern ML frameworks (e.g., PyTorch, TensorFlow).
  • Evaluation Expertise: Deep understanding of evaluation methodologies for Large Language Models (LLMs) or generative AI, including benchmark design and statistical analysis.
  • Communication: Outstanding communication skills, with the ability to explain complex technical behaviours and evaluation results to both technical and non-technical stakeholders.
  • Analytical Mindset: A meticulous approach to analysing model behaviour, with a passion for scientific rigour and AI safety.

What We Offer

  • Competitive salary and meaningful equity packages.
  • Comprehensive health, dental, and vision programmes.
  • Flexible working arrangements to support a healthy work-life balance.
  • State-of-the-art computing resources and budget for continuous learning.
  • A collaborative, inclusive, and forward-thinking organisational culture.