About This Role

Here's a job listing you can post:

STEM Computational Scientific Software & Evaluation Design

About the Project

We're building a large-scale evaluation benchmark for advanced AI reasoning across scientific and engineering domains. Our task designers create challenging computational problems that test whether AI systems can use real scientific software tools to solve research-grade problems from querying simulations and interpreting outputs to designing experimental strategies and recovering hidden information from data.

This is not a typical annotation or labeling role. You'll be designing original, graduate-level computational problems grounded in real scientific workflows, calibrating them against frontier AI models, and iterating on problem design until the difficulty is right.

What You'll Do

You'll design problems that require sophisticated use of domain-specific scientific software libraries. Some problems will require computing precise outputs from fully specified setups — testing whether a solver can correctly implement complex multi-step scientific workflows. Others will require something harder: designing a sequence of queries or experiments to uncover information that isn't directly visible, demanding strategic reasoning about what to measure, how to interpret partial observations, and how to narrow down possibilities efficiently.

Each task goes through a calibration loop where it's tested against state-of-the-art AI models, and you'll refine the problem design until the difficulty hits the target range.

Domains & Tools We're Hiring For

We're especially interested in experts with deep, hands-on experience in the following areas:

Bioinformatics & Single-Cell Genomics Working with tools like scanpy, scvelo, squidpy, and gudhi for single-cell RNA-seq analysis, trajectory inference, spatial transcriptomics, and topological data analysis. You should be comfortable designing problems around cell-type annotation, pseudotime ordering, multi-omic integration, spatial variable gene identification, and persistence-based analysis pipelines. This is our highest-throughput domain and where we're scaling first.

Computational Chemistry & Electronic Structure Working with PySCF for quantum chemistry calculations including Hartree-Fock, DFT, TDDFT, CASSCF, and post-HF methods. Ideal candidates can design problems around excited-state analysis, orbital diagnostics, method selection for tricky electronic structures, and interpreting computational artifacts that arise from method limitations.

Particle & Nuclear Physics Working with scikit-hep and related HEP Python tools for particle physics data analysis, cross-section computations, renormalization group calculations, and perturbative QCD. Experience with Monte Carlo event generation or collider phenomenology is a plus.

Electrical Engineering & RF/Circuit Design Working with scikit-rf for RF and microwave network analysis, S-parameter characterization, and transmission-line modeling, or ngspice for circuit simulation, operating point analysis, and frequency response characterization. Candidates should be comfortable designing problems that involve recovering circuit parameters from measurement data.

Astrophysics & Cosmology Working with astropy and related tools for cosmological calculations, angular power spectra, galaxy survey analysis, and observational data reduction pipelines.

Structural & Mechanical Engineering Working with scikit-fem or similar finite element libraries for beam analysis, elasticity problems, and computational mechanics. Experience with Timoshenko beam theory, mesh convergence studies, or variational formulations is valuable.

Seismology & Geophysics Working with ObsPy or SPECFEM for seismic waveform analysis, travel-time tomography, moment tensor inversion, or synthetic seismogram generation.

Pharmacokinetics & Systems Biology Working with libRoadRunner, Tellurium, or SBML-based tools for compartmental PK/PD modeling, enzyme kinetics, or systems biology simulations.

*experience with other specialized software for the above domains and/or other domains will also be considered

What Makes a Strong Candidate

You have graduate-level expertise (MS or PhD preferred) in one or more of the domains listed above, with real hands-on experience using the specific software tools, not just theoretical knowledge of the field. You've written code that calls these libraries to solve actual research problems, and you understand where they break, what their edge cases are, and what makes a problem genuinely hard versus superficially complex.

Beyond domain expertise, the strongest candidates will be able to think like a puzzle designer: constructing problems where the difficulty comes from reasoning strategy rather than brute computation, where there are multiple plausible approaches but only careful analysis reveals the right one, and where surface-level pattern matching won't get you to the answer.

Requirements

Graduate-level training in a relevant STEM domain (MS, PhD, or equivalent research experience)
Demonstrated proficiency with at least one of the listed scientific software libraries, evidenced by research publications, open-source contributions, or professional work
Strong Python programming skills — you'll be writing problem setups, oracle functions, and solution validators
Ability to work independently and iterate on problem designs based on calibration feedback
Comfortable working in a Linux/terminal environment with remote compute sandboxes
Available for at least 15–20 hours per week

Nice to Have

Experience across multiple listed domains or tools
Familiarity with benchmark or evaluation design
Background in scientific pedagogy or exam/problem-set design
Experience with computational reproducibility and containerized environments

STEM Computational Scientific Software & Evaluation Design

Required Skills

About This Role

Interested in this role?

Similar Roles