BraintrustCode ReviewerEngineeringcontract

Senior Software Engineer (Python) - Agent Evaluation - Freelance/Remote 100+ openings

Toloka AI B.V.

Pay

$60-80/hour

Location

Eligible

US, CA, AR +9

Difficulty

Hard

Posted

2 days ago (Jul 1, 2026)

Hours

30 hrs/week

Required Skills

Python JavaScript React Data Analysis Software Engineering STEM AI Training Writing

About This Role

This opportunity is intended for experienced Senior Python Engineers only*

Open to candidates in the North America, South America, Asia and Europe. Please submit your CV in English and indicate your level of English proficiency.

Mindrift connects specialists with project-based AI opportunities for leading tech companies, focused on testing, evaluating, and improving AI systems. Participation is project-based, not permanent employment.

What this opportunity involves

We're building a dataset to evaluate AI coding agents - how well a model handles real-world developer tasks.

You'll create challenging tasks and evaluation criteria within realistic simulated environments:

Build realistic developer environments - a virtual company with codebase, infrastructure, and context (tickets, docs, conversations) that forms a believable development historyDesign tasks from intermediate states of these environments - craft the prompt, define what "solved" means, and ensure the task is solvable by an AI agentWrite tests that verify agent solutions - accept all valid approaches and reject incorrect ones, neither too strict nor too lenientIterate on tasks and tests based on QA feedback - review agent solutions, analyze failures, and refine until the evaluation is fair and robust What this is NOT

Not data labelingNot prompt engineeringNot writing code from scratch - the agent writes most of the code; you guide and evaluate What we look for - You must meet all the requirements in order to be considered for this project:

5+ years of professional experience with Python.Strong experience with FastAPI, pytest, and async/await.Hands-on experience with Docker, PostgreSQL, and CI/CD pipelines.Proven experience writing and maintaining automated tests (not just executing them).Full-stack experience with React and TypeScript is a plus.English proficiency at B2 level or higher.Availability to work 30+ hours per week.

Why this is hard

Frontier models are already good at coding. Creating a task that genuinely challenges the best models is non-trivial. You need to deeply understand where models fail and what scenarios reveal the difference between a good and a bad solution. Tasks have many valid solutions - writing tests that accept all correct solutions and reject incorrect ones is harder than it sounds.

How it works

Apply → Pass qualification(s) → Join a project → Complete tasks → Get paid

Hiring & Onboarding Process

The process is designed to move quickly and typically includes the following steps:

App review and invitation to a virtual project introduction session (approximately 30 minutes)Platform registration and identity verificationTechnical assessment (approximately 35 minutes)Background check (completed at no cost to candidates)Onboarding and project-specific training tasksBegin production work! Additional Requirements

Willingness to complete identity verification as part of the onboarding process.Ability to complete a technical assessment.Willingness to join and participate in Discord, which will be used for project communication and updates.Successful completion of a background check is required prior to onboarding.Reliable internet connection and ability to communicate effectively in a remote environment. #LI-MS1 #LI-Remote

Interested in this role?

Apply directly on Braintrust to get started.

Similar Roles

View all

SWE Infrastructure Specialist (JS/TS/Python) – Freelance AI Trainer ProjectMeridial

$80-150/hour

Agentic AI Software Engineer (AI Training - OpenClaw)Outlier

$30-55/hour

Full Stack Software Engineer for AI Training (Remote, Flexible, 100+ openings)Outlier

$30-50/hour

Remote Software Engineer (AI Code Evaluation)Outlier

$15-32/hour

Related Resources

Is Braintrust legit? Read our review See live AI gig pay rates How to become an RLHF trainer Browse all Code Reviewer jobs