Senior MLOps Engineer (GPU Platform)
Products
We're hiring a Senior MLOps Engineer to own the reliability and scale of our GPU compute platform. Vecura runs 300+ scientific AI tools — protein structure prediction, molecular dynamics, docking, and more — across a wide range of GPU types, on both serverless and self host spanning cloud and on-prem. You'll own the platform layer between infrastructure and models: how GPU jobs are scheduled, queued, isolated, observed, and recovered. You think in SLOs and design for repeatability, building standard systems that scale across hundreds of models rather than one-off deployments. You'll work alongside our DevOps engineer (infra/cluster) and our AI engineers (model onboarding), owning the orchestration and reliability surface that connects them.
Responsibilities
- Own the success-run ratio of GPU workloads as a measurable SLO; drive it up and keep it there.
- Build and operate the GPU job scheduling and queueing layer — fair-share allocation, prioritization, backpressure, and recovery across a heterogeneous fleet.
- Implement GPU partitioning and sharing (MIG, MPS, time-slicing) to raise utilization without destabilizing runs.
- Profile and right-size workloads: per-model GPU memory, runtime, and failure characteristics; eliminate OOMs and silent failures.
- Define a standard packaging/deployment contract for new models so onboarding is repeatable, not bespoke.
- Build observability for the run lifecycle — metrics, logs, traces, alerting — so failures are caught and diagnosed fast.
- Harden the orchestration stack (workflow engine, durable execution, retries/failover) against real failure modes.
- Partner with the DevOps engineer on cluster/networking and with AI engineers to make their models production-ready.
Qualifications
Must have:
- 5+ years in MLOps / ML platform / GPU systems engineering, with direct ownership of production reliability.
- Deep experience operating GPU workloads at scale (NVIDIA stack: CUDA, drivers, GPU Operator, MIG/MPS).
- Strong background in workload orchestration and scheduling — Kubernetes (Jobs/batch), Ray, Slurm, or equivalent.
- Hands-on managed-ML platform experience on at least one major cloud, with working familiarity of the other:
- GCP — Cloud Run, Vertex AI
- AWS — SageMaker
- Solid understanding of cloud architecture (compute, networking, storage, IAM) across hybrid cloud + on-prem.
- Proven track record raising reliability/utilization of a heterogeneous GPU fleet.
- Solid software engineering (Python and one systems language) — you build platform tooling, not just configure it.
- Observability and SRE fundamentals: SLOs, metrics, tracing, incident response.
Benefits
We provide a dynamic, fast-paced, and collaborative environment where problem-solving and agility are at the heart of what we do. Along with a competitive salary, we foster a culture that values ambition, confidence, and humility, consistently pushing the boundaries of innovation. If you're excited about working in a young, talented tech company and want to explore the world of AI and pharmaceuticals, we encourage you to apply.
- Competitive salary (negotiable based on experience)
- Workplace: No.45-57, Tran Xuan Soan, Hai Ba Trung, Ha Noi (From Monday to Friday: 9h -17h)
- Build a professional network through collaborations with pharmaceutical companies, industry leaders, and academic experts.
- Work on impactful projects that address critical challenges in drug discovery and healthcare.
- Employees are entitled to 2 work-from-home days per month, along with daily lunch provided by the company.
- Holiday & Tet bonuses; performance-based bonus
- Social insurance contribution on full salary
How to Apply
If you think we're a good match, send your CV to:
Email: office@nyb.group
Subject: [NYB] Senior MLOps Engineer_Your name
We’ll get in touch to let you know what the next steps are.
Contact office@nyb.group for more information.