JOB OPENING
ML Ops Engineer
Supporting Country:
Location:
Vacancy ID:
Job Description
Responsibilities
You will:
- Work with engineers, researchers and data scientists to build the next generation of Tractable’s ML & data platform
- Help identify and realise capabilities in our ML & data platform that massively speed up getting research to production across dataset & model management, model training, model serving, labelling, data & ML pipeline orchestration and more
- Support Research and Product Engineers with tools and processes to enable a
seamless data flywheel
- Deploy and continuously develop robust infrastructure, using best practices for
managing infrastructure-as-code
- Solve cost and performance scalability challenges in both model training and model serving
- Run, monitor and maintain business-critical, production systems
- Adopt open-source technologies to best leverage our in-house resources
- Promote engineering best practices throughout the team
- Suggest, collect and synthesise requirements to create an effective feature roadmap
Description
ML foundations team focuses on building tools and services for our internal customer within Tractable, research, product, engineering and Operation specialists.
We have 3 teams that tackle different aspects of this space, ML applications, Data operations and ML Infrastructure. You'll be collaborating with peer teams and enhance, build and maintain the ML infrastructure stack.
We are looking for a Senior [Data|ML Ops] Engineer to build and support systems that enable the core mission of Tractable - to make applied AI possible - by optimising the end-to-end Machine Learning life cycle. The vision of the ML Infrastructure is to enable researchers to spend 80%+ of their time solving tricky ML problems rather than dealing with engineering/infra/ops challenges.
You will help mature our ML and data platform to a world-class state. You will influence the scope and technical direction as well as champion best practices within the team. You have a relentless focus on user experience (researchers, data scientists and product engineers.
and you care deeply about what your team is building to make sure it will have the biggest impact on your users. You will be a strong mentor, nurturing an encouraging and supportive environment to enable the team to do their best work.
Education and Experience
Skills and Behaviours
- Main Infrastructure: AWS (EC2, S3, MSK, Lambda, StepFunctions, Glue, IAM,
Cognito, Systems Manager, CloudWatch, SQS, Route 53, Sagemaker), Apache Kafka (AWS MSK), Kubernetes, Datadog (Metrics, Logs, Synthetics), Pagerduty, Loki, Elastic Search
- Main CI/CD: Terraform, Docker, Harness
- Main Databases: Postgres / RDS, Redis, DynamoDB
- Main Languages: Python, Node + Typescript, SQL (Postgres)
- Main Data stack: AWS MSK, AWS Lambda, AWS Redshift, dbt, Airflow, Airbyte, AWS Glue
- Main ML stack: Triton, TFServing, KServe, AWS Sagemaker, AWS Lambda, AWS
MSK, sync/async APIs, Weights & Biases, Tensorflow, Pytorch, dvc, Dagster/Flyte, Streamlit
We encourage you to drop us a line even if you don’t have all the points above. That’s a lot of different areas of responsibility! We will help you pick them up because we believe that great people come from all walks of life