Location: New York, NY, US
Job Summary:
Job Duties and Scope:
- Design and deploy scalable machine learning models and infrastructure.
- Ensure fault-tolerance and high availability of systems.
- Optimize system performance and debug production issues.
- Work within cloud environments, using Infrastructure as Code (IaC) and Kubernetes for deployments.
- Develop software architecture for machine learning systems (inference, evaluation, experimentation).
Required Skills:
- Strong proficiency in Python and Kubernetes.
- Expertise in designing scalable and fault-tolerant systems.
- Proficiency in optimizing performance and security of systems.
- Deep understanding of operating systems concepts (multi-threading, memory management).
Required Experiences:
- 5+ years in ML model deployment and scaling.
- Experience with distributed systems and handling inference at scale.
- Relevant Bachelor's/Master's Degree in Computer Science/Engineering, Statistics, or Mathematics.
Job URLs: