Location: San Jose, CA, US
Job Summary:
Job Duties and Scopes:
- Lead the design and implementation of distributed machine learning infrastructure for various ranking models.
- Oversee development of monitoring tools to ensure reliability and scalability of ML infrastructure.
- Identify and prioritize system inefficiencies; enhance system performance.
- Analyze bottlenecks and instabilities; implement effective solutions.
- Collaborate with product teams for tailored solutions.
Required Skills:
- Team leadership and engineering management.
- Development and deployment of large-scale ML systems.
- Strong communication and teamwork.
- Problem-solving in complex systems.
- Proficiency in big data frameworks (e.g., Spark, Hadoop).
Required Experiences:
- Experience leading an engineering team.
- Contributions to open-source ML frameworks (e.g., TensorFlow, PyTorch).
- Optimization experience in Parameter Server systems.
- Background in areas like HPC or ML hardware acceleration.
- Familiarity with resource management in distributed systems.
Job URLs: