Location: Seattle, WA, US
Job Summary:
Job Duties:
- Lead the design and implementation of distributed infrastructure for machine learning models.
- Oversee development of monitoring and management tools for reliability and scalability.
- Identify and address system inefficiencies and performance bottlenecks.
- Create tools for analyzing system instability and formulate effective solutions.
- Collaborate with product teams for tailored solutions.
Required Skills:
- Engineering leadership experience.
- Proficiency in large-scale machine learning systems.
- Strong communication and teamwork abilities.
- Knowledge of open-source machine learning frameworks (e.g., TensorFlow, PyTorch).
- Familiarity with big data frameworks (e.g., Spark, Hadoop) and system optimization.
Required Experiences:
- Experience leading an engineering team.
- History of developing and deploying machine learning systems.
- Participation in Parameter Server or search system optimization.
- Background in Hardware-Software Co-Design, High Performance Computing, or ML Hardware Acceleration.
Job URLs: