Home

Senior Site Reliability Engineer at NVIDIA

Location: Santa Clara, CA

Job Summary:

Job Duties and Scopes:
- Develop frameworks/scripts for automating workflows in a private cloud with NVIDIA GPUs.
- Focus on stabilizing virtualization infrastructure (ESXi, KVM, Hyper-V).
- Deploy and maintain machines using configuration management tools (Chef, Ansible, Terraform).
- Create monitoring systems for infrastructure subsystems (Zabbix, Grafana).
- Participate in on-call support and troubleshoot complex infrastructure issues.

Required Skills:
- Proficient in Python, Go, Unix shell; knowledge of Java, C.
- Familiarity with Linux and Windows hosting maintenance.
- Experience with version control systems (Perforce, GIT).

Required Experiences:
- Bachelor's/Master's in Computer Science or equivalent.
- 6+ years in large-scale enterprise production systems.
- Proven debugging and analysis of infrastructure-related issues.

Job URLs: