M
Network Reliability Engineer
MARGO
About This Role
#HPC #AI #GPU #CLUSTERS
YOUR DAILY ROUTINE
• Build a large AI infrastructure with monitoring, diagnosis, and remediation of production incidents- Troubleshoot high-impact production issues in collaboration with other engineering teams
• Participate in an on-call rotation to handle incidents and ensure service continuity
• Implement and maintain observability solutions to monitor AI infrastructure and application health
• Contribute to AI infrastructure lifecycle management across different environments and countries
• Promote and apply best practices in terms of stability, resiliency, scalability, and security
• Maintain clear technical documentation for tools and procedures
• Contribute to system and tool evolution based on production feedback
• Collaborate closely with development teams to ensure infrastructure readiness- Participate in team rituals and knowledge-sharing initiatives
ABOUT YOU
SOFTSKILLS :
• Proactive and solution-oriented mindset
• Passion for automation and continuous improvement
• Strong collaboration and communication skills
• Ability to work independently and in a team
• Willingness to mentor and share knowledge
HARDSKILLS :
• Experience with Go or Python
• Strong scripting skills (Bash, Python)
• Hands-on experience with Linux systems (Ubuntu/Debian)
• Preferred hands-on experience with GPU & HPC infrastructure
• Knowledge of networking (VLAN/LAN, TCP/IP, DNS, BGP, load-balancing, IPv6, etc.)
• Familiarity with monitoring and logging tools (Prometheus, Grafana, Elastic, etc.)
• Comfortable with Infrastructure-as-Code (Ansible, Salt, AWX, etc.)
• Experience managing relational databases (MariaDB)
• Understanding of CI/CD pipelines (GitLab)
• Comfortable with English (written and spoken)
Originally posted on Himalayas
Ready to Apply?
Click the button below to visit the company's application page.
Apply for this Position