Network Reliability Engineer

MARGO

Apply Now
Poland
PLN 200 - PLN 250 / year
contract
mid
Posted July 4, 2026
via himalayas

About This Role

#HPC #AI #GPU #CLUSTERS YOUR DAILY ROUTINE • Build a large AI infrastructure with monitoring, diagnosis, and remediation of production incidents- Troubleshoot high-impact production issues in collaboration with other engineering teams • Participate in an on-call rotation to handle incidents and ensure service continuity • Implement and maintain observability solutions to monitor AI infrastructure and application health • Contribute to AI infrastructure lifecycle management across different environments and countries • Promote and apply best practices in terms of stability, resiliency, scalability, and security • Maintain clear technical documentation for tools and procedures • Contribute to system and tool evolution based on production feedback • Collaborate closely with development teams to ensure infrastructure readiness- Participate in team rituals and knowledge-sharing initiatives ABOUT YOU SOFTSKILLS : • Proactive and solution-oriented mindset • Passion for automation and continuous improvement • Strong collaboration and communication skills • Ability to work independently and in a team • Willingness to mentor and share knowledge HARDSKILLS : • Experience with Go or Python • Strong scripting skills (Bash, Python) • Hands-on experience with Linux systems (Ubuntu/Debian) • Preferred hands-on experience with GPU & HPC infrastructure • Knowledge of networking (VLAN/LAN, TCP/IP, DNS, BGP, load-balancing, IPv6, etc.) • Familiarity with monitoring and logging tools (Prometheus, Grafana, Elastic, etc.) • Comfortable with Infrastructure-as-Code (Ansible, Salt, AWX, etc.) • Experience managing relational databases (MariaDB) • Understanding of CI/CD pipelines (GitLab) • Comfortable with English (written and spoken) Originally posted on Himalayas

Ready to Apply?

Click the button below to visit the company's application page.

Apply for this Position