Job Description
We are looking for a proactive and highly independent professional to ensure the stable and reliable operation of our critical systems outside of business hours in a rotating shift schedule. The successful candidate will be the primary point of contact for handling alerts generated by our systems and managing emerging incidents. Their task is to quickly analyze incoming events, determine the severity of the problem, initiate immediate troubleshooting based on predefined protocols (runbooks), and, if necessary, involve the appropriate expert team (DevOps, Software Developer). This position is crucial for our business continuity, making a responsible attitude and calm, solution-oriented thinking essential.
Responsibilities
- Continuously monitoring our critical applications and their associated infrastructure via monitoring systems (e.g., Azure Monitor, New Relic, MS Application Insights).
- Receiving automatic alerts, immediately analyzing and prioritizing issues based on severity and business impact.
- Managing incidents throughout their entire lifecycle: detection, diagnosis, escalation, communication, and resolution.
- Following predefined runbooks and troubleshooting protocols for quick and effective resolution.
- If the problem cannot be solved immediately, escalating the incident to the appropriate on-call DevOps or software development engineer.
- Precisely documenting incidents and the steps taken in the ticketing system (Jira).
- Actively participating in post-mortem analyses to identify root causes and develop future preventive measures.
- Continuously maintaining and improving the knowledge base and runbooks based on experience.
Requirements
- At least 2 years of experience in a similar role (e.g., NOC Engineer, IT Operations, Application Support, SRE).
- Practical experience using monitoring tools.
- Strong analytical and system-level troubleshooting skills.
- Ability to work independently and responsibly, and to make calm decisions under pressure.
- Thorough knowledge of IT infrastructure, networking, and cloud (primarily Azure) technologies.
- Effective and clear verbal and written communication skills in Hungarian and English.
- Proficient use of ticketing systems (e.g., Jira).
- Flexibility and willingness to work in a 24/7 rotating shift schedule.
Nice to Have
- ITIL Foundation certification or knowledge of incident management frameworks.
- Scripting skills (e.g., PowerShell, Bash) for automation tasks.
- Basic SQL database management knowledge.
- Knowledge of CI/CD processes and tools (e.g., Azure DevOps).
- Experience in a software development environment.