.we are a leading global software company dedicated to the world of computer aided design, 3d modeling and simulation - helping innovative global manufacturers design better products, faster!
with the resources of a large company, and the energy of a software start-up, we have fun together while creating a world class software portfolio.
our culture encourages creativity, welcomes fresh thinking, and focuses on growth, so our people, our business, and our customers can achieve their full potential.the disw sre organization is dedicated to enhancing service and application availability, optimizing processes by automating manual and repetitive tasks, and addressing complex technical challenges in a dynamic, collaborative, inclusive, and iterative environment.
this position plays a crucial role in developing automated solutions and processes that support and sustain best-in-class cloud-based applications.position overviewthe candidate will support the siemens xcelerator platform and will be responsible for coordinating major incident response, maintaining stakeholder communication during service-impacting events, and facilitating resolution in compliance with service level agreement (sla).
strong communication and coordination skills are necessary to support core objectives.
this role's success will be defined by product teams within disw business units meeting their slas.responsibilities/tasksincident management: act as the primary point of contact and leader during major incidents, coordinating the response, communication, and resolution efforts across all involved teams.incident response: quickly assess the severity of incidents, determine the impact, and drive the appropriate response to restore services as quickly as possible.communication: ensure clear, concise, and timely communication with stakeholders, including technical teams, management, and customers, throughout the incident lifecycle.post-incident analysis: lead post-incident reviews to identify root causes, drive improvements, and implement preventive measures to reduce the likelihood of recurrence.collaboration: work closely with sre, devops, development, and other relevant teams to ensure that incident management processes are well-defined and continuously improved.training & preparedness: conduct regular incident response drills, train teams on incident management processes, and ensure readiness for handling high-severity incidents.documentation: maintain and update incident management documentation, ensuring that all procedures are up-to-date and accessible to all relevant teams.monitoring & alerts: collaborate with sre and monitoring teams to define and refine alerting criteria, ensuring that incidents are detected and escalated promptly.continuous improvement: identify opportunities to improve system reliability, scalability, and performance based on lessons learned from incidents