SRE Incident Manager in Mountain View, CA

Employer: Atlassian

Location: Mountain View, CA

Paid Relocation

Job Description

Can you spot a well-run major incident versus a badly-run one? Do you know the factors that make a difference?

Do you know the value of a good post-incident review with a true root cause?

Do you want to guide other teams to improve their incidents, post-incident reviews, and other reliability-related processes?

If "yes" then we have role you might like. An Atlassian Technical Program Manager in Incident Management:

  • Defines, owns and delivers process and tooling that supports reliability, including incident management and post-incident review
  • Guides and trains teams in these processes so they get maximium benefit with minimal friction
  • Acts as an advocate for these processes, running meetups, "War Games" sessions, and blogging & presenting internally & externally
  • Measures the business-relevant outcomes of the processes you own, including incident rate and time-to-recovery

More about you

We expect you to have:

  • Experience managing major incidents and post-incident reviews at comparable technology organisations
  • A track record of measurably improving reliability results across teams through your own initiatives
  • Experience working with other engineering leaders as a trusted reliability subject matter expert
  • Sufficient technical nous to understand complex, high-scale information systems
  • Examples of your data-driven approach to process measurement and improvement
  • Demonstrated program leadership and accountability generation abilities
  • Some ability to write code and implement automation that supports reliability
  • A positive and enthusiastic attitude

And although not required it would be good if you had:

  • Project and program management experience, including goal-setting & measurement, and stakeholder management
  • Experience and desire to present your expertise to large groups, eg. at all-hands meetings and conferences
  • Experience developing and delivering training in a comparable organization to Atlassian
  • More extensive current or former experience in software development or release management
  • Formal training or qualification in incident management or post-incident reviews
  • Experience and skill in data analysis and reporting (eg. SQL, ETL systems, and data visualisation)

This role is a combination of technical program management (≈40%), ownership of process & tooling (≈30%), analysis and reporting (≈20%), and actual incident management (≈10%). As such it requires a blend of business and technical skills and experience.

  • Program Management around incident management (IM) and post-incident reviews (PIRs)
  • Goal-setting and measurement
  • Accountability generation in a matrix organisation
  • Stakeholder management and communication - i.e. "People skills"
  • Ownership of IM & PIR process and tooling
  • Continuous process measurement and improvement
  • Internal advocacy for process excellence across many disparate teams
  • Development and delivery of IM/PIR training across the company
  • Developing supporting tooling and automation
  • Analysis and Reporting
  • Analysis across groups to draw valid conclusions about the drivers of reliability
  • Regular and ad-hoc reporting to key stakeholders
  • Domain model, batch job, and report creation and maintenance
  • Hands-on Incident Management and PIR
  • Manage major incidents in an on-call roster as part of our global major incident management team
  • Lead incident teams to resolve major incidents quickly and effectively
  • Drive post-incident reviews to turn failures into resilience

More about our team

Atlassian Site Reliability Engineering is a rapidly growing group within the organization. We are building our teams, tools and systems as part of the company's mission to build the best SaaS services in the world. This is an exciting team to join because SRE is involved with every technical team across Atlassian.

SRE works side by side with the product family and platform developers to maintain and improve services, performance, and provide real time feedback on production systems. We live the company values with a strong customer focus and possess a healthy sense of urgency. As a heavily data driven team, we use a variety of data collection, analytics and visualisations to learn about our complex systems. We also live the 'Play, as a team' value by having a strong focus on sharing learning experiences with engineering teams outside of SRE.

SRE Team members have a range of options to suit individual styles: If you like mastering a domain and going deep, we need you. If you can juggle three tasks and coordinate multiple people in the heat of an incident, we need you. If you love the benefits of process and methodical improvement, you will love it here. If you want to keep your head down, headphones on and bash out code to support the team, we have a spot for you too.

More about our benefits

Our offices are open, highly collaborative and yes, fun! To support you at work (and play) we offer some fantastic perks: ample time off to relax and recharge, five paid volunteer days a year for your favorite cause, plenty of food and beverages, ergonomic workstations with sit/stand desks, unique ShipIt days, a company paid trip after five years, generous employer-paid insurance coverage (medical, dental, and vision) for you and your family, 401k matching and more.

More about Atlassian

Software is changing the world, and we’re at the center of it all. With a customer list that reads like a who's who in tech and a highly disruptive business model, we’re advancing the art of team collaboration with products like Jira, Confluence, Bitbucket, Trello, and now Stride. Driven by honest values, an amazing culture, and consistent revenue growth, we’re out to unleash the potential of every team. From Amsterdam and Austin to Sydney and San Francisco, we’re looking for people who are powered by passion and eager to do the best work of their lives in a highly autonomous yet collaborative, no B.S. environment.

Additional Information

We believe that the unique contributions of all Atlassians is the driver of our success. To make sure that our products and culture continue to incorporate everyone's perspectives and experience we never discriminate on the basis of race, religion, national origin, gender identity or expression, sexual orientation, age, or marital, veteran, or disability status.

All your information will be kept confidential according to EEO guidelines.

Required Skills:

sql web-services automation

Similar Jobs