I have been working in my current field for over seven years as a DevOps and site reliability engineer, and my primary experience involves managing the reliability of infrastructure platforms hosted in multi-cloud and on-premises environments. I have predominantly worked with systems hosted in AWS services, setting up infrastructure, CI/CD, observability, and completely establishing the release process where I utilize PagerDuty Operations Cloud for triage and other SRE operations. I have been using PagerDuty Operations Cloud for over four years, and I have utilized it in multiple ways. One involves using PagerDuty Operations Cloud through enterprise services via a subscription model, and I have also used it in a project at Intel where I utilized PagerDuty Operations Cloud from AWS for approximately one to one and a half years. After that period, I have been using it as a subscription currently at IBM. One of the main use cases for PagerDuty Operations Cloud involves handling the operation center, particularly concerning incident resolutions and triaging different incidents as part of the score platform engineering team within a central IBM cloud where various IBM cloud services are hosted. To ensure continuous reliability, automated incidents are created in PagerDuty Operations Cloud and incident management automation is heavily utilized as part of the project. Previously, I worked on integrating PagerDuty Operations Cloud with default AWS services to create incidents for different AWS services as part of the host infrastructure at Intel. Currently, I am creating different incident workflows within IBM internal cloud operations to ensure an effective incident management process, utilizing integrations with different LLMs as part of incident management, along with agentic SRE tasks that have arisen in the project. Since I am part of a larger platform engineering team and SRE operations team, there are many incidents and services that my team handles. I handle over 26 IBM cloud score services hosted in our internal platform, where there have been many incidents related to service downtime, reliability issues, and update issues. A dedicated SRE team handles end-to-end incident management, and we wanted to automate the incident management process, especially since we receive hundreds of incidents per day, up to thousands of incidents during critical release times of different services. Thus, the manual on-call process has been automated through utilizing PagerDuty Operations Cloud.
We receive a notification if there are any failed jobs or operations. We have some Bamboo agents working, so if one of the jobs fails on one of these servers, PagerDuty Operations Cloud creates an incident and notifies us. We use PagerDuty Operations Cloud for monitoring purposes, and it works great for our current needs.
My main use case for PagerDuty Operations Cloud is monitoring and on-call management for downtime. Recently, we had a service go down last week, and we were alerted via PagerDuty Operations Cloud of the issue. One of our on-call engineers responded to the page and quickly resolved the problem through PagerDuty Operations Cloud app.
My main use case for PagerDuty Operations Cloud is to set up alerts for any failures, such as when one server is down, a particular service is down, or when APIs are not responding due to technical issues, with PagerDuty triggering an alert and also calling my personal mobile number to notify me about the issue, allowing me to acknowledge that I am looking into it and take necessary actions. I can give an example of a situation where PagerDuty Operations Cloud helped us handle an incident, such as when our payment system was about to go down. During that time, we usually monitor the system manually, but there are incidents where an automated system works more efficiently than a human. PagerDuty Operations Cloud identified the issue first by alerting us that something went wrong with the servers or services, which enabled us to contact the DevOps and Dev team to identify the exact issue in our banking app, highlighting how helpful PagerDuty Operations Cloud has been from the beginning. PagerDuty Operations Cloud is very helpful for monitoring purposes, allowing us to set up multiple alerting methods such as SMS alerting, email alerting, and call alerting, all of which we commonly use, proving its usefulness across various banking services, with teams including Dev, DevOps, and SecOps relying on it heavily.
Initially, I started using it for managing on-call schedules. As the tech stack developed, we began using it for service alerts and event routing, and then transitioned to operational views and dashboarding. It eventually became central to our alerting systems, where all monitoring tools would send information to PagerDuty, enabling event management and routing to whoever was on call.
The primary use case of the solution is to alert the on-call person when there are any critical errors or when the servers are down. It is also used for the on-call scheduling of personnel.
Principal Architect at a energy/utilities company with 10,001+ employees
Real User
Sep 19, 2022
It's mainly for IT call scheduling, emergency contacts, events, and those kinds of things. It's integrated with AWS, MS Teams, Remedy, and other solutions.
Compliance, Security & Testing Manager at a financial services firm with 11-50 employees
Real User
Oct 8, 2020
We are a 24-hour online business. We use it for scheduling our on-call engineers and making sure that there is follow-the-sun or round-the-clock coverage for alerting and network operations. It ingests all our alert paths, i.e., anything that generates an alert of any description, such as, Splunk, AWS, and internal applications. We feed all our events into it, then it generates alerts which need a response from an engineer with a description. Another thing is it is built-in scheduling is pretty much hands-off for our on-call engineers unless somebody goes on holidays. That is the only time that we have to jump in there and make any changes.
VP of Engineering at a comms service provider with 201-500 employees
Real User
Jun 25, 2020
We mostly use it for our on-call engineers, for schedules, alerting, and critical alerts. And, of course, we use it for the management of an issue, so that people acknowledge the alerts, reassign them, etc.
Tier 4 Support Team Leader at a comms service provider with 10,001+ employees
Real User
Mar 1, 2020
The most common use case is the result of alerts coming from a monitoring system, like New Relic or Nagios, alerts that we define as critical. They are alerts where we need someone to get on a bridge or to start working on them during the night. Once such an alert is firing, it fires a PagerDuty alert and it triggers the current on-call who is scheduled in PagerDuty's schedule. The on-call person acknowledges the alert and looks into it to understand what is going on and to update, via PagerDuty, what the status is. The update will be sent to all the groups that are part of the PagerDuty schedule until the issue is resolved. We mostly integrate it with other monitoring tools like New Relic or Nagios, or we are using their email integration for on-call processes to page people in groups. We also use it for Sev 1 issues that are coming from alerts from New Relic or from Nagios or other monitoring systems.
The PagerDuty Operations Cloud is the platform for mission-critical, time-critical operations work in the modern enterprise. Through the power of AI and automation, it detects and diagnoses disruptive events, mobilizes the right team members to respond, and streamlines infrastructure and workflows across your digital operations. The Operations Cloud is essential infrastructure for revolutionizing digital operations to compete and win as a modern digital business.
PagerDuty Features
PagerDuty...
I have been working in my current field for over seven years as a DevOps and site reliability engineer, and my primary experience involves managing the reliability of infrastructure platforms hosted in multi-cloud and on-premises environments. I have predominantly worked with systems hosted in AWS services, setting up infrastructure, CI/CD, observability, and completely establishing the release process where I utilize PagerDuty Operations Cloud for triage and other SRE operations. I have been using PagerDuty Operations Cloud for over four years, and I have utilized it in multiple ways. One involves using PagerDuty Operations Cloud through enterprise services via a subscription model, and I have also used it in a project at Intel where I utilized PagerDuty Operations Cloud from AWS for approximately one to one and a half years. After that period, I have been using it as a subscription currently at IBM. One of the main use cases for PagerDuty Operations Cloud involves handling the operation center, particularly concerning incident resolutions and triaging different incidents as part of the score platform engineering team within a central IBM cloud where various IBM cloud services are hosted. To ensure continuous reliability, automated incidents are created in PagerDuty Operations Cloud and incident management automation is heavily utilized as part of the project. Previously, I worked on integrating PagerDuty Operations Cloud with default AWS services to create incidents for different AWS services as part of the host infrastructure at Intel. Currently, I am creating different incident workflows within IBM internal cloud operations to ensure an effective incident management process, utilizing integrations with different LLMs as part of incident management, along with agentic SRE tasks that have arisen in the project. Since I am part of a larger platform engineering team and SRE operations team, there are many incidents and services that my team handles. I handle over 26 IBM cloud score services hosted in our internal platform, where there have been many incidents related to service downtime, reliability issues, and update issues. A dedicated SRE team handles end-to-end incident management, and we wanted to automate the incident management process, especially since we receive hundreds of incidents per day, up to thousands of incidents during critical release times of different services. Thus, the manual on-call process has been automated through utilizing PagerDuty Operations Cloud.
We receive a notification if there are any failed jobs or operations. We have some Bamboo agents working, so if one of the jobs fails on one of these servers, PagerDuty Operations Cloud creates an incident and notifies us. We use PagerDuty Operations Cloud for monitoring purposes, and it works great for our current needs.
My main use case for PagerDuty Operations Cloud is monitoring and on-call management for downtime. Recently, we had a service go down last week, and we were alerted via PagerDuty Operations Cloud of the issue. One of our on-call engineers responded to the page and quickly resolved the problem through PagerDuty Operations Cloud app.
My main use case for PagerDuty Operations Cloud is to set up alerts for any failures, such as when one server is down, a particular service is down, or when APIs are not responding due to technical issues, with PagerDuty triggering an alert and also calling my personal mobile number to notify me about the issue, allowing me to acknowledge that I am looking into it and take necessary actions. I can give an example of a situation where PagerDuty Operations Cloud helped us handle an incident, such as when our payment system was about to go down. During that time, we usually monitor the system manually, but there are incidents where an automated system works more efficiently than a human. PagerDuty Operations Cloud identified the issue first by alerting us that something went wrong with the servers or services, which enabled us to contact the DevOps and Dev team to identify the exact issue in our banking app, highlighting how helpful PagerDuty Operations Cloud has been from the beginning. PagerDuty Operations Cloud is very helpful for monitoring purposes, allowing us to set up multiple alerting methods such as SMS alerting, email alerting, and call alerting, all of which we commonly use, proving its usefulness across various banking services, with teams including Dev, DevOps, and SecOps relying on it heavily.
Initially, I started using it for managing on-call schedules. As the tech stack developed, we began using it for service alerts and event routing, and then transitioned to operational views and dashboarding. It eventually became central to our alerting systems, where all monitoring tools would send information to PagerDuty, enabling event management and routing to whoever was on call.
We use the solution for incident management.
The solution is used to alert the on-call users if we have priority-one or business-critical issues.
Our use cases include generating alerts from our site 24/7. We are managing the cloud infrastructure there.
The two major use cases were alerts for events and scheduling of engineers to get pages based on incidents.
The primary use case of the solution is to alert the on-call person when there are any critical errors or when the servers are down. It is also used for the on-call scheduling of personnel.
We primarily use this solution to track alerts from our cloud environment and monitor and respond to alerts on our cloud platform.
It's mainly for IT call scheduling, emergency contacts, events, and those kinds of things. It's integrated with AWS, MS Teams, Remedy, and other solutions.
We use PagerDuty for incident managment. We're looking at integrating PagerDuty with Rundeck in the future.
We are a 24-hour online business. We use it for scheduling our on-call engineers and making sure that there is follow-the-sun or round-the-clock coverage for alerting and network operations. It ingests all our alert paths, i.e., anything that generates an alert of any description, such as, Splunk, AWS, and internal applications. We feed all our events into it, then it generates alerts which need a response from an engineer with a description. Another thing is it is built-in scheduling is pretty much hands-off for our on-call engineers unless somebody goes on holidays. That is the only time that we have to jump in there and make any changes.
We mostly use it for our on-call engineers, for schedules, alerting, and critical alerts. And, of course, we use it for the management of an issue, so that people acknowledge the alerts, reassign them, etc.
The most common use case is the result of alerts coming from a monitoring system, like New Relic or Nagios, alerts that we define as critical. They are alerts where we need someone to get on a bridge or to start working on them during the night. Once such an alert is firing, it fires a PagerDuty alert and it triggers the current on-call who is scheduled in PagerDuty's schedule. The on-call person acknowledges the alert and looks into it to understand what is going on and to update, via PagerDuty, what the status is. The update will be sent to all the groups that are part of the PagerDuty schedule until the issue is resolved. We mostly integrate it with other monitoring tools like New Relic or Nagios, or we are using their email integration for on-call processes to page people in groups. We also use it for Sev 1 issues that are coming from alerts from New Relic or from Nagios or other monitoring systems.
Our primary use case of this solution is for alarming and to mitigate threats in our organization.