What is our primary use case?
As a cloud operation team, I was a user who set the alerts, and whatever important incidents or anomalies were detected that needed to be immediately taken care of were bifurcated through our APM tools that we integrated with PagerDuty Operations Cloud. As a cloud operation team, we supported the platform for rotational shifts. My roles involved setting the person in the shift according to the shift roster, so whenever any incidents triggered, they would get the call. The primary use was supporting production operations and cloud activities.
Our multi-environment consists of AWS infrastructure, Linux servers, Kubernetes clusters, and customer-facing applications. PagerDuty Operations Cloud was mainly used for incident management and alerting. We integrated it with AppDynamics, Instana, and CloudWatch, where it would monitor the patterns and platform, and then PagerDuty Operations Cloud would generate the critical alerts that the appropriate support team who was working in that present shift would get notified of immediately. This platform really helped us manage production incidents beyond service outages, mostly high CPU utilization where we set alerts, application failures, pod issues in Kubernetes, and infrastructure-related alerts. We configured all kinds of alerts, which ensured that alerts were routed to the correct on-call person, helping us reduce response time in critical situations.
What is most valuable?
One of the best features I would mention about PagerDuty Operations Cloud is its on-call rotational scheduling support and escalation management practices. If an engineer did not acknowledge the alert within a defined time frame, the incident was automatically escalated to the next person, support team, or manager of that specific team. Another useful feature was its integration capability. We were able to integrate PagerDuty Operations Cloud with monitoring and observability tools that allow alerts to generate automatically whenever issues were detected in the environment within a fraction of time. We also had the mobile application that was very helpful because the engineer could receive calls, notifications, and acknowledge the incident and track the updates even when they were away from their laptop.
I also valued the centralized incident management dashboard that provides visibility into active incidents, response status, escalation history, and overall operational health. I used to get all the data accumulated there through the dashboard.
PagerDuty Operations Cloud helps us manage production incidents beyond service outages, mostly high CPU utilization where we set alerts, application failures, pod issues in Kubernetes, and infrastructure-related alerts.
What needs improvement?
My experience with PagerDuty Operations Cloud has been positive overall. One area where I believe improvement can be made is reporting and dashboard customization to make it more user-friendly. The operations team often requires different views compared to the management team. Having more flexibility in generating custom reports would be helpful. Another improvement could be providing more advanced AI-driven collaboration capabilities to reduce unnecessary noise alerts and help the team focus on the most critical issues. Apart from these areas, the platform is very reliable and effective for managing production incidents and on-call operations.
For how long have I used the solution?
I have been using PagerDuty Operations Cloud for almost five to six years.
What do I think about the stability of the solution?
PagerDuty Operations Cloud has been stable and performing well wherever our incident management or alerting was configured for production support. Timely notifications and incident responses were critical. PagerDuty Operations Cloud delivers alerts immediately through multiple channels which we configured, including mobile on-call notifications, email, SMS, and phone calls. Since PagerDuty Operations Cloud was integrated with our monitoring and observability tools, it helped ensure that critical incidents were captured and routed to the appropriate on-call team. During my usage, I did not encounter any significant outages or stability issues that impacted our operations due to PagerDuty Operations Cloud.
What do I think about the scalability of the solution?
PagerDuty Operations Cloud is highly scalable and works well with small and large environments. The project I worked on was integrated with multiple application servers and cloud resources for monitoring. PagerDuty Operations Cloud handles all the alerts from different resources and routes them to the appropriate teams. As the infrastructure grows, new services get implemented, escalation policies get defined, and schedules and teams are easily available without requiring major changes in our existing setup. This makes it suitable for an organization to manage large cloud infrastructure and multiple team supports.
Which solution did I use previously and why did I switch?
When I joined this project, they had already implemented PagerDuty Operations Cloud. When I joined, the SOPs and testing were already in process. After a few days, when I was actually onboarded, many of the alerts were configured in PagerDuty Operations Cloud. I did not get the chance to work on different tools besides PagerDuty Operations Cloud.
How was the initial setup?
During the initial setup of PagerDuty Operations Cloud, when I joined the project, I got a Jira ticket listing a few of the servers where I needed to install PagerDuty agents so it could trigger any alerts or integrate with the server. I was mostly involved in the configuration part.
The setup was straightforward. PagerDuty Operations Cloud also helped us in this process. It was not directly integrated on the individual servers, but we integrated our monitoring tools and observability with PagerDuty Operations Cloud. The servers and applications were monitored through application monitoring tools such as Instana, Zabbix, and Splunk. Whenever critical alerts were generated, they would automatically forward to PagerDuty Operations Cloud through the configured integrations we set up with the application. PagerDuty Operations Cloud would notify the on-call engineers and follow different escalation policies if the alerts were not acknowledged within a specific time. Our flow was that we had EC2 instances, AWS servers, and CloudWatch alarms, and if any alert triggered, it would send through SNS, AWS Simple Notification Service, and then to PagerDuty Operations Cloud and the on-call engineer.
What about the implementation team?
We followed the documentation provided by PagerDuty Operations Cloud for the configuration part.
The documentation is full-fledged with proper details on how to configure it depending on the integration with any application monitoring tool. They specify what steps need to be followed. If integrating with servers, they mention which type of server, whether it is Windows or Linux, and accordingly, they have provided all the documents. The documentation is comprehensive and easy to understand, such that even a layperson can do the configuration part with the way they have provided the documentation.
What other advice do I have?
We are not mostly focused on utilizing PagerDuty's autonomous AI agents because we are working on cloud infrastructure where we do the deployments. We have not implemented AI in our cloud to that extent. Going forward, if our infrastructure is AI-based, then we will definitely explore where PagerDuty Operations Cloud can help in that.
As of now, we do not use generative AI capabilities of PagerDuty Operations Cloud. Our infrastructure is huge, and there is a dedicated developer team working on AI-related things. They are still in two POCs, and the POC is being evaluated. If it looks good, then only we can roll this out into production because my application is customer-facing, and we do not want anything to go wrong or if the alert triggers unnecessarily due to some AI alert that did not notify us. That would ultimately cause us to lose our SLAs and SLOs, and all the other escalation matrices would come into the picture. That is why we are still in POCs as it is critical.
That part is taken care of by a different team or mostly the clients themselves. My main role is to keep the environment always up and running, and all alerts should be properly centralized and customized accordingly.
PagerDuty Operations Cloud is basically where we get the alert, and we can integrate through Slack and on-call rotational shifts on cell phones. Prior to this, we were mostly relying on application monitoring tools only and emails and Slack notifications. If an on-call shift person is not at their desk and if any alert has been triggered and no one is there to acknowledge it or look into it and take necessary action, then ultimately there will be customer impact. That is why we implemented PagerDuty Operations Cloud. Even if the on-call person is not near their laptop, they will get the call and can immediately acknowledge and report to the team that we have received a P1 call for this specific environment or that the alert is regarding a production issue. Another team member will immediately take action, so there will not be any miss.
I did not encounter any issues that required contacting support for PagerDuty Operations Cloud. This review represents an overall rating of 9 out of 10.
Which deployment model are you using for this solution?
Public Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Amazon Web Services (AWS)
Disclosure: My company does not have a business relationship with this vendor other than being a customer.