What is our primary use case?
incident.io is our primary solution for incident management. We receive a lot of alerting from DataDog, Honeycomb, and Prometheus. All these metrics and alerts get triggered based on thresholds that our performance engineers have established based on our user experience and backend performance. When those thresholds spike on an application or in certain scenarios, we receive an alert. Out of that alert, an incident is automatically created if it has been triggered more than three times. Once the incident gets created, we see the severity of the incident and accordingly take action. It creates a Slack channel and everything for us.
What is most valuable?
The best features in incident.io include the severity level definition, which is very clear, such as whether it is a sev 0, sev 1, or sev 2 incident. The on-call workflows are extremely good. How we have set that on-call workflow is quite impressive; you can do a plug and play kind of thing. You have a canvas, similar to a whiteboard that we use in any of the applications. There, you just bring on the component and plug and play.
For example, our on-call rotation is structured such that the first alert goes to a junior engineer. If within an hour the severity is P2, then it goes to a senior engineer. If the severity is P2 and again, for one hour, no one responds, it goes to the lead engineer, basically my level. If I don't respond, it goes to my manager. If my manager doesn't respond, it goes to the VP of Engineering.
We have better control over incidents on weekends. For a P0 incident, which is critical and blocks customers, we make sure that the first incident responder is a junior engineer along with a senior engineer tagged along. If they don't respond within 10 minutes of the alert or incident creation, an automatic alert goes to the higher availability engineer, such as a lead engineer. If I don't respond properly, then it goes to both my manager and the VP of Engineering within an hour. This is how it works, and we have to take responsibility.
The system provides that you are solely responsible for the incident and you will manage it in and out. You have all the essentials along with it, regarding how you will manage it with a 10-minute window, 20-minute window, and so on. It gives you updates on incidents, and you just fill the form to indicate what status we have reached. This way, we can also provide our customers live feeds on their status, especially if they are dealing with a P0. For P1 and P2, we don't follow that strictly, as we have a very flexible approach where the customer isn't blocked in those scenarios, and we are quite relaxed.
Having an automated workflow feature in incident.io has helped me reduce human error significantly. We were using a better solution such as PagerDuty previously, but the cost implications were extremely high. It charges per incident created. Moving to incident.io was for cost efficiency, as it is almost 50% cheaper based on my experience compared to what we were paying for PagerDuty at that time. However, PagerDuty is a far superior product than incident.io, but incident.io gets the job done.
incident.io's real-time collaboration features have significantly impacted my incident resolution efforts. I've previously mentioned that we follow a hierarchical system to manage our incidents and incident responses. This hierarchy consists of a junior engineer, then a senior engineer, then me as the lead engineer, followed by a manager, VP of Engineering, director, and even the CTO can get tagged. For a P0 incident, which is very critical, we get high returns in terms of incident response.
You can say the on-call flows help immensely. If one engineer is unavailable, we can depend on the team rather than a single person handling everything. It's more of a team effort, and within everyone's team, the observability of incidents is clear. We receive proper alerts on our phone calls and everything, which makes life a lot easier with Slack alerts, phone calls, and email flows all being available.
What needs improvement?
I would like to see incident.io improved in terms of maturity, as it is not a complete solution at the moment. It narrowly focuses solely on incident management and lacks the breadth of a platform such as PagerDuty, which has a high service catalog encompassing everything from asset management to change management. incident.io is not there yet. It's not so much of a feature request; it's about the niche they're working on. For it to develop further for enterprise-level customers, it needs to transition into more of a platform than just a solution.
When it comes to pricing, I have seen a great ROI with incident.io after switching from PagerDuty. However, I must clarify that those ROIs were also met with PagerDuty, meaning it isn't extensive that we are observing. The MTTR trends are something that is sadly missing in incident.io, which we had with PagerDuty. Cost estimations are also lacking. If an incident occurs, for example, seeing high cardinality metrics in production leading to a jump in billing, those estimations can't be done in incident.io while they could be done in PagerDuty. Thus, it feels more of a downgrade for us, but again, every choice has its pros and cons. incident.io is cheaper, and we needed a more economical solution, as simple as that.
For how long have I used the solution?
I have been working with incident.io for probably one and a half years. I was using PagerDuty in my previous organization, and this company also had PagerDuty. After that, we switched to incident.io because of cost issues.
What do I think about the stability of the solution?
When discussing the stability and scalability aspects of incident.io, our incidents are not very frequent; we get less than four to five incidents on a weekly basis. Therefore, I can't ascertain how it would perform for teams experiencing extraordinarily high alert volumes, but based on our use case, it fits well, and we don't encounter any issues.
What do I think about the scalability of the solution?
When discussing the stability and scalability aspects of incident.io, our incidents are not very frequent; we get less than four to five incidents on a weekly basis. Therefore, I can't ascertain how it would perform for teams experiencing extraordinarily high alert volumes, but based on our use case, it fits well, and we don't encounter any issues.
How are customer service and support?
Regarding the technical support team of incident.io, I rely heavily on documentation and don't generally need human involvement. I primarily utilize the documentation provided, as the tech support team will also recommend reaching out via the global incident.io Slack channel for community collaboration. I actively participate in that community to ask questions, and there is always someone from the team to respond if something goes wrong, making it a collaborative experience.
Which solution did I use previously and why did I switch?
We did evaluate other options available in the market, and we were using PagerDuty, which has a very dense ecosystem. incident.io is specifically a Slack-dependent solution that integrates solely with Slack, while PagerDuty has far more capabilities. For instance, it can integrate with Microsoft Teams, AWS clouds directly, Azure, and even connect to GCP servers, providing better visibility. Currently, whatever alerts we receive come via our observability stack, either Honeycomb, DataDog, or others. But with PagerDuty, you can integrate further down to track AWS-related metrics, which isn't an option available in incident.io right now.
We utilize the incident timeline feature, which helps us track our MTTR. There are certain metrics that we follow as SRE engineers, and MTTR is one of the critical metrics we have to monitor to provide our customers and meet our SLOs. In such scenarios, the timeline of the actual incident assists in tracking how the incident happened and what the turnaround time was during the incident. This allows us to solve issues for customers as quickly as possible.
Apart from MTTR and MTTD, we track MTTR as a definite metric we follow when measuring incident response improvements with analytics in incident.io. The DORA metrics we handle are primarily addressed by Sleuth. Other than MTTR, we don't track many metrics in incident.io because for other metrics, we solely rely on Sleuth for the DORA metrics. DORA metrics indicate team pace, and incident resolution is a crucial factor, which is why we utilize them.
How was the initial setup?
When it comes to the initial setup for incident.io, it is extremely simple. It's a plug and play setup, the easiest deployment I've ever done. Once I started using incident.io, we paid some money, took a version, and were able to get everything set up within a day or so, with everything working as expected. There were certain quirks and flows that needed fixing, but we managed to address those later on.
What was our ROI?
When it comes to pricing, I have seen a great ROI with incident.io after switching from PagerDuty. However, I must clarify that those ROIs were also met with PagerDuty, meaning it isn't extensive that we are observing. The MTTR trends are something that is sadly missing in incident.io, which we had with PagerDuty. Cost estimations are also lacking. If an incident occurs, for example, seeing high cardinality metrics in production leading to a jump in billing, those estimations can't be done in incident.io while they could be done in PagerDuty. Thus, it feels more of a downgrade for us, but again, every choice has its pros and cons. incident.io is cheaper, and we needed a more economical solution, as simple as that.
Which other solutions did I evaluate?
In my evaluation process for choosing incident.io, we did consider alternatives such as Jira and ServiceNow, which provide similar functionalities but are comparatively more complex to set up. Being a smaller company, we needed a straightforward solution where everything gets done clearly on Slack since we use it heavily in our operations. Our requirements were very minimal: we wanted an incident manager to create a Slack channel for incidents, alert incident responders, and provide all pertinent details. That use case fits best with incident.io after our experience with PagerDuty.
What other advice do I have?
My advice for others looking to start using incident.io is that if you don't have money, just start with incident.io. You will not regret it. However, if you have the budget, choose a better solution. Why settle for a lower-end solution with limited integration capabilities that relies solely on Slack? If you have the means, go for PagerDuty or consider a comprehensive solution such as Jira, which offers complete ITSM capabilities. incident.io isn't as mature right now; the on-call rotation ecosystem is functional and gets the job done, but for enterprise-level customers with a larger budget, moving to PagerDuty would be a better choice, despite it being costly. I would rate this product 7 out of 10.