What is our primary use case?
My team and I primarily rely on Datadog for logs to our application to identify issues in our cloud-based solution, so we can take the requests and information that's being presented as errors from our customers and use it to identify what the errors are within our back-end systems, allowing us to submit code fixes or configuration changes.
I had an error when I was trying to submit an API request this morning that just said unspecified error in the web interface. I took the request ID and filtered a facet of our logs to include that request ID, and it gave me the specific examples, allowing me to look at the code stack that we had logged to identify what specifically it was failing to convert in order to upload that data.
My team doesn't utilize Datadog logs very often, but we do have quite a few collections of dashboards and widgets that tell us the health of the various API requests that come through our application to identify any known issues with some of our product integrations. It's useful information, but it's not necessarily stuff that our team monitors directly as we're more of a reactionary team.
What is most valuable?
The best features Datadog offers, in my experience, are the ability to filter down by facets very quickly to identify the problems we're experiencing with our individual customers using our cloud application. I really enjoy the trace option so that I can see all of the various components and how they communicate with each other to see where the failures are occurring.
The trace option helps us spot issues by giving access to see if the problem is occurring within our Java components or if it's a result of the SQL queries, allowing us to look at the SQL queries themselves to identify what information it's trying to pull. We can also look at other integrations, whether that's serverless Lambda functions or different components from our outreach.
Datadog has impacted our organization positively because the general feeling is that it's superior to the ELK stack that we used to use, being significantly faster in searching and filtering the information down, as well as providing links to our search criteria that our development teams and cloud operations teams can use to look at the same problems without having to set up their own search and filter criteria.
What needs improvement?
For the most part, the issues that we come across with Datadog are related to training for our organization. Our development and operations teams have done a really good job of getting our software components into Datadog, allowing us to identify them. However, we do have reduced logging in our Datadog environment due to the amount of information that's going through.
The hardest thing we experience is just training people on what to search for when identifying a problem in Datadog, and having some additional training that might be easily accessible would probably be a benefit.
At this point, I do not know what I don't know, so while there may be options for improvements, Datadog works very well for the things that we currently use it for. Additionally, the extra training that would be more easily accessible would be extremely helpful, perhaps something within the user interface itself that could guide us on useful information or how to tie different components or build a good dashboard.
For how long have I used the solution?
I have worked for Calabrio for 13 years.
What do I think about the stability of the solution?
What do I think about the scalability of the solution?
Datadog's scalability is strong; we've continued to significantly grow our software, and there are processes in place to ensure that as new servers, realms, and environments are introduced, we're able to include them all in Datadog without noticing any performance issues. The reporting and search functionality remain just as good as when we had a much smaller implementation.
Which solution did I use previously and why did I switch?
Previously, we used the ELK stack—Elasticsearch, Logstash, and Kibana—to capture data. Our cloud operations team set that up because they were familiar with it from previous experiences. We stopped using it because as our environment continued to grow, the response times and the amount of data being kept reached a point where we couldn't effectively utilize it, and it lacked the capability to help us proactively identify issues.
What other advice do I have?
A general impression is that Datadog saves time because the ability to search, even over the vast amount of AWS realms and time spans that we have, is significantly faster compared to other solutions that I've used that have served similar purposes.
I would advise others looking into using Datadog to identify various components within their organization that could benefit from pulling that information in and how to effectively parse and process all of it before getting involved in a task, so they know what to look for. Specifically, when searching for data, if a metric can be pulled out into an individual facet and used, the amount of filtering that can be done is significantly improved compared to a general text search.
I would love to figure out how to use Datadog more effectively in the organization work that I do, but that is a discussion I need to have with our operations and research and development teams to determine if it can benefit the customer or the specific implementation software that I work with.
On a scale of one to ten, I rate Datadog a ten out of ten.
Which deployment model are you using for this solution?
Public Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Disclosure: My company does not have a business relationship with this vendor other than being a customer.