What is our primary use case?
My main use case for Datadog Services revolves around APM, and we are also using it for metrics. The metric solutioning, along with sending the metrics and getting it displayed on the dashboard, is extremely good with Datadog Services, and we have a lot of performance dashboards set up where we analyze what is wrong with the system when something goes down. Whenever an incident comes in, we just open our dashboard to check which components are showing spikes, whether they are CPU spikes, memory spikes, or load averages, or if there are some network bottlenecks, and all this analysis is done via our Datadog Services dashboards.
To facilitate this, we emit system metrics, and there are some custom metrics also which indicate the success criteria, along with the amount of documents shared in Datadog Services to ensure that things work as expected.
We have a sampling rate of about thirty percent, which I think we've reduced to twenty percent now, as the data was really high when using Datadog Services. In this scenario, what we observed is that we have significantly reduced our billing with Datadog Services when we started using sampling. The only problem I see with Datadog Services is the cardinality factor. If we increase the cardinality, the billing becomes extremely high, and when I'm sending a lot of metrics, one-on-one metrics are fine. However, when the cardinality increases and if there are unique events sent inside Datadog Services via the OTEL collector, we encounter many problems. Otherwise, the solutioning of Datadog Services works excellently well with no issues.
What is most valuable?
The best features Datadog Services offers for me include APM, which is an excellent feature for application process monitoring, allowing us to get thread dumps, heap dumps, and analyzation of everything happening inside the box. The only downside I see is the cost, which is nineteen dollars per box, so enabling APM for every box drives the bills extremely high.
Datadog Services has positively impacted our organization in performance, making our product extremely good. I want to emphasize that when using Datadog Services, our product processed almost five petabytes of data, and at that scale, even one log line can significantly impact cost. Analyzing our system performance and identifying bottlenecks becomes extremely easy with a complete scenario of end-to-end document processing where each component is covered. We had an ingestion tool with observability overseeing Kafka queues, where observability means capturing all CPU, memory, load average metrics, IOPS information, and network information. The next component is Storm, where we also had an excellent set of observability with APM traces, metrics, and a lot of system metrics available, as we installed the Datadog Services agent across all our VMs. Additionally, we're tracking documents going into S3, documents transitioning to Elasticsearch, and the metadata of documents going to MongoDB, thus covering end-to-end observability, making our lives extremely easy and crediting Datadog Services as a pioneer in the solutioning around it. The only downside I see is the cost.
What needs improvement?
Datadog Services could be better in terms of cost, which is extremely high. They charge a lot of money, and I've seen monthly bills reaching ten thousand dollars for a very small product in my current organization, leading us to consider switching to open-source solutions like SigNoz, or to a solution called Honeycomb for traces, while we use SigNoz for metrics, which has almost significantly reduced our observability costs by around ninety percent. The overall usability and accessibility of Datadog Services, along with its ease of integration, is extremely good, so I must acknowledge their well-crafted documentation on setup. Setting up the agent and getting the metrics was extremely easy, but unfortunately, the only downside remains the cost due to their high premium on existing systems. However, they are pioneers with some good AI agents running, making analysis and alerting extremely proficient, but the cost continues to be an issue.
For how long have I used the solution?
I have been familiar with Datadog Services for almost four and a half years.
What other advice do I have?
I'm not familiar with those tools like Traefik Enterprise, Buoyant Enterprise for Linkerd, or Digital.ai Continuous Testing.
I'm also not familiar with HashiCorp Consul, Digital.ai Agility, or ServiceNow DevOps, as I mostly work with deployment tools like CodeDeploy of AWS followed by Terraform. With HashiCorp, I've worked on Vault, while in terms of CI, we used GitHub, GitHub Actions, and GitLab.
I use various enterprise tools daily, such as Datadog Services, Honeycomb, and observability tools like SigNoz, utilizing a lot of tools in terms of observability, including any tracing mechanism or anything. OpenTelemetry is something which we use day in, day out.
Out of the tools I mentioned, I have already given detailed feedback on PeerSpot for Honeycomb, and we can continue on Datadog Services.
My review rating for Datadog Services is eight point five out of ten.
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Amazon Web Services (AWS)