What is our primary use case?
We use Datadog for observability and system/application health, mainly for product support, triaging, debugging, and incident responses.
We use a lot of the logging and the Datadog agent to collect logs, metrics, and traces from our GKE workloads. We use APM and continuous profiling for latency and performance measurement. We use RUM to observe frontend user events, such as tracing on request and what actions they take before errors occur. We also use error tracking and source maps to debug production failures.
We are still relatively new to the product, and we are planning to use more of the notebook functionality and power packs to record run books and break knowledge silos. We also need to utilize dashboards and continuous profiling more for performance measurement and integrate Datadog alerts for incident response.
How has it helped my organization?
We have way more observability than what we had before - on the application and the overall system. That includes the GKE cluster, nodes, and pods. It's helped with our cloud-run instances, databases, and data storage.
We also started observability in the CI pipeline to measure our CI performance, as it was a pain point for us. We are aiming to do incremental deployments and releases, and the bottleneck so far has been our CI performance. The visibility on which actions or functions take the most time allows us to pinpoint and focus on improving configurations on these.
What is most valuable?
We use structure logging a lot to triage production issues. The querying, attributes and tags manipulation, and customization have been very helpful in isolating and filtering environments. The integration with Winston logger has also been a breeze.
First and foremost, was that structured logging, tags, and attributes have not only allowed us to narrow down to a problem quickly in production, they have also let us create dashboards from these logs to understand more user behaviors, such as how many users stop and leave our application before an upload has completed. That helps us understand how important processing time is to a user.
We also intend to use distributed tracing more to understand where the error has occurred in a particular request.
What needs improvement?
Definitely, documentation could use improvement. As I navigated and try to find instrumentation and implementation details, I discovered inconsistency among SDKs based on languages.
There are also places where highlighting can be improved. I once created an issue on GitHub, and it was resolved right away by an engineer. He pointed out that it was actually in the documentation. I looked again and found it was not very obvious. We were stuck on the problem for days.
Auto instrumentation on tracing has not been very easy to find in the documentation. We ended up using OpenTelemetry, yet the conversion between tracing contexts has been difficult.
For how long have I used the solution?
We've used the solution between six months and a year.
How are customer service and support?
Customer service and support are generally very fast. I did experience one ticket, which involved changing the log index retention period, not being responded to. Any support tickets related to technical issues were resolved pretty fast.
How would you rate customer service and support?
Which solution did I use previously and why did I switch?
We used to use GCP Stackdriver for logging and monitoring since our infrastructure is all GCP based. It was lacking a lot, particularly on tracing and structured logging. We often had a lot of trouble triaging and diagnosing a production problem. Datadog's specialty is observability. Since we started using the product, we were able to create dashboards, and utilize APM, continuous profiling, RUM, and distributed tracing for production support and user trends.
Datadog also offers labs and workshops for its products, which is very helpful.
What about the implementation team?
We implemented the product ourselves.
What was our ROI?
I'm not sure what our ROI would be.
What's my experience with pricing, setup cost, and licensing?
We started with on-demand pricing as we were re-writing our product, and we weren't sure about the total usage. After we went into production and released the product, we experienced a price surge. Fortunately, our Datadog account manager reached out to us and suggested a monthly subscription, which is what we'll be switching to.
I'd advise keeping an eye on the usage and possibly setting up some monitoring on price. We didn't have much of a setup cost; we started with a free trial and continued with on-demand after the trial ended.
Which other solutions did I evaluate?
We didn't evaluate many of the other options. However, we do also use OpenTelemetry, which is vendor agnostic and integrates with Datadog.
What other advice do I have?
We always keep the Datadog agent to the latest version.
Which deployment model are you using for this solution?
Public Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Google
Disclosure: My company does not have a business relationship with this vendor other than being a customer.