What is our primary use case?
The whole idea is to have logging enabled. We should be able to search for each and every pattern. For example, log patterns are very unstructured these days. You can't predict where they're coming from or their fields.
Fields can be dynamic and change depending on the version; when people moved from Docker Daemon to ContainerD, the log format changed. Most of the logging was impacted, especially when we built solutions using pipelines like ELK. The advantage of Loki's logging is its ability to process unstructured logs. It creates chunks and uses something similar to grep for pattern matching. It can find any pattern for us.
What is most valuable?
The most valuable feature is the cost. It can be written to S3 and read from S3. For me, cost is critical, especially at the enterprise level.
With this, I don't worry about who is logging or how much they're logging. It's a cost-effective solution. We aren't paying exorbitant amounts just for logging.
There are many advantages because it also uses the same metadata as Prometheus, using the same Discovery agent. Everything seems pretty cool and calm for me.
What needs improvement?
My main concern is the recommended production-grade setup. They suggest using tools like Tanka or Jsonnet. They should simplify the process to increase adoption. The architecture is solid, with distinct read and write parts and a caching layer, making it fast.
However, setting up a production-grade cluster takes a lot of effort to understand the components and how they fit together. That's where I see room for improvement.
For how long have I used the solution?
I have been using this solution for one and a half years.
What do I think about the stability of the solution?
I haven't found any issues with Loki so far. We have stored a decent amount of data, and I feel it's stable. I haven't faced issues with logging.
What do I think about the scalability of the solution?
It is a scalable solution. When we want to scale, we need to do due diligence on the tool and its components. It's highly scalable. Grafana has ensured that the write path and read path are different. The architecture they have, which is optimized, along with the front end, shows that Grafana Loki has put in significant effort to build this solution.
How are customer service and support?
The community is really very active. If we reach out, they come to us with solutions. They are a young community. The way these guys are maintaining the complete portfolio, like Grafana and Prometheus, is impressive. Before Loki, there was only Grafana and Prometheus. Even for microservices, people mostly use solutions like Prometheus Grafana. Given their workload and their contributions to the open-source community and support, I would rate them as an eight.
How would you rate customer service and support?
How was the initial setup?
If it is HelloGuard setup or doing some setup on the dev cluster, it's pretty straightforward. But when we're dealing with a heavy cluster, like 15 to 20 terabytes of data per day, we need a production-grade cluster.
For that kind of scenario, we must invest time and understand the process. We could have integrated these features within their health check, but they're using processes like Tanka and Jsonnet to implement a production service. I feel this could have been better.
If I use a metric solution for metrics, I'd use Grafana for metrics monitoring. For logging, I'd use a different tool, like ELK. And for tracing another tool. So, to troubleshoot a specific issue, I have to switch between three different consoles.
What I see in metrics isn't the same as in logs because the metadata and collection methods differ. That's where Loki comes in. Within Grafana, you can see metrics, logs, correlations, generate metrics from logs, and also set alerts.
Alerting from logs is something many companies desire. With Loki, if there's a pattern in the log, we can filter it out without altering the entire pipeline. For instance, if I had to add fields in ELK, it would require a lot of configuration changes. Loki, however, is more flexible. It uses a grep-like pattern and the metadata model from Prometheus.
It's highly efficient, with compressed data and block storage like GCS bucket or AWS S3, making log storage cost-effective. Compared to other solutions, it's more economical. Loki also has a Log CLI, which is very effective.
It's all on-premises. Like, it's on the cloud, but it's self-managed, not a managed service.
What other advice do I have?
I would suggest going for it. However, my recommendation would also depend on the use cases. There are heavy solutions like full-text search engines, such as Solar and Elasticsearch. These are built mainly for e-commerce purposes, but the downside is the large metadata.
This means high storage costs. Maintaining logs also contributes significantly to maintaining metadata. For example, for a terabyte of data, if your metadata is just a few GBs, that's an advantage with Loki. If any node fails because your backend data is in S3 and the metadata is small, it recovers quickly. If you need it only for logging purposes, I'd suggest going for it. Loki offers quick wins over other solutions and is also cost-effective.
The operation overhead of maintaining it is very minimal.
Overall, I would rate the solution an eight out of ten. The reason is quite simple. These guys have introduced the concept of a unified dashboard, which is commendable. This unified dashboard allows us to monitor logs and metrics side by side.
In terms of microservice architecture, if you observe a metric from a certain part and you also see the logs for that same part side by side, it makes diagnosing issues straightforward. For instance, if there was a spike at a particular time, you can directly correlate it with the logs right beside it.
Additionally, there's flexibility in log formats. Loki doesn't restrict you to JSON, XML, or CSV. For example, today, if I'm using ContainerD and it's writing in JSON, and tomorrow, if I have another tool writing in plain text format, Loki adapts seamlessly. It parses my logs and allows analytics on top of them. Plus, it now supports SQL-like syntax, which further boosts its versatility.
The tool’s single UI supports metrics, logs, and even tracing. You can integrate tracing tools with Loki and access everything from a unified platform. The data complexity is high, but it’s efficiently stored on S3 or any object-based storage, ensuring cost efficiency.
Loki also utilizes the same service discovery mechanism as used by Prometheus. So, whatever labeled metadata you see in Prometheus, you have the exact same metadata in the Loki system.
Given this level of intricacy and the attempt to address these challenges, I firmly believe that they deserve praise for their work.
Which deployment model are you using for this solution?
Hybrid Cloud