What is our primary use case?
ClickHouse has been used for the last one year.
The primary use case involves IoT devices. Software has been developed to onboard IoT devices, which send data at varying frequencies. Analysis must be provided to users based on these different data transmission patterns. A dashboard allows users to onboard their IoT devices and analyze their data. The volume of data is substantial. For example, if a company has one lakh IoT devices sending data every 10 minutes, the data generated in one month can reach several GB to TB. Real-time analysis is required to determine how many times devices were active or inactive, week-wise device activity, total average voltage for energy meters, and many other analytical insights.
ClickHouse has delivered exceptional performance for this use case. Testing was conducted on over 10 million rows, performing count, sum, average, aggregation by week, aggregation by month, ordering, and sorting operations. ClickHouse provides responses within a few seconds, typically two to three seconds, which is impressive. An AI agent has also been built on top of ClickHouse for user-based queries. When a user asks a question such as how many devices are inactive for more than a month, the system directly contacts OpenAI, generates a ClickHouse query from the response, and submits it to ClickHouse. ClickHouse responds within ten seconds. Testing has been performed on over 10 million rows, and it is working well for the use case.
The two main use cases are analysis and an AI agent built on top of ClickHouse.
What is most valuable?
ClickHouse has delivered exceptional performance for this use case. Testing was conducted on over 10 million rows, performing count, sum, average, aggregation by week, aggregation by month, ordering, and sorting operations. ClickHouse provides responses within a few seconds, typically two to three seconds, which is impressive. An AI agent has also been built on top of ClickHouse for user-based queries. When a user asks a question such as how many devices are inactive for more than a month, the system directly contacts OpenAI, generates a ClickHouse query from the response, and submits it to ClickHouse. ClickHouse responds within ten seconds. Testing has been performed on over 10 million rows, and it is working well for the use case. The two main use cases are analysis and an AI agent built on top of ClickHouse.
Speed is the main valuable feature. Setup is straightforward. Several features are utilized including Materialized Views, simple views, ReplaceMergeTree, and Aggregation Tree. These features are used to aggregate results that remain unchanged. For example, monthly, weekly, and daily summaries are aggregated and remain unchanged because they are historical data. Materialized View is one of the most used and valuable features being leveraged.
What needs improvement?
Everything appears to function well. From a software engineering perspective, one consideration involves eventual consistency. In the case of ReplaceMergeTree, data duplication eventually gets corrected during the merge process. When merge parts are combined into one, duplication is removed. If duplication could be removed in real-time, that would be better. An info table has been created to provide the latest data per device. However, when the same device data is inserted again, it takes time. When merge parts combine, duplication is removed. At that point, when data is sent to the UI, it must be grouped by device ID and the last created date must be picked to avoid showing duplicate data. This is understood to be a limitation of the append-only nature, but a solution might exist to address this issue.
Real-time deduplication would be beneficial for clarity on when it occurs. Duplication will exist before the merge, and duplicates will be removed after the merge. ClickHouse provides ReplaceMergeTree, MergeTree, SummingTree, and AggregationTree. A tree family that guarantees deduplication with no duplicity would be valuable. Information should be provided to customers regarding performance trade-offs, such as a X to Y performance reduction, so they can decide if the trade-off is worth it. In this case, the latest data must be displayed, and users must see correct data. Currently, grouping is done outside the main query. A tree family could be provided that guarantees one hundred percent no duplication data, though the FINAL keyword is currently available, it requires developers to add it before and after queries with careful consideration.
Configuration complexity presents another improvement opportunity. Difficulty levels exist ranging from eight to nine out of ten. Setting nine is reasonable, as there must remain some improvement scope. Too many issues exist for beginners to set up ClickHouse. Many parameters must be configured, such as maximum scatter part settings that determine when writing to a table stops. Many parameters require careful setup, making it very difficult for beginners. Unlike PostgreSQL or MongoDB, which can be downloaded and run without difficulty, ClickHouse is not easy for beginners to set up. Improvement scope exists to enable easier setup. Default settings could be provided so that anyone can set up ClickHouse easily.
What do I think about the scalability of the solution?
Scalability was the main concern. Feedback has been very positive regarding ClickHouse's scalability.
How are customer service and support?
Customer support at ClickHouse has not been reached because cloud service has not been utilized. However, documentation and blogs have been thoroughly reviewed, and all issues have been resolved using these resources.
How would you rate customer service and support?
Which solution did I use previously and why did I switch?
The transition was made from MongoDB to ClickHouse. MongoDB was used when data from IoT devices required bulk writing due to frequent write operations. InfluxDB and Cassandra were considered as alternatives. However, when MongoDB was evaluated for analysis, it did not provide good performance. The decision was made to switch to an analytics-focused database. ClickHouse was discovered and selected, and has been in use since that decision.
RocksetDB and Google Spanner, along with other open-source solutions, were evaluated. After reading blogs, documentation, and reviews, and after testing some solutions including Spanner, Spanner was found to provide similar performance, but cost is the main concern. ClickHouse was selected because it is open-source and can be run on-premises, which is the primary requirement due to data security considerations.
How was the initial setup?
Speed is the main advantage, and setup is straightforward. Several features are utilized including Materialized Views, simple views, ReplaceMergeTree, and Aggregation Tree. These features are used to aggregate results that remain unchanged. For example, monthly, weekly, and daily summaries are aggregated and remain unchanged because they are historical data. Materialized View is one of the most used and valuable features being leveraged.
What's my experience with pricing, setup cost, and licensing?
Licensing and cost details are not available as these matters are managed by the DevOps team. An 8-core machine with 32 GB RAM is being used to run ClickHouse.
Which other solutions did I evaluate?
RocksetDB and Google Spanner, along with other open-source solutions, were reviewed. After reading blogs, documentation, and reviews, and after testing some solutions including Spanner, Spanner was found to provide similar performance, but cost is the main concern. ClickHouse was selected because it is open-source and can be run on-premises, which is the primary requirement due to data security considerations.
What other advice do I have?
An exact monetary value cannot be provided, but time savings from query execution can be quantified. Testing was conducted on three to four lakh rows using sum aggregation in both PostgreSQL and MongoDB. PostgreSQL and MongoDB required five to ten seconds on an 8 GB machine with a four-core CPU. The same ClickHouse instance provided results within one second. This represents approximately six to seven times faster query execution. ClickHouse delivers results within one second while MongoDB and PostgreSQL deliver results on the same data within six to seven seconds.
The main concern is that too many issues exist for beginners to set up ClickHouse. Many parameters must be configured, which can complicate setup for beginners. This hinders ease of setup compared to databases such as PostgreSQL or MongoDB, which are straightforward to run for beginners.
For initial settings, focus on reading ClickHouse's documentation. Specific default settings may be available to simplify initial setup. After reviewing various blogs, one problem encountered was the 'too many merge parts' error. This error occurred when frequently inserting data from APIs via Kafka. After adjusting the setup, the Kafka consumer had to be reset, and specific flags had to be tuned to prevent the error from recurring.
When using ReplaceMergeTree, caution must be exercised regarding duplication. Directly displaying data to end-users without considering duplication can lead to discrepancies in unique entries and potentially mislead customers, creating issues for stakeholders. To avoid this, opt for strict reading settings using accurate queries, tailoring adjustments to specific needs.
This review has been rated nine out of ten.