What is our primary use case?
We use the concept of a feature store. We need the engine, and we use Feast as our feature store framework to build our feature store. We use the DuckDB engine as a data engine, which helps us move data from the offline feature store to the online feature store.
The online feature store is part of the feature store where we serve our features to the operational environment. We have some daily processes, for example, where we update features in operational environments every hour. For this update, we use DuckDB to load data from Parquet to Redis.
How has it helped my organization?
For us, it's a good solution because we can optimize our infrastructure usage. We don't need any investment in this solution because we can start using it, and now we use only the open-source solution.
It's integrated with Dask as a data orchestrator, which had some integration with DuckDB. It looks very easy to start using and implement, and we don't have any investment problems.
What is most valuable?
DuckDB is very fast. If we don't have enough memory, we can use swap. For instance, if we use a fast storage space, the system volume in Kubernetes could be a fast swap for DuckDB if we have memory limits. In our framework, DuckDB is five times faster than Spark in some cases. It has an extensive SQL dialect, many SQL Windows functions, and good integration with Polars. We can easily use SQL and Pandas framework in Python to process data.
What needs improvement?
Sometimes we have memory issues that cause job failures, which we don't fully understand why, but we can rerun the jobs without problems. There were also problems with binary formats in the previous version, which have now been fixed for backward compatibility.
For how long have I used the solution?
Maybe a year. The last version of DuckDB is very different from the previous version because they rolled out the first stable version this summer. Now, we focus only on this version.
What do I think about the stability of the solution?
Mainly, 99% stable for us.
What do I think about the scalability of the solution?
DuckDB is an in-memory database and data engine, so if we want to use it in a distributed system, we should orchestrate and manage it manually. We use Dask to run many jobs due to memory limitations, but this is by design.
How are customer service and support?
We use Slack or Discord to chat with the community, which is helpful. Usually, we find solutions and recipes in the community. It's quite easy to find what we need.
How would you rate customer service and support?
How was the initial setup?
Setting up DuckDB is very easy. It's a Python package that can be installed with the Python package manager or Anaconda, which are standard package managers.
What about the implementation team?
One person can install DuckDB inside a container and then build the container without any problems. We update our images each quarter, making it easier to manage.
What was our ROI?
It's a good solution because we can optimize our infrastructure usage, and we don't need to invest in the solution itself. We use only open-source solutions, and it's easy to integrate with Dask.
What's my experience with pricing, setup cost, and licensing?
We use the open-source version of DuckDB and don't have costs associated with it. There is a company that provides commercial software, but we use the open-source library, not the services.
What other advice do I have?
I'd rate the solution nine out of ten.
Which deployment model are you using for this solution?
On-premises