What is our primary use case?
Enterprise Data Warehouse and enterprise data staging. Automation and documentation of all data warehousing processes.
How has it helped my organization?
It did not pass usability, reliability, sustainability or performance tests so there was no potential benefit other than minor documentation advantages over the current automated documentation solution.
What is most valuable?
Documentation has excellent potential as it provides the capability to generate data lineage and a complete set of rich documentation and output in multiple formats.
The tool supports multiple target update methods.
Support for multiple design methodologies including Data Vault.
What needs improvement?
In my opinion, the tool requires a much different architectural approach to be effective. It has fundamental architectural flaws that prevent it from being a practical solution for even moderate volumes of data.
The largest issue is that it violates the most basic database data-movement physics through its reliance upon the infamous RBAR or "row by agonizing row" cursor operations within Tsql. In contrast, efficient database apply operations to whole sets of data rather than row by row achieved through Sql joins and set based queries. Note that this is a common problem with code generators as they end up trading scalability for flexibility. Ironically this also makes them unable to take advantage of the much touted "push down" operations that is one of their big selling points. This tool avoids the traditional lookup ETL lookup cache component issue (sometimes referred to as blocking transactions) - that much is true. However it ends up doing something similar but just within the database and with no easy way to override it. Experienced ETL developers know not to do any blocking transactions or RBAR if they can possibly help it. And with traditional ETL tools you either take advantage of built-in push-down capabilities or pre-join the data in views or data sources to avoid this kind issue.
Other constraints include...
- Repository only supports a single target database
- No support for change data capture or delta detection - that must be custom coded
- Must be used to stage data or it requires a redundant copy of staged data. This ended up more than double our storage
- Relies 100% upon target database server with no ability to scale ETL to separate server
- Relies 100% on the target database engine to do all the work (along with everything else the database must do like reporting)
- Cannot acquire data from sources other than flat files. It relies upon the use of linked servers (which many shops don't allow) or calling an ETL tool such as SSIS in the background for data movement. In that case you're still relying on an ETL tool for the ELT portion so then you're actually using two different tools - Wherescape plus a 3rd party ETL tool instead of just one tool.
Specific list of areas for improvement:
* Performance (this is the area that needs the most improvement)
* Efficient storage utilization
* Logging
* Count validations (data movements don't validate counts which is 101 basics to prevent data leaks)
* Scheduler reliability and transparency
* Ability to separate data acquisition/staging processing from target schema population
* Providing automatic record count validations
* Support for change data capture
* Support for in-line transformational procedure language that extends beyond database procedures
* Documentation lineage accuracy and completeness
* Support for sharing metadata repository across multiple targets
* Improvements in staging architecture requiring fewer copies of source data
* Supporting transparency into post-gens
* Providing encapsulated pre and post events for custom code that doesn't get wiped out when you re-generate supported code from the tool.
* Better support for task orchestration in a top-down way that can be easily understood.
* Support for direct parameter passing from job schedulers.
For how long have I used the solution?
Trial/evaluations only.
What do I think about the stability of the solution?
The scheduler often hangs with nothing to inform the user why it isn't working. Sometimes this is simply due to a developer starting a scheduler from their PC which integrates with the repository but isn't obvious when looking at the shared scheduler. Other issues seemed to be only resolved by re-generating the process from scratch and hope for a different result (which often there is), especially if the scheduler hangs. It is very easy to choose an incorrect option when generating processes only to result in something that just hangs with no error message. Then you're knee deep in database procedural language trying to debug a tool's generated code.
The sheer resources that it consumes easily overwhelms a server and causes the database to freeze due to exceeding resource constraints. The tool generates an unexpected and seemingly unpredictable amount of staging data (both in and out of the database depending on data movement option chosen) consuming very large amounts of disk space with redundant copies of data. This tool will not promote friendly relations with your system administrators.
What do I think about the scalability of the solution?
The product was deemed unscalable in terms of data storage and performance.
How are customer service and technical support?
Customer Service:
Very good
Technical Support:
Very good, although this was a POC so I can't speak for production support after the product is purchased.
Which solution did I use previously and why did I switch?
Currently use a traditional ETL tool that meets our requirements. The test scenario involved around 250 GB of source data with a resulting 50 GB (compressed) star schema target. The ETL tool can talk to all our sources including multiple cloud sources using SOAP and REST web services and this tool is just a database procedure generator with no adapter capability. It instead relies on ETL tools to do that work so it doesn't eliminate ETL tools ending up having to support not one tool but two tools - both in licensing, support and expertise which was expected to cause significant increase in total support costs.
How was the initial setup?
The tool relies on ODBC connections requiring complications in security in an AD environment over using OLEDB.
It requires a developer to either access the server directly or setup file shares. It was often confusing when doing development from a PC but working with a scheduler on a server and knowing what files need to be available from the PC verses from the server. The tool requires a great deal of attention to disk usage and location especially when using native database load utilities as it unloads all the data first into temporary files during the load which can easily consume your servers available space in unpredictable ways. Getting the environment functioning just for the training class was unexpectedly difficult.
The metadata repository is unique to a single target database and cannot be shared with any other target which prevents use beyond a silo datamart. This means data lineage is a silo repository for each target.
What about the implementation team?
Software vendor was the lead implementor.
What was our ROI?
The ROI was expected to be very negative to the point of failure.
What's my experience with pricing, setup cost, and licensing?
Factor in the price of specialized consulting who know this product. They're hard to find and expensive.
Which other solutions did I evaluate?
The existing traditional mainstream ETL tool and the previous one have been functioning without problems but the concept of a self-documenting tool that provides a common framework and shared metadata repository was a compelling alternative. This tool was already in use in our shop in a limited way so it begged the question on whether it could serve an expanded role.
Advice: Before choosing any tool it is best to do your homework and compare features to competitors and test the tool in a real life volume scenario. Software is never what its advertised from any vendor. Don't take other's word for it, especially if they have a vested interest in you using the tool. I know of testimonials about this tool and others that are based on wishful thinking and not reality. Caveat Emptor my friends. So the real question is how far off is it and does it have show stoppers that will prevent it's use in your environment? Fail to take this precaution at your and your organization's peril.
What other advice do I have?
The compelling selling point for this and similar tools as "datamart in a box" providers is that they are pre-built frameworks and documentation. The opportunity is that staff with little data warehousing experience can get up to speed easily and not have to worry about building logging, orchestration and error frameworks or even know much about data warehouse design.
Unfortunately this tool's basic architecture has severe flaws which make it likely impractical for most real world marts or data warehouses. It has some significant cons in the following areas:
- Metadata support specific (cannot share across multiple) target database
- Unreliable lineage
- Lacks capability to interface with application API's
- Lacks out-of-box CDC control capabilities
- Poor/outdated GUI (looks like early 1990s)
- Lack of available resources in the market place who can operate this tool
- Lack of customer base and information available
- Primitive logging and error handling
- Forced duplication of staging data multiple times
- Inability to separate source data capture from targets (tightly couples sources with targets)
- Stores all objects in single target database causing number of issues such as backup/recovery, fragmentation, limited organization options for objects