We were receiving data from hospitals or any kind of healthcare service providers in the country. We were dominantly operating in the US. When we received that data, we had to classify it into different repositories or different datasets. This data was sent to different vendors, and for that, the data needed to get processed in different ways. We needed to bifurcate data at many steps with different kinds of filters. For that, we used StreamSets.
Product Manager at a hospitality company with 51-200 employees
Provides a good bifurcation rate and accuracy, and saves time and money
Pros and Cons
- "The ability to have a good bifurcation rate and fewer mistakes is valuable."
- "One thing that I would like to add is the ability to manually enter data. The way the solution currently works is we don't have the option to manually change the data at any point in time. Being able to do that will allow us to do everything that we want to do with our data. Sometimes, we need to manually manipulate the data to make it more accurate in case our prior bifurcation filters are not good. If we have the option to manually enter the data or make the exact iterations on the data set, that would be a good thing."
What is our primary use case?
How has it helped my organization?
We could bifurcate the datasets that we received from different hospitals. We could bifurcate it on the basis of the medical requirements of the hospitals, and sometimes, on the basis of the schedule or purpose. We were obtaining data that we could then supply to some consulting firms or other sources.
StreamSets saved us time. The accuracy was pretty good, and it was definitely better than what we were using previously. Earlier, we had hired two people who were doing the job manually, and we were also using some other platform. We had to pay for them. Overall, we have saved a lot of time, and the accuracy has improved as well. We didn't calculate the time savings, but I believe we saved about three days in a week, so there were about 30% to 40% time savings.
StreamSets reduced the workload. There was a 10% to 15% reduction in the workload.
StreamSets helped us to scale our data operations. The limit at which we purchased this solution was incredible. We were never able to reach the limit that we purchased, but it helped us to increase or scale our operation. Especially in months when we received a higher number of entries, we were able to perform our work on time.
What is most valuable?
The ability to have a good bifurcation rate and fewer mistakes is valuable. In the scenario we had, when we had to bifurcate the data, we did not completely cut the data. We made a different route for one set of data, which went into a different operating system. There was also a complete set of data along with the original data that got cut, which once again went through the filtration process, and in this way, it kept on happening. Different solutions that were in place were not providing this feasibility. With the other solutions that we were using earlier, we had to reuse the data again and again from the start. It was a time-taking process.
Their support system was pretty good. When we were setting up the bifurcation protocols that we wanted to set up, we had a few support calls with them, and those were really helpful.
What needs improvement?
The design or the way they have set up the protocol is pretty good. One thing that I would like to add is the ability to manually enter data. The way the solution currently works is we don't have the option to manually change the data at any point in time. Being able to do that will allow us to do everything that we want to do with our data. Sometimes, we need to manually manipulate the data to make it more accurate in case our prior bifurcation filters are not good. If we have the option to manually enter the data or make the exact iterations on the data set, that would be a good thing. It does not have that feature. None of the solutions provides this feature, but this is the feature that we are looking for. If we could bifurcate the data or do manual manipulation of data at any point in time, it would be a game changer.
Its initial setup could also be a bit easier.
Buyer's Guide
StreamSets
July 2025

Learn what your peers think about StreamSets. Get advice and tips from experienced pros sharing their opinions. Updated: July 2025.
865,384 professionals have used our research since 2012.
For how long have I used the solution?
I used this solution for about a year.
What do I think about the stability of the solution?
It's a stable product. We used it for about a year, and we hardly had to shut it down.
What do I think about the scalability of the solution?
We are a medium enterprise. We only have three departments in our company, and only one of the departments is using it. Salespeople don't use it. The development people don't use it. We are the ones using it, and our job is to process the information, so only one department is using the solution. We have about 18 people in the department.
Up to medium enterprises, it's a good choice. You can scale between one million to ten million data files. I don't believe they offer the service for a hundred million or one billion datasets. It isn't too scalable for large enterprises, but for small and medium enterprises, it's good.
How are customer service and support?
I'd rate them an eight out of ten. The only reason for not giving them a ten out of ten is that if you're doing very important work and you need to get the solution the same day, it's a bit tough to have the team support you in a very short period of time. They usually give you appointments about a day or two days later. Other than that, everything is good.
How would you rate customer service and support?
Positive
Which solution did I use previously and why did I switch?
We were using another solution previously. The major reason for switching to StreamSets was that we needed to scale our operations. Our prior solution could have been scaled, but the cost of scaling was a bit higher. We would have had to hire one more person to be able to scale, but we did not want to hire more people, so we decided to use a completely automated solution for this part so that it could be handled by only one of our team members. That was the primary requirement. The cost-benefit analysis was done by one of our peers. His proposal was pretty good, and everyone agreed to it.
How was the initial setup?
Its initial setup is a bit tough. You need to have the technical expertise to do that. The support team is good. They help you around, but if they could make it a bit easier, it would be better.
I believe it operates only from the cloud. We also received the data from our associations on the cloud. We processed it on the cloud, and everything happened on the cloud.
The initial setup was complex because we were not able to directly link the data we were receiving with the StreamSets solution. Linking it required us to fill in or enter some information in StreamSets, but we were not able to figure out what to enter. For that part, we needed their help.
We spent about a week. For the first three days, our team members were trying their best to do it, but then we had to schedule a meeting with them. In terms of the number of people, only one person was working with our team, and there were three people working with the product. I was also involved in the product as a product manager, but I was not directly operating that system.
It didn't require any maintenance as such. Any maintenance activities were related to our side of things. There were mistakes on our end. When we were entering different data, we had to do different configurations in the system.
What was our ROI?
We did the cost-benefit analysis before buying the solution, and it performed even better than that. We were able to replace two of our staff members who were doing this work. The cost that we paid for this solution was pretty less as compared to their salaries, so on the cost-benefit side of things, it was a good deal. We saved about two persons' manual wage, which is about $6,000 a month, and we also saved 15% of a week's time. These two were the biggest returns on the investment. The accuracy was also a bit higher.
What's my experience with pricing, setup cost, and licensing?
Its pricing is pretty much up to the mark. For smaller enterprises, it could be a big price to pay at the initial stage of operations, but the moment you have the Seed B or Seed C funding and you want to scale up your operations and aren't much worried about the funds, at that point in time, you would need a solution that could be scaled. Simultaneously, you need a solution that you don't want to use on a very long-term basis. This solution could not be applied if we were operating with all the hospital chains in the US. We were operating just with one hospital. That's why it worked pretty well, so for medium enterprises, I believe it's very good.
What other advice do I have?
To those evaluating StreamSets, I'd advise doing a cost-benefit analysis because the way of using StreamSets differs from person to person. Someone else might have a very different use case, and they may not run into profit using the solution. For us, it was a good solution because we were hiring people for this work. People were doing the job manually. We saved both time and money, so doing a cost-benefit analysis would be the best thing.
If you are looking to expand your domain or range of operations, StreamSets is very helpful. If you are just looking for a better data analytics tool that can do bifurcation on data, I believe there are other tools or services available in the market that do not focus on the expansion of operations. They focus on doing better and more complex bifurcations.
StreamSets enables you to build data pipelines without knowing how to code. After generating a few responses, you have to enter some basic syntax or code, but generally, one can do a lot of no-code stuff, which was not an important aspect for us because we were operating in the IT space, and our entire team was capable of entering all the syntaxes that were required. It was not an issue for us at any point in time. In fact, in the operations that we were performing, we only used code. When we were testing out our initial datasets, we used some no-code features that were there, but at the later stage, we used only syntaxes.
We did not connect to the messaging systems, but we connected some enterprise databases. We were operating with a set of hospitals in the US, and we had to connect with them only the first time. Afterward, it was the data that was passing through the pipeline. Initially, for a completely new user, it's a bit tricky. Some technical expertise is required. It's a bit tough, but because the support team is there, one would be able to do it.
Overall, I would rate StreamSets an eight out of ten.
Disclosure: PeerSpot contacted the reviewer to collect the review and to validate authenticity. The reviewer was referred by the vendor, but the review is not subject to editing or approval by the vendor.

CEO-founder at Tubayo
Data streams and pipelines help our team identify areas for improvement in our solution
Pros and Cons
- "One of the things I like is the data pipelines. They have a very good design. Implementing pipelines is very straightforward. It doesn't require any technical skill."
- "Sometimes, it is not clear at first how to set up nodes. A site with an explanation of how each node works would be very helpful."
What is our primary use case?
We use it for building a data lake in our content. We have sales multiple times during the day, and a sale is the trigger. Sales use the lake as a landing zone. We also use it for various types of data transformation.
How has it helped my organization?
It enables us to create data streams and pipelines that our team can use to identify areas for improvement. Our marketing team can read the data generated on sales to understand how we can integrate our product and focus on the areas in which we need more improvement. By the end of the day, we have an improved solution.
The lack of coding makes work easier and faster, and after creating a template you can immediately transform any source. It saves a lot of time and makes things efficient. You complete things on time.
The impact that it has had on my company is that when we have a variety of data that we want to convert or transform, StreamSets is helpful. We can store a maximum amount of data, and transfer various data from different departments and use the analysis to understand how to improve our business.
And because it's a service, it's very helpful to me as a CEO. It's serverless and secure.
In addition, the data drift resilience has reduced the time it takes to fix data drift breakages by 35 percent. Overall, StreamSets, as a solution, saves me about 45 percent of time, and has reduced workload by 25 percent. It also saves me about $500 a month.
Another benefit is that breaking down sums of data gives you the ability to create graphical reports and present them to any team, and they will be understood.
What is most valuable?
One of the things I like is the data pipelines. They have a very good design. Implementing pipelines is very straightforward. It doesn't require any technical skill.
We have also integrated it with Kafka messaging and it is not complex to do. It is really so easy to connect or integrate with data interfaces. And moving data into analytics platforms using StreamSets is easy. It doesn't require any coding, meaning your can transfer or move data into data payloads without coding skills. It's a good move, for someone in the beginning, who doesn't have any knowledge because it's quite easy.
What needs improvement?
Sometimes, it is not clear at first how to set up nodes. A site with an explanation of how each node works would be very helpful.
Also, it doesn't provide a very good user experience.
For how long have I used the solution?
I have been using StreamSets for three years.
What do I think about the stability of the solution?
It is stable. I've never seen any negative downtime.
How are customer service and support?
Their technical support is very supportive. They really know what to do, and they are very good people, very friendly.
How would you rate customer service and support?
Positive
How was the initial setup?
It took me three days to deploy it. I did it on my own. We use it in two departments in one location and there are four users.
There is no maintenance of the solution on our side.
What was our ROI?
Since I implemented StreamSets, we have more generated sales, on the order of 50 percent.
What's my experience with pricing, setup cost, and licensing?
The pricing is affordable for any business.
What other advice do I have?
The transformation logic is a bit complex when you begin and you may need to read the documentation. When you create logic, you have to be sure of the scenarios in the logic.
Any company that is looking for data engineering should use StreamSets because the pricing is quite favorable. I would recommend it.
Which deployment model are you using for this solution?
Public Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Amazon Web Services (AWS)
Disclosure: PeerSpot contacted the reviewer to collect the review and to validate authenticity. The reviewer was referred by the vendor, but the review is not subject to editing or approval by the vendor.
Buyer's Guide
StreamSets
July 2025

Learn what your peers think about StreamSets. Get advice and tips from experienced pros sharing their opinions. Updated: July 2025.
865,384 professionals have used our research since 2012.
Enterprise Solutions Architect at a energy/utilities company with 1,001-5,000 employees
Quite simple to use for anybody who has an ETL or BI background
Pros and Cons
- "StreamSets data drift feature gives us an alert upfront so we know that the data can be ingested. Whatever the schema or data type changes, it lands automatically into the data lake without any intervention from us, but then that information is crucial to fix for downstream pipelines, which process the data into models, like Tableau and Power BI models. This is actually very useful for us. We are already seeing benefits. Our pipelines used to break when there were data drift changes, then we needed to spend about a week fixing it. Right now, we are saving one to two weeks. Though, it depends on the complexity of the pipeline, we are definitely seeing a lot of time being saved."
- "Currently, we can only use the query to read data from SAP HANA. What we would like to see, as soon as possible, is the ability to read from multiple tables from SAP HANA. That would be a really good thing that we could use immediately. For example, if you have 100 tables in SQL Server or Oracle, then you could just point it to the schema or the 100 tables and ingestion information. However, you can't do that in SAP HANA since StreamSets currently is lacking in this. They do not have a multi-table feature for SAP HANA. Therefore, a multi-table origin for SAP HANA would be helpful."
What is our primary use case?
We are using the StreamSets DataOps platform to ingest data to a data lake.
How has it helped my organization?
Our time to value has increased because our development time has been considerably reduced. The major benefit that we are getting out of the solution is the ability to easily transform and upskill a person who has already worked on an ETL or BI background. We don't need to specifically look for people who know programming or worked on Python, DataOps, or a DevOps sort of functionality. In the market, it is easier to find people with ETL or BI skills than people with hardcore DevOps or programming skills. That is the major benefit that we are getting out of moving to a GUI-based tool like StreamSets. How quickly we are delivering to our customers, as well as our ability to ingest to a data lake, have actually improved a lot by using this tool.
What is most valuable?
The types of the source systems that it can work with are quite varied. There are numerous source systems that it can work with, e.g., a SQL Server database, an Oracle Database, or REST API. That is an advantage we are getting.
The most important feature is the Control Hub that comes with the DataOps Platform and does load balancing. So, we do not worry about the infrastructure. That is a highlight of the DataOps platform: Control Hub manages the data load to various engines.
It is quite simple for anybody who has an ETL or BI background and worked on any ETL technologies, e.g., IBM DataStage, SAP BODS, Talend, or CloverETL. In terms of experience, the UI and concepts are very similar to how you develop your extraction pipeline. Therefore, it is very simple for anybody who has already worked on an ETL tool set, either for your data ingestion, ETL pipeline, or data lake requirements.
We use StreamSets to load into AWS S3 and Snowflake databases, which are then moved forward by Power BI or Tableau. It is quite simple to move data into these platforms using StreamSets. There are a lot of tools and destination stages within StreamSets and Snowflake, Amazon S3, any database, or an HTTP endpoint. It is just a drag-and-drop feature that is saving a lot of time when rewriting any custom code in Python. StreamSets enables us to build data pipelines without knowing how to code, which is a big advantage.
The data resilience feature is good enough for our ETL operations, even for our production pipelines at this stage. Therefore, we do not need to build our own custom framework for it since what is available out-of-the-box is good enough for a production pipeline.
StreamSets data drift feature gives us an alert upfront so we know that the data can be ingested. Whatever the schema or data type changes, it lands automatically into the data lake without any intervention from us, but then that information is crucial to fix for downstream pipelines, which process the data into models, like Tableau and Power BI models. This is actually very useful for us. We are already seeing benefits. Our pipelines used to break when there were data drift changes, then we needed to spend about a week fixing it. Right now, we are saving one to two weeks. Though, it depends on the complexity of the pipeline, we are definitely seeing a lot of time being saved.
What needs improvement?
One room for improvement is probably the GUI. It is pretty basic and a lot of improvement is required there.
In terms of security, from an architecture perspective, when we want to implement something, and because our organization is very strict when it comes to cybersecurity, we have been struggling a bit because the platform has a few gaps. Those gaps are really gaps based on our organization's requirements. These are not gaps on StreamSets' side. The solution could improve a lot in terms of having more features added to the security model, which would help us.
There are quite a few features that we wanted. One is SAP HANA. Currently, we can only use the query to read data from SAP HANA. What we would like to see, as soon as possible, is the ability to read from multiple tables from SAP HANA. That would be a really good thing that we could use immediately. For example, if you have 100 tables in SQL Server or Oracle, then you could just point it to the schema or the 100 tables and ingestion information. However, you can't do that in SAP HANA since StreamSets currently is lacking in this. They do not have a multi-table feature for SAP HANA. Therefore, a multi-table origin for SAP HANA would be helpful.
For how long have I used the solution?
I have been using it for the past 12 months.
What do I think about the stability of the solution?
I have no concerns in terms of the application's core stability. We haven't had any major outages as such, and even if we had one, those were internal and related to our network, proxy, or firewall. As someone who implemented it and has been working on it day in, day out, sometimes 24/7, I am quite confident with the stability of the solution.
As with any application, it requires periodical maintenance, at least to do an upgrade. That maintenance is to simply upgrade the product, and nothing more than that.
What do I think about the scalability of the solution?
A core feature of the DataOps Platform is you can easily scale through engines when you have more pipelines running and data to process. So, if you would need to purchase more engines or cores, it is quite scalable. That is a major advantage that we are getting.
In the Control Hub Platform, the orchestration and load balancing are quite scalable. You don't need to fiddle with the existing solution. Everything is run on another engine that gets hooked up automatically to Control Hub, which makes it seamless.
There is sort of a developed template out of StreamSets, where you just have one template and can point it to any source system. You can just start ingesting, which has reduced a lot of time in building our new pipelines.
How are customer service and support?
They are quite good and responsive. We have a dedicated support portal for StreamSets. We have authorized members who can raise support tickets using the portal, including myself. They have a quick turnaround with good responses, so we are quite happy as of now. I would rate the technical support between 7.5 and 8 out of 10.
How would you rate customer service and support?
Positive
Which solution did I use previously and why did I switch?
We previously developed our own custom platform. We switched because maintaining a custom platform is difficult. We are not a product team. We are an energy company who services business customers. Therefore, maintaining a custom platform is difficult. Another thing was that the custom platform was written programmatically. So, you need a lot of people who have a programmatic knowledge, both to maintain and use it.
The time to value is quite a critical KPI. Before, when our business needed data quickly on the platform, our previous solutions struggled to get it. Thus, our time to value has improved a lot and our customers are happy because they are able to get the data quickly.
How was the initial setup?
I was there right from the start when they adopted an open-source version. Late last year, we moved to an enterprise version, i.e., the DataOps platform. So, I worked on the 3.2.2 version, and now I am working on the 5.0 version, which is the enterprise license version.
The implementation is straightforward, except for a few hiccups with known network, process, and firewall issues. Other than that, it was a very simple, lean implementation.
Because we had a lot of firewall issues and issues with our optimization, it took probably four weeks for us to get things running. However, if you exclude the issues, it took probably a week to a week and a half to get things up and running.
We are working, as a separate piece of the project, to migrate whatever is running in our existing custom platform to StreamSets. From a certain date, we started to work purely on StreamSets. For any future ingestion requirements, we are using StreamSets DataOps platform. However, the previous platform is inactive at the moment. We are only using it for existing pipelines, and the plan is to migrate them to the DataOps platform this year very soon.
What about the implementation team?
Two people were needed for the deployment of this solution: a cloud engineer and a senior data engineer.
What was our ROI?
First, it has saved us a lot of time because we do not need to come up with our own custom platform, which is a huge expenditure in building and maintaining the custom platform. Second, even if we go for other products in the market, there are lots of gaps with the other products. Even if we picked up another product, we would have to customize it. An off-the-shelf product is not enough to meet our needs. Therefore, StreamSets has definitely helped us in getting the information into our data lake very quickly, in terms of ingestion.
The most important thing is it has helped us from a resourcing point of view. You can easily upskill a BI or ETL resource without any programming knowledge to work with this. That is a major advantage that we are getting since we have a lot of ETL people who do not have programming knowledge. They have vast ETL experience working with GUI-based tools, and StreamSets is really useful for them.
It has drastically reduced the time that we are spending on workloads by 60% to 70% as well as reducing the time spent on ingestion by 30%.
What's my experience with pricing, setup cost, and licensing?
It has a CPU core-based licensing, which works for us and is quite good.
Which other solutions did I evaluate?
We did evaluate other solutions. It was not a quick decision for us to take this product. We evaluated other products in the market, but they were not close to StreamSets or not in the data integration space. One thing that caught our attention with StreamSet was the processes that it could work with. Secondly, the Control Hub DataOps platform manages the load balancing, etc. We were quite interested in that since we would not need to maintain it ourselves. The third most important thing was that you can create job templates in StreamSets. So, this means you create a template for a particular type of ingestion. Going forward, you just change the parameters, then you can point it to any source. This means there is less pipeline development and we can quickly ingest data into the data lake. Those are the features that we were interested in and why we switched StreamSets.
There is actually a gap in the entire data integration market at the moment, and StreamSets Data Collector is trying to fill that gap. The reason is because most data ingestion has to be done through programming languages, like Python or Java. We currently do not have a GUI-based tool set that is as robust as StreamSets. That is what I found out in the lab over the last 12 months. There are new products coming up, but it will still be a few more years until they are stabilized. Whereas, StreamSets is already there to solve your immediate data ingestion requirements.
What other advice do I have?
Every tool in the market at the moment has some major gaps, especially for large enterprises. It could be the way that the data or pipeline is secured. At present, StreamSets looks like the market leader and is trying to fill that gap. For anyone going through a proof of concept for various tools, StreamSets is almost at the top. I don't think that they need to look any further.
We are working only with API, a relational database management system, and our enterprise warehouses at the moment. We are not using any streaming sort of ingestion at the moment.
We are not using Snowflake Transformer yet. It just got released. We are using a traditional Snowflake destination stage because our enterprise is huge. We have our own Snowflake architecture. We load the security in the data into our own databases using the destination stage, not Transformer yet.
I would rate the solution as 7.5 out of 10.
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Amazon Web Services (AWS)
Disclosure: PeerSpot contacted the reviewer to collect the review and to validate authenticity. The reviewer was referred by the vendor, but the review is not subject to editing or approval by the vendor.
Product Marketer at a media company with 1,001-5,000 employees
We have been able to eliminate the vast majority of our break/fix costs and maintenance time
Pros and Cons
- "The entire user interface is very simple and the simplicity of creating pipelines is something that I like very much about it. The design experience is very smooth."
- "One area for improvement could be the cloud storage server speed, as we have faced some latency issues here and there."
What is our primary use case?
Our major use case with StreamSets is to build data pipelines from multiple sources to multiple destinations. We mainly use the StreamSets Data Collector Engine for seamless streaming from any source to any destination.
We also use it to deliver continuous data for database operations and modern analytics.
How has it helped my organization?
One great thing is that now, with the implementation of StreamSets, we have been able to eliminate about 80 percent of our break/fix costs and maintenance time. It is very easy to connect with streaming platforms and streaming services.
Also, we can integrate and stream databases by connecting with multiple streaming services. Before StreamSets, data transfer from source to destination took about three hours of time and it was prone to errors. Now, with the introduction of StreamSets, we primarily use the Data Collector and this has enabled us to complete the same job in less than 30 minutes. We save that much time per day or about 15 hours per week.
Another definite benefit is that it has helped us to break down data silos within our organization. We are able to work together, with the interaction of StreamSets. Previously, the data silos were extremely perilous because data would come from multiple, scattered sources. We were not able to consolidate it on time and we were not able to exactly pinpoint errors. But StreamSets has helped us streamline the use of multiple sources and destinations, completely eliminating the silos. That saves us a lot of time and we have reduced the number of errors by a lot.
What is most valuable?
The most valuable features of StreamSets, for me, are the Data Collector and the Control Hub platform. They are both very straightforward to use and user-friendly. And with the Data Collector and Control Hub, we get canvas selection for designing all our pipelines, which is very intuitive and useful for us.
In fact, the entire user interface is very simple and the simplicity of creating pipelines is something that I like very much about it. The design experience is very smooth. A great thing about StreamSets is that it is a single, centralized platform. All our design-pattern requirements are met with a single design experience through StreamSets.
We can also easily build pipelines with minimal coding and minimal technical knowledge. It is very easy to start and very easy to scale as well. That is very important to me, personally, because I'm from a non-technical background. One of the most important criteria was for me to be able to use this platform efficiently.
Also, moving data to modern analytics platforms is very straightforward. That is why StreamSets is one of the top players in the market right now.
And one of the major advantages for us is the built-in functionality. StreamSets has a plethora of features that combine well with ETL.
What needs improvement?
In terms of features, I don't have any complaints so far. But one area for improvement could be the cloud storage server speed, as we have faced some latency issues here and there.
For how long have I used the solution?
I have been using StreamSets for about eight months.
What do I think about the stability of the solution?
It is stable. It's a cloud-based solution, so there is a little bit of latency, some server speed issues, but apart from that, there is no question about the stability of the solution.
What do I think about the scalability of the solution?
The platform is definitely scalable.
Maybe in the future we will increase our usage of StreamSets, but I don't see any immediate scalability requirements for us.
How are customer service and support?
I have not contacted their customer support, but my team contacts them. From what I understand they have a pretty healthy conversation with the StreamSets customer support. All of our queries are sent via email and they get them sorted out. They also join Google Meet sessions or calls, if required, to sort out our queries. It has been a very smooth journey so far. I don't have any complaints with regard to their customer service.
How would you rate customer service and support?
Positive
Which solution did I use previously and why did I switch?
StreamSets is the first solution that we are using in this space.
How was the initial setup?
I was not fully involved in the initial implementation, but we did the implementation in phases. We wanted to get it on board as soon as possible, so instead of doing a complete implementation, we did it in phases and it didn't take a lot of time. We were able to get on with the work as soon as possible with this model.
The initial setup was simple. We didn't require any additional training or third-party vendors. We were able to do it along with the StreamSets team, so it was smooth for us.
We have 15 people using StreamSets, all at one location. They are developers and users.
Because it is a cloud platform there isn't much maintenance required other than server updates, but that is expected with any cloud platform. No extensive maintenance is required. We have a team of two people who maintain it and handle updates and all the latest releases.
What was our ROI?
Tasks that took three hours can now be done in less than 30 minutes. This is one of the prime data points in terms of ROI for this product.
In terms of money saved, we still haven't seen any direct results from StreamSets. With its automation, we are able to focus on other tasks because StreamSets is taking care of the operations side. Theoretically, it should save us some money but it hasn't until now. We still have the same number of employees.
We are moving in a positive direction. Hopefully, this trend continues. We were able to see the time savings and reduced errors within three months of deployment.
What's my experience with pricing, setup cost, and licensing?
There are two editions, Professional and Enterprise, and there is a free trial. We're using the Professional edition and it is competitively priced. I wouldn't say it's cheap or moderate, but it's also not a high price.
What other advice do I have?
We have been experimenting with Hadoop, but apart from that, we do not use it to establish a connection with other services. As an organization, we have not faced any issues with connectivity using StreamSets. The platform is very stable.
Overall, StreamSets is very efficient and effective. It has helped us save a lot of time and also reduced errors a lot. I would definitely rate it very highly. The major reason is that it gives us a single, centralized platform for all our design-pattern requirements and we are able to produce results efficiently. With StreamSets, we are able to transfer or stream data from any source to any destination. It has increased the overall efficiency of our organization.
Software AG is constantly improving and evolving the product, and that is something that I like: using a product that is ever-evolving and being upgraded.
After deploying StreamSets, I learned a lot about how data planning works and how easy it is to stream from multiple sources to multiple destinations. That is one of my major takeaways. I thought it would be a very complex task, but that myth was broken by StreamSets. The complexity was made very simple for me.
My advice is to try the free edition. It's a very user-friendly and intuitive product as well. Try it to get a grasp of what's happening inside the product. Once you try the free edition, you'll definitely go for the Professional edition. I don't have any doubt about that. The product itself will lure you. That is the power of the product.
Which deployment model are you using for this solution?
Public Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Amazon Web Services (AWS)
Disclosure: PeerSpot contacted the reviewer to collect the review and to validate authenticity. The reviewer was referred by the vendor, but the review is not subject to editing or approval by the vendor.
AI Engineer at Techvanguard
A no-code solution with a drag-and-drop UI, but the execution engine should be better
Pros and Cons
- "The most valuable would be the GUI platform that I saw. I first saw it at a special session that StreamSets provided towards the end of the summer. I saw the way you set it up and how you have different processes going on with your data. The design experience seemed to be pretty straightforward to me in terms of how you drag and drop these nodes and connect them with arrows."
- "The execution engine could be improved. When I was at their session, they were using some obscure platform to run. There is a controller, which controls what happens on that, but you should be able to easily do this at any of the cloud services, such as Google Cloud. You shouldn't have any issues in terms of how to run it with their online development platform or design platform, basically their execution engine. There are issues with that."
What is our primary use case?
I was working on an integration project where I was using the StreamSets platform. I was looking at both their data collector and their transformer. The idea was to integrate it with AWS SageMaker Canvas. Both of them are what they call no-code options. StreamSets is for data pipelining, managing your data flow, and transforming your data. SageMaker is AWS, and Canvas is basically their no-code option for machine learning.
I was trying to connect it to a data object repository. For AWS, that's a specific managed service called S3. I wasn't trying to run it with a data warehouse.
How has it helped my organization?
It's still in the trial stage. I don't get a 30-day trial period or anything like that. I just got to write about what's involved and then see if that's something that justifies the use case for going ahead and purchasing the license for it.
It enables you to build data pipelines without knowing how to code. It abstracts away the need for Spark or anything like that. This ability is highly important because it reduces development time.
It saves time because you don't have to write code.
It saves money by not having to hire people with specialized skills. You don't need Spark or anything like that for doing the same thing.
It helps to scale your data operations. You can get to the execution engine and provision bigger machines or bigger clusters. You can scale out to however much data you need to scale out to.
What is most valuable?
The most valuable would be the GUI platform that I saw. I first saw it at a special session that StreamSets provided towards the end of the summer. I saw the way you set it up and how you have different processes going on with your data. The design experience seemed to be pretty straightforward to me in terms of how you drag and drop these nodes and connect them with arrows.
What needs improvement?
The execution engine could be improved. When I was at their session, they were using some obscure platform to run. There is a controller, which controls what happens on that, but you should be able to easily do this at any of the cloud services, such as Google Cloud. You shouldn't have any issues in terms of how to run it with their online development platform or design platform, basically their execution engine. There are issues with that.
It can break down data silos within the organization. One person can do the whole thing with StreamSets and SageMaker Canvas, but it hasn't yet had any effect on our operations or business because it's one of those situations where you can either get a demo from them or you basically have to go to one of these sessions and they give you temporary credentials and try to work with your use case. Personally, I would change their model a bit and give a two-week trial license for a cloud platform at the very least. You can then try to get something to work or call up their technical department and say, "Look, I've been evaluating this thing for the last few days. I don't know exactly how to resolve this issue."
For how long have I used the solution?
I started using it in June of this year.
What do I think about the stability of the solution?
The whole issue of the execution engine needs to be better resolved. If you pick a cloud, why isn't it working with this cloud? Or what do I need to do to get it to work with one specific cloud service if it can be deployed across multiple clouds?
What do I think about the scalability of the solution?
It seems pretty highly scalable to me. That's not going to be an issue. Just the administration of it could be an issue.
It's currently being used in a dev department for machine learning. It's being used by the business analyst team.
How are customer service and support?
I haven't contacted their support.
Which solution did I use previously and why did I switch?
AWS has native solutions. There are AWS Data Wrangler and others that come bundled with their services, like AWS Glue. We haven't yet switched to StreamSets. It's still in the evaluation stage, but the no-code and the drag-and-drop option with a GUI are some of the things that seem to resonate with people.
How was the initial setup?
I was involved in its setup. I was the one who basically had to try to get it to run with whatever process or custom processor I developed.
It was complex to set up. I had to go to the sessions. On a couple of occasions, I was doing it directly from the cloud platform, and apparently, that wasn't the way to do it. You have to go through their universal designer platform first.
In terms of maintenance, once you're deployed from the cloud, that's all handled for you. It's managed for you directly from the cloud service. So, you don't have to worry about that. They maintain their design platform.
What about the implementation team?
I didn't use any consultant.
What's my experience with pricing, setup cost, and licensing?
I didn't get into that with the StreamSets representative. It seems to be pay-as-you-go, but I don't know exactly how they do it.
Which other solutions did I evaluate?
Alteryx is another option. It's a similar tool, and it looks almost the same as StreamSets. Alteryx is something that's available for any cloud. It doesn't matter which cloud. You go on the various clouds, and you look and see what they have.
What other advice do I have?
To those evaluating this solution, I would advise looking into how it integrates with the cloud service that they're going to try it with. Does it naturally integrate better with AWS or Azure? It's one of those situations.
I used StreamSets' ability to move data into a modern analytics platform. That's what the AWS SageMaker Canvas is. It's like predictive analytics. In terms of ease of moving data into this analytics platform, doing the design on the StreamSets platform is one thing, but having the execution engine and getting that provision is a totally different ball game. Basically, that's where its limitation comes in.
Overall, I would rate it a seven out of ten. The issue that was never resolved for me was if you're running a compute or execution engine on AWS versus Azure versus GCP, how does that integration work because that has got nothing to do with StreamSets? That is outside of StreamSets. You're now dealing with the cloud service, and there's a good reason for that.
Which deployment model are you using for this solution?
Public Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Amazon Web Services (AWS)
Disclosure: PeerSpot contacted the reviewer to collect the review and to validate authenticity. The reviewer was referred by the vendor, but the review is not subject to editing or approval by the vendor.
Senior Network Administrator at a energy/utilities company with 201-500 employees
Helped us break down data silos and produce better, up-to-date reports, as well as save money
Pros and Cons
- "The most valuable feature is the pipelines because they enable us to pull in and push out data from different sources and to manipulate and clean things up within them."
- "The design experience is the bane of our existence because their documentation is not the best. Even when they update their software, they don't publish the best information on how to update and change your pipeline configuration to make it conform to current best practices. We don't pay for the added support. We use the "freeware version." The user community, as well as the documentation they provide for the standard user, are difficult, at best."
What is our primary use case?
We use the whole Data Collector application.
How has it helped my organization?
We now consume many more hundreds of terabytes of data than we used to before we had StreamSets. It has definitely enabled us to do things a lot faster, and be a lot more agile, with a lot more data consumption and a lot more reporting.
Another benefit is that it has helped us to break down data silos. We now consume data across different silos and then we aggregate it together so that we can do reporting that is not just for that one silo of people but for a number of different people across the entire organization. That has had a positive effect, enabling us to save money, spend money more effectively, and have more up-to-date data in reports, as well as in auditing. Our safety processes are better too.
One way we have saved money is thanks to how the solution streamlines the data that we pull in, data that we weren't pulling in before.
StreamSets allows more people to know what's going on. It helps us with better allocation of resources, better allocation of staff, and right-sizing. We're in oil and gas and, in our case, it allows us to optimize what we're pulling out of the ground and then what we're selling.
It has helped to scale our data operations and as a result, in addition to saving money and right-sizing, it's helped our field operations and provided us with more management reporting.
Also, the data drift resilience reduces the time it takes to fix data drift breakages.
What is most valuable?
The most valuable feature is the pipelines because they enable us to pull in and push out data from different sources and to manipulate and clean things up within them.
We use StreamSets to connect to enterprise data stores, including OLTP databases and Hadoop. Connecting to them is pretty easy. It's the data manipulation and the data streaming that are the harder parts behind that, just because of the way the tool is written.
What needs improvement?
The design experience is the bane of our existence because their documentation is not the best. Even when they update their software, they don't publish the best information on how to update and change your pipeline configuration to make it conform to current best practices.
We don't pay for the added support. We use the "freeware version." The user community, as well as the documentation they provide for the standard user, are difficult, at best.
However, we have a couple of people in-house here who are experts in data analysis and they have figured out how to use this tool. We have to have people who are extremely skilled to go in and write the pipelines for this software because it's so complicated. The software works great for us, but there is an extremely steep learning curve because they don't provide a lot of information outside of paying their ridiculous support costs. Their support starts at $50,000 a year and up.
Also, the built-in data drift resilience for ETL operations requires a bunch of custom code development to be able to handle that. It's somewhat difficult because you have to customize it a fair amount.
I also would like a more user-friendly interface and better error-trap handling.
For how long have I used the solution?
We have been using StreamSets for about four years.
What do I think about the stability of the solution?
We just patched ourselves up to the latest release about a month ago, so it's actually pretty stable at this point. It used to be quite buggy, going back over the last little while, but it's pretty stable now.
What do I think about the scalability of the solution?
This software is very scalable.
Which solution did I use previously and why did I switch?
We did not have a previous solution.
How was the initial setup?
The initial setup was somewhere between straightforward and complex. It was pretty straightforward to start with, but then it started ramping up to be more difficult as we wanted to add more stuff in.
The difficulty depends upon your data sources. If you have just one data source and you want to consume a lot of different types of data from that one source, it's pretty straightforward. But when you have 20 or 25 different data sources, and you need to pipeline all that data into a couple of data warehouses so that you can use advanced data analytics software to do reporting, analysis, and notifications, it's a lot more complicated. With every data source, it becomes exponentially more complicated to manage.
We spent a significant amount of time doing it, but otherwise, it was seamless because it was our own staff. We didn't have to worry about trying to find money or resource time or do any of the prep work needed to get external resources.
Ours is a single deployment, but it is used across our entire staff base of 200-plus people. We need three people for deployment and maintenance, whose responsibilities include software management, application management, and data analysis and management.
What was our ROI?
The ROI we have seen is in savings of time and money.
What's my experience with pricing, setup cost, and licensing?
We use the free version. It's great for a public, free release. Our stance is that the paid support model is too expensive to get into. They should honestly reevaluate that.
We tried to go and get them to look at their licensing and support model and they said they were not interested in reevaluating that in any way.
Which other solutions did I evaluate?
We tried to use another freeware ETL tool. It's fairly well-known. We ran it for a couple of months but it was going to be even more difficult than StreamSets, so we chose that in the end.
What other advice do I have?
The ease of using StreamSet to move data into modern analytics platforms, on a scale of one to 10, is about a five.
The solution enables you to build data pipelines without knowing how to code if it's the latest, state-of-the-art cloud connecting stuff. If it's for anything structured for Oracle and SQL Server and other data sources, it's difficult. Without knowing how to write code, some of it's easy and some of it is not.
My advice to someone who is considering this software is to be very aware that their integrator and data analysis people will need a very specific skill set.
Which deployment model are you using for this solution?
On-premises
Disclosure: PeerSpot contacted the reviewer to collect the review and to validate authenticity. The reviewer was referred by the vendor, but the review is not subject to editing or approval by the vendor.
Principal Engineer at Tata Consultancy Services
Integrates with different enterprise systems and enables us to easily build data pipelines without knowing how to code
Pros and Cons
- "I have used Data Collector, Transformer, and Control Hub products from StreamSets. What I really like about these products is that they're very user-friendly. People who are not from a technological or core development background find it easy to get started and build data pipelines and connect to the databases. They would be comfortable like any technical person within a couple of weeks."
- "We create pipelines or jobs in StreamSets Control Hub. It is a great feature, but if there is a way to have a folder structure or organize the pipelines and jobs in Control Hub, it would be great. I submitted a ticket for this some time back."
What is our primary use case?
I worked mostly on data injection use cases when I was using Data Collector. Later on, I got involved with some Spark-based transformations using Transformer.
Currently, we are not using CI/CD. We are not using automated deployments. We are manually deploying in prod, but going forward, we are planning to use CI/CD to have automated deployments.
I worked on on-prem and cloud deployments. The current implementation is on-prem, but in my previous project, we worked on AWS-based implementation. We did a small PoC with GCP as well.
How has it helped my organization?
It is very easy to use when connecting to enterprise data stores such as OLTP databases or messaging systems such as Kafka. I have had integration with OLTP as well as Kafka. Until a few years ago, we didn't have a good way of connecting to the streaming databases or streaming products. This ability is important because most of our use cases in recent times are of streaming nature. We have to deliver certain messages or data as per our SLA, and the combination of Kafka and StreamSets helps us meet those timelines. I'm not sure what I would have used to achieve the same five years ago. The combination of Kafka and StreamSets has opened up a new world of opportunities to explore. I recently used orchestration wherein you can have multiple jobs, and you can orchestrate them. For example, you can specify to let Job A run first, then Job B, and then Job C in an automated fashion. You don't need any manual intervention. In one of my projects, I had a data hub from 10 different databases. It was all automated by using Kafka and StreamSets.
It enables you to build data pipelines without knowing how to code. You can build data pipelines even if you don't know how to code. You can just drag and drop. If you know how to code, you can do some custom coding as well, but you don't need to know coding to work with StreamSets, which is important if somebody in your team is not familiar with coding. The nature of coding is changing, and the number of technologies is changing. The range is so wide right now. Even if I know Java or Oracle, it may not be enough in today's times because we might have databases in Teradata. We might have Snowflake or other different kinds of databases. StreamSets is a great solution because you don't need to know all different databases or all different coding mechanisms to work with StreamSets. Rather than learning each and every technology and building your data pipelines, you can just plug and play at a faster pace.
StreamSets’ built-in data drift resilience plays a part in our ETL operations. It is a very helpful feature. Previously, we had a lot of jobs coming from different source systems, and whenever there was any change in columns, it was not informed. It required a lot of changes on our end, which would take from a couple of weeks to a month. Because of the data drift feature, which is embedded in StreamSets, we don't have to spend that much time taking care of the columns and making sure they are in sync. All this is taken care of. We don't have to worry about it. It is a very helpful feature to have.
StreamSets' data drift resilience reduces the time to fix data drift breakages. It has definitely saved around two to three weeks of development time. Previously, any kind of changes in our jobs used to require changing our code or table structure and doing some testing. It required at least two to three weeks of effort, which is now taken care of because of StreamSets.
StreamSets’ reusable assets helped to reduce workload. We can use pipeline fragments across multiple projects, which saves development time. The time saved varies from team to team.
It saves us money by not having to hire people with specialized skills. Without StreamSets, for example, I would've had to hire someone to work on Teradata or Db2. We definitely save some money on creating a new position or hiring a new developer. StreamSets provides a lot of features from AWS, Azure, or Snowflake. So, we don't have to find specialized, skilled resources for each of these technologies to create data pipelines. We just need to have StreamSets and one or two DBAs from each team to get the right configuration items, and we can just use it. We don't have to find a specialized resource for each database or technology.
It has helped us to scale our data operations. It saves the licensing costs on some legacy software, and we can reuse pipelines. Once we have a template for a certain use case, we can reuse the same template across different projects to move data to the cloud, which saves us money.
What is most valuable?
I have used Data Collector, Transformer, and Control Hub products from StreamSets. What I really like about these products is that they're very user-friendly. People who are not from a technological or core development background find it easy to get started and build data pipelines and connect to the databases. They would be comfortable like any technical person within a couple of weeks. I really like its user-friendliness. It is easy to use. They have a single snapshot across different products, which is very helpful to learn and use the product based on your use case.
Its interface is very cool. If I'm using a batch project or an ETL, I just have to configure appropriate stages. It is the same process if you go with streaming. The only difference is that the stages will change. For example, in a batch, you might connect to Oracle Database, or in streaming, you may connect to Kafka or something like that. The process is the same, and the look-and-feel is the same. The interface is the same across different use cases.
It is a great product if you are looking to ramp up your teams and you are working with different databases or different transformations. Even if you don't have any skilled developers in Spark, Python, Java, or any kind of database, you can still use this product to ramp up your team and scale up your data migration to cloud or data analytics. It is a fantastic product.
What needs improvement?
There are a few things that can be better. We create pipelines or jobs in StreamSets Control Hub. It is a great feature, but if there is a way to have a folder structure or organize the pipelines and jobs in Control Hub, it would be great. I submitted a ticket for this some time back.
There are certain features that are only available at certain stages. For example, HTTP Client has some great features when it is used as a processor, but those features are not available in HTTP Client as a destination.
There could be some improvements on the group side. Currently, if I want to know which users are a part of certain groups, it is not straightforward to see. You have to go to each and every user and check the groups he or she is a part of. They could improve it in that direction. Currently, we have to put in a manual effort. In case something goes wrong, we have to go to each and every user account to check whether he or she is a part of a certain group or not.
For how long have I used the solution?
I got exposed to StreamSets in late 2018. Initially, I worked on StreamSets Data Collector, and then, for a year or so, I got exposed to Transformer as well.
What do I think about the stability of the solution?
It is stable, and they're growing rapidly.
What do I think about the scalability of the solution?
It is pretty scalable, but it also depends on where it is installed, which is something a lot of developers misunderstand. Most of the time, the implementation is done on on-prem servers, which is not very scalable. If you install it on cloud-based servers, it is fast. So, the problem is not with StreamSets; the problem is with the underlying hardware. I have worked on both sides. Therefore, I'm aware of the scenarios, but if I were to work purely in the development team, I might not be aware that it is underlying hardware that is causing problems.
In terms of its usage, it is available enterprise-wide. I don't know the exact number of users now because I am not a part of the platform or admin team, but at one time, we had more than 200 users working on this platform. We had one implementation on AWS Cloud and one on GCP. We had Dev, QA, and prod environments. Even now, we have about four environments. We have SIT and NFT, and in prod, we have two environments.
We plan to increase its usage. We are rapidly increasing its usage in our projects. There is a lot of excitement around it. A lot of people want to explore this tool in our organization. A lot of people are trying to learn this technology or use it to migrate their data from legacy databases to the cloud. This will actually encourage more folks to join the data engineering or analytics team. There is a lot of curiosity around the product.
How are customer service and support?
Currently, I'm not involved with them on a daily basis. I'm no longer a part of the platform team, but when I was involved with them two years back, their support was good. Most of the interactions I have had with them were pretty good. They were responsive, and they responded within a day or two. I would rate them a nine out of ten. They were good most of the time, but it could be a challenge to get the right person. They are still a growing company. You need to be a little patient with them to get to the right person to help you with the issues you have.
How would you rate customer service and support?
Positive
Which solution did I use previously and why did I switch?
About three or four years ago, I worked on Trifacta, which is now acquired by Alteryx. The features were different, and the requirements were different.
Talend is a good product. It seems quite close to StreamSets, but I have not worked on Talend. I just got a demo of Talend a couple of years ago, but I never worked on it. I felt that StreamSets had more features. Its UI was good, and functionality-wise, I found it a little bit more comfortable to use.
How was the initial setup?
I was involved with AWS deployment. At that time, I was a part of the platform team. Now, I work with the application development team, and I'm not involved in that. It was complex at that time. About four years ago, when StreamSets was new, we had a tough time deploying because the documentation was not very clear at that time. A lot of the documents were very good and available on the web, but the documentation wasn't exhaustive or elaborate. We also had our own learning curve. We had someone from StreamSets to help us with the deployment. So, it went well. Now, it is better, but when we did it, it was very complex.
We implemented it in phases. We just implemented or installed the StreamSets platform in our company, and we let a couple of teams use it. We started with Data Collector, and we allowed teams to use and feel it. When they said that this is a good tool to use, we got the enterprise license, and we installed Control Hub and Data Collector. It was not implemented enterprise-wide at the same time. It was released to teams in phases.
What about the implementation team?
It was a mix of a consultant and reseller. It probably was Access Group that helped us with this implementation. At that time, I was in the US, and they were good. Our experience with them was fantastic. We had a couple of consultants from their team to help us with the installation. Now, we have a different vendor in the UK. We have a different partner to help us with that.
We started with about three people, and now, we have more than 20 people on the team. It requires regular maintenance in terms of user management. It is not because of StreamSets; it is because of the underlying software. Data Collector can support a certain number of jobs in parallel. In case we have more tenants on board, we have to increase the Data Collector or Transformer instances to support the increased number of users.
What was our ROI?
We have definitely seen an ROI. It has helped us in moving into the data analytics world at a faster pace than any other tool would've done. The traditional tools we had didn't provide the functionality that StreamSets offers.
The time for realizing its benefits from deployment depends on the use case or the end requirement. For example, we deployed one project last year, and within a couple of months, we could see a lot of benefits for that team. For some use cases, it could be two months to six months or one year. You can build data pipelines, and you can move data to Snowflake or any cloud database using StreamSets in a matter of a few weeks.
What's my experience with pricing, setup cost, and licensing?
There are different versions of the product. One is the corporate license version, and the other one is the open-source or free version. I have been using the corporate license version, but they have recently launched a new open-source version so that anybody can create an account and use it.
The licensing cost varies from customer to customer. I don't have a lot of input on that. It is taken care of by PMO, and they seem fine with its pricing model. It is being used enterprise-wide. They seem to have got a good deal for StreamSets.
What other advice do I have?
It is very user-friendly, and I promote it big time in my organization among my peers, my juniors, and across different departments.
They're growing rapidly. I can see them having a lot of growth based on the features they are bringing. They could capture a lot more market in coming times. They're providing a lot of new features.
I love the way they are constantly upgrading and improving the product. They're working on the product, and they're upgrading it to close the gaps. They have developed a data portal recently, and they have made it free. Anyone who doesn't know StreamSets can just create an account and start using that portal. It is a great initiative. I learned directly on the corporate portal license, but if I were to train somebody in my team who doesn't yet have a license, I would just recommend them to go to the free portal, register, and learn how to use StreamSets. It is available for anyone who wants to learn how to work on the tool.
We use StreamSets' ability to move data into modern analytics platforms. We use it for Tableau, and we use it for ThoughtSpot. It is quite easy to move data into these analytics platforms. It is not very complicated. The problems that we had were mostly outside of StreamSets. For example, most of our databases were on-prem, and StreamSets was installed on the cloud, such as AWS Cloud. There were some issues with that. It wasn't a drawback because of StreamSets. It was pretty straightforward to plug and play.
I have used StreamSets Transformer, but I haven't yet used it with Snowflake. We are planning to use it. We have a couple of use cases we are trying to migrate to Snowflake. I've seen a couple of demos, and I found it to be very easy to use. I didn't see any complications there. It is a great product with the integration of StreamSets Transformer and Snowflake. When we move data from legacy databases to Snowflake, I anticipate there could be a lot of data drift. There could be some column mismatches or table mismatches, but what I saw in the demo was really fantastic because it was creating tables during runtime. It was creating or taking care of the missing columns at runtime. It is a great feature to have, and it will definitely be helpful because we will be migrating our databases to Snowflake on the cloud. It will definitely help us meet our customer goals at a faster pace.
I would rate it a nine out of ten. They're improving it a lot, and they need to improve a lot, but it is a great product to use.
Disclosure: PeerSpot contacted the reviewer to collect the review and to validate authenticity. The reviewer was referred by the vendor, but the review is not subject to editing or approval by the vendor.
Technical Lead at Sopra Steria
Easy-to-use tool with no coding required
Pros and Cons
- "StreamSets’ data drift resilience has reduced the time it takes us to fix data drift breakages. For example, in our previous Hadoop scenario, when we were creating the Sqoop-based processes to move data from source to destinations, we were getting the job done. That took approximately an hour to an hour and a half when we did it with Hadoop. However, with the StreamSets, since it works on a data collector-based mechanism, it completes the same process in 15 minutes of time. Therefore, it has saved us around 45 minutes per data pipeline or table that we migrate. Thus, it reduced the data transfer, including the drift part, by 45 minutes."
- "The logging mechanism could be improved. If I am working on a pipeline, then create a job out of it and it is running, it will generate constant logs. So, the logging mechanism could be simplified. Now, it is a bit difficult to understand and filter the logs. It takes some time."
What is our primary use case?
StreamSets is a wonderful data engineering, data ops tool where we can design and create data pipelines, loading on-prem data to the cloud. One of our major projects was to move data from on-premises to Azure and GCP Cloud. From there, once data is loaded, the data scientist and data analyst teams use that data to generate patterns and insights.
For a US healthcare service provider company, we designed a StreamSets pipeline to connect to relational database sources. We did generate schema from the source data loaded into Azure Data Lake Storage (ADLS) or any cloud, like S3 or GCP. This was one of our batch use cases.
With StreamSets, we have also tried to solve our real-time streaming use cases as well, where we were streaming data from source Kafka topic to Azure Event Hubs. This was a trigger-based streaming pipeline, which moved data when it appeared in a Kafka topic. Since this pipeline was a streaming pipeline, it was continuously streaming data from Kafka to Azure for further analysis.
How has it helped my organization?
We can securely fetch the passwords and credentials stored in Azure Key Vault. This is a fundamentally very strong feature that has improved our day-to-day life.
What is most valuable?
It is a pretty easy tool to use. There is no coding required. StreamSets provides us a canvas to design our pipeline. At the beginning of any project, it gives us a picture, which is an advantage. For example, if I want to do a data migration from on-premise to cloud, I will draw it for easier understanding based on my target system, and StreamSets does exactly the same thing by giving us a canvas where I can design our pipeline.
There are a wide range of available stages: various sources, relational sources, streaming sources. There are various processes like to transform the source data. It is not only to migrate data from source to destination, but we can utilize different processes to transform the data. When I was working on the healthcare project, there was personal identification information on the personal health information (PHI) data that we needed to mask. We can't simply move it from source to destination. Therefore, StreamSets provides masking of that sensitive data.
It provides us a facility to generate schema. There are different executors available, e.g., Pipeline Finisher executor, which helps us in finishing the pipeline.
There are different destinations, such as S3, Azure Data Lake, Hive, and Kafka Hadoop-based systems. There are a wide range of available stages. It supports both batch and streaming.
Scheduling is quite easy in StreamSets. From a security perspective, there is integration with keywords, e.g., for password fetching or secrets fetching.
It is pretty easy to connect to Hadoop using StreamSets. Someone just needs to be aware about the configuration details, such as which Hadoop cluster to connect and what credentials will be available. For example, if I am trying with my generic user, how do I connect with the Hadoop distributed system? Once we have the details of our cluster and the credential, we can load data to the Hadoop standalone file system. In our use case, we collected data from our RDBMS sources using JDBC Query Consumer. We queried the data from the source table, captured that data, and then loaded the data into the destination Hadoop distributed file system. Thus, configuration details are required. Once we have the configuration details, i.e., the required credentials, we can connect with Hadoop and Hive.
It takes care of data drift. There are certain data rules, matrix rules, or capabilities provided by StreamSets that we can set. So, if the source schema gets deviated somehow, StreamSets will automatically notify us or send alerts in automated fashion about what is going wrong. StreamSets also provides Change Data Capture (CDC). As soon as the source data is changed, it can capture that and update the details into the required destination.
What needs improvement?
The logging mechanism could be improved. If I am working on a pipeline, then create a job out of it and it is running, it will generate constant logs. So, the logging mechanism could be simplified. Now, it is a bit difficult to understand and filter the logs. It takes some time. For example, if I am starting with StreamSets, everything is fine. However, if I want to dig into problems that my pipeline ran into, it initially takes some time to get familiar with it and understand it.
I feel the visualization part can be simplified or enhanced a bit, so I can easily see what happened with my job seven days earlier and how many records it transmitted.
For how long have I used the solution?
I have been using StreamSets for close to four and a half years when creating my data pipelines in our projects.
What do I think about the stability of the solution?
Stability-wise, it is wonderful and quite good. Mostly, since the solution is completely cloud-based in our project, we just need to hit a URL and then we are logged into StreamSets with our credentials. Everything is present there. Other than some rare occasions, StreamSets behaves pretty well.
There were certain memory leak issues for a few stages, like Azure Data Lake, but those were corrected with immediate solutions, like patches and version upgrades.
Stability-wise, I would rate it as eight and a half or nine out of 10.
What do I think about the scalability of the solution?
I would like auto scaling for heavy load transfer. This applied particularly when we were our data migration project. The tables had more than 10 millions of records in them. When we utilized StreamSets, it took a huge amount of time. Though we were doing every schema generation, we were using ADLS as a destination, and it hung for a good amount of time. So, we considered PySpark processes for our tables, which have greater than 10 millions of records. Usually, it works pretty well with the source tables and the data size is close to five to six million records, but when it is closer to 10 million, I personally feel the auto scaling feature could be improved.
How are customer service and support?
We have spent a good amount of time dealing with their technical support team. The first step is to check the documentation, then work with them.
I had a chance to work with StreamSets during our use case. They helped us out in a good manner with a memory leak issue that we were facing in our production pipeline. So, there was one issue where our pipelines were running fine in dev and the lower environment, i.e., dev and QA, but when we moved those pipelines into production, we were getting a memory leak issue where the JVM ran out of memory exception.
We tried reducing the number of threads and the batch size for the small table, but it was still creating issues. Then, we connected with StreamSets' support team. They gave us a customized patch, which our platform team installed in our production environment. With some collaborative effort of around a week, we were finally able to run our pipeline pretty well.
I would rate the customer support and the technical support as quite good and knowledgeable (eight out of 10). They helped with issues that were occurring in our work. They accepted that there were some issues with the version, which StreamSets released and we were using. They accepted that the version particularly had some issues with the memory management. Therefore, the immediate solution that they provided was a patch, which our platform team installed. However, the long-term solution was to update or upgrade our StreamSets Data Collector platform from version 3.11 to 4.2, and that solved our problem.
How would you rate customer service and support?
Positive
Which solution did I use previously and why did I switch?
We were using Cloudera distribution. All our projects were running, utilizing Hadoop, and the distribution was Cloudera Hortonworks. We were utilizing Sqoop and Hive as well as PySpark or Scala-based processes to code. However, StreamSets helped us a lot in designing our data pipeline quickly in a very fast way.
It has made our job pretty easy in terms of designing, managing, and running our data engineering pipeline. Previously, if I needed to transfer data from source to destination, I would need to use Sqoop, which is a Hadoop stack technology used to establish connectivity with the RDBMS, then load it to the Hadoop distributed file system. With Sqoop, I needed to have my coding skills ready. I needed to be very precise about the connection details and syntax. I needed to be very aware of them. StreamSets solved this problem.
Its greatest feature is that it provides an easy way to design your pipeline. I just need to drag and drop source JDBC Query Consumer to my canvas as well as drag and drop my destination to the canvas. I then need to connect both these stages and be ready with my configuration details. As soon as I am done with that, I will validate the pipeline. I can create a job out of it and schedule it, even the monitoring. All these things can be achieved by a single control panel. So, it not only solves the developer's basic problems, but it also has greatly improved the experience.
We were previously completely using the Hadoop technology stack. Slowly, we started converting our processes into data engineering pipelines, which are designed into StreamSets. Earlier, the problem area was to write code into Sqoop or create Sqoop scripts to capture data from source, then put it into HDFS. Once data was in HDFS, we would write another PySpark process, which did the optimization and faster loading of the data, which is in Hadoop Distributed File System to a cloud-based storage data lake, like ADLS or S3. However, when StreamSets came into picture, we didn't need an intermediary, three-storage distributed file system like HDFS. We could simply create a pipeline that connects to RDBMS and load data directly to the cloud-based Azure Data Lake. So there is no requirement for an intermediary Hadoop Distributed File System (HDFS), which saves us a great amount of time and also helps us a lot in creating our data engineering pipelines.
Microsoft provided Change Data Capture tools, which one of our team members was using. Performance-wise, I personally feel StreamSets is way faster. A few of the support team members were using Informatica as well, but it does not provide powerful features that can handle big amounts of data.
How was the initial setup?
For our deployment model, we were following three environments: dev, QA and prod. Our team's main responsibility is to hydrate Azure Data Lake and GCP from the source system. Control Hub is hosted on GCP, and we were hitting the URL to log into StreamSets. All the data collector machines are created on Google Cloud Platform, and we use a dev environment. Whenever we create and do a PoC, we work in a dev environment. Once our pipeline and jobs are working fine, we move our pipelines to our QA environment, which is export and import. It is pretty easy to do via StreamSets Control Hub. We can simply select a job and export it, then log back into the QA environment and import the job. Once we import the job, the associated pipeline, and all the parameters, we have an option to import the whole bundle, like the pipeline, parameter, and instances. We can import everything. Once this is also working fine, we have another final environment, which is the production which is based on the source refresh frequencies.
What about the implementation team?
In our company, we have a good data engineering team. We have a separate administrator team who is mainly responsible for deploying it on cloud, providing us libraries whenever required. There is a separate team who is taking care of all the installations and platform-related activities. We are primarily data engineers who utilize the product for solutions.
What was our ROI?
StreamSets’ data drift resilience has reduced the time it takes us to fix data drift breakages. For example, in our previous Hadoop scenario, when we were creating the Sqoop-based processes to move data from source to destinations, we were getting the job done. That took approximately an hour to an hour and a half when we did it with Hadoop. However, with the StreamSets, since it works on a data collector-based mechanism, it completes the same process in 15 minutes of time. Therefore, it has saved us around 45 minutes per data pipeline or table that we migrate. Thus, it reduced the data transfer, including the drift part, by 45 minutes.
What's my experience with pricing, setup cost, and licensing?
StreamSets Data Collector is open source. One can utilize the StreamSets Data Collector, but the Control Hub is the main repository where all the jobs are present. Everything happens in Control Hub.
What other advice do I have?
For people who are starting out, the simple advice is to first try out the cloud login of StreamSets. It is freely available for everyone these days. StreamSets has released its online practice platform to design and create pipelines. Someone simply needs to go to cloud.login.streamsets.com, which is StreamSets official website. It is there that people who are starting out can log into StreamSets cloud and spin up their StreamSets Data Collector machines. Then, they can choose their execution mode. It is all in a Docker-containerized fashion. You don't need to do anything.
You simply need to have your laptop ready and step-by-step instructions are given. You just simply spin up your Data Collector, the execution mode, and then you are ready with the canvas. You can design your pipeline, practice, and test there. So, if you want to evaluate StreamSets in basic mode, you can take a look online. This is the easiest way to evaluate StreamSets.
It is a drag-and-drop, UI-based approach with a canvas, where you design the pipeline. It is pretty easy to follow. So, once your team feels confident, then they can purchase the StreamSets add-ons, which will provide them end-to-end solutions and vendor support. The best way is to log into their cloud practice platform and create some pipelines.
In my current project, there is a requirement to integrate with Snowflake, but I don't have Snowflake experience. I have not integrated Snowflake with StreamSets yet.
I personally love working on StreamSets. It is part of my day-to-day activities. I do a lot of work on StreamSets, so I would rate them pretty well as nine out of 10.
Disclosure: PeerSpot contacted the reviewer to collect the review and to validate authenticity. The reviewer was referred by the vendor, but the review is not subject to editing or approval by the vendor.

Buyer's Guide
Download our free StreamSets Report and get advice and tips from experienced pros
sharing their opinions.
Updated: July 2025
Product Categories
Data IntegrationPopular Comparisons
Informatica Intelligent Data Management Cloud (IDMC)
Azure Data Factory
Informatica PowerCenter
Oracle Data Integrator (ODI)
Palantir Foundry
IBM InfoSphere DataStage
Talend Open Studio
Oracle GoldenGate
SAP Data Services
Qlik Replicate
Denodo
Fivetran
Alteryx Designer
SnapLogic
Buyer's Guide
Download our free StreamSets Report and get advice and tips from experienced pros
sharing their opinions.
Quick Links
Learn More: Questions:
- How does Matillion ETL compare to StreamSets?
- When evaluating Data Integration, what aspect do you think is the most important to look for?
- Microsoft SSIS vs. Informatica PowerCenter - which solution has better features?
- What are the best on-prem ETL tools?
- Which integration solution is best for a company that wants to integrate systems between sales, marketing, and project development operations systems?
- Experiences with Oracle GoldenGate vs. Oracle Data Integrator?
- Should we choose Data Hub or GoldenGate?
- What are the must-have features for a Data integration system?
- Is there a bulletproof KPI Data Manager for SME?
- A recent review wrote that PowerCenter has room for improvement. Agree or Disagree?