Apache Spark Reviews and Pricing

Peter-Paul Eijkenboom

Senior Test Automation Specialist at APG

Feb 26, 2022

Download

Useful for big data and scientific purposes, but needs better query handling, stability, and scalability

Pros and Cons

"It is useful for handling large amounts of data, and it is very useful for scientific purposes."

"We are building our own queries on Spark, and it can be improved in terms of query handling."
"It is useful for scientific purposes, but for commercial use of big data, it gives some trouble."

What is our primary use case?

We are using it for big data. We are using a small part of it, which is related to using data.

What is most valuable?

It is useful for handling large amounts of data. It is very useful for scientific purposes.

What needs improvement?

There are some difficulties that we are working on. It is useful for scientific purposes, but for commercial use of big data, it gives some trouble.

They should improve the stability of the product. We use Spark Executors and Spark Drivers to link to our own environment, and they are not the most stable products. Its scalability is also an issue.

We are building our own queries on Spark, and it can be improved in terms of query handling.

For how long have I used the solution?

In my company, it has been used for several years, but I have been using it for seven months.

Buyer's Guide

Apache Spark

June 2026

Free Report: Apache Spark Reviews and More

Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: June 2026.

DOWNLOAD NOW

900,747 professionals have used our research since 2012.

What do I think about the scalability of the solution?

It is not scalable. Scalability is one of the issues.

How are customer service and support?

It is open source from my point of view. So, there is no support.

What other advice do I have?

I would advise not using it if you don't have experienced users inside your organization. If you have to figure it all out on your own, then you shouldn't start with it.

Overall, I would rate it a six out of 10. For a commercial use case, it is a six out of 10. For scientific purposes, it is an eight out of 10.

Which deployment model are you using for this solution?

On-premises

Disclosure: My company does not have a business relationship with this vendor other than being a customer.

Suresh_Srinivasan

Co-Founder at FORMCEPT Technologies

Jan 13, 2022

Download

Handles large volume data, cloud and on-premise deployments, but difficult to use

Pros and Cons

"We are using Apache Spark, for large volume interactive data analysis."

"Apache Spark is very difficult to use. It would require a data engineer. It is not available for every engineer today because they need to understand the different concepts of Spark, which is very, very difficult and it is not easy to learn."
"Apache Spark is very difficult to use. It would require a data engineer."

What is our primary use case?

The solution can be deployed on the cloud or on-premise.

How has it helped my organization?

We are using Apache Spark, for large volume interactive data analysis.

MechBot is an enterprise, one-click installation, trusted data excellence platform. Underneath, I am using Apache Spark, Kafka, Hadoop HDFS, and Elasticsearch.

What is most valuable?

Apache Spark can do large volume interactive data analysis.

What needs improvement?

Apache Spark is very difficult to use. It would require a data engineer. It is not available for every engineer today because they need to understand the different concepts of Spark, which is very, very difficult and it is not easy to learn.

For how long have I used the solution?

I have been using Apache Spark for approximately 11 years.

What do I think about the stability of the solution?

The solution is stable.

What do I think about the scalability of the solution?

Apache Spark is scalable. However, it needs enormous technical skills to make it scalable. It is not a simple task.

We have approximately 20 people using this solution.

How was the initial setup?

If you want to distribute Apache Spark in a certain way, it is simple. Not every engineer can do it. You need DevOps specialized skills on Spark is what is required.

If we are going to deploy the solution in a one-layer laptop installation, it is very straightforward, but this is not what someone is going to deploy in the production site.

What's my experience with pricing, setup cost, and licensing?

Since we are using the Apache Spark version, not the data bricks version, it is an Apache license version, the support and resolution of the bug are actually late or delayed. The Apache license is free.

What other advice do I have?

We are well versed in Spark, the version, the internal structure of Spark, and we know what exactly Spark is doing.

The solution cannot be easier. Everything cannot be made simpler because it involves core data, computer science, pro-engineering, and not many people are actually aware of it.

I rate Apache Spark a six out of ten.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.

Buyer's Guide

Apache Spark

June 2026

Free Report: Apache Spark Reviews and More

Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: June 2026.

DOWNLOAD NOW

900,747 professionals have used our research since 2012.

Oscar Estorach

Chief Data Strategist And Director at theworkshop.es

Aug 19, 2021

Download

Scalable, open-source, and great for transforming data

Pros and Cons

"The solution has been very stable."
"Spark, as a tool, is easy to work with as you can work with Python, Scala, and Java."

"It's not easy to install."

What is our primary use case?

You can do a lot of things in terms of the transformation of data. You can store and transform and stream data. It's very useful and has many use cases.

What is most valuable?

Overall, it's a very nice tool.

It is great for transforming data and doing micro-streamings or micro-batching.

The product offers an open-source version.

The solution has been very stable.

The scalability is good.

Apache Spark is a huge tool. It has many use cases and is very flexible. You can use it with so many other platforms.

Spark, as a tool, is easy to work with as you can work with Python, Scala, and Java.

What needs improvement?

If you are developing projects, and you need to not put them in a production scenario, you might need more than a cluster of servers, as it requires distributed computing.

It's not easy to install. You are typically dealing with a big data system.

It's not a simple, straightforward architecture.

For how long have I used the solution?

I've been using the solution for three years.

What do I think about the stability of the solution?

The stability is very good. There are no bugs or glitches and it doesn't crash or freeze. It's a reliable solution.

What do I think about the scalability of the solution?

We have found the scalability to be good. If your company needs to expand it, it can do so.

We have five people working on the solution currently.

How are customer service and technical support?

There isn't really technical support for open source. You need to do your own studying. There are lots of places to find information. You can find details online, or in books, et cetera. There are even courses you can take that can help you understand Spark.

Which solution did I use previously and why did I switch?

I also use Databricks, which I use in the cloud.

How was the initial setup?

When handling big data systems, the installation is a bit difficult. When you need to deploy the systems, it's better to use services like Databricks.

I am not a professional admin. I am a developer for and design architecture.

You can use it in your standalone system, however, it's not the best way. It would be okay for little branch codes, not for production.

What's my experience with pricing, setup cost, and licensing?

We use the open-source version. It is free to use. However, you do need to have servers. We have three or four. they can be on-premises or in the cloud.

What other advice do I have?

I have the solution installed on my computer and on our servers. You can use it on-premises or as a SaaS.

I'd rate the solution at a nine out of ten. I've been very pleased with its capabilities.

I would recommend the solution for the people who need to deploy projects with streaming. If you have many different sources or different types of data, and you need to put everything in the same place - like a data lake - Spark, at this moment, has the right tools. It's an important solution for data science, for data detectors. You can put all of the information in one place with Spark.

Which deployment model are you using for this solution?

On-premises

Disclosure: My company does not have a business relationship with this vendor other than being a customer.

Kürşat Kurt

Software Architect at Akbank

Oct 30, 2020

Download

Provides fast aggregations, AI libraries, and a lot of connectors

Pros and Cons

"AI libraries are the most valuable. They provide extensibility and usability. Spark has a lot of connectors, which is a very important and useful feature for AI. You need to connect a lot of points for AI, and you have to get data from those systems. Connectors are very wide in Spark. With a Spark cluster, you can get fast results, especially for AI."
"Aggregations are very fast in our project since we started to use Spark."

"Stream processing needs to be developed more in Spark. I have used Flink previously. Flink is better than Spark at stream processing."

What is our primary use case?

We just finished a central front project called MFY for our in-house fraud team. In this project, we are using Spark along with Cloudera. In front of Spark, we are using Couchbase.

Spark is mainly used for aggregations and AI (for future usage). It gathers stuff from Couchbase and does the calculations. We are not actively using Spark AI libraries at this time, but we are going to use them.

This project is for classifying the transactions and finding suspicious activities, especially those suspicious activities that come from internet channels such as internet banking and mobile banking. It tries to find out suspicious activities and executes rules that are being developed or written by our business team. An example of a rule is that if the transaction count or transaction amount is greater than 10 million Turkish Liras and the user device is new, then raise an exception. The system sends an SMS to the user, and the user can choose to continue or not continue with the transaction.

How has it helped my organization?

Aggregations are very fast in our project since we started to use Spark. We can tell results in around 300 milliseconds. Before using Spark, the time was around 700 milliseconds.

Before using Spark, we only used Couchbase. We needed fast results for this project because transactions come from various channels, and we need to decide and resolve them at the earliest because users are performing the transaction. If our result or process takes longer, users might stop or cancel their transactions, which means losing money. Therefore, fast results time is very important for us.

What is most valuable?

AI libraries are the most valuable. They provide extensibility and usability. Spark has a lot of connectors, which is a very important and useful feature for AI. You need to connect a lot of points for AI, and you have to get data from those systems. Connectors are very wide in Spark. With a Spark cluster, you can get fast results, especially for AI.

What needs improvement?

Stream processing needs to be developed more in Spark. I have used Flink previously. Flink is better than Spark at stream processing.

For how long have I used the solution?

I am a Java developer. I have been interested in Spark for around five years. We have been actively using it in our organization for almost a year.

What do I think about the stability of the solution?

It is the most stable platform. As compare to Flink, Spark is good, especially in terms of clusters and architecture. My colleagues who set up these clusters say that Spark is the easiest.

What do I think about the scalability of the solution?

It is scalable, but we don't have the need to scale it.

It is mainly used for reporting big data in our organization. All teams, especially the VR team, are using Spark for job execution and remote execution. I can say that 70% of users use Spark for reporting, calculations, and real-time operations. We are a very big company, and we have around a thousand people in IT.

We will continue its usage and develop more. We have kind of just started using it. We finished this project just three months ago. We are now trying to find out bottlenecks in our systems, and then we are ready to go.

How are customer service and technical support?

We have not used Apache support. We have only used Cloudera support for this project, and they helped us a lot during the development cycle of this project.

How was the initial setup?

I don't have any idea about it. We are a big company, and we have another group for setting up Spark.

What other advice do I have?

I would advise planning well before implementing this solution. In enterprise corporations like ours, there are a lot of policies. You should first find out your needs, and after that, you or your team should set it up based on your needs. If your needs change during development because of the business requirements, it will be very difficult.

If you are clear about your needs, it is easier to set it up. If you know how Spark is used in your project, you have to define firewall rules and cluster needs. When you set up Spark, it should be ready for people's usage, especially for remote job execution.

I would rate Apache Spark a nine out of ten.

Which deployment model are you using for this solution?

On-premises

Disclosure: My company does not have a business relationship with this vendor other than being a customer.

Suresh_Srinivasan

Co-Founder at FORMCEPT Technologies

Mar 26, 2024

Download

Enables us to process data from different data sources

Pros and Cons

"We use Spark to process data from different data sources."

"In data analysis, you need to take real-time data from different data sources. You need to process this in a subsecond, do the transformation in a subsecond, and all that."

What is our primary use case?

Our primary use case is for interactively processing large volume of data.

What is most valuable?

We use Spark to process data from different data sources.

What needs improvement?

In data analysis, you need to take real-time data from different data sources. You need to process this in a subsecond, and do the transformation in a subsecond

For how long have I used the solution?

I have been using Apache Spark for eight to nine years.

What do I think about the stability of the solution?

It is a stable solution. The solution is ten out of ten on stability.

What do I think about the scalability of the solution?

The solution is highly scalable. All of the technical guys use Spark. Our product is used by many people within our customers' company.

How was the initial setup?

The initial setup is straightforward.

What's my experience with pricing, setup cost, and licensing?

The solution is moderately priced.

What other advice do I have?

I rate the overall solution a ten out of ten.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.

reviewer1283880

CEO International Business at a tech services company with 1,001-5,000 employees

Nov 13, 2023

Download

A powerful open-source framework for fast, flexible, and versatile big data processing, with a strong learning curve and resource demands

Pros and Cons

"The most crucial feature for us is the streaming capability. It serves as a fundamental aspect that allows us to exert control over our operations."

"It requires overcoming a significant learning curve due to its robust and feature-rich nature."

What is our primary use case?

In AI deployment, a key step is aggregating data from various sources, such as customer websites, debt records, and asset information. Apache Spark plays a vital role in this process, efficiently handling continuous streams of data. Its capability enables seamless gathering and feeding of diverse data into the system, facilitating effective processing and analysis for generating alerts and insights, particularly in scenarios like banking.

What is most valuable?

The most crucial feature for us is the streaming capability. It serves as a fundamental aspect that allows us to exert control over our operations.

What needs improvement?

It requires overcoming a significant learning curve due to its robust and feature-rich nature.

For how long have I used the solution?

We have been using it for two years now.

What do I think about the stability of the solution?

It provides excellent stability. We never faced any issues with it.

What do I think about the scalability of the solution?

It ensures outstanding scalability capabilities.

Which solution did I use previously and why did I switch?

Opting for Apache Spark, an open-source solution, provides a distinct advantage by offering control over the code. This means you can identify issues, make necessary fixes, and determine what aspects to accept as they are. In contrast, dealing with a vendor may limit control, requiring you to submit requests and advocate for changes based on your business volume with them. This dependency on volume can potentially compromise control. To safeguard both your customers and your business, the choice of an open-source solution like Apache Spark allows for more autonomy and control over the technology stack.

What about the implementation team?

The system's smooth operation relies on deploying a comprehensive container with Kubernetes clusters, configured with essential toolsets. Instrumentation data from the backend is fed back to a central framework equipped with specific tools for driving various processes. In a case involving a customer with Red Hat and Postini clusters, the OpenShift Container Platform, comprising Kubernetes clusters, is used. The tools manage onboarding, infrastructure provisioning, certificate management, authorization control, etc. The deployment spans multiple independent data centers, like telecom circles in India, requiring unique approaches for various tasks, including disaster recovery planning and central alerting, facilitated through SaaS. The deployment process typically takes approximately forty to forty-five days for six thousand servers.

What was our ROI?

It provides a dual advantage by saving both time and money while enhancing performance, particularly by leveraging my skill sets.

What's my experience with pricing, setup cost, and licensing?

It is an open-source solution, it is free of charge.

What other advice do I have?

I would give it a rating of seven out of ten, which, by my standards, is quite high.

Which deployment model are you using for this solution?

Private Cloud

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Other

Disclosure: My company does not have a business relationship with this vendor other than being a customer.

Salvatore Campana

CEO & Founder at Xautomata

Oct 27, 2023

Download

Reduces startup time and gives excellent ROI

Pros and Cons

"Spark helps us reduce startup time for our customers and gives a very high ROI in the medium term."

"The initial setup was not easy."

What is our primary use case?

I use Spark to run automation processes driven by data.

How has it helped my organization?

Apache Spark helped us with horizontal scalability and cost optimizations.

What is most valuable?

The most valuable feature is the grid computing.

What needs improvement?

An area for improvement is that when we start the solution and declare the maximum number of nodes, the process is shared, which is a problem in some cases. It would be useful to be able to change this parameter in real-time rather than having to stop the solution and restart with a higher number of nodes.

For how long have I used the solution?

I've been using Spark for around four years.

How was the initial setup?

The initial setup was not easy, but we created a means of asking the user about their needs, making the setup much easier. We can now deploy the platform in thirty minutes using the public cloud or Kubernetes space.

What was our ROI?

Spark helps us reduce startup time for our customers and gives a very high ROI in the medium term.

What's my experience with pricing, setup cost, and licensing?

Spark is an open-source solution, so there are no licensing costs.

What other advice do I have?

I would rate Apache Spark eight out of ten.

Which deployment model are you using for this solution?

Hybrid Cloud

Disclosure: My company does not have a business relationship with this vendor other than being a customer.

Mahdi Sharifmousavi

Lecturer at Amirkabir University of Technology

Sep 2, 2022

Download

A scalable solution that can grow with the needs of a business, and provides excellent functionality for analytical tasks

Pros and Cons

"This solution provides a clear and convenient syntax for our analytical tasks."

"This solution currently cannot support or distribute neural network related models, or deep learning related algorithms. We would like this functionality to be developed."

What is our primary use case?

We use this solution for it's anti-money laundering and direct marketing features within a banking environment.

What is most valuable?

This solution provides a clear and convenient syntax for our analytical tasks.

What needs improvement?

This solution currently cannot support or distribute neural network related models, or deep learning related algorithms. We would like this functionality to be developed.

There is also limited Python compatibility, which should be improved.

For how long have I used the solution?

We have used this solution for around seven years, through several versions.

What do I think about the stability of the solution?

We have found this solution to be stable during our time using it.

What do I think about the scalability of the solution?

This is a very scalable solution from our experience.

What about the implementation team?

We implemented the solution using our in-house team, but the UI was developed using a third party contractor.

What's my experience with pricing, setup cost, and licensing?

The deployment time of this solution is dependent on the requirements of an organization, and the compatibility of the systems they will be using alongside this solution. We would recommend that these are clearly defined when designing the product for the businesses needs.

What other advice do I have?

I would rate this solution a nine out of ten.

Which deployment model are you using for this solution?

On-premises

Disclosure: My company does not have a business relationship with this vendor other than being a customer.

NitinKumar

Director of Enginnering at Sigmoid

Aug 1, 2022

Download

Easy to code, fast, open-source, very scalable, and great for big data

Pros and Cons

"Its scalability and speed are very valuable. You can scale it a lot. It is a great technology for big data. It is definitely better than a lot of earlier warehouse or pipeline solutions, such as Informatica. Spark SQL is very compliant with normal SQL that we have been using over the years. This makes it easy to code in Spark. It is just like using normal SQL. You can use the APIs of Spark or you can directly write SQL code and run it. This is something that I feel is useful in Spark."
"It is an excellent tool to process massive amount of data."

"Its UI can be better. Maintaining the history server is a little cumbersome, and it should be improved. I had issues while looking at the historical tags, which sometimes created problems. You have to separately create a history server and run it. Such things can be made easier. Instead of separately installing the history server, it can be made a part of the whole setup so that whenever you set it up, it becomes available."
"Its UI can be better. Maintaining the history server is a little cumbersome, and it should be improved."

What is our primary use case?

I use it mostly for ETL transformations and data processing. I have used Spark on-premises as well as on the cloud.

How has it helped my organization?

Spark has been at the forefront of data processing engine. I have used Apache Spark for multiple projects for different clients. It is an excellent tool to process massive amount of data.

What is most valuable?

Its scalability and speed are very valuable. You can scale it a lot. It is a great technology for big data. It is definitely better than a lot of earlier warehouse or pipeline solutions, such as Informatica.

Spark SQL is very compliant with normal SQL that we have been using over the years. This makes it easy to code in Spark. It is just like using normal SQL. You can use the APIs of Spark or you can directly write SQL code and run it. This is something that I feel is useful in Spark.

What needs improvement?

Its UI can be better. Maintaining the history server is a little cumbersome, and it should be improved. I had issues while looking at the historical tags, which sometimes created problems. You have to separately create a history server and run it. Such things can be made easier. Instead of separately installing the history server, it can be made a part of the whole setup so that whenever you set it up, it becomes available.

For how long have I used the solution?

I have been using this solution for around 7 years.

What do I think about the stability of the solution?

There were bugs three to four years ago, which have been resolved. There were a couple of issues related to slowness when we did a lot of transformations using the Width columns. I was writing a POC on ETL for moving from Informatica to Spark SQL for the ETL pipeline. It required the use of hundreds of Width columns to change the column name or add some transformation, which made it slow. It happened in versions prior to version 1.6, and it seems that this issue has been fixed later on.

What do I think about the scalability of the solution?

It is very scalable. You can scale it a lot.

How are customer service and support?

I haven't contacted them.

How was the initial setup?

The initial setup was a little complex when I was using open-source Spark. I was doing a POC in the on-premise environment, and the initial setup was a little cumbersome. It required a lot of set up on Unix systems. We also had to do a lot of configurations and install a lot of things.

After I moved to the Cloudera CDH version, it was a little easy. It is a bundled product, so you just install whatever you want and use it.

What's my experience with pricing, setup cost, and licensing?

Apache Spark is open-source. You have to pay only when you use any bundled product, such as Cloudera.

What other advice do I have?

I would definitely recommend Spark. It is a great product. I like Spark a lot, and most of the features have been quite good. Its initial learning curve is a bit high, but as you learn it, it becomes very easy.

I would rate Apache Spark an eight out of ten.

Which deployment model are you using for this solution?

Public Cloud

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Amazon Web Services (AWS)

Disclosure: My company does not have a business relationship with this vendor other than being a customer.

reviewer1185906

Manager - Data Science Competency at a tech services company with 201-500 employees

Feb 27, 2022

Download

Fast-performance, cost-effective, and runs in a cloud-agnostic environment

Pros and Cons

"One of the key features is that Apache Spark is a distributed computing framework. You can help multiple slaves and distribute the workload between them."
"As it uses in-memory data processing, Spark is very fast."

"When you are working with large, complex tasks, the garbage collection process is slow and affects performance."

What is our primary use case?

My main task is working on predictive analytics, and Apache Spark is one of the tools that I utilize in this role. Primarily, we work with the predictive analysis of very large amounts of data.

Apache Spark is also helpful for data pre-processing, including data cleaning.

This solution is cloud-agnostic. You can use it with an EC2 instance and you can even install it on-premises. Some environments have it installed in VMs.

What is most valuable?

One of the key features is that Apache Spark is a distributed computing framework. You can have multiple slaves and distribute the workload between them.

Another feature is memory-based computing. This is unlike Hadoop, which relies on storage. As it uses in-memory data processing, Spark is very fast.

What needs improvement?

When you are working with large, complex tasks, the garbage collection process is slow and affects performance. This is an area where they need to improve because your job may fail if it is stuck for a long time while memory garbage collection is happening. This is the main problem that we have.

For how long have I used the solution?

I have been working with Apache Spark for the past four years.

What do I think about the stability of the solution?

This product is pretty stable. Companies like Facebook, Uber, and Netflix are all using Apache Spark. It's stable enough to be used all over the world.

What do I think about the scalability of the solution?

In our team that works on this, we have approximately 10 people.

How are customer service and support?

There is no official support for this solution. Because it's open-source and there is no cost involved, there is nobody to contact for support. Our own internal team of experts, which work on different problems, both support and contribute to the platform.

Which solution did I use previously and why did I switch?

I work on several open-source frameworks including Python, Scikit-learn, TensorFlow, PyTorch, H20.ai, and R. We don't endorse proprietary tools so we aren't working with them.

How was the initial setup?

With respect to the initial setup, it's neither easy nor very difficult. Our team has experience so it is not difficult for them. However, for a person that is new to using it, the setup might be very difficult.

What about the implementation team?

We have a team of experts in my company, and they handle it very well.

What's my experience with pricing, setup cost, and licensing?

This is an open-source tool, so it can be used free of charge. There is no cost involved.

What other advice do I have?

We are not using the current version of this platform, Spark 3. However, we do know that it is used in the market and it has new features. We will eventually move to it.

My advice for anybody who wants to use Apache Spark is that they have two options. The first is Databricks, which are the creators of Apache Spark, and use their proprietary version. If you choose this option then you will have to pay for the product.

If instead, you use Apache Spark, then you can rely on your own expert in-house team for support, maintenance, and deployment. In this option, you don't have to pay anything to anybody outside of your company.

I would rate this solution an eight out of ten.