No more typing reviews! Try our Samantha, our new voice AI agent.
Lucas Dreyer - PeerSpot reviewer
Data Engineer at BBD
Real User
Top 5
Oct 30, 2023
A reliable and scalable open-source framework for big data processing that excels in speed, fault tolerance, and support for various data sources
Pros and Cons
  • "It is highly scalable, allowing you to efficiently work with extensive datasets that might be problematic to handle using traditional tools that are memory-constrained."
  • "One limitation is that not all machine learning libraries and models support it."

What is our primary use case?

We use it for data engineering and analytics to process and examine extensive datasets.

What is most valuable?

It is highly scalable, allowing you to efficiently work with extensive datasets that might be problematic to handle using traditional tools that are memory-constrained.

What needs improvement?

One limitation is that not all machine learning libraries and models support it. While libraries like Scikit-learn may work with some Spark-compatible models, not all machine-learning tools are compatible with Spark. In such cases, you may need to extract data from Spark and train your models on smaller datasets instead of directly using Spark for training.

For how long have I used the solution?

I have been using it for four years.

Buyer's Guide
Apache Spark
June 2026
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: June 2026.
900,747 professionals have used our research since 2012.

What do I think about the stability of the solution?

I have not encountered any significant stability issues and it has proven to be a robust and reliable platform without major crashes. However, there have been instances where I needed to address query optimization and similar tasks to ensure optimal performance. I would rate it nine out of ten.

How are customer service and support?

To rate my overall experience, I would give it an eight out of ten, leaving room for potential improvements in terms of technical support.

Which solution did I use previously and why did I switch?

We used Pandas data frames and SQL-type queries for smaller datasets, but we haven't worked with anything on the scale of Spark SQL.

How was the initial setup?

I haven't handled the deployment process, but setting it up on the cloud seems relatively straightforward.

What about the implementation team?

Setting it up on-premises might take longer, potentially a couple of days. However, when deploying it on the cloud, the process can be significantly quicker, possibly taking only a few hours.

What's my experience with pricing, setup cost, and licensing?

On the cloud model can be expensive as it requires substantial resources for implementation, covering on-premises hardware, memory, and licensing. Managing costs in a cloud environment can be challenging due to the cumulative expenses associated with running and maintaining Spark. Licensing costs may not be the primary concern, but operational costs in the cloud can add up. For on-premises deployments, maintenance costs include cluster management, job optimization, and upgrades. In the cloud, maintenance costs are relatively lower, especially with managed database clusters, but they still exist and primarily revolve around cluster upkeep.

Which other solutions did I evaluate?

We evaluated Microsoft Synapse, which offers similar analytics functionality but not quite at the same scale as Apache Spark and Spark as a whole. While some tasks can be accomplished with Synapse on AWS, there are certain features and capabilities, such as micro-batching and scalability, that Spark excels at and remains unmatched.

What other advice do I have?

Additional skill requirements are crucial to use the solution and its related features effectively. Training costs and efforts may be necessary to ensure individuals are proficient in using these technologies. Overall, I would rate it nine out of ten.

Which deployment model are you using for this solution?

On-premises
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
reviewer1759647 - PeerSpot reviewer
Information Technology Business Analyst at a aerospace/defense firm with 10,001+ employees
Real User
Jul 25, 2023
A highly scalable and affordable tool that can be used to gather information from different systems
Pros and Cons
  • "The product is useful for analytics."
  • "The product could improve the user interface and make it easier for new users."

What is most valuable?

We use it as an ETL tool to gather information from different systems. The product is useful for analytics.

What needs improvement?

The product could improve the user interface and make it easier for new users. It has a steep learning curve.

For how long have I used the solution?

I have been using the product for approximately three to four years. Currently, I am using the latest version.

What do I think about the stability of the solution?

The tool is stable. I rate the stability a ten out of ten.

What do I think about the scalability of the solution?

The tool is very scalable. I rate the scalability a ten out of ten. Approximately 30 users are using Apache Spark in our organization.

How are customer service and support?

We are using the free version of the product. So, we are not using any support.

How would you rate customer service and support?

Positive

How was the initial setup?

The basic installation is easy. However, we are working in the security business and need a very secure installation. It has been quite difficult. I rate the basic installation a ten out of ten. I rate the ease of setup a two or three out of ten for a more secure installation with all the security features. The solution is deployed on-premises in our organization. The deployment process requires a couple of weeks.

What's my experience with pricing, setup cost, and licensing?

We are using the free version of the solution.

What other advice do I have?

I would recommend the product. I think it's a good solution for analytics. Overall, I rate the product an eight out of ten.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Buyer's Guide
Apache Spark
June 2026
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: June 2026.
900,747 professionals have used our research since 2012.
CTO at Hammerknife
Real User
Dec 20, 2023
Provides a valuable implementation of distributed data processing with a simple setup process
Pros and Cons
  • "Apache Spark provides a very high-quality implementation of distributed data processing."
  • "There were some problems related to the product's compatibility with a few Python libraries."

What is our primary use case?

We use the product for real-time data analysis.

What is most valuable?

Apache Spark provides a very high-quality implementation of distributed data processing. I rate it 20 on a scale of one to ten.

What needs improvement?

There were some problems related to the product's compatibility with a few Python libraries. But I suppose they are fixed.

For how long have I used the solution?

We have been using Apache Spark for the last two to three years.

What do I think about the stability of the solution?

I rate the product's stability a ten out of ten.

What do I think about the scalability of the solution?

The product is enormously scalable.

How was the initial setup?

The initial setup process is simple if you are a good professional. You have to select a few parameters and press enter. It is already integrated into Databricks platform. One person is enough to manage small and medium implementations.

What's my experience with pricing, setup cost, and licensing?

It is an open-source platform. We do not pay for its subscription.

Which other solutions did I evaluate?

We are evaluating a few analytics engineering and DBT solutions. For now, Spark is in the secondary position.

What other advice do I have?

I recommend Apache Spark for batch analytics features.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
PLC Programmer at Alzero
Real User
Dec 1, 2023
Highly-recommended robust solution for data processing
Pros and Cons
  • "I appreciate everything about the solution, not just one or two specific features. The solution is highly stable. I rate it a perfect ten. The solution is highly scalable. I rate it a perfect ten. The initial setup was straightforward. I recommend using the solution. Overall, I rate the solution a perfect ten."
  • "The solution’s integration with other platforms should be improved."

What is our primary use case?

We are a software solutions company that serves a variety of industries, including banking, insurance, and industrial sectors. The product is specifically employed for managing data platforms for our customers.


What is most valuable?

The solution, as a package, excels across the board. I appreciate everything, not just one or two specific features.


What needs improvement?

The solution’s integration with other platforms should be improved.


For how long have I used the solution?

I have been using the solution for the past eight years. Currently, I’m using the latest version of the solution.


What do I think about the stability of the solution?

The solution is highly stable. I rate it a perfect ten.


What do I think about the scalability of the solution?

The solution is highly scalable. I rate it a perfect ten.


How was the initial setup?

The initial setup was straightforward and was conducted on the cloud. The entire deployment process took just 15 minutes. The deployment process involves provisioning the computational part tool using Terraform.


What's my experience with pricing, setup cost, and licensing?

The solution is affordable and there are no additional licensing costs.


What other advice do I have?

I recommend using the solution. Overall, I rate the solution a perfect ten.


Disclosure: My company has a business relationship with this vendor other than being a customer. Partner
PeerSpot user
Lokesh Jayanna - PeerSpot reviewer
Vice President at Goldman Sachs at a computer software company with 10,001+ employees
Real User
Nov 26, 2023
Stable product with a valuable SQL tool
Pros and Cons
  • "The product’s most valuable feature is the SQL tool. It enables us to create a database and publish it."
  • "At the initial stage, the product provides no container logs to check the activity."

What is our primary use case?

We use the product for extensive data analysis. It helps us analyze a huge amount of data and transfer it to data scientists in our organization.

What is most valuable?

The product’s most valuable feature is the SQL tool. It enables us to create a database and publish it. It is a useful feature for us.

What needs improvement?

At the initial stage, the product provides no container logs to check the activity. It remains inactive for a long time without giving us any information. The containers could start quickly, similar to that of Jupyter Notebook.

For how long have I used the solution?

We have been using Apache Spark for eight months to one year.

What do I think about the stability of the solution?

It is a stable product. I rate its stability an eight out of ten.

What do I think about the scalability of the solution?

We have 45 Apache Spark users. I rate its scalability a nine out of ten.

How was the initial setup?

The complexity of the initial setup depends on the kind of environment an organization is working with. It requires one executive for deployment. I rate the process an eight out of ten.

What's my experience with pricing, setup cost, and licensing?

The product is expensive, considering the setup. However, from a standalone perspective, it is inexpensive.

What other advice do I have?

I advise others to analyze data and understand your business requirements before purchasing the product. I rate it an eight out of ten.

Which deployment model are you using for this solution?

On-premises
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Jagannadha Rao - PeerSpot reviewer
Lead Data Scientist at International School of Engineering
Real User
Oct 24, 2023
A flexible solution that can be used for storage and processing
Pros and Cons
  • "The most valuable feature of Apache Spark is its flexibility."
  • "Apache Spark's GUI and scalability could be improved."

What is our primary use case?

We use Apache Spark for storage and processing.

What is most valuable?

The most valuable feature of Apache Spark is its flexibility.

What needs improvement?

Apache Spark's GUI and scalability could be improved.

For how long have I used the solution?

I have been using Apache Spark for four to five years.

What do I think about the scalability of the solution?

Around 15 data scientists are using Apache Spark in our organization.

How was the initial setup?

Apache Spark's initial setup is slightly complex compared to other other solutions. Data scientists could install our previous tools with minimal supervision, whereas Apache Spark requires some IT support. Apache Spark's installation is a time-consuming process because it requires ensuring that all the ports have been accessed properly following certain guidelines.

What about the implementation team?

While installing Apache Spark, I must look at the documentation and be very specific about the configuration settings. Only then I'll be able to install it.

What's my experience with pricing, setup cost, and licensing?

Apache Spark is an expensive solution.

What other advice do I have?

I would recommend Apache Spark to other users.

Overall, I rate Apache Spark an eight out of ten.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Data Engineer at Berief Food GmbH
Real User
Aug 3, 2023
A useful and easy-to-deploy product that has an excellent data processing framework
Pros and Cons
  • "The data processing framework is good."
  • "The solution must improve its performance."

What is our primary use case?

Our customers configure their software applications, and I use Apache to check them. We use it for data processing.

What is most valuable?

The data processing framework is good. The product is very useful.

What needs improvement?

The solution must improve its performance.

For how long have I used the solution?

I have been using the solution for four to five years.

What do I think about the stability of the solution?

The tool is stable. I rate the stability more than nine out of ten.

What do I think about the scalability of the solution?

We have a small business. Around four people in my organization use the solution.

How was the initial setup?

The deployment was easy.

What about the implementation team?

The solution was deployed with the help of third-party consultants.

What other advice do I have?

Overall, I rate the product more than eight out of ten.

Which deployment model are you using for this solution?

On-premises
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
reviewer2208003 - PeerSpot reviewer
Quantitative Developer at a marketing services firm with 11-50 employees
Real User
Jul 12, 2023
Seamless in distributing tasks, including its impressive map-reduce functionality
Pros and Cons
  • "The distribution of tasks, like the seamless map-reduce functionality, is quite impressive."
  • "When using Spark, users may need to write their own parallelization logic, which requires additional effort and expertise."

What is our primary use case?

Predominantly, I use Spark for data analysis on top of datasets containing tens of millions of records.

How has it helped my organization?

I have an example. We had a single-threaded application that used to run for about four to five hours, but with Spark, it got reduced to under one hour.

What is most valuable?

The distribution of tasks, like the seamless map-reduce functionality, is quite impressive. For the user, it appears as simple single-line data manipulations, but behind the scenes, the executor pool intelligently distributes the map and reduce functions.

What needs improvement?

The visualization could be improved.

For how long have I used the solution?

I have been working with Apache Spark for only a few months, not too long.

What do I think about the stability of the solution?

I haven't faced any stability issues. It has been stable in my experience.

What do I think about the scalability of the solution?

When it comes to the scalability of Spark, it's primarily a processing engine, not a database engine. I haven't tested it extensively with large record sizes.

In my organization, quite a few people are using Spark. In my smaller team, there are only two users.

What about the implementation team?

In terms of maintenance, when the load hits around 95%, we need to prioritize scripts and analysis within the team. 

We coordinate and prioritize based on the available resources. If there were self-service tools or better hand-holding for such situations, it would make things easier.

Which other solutions did I evaluate?

Currently, we extensively use pandas and Polaris. We are leveraging Docker and Kubernetes as a framework, along with AWS Batch for distribution. This is the closest substitute we have for Spark Distribution.

Both Docker and Kubernetes are more general-purpose solutions. If someone is already using Kubernetes and it's provided as a service, it can be used for special-purpose utilization, similar to Docker and Kubernetes.


In such cases, users may need to write the parallelization logic themselves, but it's relatively easy to onboard and start with a distributed load. Spark, on the other hand, is primarily used for special-purpose utilization. Users typically choose Spark when they have data-intensive tasks.

Another significant issue with Spark is its syntactics. For instance, if we have libraries like Panda or Polaris, we can run them single-threaded on a single core, or we can distribute them leveraging Kubernetes.

We don't need to rewrite that code base for Spark. However, if we are writing code specifically for Spark Executors, it will not be amenable to running it locally.

What other advice do I have?

I would recommend understanding the use case better. Only if it fits your use case, then go for it. But it is a great tool.

Overall, I would rate Apache Spark an eight out of ten. 

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Armando Becerril - PeerSpot reviewer
Partner / Head of Data & Analytics at Intelligence Software Consulting
Real User
Feb 16, 2023
Great for machine learning applications; good documentation available
Pros and Cons
  • "Provides a lot of good documentation compared to other solutions."
  • "The migration of data between different versions could be improved."

What is our primary use case?

We use Spark for machine learning applications, clustering, and segmentation of customers.

What is most valuable?

Apache provides a lot of good documentation compared to other solutions. 

What needs improvement?

The migration of data between different versions could be improved. 

For how long have I used the solution?

I've been using this solution for four years. 

What do I think about the stability of the solution?

The solution is stable. 

What do I think about the scalability of the solution?

The solution is scalable. 

How are customer service and support?

If you pay for customer support then you get a quick and efficient response, otherwise the community support offers good help. 

How was the initial setup?

The initial setup has been simplified over the past few years and is now relatively straightforward. 

What's my experience with pricing, setup cost, and licensing?

Licensing costs depend on where you source the solution. 

What other advice do I have?

This is a good solution for big data use cases and I rate it eight out of 10. 

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
AmitMataghare - PeerSpot reviewer
Associate Director at a consultancy with 10,001+ employees
Real User
Apr 29, 2022
High performance, beneficial in-memory support, and useful online community support
Pros and Cons
  • "One of Apache Spark's most valuable features is that it supports in-memory processing, the execution of jobs compared to traditional tools is very fast."
  • "Apache Spark could improve the connectors that it supports. There are a lot of open-source databases in the market. For example, cloud databases, such as Redshift, Snowflake, and Synapse. Apache Spark should have connectors present to connect to these databases. There are a lot of workarounds required to connect to those databases, but it should have inbuilt connectors."
  • "Apache Spark could improve the connectors that it supports."

What is our primary use case?

Apache Spark is a programming language similar to Java or Python. In my most recent deployment, we used Apache Spark to build engineering pipelines to move data from sources into the data lake.

What is most valuable?

One of Apache Spark's most valuable features is that it supports in-memory processing, the execution of jobs compared to traditional tools is very fast.

What needs improvement?

Apache Spark could improve the connectors that it supports. There are a lot of open-source databases in the market. For example, cloud databases, such as Redshift, Snowflake, and Synapse. Apache Spark should have connectors present to connect to these databases. There are a lot of workarounds required to connect to those databases, but it should have inbuilt connectors.

For how long have I used the solution?

I have been using Apache Spark for approximately five years.

What do I think about the stability of the solution?

Apache Spark is stable.

What do I think about the scalability of the solution?

I have found Apache Spark to be scalable.

How are customer service and support?

Apache Spark is open-source, there is no team that will give you dedicated support, but you can post your queries on the community forums, and usually, you will receive a good response. Since it's open-source, you depend on freelance developers to respond to you, you cannot put a time limit there, but the response, on average, is pretty good.

How was the initial setup?

If Apache Spark is in the cloud, setting it up will require only minutes. If it's on Amazon, GCP, or Microsoft cloud, it'll take minutes to set everything up. However, if you are using the on-premise version, then it might take some time to set up the environment.

What other advice do I have?

I rate Apache Spark an eight out of ten.

Which deployment model are you using for this solution?

Public Cloud

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Amazon Web Services (AWS)
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros sharing their opinions.
Updated: June 2026
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros sharing their opinions.