Try our new research platform with insights from 80,000+ expert users
UjjwalGupta - PeerSpot reviewer
Module Lead at Mphasis
Real User
Mar 14, 2024
Helps to build ETL pipelines load data to warehouses
Pros and Cons
  • "The tool's most valuable feature is its speed and efficiency. It's much faster than other tools and excels in parallel data processing. Unlike tools like Python or JavaScript, which may struggle with parallel processing, it allows us to handle large volumes of data with more power easily."
  • "Apache Spark could potentially improve in terms of user-friendliness, particularly for individuals with a SQL background. While it's suitable for those with programming knowledge, making it more accessible to those without extensive programming skills could be beneficial."

What is our primary use case?

We're using Apache Spark primarily to build ETL pipelines. This involves transforming data and loading it into our data warehouse. Additionally, we're working with Delta Lake file formats to manage the contents.

What is most valuable?

The tool's most valuable feature is its speed and efficiency. It's much faster than other tools and excels in parallel data processing. Unlike tools like Python or JavaScript, which may struggle with parallel processing, it allows us to handle large volumes of data with more power easily.

What needs improvement?

Apache Spark could potentially improve in terms of user-friendliness, particularly for individuals with a SQL background. While it's suitable for those with programming knowledge, making it more accessible to those without extensive programming skills could be beneficial.

For how long have I used the solution?

I have been using the product for six years. 

Buyer's Guide
Apache Spark
March 2026
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: March 2026.
884,933 professionals have used our research since 2012.

What do I think about the stability of the solution?

Apache Spark is generally considered a stable product, with rare instances of breaking down. Issues may arise in sudden increases in data volume, leading to memory errors, but these can typically be managed with autoscaling clusters. Additionally, schema changes or irregularities in streaming data may pose challenges, but these could be addressed in future software versions.

What do I think about the scalability of the solution?

About 70-80 percent of employees in my company use the product. 

How are customer service and support?

We haven't contacted Apache Spark support directly because it's an open-source tool. However, when using it as a product within Databricks, we've contacted Databricks support for assistance.

Which solution did I use previously and why did I switch?

The main reason our company opted for the product is its capability to process large volumes of data. While other options like Snowflake offer some advantages, they may have limitations regarding custom logic or modifications.

How was the initial setup?

The solution's setup and installation of Apache Spark can vary in complexity depending on whether it's done in a standalone or cluster environment. The process is generally more straightforward in a standalone setup, especially if you're familiar with the concepts involved. However, setting up in a cluster environment may require more knowledge about clusters and networking, making it potentially more complex.

What's my experience with pricing, setup cost, and licensing?

The tool is an open-source product. If you're using the open-source Apache Spark, no fees are involved at any time. Charges only come into play when using it with other services like Databricks.

What other advice do I have?

If you're new to Apache Spark, the best way to learn is by using the Databricks Community Edition. It provides a cluster for Apache Spark where you can learn and test. I rate the product an eight out of ten.

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Amazon Web Services (AWS)
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Cloud solution architect at 0
Real User
Mar 10, 2024
Offers seamless integration with Azure services and on-premises servers
Pros and Cons
  • "The solution is scalable."
  • "The setup I worked on was really complex."

What is our primary use case?

My contribution primarily focused on the networking aspect, ensuring secure and reliable connections between Azure services and on-premises servers. The solution was complex, involving private links, virtual machines, and custom firewall rules to facilitate secure data transmission.

I use Apache Spark, especially for data processing and analytics.  My work involves a broad range of technologies, including PostgreSQL, Apache Kafka, Spark, and various Azure services. Previously, my focus was more on networking, cybersecurity, and Azure's data services like SQL and Active Directory.

How has it helped my organization?

We've set up a Spark cluster running in Azure to process real-time data. This setup involves connecting Azure applications to the Spark cluster via Azure Private Link, ensuring secure data flow. 

The architecture required detailed network design, including routing through Linux firewalls and ensuring data could be securely transmitted to and from on-premises servers. 

While I was heavily involved in the network design aspect, the Spark cluster was primarily used for processing and analyzing data streams for various applications.

Moreover, from my experience, I haven't encountered significant challenges with integrations involving Spark. The crucial factor is having established connectivity. 

Whether Spark is operating in Azure or on-premises doesn't significantly affect our operations, thanks to high-bandwidth solutions like ExpressRoute. The main consideration then becomes the cost. As long as we maintain performance standards, I don't see any issues, regardless of the deployment environment.

Ensuring the collection of relevant metrics and logs is critical for assessing performance improvements. The specifics of how these are collected or which tools are used might vary, but the goal is to gather comprehensive data for ongoing monitoring and improvement.

What is most valuable?

What I liked about the solution was its uniqueness. We provided the customer with a solution that hadn't been offered by anyone else before. 

It involved multiple components, such as Spark cluster, CMAX, a backend VM, and a Linux VM for mapping the service processes to the backend, which is running on-premises where the Kafka service was running. 

It was challenging for people to understand how to send traffic through the private link between all these services. Ensuring the traffic was sent to the correct destination with the correct source header without any operation issues was complex, but we achieved it.

We had multiple instances of fault tolerance and scalability.  

What needs improvement?

The setup I worked on was really complex.

For how long have I used the solution?

I have been using it for a year. 

What do I think about the stability of the solution?

The solution was definitely stable. There were no unstable services in it. Since most services were in Azure, everything worked better. 

Azure's networking products, like ExpressRoute and Private Link service, are very stable. We didn't encounter any issues with the solution. 

It took some time to complete, but after that, we haven't had a single support case.

What do I think about the scalability of the solution?

The solution is scalable. We used a load balancer at each tier, with multiple instances of the services running. 

It's all scalable and relevant. We didn't have a lot of issues and have been monitoring the traffic flow. 

We even projected the requests for the next two to three years and created scalable instances accordingly.

There are many users of Spark in our organization. For example, many customers are using Spark, often in conjunction with requests from third-party vendors. They frequently use Spark plug-ins as well.

Which solution did I use previously and why did I switch?

I've been exploring its capabilities in the OpenAI context, rather than dealing with external databases. 

I've also started using Apache Kafka for messaging and event streaming, which is essential since our solutions often integrate with applications running in Azure, including event hubs and service bus for messaging. This experience includes interfacing with various technologies, not just within Microsoft's ecosystem but also with Amazon Web Services.

Learning new technologies is a continuous process, and I've never found it difficult to adapt, especially with something as foundational as Apache Kafka.

How was the initial setup?

The setup I worked on was really complex, not specifically because of Spark but due to the integration with multiple services. 

It took us about a week to finalize the solution, as understanding the entire workflow and brainstorming on how to maintain private traffic was intricate.

Regarding the deployment process, it involved thorough planning and testing to ensure minimal latency. We managed to achieve a latency of around 20 to 30 milliseconds, which was pretty good.

What about the implementation team?

For the deployment process, once we have a clear understanding of the workflow, the services to be included, how they should be integrated, the policies, and the configurations to be applied, it becomes easier to structure and incorporate it into the ops pipeline. 

We may need to standardize it a bit based on different customer requirements. This standardization allows customers to apply the necessary customizations once it's deployed.

It's a hybrid solution, with about 90% of the services running in the cloud and 10%  on-premises.

What's my experience with pricing, setup cost, and licensing?

The licensing costs for Spark would depend on the specific packages and the needs of the project. Costs can vary based on requirements, affordability, and customer expectations. 

Licensing costs can vary. For instance, when purchasing a virtual machine, you're asked if you want to take advantage of the hybrid benefit or if you prefer the license costs to be included upfront by the cloud service provider, such as Azure. 

If you choose the hybrid benefit, it indicates you already possess a license for the operating system and wish to avoid additional charges for that specific VM in Azure. This approach allows for a reduction in licensing costs, charging only for the service and associated resources. 

The licensing arrangements can differ based on the product and service. Some products might require a license purchase upfront, with subsequent charges based only on usage. 

The availability of hybrid benefits can also influence licensing costs, especially if you're using third-party services like Palo Alto in a VM from the marketplace. If you have an existing license, your costs could be reduced, but purchasing a new license would include licensing fees in the overall cost.

What other advice do I have?

My advice is to thoroughly understand your own needs and environment before making a decision. Recommendations should be based on product features, quality, accuracy, and stability. 

Cost is also a factor, but it should not be the only consideration. Depending on whether the priority is performance and scalability or cost-effectiveness, I would suggest a solution that best meets those needs, whether it's a managed service or a more cost-conscious option.

I would rate Spark as ten out of ten. I haven't had any issues with Spark in my experience.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Buyer's Guide
Apache Spark
March 2026
Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: March 2026.
884,933 professionals have used our research since 2012.
Atif Tariq - PeerSpot reviewer
Cloud and Big Data Engineer | Developer at Huawei
Real User
Nov 29, 2023
A scalable solution that can be used for data computation and building data pipelines
Pros and Cons
  • "The most valuable feature of Apache Spark is its memory processing because it processes data over RAM rather than disk, which is much more efficient and fast."
  • "Apache Spark should add some resource management improvements to the algorithms."

What is our primary use case?

Apache Spark is used for data computation, building data pipelines, or building analytics on top of batch data. Apache Spark is used to handle big data efficiently.

What is most valuable?

The most valuable feature of Apache Spark is its memory processing because it processes data over RAM rather than disk, which is much more efficient and fast.

What needs improvement?

Apache Spark should add some resource management improvements to the algorithms. Thereby, the solution can manage SKUs more efficiently with a physical and logical plan over the different data sets when you are joining it.

For how long have I used the solution?

I have been working with Apache Spark for six to seven years.

What do I think about the stability of the solution?

Apache Spark is a very stable solution. The community is still working on other parts, like performance and removing bottlenecks. However, from a stipulative point of view, the solution's stability is very good.

I rate Apache Spark a nine out of ten for stability.

What do I think about the scalability of the solution?

Apache Spark is a scalable solution. More than 50 to 100 users are using the solution in our organization.

How are customer service and support?

Apache Spark's technical support team responds on time.

How would you rate customer service and support?

Positive

How was the initial setup?

The solution’s initial setup is very easy.

What's my experience with pricing, setup cost, and licensing?

Apache Spark is an open-source solution, and there is no cost involved in deploying the solution on-premises.

What other advice do I have?

I would recommend Apache Spark to users doing analytics, data computation, or pipelines.

Overall, I rate Apache Spark ten out of ten.

Which deployment model are you using for this solution?

Hybrid Cloud
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Suriya Senthilkumar - PeerSpot reviewer
Analyst at Deloitte
Real User
Mar 3, 2024
Processes a larger volume of data efficiently and integrates with different platforms
Pros and Cons
  • "The product’s most valuable features are lazy evaluation and workload distribution."
  • "They could improve the issues related to programming language for the platform."

What is our primary use case?

We use the product in our environment for data processing and performing Data Definition Language (DDL) operations.

What is most valuable?

The product’s most valuable features are lazy evaluation and workload distribution.

What needs improvement?

They could improve the issues related to programming language for the platform. 

For how long have I used the solution?

We have been using Apache Spark for around two and a half years.

What do I think about the stability of the solution?

The platform’s stability depends on how effectively we write the code. We encountered a few issues related to programming languages.

What do I think about the scalability of the solution?

We have more than 100 Apache Spark users in our organization.

Which solution did I use previously and why did I switch?

Before choosing Apache Spark for processing big data, we evaluated another option, Hadoop. However, Spark emerged as a superior choice comparatively.

How was the initial setup?

The initial setup complexity depends on whether it's on the cloud or on-premise. For cloud deployments, especially using platforms like Databricks, the process is straightforward and can be configured with ease. However, if the deployment is on-premise, the setup tends to be more time-consuming, although not overly complex.

What's my experience with pricing, setup cost, and licensing?

They provide an open-source license for the on-premise version. However, we have to pay for the cloud version including data centers and virtual machines.

What other advice do I have?

Apache Spark is a good product for processing large volumes of data compared to other distributed systems. It provides efficient integration with Hadoop and other platforms.

I rate it a ten out of ten.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
Lucas Dreyer - PeerSpot reviewer
Data Engineer at BBD
Real User
Top 5Leaderboard
Oct 30, 2023
A reliable and scalable open-source framework for big data processing that excels in speed, fault tolerance, and support for various data sources
Pros and Cons
  • "It is highly scalable, allowing you to efficiently work with extensive datasets that might be problematic to handle using traditional tools that are memory-constrained."
  • "One limitation is that not all machine learning libraries and models support it."

What is our primary use case?

We use it for data engineering and analytics to process and examine extensive datasets.

What is most valuable?

It is highly scalable, allowing you to efficiently work with extensive datasets that might be problematic to handle using traditional tools that are memory-constrained.

What needs improvement?

One limitation is that not all machine learning libraries and models support it. While libraries like Scikit-learn may work with some Spark-compatible models, not all machine-learning tools are compatible with Spark. In such cases, you may need to extract data from Spark and train your models on smaller datasets instead of directly using Spark for training.

For how long have I used the solution?

I have been using it for four years.

What do I think about the stability of the solution?

I have not encountered any significant stability issues and it has proven to be a robust and reliable platform without major crashes. However, there have been instances where I needed to address query optimization and similar tasks to ensure optimal performance. I would rate it nine out of ten.

How are customer service and support?

To rate my overall experience, I would give it an eight out of ten, leaving room for potential improvements in terms of technical support.

How would you rate customer service and support?

Positive

Which solution did I use previously and why did I switch?

We used Pandas data frames and SQL-type queries for smaller datasets, but we haven't worked with anything on the scale of Spark SQL.

How was the initial setup?

I haven't handled the deployment process, but setting it up on the cloud seems relatively straightforward.

What about the implementation team?

Setting it up on-premises might take longer, potentially a couple of days. However, when deploying it on the cloud, the process can be significantly quicker, possibly taking only a few hours.

What's my experience with pricing, setup cost, and licensing?

On the cloud model can be expensive as it requires substantial resources for implementation, covering on-premises hardware, memory, and licensing. Managing costs in a cloud environment can be challenging due to the cumulative expenses associated with running and maintaining Spark. Licensing costs may not be the primary concern, but operational costs in the cloud can add up. For on-premises deployments, maintenance costs include cluster management, job optimization, and upgrades. In the cloud, maintenance costs are relatively lower, especially with managed database clusters, but they still exist and primarily revolve around cluster upkeep.

Which other solutions did I evaluate?

We evaluated Microsoft Synapse, which offers similar analytics functionality but not quite at the same scale as Apache Spark and Spark as a whole. While some tasks can be accomplished with Synapse on AWS, there are certain features and capabilities, such as micro-batching and scalability, that Spark excels at and remains unmatched.

What other advice do I have?

Additional skill requirements are crucial to use the solution and its related features effectively. Training costs and efforts may be necessary to ensure individuals are proficient in using these technologies. Overall, I would rate it nine out of ten.

Which deployment model are you using for this solution?

On-premises
Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
reviewer1759647 - PeerSpot reviewer
Information Technology Business Analyst at a aerospace/defense firm with 10,001+ employees
Real User
Jul 25, 2023
A highly scalable and affordable tool that can be used to gather information from different systems
Pros and Cons
  • "The product is useful for analytics."
  • "The product could improve the user interface and make it easier for new users."

What is most valuable?

We use it as an ETL tool to gather information from different systems. The product is useful for analytics.

What needs improvement?

The product could improve the user interface and make it easier for new users. It has a steep learning curve.

For how long have I used the solution?

I have been using the product for approximately three to four years. Currently, I am using the latest version.

What do I think about the stability of the solution?

The tool is stable. I rate the stability a ten out of ten.

What do I think about the scalability of the solution?

The tool is very scalable. I rate the scalability a ten out of ten. Approximately 30 users are using Apache Spark in our organization.

How are customer service and support?

We are using the free version of the product. So, we are not using any support.

How would you rate customer service and support?

Positive

How was the initial setup?

The basic installation is easy. However, we are working in the security business and need a very secure installation. It has been quite difficult. I rate the basic installation a ten out of ten. I rate the ease of setup a two or three out of ten for a more secure installation with all the security features. The solution is deployed on-premises in our organization. The deployment process requires a couple of weeks.

What's my experience with pricing, setup cost, and licensing?

We are using the free version of the solution.

What other advice do I have?

I would recommend the product. I think it's a good solution for analytics. Overall, I rate the product an eight out of ten.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
CTO at Hammerknife
Real User
Top 20
Dec 20, 2023
Provides a valuable implementation of distributed data processing with a simple setup process
Pros and Cons
  • "Apache Spark provides a very high-quality implementation of distributed data processing."
  • "There were some problems related to the product's compatibility with a few Python libraries."

What is our primary use case?

We use the product for real-time data analysis.

What is most valuable?

Apache Spark provides a very high-quality implementation of distributed data processing. I rate it 20 on a scale of one to ten.

What needs improvement?

There were some problems related to the product's compatibility with a few Python libraries. But I suppose they are fixed.

For how long have I used the solution?

We have been using Apache Spark for the last two to three years.

What do I think about the stability of the solution?

I rate the product's stability a ten out of ten.

What do I think about the scalability of the solution?

The product is enormously scalable.

How was the initial setup?

The initial setup process is simple if you are a good professional. You have to select a few parameters and press enter. It is already integrated into Databricks platform. One person is enough to manage small and medium implementations.

What's my experience with pricing, setup cost, and licensing?

It is an open-source platform. We do not pay for its subscription.

Which other solutions did I evaluate?

We are evaluating a few analytics engineering and DBT solutions. For now, Spark is in the secondary position.

What other advice do I have?

I recommend Apache Spark for batch analytics features.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.
PeerSpot user
PLC Programmer at Alzero
Real User
Dec 1, 2023
Highly-recommended robust solution for data processing
Pros and Cons
  • "I appreciate everything about the solution, not just one or two specific features. The solution is highly stable. I rate it a perfect ten. The solution is highly scalable. I rate it a perfect ten. The initial setup was straightforward. I recommend using the solution. Overall, I rate the solution a perfect ten."
  • "The solution’s integration with other platforms should be improved."

What is our primary use case?

We are a software solutions company that serves a variety of industries, including banking, insurance, and industrial sectors. The product is specifically employed for managing data platforms for our customers.


What is most valuable?

The solution, as a package, excels across the board. I appreciate everything, not just one or two specific features.


What needs improvement?

The solution’s integration with other platforms should be improved.


For how long have I used the solution?

I have been using the solution for the past eight years. Currently, I’m using the latest version of the solution.


What do I think about the stability of the solution?

The solution is highly stable. I rate it a perfect ten.


What do I think about the scalability of the solution?

The solution is highly scalable. I rate it a perfect ten.


How was the initial setup?

The initial setup was straightforward and was conducted on the cloud. The entire deployment process took just 15 minutes. The deployment process involves provisioning the computational part tool using Terraform.


What's my experience with pricing, setup cost, and licensing?

The solution is affordable and there are no additional licensing costs.


What other advice do I have?

I recommend using the solution. Overall, I rate the solution a perfect ten.


Disclosure: My company has a business relationship with this vendor other than being a customer. Partner
PeerSpot user
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros sharing their opinions.
Updated: March 2026
Buyer's Guide
Download our free Apache Spark Report and get advice and tips from experienced pros sharing their opinions.