Apache Spark Reviews and Pricing

SurjitChoudhury

Data engineer at Cocos pt

Mar 16, 2024

Download

Offers batch processing of data and in-memory processing in Spark greatly enhances performance

Pros and Cons

"Now, when we're tackling sentiment analysis using NLP technologies, we deal with unstructured data—customer chats, feedback on promotions or demos, and even media like images, audio, and video files. For processing such data, we rely on PySpark. Beneath the surface, Spark functions as a compute engine with in-memory processing capabilities, enhancing performance through features like broadcasting and caching. It's become a crucial tool, widely adopted by 90% of companies for a decade or more."

"There could be enhancements in optimization techniques, as there are some limitations in this area that could be addressed to further refine Spark's performance."

What is our primary use case?

Our main use cases for Spark are Apache Spark SQL and sometimes Spark Streaming to process streaming data.

Like most solutions, we got data from SAP or Azure Data Warehouse. Suppose they were using Azure Cloud technology. So, the data comes from there, relational or sometimes semi-structured data like JSON files and all.

So, we process the data with Spark, writing this code with PySpark, actually Python, which Spark allows, to create the data forms and all and load it into the Tableau format, basically.

So, we try to load it into some database, like SQL Server or any other database. From there, the business data scientists or analysts pick up the data. So, any sort of different sources, basically, like e-commerce sites.

So, previously, we used mostly structured data, which was stored in SAP, mainframe Oracle, or any other system provided in structured formats like CSV.

Now, when we're tackling sentiment analysis using NLP technologies, we deal with unstructured data—customer chats, feedback on promotions or demos, and even media like images, audio, and video files. For processing such data, we rely on PySpark.

Beneath the surface, Spark functions as a compute engine with in-memory processing capabilities, enhancing performance through features like broadcasting and caching. It's become a crucial tool, widely adopted by 90% of companies for a decade or more.

Before Spark, there was MapReduce, but it was much slower. Even running the same query a second time would be time-consuming due to the I/O operations with disk storage. Spark was introduced to address these issues, offering processing speeds a hundred times faster than MapReduce, an initiative that saw contributions from Adobe Systems among others.

So, in response to the evolving needs of the industry, Spark has proven to be the solution, efficiently handling the processing requirements we face today.

What is most valuable?

Spark supports real-time data processing through Spark Streaming. It allows for batch processing of data. If you have immediate data, like chat information, that needs to be processed in real-time, Spark Streaming is used.

For data that can be evaluated later, batch processing with Apache Spark is suitable. Mostly, batch processing is utilized in our organization, but for streaming data processing, tools like Kafka are often integrated.

In-memory processing in Spark greatly enhances performance, making it a hundred times faster than the previous MapReduce methods. This improvement is achieved through optimization techniques like caching, broadcasting, and partitioning, which help in optimizing queries for faster processing.

What needs improvement?

There could be enhancements in optimization techniques, as there are some limitations in this area that could be addressed to further refine Spark's performance.

For how long have I used the solution?

I've used it for four years.

Buyer's Guide

Apache Spark

June 2026

Free Report: Apache Spark Reviews and More

Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: June 2026.

DOWNLOAD NOW

900,747 professionals have used our research since 2012.

How are customer service and support?

In the community forums, I asked questions a while back when I was new. However, the responses came from other users in the community, not the official Apache Spark organization. So, I am not sure about the proficiency.

Since it's open-source, most questions happen in the community. For enterprise support, I imagine the response speed would be different.

Which solution did I use previously and why did I switch?

I have also used Hadoop.

The main reason for choosing Apache Spark was for big data solutions. Hadoop was introduced earlier, and most organizations were using Hadoop or cloud data platforms.

Then, Apache Spark came into the picture, and it was much faster. It's kind of taking the place of Hadoop. Organizations using Hadoop are now primarily focusing on Apache Spark for support.

So, for big data computing tasks, what you do with Hadoop is like a top-level layer. Spark is another layer on top of that. Organizations using Hadoop technologies and big data technologies in general have adopted Spark.

There aren't really other comparable tools for big data computing tasks. But, resource managers like Kubernetes and YARN are used with Spark. YARN was used in Hadoop big data technology, but now Kubernetes is more commonly used for resource management.

How was the initial setup?

Resource allocation and optimization in the computing tasks are different for on-premise systems.

In cloud environments, resource allocation is already handled by the cloud provider, so you don't need to worry about it.

On-prem, if you're using Hadoop with Spark, resource allocation might be handled by Kubernetes or YARN. These tools provide feedback to the Spark driver about available resources, and the driver allocates tasks to worker nodes based on that information.

What other advice do I have?

Overall, I would rate the solution a nine out of ten.

I would recommend this tool to someone considering it for scalable data processing.

Nowadays, Apache Spark is on the market, and most organizations are using it. There are people with more experience and knowledge than me, and they're confident about this tool.

That's why it's become a solution for organizations. It's not a one-man decision but rather a group or community effort.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.

Miodrag Milojevic

Senior Data Archirect at Yettel

Aug 18, 2023

Download

Parallel computing helped create data lakes with near real-time loading

Pros and Cons

"It's easy to prepare parallelism in Spark, run the solution with specific parameters, and get good performance."

"If you have a Spark session in the background, sometimes it's very hard to kill these sessions because of D allocation."

What is our primary use case?

I use the solution for data lakes and big data solutions. I can combine it with the other program languages.

What is most valuable?

One of the reasons we use Spark is so we can use parallelism in data lakes. So in our case, we can get many data nodes, and the main power of Hadoop and big data solutions is the number of nodes usable for different operations. It's easy to prepare parallelism in Spark, run the solution with specific parameters, and get good performance. Also, Spark has an option for near real-time loading and processing. We use micro batches of Spark.

What needs improvement?

If you have a Spark session in the background, sometimes it's very hard to kill these sessions because of D allocation. In combination with other tools, many sessions remain, even if you think they've stopped. This is the main problem with big data sessions, where zombie sessions reside that you have to take care of. Otherwise, they spend resources and cause problems.

For how long have I used the solution?

I've been using Apache Spark for more than two years. I'm using the latest version.

What do I think about the stability of the solution?

The solution is stable, but not completely. For example, we use Exadata as an extremely stable data warehouse, but that's not possible with big data. There are things that you have to fix sometimes. The stability is similar to the cloud solution, but that depends on the solution you need.

What do I think about the scalability of the solution?

The solution is scalable, but adding new nodes is not easy. It will take some time to do that, but it's scalable. We have about 20 users using Apache Spark. We regularly use the solution.

How are customer service and support?

We use Cloudera distribution, so we ask Cloudera for support, which is not open-source.

How was the initial setup?

When you install the complete environment, you install Spark as a part of this solution. The setup can be tricky when introducing security, such as connecting Spark using Kerberos. It can be tricky because when you use it, you have to distribute your architecture with many servers, and even then, you have to prepare Kerberos on every server. It's not possible to do this in one place.

Deploying Apache Spark is pretty complex. But that is a problem with the security approach. Our security guys requested this security, so we use Kerberos authentication mandatorily, which can be complex. We had five people for maintenance and deployment, not to mention deployment or other roles.

What about the implementation team?

We had an external integrator, but we also had in-house knowledge. Sometimes, we need to change or install something, and it's not good to ask the integrator for everything because of availability and planning. We had more freedom thanks to our internal knowledge.

What's my experience with pricing, setup cost, and licensing?

Apache Spark is not too cheap. You have to pay for hardware and Cloudera licenses. Of course, there is a solution with open source without Cloudera. But in that case, you don't have any support. If you face a problem, you might find something in the community, but you cannot ask Cloudera about it. If you have open source, you don't have support, but you have a community. Cloudera has different packages, which are licensed versions of products like Apache Spark. In this case, you can ask Cloudera for everything.

What other advice do I have?

Spark was written in Scala. Scala is a programming language fundamentally in Java and useful for data lakes.

We thought about using Flink instead, but it wasn't useful for us and wouldn't gain any additional value. Besides, Spark's community is much wider, so information is available and is better than Flink's.

I rate Apache Spark an eight out of ten.

If you plan to implement Apache Spark on a large-scale system, you should learn to use parallelism, partitioning, and everything from the physical level to get the best performance from Spark. And it will be good to know Python, especially for data scientists using PySpark for analysis. Likewise, it's good to know Scala because you can be very efficient in preparing some datasets since it is Spark's native language.

Which deployment model are you using for this solution?

On-premises

Disclosure: My company does not have a business relationship with this vendor other than being a customer.

Buyer's Guide

Apache Spark

June 2026

Free Report: Apache Spark Reviews and More

Learn what your peers think about Apache Spark. Get advice and tips from experienced pros sharing their opinions. Updated: June 2026.

DOWNLOAD NOW

900,747 professionals have used our research since 2012.

Ilya Afanasyev

Senior Software Development Engineer at Yahoo!

Aug 22, 2022

Download

Reliable, able to expand, and handle large amounts of data well

Pros and Cons

"There's a lot of functionality."
"It's a nice system for batch processing huge data."

"I know there is always discussion about which language to write applications in and some people do love Scala. However, I don't like it."
"They currently use a JDK version which is a little bit old. Not all features are on it."

What is our primary use case?

It's a root product that we use in our pipeline.

We have some input data. For example, we have one system that supplies some data to MongoDB, for example, and we pull this data from MongoDB, enrich this data from other systems - with some additional fields - and write to S3 for other systems. Since we have a lot of data, we need a parallel process that runs hourly.

What is most valuable?

We use batch processing. It works well with our formats and file versions. There's a lot of functionality.

In our pipeline each hour, we make a copy of data from MongoDB, of the changes from MongoDB to some specific file. Each time pipeline copied all of the data, it would do it each time without changes to all of the tables. Tables have a lot of data, and in the last MongoDB version, there is a possibility to read only changed data. This reduced the cost and configuration of the cluster, and we saved about $150,000.

The solution is scalable.

It's a stable product.

What needs improvement?

The primary language for developers on Spark is Scala. Now it's also about Java. I prefer Java versus Scala, and since they are supported, it is good. I know there is always discussion about which language to write applications in, and some people do love Scala. However, I don't like it.

They use currently have a JDK version which is a little bit old. Not all features are on it. Maybe they should pull support of the JDK version.

For how long have I used the solution?

I've used the solution for a year and a half.

What do I think about the stability of the solution?

The solution is stable. There are no bugs or glitches. It doesn't crash or freeze.

What do I think about the scalability of the solution?

The product scales well. It's fine to expand if needed.

Many teams use Spark. For example, we have a few kinds of pipelines, huge pipelines. One of them processes 300 billion events each day. It's our core technology currently.

We do not plan to increase usage. We keep our legacy system on Spark, and we are now discussing Flink and Spark and what we would prefer. However, most of the people are already migrating new systems to Flink. We will keep Spark for a few more years still.

How are customer service and support?

We have an internal team, and they participate in process of developing Spark. They are Spark contributors, and if we have some problems, we turn to them. It's our own people, yet they work with Spark. Generally, if the problem is more minor, we look at some sites or have some discussion about Spark or internal guys who have experience with Spark.

Which solution did I use previously and why did I switch?

We also use Flink.

Before Spark, I worked with another company that we used some different technology, including Kafka, Radius, Postgres SQL, S3, and Spring.

How was the initial setup?

I didn't handle the initial setup. We were using this pipeline and clusters already. I just installed it on my local server. However, in terms of difficulty, I didn't see any problem. The deployment might only take a few hours.

I found some documentation. I got the documentation from the site and downloaded the archive and unzipped it, and installed it. I can't say that I installed something from a special configuration. I just installed a few nodes for debugging and for running locally, and that's all. Also, in one case I used, for example, a Docker configuration with Spark. It all worked fine.

What's my experience with pricing, setup cost, and licensing?

It's an open-source product. I don't know much about the licensing aspect.

Which other solutions did I evaluate?

We have compared Flink and Spark as two possible options.

What other advice do I have?

I can recommend the product. It's a nice system for batch processing huge data.

I'd rate the solution eight out of ten.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.

Anshuman Kishore

Director Product Development at Mycom Osi

Apr 1, 2024

Download

Available for free and can be deployed easily

Pros and Cons

"The product's deployment phase is easy."

"At times during the deployment process, the tool goes down, making it look less robust. To take care of the issues in the deployment process, users need to do manual interventions occasionally."

What is our primary use case?

I use the solution in my company for one of the cases where we have to deal with areas like topology engines and big topology chains.

What is most valuable?

Overall, my company likes the product since it is a good tool.

What needs improvement?

There can be challenges in getting a good developer for Apache Spark. Getting developers in the market with the right skill set for Apache Spark is tough. The aforementioned area can be considered for improvement in the product.

At times during the deployment process, the tool goes down, making it look less robust. To take care of the issues in the deployment process, users need to do manual interventions occasionally. I feel that the use of large datasets can be a cause of concern during the tool's deployment phase, making it an area where improvements are required.

For how long have I used the solution?

I have been using Apache Spark for seven to eight years.

What do I think about the stability of the solution?

Stability-wise, I rate the solution an eight and a half out of ten.

What do I think about the scalability of the solution?

It is a very scalable solution.

In our company, there are users of Apache Spark, and then there are users of the applications that were developed with it.

Currently, my company does not plan to increase the use of the product.

How was the initial setup?

The product's deployment phase is easy.

The product's deployment phase involved the CI/CD pipeline and Jenkins pipeline.

Earlier, the solution was deployed on an on-premises model. Later on, the solution was deployed on a cloud model.

Initially, during the product's deployment phase, it took more than four to five hours. With the passage of time, the product's deployment process became easier.

Around 50 to 100 people in my company are involved in the product's deployment process.

What's my experience with pricing, setup cost, and licensing?

Considering the product version used in my company, I feel that the tool is not costly since the product is available for free.

What other advice do I have?

The tool offers functionality that helps my company deal with data processing in projects on a near real-time basis.

The impact of in-memory processing capabilities on the improvement of computational efficiency is one of the reasons why my company chose Apache Spark.

At the moment, my company plans to explore data analysis with Apache Spark. My company primarily used the product for data processing and not for data analysis.

If you buy the product with the capabilities of Azure DevOps and use the tool's dashboard, you find the solution to be good. The tool has an in-built UI and other good capabilities.

I feel that the product is fine and easy to use for those who plan to use it in the future. I recommended the tool to others based on the performance and scalability features it offers.

I managed data partitioning and distribution with Apache Spark once in my company.

The benefits of the use of the product revolve around the fact that it was easy to get the data processing done in a very quick and fastest possible way with the help of its n-memory processing and performance.

I rate the solution an eight and a half to nine out of ten.

Which deployment model are you using for this solution?

On-premises

Disclosure: My company does not have a business relationship with this vendor other than being a customer.

Hamid M. Hamid

Data architect at Banking Sector

Feb 12, 2024

Download

Along with the easy cluster deployment process, the tool also has the ability to process huge datasets

Pros and Cons

"The deployment of the product is easy."

"Technical expertise from an engineer is required to deploy and run high-tech tools, like Informatica, on Apache Spark, making it an area where improvements are required to make the process easier for users."

What is our primary use case?

In my company, the solution is used for batch processing or real-time processing.

What needs improvement?

The product has matured at the moment. The product's interoperability is an area of concern where improvements are required.

Apache Spark can be integrated with high-tech tools like Informatica. Technical expertise from an engineer is required to deploy and run high-tech tools, like Informatica, on Apache Spark, making it an area where improvements are required to make the process easier for users.

For how long have I used the solution?

I have been using Apache Spark for three years.

What do I think about the stability of the solution?

Stability-wise, I rate the solution a nine out of ten.

What do I think about the scalability of the solution?

It is a very scalable solution. Scalability-wise, I rate the solution a nine out of ten.

There are no different numbers of uses for Apache Spark in my company since it is used as a processing engine.

How are customer service and support?

Apache Spark is an open-source tool, so the only support users can get for the tool is from different vendors like Cloudera or HPE.

Which solution did I use previously and why did I switch?

In the past, my company has used certain ETL tools, like Informatica, based on the performance levels offered.

How was the initial setup?

The deployment of the product is easy.

Apache Spark's cluster deployment process is very easy.

There is only a deployment process required for an application to run on Apache Spark. Apache Spark itself is a setup tool. Deploying an application using Apache Spark is easy as a user since you just need to submit the code in Scala and submit it to the cluster, and then the deployment process can be done in one step.

The solution is deployed on an on-premises model.

What's my experience with pricing, setup cost, and licensing?

Apache Spark is an open-source tool. It is not an expensive product.

What other advice do I have?

The tool is used for real-time data analytics as it is very powerful and reliable. The code that you write with Apache Spark provides stability. There are many bugs that can appear according to the code that you use, which could be Java or Scala. So this is amazing. Apache Spark is very reliable, powerful, and fast as an engine. When compared with another competitor like MapReduce, Apache Spark performs 100 times better than MapReduce.

The monitoring part of the product is good.

The product offers clusters that are resilient and can run into multiple nodes.

The tool can run with multiple clusters.

The integration capabilities of the product with other platforms to improve our company's workflow are good.

In terms of the improvements in the product in the data analysis area, new libraries have been launched to support AI and machine learning.

My company is able to process huge datasets with Apache Spark. There is a huge value added to the organization because of the tool's ability to process huge datasets.

I rate the overall solution a nine out of ten.

Which deployment model are you using for this solution?

On-premises

Disclosure: My company does not have a business relationship with this vendor other than being a customer.

Atal Upadhyay

AVP at MIDDAY INFOMEDIA LIMITED

Apr 8, 2024

Download

Allows us to consume data from any data source and has a remarkable processing power

Pros and Cons

"With Spark, we parallelize our operations, efficiently accessing both historical and real-time data."

"It would be beneficial to enhance Spark's capabilities by incorporating models that utilize features not traditionally present in its framework."

What is our primary use case?

We pull data from various sources and employ a buzzword to process it for reporting purposes, utilizing a prominent visual analytics tool.

How has it helped my organization?

Our experience with using Spark for machine learning and big data analytics allows us to consume data from any data source, including freely available data. The processing power of Spark is remarkable, making it our top choice for file-processing tasks.

Utilizing Apache Spark's in-memory processing capabilities significantly enhances our computational efficiency. Unlike with Oracle, where customization is limited, we can tailor Spark to our needs. This allows us to pull data, perform tests, and save processing power. We maintain a historical record by loading intermediate results and retrieving data from previous iterations, ensuring our applications operate seamlessly. With Spark, we parallelize our operations, efficiently accessing both historical and real-time data.

We utilize Apache Spark for our data analysis tasks. Our data processing pipeline starts with receiving data in the RAV format. We employ a data factory to create pipelines for data processing. This ensures that the data is prepared and made ready for various purposes, such as supporting applications or analysis.

There are instances where we perform data cleansing operations and manage the database, including indexing. We've implemented automated tasks to analyze data and optimize performance, focusing specifically on database operations. These efforts are independent of the Spark platform but contribute to enhancing overall performance.

What needs improvement?

It would be beneficial to enhance Spark's capabilities by incorporating models that utilize features not traditionally present in its framework.

For how long have I used the solution?

I've been engaged with Apache Spark for about a year now, but my company has been utilizing it for over a decade.

What do I think about the stability of the solution?

It offers a high level of stability. I would rate it nine out of ten.

What do I think about the scalability of the solution?

It serves as a data node, making it highly scalable. It caters to a user base of around five thousand or so.

How was the initial setup?

The initial setup isn't complicated, but it varies from person to person. For me, it wasn't particularly complex; it was straightforward to use.

What about the implementation team?

Once the solution is prepared, we deploy it onto both the staging server and the production server. Previously, we had a dedicated individual responsible for deploying the solution across multiple machines. We manage three environments: development, staging, and production. The deployment process varies, sometimes utilizing a tenant model and other times employing blue-green deployment, depending on the situation. This ensures the seamless setup of servers and facilitates smooth operations.

What other advice do I have?

Given our extensive experience with it and its ability to meet all our requirements over time, I highly recommend it. Overall, I would rate it nine out of ten.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.

UjjwalGupta

Module Lead at Mphasis

Mar 14, 2024

Download

Helps to build ETL pipelines load data to warehouses

Pros and Cons

"The tool's most valuable feature is its speed and efficiency. It's much faster than other tools and excels in parallel data processing. Unlike tools like Python or JavaScript, which may struggle with parallel processing, it allows us to handle large volumes of data with more power easily."

"Apache Spark could potentially improve in terms of user-friendliness, particularly for individuals with a SQL background. While it's suitable for those with programming knowledge, making it more accessible to those without extensive programming skills could be beneficial."

What is our primary use case?

We're using Apache Spark primarily to build ETL pipelines. This involves transforming data and loading it into our data warehouse. Additionally, we're working with Delta Lake file formats to manage the contents.

What is most valuable?

The tool's most valuable feature is its speed and efficiency. It's much faster than other tools and excels in parallel data processing. Unlike tools like Python or JavaScript, which may struggle with parallel processing, it allows us to handle large volumes of data with more power easily.

What needs improvement?

Apache Spark could potentially improve in terms of user-friendliness, particularly for individuals with a SQL background. While it's suitable for those with programming knowledge, making it more accessible to those without extensive programming skills could be beneficial.

For how long have I used the solution?

I have been using the product for six years.

What do I think about the stability of the solution?

Apache Spark is generally considered a stable product, with rare instances of breaking down. Issues may arise in sudden increases in data volume, leading to memory errors, but these can typically be managed with autoscaling clusters. Additionally, schema changes or irregularities in streaming data may pose challenges, but these could be addressed in future software versions.

What do I think about the scalability of the solution?

About 70-80 percent of employees in my company use the product.

How are customer service and support?

We haven't contacted Apache Spark support directly because it's an open-source tool. However, when using it as a product within Databricks, we've contacted Databricks support for assistance.

Which solution did I use previously and why did I switch?

The main reason our company opted for the product is its capability to process large volumes of data. While other options like Snowflake offer some advantages, they may have limitations regarding custom logic or modifications.

How was the initial setup?

The solution's setup and installation of Apache Spark can vary in complexity depending on whether it's done in a standalone or cluster environment. The process is generally more straightforward in a standalone setup, especially if you're familiar with the concepts involved. However, setting up in a cluster environment may require more knowledge about clusters and networking, making it potentially more complex.

What's my experience with pricing, setup cost, and licensing?

The tool is an open-source product. If you're using the open-source Apache Spark, no fees are involved at any time. Charges only come into play when using it with other services like Databricks.

What other advice do I have?

If you're new to Apache Spark, the best way to learn is by using the Databricks Community Edition. It provides a cluster for Apache Spark where you can learn and test. I rate the product an eight out of ten.

If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?

Amazon Web Services (AWS)

Disclosure: My company does not have a business relationship with this vendor other than being a customer.

Vineeth Marar

Cloud solution architect at 0

Mar 10, 2024

Download

Offers seamless integration with Azure services and on-premises servers

Pros and Cons

"The solution is scalable."

"The setup I worked on was really complex."

What is our primary use case?

My contribution primarily focused on the networking aspect, ensuring secure and reliable connections between Azure services and on-premises servers. The solution was complex, involving private links, virtual machines, and custom firewall rules to facilitate secure data transmission.

I use Apache Spark, especially for data processing and analytics. My work involves a broad range of technologies, including PostgreSQL, Apache Kafka, Spark, and various Azure services. Previously, my focus was more on networking, cybersecurity, and Azure's data services like SQL and Active Directory.

How has it helped my organization?

We've set up a Spark cluster running in Azure to process real-time data. This setup involves connecting Azure applications to the Spark cluster via Azure Private Link, ensuring secure data flow.

The architecture required detailed network design, including routing through Linux firewalls and ensuring data could be securely transmitted to and from on-premises servers.

While I was heavily involved in the network design aspect, the Spark cluster was primarily used for processing and analyzing data streams for various applications.

Moreover, from my experience, I haven't encountered significant challenges with integrations involving Spark. The crucial factor is having established connectivity.

Whether Spark is operating in Azure or on-premises doesn't significantly affect our operations, thanks to high-bandwidth solutions like ExpressRoute. The main consideration then becomes the cost. As long as we maintain performance standards, I don't see any issues, regardless of the deployment environment.

Ensuring the collection of relevant metrics and logs is critical for assessing performance improvements. The specifics of how these are collected or which tools are used might vary, but the goal is to gather comprehensive data for ongoing monitoring and improvement.

What is most valuable?

What I liked about the solution was its uniqueness. We provided the customer with a solution that hadn't been offered by anyone else before.

It involved multiple components, such as Spark cluster, CMAX, a backend VM, and a Linux VM for mapping the service processes to the backend, which is running on-premises where the Kafka service was running.

It was challenging for people to understand how to send traffic through the private link between all these services. Ensuring the traffic was sent to the correct destination with the correct source header without any operation issues was complex, but we achieved it.

We had multiple instances of fault tolerance and scalability.

What needs improvement?

The setup I worked on was really complex.

For how long have I used the solution?

I have been using it for a year.

What do I think about the stability of the solution?

The solution was definitely stable. There were no unstable services in it. Since most services were in Azure, everything worked better.

Azure's networking products, like ExpressRoute and Private Link service, are very stable. We didn't encounter any issues with the solution.

It took some time to complete, but after that, we haven't had a single support case.

What do I think about the scalability of the solution?

The solution is scalable. We used a load balancer at each tier, with multiple instances of the services running.

It's all scalable and relevant. We didn't have a lot of issues and have been monitoring the traffic flow.

We even projected the requests for the next two to three years and created scalable instances accordingly.

There are many users of Spark in our organization. For example, many customers are using Spark, often in conjunction with requests from third-party vendors. They frequently use Spark plug-ins as well.

Which solution did I use previously and why did I switch?

I've been exploring its capabilities in the OpenAI context, rather than dealing with external databases.

I've also started using Apache Kafka for messaging and event streaming, which is essential since our solutions often integrate with applications running in Azure, including event hubs and service bus for messaging. This experience includes interfacing with various technologies, not just within Microsoft's ecosystem but also with Amazon Web Services.

Learning new technologies is a continuous process, and I've never found it difficult to adapt, especially with something as foundational as Apache Kafka.

How was the initial setup?

The setup I worked on was really complex, not specifically because of Spark but due to the integration with multiple services.

It took us about a week to finalize the solution, as understanding the entire workflow and brainstorming on how to maintain private traffic was intricate.

Regarding the deployment process, it involved thorough planning and testing to ensure minimal latency. We managed to achieve a latency of around 20 to 30 milliseconds, which was pretty good.

What about the implementation team?

For the deployment process, once we have a clear understanding of the workflow, the services to be included, how they should be integrated, the policies, and the configurations to be applied, it becomes easier to structure and incorporate it into the ops pipeline.

We may need to standardize it a bit based on different customer requirements. This standardization allows customers to apply the necessary customizations once it's deployed.

It's a hybrid solution, with about 90% of the services running in the cloud and 10% on-premises.

What's my experience with pricing, setup cost, and licensing?

The licensing costs for Spark would depend on the specific packages and the needs of the project. Costs can vary based on requirements, affordability, and customer expectations.

Licensing costs can vary. For instance, when purchasing a virtual machine, you're asked if you want to take advantage of the hybrid benefit or if you prefer the license costs to be included upfront by the cloud service provider, such as Azure.

If you choose the hybrid benefit, it indicates you already possess a license for the operating system and wish to avoid additional charges for that specific VM in Azure. This approach allows for a reduction in licensing costs, charging only for the service and associated resources.

The licensing arrangements can differ based on the product and service. Some products might require a license purchase upfront, with subsequent charges based only on usage.

The availability of hybrid benefits can also influence licensing costs, especially if you're using third-party services like Palo Alto in a VM from the marketplace. If you have an existing license, your costs could be reduced, but purchasing a new license would include licensing fees in the overall cost.

What other advice do I have?

My advice is to thoroughly understand your own needs and environment before making a decision. Recommendations should be based on product features, quality, accuracy, and stability.

Cost is also a factor, but it should not be the only consideration. Depending on whether the priority is performance and scalability or cost-effectiveness, I would suggest a solution that best meets those needs, whether it's a managed service or a more cost-conscious option.

I would rate Spark as ten out of ten. I haven't had any issues with Spark in my experience.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.

Atif Tariq

Cloud and Big Data Engineer | Developer at Huawei

Nov 29, 2023

Download

A scalable solution that can be used for data computation and building data pipelines

Pros and Cons

"The most valuable feature of Apache Spark is its memory processing because it processes data over RAM rather than disk, which is much more efficient and fast."

"Apache Spark should add some resource management improvements to the algorithms."

What is our primary use case?

Apache Spark is used for data computation, building data pipelines, or building analytics on top of batch data. Apache Spark is used to handle big data efficiently.

What is most valuable?

The most valuable feature of Apache Spark is its memory processing because it processes data over RAM rather than disk, which is much more efficient and fast.

What needs improvement?

Apache Spark should add some resource management improvements to the algorithms. Thereby, the solution can manage SKUs more efficiently with a physical and logical plan over the different data sets when you are joining it.

For how long have I used the solution?

I have been working with Apache Spark for six to seven years.

What do I think about the stability of the solution?

Apache Spark is a very stable solution. The community is still working on other parts, like performance and removing bottlenecks. However, from a stipulative point of view, the solution's stability is very good.

I rate Apache Spark a nine out of ten for stability.

What do I think about the scalability of the solution?

Apache Spark is a scalable solution. More than 50 to 100 users are using the solution in our organization.

How are customer service and support?

Apache Spark's technical support team responds on time.

How would you rate customer service and support?

Positive

How was the initial setup?

The solution’s initial setup is very easy.

What's my experience with pricing, setup cost, and licensing?

Apache Spark is an open-source solution, and there is no cost involved in deploying the solution on-premises.

What other advice do I have?

I would recommend Apache Spark to users doing analytics, data computation, or pipelines.

Overall, I rate Apache Spark ten out of ten.

Which deployment model are you using for this solution?

Hybrid Cloud

Disclosure: My company does not have a business relationship with this vendor other than being a customer.

Suriya Senthilkumar

Analyst at Deloitte

Mar 3, 2024

Download

Processes a larger volume of data efficiently and integrates with different platforms

Pros and Cons

"The product’s most valuable features are lazy evaluation and workload distribution."

"They could improve the issues related to programming language for the platform."

What is our primary use case?

We use the product in our environment for data processing and performing Data Definition Language (DDL) operations.

What is most valuable?

The product’s most valuable features are lazy evaluation and workload distribution.

What needs improvement?

They could improve the issues related to programming language for the platform.

For how long have I used the solution?

We have been using Apache Spark for around two and a half years.

What do I think about the stability of the solution?

The platform’s stability depends on how effectively we write the code. We encountered a few issues related to programming languages.

What do I think about the scalability of the solution?

We have more than 100 Apache Spark users in our organization.

Which solution did I use previously and why did I switch?

Before choosing Apache Spark for processing big data, we evaluated another option, Hadoop. However, Spark emerged as a superior choice comparatively.

How was the initial setup?

The initial setup complexity depends on whether it's on the cloud or on-premise. For cloud deployments, especially using platforms like Databricks, the process is straightforward and can be configured with ease. However, if the deployment is on-premise, the setup tends to be more time-consuming, although not overly complex.

What's my experience with pricing, setup cost, and licensing?

They provide an open-source license for the on-premise version. However, we have to pay for the cloud version including data centers and virtual machines.

What other advice do I have?

Apache Spark is a good product for processing large volumes of data compared to other distributed systems. It provides efficient integration with Hadoop and other platforms.

I rate it a ten out of ten.

Disclosure: My company does not have a business relationship with this vendor other than being a customer.