What is our primary use case?
My contribution primarily focused on the networking aspect, ensuring secure and reliable connections between Azure services and on-premises servers. The solution was complex, involving private links, virtual machines, and custom firewall rules to facilitate secure data transmission.
I use Apache Spark, especially for data processing and analytics. My work involves a broad range of technologies, including PostgreSQL, Apache Kafka, Spark, and various Azure services. Previously, my focus was more on networking, cybersecurity, and Azure's data services like SQL and Active Directory.
How has it helped my organization?
We've set up a Spark cluster running in Azure to process real-time data. This setup involves connecting Azure applications to the Spark cluster via Azure Private Link, ensuring secure data flow.
The architecture required detailed network design, including routing through Linux firewalls and ensuring data could be securely transmitted to and from on-premises servers.
While I was heavily involved in the network design aspect, the Spark cluster was primarily used for processing and analyzing data streams for various applications.
Moreover, from my experience, I haven't encountered significant challenges with integrations involving Spark. The crucial factor is having established connectivity.
Whether Spark is operating in Azure or on-premises doesn't significantly affect our operations, thanks to high-bandwidth solutions like ExpressRoute. The main consideration then becomes the cost. As long as we maintain performance standards, I don't see any issues, regardless of the deployment environment.
Ensuring the collection of relevant metrics and logs is critical for assessing performance improvements. The specifics of how these are collected or which tools are used might vary, but the goal is to gather comprehensive data for ongoing monitoring and improvement.
What is most valuable?
What I liked about the solution was its uniqueness. We provided the customer with a solution that hadn't been offered by anyone else before.
It involved multiple components, such as Spark cluster, CMAX, a backend VM, and a Linux VM for mapping the service processes to the backend, which is running on-premises where the Kafka service was running.
It was challenging for people to understand how to send traffic through the private link between all these services. Ensuring the traffic was sent to the correct destination with the correct source header without any operation issues was complex, but we achieved it.
We had multiple instances of fault tolerance and scalability.
What needs improvement?
The setup I worked on was really complex.
For how long have I used the solution?
I have been using it for a year.
What do I think about the stability of the solution?
The solution was definitely stable. There were no unstable services in it. Since most services were in Azure, everything worked better.
Azure's networking products, like ExpressRoute and Private Link service, are very stable. We didn't encounter any issues with the solution.
It took some time to complete, but after that, we haven't had a single support case.
What do I think about the scalability of the solution?
The solution is scalable. We used a load balancer at each tier, with multiple instances of the services running.
It's all scalable and relevant. We didn't have a lot of issues and have been monitoring the traffic flow.
We even projected the requests for the next two to three years and created scalable instances accordingly.
There are many users of Spark in our organization. For example, many customers are using Spark, often in conjunction with requests from third-party vendors. They frequently use Spark plug-ins as well.
Which solution did I use previously and why did I switch?
I've been exploring its capabilities in the OpenAI context, rather than dealing with external databases.
I've also started using Apache Kafka for messaging and event streaming, which is essential since our solutions often integrate with applications running in Azure, including event hubs and service bus for messaging. This experience includes interfacing with various technologies, not just within Microsoft's ecosystem but also with Amazon Web Services.
Learning new technologies is a continuous process, and I've never found it difficult to adapt, especially with something as foundational as Apache Kafka.
How was the initial setup?
The setup I worked on was really complex, not specifically because of Spark but due to the integration with multiple services.
It took us about a week to finalize the solution, as understanding the entire workflow and brainstorming on how to maintain private traffic was intricate.
Regarding the deployment process, it involved thorough planning and testing to ensure minimal latency. We managed to achieve a latency of around 20 to 30 milliseconds, which was pretty good.
What about the implementation team?
For the deployment process, once we have a clear understanding of the workflow, the services to be included, how they should be integrated, the policies, and the configurations to be applied, it becomes easier to structure and incorporate it into the ops pipeline.
We may need to standardize it a bit based on different customer requirements. This standardization allows customers to apply the necessary customizations once it's deployed.
It's a hybrid solution, with about 90% of the services running in the cloud and 10% on-premises.
What's my experience with pricing, setup cost, and licensing?
The licensing costs for Spark would depend on the specific packages and the needs of the project. Costs can vary based on requirements, affordability, and customer expectations.
Licensing costs can vary. For instance, when purchasing a virtual machine, you're asked if you want to take advantage of the hybrid benefit or if you prefer the license costs to be included upfront by the cloud service provider, such as Azure.
If you choose the hybrid benefit, it indicates you already possess a license for the operating system and wish to avoid additional charges for that specific VM in Azure. This approach allows for a reduction in licensing costs, charging only for the service and associated resources.
The licensing arrangements can differ based on the product and service. Some products might require a license purchase upfront, with subsequent charges based only on usage.
The availability of hybrid benefits can also influence licensing costs, especially if you're using third-party services like Palo Alto in a VM from the marketplace. If you have an existing license, your costs could be reduced, but purchasing a new license would include licensing fees in the overall cost.
What other advice do I have?
My advice is to thoroughly understand your own needs and environment before making a decision. Recommendations should be based on product features, quality, accuracy, and stability.
Cost is also a factor, but it should not be the only consideration. Depending on whether the priority is performance and scalability or cost-effectiveness, I would suggest a solution that best meets those needs, whether it's a managed service or a more cost-conscious option.
I would rate Spark as ten out of ten. I haven't had any issues with Spark in my experience.
Disclosure: My company does not have a business relationship with this vendor other than being a customer.