What is our primary use case?
I use it for handling large datasets and distributing data across multiple nodes. The ability to manage big data, especially large files like media files, makes it a preferred choice in such scenarios. Its distributed nature allows for efficient processing and retrieval of data.
How has it helped my organization?
It proves to be highly useful in the realm of machine learning algorithms, thanks to its design as a file storage system. This makes it well-suited for development work, allowing users to examine files during the development process.
What is most valuable?
The core functionality doesn't retrieve the actual file during querying; instead, it provides the file location and metadata. This setup ensures data durability and allows for configuring a replication factor to determine how many copies of the data are maintained. It excels in handling diverse data formats and is particularly beneficial for constructing a data lake for companies with mixed data types, such as videos and structured data.
What needs improvement?
For smaller datasets, it may not be as suitable, and its performance might not be optimal.
For how long have I used the solution?
I have been using it for four and a half years.
What do I think about the stability of the solution?
It provides impressive stability and reliability. I would rate it nine out of ten.
What do I think about the scalability of the solution?
It possesses excellent scalability capabilities. Ensuring proper closure of all resources is critical, as the system's performance relies on meticulous resource management. Each service needs to run smoothly, and any lapse in service can impact the system's functionality. The cloud's scalability, with providers offering horizontal nodes, adds to the overall scalability of the system. I would rate it eight out of ten.
How was the initial setup?
The initial setup might take around one and a half days. I would rate it seven out of ten.
What about the implementation team?
Deployment can be a bit complex, especially when considering cloud environments. When working with entrepreneurs, the setup involves a name node and data nodes. Data nodes store the actual data, and the name node keeps track of metadata, including which data is stored in each data node. Maintaining a replication factor of two or three ensures redundancy, with each data node having a backup on another node. For a typical application, one name node and two data nodes are often sufficient for deployment. Whether deploying on Azure or AWS, both platforms offer managed services, simplifying the deployment process. However, it's worth noting that deployment without these managed services is also possible. Open-source distributions can be installed on three machines – one for the name node and two for data nodes – without relying on managed services from Azure or AWS. This approach offers flexibility, and while using managed services may be convenient, it's not always necessary.
What's my experience with pricing, setup cost, and licensing?
It is an open-source solution.
What other advice do I have?
If the use case involves creating a data warehousing solution with complex functionality, data analytics, and extensive querying requirements, especially if there's a need for a data lake, then opting for Hadoop and Spark is advisable. In this scenario, Hadoop with technologies like Hive and Spark can provide the necessary capabilities. On the other hand, if the application's primary focus is on handling large volumes of data without the requirement for complex features, data warehousing, or advanced querying, and if dashboarding visibility is not a priority, then MongoDB could be a suitable choice. Overall, I would rate it nine out of ten.
Which deployment model are you using for this solution?
Public Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?
Microsoft Azure