Apache Hadoop provides a scalable, cost-effective open-source platform capable of handling vast data volumes with features like HDFS, distributed processing, and high integration capabilities.
| Product | Mindshare (%) |
|---|---|
| Apache Hadoop | 3.2% |
| Snowflake | 9.3% |
| Teradata | 8.7% |
| Other | 78.8% |
| Title | Rating | Mindshare | Recommending | |
|---|---|---|---|---|
| Dell PowerStore | 4.4 | 1.4% | 97% | 220 interviewsAdd to research |
| Teradata | 4.1 | 8.7% | 88% | 83 interviewsAdd to research |
| Company Size | Count |
|---|---|
| Small Business | 11 |
| Midsize Enterprise | 7 |
| Large Enterprise | 16 |
| Company Size | Count |
|---|---|
| Small Business | 77 |
| Midsize Enterprise | 42 |
| Large Enterprise | 147 |
Apache Hadoop is known for its distributed file system HDFS, which supports large data volumes efficiently. Its open-source nature allows cost-effective scalability and compatibility with tools like Spark for enhanced analytics. While it offers significant processing power, areas for improvement include user-friendliness, interface design, security measures, and real-time data handling. Users benefit from data storage for structured and unstructured data, facilitated by its distributed processing architecture. Data replication ensures fault tolerance, while its capability to integrate with tools like Apache Atlas and Talend highlights its versatility.
What are the key features of Apache Hadoop?Industries leverage Apache Hadoop for Big Data analytics, data lakes, ETL tasks, and enterprise data hubs, handling unstructured and structured data from IoT, RDBMS, and real-time streams. Its applications extend to data warehousing, AI/ML projects, and data migration, employing tools like Apache Ranger, Hive, and Talend for effective data management and analysis.
| Author info | Rating | Review Summary |
|---|---|---|
| Financial Advisor at a financial services firm with 10,001+ employees | 4.0 | We had a limited on-premises deployment of Apache Hadoop which scaled well and was reliable, but maintaining it was challenging due to a lack of resources and expertise after the original team left. We prefer solutions with structured support. |
| Principle Network and Database Engr at Parsons Corporation | 4.5 | I use Apache Hadoop daily to analyze unstructured incident data, benefiting from its AI and machine learning capabilities, strong failover support, and fault tolerance, especially useful in our dual-server setup and field environments prone to hardware failures. |
| Database Administrator at Lacoste | 4.5 | I am working on migrating a customer's data warehouse from Oracle to Hadoop, specifically considering Cloudera, due to its data warehouse capabilities and scalability. The integration of reporting tools like Power BI with Hadoop poses some challenges. |
| Software Development Consultant at Synechron | 4.0 | I use Apache Hadoop in my company for its efficient analytical processing and organized data distribution. However, it struggles with incremental data processing and has high licensing costs. Setup and technical support need improvement for better user experience and resolution times. |
| IT Support Specialist at Convergys Corporation | 4.5 | We used Apache Hadoop mainly for data analysis and storage, benefiting from its low cost, open-source nature, and efficient performance on commodity hardware. While its flexibility and resilience are advantageous, improved security measures would enhance its capabilities. |
| Head of Data at a energy/utilities company with 51-200 employees | 4.5 | Apache Hadoop's distributed computing capability efficiently accelerates data processing by distributing tasks across multiple nodes. While it offers cost savings on compute resources through optimization, the availability of comprehensive training materials could be improved to enhance onboarding and skill development. |
| Senior Assosiate Consultant at Applied Materials | 3.0 | I use Apache Hadoop for data storage and report generation, appreciating its open-source nature, ability to handle large data volumes, and effectiveness in data processing and storage. However, limited support requires deeper knowledge and improvisation. |
| Head Of Data Governance at Alibaba Group | 4.0 | We use the Hadoop File System for big data due to its open-source nature and cost-effectiveness. However, dealing with data skewness requires custom solutions, unlike Spark, which is more efficient and faster due to in-memory processing. |
| Senior Data Archirect at Yettel | 4.0 | I've been using Apache Hadoop for data collection, finding it effective for file storage and maintenance, though its stability needs improvement. Calculating ROI is difficult, and while we explored Azure, Hadoop proved more acceptable for our needs. |
| Manager at Robi Axiata Limited | 3.5 | I use Apache Hadoop for big data analysis and AI/ML, valuing its processing for newer technologies. However, I find its user-friendliness lacking, as GUI changes require advanced coding. Stability is good, and setup was easy. |
Neutral
My use cases for Apache Hadoop include the setups I completed, connecting to the database, and analyzing the incidences, making it a good tool for Hadoop. Apache Hadoop helps us analyze all of the accidents escalated to a higher level.
I use Apache Hadoop for analyzing unstructured data because we have numerous incident reports. It actually uses Mango, I remember.
I rely on Apache Hadoop every day for data analyzing, and it helps with failover.
The best features of Apache Hadoop are that we use it to analyze with AI, machine learning, and everything.
The data replication feature of Apache Hadoop is notable, though I'm not that advanced in using it.
I assess Apache Hadoop's fault tolerance during hardware failures positively since we have hardware failover, which works without problems. We have dual servers everywhere, and Apache Hadoop is installed in these dual servers.
Apache Hadoop helps us in cases of hardware failure because it works 24/7, and sometimes servers crash in the field. Not all of the servers are in a data center, and Apache Hadoop helps us to failover and analyze. It's very powerful.
Our use case is for a customer who wants to migrate their data warehouse to Hadoop. It's a request from a customer in Senegal who wants to migrate their Oracle data warehouse to Hadoop. I'm trying to migrate it to Hive or HBase.
They're choosing between upgrading Oracle or moving to Cloudera Hadoop. They seem to prefer Cloudera.
The current data warehouse runs on Oracle DB, but we have to migrate the analytics process to Hadoop.
My customers like the HDFS and the data warehouse capabilities within Hadoop.
They have integrated other tools as well, like Power BI and Oracle BI, both on Azure, for reporting. Oracle BI is difficult to integrate.
It is difficult to integrate them with Hadoop.
I have a little experience with it. I'm a database expert, not a Hadoop expert, so I haven't worked extensively with it.
If the customer chooses it, the solution must be scalable. They're a telecom company and need a high-level solution.
Hadoop is quite scalable, and that's the main requirement. They want a solution that can handle 100 terabytes for the first step.
I contacted only an integrator who works on the design process.
Oracle doesn't provide Big Data appliances now.
The client doesn't have experience with Hadoop. If they choose this solution, there will be training.
The integrator will provide a training program.
It took three months for migration.
The solution is expensive, but that's for the customer to decide.
We already have the cost of the materials and the license, but we can compare pricing with other integrators and can't see any specific difference.
It's an annual license. The customer is part of a larger company that already uses Hadoop. They know what they want and have a use case in mind, which is why they asked us for a proposal. We just give the pricing and information.
The customer is comparing it to Orange, another telecom operator in Senegal, who uses Hadoop successfully.
The main reason to chose Apache is the comparison to Orange, and that Hadoop is very scalable.
It's a good solution. Other companies have used it for over ten years with great success.
For the telecom sector, I would give it a nine out of ten.
I recommend it for the telecom sector. I know it well, and it's a good fit.

I use the solution in my company since it makes the analytical processing easy. It takes data into one cluster and then processes it. While working on any GPU whatever the analytics are, and what I get as insight from the data, I can say that processing is very fast.
The main features of the tool are the distribution, how it makes data clusters, and what the data is, which are all very organized in Apache Hadoop.
When working with Kafka, I saw that the data came in an incremental order. The incremental data processing part is still not very effective in Apache Hadoop. If the data is already there, it can be processed very effectively, especially if the data is coming in every second. If you want to know the location of some data every second, then such data is not processed effectively in Apache Hadoop.
I can say that one of the features where improvements are required revolves around the licensing cost of the tool. If the tool can build some licensing structures in a pay-per-use manner, organizations can get the look and feel of Apache Hadoop. Apache Hadoop can offer a licensing structure of the product that can be seen as similar to how AWS operates. Apache Hadoop can look into the capability of processing incremental data.
The tool's setup process can be a scope of improvement. Also, it is not very simple because while doing the setup, we need to do all the server settings, including port listing and firewall configurations. If we look at other products on the market, then they can be made simpler.
There are certain shortcomings when it comes to the product's technical support part, making it an area where improvements are required.
The time frame for the resolution is an area that needs to be improved. The overall communication part of the technical support team also needs improvement.
I have been using Apache Hadoop for three years.
Stability-wise, whenever there is a sale like a big billion days that happens, the tool remains stable. I rate the solution's stability an eight out of ten.
It is a reliable product. So even for any business, if it is scaled up, then it can manage the data, or it can manage the processing. Scalability-wise, I rate the solution an eight out of ten.
The product is suitable for enterprise-sized companies with more than 1,000 employees.
I rate the technical support a seven out of ten.
Neutral
The product's initial setup phase is not very simple.
I required one DevOps engineer for the product's installation phase.
The solution is deployed on the cloud.
For configuration, we take about three days.
For any big enterprise the costs can be handled, and it is suitable for big enterprises because the scale of data is large. For medium and small enterprises, the tool is on the high-price side.
It is easy to integrate Hadoop with your IT workflow since I am using the cloud version only. In the cloud, I could install or put my data, but I did not have to get everything installed on my machine. The cloud can be accessed from anywhere, making it a really valuable tool.
I am building a GenAI model, and so, in our company, we are making a data pipeline that we have integrated into Apache Kafka. Earlier, we also included Apache Hadoop in the process. We have integrated the product into some AI tools. Whatever data is getting processed is coming out, and new data is coming in and getting processed. We have integrated the tool into some LLM models.
I will recommend the tool to any enterprise company that has an employee strength of more than 1,000.
I rate the tool an eight out of ten.

We used the product primarily for data analysis and storage. It helps handle large data sets, performing tasks like filtering, sorting, and joining. The platform is useful for data warehousing and provides distributed coordination and synchronization functionalities.
The solution has effectively supported our operations primarily due to its cost efficiency. It enables us to manage large data sets without incurring excessive subscription costs, resulting in more efficient data handling and operations.
The platform's most valuable feature is its low cost and open-source nature. It runs efficiently on commodity hardware and supports a large ecosystem of tools. Its flexibility in handling and storing large volumes of data is particularly beneficial, as is its resilience, which ensures data redundancy and fault tolerance.
Improvements in security measures would be beneficial, given the large volumes of data handled. Robust security features are essential to prevent data leaks or breaches. Additionally, integrating advanced capabilities similar to those other solutions would enhance the platform's functionality.
I have worked with Apache Hadoop for about six to seven months. The duration varied based on the projects I was involved in, as we often switched to different projects with different applications.
Although I have encountered some performance issues, the platform has proven to be stable.
I would rate its stability as eight or nine.
This platform's scalability significantly impacts data management capabilities. It allows for simultaneously handling large data volumes, which other applications might struggle with.
I have contacted Apache tech support when encountering issues that could not be resolved internally. Their support is reliable and responsive.
Positive
The initial setup can be complex due to the need for precise coding and configuration. Setting up the required components and ensuring all dependencies are correctly configured is crucial for a successful deployment. Depending on network capability and system specifications, the setup typically takes 30 minutes to one hour if prerequisites are met. Maintenance involves regular updates to ensure the platform runs with the latest features and security patches.
The product is open-source, but some associated licensing fees depend on the subscription level. While it might be free for students, organizations typically need to pay for their subscriptions. The fees were reasonable for my usage, though I am not aware of recent changes to the pricing.
The product is highly effective for processing and managing large data sets. Integrating it with other solutions like AWS can provide additional functionalities, but the cost benefits of using this platform remain significant. I have also used the solution in AI-driven projects with machine learning models, and its integration with Apache Spark has been advantageous. I recommend it to organizations needing to handle large data sets due to its cost-effectiveness and robust capabilities.
I rate it a nine out of ten.

The product's distributed computing capability is the most effective. It allows us to distribute data processing tasks across multiple nodes, significantly speeding up processing time.
The product's availability of comprehensive training materials could be improved for faster onboarding and skill development among team members.
I've been working with Apache Hadoop for about four years now.
We encounter occasional issues like memory constraints during extensive data processing. They have been manageable through scaling adjustments.
I rate the stability an eight or nine.
Hadoop's scalability can be rated a nine out of ten due to its exceptional flexibility, allowing horizontal scaling by adding or removing nodes as required.
Technical support has been decent, although there can be delays in resolving new or complex issues.
Neutral
The product deployment process was quite straightforward, taking only a few minutes for installation.
The platform has resulted in significant cost savings on compute resources by optimizing usage based on workload demands.
I would rate the product's subscription-based pricing a six out of ten. It's reasonable, but there's room for improvement in cost-effectiveness.
The platform's quick data processing capabilities have been instrumental in supporting our AI-driven projects. I would recommend it, especially for organizations dealing with large-scale data processing and needing robust distributed computing capabilities.
I would rate Apache Hadoop an eight out of ten.
We use it to store data. Our team then takes this data to create reports on top of that.
We primarily use Kafka for intensive data streaming. For batch-based processing, we use Hadoop. Additionally, we have our own custom batch catalog that likely helps prepare data for further analysis or use.
We have many projects where our main data storage is done in Hadoop only. All projects take data from Hadoop to provide data insights and reports.
Hadoop YARN for resource management is a really good aspect. It is is very good for managing large data volumes. It allows us to monitor data processing effectively. We can see how much data there is, the consumption of RAM or ROM, and how resources are allocated. It's good for managing and previewing the scale of data processing.
It's primarily open source. You can handle huge data volumes and create your own views, workflows, and tables. I can also use it for real-time data streaming.
Its ability to handle open data access is significant, and the support is substantial, though not as responsive as one might hope. It's very effective for data processing and storage.
Overall, it's very good for data processing and storage.
Since it is an open-source product, there won't be much support.
So, you have to have deeper knowledge. You need to improvise based on that.
I have been using it for five years.
It is a pretty stable product.
It is easy to scale. Almost everyone uses it, so there are 1000 end users.
It is open source. There is no support.
Neutral
The installation is a bit complex. We have to have a very good knowledge about the product.
The deployment took around ten hours.
We needed two to three architects. There were support people also.
Hadoop is a good database, and it's open-source, which makes it cost-effective. But, if you have a large budget and allocations, you can go with products that have advanced analytics tools or other extra features, like data lineage.
Overall, I would rate the solution a seven out of ten.

We use the Hadoop File System. We usually keep the data for our tables or big data on it. Hadoop has a query engine called Hive. We write SQL queries, and the tool usually processes in a parallel environment and gets us the data on Hive.
Hadoop File System is a perfect choice if we want to use any database systems or file systems because it is open-source. It has no cost. Or else, we’ll have to use Amazon S3 or Azure database, for which we will have to pay a lot. A lot of big data processing needs a proper partition and structure. Hadoop File System is compatible with almost all the query engines. That’s another reason why people would be very comfortable working with the Hadoop ecosystem.
The tool provides functionalities to deal with data skewness or a diverse set of data. There are some configurations that it usually provides. In certain cases, the configurations for dealing with data skewness do not make any sense. We usually have to deal with it using a custom solution.
Spark would deal with such cases efficiently. If Hadoop solves the issues the way Spark does, it can compete with Spark at the same level. Hive is a little slower than Spark. Spark is in-memory and parallel processing. Hive is not in-memory, but it is parallel processing.
I have been using the solution for three and a half years.
Around 70 to 80 people use the product in our organization.
We can easily find support for Apache products. Support is a positive aspect of the solution.
The installation is a little difficult because it is an open-source tool. It is similar to Apache Spark. The product is not self-manageable. We will have to invest a little in the setup.
People who want to buy the solution must hire or work with someone who understands the architecture as per the use case. It should be good for the long run. Once Hadoop is set up, we can change the configuration, but the architecture cannot be changed frequently. We must invest more in the architecture. Once properly built, we can build or develop anything on it. Architecture is important for Hadoop. If the product is set up well, we will not find difficulties later. Overall, I rate the solution an eight out of ten.

I have been using the latest version of Apache Hadoop. It is a file system for data collection. There are nodes in this cluster that contain all the information, directories, and other files. The nodes are based on the MySQL database.
Hadoop isn't so problematic. It deals with file storage and maintenance. It is a network of file operations.
The stability of the solution needs improvement.
There are some issues with file retention and its stability but they can be worked through. There are a lot of things that are based on disk space that require the preparation of different and sophisticated controls. The software itself is not unstable, but sometimes its options can cause stability issues.
The scalability includes adding nodes and it is not so easy to do. It is a detailed process that requires precision.
There are almost 25 users, including data engineers and others, but no specialists. We plan to increase endpoint users and introduce running reports, automated reports, or reports based on some tools.
Apache is an open source software and only has a community, instead of customer support. There is Cloudera which provides Apache Hadoop on license and offers support. In Cloudera, there are some consultants with less knowledge who offer support for small issues. As the case escalates, they provide more support with better technical expertise.
Positive
We checked a few solutions and tried solutions from Azure. There were pros and cons but this solution was more acceptable.
The setup depends on the data. Vast data can be hard to set up. You might have some issues with the setup, but it depends on the number of nodes. More nodes can cause issues and more time to resolve. The reshuffling is also complex and can cause problems.
The ROI is very hard to calculate. The source of data for the company can help to run different technologies and make many decisions based on the data, but it's very hard to calculate the return on investment.
I am not updated with the licensing cost, but you need to pay for a license if purchased from Cloudera.
If you plan to use Apache Hadoop, purchase the license from Cloudera because they provide you with technical support.
I rate the overall solution an eight out of ten.

I'm from the data governance team, and this is how my team uses Apache Hadoop: there's a GUI called Apache Atlas, then there's an option called the "business glossary". My team uses the business glossary from Apache Atlas and also uses Apache Ranger. Apache Ranger is another GUI where you can check who is using which data source through the Apache Hadoop platform. My team also uses the Apache Hadoop platform for AI-related use cases and relevant data, so the data required from any kind of AI use case, that data is processed with ETL, specifically with the Talend tool. My team then loads the data in Apache Hadoop, uses that data by making some clusters, and uses the data for AI/ML cases.
What I like about Apache Hadoop is that it's for big data, in particular big data analysis, and it's the easier solution. I like the data processing feature for AI/ML use cases the most because some solutions allow me to collect data from relational databases, while Hadoop provides me with more options for newer technologies.
What could be improved in Apache Hadoop is its user-friendliness. It's not that user-friendly, but maybe it's because I'm new to it. Sometimes it feels so tough to use, but it could be because of two aspects: one is my incompetency, for example, I don't know about all the features of Apache Hadoop, or maybe it's because of the limitations of the platform. For example, my team is maintaining the business glossary in Apache Atlas, but if you want to change any settings at the GUI level, an advanced level of coding or programming needs to be done in the back end, so it's not user-friendly.
Apache Hadoop has good stability.
I'm not sure how scalable Apache Hadoop is.
In terms of technical support from Apache Hadoop, we are working with an external vendor and they are the ones helping us in every case. They are helpful.
We used Oracle Exadata before using Apache Hadoop. It was one or two years ago when we started using the Apache Hadoop platform. We're still thinking about using both platforms in parallel or choosing one of the two. We're still looking into the benefits of each platform, but currently, we're using both Oracle Exadata and Apache Hadoop.
I wasn't part of the team that set up Apache Hadoop, but using it after it was set up was very easy. The solution was ready immediately, and the GUI was smooth and fast, with no issues.
Apache Hadoop was implemented by the IT team, so it was an in-house implementation.
If my company can use the cloud version of Apache Hadoop, particularly the cloud storage feature, it would be easier and would cost less because an on-premises deployment has a higher cost during storage, for example, though I don't know exactly how much Apache Hadoop costs.
My company is using both Apache Hadoop and Oracle Exadata.
I'm unsure which version of Apache Hadoop I'm using, but it could be the latest version.
Currently, the solution is deployed on-premises because here in Bangladesh, there's a limitation with transferring data outside of the country. As far as I know, there's no cloud solution internally in Bangladesh, so if you want to use a cloud solution here, you'll have to move your data outside Bangladesh, and this is why Apache Hadoop is still deployed on-premises.
More than fifty people use Apache Hadoop directly, particularly the IT and analytics expert teams. The solution is being used by developers, people in operations, and people who maintain security.
In my company, Apache Hadoop is not fully implemented yet. It's still in the implementation phase and at least for the next two to three years, there isn't any plan of discarding it.
I'm giving Apache Hadoop a rating of seven out of ten.
I don't have any recommendations currently for people who want to implement Apache Hadoop because I'm still in the learning phase and I don't have much knowledge yet. The IT team in my company is also struggling every time in terms of preparing everything and still needs help from external vendors because the team isn't an expert on Apache Hadoop yet. My company's expertise is in Oracle Exadata because usage of that product started in 2002 or 2003.
My company is a customer of Apache Hadoop.