As a DevOps engineer, my day-to-day task is to move files from one location to another, doing some transformation along the way. For example, I might pull messages from Kafka and put them into S3 buckets. Or I might move data from a GCS bucket to another location.
NiFi is really good for this because it has very good monitoring and metrics capabilities. When I design a pipeline in NiFi, I can see how much data is being processed, where it is at each stage, and what the total throughput is.
I can see all the metrics related to the complete pipeline. So, I personally like it very much.
The good thing about Apache NiFi is that it has a concept called a flow file, and there's something called a flow file processor. The processor is the building block of your entire job. They have close to 500 processors for each purpose.
For example, for reading from Kafka, Ni-Fi has a processor called "consumer Kafka". To write to S3, they have a processor called "put S3". Now, if I read from Kafka and write my own application, I'd need to ensure the library I'm using tracks my messages. I'd also need to handle any failures by rereading messages and ensuring acknowledgment. But all this complexity is already handled by Apache processor.
They have around 500 processors, with a community investing significant effort into developing them. I can design your processor with a single click, export the entire workflow, and import it. The format is actionable, so NiFi is immediately set up.
It's also distributed in nature so that I can scale it across nodes based on the workload. These nodes share their state. If one node goes down during processing, that data might be lost, but any subsequent data is safe. Such occurrences are rare.
In essence, if you want a quick solution, Apache NiFi is a strong contender. There are other solutions like AirFlow and some paid pipeline options.
AirFlow is open-source but can be complicated. For ETL or ERT solutions, there are pricier options. But if I need a pipeline that I can monitor step by step, Apache NiFi is a good choice. It integrates with Prometheus metrics, allowing me to embed them in my workflow.
There's also a processor for integration with Slack, and I can receive notifications when the workflow is completed or fails.
Another feature I appreciate is "back pressure," which NiFi handles automatically. It maintains its own queue and addresses back-pressure issues. If, for instance, an upstream entity isn't fast enough, items get stored in a queue, managed internally by NiFi's back pressure algorithm.
There is room for improvement in integration with SSO. For example, NiFi does not have any integration with SSO. And if I want to give some kind of rollback access control across the organization. That is not possible.
So I have to create a separate username and password, and then I have to share it with the individual team. So, that is the pain point to be at the enterprise level.
I have been using it for one and a half years.
I would rate the stability a seven out of ten because there are a lot of processes that need to be implemented.
It's scalable. It can easily scale on multiple nodes. Depending on the workload, it also handles that internally; like the workers, they coordinate with each other, and they share the workload with each other. So, it's pretty good in terms of scalability.
The initial setup is very easy, especially for users who are familiar with EDL or EMT.
NiFi is one of the easiest tools on the market to learn and use. It is also a quick-win solution, which is good for first-time users who are developing data pipelines for EMT. NiFi makes it easy to track and trace the status of your pipelines, so you can be sure that they are working properly.
If I were to advise someone, I would ask the user what endpoints they want to touch. If I want to read something from Kafka and I want to put this thing on the S3 bucket, what is the alternative I have?
I have Kafka Connect, where I can connect Kafka with one Kafka, and I can put it into an S3 bucket. Is this scalable? No. Is this monitoring No.
We can't monitor it. We can't scale it. It's going to be a complete black box. The person who knows Kafka Connect, or Kafka, can understand what is happening there while using Kafka Connect. But if I compare it, I literally don't need to understand what Kafka is.
I know, "Okay, this is Kafka. These are the endpoints, and this is the URL I have to point to." That's it. My job is done. I will create a complete flow pipeline within, let's say, thirty minutes or something without having any current knowledge. I can read, I can Google it, and I can just implement it.
For people who are new to big data technologies like Kafka and BigQuery, I would give this solution an eight out of ten.
Let's say you need to build a solution to read from Kafka and write to an S3 bucket. You could use Kafka Connect, but if your requirements change and you need to start reading from a database instead, Kafka Connect will not work. With Apache NiFi, you can easily modify your flow pipeline to start reading from the database instead.