What is our primary use case?
We are a reseller of this solution and have implemented it for a couple of our customers. In addition, we also use it as part of our own product.
Our customers use it as part of an on-premises accounts payable solution, whereas we utilize it within our own cloud solution that is used for mortgage classification and data extraction of mortgage documents.
How has it helped my organization?
Grooper allows us to automate data extraction and integrations, and we have done so in our own cloud solution. We have APIs for integration with loan origination systems, customer portals, and other proprietary systems. In our process, we have clients that post documents to our API. An example of such might be a 300-page PDF file. We ingest that through Grooper and we'll classify the documents, extract all of the data, and then we'll either post back all of the documents and data back to the customer using a return URL or we'll make it available so that they can call another endpoint and download all of the information from us. That whole process is totally unattended, with no human intervention whatsoever. This ability for this to take place automatically is almost a number one factor for us.
When it comes to processing difficult source data with both unstructured and semi-structured content, it does very well from an ingestion standpoint. To begin with, there are different methods on how we can get those documents. We can ingest documents that come from, for instance, an SFTP site, a file system, or right from email accounts like in Exchange.
One of the nice things about it is, for instance, if it's a file that somebody scanned and produced, such as a PDF file, and we're going to classify and extract data from that, that's great. But, if we receive electronic files, such as an XML file, a text file, an HTML file, or a searchable PDF, those are considered electronic documents, meaning that the data is embedded within it. In those cases, Grooper will allow us to extract the data right from the electronic file itself, so that we don't have to convert it to an image to then turn around and OCR to then try and get data from it.
That is a huge advantage, as pretty much every other OCR system that is out there will take that electronic file, convert it to a TIFF file, OCR it, and then extract the data from the image file. This roundabout process is susceptible to the quality of the document whereas if it's an electronic file with Grooper, the data we extract will always be a hundred percent because it's being pulled directly from the electronic file.
Grooper allows us to consolidate mass amounts of data that would otherwise require a person to go through, page by page. When you set up what's called a data model, you can group fields into sections. As an example, consider a typical invoice. You have header data, footer data, and then you might have a line item table that has all of the individual line items. These make up the unit price, quantity, and line price, which then totals to your subtotal, tax, and freight, which equals your invoice total.
In most systems, you define all of those fields, so when you look at the information and when the user has to fix things, it's all sequential. With Grooper, we can create a section for the header, a section for the table, a section for the footer, and then group those fields together. This means that when the user is presented with the data, side-by-side with the actual image document, it's very intuitive because the data gets presented pretty much in the same manner that it is on the actual document. It helps speed up the amount of time that a user would take in order to make corrections.
In certain types of jobs, I would estimate that using Grooper saves us 70% of the time it normally takes to complete it.
Using this solution has helped to reduce the number of people involved in data extraction and classification. As an example, our largest healthcare customer processes 2,000 invoices a day, and they had 75 AP clerks who were doing data entry into PeopleSoft. Last year, we implemented a Grooper process where we automatically ingest the invoices from an email, classify them, then extract the data. We also do all of the validations for their PeopleSoft system. The number of people went from 75 down to 14.
This company has more than 3,000 suppliers and not all of them were set up before they went into production. Since that time, there's been an effort where every week, as they bring on new suppliers through the automated process, they continue to provide my team things that need to be tweaked or introduce suppliers that we hadn't seen before and need to be added. One type of document they receive from a supplier like me is a direct invoice, which is something that they will approve automatically after receiving it in an email and after the GL coding and other aspects are verified. The last statistic I saw on the automatic processing of direct invoices is that 62% of those are going through without human intervention.
What is most valuable?
The classification feature is very good. That's the initial reason why we switched from the other product that we used to resell and then decided to utilize it within our own product. This feature doesn't require a bunch of samples like the previous technology that we utilized.
Previously, for instance, if we were classifying mortgage documents or bank statements, I had to get three or four representative samples of all of the bank statements that are out there in the country. With thousands of community banks, it's almost impossible to get all those samples. As such, we always had an issue with being able to classify a bank statement.
However, with Grooper we didn't even use samples. Instead, we put in what's called positive extractors that look for certain keywords or characteristics of what makes up a bank statement. By doing it that way, we were able to classify probably 98% of all bank statements without ever having received a sample of each.
The second most valuable feature is extraction accuracy. That was an add-on bonus for us because initially, we were just doing classification, and being able to do more accurate extraction opened up another revenue source for us. We were able to add on the extraction capabilities to our classification and so now, pretty much everybody that we talk to wants not just classification, but they want extraction. Furthermore, when they see the accuracy of the extraction, everybody's very happy.
Grooper can extract from and ingest pretty much every image file type. It can handle TIFFs, JPEGs, PNGs, BMPs, basically all image file types, PDFs, all of the Office docs including Word, Excel, PowerPoint, Text files, XML files, and more. There's no limit on which file types they can process.
The data output and reporting are fully customizable. We have total control over what data we extract and have that included in an XML file. Grooper has a couple of export modules to allow you to export that XML data raw, it can do XSLT conversions to reformat it in a different manner if we have a specification for that, or we can output that to a database. For database output, we can have it inserted into the tables and fields in the way that we want them.
Grooper does not necessarily do the actual reporting, other than internal reporting as far as statistics like the batch state, how many batches, where they're at, if there are any errors, and that kind of thing. But, in terms of extraction data reporting, we do have the mechanisms to export all of the data to either a database, XML files, and other formats. We will take it from there and load that into whatever system we're going to do the actual reports in.
The user interface is easy to use, and the flexibility is noteworthy. Because of the way the system is architected, different people can follow different approaches and get the same result. For example, there are three of us in my company that are trained on Grooper. If each of us were to do the same project, the chances are that each of us would do it differently. Depending on how you think and how you would set things up, such as the extraction and the order that you want to do things in, it could differ based on these. However, the outcome would always be the same.
That's one of the nice things about it because it's not like, "Okay, you only can do it one way." Rather, you can do it in different ways. Some people don't like that, because they want to be taught using a fixed sequence like, "Okay, you do A, B, C, and D, and then you get your result." The system is flexible enough that I may do step D first and then A, and then C and then B and still get the same result.
From a user interface perspective, most things are available via drop-down menus, you can select references, and point back to your extractors, and other things like that. From a GUI perspective, it's very effective.
What needs improvement?
Currently, we're still using version 2-7-2, and now they're about to do the beta release on their version 2021. In this coming version, we expect that some of our issues will be fixed.
We've had challenges in classification tasks where similar documents were flagged as multiple matches. The system would identify them and say, "Hey, I think I've got multiple matches. It could either be this one or that one." Because of that, it required us to instruct the system to either leave it unclassified, or we had to halt the process for somebody to look at it.
With the new version for 2021, they have changed the paradigm. As it is now, we're using something called a form type, where pages within the document are referenced using a specific page number. For example, in a ten-page document, you might refer to information specifically on the first or fifth page. In the new paradigm, there is a first, middle, and last page concept, as opposed to having the different form types with all of the different pages. What they're telling me is that it's going to make the classification more accurate. Just because the first page of two different documents looks the same, they will not be considered duplicates. Having multiple points of reference will now allow it to better distinguish them.
The other area we have had challenges with is table extractions, where if the data headers were not defined, or the tables did not have descriptions for the columns. My understanding is that in the 2021 version, they've now shown that they're handling that. Again, we don't have it and haven't been able to test it, but it's coming.
Technical support is definitely an area that they need improvement in, in terms of the front-line individuals.
For how long have I used the solution?
We have been using Grooper for two and a half years.
What do I think about the stability of the solution?
Grooper is a very stable product. As I mentioned, our cloud solution doesn't have any human intervention and I've got one support person that monitors things, other than our automated tools where it monitors services and stuff like that. I just have one person to ensure that there aren't any errors or other such problems. Errors arise occasionally, for example, if we get a corrupt image or somebody sends us a document that has security on it. From Grooper, itself, we've not had any issues with it crashing or hanging.
One of the huge advantages is that Grooper supports a pool of computing resources, which means that if one of our servers goes down, the licensing server detects that and adaptively changes the workload. Specifically, it will not send any new work to that device because it's not online. It will just continue to distribute the work amongst the others that are available. When it comes back online, then it'll start giving it work, automatically. It's a very nice feature to have, to be able to distribute that work across multiple resources.
What do I think about the scalability of the solution?
You can scale Grooper as much as you want. You can literally add as many servers as you require. If you're in a virtual environment, you could spin up a bunch of VMs, install Grooper on them, add those into the thread pool, and just tell it about them. They can now participate in the process.
Spinning up a VM and getting it prepared, including installing the software and adding it to the thread pool, can be done in about 20 minutes.
In our cloud solution, we maintain a service level of 10 minutes or less, from the time we receive a file from a customer to the time we deliver it back. Our average is four minutes. Early this year, we were starting to get that towards eight minutes because we were increasing volume. We literally just called our cloud provider and asked them to enable another server for us. We installed the software, added it into the thread pool, and we now are handling 30% more volume and we're back down to that four-minute turnaround time.
It really scales.
How are customer service and technical support?
The only time that we reach out to them is when we encounter issues like bugs. Sometimes we'll find something that doesn't look right, so we'll submit a ticket and have somebody review it. I will say that's probably one area that they would definitely need some improvement in, particularly with the front-line individuals.
When we submit a ticket, usually they'll ask all of the basic things, as well as request we send the logs and other relevant data. They'll go down the checklist. Specifically for our company, because we're a reseller and we know the product very well, we have already done all of these things. I know it's probably standard protocol, but I think they should train the individuals to know the difference between a regular customer who just implemented Grooper and our organization, who's an actual reseller and has implemented their solution, as well use it internally. It's a waste of time for them to ask all those things because we know that there really is a problem, and want to get on to solving it.
Once it gets beyond the first level, on to the engineering team or the development team, they have been very good and responsive about providing fixes and patches. From that aspect, I don't necessarily have an issue. It's more just the first level of support.
We don't have the level of support that would give us an assigned engineer. In a couple of cases where we ran into some issues that were more urgent, I reached out to our account managers. At that point, he got in contact with the product manager and they called me right away. They were able to get some people on the phone and handled it immediately, but there isn't a designated engineer for our account.
Which solution did I use previously and why did I switch?
We used to resell and implement another product prior to Grooper.
The most recent one we have worked with is the Ephesoft product. We're still a reseller of Ephesoft, technically, and have been for approximately seven years. We actually adopted version one, seven years ago or so. We've got perhaps 25 implementations, who are customers that we still support today.
As an example, I mentioned our cloud solution for mortgage classification extraction. I tried to build that three years ago with Ephesoft, but it just didn't lend itself to it. For one, the accuracy level wasn't there. The problem is that we would need to have representative samples of every document. We've got over 1,500 distinct documents in that model, so trying to find 20 samples of 1,500 documents would just take forever. The other problem was that there were large limitations on the extraction side, as far as table extractions. Even to this day, they still have issues with that. It is important to remember, however, that we used it because it was the best thing we had at the time.
Before that, we used PSIcapture, which is a PSIGEN product. We used that for between four and five years before we switched to Ephesoft. Of course, we used Captiva (now known as OpenText Intelligent Capture), and IBM's offering as well.
I can tell you that previously, with the last product that we used to resell, setting up that accounts payable system for the healthcare organization that I have described probably would have taken us six months. With Grooper, we were able to get the entire product all done, with the integration, in six weeks.
It's such a big improvement because there's just so much more that's out of the box. With the other product, we had to do a lot of scripting and write services around it in order to get data into it, and once we got the data back out, we had to do a lot of other stuff too. Whereas with Grooper, there's just so much functionality within the product itself that we don't actually have to write all those things.
In terms of the learning curve between products, from a training standpoint, Grooper is definitely more involved. With the other products, you could go through a two-day class and learn enough to be able to get started. With Grooper, you're going to spend a minimum of a week. Ideally, you should take the other classes as well. So, it's essentially a two-week training period, and that's assuming that you have a capture background.
By "capture", I mean that you should be familiar with scanning, image processing, all of the capabilities with respect to cleaning up images and OCR, and things like that. It is more involved because there's just so much more functionality within the product. Whereas the other products have a very simple user interface, but then you're very limited on what you can actually do.
One of the big benefits to Grooper, and one of the reasons why I switched our company away from Ephesoft, is that Ephesoft is licensed based on a number of cores. If you look at an entry-level four-core server, you can process 20 pages a minute. The time is consumed with OCR and the licensing that they use. When you do the math, you realize that you can do a couple of million pages in a year. If that server is running 24/7 and you were processing non-stop, it would process two million pages in a year.
Well, if all of a sudden, you need to do more volume, but in a shorter time, you have to add more servers and more cores. Of course, now you have to buy much higher licenses and then it just starts escalating from a cost standpoint. The way Grooper works, it's licensed based on the number of pages per year and they don't care how many resources, from a server perspective, you deploy.
In our case, as an example, we brought on some new customers to our cloud solution. What we did is we just added more servers, made those servers available into what's called a thread pool, and now Grooper started distributing work across multiple servers, all without it affecting my license at all. You can actually do what's called crowd computing processing.
In an organization, you could install Grooper on perhaps 50 desktops and then add them to the server. You would tell the server that these 50 computers are out there on the network and are available. Assume that each had four cores. What Grooper will do is to monitor through the day and night and determine whether any of those resources are available. It'll send them tasks automatically and lets those computers do the processing and offload some of the work. Because of that, we're able to get stuff through really fast. We could split up, for example, a batch of 300 pages, maybe across 20 computers. These don't have to be servers; rather, they can be desktops that are not being heavily used at the time. Now, we can process all of those tasks in a matter of seconds.
Not only is the work done more quickly, but the redundancy created by the pool of computing resources adds stability to the workload.
How was the initial setup?
With respect to setting Grooper up, it's straightforward. Where the complexity comes in is, figuring out how we're going to integrate it with the customers' systems. It's not necessarily a Grooper issue. It's really more on the client-side.
What was our ROI?
We have certainly seen a return on investment from this product.
We had a soft launch in 2019, but in 2020 is when we actually launched our mortgage debt platform. For us, this opened up a whole new revenue stream that wasn't there before. Also, when we look at what we're paying compared to our revenue, it's a fraction of the cost because what we're doing is something, really, that nobody in the industry could do before. As such, we're able to charge a higher premium per file than others in the past.
As an example, let's assume that somebody was charging $3 a loan. By contrast, we're charging $5 a loan, but we can justify it because we can automatically, without any human intervention at about a 98% to 99% accuracy level, process their documents and get it back to them within four minutes.
It is similar to a situation where you can buy a car, and you can choose either the Toyota Camry or you can buy the Lexus. Generally, you're going to step up and get more value for your money. It's going to cost you more, but you're going to be driving a Lexus, which is a much nicer car.
Our customers are also saving a lot of money. For example, the one customer we process 25,000 loan files a month for, is saving about $1.5 million a year, just on labor.
As an example, this same customer asked us to put in a process for them to try and mitigate fraud. The reason is that a couple of years ago, they got caught where a title company sent them instructions for a wire transfer via email. In midstream, somebody intercepted the communication and they changed the routing number and the account number, then they bounced it back to the company like the email was undeliverable.
The company then called and said "Hey, I got this bounced back." and they responded to say, "Well, I don't think we're having problems. Go ahead and resend it." In turn, they forwarded that same document. Once that loan had closed, the money was wired to an account in Russia and it was a $575,000 loan. Consequently, the company got frauded out of $575,000. What they did was to put a process in place where they would have people checking the system to find out if more than one wiring instruction was added into the repository, and then somebody had to go and look at that.
When all of this happened, they asked us to write a program that checked those loans nonstop. Their volume is very high, at about 100,000 loans, so it would take a week for our system to cycle through them. Then within that week, they would get between 2,000 and 2,500 emails that we would have to look at because all we could tell was that there were two documents in that placeholder. We didn't know if they were two wiring instructions because at the time, we were using Ephesoft and we couldn't make that determination. The only thing we knew is that there were two documents.
Because of the necessity to check so many emails, they had approximately 18 people looking at them. It had to be done immediately because the loan is about to be funded, and they don't want to fund it before it's verified, otherwise, they run the risk of fraud again. Now that they are using our cloud-based classification extraction platform, they inquired about how our process could be further improved.
My suggestion was that we can do the same type of monitoring, but utilize an API rather than an SDK, which is much faster. Using the API, we can filter out specific loans, so instead of looking at 120,000 loans, we can look at perhaps 30,000 loans that are really active. Once we find more than one document, we'll pull those documents down, automatically ingest them, classify them, extract the relevant data, and we can now recognize, for example, that I have one document that is a wiring instruction and one that is not a wiring instruction.
In that case, I don't need to send an email because there's no issue. The only instance where we need to be concerned is if there are two wiring instructions, but the routing number and account number are different. If they are the same then it doesn't matter. However, if we find that there are however many, but one happens to be different, now I need to alert the client.
As a result of our newest implementation, our client is receiving perhaps eight emails each day, instead of hundreds of emails a day. Now, those 18 people can focus on their normal job, as opposed to having to go in and do research on these loans when really, once they get in there, 99% of the time there's not a problem.
What's my experience with pricing, setup cost, and licensing?
The way it's licensed is on an annual per-page basis, which is something that I don't see as an issue at all. Overall, their pricing is higher than the competitors, but they offer functionality that is otherwise not available. The way we justify that to the customers when we're implementing is to have them look at the additional functionality. If a competing product is cheaper but it can't do the job, it doesn't really matter.
From my perspective, when I talk to our prospects and customers, I explain that it does no good to compare prices. Some of them will compare Grooper to Ephesoft and point out that Ephesoft is whatever percentage cheaper, say 15% or 20%. In response, I explain that Ephesoft can't do what they are asking to be done, so it doesn't matter if it's 100% cheaper. It can't do it, so you have to think of things in a different mindset at that point, aside from the licensing aspect.
There really isn't anybody else that I'm aware of that's on their level, so I think they can command it. When somebody else comes along that can do the same things that they can do, then I think at that point the pricing will probably get adjusted.
Which other solutions did I evaluate?
We have been in this business a long time and have tried a variety of other products.
What other advice do I have?
We've written code for Grooper, although it has been utilized primarily for validations, rather than for the actual extraction. The extraction is something that we've pretty much handled all through the user interface. However, once we pull pieces of information and we want to validate that to a third-party system or an external database, for example, we have written our own scripts to take the extracted data. The operations will be things like a database lookup, performing validations, pulling back some more information, and then updating additional fields. But for the extraction itself, we really have not had to write code.
The fact that extracting data didn't require scripting was not a deciding factor for us. However, it is an important factor because most people want to have a business analyst support the process, rather than having to hire a developer.
I would rate this solution a nine out of ten.