What is our primary use case?
We had use cases for
AWS Lake Formation, specifically for a government client that was trying to migrate a lot of data and implement a solution that enables various entities to access the data at a central location, since the government entity had data coming in from different departments across various government entities. They wanted to build a data lake that incorporated big data processing with Glue services and crawlers, while also proving out a concept for securing their data lake and data assets along with providing governance around it, allowing different entities to access the data under secure conditions with scalability in mind. I was a consultant for one of the entities down here, where I was involved in a sort of POC to see if they could potentially use
AWS Lake Formation along with Security Lake.
What is most valuable?
The most valuable features of AWS Lake Formation were the access model itself, as it allows implementation of filters, Blueprints, and row-level and column-level security to mask data that shouldn't be accessed by certain entities, enabling granular control without exposing PII data. Another feature is the Glue Workflows, which allow orchestration of multiple Glue jobs to automate the entire process end-to-end. Additionally, the Blueprints feature, which provides connectors out of the box for ingesting data from different sources, was also beneficial. AWS Lake Formation is tightly integrated with
IAM for authentication and authorization, as its permission model relies on
IAM user groups and roles. This allows categorization of groups based on the access required by different users, enabling implementation of access policies within Lake Formation. This integration provides extensibility and scalability since user groups, once granted permissions, can manage further access control for new users or groups. While we were still exploring other features such as federated access for users outside
AWS, we were in the early days of utilizing AWS Lake Formation. The scalability of AWS Lake Formation is quite good, allowing creation of user groups with grantable permissions, letting users manage access for new users onboarded to specific databases or tables, as these groups can grant permissions to extended users as needed. The stability and reliability of AWS Lake Formation are impressive; once permissions are applied, the access flow is efficient. When a user runs a query from
Athena, it interacts with AWS Lake Formation first, which uses temporary credentials to access
S3 buckets and presents data securely. This centralized permission management adds a layer of security, making it predictable in what users can access while applying necessary filters before data exposure.
What needs improvement?
In terms of improvement areas for AWS Lake Formation, one feature I noticed about 10 months ago is the ability to query data from
Athena without accessing
S3 buckets directly; users granted access to tables can run SQL queries via Athena. However, I found that Athena can be a bit clunky when writing queries, indicating a potential enhancement point for easier user interaction with query tools such as DataGrip using provided driver JARs.
For how long have I used the solution?
I've been working with AWS Lake Formation for about a year, specifically for about three to six months on one of the projects related to Lake Formation.
What was my experience with deployment of the solution?
While I was not directly involved in the actual deployment of AWS Lake Formation, I found the application perspective to be straightforward for understanding features, although there were conceptual aspects users need to grasp before implementing permissions. There was some trial and error, but over a period of three to six months, we effectively scaled the permission models and controlled access to S3 files, while my colleagues faced challenges with securing S3 files and managing policies such as ACLs, ACLEs, and NACLs during implementation.
What do I think about the stability of the solution?
The stability and reliability of AWS Lake Formation are impressive; once permissions are applied, the access flow is efficient. When a user runs a query from Athena, it interacts with AWS Lake Formation first, which uses temporary credentials to access S3 buckets and presents data securely.
What do I think about the scalability of the solution?
The scalability of AWS Lake Formation is quite good, allowing creation of user groups with grantable permissions, letting users manage access for new users onboarded to specific databases or tables, as these groups can grant permissions to extended users as needed.
How was the initial setup?
I participated in the initial setup and deployment of AWS Lake Formation primarily from a Glue side, focusing on data engineering aspects, building Glue jobs, and creating S3 buckets while implementing permissions to different user groups.
What about the implementation team?
I participated in the initial setup and deployment of AWS Lake Formation primarily from a Glue side, focusing on data engineering aspects, building Glue jobs, and creating S3 buckets while implementing permissions to different user groups.
What other advice do I have?
For fine-grained access control of data assets in AWS Lake Formation, I used LF-Tags to categorize certain databases and tables based on tags assigned to specific user groups, allowing me to match tags and grant access accordingly. I also set Data Filters to mask either rows or columns on the tables and databases, and implemented S3 policies to restrict access to the S3 buckets. This involved creating groups and assigning permissions such as grant access, select access, update access, while these database and table levels were populated using Glue crawlers. I did not use AWS Lake Formation's machine learning algorithms for data classification, as the use case was focused on managing large volumes of data, which was close to 1.5 petabytes, rather than implementing AI models. On a scale of 1-10, I rate AWS Lake Formation a 7.
Which deployment model are you using for this solution?
Public Cloud
If public cloud, private cloud, or hybrid cloud, which cloud provider do you use?