To begin with, we need to perform network configuration. We will open ports specific to RDS PostgreSQL DB, define inbound rules and create S3 endpoint to access data in S3 bucket (created as part of DMS Lab)
Note: Make sure your RDS instance is available. If the Status says “Storage Full”, increase the DB storage capacity and apply changes immediately to the DB instance and proceed further.
Security group associated with RDS instance will be automatically selected. Select Inbound Rules and Click Edit Rules.
Create Inbound Rules as shown below:
Create S3 VPC endpoint. Navigate to VPC Dashboard and select Endpoint from left section. Click on “Create Endpoint”.
In the next window, follow these steps:
Endpoint creation is successful.
Documentation Reference: Link
Create the IAM Role by following below steps:
On the IAM console, In the navigation pane, choose Roles, and then choose Create role.
On the Create role page, choose AWS service, and choose Glue. Then choose Next:Permissions.
Search for and select the AWSGlueServiceRole managed policy. Optionally, provide tags.
Then complete the wizard, naming the role LakeFormationWorkflowRole, and choose Create role.
Back on the Roles page, search for and choose LakeFormationWorkflowRole.
On the role Summary page, under the Permissions tab, choose Add inline policy, and add the following policy in JSON editor.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Lakeformation",
"Effect": "Allow",
"Action": [
"lakeformation:GetDataAccess",
"lakeformation:GrantPermissions"
],
"Resource": "*"
}
]
}
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "PassRolePermissions",
"Effect": "Allow",
"Action": ["iam:PassRole"],
"Resource": [
"arn:aws:iam::account-id:role/LakeFormationWorkflowRole"
]
}
]
}
On the AWS Glue menu, select Connections.
Click Add Connection.
Enter the connection name. This name should be descriptive and easily recognized (e.g ," glue-rds-connection").
Choose RDS for connection type and PostgreSQL for Database Engine
Optionally, enter the description. This should also be descriptive and easily recognized and Click Next.
Choose dmslabinstance as Instance and enter password –“master123” and Click Next and Click Finish.
glue-rds-connection was created successfully. To test it, select the connection, and choose Test connection.
To Test connection, choose the IAM role created in the previous step and then click on Test Connection.
Navigate to the AWS Lake Formation service.
If you are logging into the lake formation console for the first time then you must add administrators first in order to do that follow Steps 2 and 3. Else skip to Step 4.
Click Add administrators
Add your
Navigate to Databases on left pane. Select “ticketdata” and click on “Actions”, select “Grant” to grant permissions.
Under “IAM Users and Roles”, select Lake Formation role that you created – “LakeFormationWorkflowRole” and for user, select your username. Grant “super” permissions for Database permissions and Grantable permissions.
On the left pane navigate to Blueprints click Use blueprints.
a. For Blueprint Type, select Database snapshot
b. Under Import Source
i. For Database Connection choose the DB connection created in the glue. [Ex: “glue-rds-connection”]
ii. For Source Data Path enter “sportstickets/dms_sample/player”.
c. Under Import Target
i. For Target Database, choose existing “ticketdata”.
ii. For Target storage location choose the S3 bucket - For Bucket name, type the name of the s3 bucket that you created earlier in student lab.
iii. Add a folder at the end of the bucket url path.
NOTE: The value is similar to the following string,
d. For Import Frequency, Select Run On Demand
e. For Import Options;
i. Give a suitable Workflow Name
ii. For the IAM role choose the LakeFormationWorkflowRole created previously
iii. For Table prefix type “lakeformation_”
Leave other options as default, Choose Create, and wait for the console to report that the workflow was successfully created.
Once the blueprint gets created, click on Start it Now? [There may be a delay of 5-10s delay in the blueprint showing up. You may have to hit refresh. Select the blueprint and choose Start in Actions drop down]
Once the workflow starts executing, you will see the status changes from running discovering
The Lake Formation blueprint creates a Glue Workflow under the hood which contains Glue ETL jobs – both python shell and pyspark; Glue crawlers and triggers. It will take somewhere between 15-20 mins to finish execution. In the meantime, let us drill down to see what it creates for us;
Navigate to the Glue console
Navigate to Databases on the left panel and select ticketdata
Click on “Tables in ticketdata” and this table will be pre fixed by “lakeformation_”
Example: lakeformation__sportstickets_dms_sample_player
And Click Action -> View Data
This will now take you to Athena console, where you can preview the table contents, as show below;
Congratulation!!! You have completed lake formation lab. To explore more fine grain data lake security feature, continue to next section.
Before we start the querying the data, let us create an IAM User datalake_user and grant column level access on the table created by the Lake formation workflow above, to datalake_user.
Login as admin user to your account. Navigate to IAM Console and click on Add User.
Create a user named datalake_user and give it a password: master123.
Next click on Permissions and keep navigating to the next steps until reached the end. Review the details and click on “Create User”.
Navigate to the Lake Formation console, in the navigation pane, under Permissions, choose Data permissions.
Choose Grant, and in the Grant permissions dialog box, do the following: a. For IAM user and roles, choose datalake_user. b. For Database, choose ticketdata c. The Table list populates. d. For Table, choose the table shown. e. For Columns, select Include Columns and choose id, first_name f. For Table permissions, choose Select.
Choose Grant.
Using Athena, let us now explore the data set as the datalake_user.