If you use Amazon Virtual Private Cloud (Amazon VPC) to host your AWS resources, you can establish a private connection between your VPC and AWS Glue. You use this connection to enable AWS Glue to communicate with the resources in your VPC without going through the public internet.
Amazon VPC is an AWS service that you can use to launch AWS resources in a virtual network that you define. With a VPC, you have control over your network settings, such the IP address range, subnets, route tables, and network gateways. To connect your VPC to AWS Glue, you define an interface VPC endpoint for AWS Glue. When you use a VPC interface endpoint, communication between your VPC and AWS Glue is conducted entirely and securely within the AWS network.
With AWS Lake Formation, you can import your data using workflows. A workflow defines the data source and schedule to import data into your data lake. You can easily define workflows using blueprints, or templates, that Lake Formation provides.
When you create a workflow, you must assign it an AWS Identity and
Access Management (IAM) role that enables Lake Formation to set up the
necessary resources on your behalf to ingest the data. In this lab,
we’ve pre-created an IAM role for you, called
<random>-LakeFormationWorkflowRole<random>
Navigate to the AWS Glue console: https://console.aws.amazon.com/glue/home?region=us-east-1
On the AWS Glue menu (left hand menu), select Connections.
Click Add Connection.
Enter glue-rds-connection as the connection name.
Choose JDBC for connection type.
Optionally, enter the description. This should also be descriptive and easily recognizable. Click Next.
Input JDBC URL with the format of jdbc:postgresql://<RDS_Server_Name>:5432/sportstickets
Enter master as username, master123 as Password
For VPC, select the pre-created VPC ending with dmslstudv1
For Subnet, choose one of the existing private_subnet
Select the security group with sgdefault in the name.
Navigate to the AWS Lake Formation service:
https://console.aws.amazon.com/lakeformation/home?region=us-east-1#databases
If you are logging into the lake formation console for the first time then you must add administrators first. In order to do that follow Steps 2 and 3. Else skip to Step 4.
Click Add administrators
On the left pane navigate to Blueprints and click Use blueprints.
a. For Blueprint Type, select Database snapshot
b. Under Import Source
c. Under Import Target
d. For Import Frequency, Select Run On Demand
e. For Import Options:
Leave other options as default, click Create, and wait for the console to report that the workflow was successfully created.
Once the blueprint gets created, select it and click Action -> Start. There may be a delay of 5-10 seconds for the blueprint showing up. You may have to hit refresh button.
Once the workflow starts executing, you will see the status changes from running -> discovering -> Completed
The Lake Formation blueprint creates a Glue Workflow, which orchestrates Glue ETL jobs (both python shell and pyspark), Glue crawlers, and triggers. It will take somewhere between 20-30 mins to finish its first execution. In the meantime, let us drill down to see what it creates for us.
On the Lake Formation console, in the navigation pane, choose Blueprints
In the Workflow section, click on the Workflow name. This will direct you to the Workflow run page. Click on the Run Id.
Here you can see the graphical representation of the Glue workflow built by Lake Formation blueprint. Highlighting and clicking on individual components will display the details of those components (name, description, job run id, start time, execution time)
To understand what all Glue Jobs got created as a part of this workflow, in the navigation pane, click on Jobs.
Every job comes with history, details, script and metrics tab. Review each of these tabs for any of the python shell or pyspark jobs.
Navigate to the Lake Formation Console. https://console.aws.amazon.com/lakeformation/home?region=us-east-1#databases
Navigate to Databases on the left panel and select ticketdata
Click on View tables
Select table lakeformation_sportstickets_dms_sample_player. As per our configuration above, Lake Formation tables were prefixed with lakeformation
Click Action -> View Data
This will now take you to Athena console.
If you see a “Get Started” page, it’s because it’s the first time we’re using Athena in this AWS Account. To proceed, click Get Started.
Then click set up a query result location in Amazon S3 at the top
In the pop-up window in the Query result location field, enter your s3 bucket location followed by /, so that it looks like s3://xxx-dmslabs3bucket-xxx/ and click Save
On Athena Console, you can run some queries using query editor:
To select some rows from the table, try running:
SELECT * FROM "ticketdata"."lakeformation_sportstickets_dms_sample_player" limit 10;
To get a row count, run:
SELECT count(*) as recordcount FROM "ticketdata"."lakeformation_sportstickets_dms_sample_player" limit 10;
Congratulations!!! You have completed lake formation lab. To explore more fine grain data lake security feature, continue to next section.
Before we start querying the data, let us create an IAM User datalake_user and grant column level access on the table created by the Lake formation workflow above, to datalake_user.
Next click on Permissions
Choose Attach existing policies directly and search for AthenaFullAccess
Keep navigating to the next steps until reached the end. Review the details and click on “Create User”.
On the final screen, write down the sign-in link and hit Close
Use the following json snippet replacing <your_dmslabs3bucket_unique_name> with the name of your dmslabs3bucket, e.g. arn:aws:s3:::mod-08b80667356c4f8a-dmslabs3bucketnh54wqg771lk
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:Put*",
"s3:Get*",
"s3:List*"
],
"Resource": [
"arn:aws:s3:::<your_dmslabs3bucket_unique_name>/*" ]
}
]
}
Give a name to the policy, such as athena_access. Then Create Policy. IAM user with required policies have been created.
Next, Navigate to Lake Formation console, under Permissions choose Data permissions.
Choose Grant.
Once the Grant permissions window opens up:
For IAM user and roles, choose datalake_user.
For Database, choose ticketdata
The Table list populates.
For Table, choose lakeformation_sportstickets_dms_sample_player.
For Columns, select Include Columns and choose id, first_name
For Table permissions, uncheck Super and choose Select.
Click Grant
Using Athena, let us now explore the data set as the datalake_user.
On the same web page, sign back in as the IAM user
datalake_user, using Master123! as password. Note: remove
hyphens ‘-‘ from your Account ID
a. Make sure to change the region to us-east-1 (N. Virginia)
b. Navigate to Athena console
Then click set up a query result location in Amazon S3 at the top.
In the pop-up window, enter your s3 bucket location followed by ‘/’ in the Query result location box. It looks like s3://xxx-dmslabs3bucket-xxx/ and click Save
Next, ensure database ticketdata is selected on right hand panel.
Now run a select query:
SELECT * FROM "ticketdata"."lakeformation_sportstickets_dms_sample_player" limit 10;
This illustrates that AWS Lake Formation can provide granular access at table and column level to IAM users.
Congratulations!! You have successfully completed this lab!