Federated query is a new Amazon Athena feature that enables data analysts, engineers, and data scientists to execute SQL queries across data stored in relational, non-relational, object, and custom data sources. With Athena federated query, customers can submit a single SQL query and analyze data from multiple sources running on-premises or hosted on the cloud. Athena executes federated queries using Data Source Connectors that run on AWS Lambda.
As part of this lab, we will explore how to query different data sources from Athena. We will be using existing RDS database which was created as part of the initial setup as datasource-1 and data lake data (stored in S3) as datasource-2.
Choose “Connect data source” on the Query Editor
Select the data source as PostgreSQL to which you want to connect,
as shown in the following screenshot. Click ‘Next’
Click on ‘Configure new function’ and give a catalog name as
select ‘Configure new lambda function’
Fill the application setting fields as per below snapshot and click deploy.
|SpillBucket||Get the BucketName from EventEngine Team Dashboard|
replace <DATABASE_EDNPOINT> with the correct database endpoint (e.g. dmslabinstance.abcdshic87yz.eu-west-1.rds. amazonaws.com)
|SecurityGroupIds||Use the SecurityGroupId noted in prerequisites|
|SubnetIds||Use the SubnetId noted in prerequisites|
Wait for the function to deploy. Go back to the Athena window and
select the newly deployed Lambda function, for Catalog Name enter
Postgres_DB and click ‘Connect’
Once the function is deployed. Select the function and change
handler string in runtime setting as below.
Changed string below:
Verify that the new data source is visible
Refresh the query editor and check the new datasource is visible with corresponding database.
For this example, we will be using “sport_location” table from Postgres data source and “parquet_sporting_event” table from data lake. Copy below query and paste in the query editor
SELECT loc.city, count(distinct evt.id) AS events FROM "Postgres_DB"."dms_sample"."sport_location" AS loc JOIN "AwsDataCatalog"."ticketdata"."parquet_sporting_event" AS evt ON loc.id = evt.location_id GROUP BY loc.city ORDER BY loc.city ASC;