Athena Federated query

Federated query is a new Amazon Athena feature that enables data analysts, engineers, and data scientists to execute SQL queries across data stored in relational, non-relational, object, and custom data sources. With Athena federated query, customers can submit a single SQL query and analyze data from multiple sources running on-premises or hosted on the cloud. Athena executes federated queries using Data Source Connectors that run on AWS Lambda.

As part of this lab, we will explore how to query different data sources from Athena. We will be using existing RDS database which was created as part of the initial setup as datasource-1 and data lake data (stored in S3) as datasource-2.

Steps:

  1. Create an S3 endpoint to allow S3 access to Athena connecter Lambda, follow steps mentioned here to create S3 endpoint. Make sure to use the same subnet which will be used for Lambda function in next step. This will put a route entry in the subnet routing table to reach to S3 via AWS backbone.

  2. Choose “Connect data source” on the Query Editor

  1. Select the data source as ‘PostgreSQL’ to which you want to connect, as shown in the following screenshot. Click ‘Next’

  2. Click on ‘Configure new function’ and give a catalog name. And select ‘Configure new lambda function’

  1. Fill the application setting fields as per below snapshot and click deploy.

  1. Wait for the function to deploy. Go back to the Athena window and select the newly deployed Lambda function and click ‘Connect’

  1. Once the function is deployed. Select the function and change handler string in runtime setting as below.

“com.amazonaws.connectors.athena.jdbc.postgresql.PostGreSqlCompositeHandler”

Changed string below:

  1. Verify that the new data source is visible
  1. Refresh the query editor and check the new datasource is visible with corresponding database.
  1. For this example, we will be using “sport_location” table from Postgres data source and “parquet_sporting_event” table from data lake.

Copy below query and paste in the query editor

SELECT loc.city, count(distinct evt.id) AS events
FROM "Postgres_DB"."dms_sample"."sport_location" AS loc
JOIN "AwsDataCatalog"."ticketdata"."parquet_sporting_event" AS evt
ON loc.id = evt.location_id
GROUP BY loc.city
ORDER BY loc.city ASC;