Athena Federated query

Federated query is a new Amazon Athena feature that enables data analysts, engineers, and data scientists to execute SQL queries across data stored in relational, non-relational, object, and custom data sources. With Athena federated query, customers can submit a single SQL query and analyze data from multiple sources running on-premises or hosted on the cloud. Athena executes federated queries using Data Source Connectors that run on AWS Lambda.

As part of this lab, we will explore how to query different data sources from Athena. We will be using existing RDS database which was created as part of the initial setup as datasource-1 and data lake data (stored in S3) as datasource-2.

Prerequisites

Ingestion with DMS and Transforming data with Glue ETL labs are prerequisites for this lab.

If you are in an AWS Event use Event Engine Team Dashboard, if you are in your own account check Outputs section of the CloudFormation stack to get the following values.

  • Security Group Id for Athena Federated Query Connector Check the Security Group Name with string sgdefault from Amazon EC2 Security Group console (e.g. sg-006190abcdc987641).
  • SubnetId for Athena Federated Query Connector [Private Subnet ID from Amazon VPC Subnet console (e.g. subnet-07edabc12335e98bd).

Steps:

  1. Choose “Connect data source” on the Query Editor

  2. Select the data source as PostgreSQL to which you want to connect, as shown in the following screenshot. Click ‘Next’

  3. Click on ‘Configure new function’ and give a catalog name as PostgreSQL_DB. And select ‘Configure new lambda function’

  4. Fill the application setting fields as per below snapshot and click deploy.

    Field Value
    Application Name AthenaJdbcConnector
    SecretNamePrefix AthenaFed_
    SpillBucket Get the BucketName from EventEngine Team Dashboard
    DefaultConnectionString postgres://jdbc:postgresql://<DATABASE_ENDPOINT>:5432/sportstickets?user=master&password=master123
    replace <DATABASE_EDNPOINT> with the correct database endpoint (e.g. dmslabinstance.abcdshic87yz.eu-west-1.rds. amazonaws.com)
    DisableSpillEncryption false
    LambdaFunctionName postgresqlconnector
    LambdaMemory 3008
    LambdaTimeout 900
    SecurityGroupIds Use the SecurityGroupId noted in prerequisites
    SpillPrefix athena-spill-bucket
    SubnetIds Use the SubnetId noted in prerequisites

  5. Wait for the function to deploy. Go back to the Athena window and select the newly deployed Lambda function, for Catalog Name enter Postgres_DB and click ‘Connect’

  6. Once the function is deployed. Select the function and change handler string in runtime setting as below.
    com.amazonaws.connectors.athena.jdbc.postgresql.PostGreSqlCompositeHandler Changed string below:

  7. Verify that the new data source is visible

  8. Refresh the query editor and check the new datasource is visible with corresponding database.

  9. For this example, we will be using “sport_location” table from Postgres data source and “parquet_sporting_event” table from data lake. Copy below query and paste in the query editor

    SELECT loc.city, count(distinct evt.id) AS events
    FROM "Postgres_DB"."dms_sample"."sport_location" AS loc
    JOIN "AwsDataCatalog"."ticketdata"."parquet_sporting_event" AS evt
    ON loc.id = evt.location_id
    GROUP BY loc.city
    ORDER BY loc.city ASC;