Glue Job Bookmark (Optional)

PART B: Glue Job Bookmark (Optional):

Pre-requisite: Completion of CDC part of DMS Migration Lab

Step 1: Create Glue Crawler for ongoing replication (CDC Data)

Now, let’s repeat this process to load the data from change data capture.

  1. On the AWS Glue menu, select Crawlers.

  2. Click Add crawler.

  3. Enter the crawler name for ongoing replication. This name should be descriptive and easily recognized (e.g. glue-lab-cdc-crawler).

  4. Optionally, enter the description. This should also be descriptive and easily recognized and Click Next.

  5. Choose Data Stores as Crawler Source Type, Crawl all folders and Click Next

  6. On the Add a data store page, make the following selections:

    • For Choose a data store, click the drop-down box and select S3.
    • For Crawl data in, select Specified path in my account.
    • For Include path, enter the target folder for your DMS ongoing replication, e.g., “s3://xxx-dmslabs3bucket-xxx/cdc/dms_sample
  7. Click Next.

  8. On the Add another data store page, select No and Click Next.

  9. On the Choose an IAM role page, make the following selections:

    • Select Choose an existing IAM role.
    • For IAM role, select xxx-GlueLabRole-xxx. E.g. “dmslab-student-GlueLabRole-ZOQDII7JTBUM”
  10. Click Next.

  11. On the Create a schedule for this crawler page, for Frequency, select Run on demand and Click Next.

  12. On the Configure the crawler’s output page, select the existing Database for crawler output (e.g., ticketdata).

  13. For Prefix added to tables, specify cdc_

  14. For Configuration options (optional), keep the default selections and click Next.

  15. Review the summary page noting the Include path and Database target and Click Finish. The crawler is now ready to run.

  16. Tick the crawler name glue-lab-cdc-crawler, click Run crawler button.

  17. When the crawler is completed, you can see it has “Status” as Ready, Crawler will change status from starting to stopping, wait until crawler comes back to ready state, you can see that it has created 2 tables.

  18. Click the database name (e.g., ticketdata) to browse the tables. Specify cdc as the filter to list only newly imported tables.

Step 2: Create a Glue Job with Bookmark Enabled

  1. On the left-hand side of Glue Console, click on Jobs and then Click on Add Job.

  2. On the Job properties page, make the following selections:

    • For Name, type Glue-Lab-TicketHistory-Parquet-with-bookmark
    • For IAM role, choose existing role “xxx-GlueLabRole-xxx”
    • For Type, Select Spark
    • For Glue Version, select Spark 2.4, Python 3 (Glue version 2.0) or whichever is the latest version
    • For This job runs, select A proposed script generated by AWS Glue.
    • For Script file name, use the default.
    • For S3 path where the script is stored, provide a unique Amazon S3 path to store the scripts. (You can keep the default for this lab.)
    • For Temporary directory, provide a unique Amazon S3 directory for a temporary directory. (You can keep the default for this lab.)
  3. Expand the Advanced properties section. For Job bookmark, select Enable from the drop-down option.

  4. Expand on the Monitoring options, enable Job metrics.

  5. Click Next

  6. In Choose a data source, select cdc_ticket_purchase_hist as we are generating new data entries for ticket_purchase_hist table. Click Next

  7. In Choose a transform type, select Change Schema and Click Next

  1. In Choose a data target:

    • Create tables in your data target
    • For Data store: select Amazon S3
    • Format: parquet
    • Target path: s3://xxx-dmslabs3bucket-xxx/cdc_bookmark/ticket_purchase_history/data/
    • Click Next
  2. In map the source columns to target columns window, leave everything as default and Click on Save job and edit script.

  3. In the next window, review the job script and click on Run job, then click on close mark on the top right of the window to close the screen.

  4. Once the job finishes its run, check the S3 bucket for the parquet partitioned data.

Step 3: Create Glue crawler for Parquet data in S3

  1. Once you have the data in S3 bucket, navigate to Glue Console and now we will crawl the parquet data in S3 to create data catalog.

  2. Click on Add crawler

  3. In crawler configuration window, provide crawler name as glue_lab_cdc_bookmark_crawler and Click Next.

  4. In Specify crawler source type, select Data stores and Crawl all folders. Click Next

  5. In Add a data store:

    • For Choose a data store, select S3
    • For the Include path, click the folder icon and choose your target S3 bucket, then append /cdc_bookmark/ticket_purchase_history , e.g., “s3://xxx-dmslabs3bucket-xxx/cdc_bookmark/ticket_purchase_history”
  6. Click on Next

  7. For Add another data store, select No and click Next.

  8. In Choose an IAM role, select an existing IAM role contains GlueLabRole text. Something looks like this: xxx-GlueLabRole-xxx

  9. For setting the frequency in create a schedule for this crawler, select “Run on demand”. Click Next

  10. For the crawler’s output:

    • For Database, select “ticketdata” database.
    • Optionally, add prefix to the newly created tables for easy identification. Provide the prefix as bookmark_parquet_
    • Click Next
  11. Review all the details and click on Finish. Then Run crawler.

  12. After the crawler finishes running, click on Databases, select ticketdata and view tables in this database. You will find the newly created table as bookmark_parquet_ticket_purchase_history

  13. Once the table is created, click on Action and from dropdown select View Data.

If it’s the first time you are using Athena in your AWS Account, click Get Started using Athena… below.

Get Started using Athena...

To select some rows from the table, try running:

SELECT * FROM
"ticketdata"."bookmark_parquet_ticket_purchase_history" limit 10;

To get a row count, run:

SELECT count(*) as recordcount FROM
"ticketdata"."bookmark_parquet_ticket_purchase_history";

Before moving on to next step, note the rowcount.

Step 4: Generate CDC data and to observe bookmark functionality

Ask your instructor generate more CDC data at source database, if you ran the instructor setup on your own, then make sure to follow “Generate the CDC Data” section from instructor prelab.

  1. To make sure the new data has been successfully generated, check the S3 bucket for cdc data, you will see new files generated. Note the time when the files were generated.
  2. Rerun the Glue job Glue-Lab-TicketHistory-Parquet-with-bookmark you created in Step 2
  3. Go to the Athena Console, and rerun the following query to notice the increase in row count:
    SELECT count(*) as recordcount FROM
    "ticketdata"."bookmark_parquet_ticket_purchase_history";
    

    To review the latest transactions, run:

    SELECT * FROM
    "ticketdata"."bookmark_parquet_ticket_purchase_history" order by
    transaction_date_time desc limit 100;