This guide helps you complete Real-Time Clickstream Anomaly Detection using Amazon Kinesis Data Analytics.
Analyzing web log traffic to gain insights that drive business decisions has historically been performed using batch processing. Although effective, this approach results in delayed responses to emerging trends and user activities. There are solutions that process data in real time using streaming and micro-batching technologies, but they can be complex to set up and maintain. Amazon Kinesis Data Analytics is a managed service that makes it easy to identify and respond to changes in data behavior in real-time.
In the prelab, you set up the prerequisites required to complete this lab. Now, you will work to implement the following data pipeline .
Make sure you are in US-EAST-1 (N. Virginia) region
Click Create application
On the application page, click Connect streaming data.
Select Choose source, and make the following selections:
In the Record pre-processing with AWS Lambda section, choose Disabled.
In the Access to chosen resources section, select Choose from IAM roles that Kinesis Analytics can assume.
In the IAM role box, search for the following role:
<stack name>-CSEKinesisAnalyticsRole-<random string>
Do not click “Discover schema” in this step yet.
Do not click “Discover schema” yet.
You have set up the Kinesis Data Analytics application to receive data from a Kinesis Data Firehose and to use an IAM role from the pre-lab. However, you need to start sending some data to the Kinesis Data Firehose before you click Discover schema in your application.
Navigate to the Amazon Kinesis Data Generator (Amazon KDG) which you setup in prelab and start sending the Schema Discovery Payload at 1 record per second by click on Send data button. Make sure to select the region “us-east-1” Now that your Kinesis Data Firehose is receiving data, you can continue configuring the Kinesis Data Analytics Application.
To learn more about the SQL logic, see the Analytics application section in the following blog post: https://aws.amazon.com/blogs/big-data/real-time-clickstream-anomaly-detection-with-amazon-kinesis-analytics/
Now that the logic to detect anomalies is in the Kinesis Data Firehose, you must. connect it to a destination (AWS Lambda function) to notify you when there is an anomaly.
Click the Destination tab and click Connect to a Destination.
For Destination, choose AWS Lambda function.
In the Deliver records to AWS Lambda section, make the following selections:
Your parameters should look like the following image. This configuration allows your Kinesis Data Analytics Application to invoke your anomaly Lambda function and notify you when any anomalies are detected.
Now that all of the components are in place, you can test your analytics application. For this part of the lab, you will need to use your Kinesis Data Generator in five separate browser windows. There will be one window sending normal impression payload, one window sending normal click payload, and three windows sending extra click payload.
Make sure to select the us-east-1 region. Do not accept the default region.
In one of your browser windows, start sending the Impression payload at a rate of 1 record per second (keep this running).
On another browser window, start sending the Click payload at a rate of 1 record per second (keep this running).
On your last three browser windows, start sending the Click payload at a rate of 1 record per second for a period of 20 seconds. **If you did not receive an anomaly email, open another KDG window and send additional concurrent Click payloads. Make sure to not allow these functions to run for more than 10 to 20 seconds at a time. This could cause AWS Lambda to send you multiple emails due to the number of anomalies you are creating.
You can monitor anomalies on the Real-time analytics tab in the DESTINATION_SQL_STREAM table. If an anomaly is detected, it displays in that table.
Make sure to click other streams and review the data.
Once an anomaly has been detected in your application and you will receive an email and text message to the specified accounts.
After you have completed the lab, click Actions > Stop Application to stop your application and avoid flood of SMS and e-mails messages.
To save on cost, it is required to dispose your environment which you have created during this lab. Make sure to empty S3 buckets from console before following below steps:
In your AWS account, navigate to the CloudFormation console.
On the CloudFormation console, select stack which you have created during pre-lab.
Click on Action drop down and select delete stack as shown in below screenshot.
CREATE OR REPLACE STREAM "CLICKSTREAM" ( "CLICKCOUNT" DOUBLE ); CREATE OR REPLACE PUMP "CLICKPUMP" AS INSERT INTO "CLICKSTREAM" ("CLICKCOUNT") SELECT STREAM COUNT(*) FROM "SOURCE_SQL_STREAM_001" WHERE "browseraction" = 'Click' GROUP BY FLOOR( ("SOURCE_SQL_STREAM_001".ROWTIME - TIMESTAMP '1970-01-01 00:00:00') SECOND / 10 TO SECOND ); CREATE OR REPLACE STREAM "IMPRESSIONSTREAM" ( "IMPRESSIONCOUNT" DOUBLE ); CREATE OR REPLACE PUMP "IMPRESSIONPUMP" AS INSERT INTO "IMPRESSIONSTREAM" ("IMPRESSIONCOUNT") SELECT STREAM COUNT(*) FROM "SOURCE_SQL_STREAM_001" WHERE "browseraction" = 'Impression' GROUP BY FLOOR( ("SOURCE_SQL_STREAM_001".ROWTIME - TIMESTAMP '1970-01-01 00:00:00') SECOND / 10 TO SECOND ); CREATE OR REPLACE STREAM "CTRSTREAM" ( "CTR" DOUBLE ); CREATE OR REPLACE PUMP "CTRPUMP" AS INSERT INTO "CTRSTREAM" ("CTR") SELECT STREAM "CLICKCOUNT" / "IMPRESSIONCOUNT" * 100.000 as "CTR" FROM "IMPRESSIONSTREAM", "CLICKSTREAM" WHERE "IMPRESSIONSTREAM".ROWTIME = "CLICKSTREAM".ROWTIME; CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" ( "CTRPERCENT" DOUBLE, "ANOMALY_SCORE" DOUBLE ); CREATE OR REPLACE PUMP "OUTPUT_PUMP" AS INSERT INTO "DESTINATION_SQL_STREAM" SELECT STREAM * FROM TABLE (RANDOM_CUT_FOREST( CURSOR(SELECT STREAM "CTR" FROM "CTRSTREAM"), --inputStream 100, --numberOfTrees (default) 12, --subSampleSize 100000, --timeDecay (default) 1) --shingleSize (default) ) WHERE ANOMALY_SCORE > 2;