AWS MSK Kafka To Process Event/Streaming Data
As we should know Apache Kafka is a distributed event streaming platform and capable of handling big no. of data from multiple sources.
We it worked very much near real time and ideal for using data processing real time and use the processed data for auditing, tracking or any immediate push notification.
To use just Apache Kafka open source, you have to do all the setups yourself and in that also you would need severs which will configured as Kafka cluster and still there is challenge, that how it will scale and probably to make scalable Kafka solution, it is best to use inbuilt Cloud solution for Kafka. And AWS provide MSK managed Kafka cluster where all the infrastructure is managed by AWS and you focus on your code and business logic.
Today I did some experiment to see, how I can capture log from my multiple EC2 systems which will produce some event/message and it will go to AWS KMS Kafka and then from there, using AWS Glue, will process that event/message and send processed event/message to Redshift.
Here is the architecture diagram, which I wanted to achieve and below you will find high level steps if you want to try this yourself to test this solution.
Demo Steps:
- Create MSK cluster and once it successfully created, copy Bootstrap server and Zookeeper connection to use with below steps.
- Bootstrap server: You need to produce and consume message on topic and store into Kafka cluster.
- Zookeeper connection: You need to setup/create topic.
- Create Linux ec2 machine as client(event/message producer) and connect ec2 instance (SSH connection).
- Note: With client ec2 instance, make sure you have role attached with role having permission "AmazonMSKFullAccess".
- Run below commands on your ec2:
- Install java: sudo yum install java-1.8.0
- Download kafka: wget https://archive.apache.org/dist/kafka/2.6.2/kafka_2.12-2.6.2.tgz
- Run this command to install: tar -xzf kafka_2.12-2.6.2.tgz
- Run command to go inside Kafka: cd kafka_2.12-2.6.2/bin/
- Run below command now to create topic: MSKTutorialTopic
./kafka-topics.sh --create --zookeeper <ZookeeperConnectString> --replication-factor 3 --partitions 1 --topic MSKTutorialTopic
- Run below command to make connection to Kafka cluster to produce/publish message to topic and write messages:
./kafka-console-producer.sh --broker-list <BootstrapBrokerString> --producer.config client.properties --topic MSKTutorialTopic
- Run below consumer to read real time message, open this in new tab:
./kafka-console-consumer.sh --bootstrap-server <BootstrapBrokerStringTls> --consumer.config client.properties --topic MSKTutorialTopic --from-beginning
Now let's, see how I could push same message to Redshift using AWS Glue
- Redshift:
- Create stream/database
- Once status is active, create table by using query editor like: create table msk_message(message VARCHAR(max))
- Glue:
- Add connection for MSK kafka
- Add connection for redshift
- Create database
- Create table as manual and choose source as Kafka and put there topic name on which message will be publish and select MSK Kafka connection you created
- Create crawler job - this we create basically where to put data, so select Redshift connection created above as source
- From ETL section, create job (legacy option I used here)
- Select source as database table you created for MSK Kafka connection above
- Select target as redshift table you created for Redshift connection above
- Here, on last step, it will generate script(code) which actually store you your S3 location and basis this code it pull data and push data
- Once it created, you can run this manually or on schedule and you could check the status, matrix and log from bottom different tab options.
- Run Glue Job by selecting and after Glue job successful execution, I can see the same message I had published to Kafka.
Categories/Tags: kafka~aws msk