Linux Administration: Deploying Apache Kafka and Apache Zookeeper

Apache Kafka is an open-source message broker written in Scala that aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds [1].

Kafka integrates with Apache Zookeeper which is a distributed configuration and synchronization service for large distributed systems [2].

Kafka is similar in some ways to RabbitMQ and other messaging systems in a cense that:
- It brokers messages that are organized into topics
- Producers push messages
- Consumers pull messages
- Kafka runs in a cluster where all nodes are called brokers

In this tutorial I'll install and configure Kafka and Zookeeper on 3 servers. Zookeeper maintains quorum so you'll need at least 3 servers, or n+1 where n is an even number. I'll be using 3 OpenVZ containers but that's irrelevant. The process is pretty straightforward:

Download Zookeeper and Kafka on all three servers:

	root@server:~# wget http://apache.claz.org/zookeeper/zookeeper-3.4.6/zookeeper-3.4.6.tar.gz
	root@server:~# wget http://apache.spinellicreations.com/kafka/0.8.2.1/kafka_2.9.1-0.8.2.1.tgz

view raw gistfile1.sh hosted with ❤ by GitHub

File: gistfile1.sh ------------------ root@server:~# wget http://apache.claz.org/zookeeper/zookeeper-3.4.6/zookeeper-3.4.6.tar.gz root@server:~# wget http://apache.spinellicreations.com/kafka/0.8.2.1/kafka_2.9.1-0.8.2.1.tgz

Install Zookeeper:

	root@server:~# apt-get update && apt-get install openjdk-7-jdk
	root@server:~# cd /usr/local/
	root@server:/usr/local# tar zxfv /usr/src/zookeeper-3.4.6.tar.gz
	root@server:/usr/local# mv zookeeper-3.4.6/ zookeeper
	root@server:/usr/local# cp zookeeper/conf/zoo_sample.cfg zookeeper/conf/zoo.cfg
	root@server:/usr/local# mkdir -p /var/zookeeper/data

view raw gistfile1.sh hosted with ❤ by GitHub

File: gistfile1.sh ------------------ root@server:~# apt-get update && apt-get install openjdk-7-jdk root@server:~# cd /usr/local/ root@server:/usr/local# tar zxfv /usr/src/zookeeper-3.4.6.tar.gz root@server:/usr/local# mv zookeeper-3.4.6/ zookeeper root@server:/usr/local# cp zookeeper/conf/zoo_sample.cfg zookeeper/conf/zoo.cfg root@server:/usr/local# mkdir -p /var/zookeeper/data

Here's an example config file to get you started, just replace the IP's of your servers:

	root@server:/usr/local# cat zookeeper/conf/zoo.cfg

	tickTime=2000
	initLimit=10
	syncLimit=5
	dataDir=/var/zookeeper/data # <--- Important
	clientPort=2181
	maxClientCnxns=60
	autopurge.snapRetainCount=3
	autopurge.purgeInterval=1
	server.1=10.188.97.12:2888:3888 # <--- Important
	server.2=10.188.97.13:2888:3888 # <--- Important
	server.3=10.188.97.14:2888:3888 # <--- Important

view raw gistfile1.sh hosted with ❤ by GitHub

File: gistfile1.sh ------------------ root@server:/usr/local# cat zookeeper/conf/zoo.cfg tickTime=2000 initLimit=10 syncLimit=5 dataDir=/var/zookeeper/data # <--- Important clientPort=2181 maxClientCnxns=60 autopurge.snapRetainCount=3 autopurge.purgeInterval=1 server.1=10.188.97.12:2888:3888 # <--- Important server.2=10.188.97.13:2888:3888 # <--- Important server.3=10.188.97.14:2888:3888 # <--- Important

Install Kafka:

	root@server:/usr/local# tar zxfv /usr/src/kafka_2.9.1-0.8.2.1.tgz
	root@server:/usr/local# mv kafka_2.9.1-0.8.2.1/ kafka

view raw gistfile1.sh hosted with ❤ by GitHub

Example config file, I've noted the changes required:

	root@server:/usr/local# cat kafka/config/server.properties

	broker.id=1 #<--- Important
	port=9092
	host.name=10.188.97.12 #<--- Important
	num.network.threads=3
	num.io.threads=8
	socket.send.buffer.bytes=102400
	socket.receive.buffer.bytes=102400
	socket.request.max.bytes=104857600
	log.dirs=/tmp/kafka-logs
	num.partitions=1
	num.recovery.threads.per.data.dir=1
	log.retention.hours=168
	log.segment.bytes=1073741824
	log.retention.check.interval.ms=300000
	log.cleaner.enable=false
	zookeeper.connect=10.188.97.12:2181,10.188.97.13:2181,10.188.97.14:2181 #<--- Important
	zookeeper.connection.timeout.ms=6000

view raw gistfile1.sh hosted with ❤ by GitHub

File: gistfile1.sh ------------------ root@server:/usr/local# cat kafka/config/server.properties broker.id=1 #<--- Important port=9092 host.name=10.188.97.12 #<--- Important num.network.threads=3 num.io.threads=8 socket.send.buffer.bytes=102400 socket.receive.buffer.bytes=102400 socket.request.max.bytes=104857600 log.dirs=/tmp/kafka-logs num.partitions=1 num.recovery.threads.per.data.dir=1 log.retention.hours=168 log.segment.bytes=1073741824 log.retention.check.interval.ms=300000 log.cleaner.enable=false zookeeper.connect=10.188.97.12:2181,10.188.97.13:2181,10.188.97.14:2181 #<--- Important zookeeper.connection.timeout.ms=6000

Create the zookeeper unique identifiers on all the nodes:

	root@server1:/usr/local# echo "1" > /var/zookeeper/data/myid
	root@server2:/usr/local# echo "2" > /var/zookeeper/data/myid
	root@server3:/usr/local# echo "3" > /var/zookeeper/data/myid

view raw gistfile1.sh hosted with ❤ by GitHub

File: gistfile1.sh ------------------ root@server1:/usr/local# echo "1" > /var/zookeeper/data/myid root@server2:/usr/local# echo "2" > /var/zookeeper/data/myid root@server3:/usr/local# echo "3" > /var/zookeeper/data/myid

Start zookeeper first:

root@server1:/usr/local# /usr/local/zookeeper/bin/zkServer.sh start

view raw gistfile1.txt hosted with ❤ by GitHub

Then start kafka:

root@server:/usr/local# kafka/bin/kafka-server-start.sh kafka/config/server.properties &

view raw gistfile1.txt hosted with ❤ by GitHub

Your cluster is now up and running and ready to accept messages.

Create a new topic with a replication factor of three:

root@server:/usr/local# kafka/bin/kafka-topics.sh --create --zookeeper 10.188.97.12:2181 --replication-factor 3 --partitions 1 --topic my-replicated-topic

view raw gistfile1.txt hosted with ❤ by GitHub

File: gistfile1.txt ------------------- root@server:/usr/local# kafka/bin/kafka-topics.sh --create --zookeeper 10.188.97.12:2181 --replication-factor 3 --partitions 1 --topic my-replicated-topic

Describe the replicated topic:

root@server:/usr/local# kafka/bin/kafka-topics.sh --describe --zookeeper 10.188.97.12:2181 --topic my-replicated-topic

view raw gistfile1.txt hosted with ❤ by GitHub

Publish a few messages to the new replicated topic:

root@server:/usr/local# kafka/bin/kafka-console-producer.sh --broker-list 10.188.97.12:9092 --topic my-replicated-topic

view raw gistfile1.txt hosted with ❤ by GitHub

Consume the messages:

root@server:/usr/local# kafka/bin/kafka-console-consumer.sh --zookeeper 10.188.97.12:2181 --from-beginning --topic my-replicated-topic

view raw gistfile1.txt hosted with ❤ by GitHub

File: gistfile1.txt ------------------- root@server:/usr/local# kafka/bin/kafka-console-consumer.sh --zookeeper 10.188.97.12:2181 --from-beginning --topic my-replicated-topic

To test a cluster failover just kill zookeeper and kafka on one of the servers and you should still be able to consume the messages.

There are few important things to note about Kafka at the time of this post:

Kafka is not suited for a multi tenant environments as there's no security features - no encryption, authorization or authentication. To achieve a tenant isolation there needs to be some lower level implementation like iptables etc.
Kafka is not an end-user solution, customers need to write custom code for it.
Kafka does not have many ready-made producers and consumers.

Resources:

[1]. http://kafka.apache.org/documentation.html
[2]. https://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html

Pages

Deploying Apache Kafka and Apache Zookeeper