Deploying Apache Kafka and Apache Zookeeper

Apache Kafka is an open-source message broker written in Scala that aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds [1].

Kafka integrates with Apache Zookeeper which is a distributed configuration and synchronization service for large distributed systems [2].

Kafka is similar in some ways to RabbitMQ and other messaging systems in a cense that:
- It brokers messages that are organized into topics
- Producers push messages
- Consumers pull messages
- Kafka runs in a cluster where all nodes are called brokers

In this tutorial I'll install and configure Kafka and Zookeeper on 3 servers. Zookeeper maintains quorum so you'll need at least 3 servers, or n+1 where n is an even number. I'll be using 3 OpenVZ containers but that's irrelevant. The process is pretty straightforward:

Download Zookeeper and Kafka on all three servers:

root@server:~# wget http://apache.claz.org/zookeeper/zookeeper-3.4.6/zookeeper-3.4.6.tar.gz
root@server:~# wget http://apache.spinellicreations.com/kafka/0.8.2.1/kafka_2.9.1-0.8.2.1.tgz
view raw gistfile1.sh hosted with ❤ by GitHub
Install Zookeeper:

root@server:~# apt-get update && apt-get install openjdk-7-jdk
root@server:~# cd /usr/local/
root@server:/usr/local# tar zxfv /usr/src/zookeeper-3.4.6.tar.gz
root@server:/usr/local# mv zookeeper-3.4.6/ zookeeper
root@server:/usr/local# cp zookeeper/conf/zoo_sample.cfg zookeeper/conf/zoo.cfg
root@server:/usr/local# mkdir -p /var/zookeeper/data
view raw gistfile1.sh hosted with ❤ by GitHub
Here's an example config file to get you started, just replace the IP's of your servers:

root@server:/usr/local# cat zookeeper/conf/zoo.cfg
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/var/zookeeper/data # <--- Important
clientPort=2181
maxClientCnxns=60
autopurge.snapRetainCount=3
autopurge.purgeInterval=1
server.1=10.188.97.12:2888:3888 # <--- Important
server.2=10.188.97.13:2888:3888 # <--- Important
server.3=10.188.97.14:2888:3888 # <--- Important
view raw gistfile1.sh hosted with ❤ by GitHub
Install Kafka:

root@server:/usr/local# tar zxfv /usr/src/kafka_2.9.1-0.8.2.1.tgz
root@server:/usr/local# mv kafka_2.9.1-0.8.2.1/ kafka
view raw gistfile1.sh hosted with ❤ by GitHub
Example config file, I've noted the changes required:

root@server:/usr/local# cat kafka/config/server.properties
broker.id=1 #<--- Important
port=9092
host.name=10.188.97.12 #<--- Important
num.network.threads=3
num.io.threads=8
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
log.dirs=/tmp/kafka-logs
num.partitions=1
num.recovery.threads.per.data.dir=1
log.retention.hours=168
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
log.cleaner.enable=false
zookeeper.connect=10.188.97.12:2181,10.188.97.13:2181,10.188.97.14:2181 #<--- Important
zookeeper.connection.timeout.ms=6000
view raw gistfile1.sh hosted with ❤ by GitHub
Create the zookeeper unique identifiers on all the nodes:

root@server1:/usr/local# echo "1" > /var/zookeeper/data/myid
root@server2:/usr/local# echo "2" > /var/zookeeper/data/myid
root@server3:/usr/local# echo "3" > /var/zookeeper/data/myid
view raw gistfile1.sh hosted with ❤ by GitHub
Start zookeeper first:
root@server1:/usr/local# /usr/local/zookeeper/bin/zkServer.sh start
view raw gistfile1.txt hosted with ❤ by GitHub
Then start kafka:
root@server:/usr/local# kafka/bin/kafka-server-start.sh kafka/config/server.properties &
view raw gistfile1.txt hosted with ❤ by GitHub
Your cluster is now up and running and ready to accept messages.

Create a new topic with a replication factor of three:

root@server:/usr/local# kafka/bin/kafka-topics.sh --create --zookeeper 10.188.97.12:2181 --replication-factor 3 --partitions 1 --topic my-replicated-topic
view raw gistfile1.txt hosted with ❤ by GitHub
Describe the replicated topic:
root@server:/usr/local# kafka/bin/kafka-topics.sh --describe --zookeeper 10.188.97.12:2181 --topic my-replicated-topic
view raw gistfile1.txt hosted with ❤ by GitHub
Publish a few messages to the new replicated topic:
root@server:/usr/local# kafka/bin/kafka-console-producer.sh --broker-list 10.188.97.12:9092 --topic my-replicated-topic
view raw gistfile1.txt hosted with ❤ by GitHub
Consume the messages:
root@server:/usr/local# kafka/bin/kafka-console-consumer.sh --zookeeper 10.188.97.12:2181 --from-beginning --topic my-replicated-topic
view raw gistfile1.txt hosted with ❤ by GitHub
To test a cluster failover just kill zookeeper and kafka on one of the servers and you should still be able to consume the messages.

There are few important things to note about Kafka at the time of this post:

Kafka is not suited for a multi tenant environments as there's no security features - no encryption, authorization or authentication. To achieve a tenant isolation there needs to be some lower level implementation like iptables etc.
Kafka is not an end-user solution, customers need to write custom code for it.
Kafka does not have many ready-made producers and consumers.

Resources:

[1]. http://kafka.apache.org/documentation.html
[2]. https://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html