Apache Kafka is well known for its high performance. It is able to process a high rate of messages while maintaining low latency. Apache Pulsar is a fast-growing alternative to Kafka. There are reports that suggest Pulsar has better performance characteristics than Kafka, but the raw results are not easy to find. Plus those reports are based on older versions of Pulsar and Kafka, which are both fast-moving projects. So, in a series of posts, we are going to run the latest stable Kafka (2.3.0) and Pulsar (2.4.0) versions through a number of performance tests and publish those results.
In this first post, we are going to focus on latency. In future posts, we will look at throughput. Before we dive into the test results, let’s go over the test methodology.
Performance Testing Of Messaging systems
Both Kafka and Pulsar provide performance testing tools as part of their packages. While it would be possible to modify the performance tools from either to work with both, we are going to use a third-party benchmark framework from the OpenMessaging Project, a Linux Foundation Collaborative Project. The OpenMessaging Project has a goal of providing vendor-neutral and language-independent standards for messaging and streaming technologies. The project includes a performance testing framework that supports various messaging technologies. All the code used in these tests is in in the OpenMessaging benchmark GitHub repository. The tests are designed to run in public cloud providers. In our case, we are going to running all the tests in Amazon Web Services (AWS) using standard EC2 instances.
We are publishing the full set of outputs from each test run as a series of GitHub gists. So, you are welcome to analyze the data and come up with your own insights. Of course, you could also run the tests yourself and generate new data. You should get similar results since we find the tests are reliable run after run even with different sets of EC2 instances. During our testing, we stood up and tore down the environment multiple times.
Although we used the OpenMessaging benchmark tools to run the tests which include a set of workloads, we are going to add some workloads that are inspired by the blog post on the LinkedIn Engineering site titled “Benchmarking Apache Kafka: 2 Million Writes Per Second (On Three Cheap Machines)” because we thought it would be interesting for comparison purposes. However, this is quite an old blog post. Today, hardware is better and what we are using is not necessarily cheap, though far from top-of-the-line. So — spoiler alert — both Kafka and Pulsar can handle 2 million writes per second in the blog title without breaking a sweat, which we will see in a future post.
The OpenMessaging benchmark is a framework that is open and extensible. To add a messaging technology to test, you just need to add a Terraform configuration, an Ansible playbook, and an implementation of a Java library that controls the producers and consumers in the test harness. The Terraform configuration currently provided for Kafka and Pulsar are the same, starting the same set of EC2 instances in AWS. Obviously, for comparison purposes, it is important that the test is run on the same hardware configuration. So, the existing benchmark code makes it easy to compare Kafka and Pulsar. As we mentioned, all the code to run these tests is in the OpenMessaging GitHub benchmark repository.
The tests all begin with a warm-up period before starting the actual measuring phase of the test. The latency tests publish at a constant rate, recording the latency and throughput at regular intervals.
If you are planning to run these tests yourself, be warned. Running the tests in AWS is not cheap. A significant number of well-powered EC2 instances are needed. This makes sense when doing benchmarks of software. You want to make sure the hardware isn’t the bottleneck, so you oversubscribe the hardware. But this oversubscription comes at a cost (over $5 an hour, $3.8K a month) that will make a mark on your AWS bill if you’re not careful. You don’t want to leave the environment running if you aren’t using it (for example, overnight), and make sure you delete the all resources after running the tests.
Before we dive into the test results, there are a few important concepts we need to cover to put the results in context. First, we need to go over what the test is measuring: latency. Next, we need to go over the durability of the messages, especially when it comes to storing messages to disk. And last we need to understand the different models of message replication used in Kafka and Pulsar.
There are many similarities between Kafka and Pulsar, but there are some significant differences that can impact performance. To fairly evaluate both systems, you need to understand these differences.
A Note on Latency Tests
All latency measurements necessarily include the network latency between the application and the messaging system. Assuming all tests are performed in the same network configuration and that network provides consistent latency, then the network latency is a constant that affects all tests equally. When comparing latency measurements, then, it is important the network is held constant when making comparisons.
We point this out because the latency in these tests differs from those published in the LinkedIn Engineering blog post. Those tests were run on (presumably) a dedicated 1 GB Ethernet network. Our tests are run in the black box network of a public cloud provider on instances that provide “Up to 10 Gigabit” network performance. So, the latency numbers in this test cannot be directly compared with those in that blog post. However, we should be able to compare the latency results between the two messaging systems in our tests, since we are keeping the network configuration constant.
There are two types of latency we are measuring: publishing latency and end-to-end latency.
Let’s start by discussing end-to-end latency since it is relatively straightforward. End-to-end latency is simply the time from when the message is sent by the producer to when it received by the consumer. In both the Pulsar and Kafka implementations of the benchmark, the publish timestamp is generated by the API when sending the message. When the message is received by the consumer, a second timestamp is taken. The difference between these two times is the end-to-end latency.
Where end-to-end latency gets complicated is with the clocks used to take those timestamp measurements. When measuring end-to-end latency the clocks used for the timestamps must be synchronized. If they are not synchronized, the difference between the clocks will impact your measurements. You end up measuring how much the clocks are different as well as the messaging system latency. Since clocks drift over time, the problem becomes worse in long-running tests.
Ideally, the producer and consumer are on the same server and therefore the timestamps are taken using the same clock, so there is no difference. Unfortunately, the benchmark tests are designed to separate the producer and consumer on to different servers to distribute the load.
The next best option is to synchronize the clocks between the servers as accurately as possible. Luckily, AWS has a free time sync service that, combined with chrony, appears to keep the clocks between EC2 instance in very close sync (within a handful of microseconds of the reference clock). For these tests, we installed chrony on all the client servers and configured them to use the AWS time source.
Publishing latency is the amount of time that passes from when the message is sent until the time an acknowledgment is received from the messaging system. The acknowledgment indicates that the messaging system has persisted the message and will guarantee its delivery. Essentially, the acknowledgment indicates the responsibility for handling the message has been successfully passed from the producing application to the messaging system. Low, consistent publishing latency is good for applications. When the application is ready to hand off the delivery of a message, the messaging system quickly accepts the message, allowing the application to continue working on application-level concerns, such as business logic. This handoff of responsibility for the message is a key feature of any messaging system.
In the benchmark tests, the messages are sent asynchronously, so the producer doesn’t actually block waiting for the acknowledgment of a message. The call to send the message returns immediately and a callback handles the acknowledgment when it arrives. It may seem that publishing latency is not that important when doing asynchronous sends, but it is. Both the Kafka (max.in.flight.requests.per.connection) and Pulsar (maxPendingMessages) producers have a buffer for holding unacknowledged messages. If this buffer fills up, calls to the send method will start to block (or fail, depending on configuration). So if the messaging system does not quickly acknowledge messages, it can lead to the producing application waiting for the messaging system.
In the benchmark tests, the publishing latency is measured from the time when calling the send method until the acknowledgment callback is triggered. These two timestamps are done in the producer, so clock synchronization is not a concern.
Durability and Flushing Messages to Disk
Durability in the context of a messaging system means that if parts of the system fail, messages are not lost. To ensure this, messages need to be stored somewhere that will survive a crash of the server running the messaging system software. Both Kafka and Pulsar ultimately write messages to disk as a way to provide durability. However, telling the operating system to write a message to the file system is not sufficient. All POSIX systems cache reads and writes to the file system in memory for improved performance. Writing to the file system only means that the data has been put into the write cache, but it is not necessarily stored safely on the physical disk. Since this cache resides in memory if the server crashes (for example, power loss, kernel panic), data that has been written to the cache but not yet written or flushed to disk will be lost. The operation that forces cached data to be written to the physical disk is called fsync. To guarantee that a message has been stored on disk, the file cache has to be flushed to disk after writing each message by triggering an fsync operation.
By default, Kafka does not explicitly flush each message to disk. It leaves the decision when to flush to the operating system. So in the event of a server crash, some undefined amount of message data can be lost. This is the default setting for performance reasons. Writing to the physical disk is slower than writing to the memory cache, so this flushing slows down message processing performance. It is possible to configure Kafka to flush messages regularly (or even for every message), but the Kafka documentation recommends against this for efficiency reasons.
Pulsar, on the other hand, flushes each message to the disk by default. The message is not acknowledged to the producer until it is stored on the physical disk. This provides stronger durability guarantees if the server crashes. It also, as we shall see, is able to provide these durability guarantees while maintaining high performance. It is able to do this because it uses Apache Bookkeeper to store messages, which is a distributed log storage system that has been optimized for this purpose. It is, however, possible to disable this fsync behavior in Pulsar.
Since the flushing to disk can have an impact on performance, we are going to run the performance tests twice for both Kafka and Pulsar, once with flushing each message to disk enabled and once with it disabled. This will allow for a better apples-to-apples comparison between the two systems.
Both Kafka and Pulsar provide additional message durability by making replicas of each message. That way, even if one of the copies of the message is lost for some reason, there will be other copies available for recovery. Message replication has an effect on performance and is implemented differently on Kafka and Pulsar. We want to make sure we are getting similar replication behavior between Kafka and Pulsar in our tests.
Kafka uses a leader-follower replication model. One of the Kafka brokers is elected the leader for a partition. All messages are initially written to the leader, and the followers read and replicate the messages from the leader. Kafka monitors if each follower is caught up, or “in sync” with the leader. With Kafka, you can control the total number of copies (replication-factor) that are made of a message and the minimum number of replicas that need to be in-sync (min.insync.replicas) before a message is considered successfully stored and acknowledged to the producer. A typical configuration would have Kafka make 3 copies of a message and only acknowledge once at least two (a majority) have been confirmed successfully written. This is the configuration (replication-factor=3, in.sync.replicas=2, acks=all) we will use for all our Kafka tests.
Pulsar uses a quorum-vote replication model. Multiple copies of the message (write quorum) are written in parallel. Once some number of copies have been confirmed stored, then the message is acknowledged (ack quorum). Unlike the leader-follower model which generally writes copies to the same set of leaders and followers for a particular topic partition, Pulsar can spread (or stripe) the copies over a set of storage nodes (ensemble) which can improve the read and write performance. In our tests, we only have 3 storage nodes and want to make 3 copies like with the Kafka configuration, so our settings will not take advantage of striping. Our configuration for Pulsar (ensemble=3, write quorum=3, ack quorum=2) gives similar replication behavior as Kafka: 3 copies of the messages, acknowledge after 2 confirmed.
Now that we have covered off some of these important concepts, let’s move on to the details of the tests.
Setting Up the Benchmark
To set up the benchmark tests, we followed the steps documented on the OpenMessaging site. After applying the Terraform configuration, you get the following set of EC2 instances:
|Count||EC2 Instance||Pulsar Purpose||Kafka Purpose|
|3||i3.4xlarge||Pulsar Broker and Bookkeeper Bookie||Kafka Broker|
|4||c5.2xlarge||Test Client||Test Client|
The i3.4xlarge instances used for Pulsar/Bookkeeper and the Kafka broker include 2 NVMe SSDs for high disk performance. These are pretty powerful virtual machines, sporting 16 vCPUs, 122 GiB of memory along with their high-performance disks. Having 2 SSDs is ideal for the Pulsar setup since it writes 2 streams of data which can be parallelized on the disks. Kafka also takes advantage of these two SSDs by distributing partitions over both drives.
The Ansible playbook for both Pulsar and Kafka tunes the performance for low latency using the tuned-adm command (latency-performance profile).
Although the benchmark comes with a set of workloads that we could run right out of the box, we are going to modify them a little bit so that they line up more closely with the benchmark results for Kafka in the LinkedIn Engineering blog. Defining new workloads is easy. You just need to create a YAML file with the updated parameters to use for the test.
If you look at tests in the LinkedIn blog, you will see that they are all run with 100-byte messages. The reason given “for focusing on small records in these tests is that it is the harder case for a messaging system (generally). ” That makes sense since there is a fixed amount of work to be done for each message regardless of its size, so small messages measure how efficient the system is in processing a message. More efficiency generally leads to higher performance. It also reduces the possibility that the test is being impacted by throughput limits of the network or disk. How well a messaging system performs when handling large messages can be an interesting benchmark, but for now, we are going to focus on small messages.
The other change from the stock benchmark workloads is that we are adding a 6-partition test. Six partitions are used extensively in the LinkedIn tests, so we wanted to include that workload in our set.
You may notice that the LinkedIn blog includes producer-only and consumer-only workloads. All our workloads are going to include both producers and consumers for two reasons. First, as it stands, the benchmark tests do not support standalone producer-only or consumer-only workloads. Second, in the real world, messaging systems will always be serving producers and consumers at the same time. So, both producing and consuming messages during the test gives a more realistic test scenario.
All that being said, here is the set of workloads we used for these tests:
|1||1||100 Byte||1||1||1||50 000||50 000||15|
|1||6||100 Byte||1||1||1||50 000||50 000||15|
|1||16||100 Byte||1||1||1||50 000||50 000||15|
A Kafka consumer group and a Pulsar subscription are similar concepts. They both allow one or more consumers to receive all the messages on a topic. If a topic has multiple consumer groups/subscriptions associated with it, the messaging system is providing multiple copies of each message in the topic, or “fanning out” the message. For each message published into the topic, one message is sent to each consumer group/subscription. If all messages are sent to a single topic that has a single consumer group/subscription, then the producer rate and consumer rate are equal. If, for example, there are two consumer groups/subscriptions, then the consumer rate is double the producer rate. For these tests, we are keeping things simple. There is only one consumer group/subscription, so the producer rate and consumer rate are equal.
Apache Pulsar Results
The following sections present the latency results for the Apache Pulsar tests. We first present the results with per-message flushing enabled since this is how Pulsar works out of the box, followed by the results with per-message flushing disabled. For each workload, we include two graphs: one for the 99th percentile publishing latency over the duration of the test, and one for the average end-to-end latency. The graphs are followed by a table that summarizes the latency distribution. The latency measurements reported in the tables are aggregated for the duration of the test. The percentile calculations for end-to-end latency have less precision than the publishing latency since the end-to-end calculations are using the timestamp that is automatically placed in the message header and that timestamp only has ms precision. The publishing latency is calculated using nanosecond precision.
All tests used a 100-byte message. The produce and consume rates were a constant 50K msgs/s during the 15-minute duration of each test. Only two client servers were used. The version of Apache Pulsar used for the tests was 2.4.0.
Latency with Flush
Test 1: 1 topic, 1 partition
You can get the command output and raw results here. Click on “view raw” to see a larger version of a graph.
TEST 2: 1 TOPIC, 6 PARTITIONs
You can get the command output and raw results here. Click on “view raw” to see a larger version of a graph.
Test 3: 1 Topic, 16 Partitions
You can get the command output and raw results here. Click on “view raw” to see a larger version of a graph.
Since partitions are a unit of parallelism in both Pulsar and Kafka, we would expect to see latency reduction as we increase the number of partitions, and this is exactly the result we get. Across the board, as the number of partitions increases both the publishing and the end-to-end latency decreases. There are some outliers in each test, but the maximum latency never exceeds 267 ms. The publishing latency is more tightly bounded than the end-to-end latency. The 99.99th percentile in publishing latency never exceeds 11.6 ms in any of the tests. The effect of extra partitions on latency is most apparent in the end-to-end latency results with 16 partitions. The 16-partition tests record an average latency (3 ms) that is one-third of the 1-partition tests (9 ms).
Pulsar provides consistent publishing latency over time. All the tests run for 15 minutes. As shown in the graphs, the average publishing latency shows very little variation over the duration of the test. The end-to-end latency shows some variability over time, with the average latency increasing by about 2 ms at a regular period of about 90 seconds. Interestingly, this 2-ms periodic bump appears to be constant regardless of the end-to-end latency. For example, the average end-to-end latency is 9 ms for 1 partition and only 3 ms for 16-partitions, but the spike is 2 ms (9 to 11, 3 to 5) in both cases.
Latency without Flush
The following tests are identical to the previous set, except that per-message flushing to disk is disabled by setting journalSyncData=false in the bookkeeper.conf file and restarting the software (Pulsar broker and Bookkeeper).
TEST 4: 1 TOPIC, 1 PARTITION
TEST 5: 1 TOPIC, 6 PARTITIONS
TEST 6: 1 TOPIC, 16 PARTITIONS
As expected, the no-flush results give lower latency, but not very much. For example, the 99th percentile publishing latency with 1 partition when flushing to disk is 4.129 ms but drops to only 3.928 ms when not flushing. In fact, in the 16-partition test, there is little difference between the flush and no-flush cases. The periodic 2-ms spike in end-to-end latency still exists in these tests with the same time interval.
Given the durability tradeoff that comes with disabling flushing to disk, it hardly seems worth disabling it from a latency standpoint when using Apache Pulsar.
Apache Kafka Results
Since the default behavior for Kafka is to not flush each message to disk, we are going to start with those results, followed by the results when flushing each message to disk. As with the Pulsar tests, all tests used a 100-byte message and had a message rate of 50K msgs/s. Only two client servers were used. The latency measurements reported in the tables are aggregated for the duration of the test.
The version of Apache Kafka for these tests was 2.11-2.3.0.
Latency with No Flush
Test 7: 1 topic, 1 partition
TEST 8: 1 TOPIC, 6 PARTITIONs
Test 9: 1 Topic, 16 Partitions
Looking first at the publishing latency with 1 partition, we see that on average Kafka without per-message flushing (2.19 ms) has lower latency than Pulsar whether Pulsar is flushing each message to disk or not ( 2.969 ms flush, 2.72 ms no flush.) However, in the distribution of latency, we see a major difference between Pulsar and Kafka. While Pulsar has tight latency distribution all the way up to the 99.9th percentile (2.916 to 4.095 ms from 50th to 99.9th), Kafka latency spikes up to 149.616 ms at the 99.9th percentile. This is quite a difference. At the 99.99th percentile with 1 partition, Pulsar latency is 52.958 ms and Kafka is nearly 4 times higher at 201.701 ms. And here we are comparing default modes, so Pulsar is flushing each message to disk, while Kafka is not. If you disable flushing to disk for Pulsar, the 99.99th latency drops to just 4.508 ms.
The reason for the large number of publishing latency outliers in Kafka seems pretty obvious when looking at the graph of 99th percentile publishing latency over time. Kafka shows periodic spikes where publishing latency jumps from single digits to over 100 ms. The effect is diminished as the number of partitions is increased, but is still present. Compare this to Pulsar where the 99th percentile publishing latency is essentially a straight line for the entire duration of the test.
Another interesting difference between Pulsar and Kafka is that increasing the number of partitions lowers the publishing latency for Pulsar, but has the opposite effect for Kafka. Although the average publishing latency for Kafka is lower than Pulsar for the 1-partition test, Pulsar is lower for the 6- and 16-partition tests. For the 16-partition test, Pulsar gives under 3 ms of average publishing latency, while Kafka has nearly 8.5 ms.
Looking at the average end-to-end latency, Kafka beats Pulsar at the lowest partition count, but as with the publishing latency, end-to-end latency increases with partition count, so that at 16-partitions, Kafka has an average end-to-end latency of 11 ms, while Pulsar is approaching 3 ms. With end-to-end latency over time, we saw a periodic 2 ms spike with Pulsar. With Kafka, we similarly see spikes, but they are more frequent and typically higher, often spiking over 5 ms.
Latency with Flush
These tests are identical to the previous set, except that per-message flushing (fsync) is enabled. This was configured (flush.messages=1, flush.ms=0) for all topics used in the test.
TEST 10: 1 TOPIC, 1 PARTITION
TEST 11: 1 TOPIC, 6 PARTITIONS
TEST 12: 1 TOPIC, 16 PARTITIONS
This set of tests are an apples-to-apples comparison to Pulsar’s default behavior of flushing each message to disk. In this comparison, Pulsar is clearly better. In the 1-partition test where Kafka had an advantage over Pulsar when it wasn’t flushing to disk, when both systems are flushing to disk, Pulsar’s average latency is 2.969 ms while Kafka’s is more than double at 6.652 ms. Because adding partitions still increases latency for Kafka in these tests, the disparity grows even greater at 16-partitions where Pulsar is giving 2.72 ms of latency and Kafka is clocking in at 18.454 ms, which is 6 times higher.
The large publishing latency spikes still remain when Kafka is configured to flush each message to disk, but happen less often.
For average end-to-latency, not surprisingly flushing to disk increases the latency for Kafka across the board. Kafka actually has an advantage in the 1-partition case ( 7.129 vs 9.052 ms), but Pulsar is clearly better in the 6- and 16-partitions cases. Looking at the end-to-end latency over time with Kafka, there are still periodic spikes as high as 5 ms.
Based on this set of results, we can draw the following conclusions:
- Pulsar gives more predictable latency over time. The graphs of latency over time are smoother with Pulsar than Kafka. This comparison chart (6-partition, average end-to-end latency, no flushing) shows one case where Kafka latency is actually lower than Pulsar, but the Pulsar plot is less variable:
- Pulsar has more tightly bounded latency. Most of the Kafka tests show elevated latency at the 99.9th percentile. In the few cases where Pulsar shows elevated latency, it occurs at the 99.99th percentile. This comparison chart (6-partition, 99th publishing latency, flushing), clearly shows just how bounded Pulsar latency is compared to Kafka:
- Increasing the partition count for a Pulsar topic lowers latency when using a single producer and consumer. Increasing the partition count for a Kafka topic increases latency for a single producer and consumer.
- Given a need for the highest message durability, Pulsar provides lower latency than Kafka.
- Disabling flushing of messages to disk with Pulsar provides small latency gains and is not warranted given the durability tradeoff.
For latency-sensitive workloads, Pulsar is the overall winner. It is able to provide consistent, low latency as well as strong durability guarantees. Of course, not all workloads are latency-sensitive. Some may be willing to tradeoff latency for higher throughput. In a future post, we will be doing a similar comparison of the throughput performance between Apache Kafka and Apache Pulsar.
If you found this post useful, please let us know in the comments.
Want to see what Pulsar is about? Just sign up for our free plan to give it a try. It only takes a minute to get started.
Apache Pulsar, Apache Kafka, and Apache BookKeeper are trademarks of the Apache Software Foundation.