06.09.2021
Open Source

Apache Pulsar: Seamless Storage Evolution and Ultra-High Performance with Persistent Memory

Joe Francis, Apache Pulsar Project Management Committee & Rajan Dhabalia, Principal Software Engineer, Verizon Media

light blue, red, orange streams

Introduction 

We have been using Apache Pulsar as a managed service in Yahoo! since 2014. After open-sourcing Pulsar in 2016, entering the Apache Incubator in 2017, and graduating as an Apache Top-Level Project in 2018, there have been a lot of improvements made and many companies have started using Pulsar for their messaging and streaming needs. At Yahoo, we run Pulsar as a hosted service, and more and more use cases run on Pulsar for different application requirements such as low latency, retention, cold reads, high fanout, etc. With the rise of the number of tenants and traffic in the cluster, we are always striving for a system that is both multi-tenant and can use the latest storage technologies to enhance performance and throughput without breaking the budget. Apache Pulsar provides us that true multi-tenancy by handling noisy-neighbor syndrome and serving users to achieve their SLA without impacting each other in a shared environment. Apache Pulsar also has a distinct architecture that allows Pulsar to adopt the latest storage technologies from time to time to enhance system performance by utilizing the unique characteristics of each technology to get the best performance out of it. 

In this blog post, we are going to discuss two important characteristics of Apache Pulsar, multi-tenancy and adoption of next-generation storage technologies like NVMe and Persistent memory to achieve optimum performance with very low-cost overhead. We will also discuss benchmark testing of Apache Pulsar with persistent memory that shows we have achieved 5x more throughput with Persistent memory and also reduced the overall cost of the storage cluster.

 

What is Multi-Tenancy?

Multi-tenancy can be easily understood with the real-estate analogy and by understanding the difference between an apartment building and a single residence home. In apartment buildings resources (exterior wall, utility, etc.) are shared among multiple tenants whereas in a single residence only one tenant consumes all resources of the house. When we use this analogy in technology, it describes multi-tenancy in a single instance of hardware or software that has more than one resident. And it's important that all residents on a shared platform operate their services without impacting each other. 

Apache Pulsar has an architecture distinct from other messaging systems. There is a clear separation between the compute layer (which does message processing and dispatching) and the storage layer (that handles persistent storage for messages using Apache BookKeeper). In BookKeeper, bookies (individual BookKeeper storage nodes) are designed to use three separate I/O paths for writes, tailing reads, and backlog reads. Separating these paths is important because writes and tailing reads use-cases require predictable low latency while throughput is more important for backlog reads use cases. 

Real-time applications such as databases and mission-critical online services need predictable low latency. These systems depend on low-latency messaging systems. In most messaging systems, under normal operating conditions, dispatch of messages occurs from in-memory caches. But when a message consumer falls behind, multiple interdependent factors get triggered. The first is storage backlog. Since the system guarantees delivery, messages need to be persistently stored until delivery, and a slow reader starts building a storage backlog. Second, when the slow consumer comes back online, it starts to consume messages from where it left off. Since this consumer is now behind, and older messages have been aged out of the in-memory cache, messages need to be read back from disk storage, and cold reads on the message store will occur. This backlog reads on the storage device will cause I/O contention with writes to persist incoming messages to storage getting published currently. This leads to general performance degradation for both reads and writes. In a system that handles many independent message topics, the backlog scenario is even more relevant, as backlogged topics will cause unbalanced storage across topics and I/O contention. Slow consumers force the storage system to read the data from the persistent storage medium, which could lead to I/O thrashing and page cache swap-in-and-out. This is worse when the storage I/O component shares a single path for writes, caught-up reads, and backlog reads. 

A true test of any messaging system should be a test of how it performs under backlog conditions. In general, published throughput benchmarks don't seem to account for these conditions and tend to produce wildly unrealistic numbers that cannot be scaled or related to provisioning a production system. Therefore, the benchmark testing that we are presenting in this blog is performed with random cold reads by draining backlog across multiple topics.

 

BookKeeper and I/O Isolation

Apache BookKeeper stores log streams as segmented ledgers in bookie hosts. These segments (ledgers) are replicated to multiple bookies. This maximizes data placement options, which yields several benefits, such as high write availability, I/O load balancing, and a simplified operational experience. Bookies manage data in a log-structured way using three types of files:

Journal contains BookKeeper transaction logs. Before any update to a ledger takes place, the bookie ensures that a transaction describing the update is written to non-volatile storage.

Entry log (Data-File) aggregates entries from different ledgers (topics) and writes sequentially and asynchronously. It is also known as Data File.

Entry log index manages an index of ledger entries so that when a reader wants to read an entry, the BookKeeper locates the entry in the appropriate entry log and offset using this index.

With two separate file systems, Journal and Data-file, BookKeeper is designed to use separate I/O paths for writes, caught-up reads, and backlog reads. BookKeeper does sequential writes into journal files and performs cold reads from data files for the backlog draining.

 

Figure 1: Pulsar I/O Isolation Architecture Diagram

[Figure 1: Pulsar I/O Isolation Architecture Diagram]

 

Adoption of Next-Generation Storage Technologies

In the last decade, storage technologies have evolved with different types of devices such as HDD, SSD, NVMe, persistent memory, etc. and we have been using these technologies for Pulsar storage as time changes. Adoption of the latest technologies is helpful in Pulsar to enhance system performance but it’s also important to design a system that can fully use a storage device based on its characteristics and squeeze the best performance out of each kind of storage.

Table 2. shows how each device can fit into the BookKeeper model to achieve optimum performance.

BookKeeper adaptation based on characteristics of storage devices

 [Table 2: BookKeeper adaptation based on characteristics of storage devices]

 

Hard Disk Drive (HDD)

From the 80s until a couple of years ago, database systems have relied on magnetic disks as secondary storage. The primary advantages of a hard disk drive are affordability from a capacity perspective and reasonably good sequential performance. As we have already discussed, bookies append transactions to journals and always write to journals sequentially. So, a bookie can use hard disk drives (HDDs) with a RAID controller and a battery-backed write cache to achieve writes at lower latency than latency expectations from a single HDD.

Bookie also writes entry log files sequentially to the data device. Bookies do random reads when multiple Pulsar topics are trying to read backlogged messages. So, in total, there will be an increased I/O load when multiple topics read backlog messages from bookies. Having journal and entry log files on separate devices ensures that this read I/O is isolated from writes. Thus Pulsar can always achieve higher effective throughput and low latency writes with HDDs. 

There are other messaging systems that use a single file to write and read data for a given stream. Such systems have to do a lot of random reads if consumers from multiple streams start reading backlog messages at the same time. In a multi-tenant environment, it’s not feasible for such systems to use HDDs to achieve consistent low-write latency along with backlog consumer reads because in HDD, random reads can directly impact both write and read latencies and eventually writes have to suffer due to random cold reads on the disk.

 

SATA Solid State Drives (SSD)

Solid-state disks (SSD)-based on NAND flash media have transformed the performance characteristics of secondary storage. SSDs are built from multiple individual flash chips wired in parallel to deliver tens of thousands of IOPS and latency in the hundred-microsecond range, as opposed to HDDs with hundreds of IOPS and latencies in milliseconds. Our experience (Figure 3) shows that SSD provides higher throughput and better latency for sequential writes compared to HDDs. We have seen significant bookie throughput improvements by replacing SSDs with HDD for just journal devices.

 

Non-Volatile Memory Express (NVMe) SSD

Non-Volatile Memory Express (NVMe) is another of the current technology industry storage choices. The reason is that NVMe creates parallel, low-latency data paths to underlying media to provide substantially higher performance and lower latency. NVMe can support multiple I/O queues, up to 64K with each queue having 64K entries. So, NVMe’s extreme performance and peak bandwidth will make it the protocol of choice for today’s latency-sensitive applications. However, in order to fully utilize the capabilities of NVMe, an application has to perform parallel I/O by spreading I/O loads to parallel processes. 

With BOOKKEEPER-963 [2], the bookie can be configured with multiple journals. Each individual thread sequentially writes to its dedicated journal. So, bookies can write into multiple journals in parallel and achieve parallel I/O based on NVMe capabilities. Pulsar performs 2x-3x better with NVMe compared to SATA/SAS drives when the bookie is configured to write to multiple journals.

 

Persistent Memory

There is a large performance gap between DRAM memory technology and the highest-performing block storage devices currently available in the form of solid-state drives. This gap can be reduced by a novel memory module solution called Intel Optane DC Persistent Memory  (DCPMM) [1]. The DCPMM is a byte-addressable cache coherent memory module device that exists on the DDR4 memory bus and permits Load/Store accesses without page caching. 

DCPMM is a comparatively expensive technology on unit storage cost to use for the entirety of durable storage. However, BookKeeper provides a near-perfect option to use this technology in a very cost-effective manner. Since the journal is short-lived and does not demand much storage, a small-sized DCPMM can be leveraged as the journal device. Since journal entries are going to be ultimately flushed to ledgers, the size of the journal device and hence the amount of persistent memory needed is in the tens of GB.

Adding a small capacity DCPMM on bookie increases the total cost of bookie 5 - 10%, but it gives significantly better performance by giving more than 5x throughput while maintaining low write latency.

 

Endurance Considerations of Persistent Memory vs SSD

Due to the guarantees needed on the data persistence, journals need to be synced often. On a high-performance Pulsar cluster, with SSDs as the journal device to achieve lower latencies, this eats into the endurance budget, thus shortening the useful lifespan of NAND flash-based media. So for high performance and low latency Pulsar deployment, storage media needs to be picked carefully.

This issue can, however, be easily addressed by taking advantage of persistent memory. Persistent memory has significantly higher endurance, and the write throughput required for a journal should be handled by this device. A small amount of persistent memory is cheaper than an SSD with equivalent endurance. So from the endurance perspective, Pulsar can take advantage of persistent memory technology at a lower cost. 

Latency vs throughput with different journal device in bookie

[Figure 3: Latency vs Throughput with Different Journal Device in Bookie]

Figure 3 shows the latency vs performance graph when we use different types of storage devices to store journal files. It illustrates that the Journal with NVMe device gives 350MB throughput and the PMEM device gives 900MB throughput by maintaining consistently low latency p99 5ms.

As we discussed earlier, this benchmark testing is performed under a real production situation and the test was performed under backlog conditions. Our primary focus for this test is (a) system throughput and (b) system latency. Most of the applications in our production environment have SLA of p99 5ms publish latency. Therefore, our benchmark setup tests throughput and latency of Apache Pulsar with various storage devices (HDD, SSD, NVMe, and Persistent memory) and with a mixed workload of writes, tail reads, and random cold reads across multiple topics. In the next section, let’s discuss the benchmark test setup and performance results in detail.

 

Benchmarking Pulsar Performance for Production Use Cases

 

Workload 

We measured the performance of Pulsar for a typical mixed workload scenario. In terms of throughput, higher numbers are achievable (up to the network limit), but those numbers don't help in decision-making for building production systems. There is no one-size-fits-all recommended configuration available for any system. The configuration depends on various factors such as hardware resources of brokers (memory, CPU, network bandwidth, etc.) and bookies (storage disk types, network bandwidth, memory, CPU, etc.), replication configurations (ensembleSize, writeQuorum, ackQuorum), traffic pattern, etc. 

The benchmark test configuration is set up to fully utilize system capabilities. Pulsar benchmark test includes various configurations such as a number of topics, message size, number of producers, and consumer processes. More importantly, we make an effort to ensure that cold-reads occur, which forces the system to read messages from the disk. This is typical for systems that do a replay, have downstream outages, and have multiple use cases with different consumption patterns.

In Verizon Media (Yahoo), most of our use cases are latency-sensitive and they have a publish latency SLA of p99 5ms. Hence these results are indicative of the throughput limits with that p99 limit, and not the absolute throughput that can be achieved with the setup. We evaluated the performance of Pulsar using different types of storage devices (HDD, SSD, NVMe, and PMEM) for BookKeeper Journal devices. However, NVMe and PMEM are more relevant to current storage technology trends. Therefore, our benchmark setup and results will be more focused on NVMe and PMEM to use them for BookKeeper journal devices. 

 

Quorum Count, Write Availability, and Device Tail Latencies

Pulsar has various settings to ensure durability vs availability tradeoffs.

Unlike other messaging systems, Pulsar does not halt writes to do recovery in a w=2/a=2 setup.  It does not require  a w=3/a=2 setup to ensure write availability during upgrades or single node failure.  Writing to 2 nodes (writeQuorum=2) and waiting for 2 acknowledgements (ackQuorum=2), provides write availability in Pulsar under those scenarios.  In this setup (w=2/a=2),  when a single node fails, writes can proceed without interruption instantaneously, while recovery executes in the background to restore the replication factor.  

Other messaging systems halt writes, while doing recovery under these scenarios.

While failure may be rare, the much more common scenario of a rolling upgrade is seamlessly possible with a Pulsar configuration of (w=2/a=2). 

We consider this a marked benefit out of the box, as we are able to get by with a data replication factor of 2 instead of 3 to handle these occasions, with storage provisioned for 2 copies.

 

Test Setup 

We use 3 Brokers, 3 Bookies, and 3 application clients.

 

Application Configuration: 

3 Namespaces, 150 Topics

Producer payload 100KB

Consumers: 100 Topics with consumers doing hot reads, 50 topics with consumers doing cold reads (disk access)

 

Broker Configuration: 

96GB RAM, 25Gb NIC 

Pulsar settings: bookkeeperNumberOfChannelsPerBookie=200 [4]

JVM settings: -XXMaxDirectMemorySize=60g -Xmx30g 


 

Bookie Configuration: 1

(Journal Device: NVMe(Device-1), Ledger/Data Device: NVMe(Device-2))

64GB RAM, 25Gb NIC

BookkeeperNumberofChannelsperbookie=200

Journal disk: Micron NVMe SSD 9300

Journal directories: 2 (Bookie configuration: journalDirectories)

Data disk: Micron NVMe SSD 9300

Ledger directories: 2 (Bookie configuration: ledgerDirectories)

JVM settings: -XXMaxDirectMemorySize=30g -Xmx30g 

 

Bookie Configuration: 2

(Journal Device: PMEM, Ledger/Data Device: NVMe)

64GB RAM, 25Gb NIC

BookkeeperNumberofChannelsperbookie=200

PMEM journal device: 2 DIIMs, each with 120GB, mounted as 2 devices

Journal directories: 4  (2 on each device) (Bookie configuration: journalDirectories)

Data disk: Micron NVMe SSD 9300

Ledger directories: 2  (Bookie configuration: ledgerDirectories)

JVM settings: -XXMaxDirectMemorySize=30g -Xmx30g 


 

Client Setup 

The Pulsar performance tool[3]: was used to run the benchmark test. 

 

Results

The performance test was performed on two separate Bookie configurations: Bookie configuration-1 uses two separate NVMe each for Journal and Data device and Bookie configuration-2 uses PMEM as Journal and NVMe as a Device device.

Pulsar Performance Evaluation

[Table 4: Pulsar Performance Evaluation]

As noted before, read/write latency variations occur when an NVMe SSD controller is busy with media management tasks such as Garbage Collection, Wear Leveling, etc. The p99 NVMe disk latency goes high with certain workloads, and that impacts the Pulsar p99 latency, under a replication configuration: e=2, w=2, a=2.   (The p95 NVMe disk latency is not affected, and so Pulsar 95 latencies are still under 5ms ) 

The impact of the NVME wear leveling and garbage collection can be mitigated by a replication configuration of e=3, w=3, and a=2, which helps flatten out the pulsar p99 latency graph across 3 bookies and achieves higher throughput while maintaining low 5ms p99 latency. We don’t see such improvements in the PMEM journal device set up with such a replication configuration.

The results demonstrate that Bookie with NVMe or PMEM storage devices gives fairly high throughput at around 900MB by maintaining low 5ms p99 latency. While performing benchmark tests on NVMe journal device setup with replication configuration e=3,w=3,ack=2, we have captured io-stats of each bookie. Figure 5 shows that Bookie with a PMEM device provides 900MB write throughput with consistent low latency ( < 5ms).

Latency Vs Time (PMEM Journal device with 900MB throughput

[Figure 5: Latency Vs Time (PMEM Journal Device with 900MB Throughput)]

 

Figure 6: Pulsar Bookie IO stats

[Figure 6: Pulsar Bookie IO Stats]

IO stats (Figure 6) shows that the journal device serves around 900MB writes and no reads. Data device also serves 900MB avg writes while serving 350MB reads from each bookie.

 

Performance & User Impact 

The potential user impact of software-defined storage is best understood in the context of the performance, scale, and latency that characterize most distributed systems today. You can determine if a software solution is using storage resources optimally in several different ways, and two important metrics are throughput and latency. We have been using Bookies with PMEM journal devices in production for some time by replacing HDD-RAID devices. Figure 7 shows the write throughput vs latency bucket graph for Bookies with HDD-RAID journal device and Figure 8 shows for PMEM journal device. Bookies with HDD-RAID configuration have high write latency with the spike in traffic and it shows that requests having > 50ms write-latency increase with the higher traffic. On the other hand, Bookies with PMEM journal device provides stable and consistent low latency with higher traffic and serves user requests within SLA. These graphs explain the user impact of PMEM which allows Bookies to serve latency-sensitive applications and meet their SLA with the spike in traffic as well.
 

Figure 7. Bookie publish latency buckets with HDD-RAID Bookie journal device

[Figure 7. Bookie Publish Latency Buckets with HDD-RAID Bookie Journal Device]

 

Figure 8. Bookie publish latency buckets with PMEM Bookie journal device

[Figure 8. Bookie Publish Latency Buckets with PMEM Bookie Journal Device]

 

Final Thoughts

Pulsar architecture can accommodate different types of hardware which allows users to balance performance and cost based on required throughput and latency. Pulsar has the capability to adapt to the next generation of storage devices to achieve better performance. We have also seen that persistent memory excels in the race to achieving higher write throughput by maintaining low latency. 

 

Appendix

[1] DC Persistent Memory Module.

https://www.intel.com/content/www/us/en/architectureand-technology/optane-dc-persistent-memory.html 

[2] Multiple Journal Support: https://issues.apache.org/jira/browse/BOOKKEEPER963.

[3] Pulsar Performance Tool: http://pulsar.apache.org/docs/en/performance-pulsar-perf/.

[4] Per Bookie Configurable Number of Channels: https://github.com/apache/pulsar/pull/7910.