Uber Drives Apache Kafka's Tiered Storage Feature; Sparks Efficiency Debate (2024)

Transportation company Uber have detailed their work in adding a new tiered storage feature toApache Kafka, the popular distributed event streaming platform. This feature, added in 3.6.0 and currently in early access, aims to address the scalability and efficiency challenges faced by organizations running large Kafka clusters.

Tiered storage allows Kafka to extend its storage capabilities beyond local broker disks to remote storage systems like HDFS, Amazon S3, Google Cloud Storage, and Azure Blob Storage. This enhancement enables Kafka clusters to scale storage independently from compute resources, potentially reducing costs and operational complexity.

According to Uber's blog post, the project's motivation was to overcome limitations in how Kafka clusters are typically scaled.

"Kafka cluster storage is typically scaled by adding more broker nodes. But this also adds needless memory and CPUs to the cluster, making overall storage cost less efficient compared to storing the older data in external storage"

They added that larger clusters with more nodes increase deployment complexity and operational costs due to the tight coupling of storage and processing.

The tiered storage architecture introduces two storage tiers: local and remote. The local tier consists of the broker's local storage, while the remote tier is the extended storage such as HDFS or cloud object stores. Both tiers can have separate retention policies based on specific use cases.

See Also

Contact Q-Storage Naarden

AWS conducted a real-world test to demonstrate these benefits using a three-node cluster with m7g instance types. They created a topic with a replication factor of three and ingested 300 GB of data. When adding three new brokers and moving all partitions from the existing brokers to the new ones, the operation took approximately 75 minutes without tiered storage and caused high CPU usage. After enabling tiered storage on the same topic, with a local retention period of 1 hour and a remote retention period of 1 year, they repeated the test. This time, the partition movement operation was completed in just under 15 minutes, with no noticeable CPU usage. AWS attributes this improvement to the fact that only the small active segment needs to be moved with tiered storage enabled, as all closed segments have already been transferred to tiered storage.

However, only some in the industry share the enthusiasm for tiered storage. Richard Artoul from WarpStream offers a more cautious perspective, arguing that while tiered storage can help reduce costs, it may introduce new complexities and potential failure modes. Artoul suggests that the added complexity of managing two storage tiers could increase operational overhead and impact system reliability.

Artoul raises concerns about the performance implications of fetching data from remote storage, which could introduce latency and affect real-time processing capabilities. He points out that the cost savings of tiered storage might be offset by the expenses associated with managing and accessing data in remote storage systems, particularly due to inter-zone networking fees in cloud environments. Furthermore, Artoul argues that tiered storage needs to address the two primary problems that users have with Kafka today: complexity and operational burden, as well as costs (specifically, inter-zone networking fees). He suggests that tiered storage may exacerbate these issues rather than solve them.

While tiered storage offers potential advantages, it's important to note some current limitations. As per Red Hat's analysis, the feature still needs to support multiple log directories (JBOD) or compacted topics. Additionally, turning off tiering on a topic requires transferring data to another topic or external storage before deleting the original topic.

Both Uber and Red Hat emphasize the importance of monitoring when using tiered storage. New metrics have been introduced to track remote storage operations, allowing users to monitor and create alerts for potential issues such as slow upload/download or high error rates.

Uber has been running this feature in production for 1-2 years on different workloads, but it's still considered early access in the open-source Apache Kafka 3.6.0 release. Organizations considering adoption should carefully evaluate its current capabilities and limitations.

Introducing tiered storage potentially enables more efficient and cost-effective management of large-scale data streams. As demonstrated by AWS's implementation in Amazon MSK, it can dramatically improve cluster resilience and scalability in some scenarios. However, Artoul's critique highlights that the feature may only be a silver bullet for some Kafka users. As with any new feature, particularly in early access, users are advised to thoroughly test and monitor its performance in their specific environments before deploying to production, weighing the potential benefits against the added complexity and operational challenges.

About the Author

Matt Saunders

Show moreShow less

Uber Drives Apache Kafka's Tiered Storage Feature; Sparks Efficiency Debate (2024)

About the Author

Matt Saunders

References