How REA Group approaches Amazon MSK cluster capability planning

This publish was written by Eunice Aguilar and Francisco Rodera from REA Group.

Enterprises that have to share and entry giant quantities of knowledge throughout a number of domains and companies have to construct a cloud infrastructure that scales as want modifications. REA Group, a digital enterprise that focuses on actual property property, solved this drawback utilizing Amazon Managed Streaming for Apache Kafka (Amazon MSK) and an information streaming platform known as Hydro.

REA Group’s group of greater than 3,000 folks is guided by our goal: to alter the best way the world experiences property. We assist folks with all facets of their property expertise—not simply shopping for, promoting, and renting—by means of the richest content material, information and insights, valuation estimates, and residential financing options. We ship unparalleled worth to our clients, Australia’s actual property brokers, by offering entry to the biggest and most engaged viewers of property seekers.

To realize this, the totally different technical merchandise inside the firm recurrently want to maneuver information throughout domains and companies effectively and reliably.

Throughout the Information Platform group, we have now constructed an information streaming platform known as Hydro to offer this functionality throughout the entire group. Hydro is powered by Amazon MSK and different instruments with which groups can transfer, rework, and publish information at low latency utilizing event-driven architectures. This kind of construction is foundational at REA for constructing microservices and well timed information processing for real-time and batch use instances like time-sensitive outbound messaging, personalization, and machine studying (ML).

On this publish, we share our strategy to MSK cluster capability planning.

The issue

Hydro manages a large-scale Amazon MSK infrastructure by offering configuration abstractions, permitting customers to give attention to delivering worth to REA with out the cognitive overhead of infrastructure administration. As the usage of Hydro grows inside REA, it’s essential to carry out capability planning to fulfill consumer calls for whereas sustaining optimum efficiency and cost-efficiency.

Hydro makes use of provisioned MSK clusters in improvement and manufacturing environments. In every surroundings, Hydro manages a single MSK cluster that hosts a number of tenants with differing workload necessities. Correct capability planning makes certain the clusters can deal with excessive site visitors and supply all customers with the specified stage of service.

Actual-time streaming is a comparatively new know-how at REA. Many customers aren’t but conversant in Apache Kafka, and precisely assessing their workload necessities may be difficult. Because the custodians of the Hydro platform, it’s our accountability to discover a solution to carry out capability planning to proactively assess the influence of the consumer workloads on our clusters.

Objectives

Capability planning entails figuring out the suitable measurement and configuration of the cluster based mostly on present and projected workloads, in addition to contemplating components similar to information replication, community bandwidth, and storage capability.

With out correct capability planning, Hydro clusters can change into overwhelmed by excessive site visitors and fail to offer customers with the specified stage of service. Due to this fact, it’s essential to us to speculate time and sources into capability planning to ensure Hydro clusters can ship the efficiency and availability that trendy functions require.

The capability planning strategy we observe for Hydro covers three primary areas:

The fashions used for the calculation of present and estimated future capability wants, together with the attributes used as variables in them
The fashions used to evaluate the approximate anticipated capability required for a brand new Hydro workload becoming a member of the platform
The tooling out there to operators and custodians to evaluate the historic and present capability consumption of the platform and, based mostly on them, the out there headroom

The next diagram reveals the interplay of capability utilization and the precalculated most utilization.

Though we don’t have this functionality but, the aim is to take this strategy one step additional sooner or later and predict the approximate useful resource depletion time, as proven within the following diagram.

To ensure our digital operations are resilient and environment friendly, we should keep a complete observability of our present capability utilization. This detailed oversight permits us not solely to know the efficiency limits of our current infrastructure, but in addition to establish potential bottlenecks earlier than they influence our companies and customers.

By proactively setting and monitoring well-understood thresholds, we are able to obtain well timed alerts and take needed scaling actions. This strategy makes certain our infrastructure can meet demand spikes with out compromising on efficiency, finally supporting a seamless consumer expertise and sustaining the integrity of our system.

Resolution overview

The MSK clusters in Hydro are configured with a PER_TOPIC_PER_BROKER stage of monitoring, which offers metrics on the dealer and subject ranges. These metrics assist us decide the attributes of the cluster utilization successfully.

Nevertheless, it wouldn’t be clever to show an extreme variety of metrics on our monitoring dashboards as a result of that might result in much less readability and slower insights on the cluster. It’s extra invaluable to decide on essentially the most related metrics for capability planning somewhat than displaying quite a few metrics.

Cluster utilization attributes

Based mostly on the Amazon MSK finest practices pointers, we have now recognized a number of key attributes to evaluate the well being of the MSK cluster. These attributes embrace the next:

In/out throughput
CPU utilization
Disk house utilization
Reminiscence utilization
Producer and shopper latency
Producer and shopper throttling

For extra info on right-sizing your clusters, see Greatest practices for right-sizing your Apache Kafka clusters to optimize efficiency and value, Greatest practices for Normal brokers, Monitor CPU utilization, Monitor disk house, and Monitor Apache Kafka reminiscence.

The next desk comprises the detailed record of all of the attributes we use for MSK cluster capability planning in Hydro.

Attribute Title	Attribute Sort	Items	Feedback
Bytes in	Throughput	Bytes per second	Depends on the combination Amazon EC2 community, Amazon EBS community, and Amazon EBS storage throughput
Bytes out	Throughput	Bytes per second	Depends on the combination Amazon EC2 community, Amazon EBS community, and Amazon EBS storage throughput
Shopper latency	Latency	Milliseconds	Excessive or unacceptable latency values often point out consumer expertise degradation earlier than reaching precise useful resource (for instance, CPU and reminiscence) depletion
CPU utilization	Capability limits	% CPU consumer + CPU system	Ought to keep beneath 60%
Disk house utilization	Persistent storage	Bytes	Ought to keep beneath 85%
Reminiscence utilization	Capability limits	% Reminiscence in use	Ought to keep beneath 60%
Producer latency	Latency	Milliseconds	Excessive or unacceptable sustained latency values often point out consumer expertise degradation earlier than reaching precise capability limits or precise useful resource (for instance, CPU or reminiscence) depletion
Throttling	Capability limits	Milliseconds, bytes, or messages	Excessive or unacceptable sustained throttling values point out capability limits are being reached earlier than precise useful resource (for instance, CPU or reminiscence) depletion

By monitoring these attributes, we are able to shortly consider the efficiency of the clusters as we add extra workloads to the platform. We then match these attributes to the related MSK metrics out there.

Cluster capability limits

In the course of the preliminary capability planning, our MSK clusters weren’t receiving sufficient site visitors to offer us with a transparent concept of their capability limits. To handle this, we used the AWS efficiency testing framework for Apache Kafka to judge the theoretical efficiency limits. We performed efficiency and capability checks on the take a look at MSK clusters that had the identical cluster configurations as our improvement and manufacturing clusters. We obtained a extra complete understanding of the cluster’s efficiency by conducting these varied take a look at eventualities. The next determine reveals an instance of a take a look at cluster’s efficiency metrics.

To carry out the checks inside a selected timeframe and price range, we centered on the take a look at eventualities that might effectively measure the cluster’s capability. As an example, we performed checks that concerned sending high-throughput site visitors to the cluster and creating matters with many partitions.

After each take a look at, we collected the metrics of the take a look at cluster and extracted the utmost values of the important thing cluster utilization attributes. We then consolidated the outcomes and decided essentially the most acceptable limits of every attribute. The next screenshot reveals an instance of the exported take a look at cluster’s efficiency metrics.

Capability monitoring dashboards

As a part of our platform administration course of, we conduct month-to-month operational opinions to take care of optimum efficiency. This entails analyzing an automatic operational report that covers all of the techniques on the platform. In the course of the overview, we consider the service stage goals (SLOs) based mostly on choose service stage indicators (SLIs) and assess the monitoring alerts triggered from the earlier month. By doing so, we are able to establish any points and take corrective actions.

To help us in conducting the operational opinions and to offer us with an outline of the cluster’s utilization, we developed a capability monitoring dashboard, as proven within the following screenshot, for every surroundings. We constructed the dashboard as infrastructure as code (IaC) utilizing the AWS Cloud Improvement Package (AWS CDK). The dashboard is generated and managed mechanically as a part of the platform infrastructure, together with the MSK cluster.

By defining the utmost capability limits of the MSK cluster in a configuration file, the boundaries are mechanically loaded into the capability dashboard as annotations within the Amazon CloudWatch graph widgets. The capability limits annotations are clearly seen and supply us with a view of the cluster’s capability headroom based mostly on utilization.

We decided the capability limits for throughput, latency, and throttling by means of the efficiency testing. Capability limits of the opposite metrics, similar to CPU, disk house, and reminiscence, are based mostly on the Amazon MSK finest practices pointers.

In the course of the operational opinions, we proactively assess the capability monitoring dashboards to find out if extra capability must be added to the cluster. This strategy permits us to establish and deal with potential efficiency points earlier than they’ve a major influence on consumer workloads. It’s a preventative measure somewhat than a reactive response to a efficiency degradation.

Preemptive CloudWatch alarms

Now we have applied preemptive CloudWatch alarms along with the capability monitoring dashboards. These alarms are configured to alert us earlier than a selected capability metric reaches its threshold, notifying us when the sustained worth reaches 80% of the capability restrict. This technique of monitoring permits us to take fast motion as an alternative of ready for our month-to-month overview cadence.

Worth added by our capability planning strategy

As operators of the Hydro platform, our strategy to capability planning has supplied a constant solution to assess how far we’re from the theoretical capability limits of all our clusters, no matter their configuration. Our capability monitoring dashboards are a key observability instrument that we overview frequently; they’re additionally helpful whereas troubleshooting efficiency points. They assist us shortly inform if capability constraints could possibly be a possible root reason behind any ongoing points. Because of this we are able to use our present capability planning strategy and tooling each proactively or reactively, relying on the scenario and want.

One other good thing about this strategy is that we calculate the theoretical most utilization values {that a} given cluster with a selected configuration can stand up to from a separate cluster with out impacting any precise customers of the platform. We spin up short-lived MSK clusters by means of our AWS CDK based mostly automation and carry out capability checks on them. We do that very often to evaluate the influence, if any, that modifications made to the cluster’s configurations have on the recognized capability limits. Based on our present suggestions loop, if these newly calculated limits change from the beforehand recognized ones, they’re used to mechanically replace our capability dashboards and alarms in CloudWatch.

Future evolution

Hydro is a platform that’s continuously enhancing with the introduction of recent options. One in every of these options contains the flexibility to conveniently create Kafka shopper functions. To satisfy the rising demand, it’s important to remain forward of capability planning. Though the strategy mentioned right here has served us effectively to this point, it’s in no way the ultimate stage , and there are capabilities that we have to prolong and areas we have to enhance on.

Multi-cluster structure

To help vital workloads, we’re contemplating utilizing a multi-cluster structure utilizing Amazon MSK, which might additionally have an effect on our capability planning. Sooner or later, we plan to profile workloads based mostly on metadata, cross-check them with capability metrics, and place them within the acceptable MSK cluster. Along with the present provisioned MSK clusters, we are going to consider how the Amazon MSK Serverless cluster kind can complement our platform structure.

Utilization developments

Now we have added CloudWatch anomaly detection graphs to our capability monitoring dashboards to trace any uncommon developments. Nevertheless, as a result of the CloudWatch anomaly detection algorithm solely evaluates as much as 2 weeks of metric information, we are going to reassess its usefulness as we onboard extra workloads. Apart from figuring out utilization developments, we are going to discover choices to implement an algorithm with predictive capabilities to detect when MSK cluster sources degrade and deplete.

Conclusion

Preliminary capability planning lays a strong basis for future enhancements and offers a secure onboarding course of for workloads. To realize optimum efficiency of our platform, we should be sure that our capability planning technique evolves in keeping with the platform’s progress. Because of this, we keep a detailed collaboration with AWS to repeatedly develop further options that meet our enterprise wants and are in sync with the Amazon MSK roadmap. This makes certain we keep forward of the curve and might ship the absolute best expertise to our customers.

We advocate all Amazon MSK customers not miss out on maximizing their cluster’s potential and to start out planning their capability. Implementing the methods listed on this publish is a good first step and can result in smoother operations and important financial savings in the long term.

Concerning the Authors

Eunice Aguilar is a Workers Information Engineer at REA. She has labored in software program engineering in varied industries all through the years and not too long ago for property information. She’s additionally an advocate for girls serious about transitioning into tech, together with the well-versed who she takes inspiration from.

Francisco Rodera is a Workers Programs Engineer at REA. He has in depth expertise constructing and working large-scale distributed techniques. His pursuits are automation, observability, and making use of SRE practices to business-critical companies and platforms.

Khizer Naeem is a Technical Account Supervisor at AWS. He focuses on Environment friendly Compute and has a deep ardour for Linux and open-source applied sciences, which he leverages to assist enterprise clients modernize and optimize their cloud workloads.

How REA Group approaches Amazon MSK cluster capability planning

The issue

Objectives

Resolution overview

Cluster utilization attributes

Cluster capability limits

Capability monitoring dashboards

Preemptive CloudWatch alarms

Worth added by our capability planning strategy

Future evolution

Multi-cluster structure

Utilization developments

Conclusion

Concerning the Authors

Related Articles

Mars rover makes use of wiggly wheels impressed by lizard

This Week’s Superior Tech Tales From Across the Internet (By means of June 20)

AURA Foresight Reaches World XPRIZE Wildfire Finals in Alaska

LEAVE A REPLY Cancel reply

Latest Articles

Mars rover makes use of wiggly wheels impressed by lizard

This Week’s Superior Tech Tales From Across the Internet (By means of June 20)

AURA Foresight Reaches World XPRIZE Wildfire Finals in Alaska

Photo voltaic Beat Coal in US Electrical energy Combine for the First Time in Might

Robots-Weblog | RoboCup 2050: Werden Roboter einmal Fußball-Weltmeister?

ABOUT US