It is a visitor put up by Oleh Khoruzhenko, Senior Employees DevOps Engineer at Bazaarvoice, in partnership with AWS.
Bazaarvoice is an Austin-based firm powering a world-leading evaluations and scores platform. Our system processes billions of shopper interactions by means of scores, evaluations, photos, and movies, serving to manufacturers and retailers construct shopper confidence and drive gross sales by utilizing genuine user-generated content material (UGC) throughout the client journey. The Bazaarvoice Belief Mark is the gold customary in authenticity.
Apache Kafka is without doubt one of the core elements of our infrastructure, enabling real-time information streaming for the worldwide assessment platform. Though Kafka’s distributed structure met our wants for high-throughput, fault-tolerant streaming, self-managing this advanced system diverted vital engineering assets away from our core product improvement. Every part of our Kafka infrastructure required specialised experience, starting from configuring low-level parameters to sustaining the advanced distributed methods that our clients depend on. The dynamic nature of our surroundings demanded steady care and funding in automation. We discovered ourselves always managing upgrades, making use of safety patches, implementing fixes, and addressing scaling wants as our information volumes grew.
On this put up, we present you the steps we took emigrate our workloads from self-hosted Kafka to Amazon Managed Streaming for Apache Kafka (Amazon MSK). We stroll you thru our migration course of and spotlight the enhancements we achieved after this transition. We present how we minimized operational overhead, enhanced our safety and compliance posture, automated key processes, and constructed a extra resilient platform whereas sustaining the excessive efficiency our international buyer base expects.
The necessity for modernization
As our platform grew to course of billions of each day shopper interactions, we wanted to discover a method to scale our Kafka clusters effectively whereas sustaining a small workforce to handle the infrastructure. The restrictions of self-managed Kafka clusters manifested in a number of key areas:
- Scaling operations – Though scaling our self-hosted Kafka clusters wasn’t inherently advanced, it required cautious planning and execution. Every time we wanted so as to add new brokers to deal with elevated workload, our workforce confronted a multi-step course of involving capability planning, infrastructure provisioning, and configuration updates.
- Configuration complexity – Kafka gives a whole bunch of configuration parameters. Though we didn’t actively handle all of those, understanding their influence was essential. Key settings like I/O threads, reminiscence buffers, and retention insurance policies wanted ongoing consideration as we scaled. Even minor changes might have important downstream results, requiring our workforce to take care of deep experience in these parameters and their interactions to make sure optimum efficiency and stability.
- Infrastructure administration and capability planning – Self-hosting Kafka required us to handle a number of scaling dimensions, together with compute, reminiscence, community throughput, storage throughput, and storage quantity. We wanted to fastidiously plan capability for all these elements, typically making advanced trade-offs. Past capability planning, we have been answerable for real-time administration of our Kafka infrastructure. This included promptly detecting and addressing part failures and efficiency points. Our workforce wanted to be extremely attentive to alerts, typically requiring instant motion to take care of system stability.
- Specialised experience necessities – Working Kafka at scale demanded deep technical experience throughout a number of domains. The workforce wanted to:
- Monitor and analyze a whole bunch of efficiency metrics
- Conduct advanced root trigger evaluation for efficiency points
- Handle ZooKeeper ensemble coordination
- Execute rolling updates for zero-downtime upgrades and safety patches
These challenges have been compounded throughout peak enterprise durations, comparable to Black Friday and Cyber Monday, when sustaining optimum efficiency was important for Bazaarvoice’s retail clients.
Selecting Amazon MSK
After evaluating varied choices, we chosen Amazon MSK as our modernization answer. The choice was pushed by the service’s means to reduce operational overhead, present excessive availability out of the field with its three Availability Zone structure, and provide seamless integration with our current AWS infrastructure.
Key capabilities that made Amazon MSK the clear selection:
- AWS integration – We already used AWS companies for information processing and analytics. Amazon MSK related straight with these companies, assuaging the necessity to construct and keep customized integrations. This meant our current information pipelines would proceed working with minimal adjustments.
- Automated operations administration – Amazon MSK automated our most time-consuming duties. We not have to manually monitor cases and storage for failures or reply to those points ourselves.
- Enterprise-grade reliability – The platform’s structure matched our reliability necessities out of the field. Multi-AZ distribution and built-in replication gave us the identical fault tolerance we’d fastidiously constructed into our self-hosted system, now backed by AWS’s service ensures.
- Simplified improve course of – Earlier than Amazon MSK, model upgrades for our Kafka clusters required cautious planning and execution. The method was advanced, involving a number of steps and dangers. Amazon MSK simplified our improve operations. We now use automated upgrades for dev and take a look at workloads and keep management over manufacturing environments. This shift diminished the necessity for intensive planning classes and a number of engineers. Because of this, we keep present with the newest Kafka variations and safety patches, bettering our system reliability and efficiency.
- Enhanced safety controls – Our platform required ISO 27001 compliance, which usually concerned months of documentation and safety controls implementation. Amazon MSK got here with this certification built-in, assuaging the necessity for separate compliance work. Amazon MSK encrypted our information, managed community entry, and built-in with our current safety instruments.
With Amazon MSK chosen as our goal platform, we started planning the advanced process of migrating our vital streaming infrastructure with out disrupting the billions of shopper interactions flowing by means of our system.
Bazaarvoice’s migration journey
Shifting our advanced Kafka infrastructure to Amazon MSK required cautious planning and exact execution. Our platform processes information by means of two primary elements: an Apache Kafka Streams pipeline that handles information processing and augmentation, and shopper functions that transfer this enriched information to downstream methods. With 40 TB of state throughout 250 inside matters, this migration demanded a methodical strategy.
Planning section
Working with AWS Options Architects proved vital for validating our migration technique. Our platform’s distinctive traits required particular consideration:
- Multi-Area deployment throughout the US and EU
- Complicated stateful functions with strict information consistency wants
- Very important enterprise companies requiring zero downtime
- Numerous shopper ecosystem with completely different migration necessities
Migration challenges
The most important hurdle was migrating our stateful Kafka Streams functions. Our information processing runs as a directed acyclic graph (DAG) of functions throughout areas, utilizing static group membership to forestall disruptive rebalancing. It’s essential to notice that Kafka Streams retains its state in inside Kafka matters. For functions to get well correctly, replicating this state precisely is essential. This attribute of Kafka Streams added complexity to our migration course of. Initially, we thought of MirrorMaker2, the usual device for Kafka migrations. Nonetheless, two elementary limitations made it difficult:
- Threat of dropping state or incorrectly replicating state throughout our functions.
- Incapacity to run two cases of our functions concurrently, which meant we wanted to close down the principle utility and look forward to it to get well from the state within the MSK cluster. Given the dimensions of our state, this restoration course of exceeded our 30-minute SLA for downtime.
Our answer
We determined to deploy a parallel stack of Kafka Streams functions studying and writing information from Amazon MSK. This strategy gave us adequate time for testing and verification, and enabled the functions to hydrate their state earlier than we delivered the output to our information warehouse for analytics. We used MirrorMaker2 for enter subject replication, whereas our answer supplied a number of benefits:
- Simplified monitoring of the replication course of
- Averted consistency points between state shops and inside matters
- Allowed for gradual, managed migration of shoppers
- Enabled thorough validation earlier than cutover
- Required a coordinated transition plan for all shoppers, as a result of we couldn’t switch shopper offsets throughout clusters
Shopper migration technique
Every shopper sort required a fastidiously tailor-made strategy:
- Commonplace shoppers – For functions supporting Kafka Shopper Group protocol, we carried out a four-step migration. This strategy risked some duplicate processing, however our functions have been designed to deal with this situation. The steps have been as follows:
- Configure shoppers with
auto.offset.reset: newest. - Cease all DAG producers.
- Anticipate current shoppers to course of remaining messages.
- Lower over shopper functions to Amazon MSK.
- Configure shoppers with
- Apache Kafka Join Sinks – Our sink connectors served two vital databases:
- A distributed search and analytics engine – Doc versioning relied on Kafka file offsets, making direct migration unattainable. To deal with this, we carried out an answer that concerned constructing new search engine clusters from scratch.
- A document-oriented NoSQL database – This supported direct migration with out requiring new database cases, simplifying the method considerably.
- Apache Spark and Flink functions – These offered distinctive challenges because of their inside checkpointing mechanisms:
- Offsets managed outdoors Kafka’s shopper teams
- Checkpoints incompatible between supply and goal clusters
- Required full information reprocessing from the start
We scheduled these migrations throughout off-peak hours to reduce influence.
Technical advantages and enhancements
Shifting to Amazon MSK basically modified how we handle our Kafka infrastructure. The transformation is greatest illustrated by evaluating key operational duties earlier than and after the migration, summarized within the following desk.
| Exercise | Earlier than: Self-Hosted Kafka | After: Amazon MSK |
| Safety patching | Required devoted workforce time for Kafka and OS updates | Totally automated |
| Dealer restoration | Wanted handbook monitoring and intervention | Totally automated |
| Consumer authentication | Complicated password rotation procedures | AWS Identification and Entry Administration (IAM) |
| Model upgrades | Complicated process requiring intensive planning | Totally automated |
The small print of the duties are as follows:
- Safety patching – Beforehand, our workforce spent 8 hours month-to-month making use of Kafka and working system (OS) safety patches throughout our dealer fleet. Amazon MSK now handles these updates mechanically, sustaining our safety posture with out engineering intervention.
- Dealer restoration – Though our self-hosted Kafka had computerized restoration capabilities, every incident required cautious monitoring and occasional handbook intervention. With Amazon MSK, node failures and storage degradation points comparable to Amazon Elastic Block Retailer (Amazon EBS) slowdowns are dealt with totally by AWS and resolved inside minutes with out our involvement.
- Authentication administration – Our self-hosted implementation required password rotations for SASL/SCRAM authentication, a course of that took two engineers a number of days to coordinate. The direct integration between Amazon MSK and AWS Identification and Entry Administration (IAM) minimized this overhead whereas strengthening our safety controls.
- Model upgrades – Kafka model upgrades in our self-hosted atmosphere required weeks of planning and testing in addition to weekend upkeep home windows. Amazon MSK manages these upgrades mechanically throughout off-peak hours, sustaining our SLAs with out disruption.
These enhancements proved particularly invaluable throughout high-traffic durations like Black Friday, when our workforce beforehand wanted intensive operational readiness plans. Now, the built-in resiliency of Amazon MSK offers us with dependable Kafka clusters that function mission-critical infrastructure for our enterprise. The migration made it attainable to interrupt our monolithic clusters into smaller, devoted MSK clusters. This improved our information isolation, supplied higher useful resource allocation, and enhanced efficiency predictability for high-priority workloads.
Classes realized
Our migration to Amazon MSK revealed a number of key insights that may assist different organizations modernize their Kafka infrastructure:
- Skilled validation – Working with AWS Options Architects to validate our migration technique caught a number of vital points early. Though our workforce knew our functions nicely, exterior Kafka consultants recognized potential issues with state administration and shopper offset dealing with that we hadn’t thought of. This validation prevented pricey missteps throughout the migration.
- Knowledge verification – Evaluating information throughout Kafka clusters proved difficult. We constructed instruments to seize subject snapshots in Parquet format on Amazon Easy Storage Service (Amazon S3), enabling fast comparisons utilizing Amazon Athena queries. This strategy gave us confidence that information remained constant all through the migration.
- Begin small – Starting with our smallest information universe in QA helped us refine our course of. Every subsequent migration went smoother as we utilized classes from earlier iterations. This gradual strategy helped us keep system stability whereas constructing workforce confidence.
- Detailed planning – We created particular migration plans with every workforce, contemplating their distinctive necessities and constraints. For instance, our machine studying pipeline wanted particular dealing with because of strict offset administration necessities. This granular planning prevented downstream disruptions.
- Efficiency optimization – We discovered that using Amazon MSK provisioned throughput supplied clear price benefits when storage throughput grew to become a bottleneck. This characteristic made it attainable to enhance cluster efficiency with out scaling occasion sizes or including brokers, offering a extra environment friendly answer to our throughput challenges.
- Documentation – Sustaining detailed migration runbooks proved invaluable. After we encountered related points throughout completely different migrations, having documented options saved important troubleshooting time.
Conclusion
On this put up, we confirmed you ways we modernized our Kafka infrastructure by migrating to Amazon MSK. We walked by means of our decision-making course of, challenges confronted, and techniques employed. Our journey reworked Kafka operations from a resource-intensive, self-managed infrastructure to a streamlined, managed service, bettering operational effectivity, platform reliability, and workforce productiveness. For enterprises managing self-hosted Kafka infrastructure, our expertise demonstrates that profitable transformation is achievable with correct planning and execution. As information streaming wants develop, modernizing infrastructure turns into a strategic crucial for sustaining aggressive benefit.
For extra info, go to the Amazon MSK product web page, and discover the excellent Developer Information to study concerning the options out there that will help you construct scalable and dependable streaming information functions on AWS.
