Jumia builds a next-generation knowledge platform with metadata-driven specification frameworks

Jumia is a know-how firm born in 2012, current in 14 African international locations, with its principal headquarters in Lagos, Nigeria. Jumia is constructed round a market, a logistics service, and a fee service. The logistics service allows the supply of packages by a community of native companions, and the fee service facilitates the funds of on-line transactions inside Jumia’s ecosystem. Jumia is current in NYSE and has a market cap of $554 million.

On this publish, we share a part of the journey that Jumia took with AWS Skilled Providers to modernize its knowledge platform that ran underneath a Hadoop distribution to AWS serverless primarily based options. A number of the challenges that motivated the modernization have been the excessive value of upkeep, lack of agility to scale computing at particular occasions, job queuing, lack of innovation when it got here to buying extra trendy applied sciences, advanced automation of the infrastructure and purposes, and the lack to develop domestically.

Resolution overview

The essential idea of the modernization undertaking is to create metadata-driven frameworks, that are reusable, scalable, and ready to reply to the completely different phases of the modernization course of. These phases are: knowledge orchestration, knowledge migration, knowledge ingestion, knowledge processing, and knowledge upkeep.

This standardization for every section was thought of as a strategy to streamline the event workflows and reduce the chance of errors that may come up from utilizing disparate strategies. This additionally enabled migration of various varieties of information following the same method whatever the use case. By adopting this method, the info dealing with is constant, extra environment friendly, and extra simple to handle throughout completely different tasks and groups. As well as, though the use instances have autonomy of their area from a governance perspective, on prime of them is a centralized governance mannequin that defines the entry management within the shared architectural elements. Importantly, this implementation emphasizes knowledge safety by implementing encryption throughout all providers, together with Amazon Easy Storage Service (Amazon S3) and Amazon DynamoDB. Moreover, it adheres to the precept of least privilege, thereby enhancing total system safety and decreasing potential vulnerabilities.

The next diagram describes the frameworks that have been created. On this design, the workloads within the new knowledge platform are divided by use case. Every use case requires the creation of a set of YAML recordsdata for every section, from knowledge migration to knowledge circulation orchestration, and they’re principally the enter of the system. The output is a set of DAGs that run the precise duties.

Overview

Within the following sections, we talk about the goals, implementation, and learnings of every section in additional element.

Information orchestration

The target of this section is to construct a metadata-driven framework to orchestrate the info flows alongside the entire modernization course of. The orchestration framework supplies a sturdy and scalable resolution that has the next capacities: dynamically create DAGs, combine natively with non-AWS providers, permit the creation of dependencies primarily based on previous executions, and add an accessible metadata era per every execution. Subsequently, it was determined to make use of Amazon Managed Workflows for Apache Airflow (Amazon MWAA), which, by the Apache Airflow engine, supplies these functionalities whereas abstracting customers from the administration operation.

The next is the outline of the metadata recordsdata which are offered as a part of the info orchestration section for a given use case that performs the info processing utilizing Spark on Amazon EMR Serverless:

proprietor: # Use case proprietor
dags: # Checklist of DAGs to be created for this use case
  - title: # Use case title
    kind: # Kind of DAG (could possibly be migration, ingestion, transformation or upkeep)
    tags: # Checklist of TAGs
    notification: # Defines notificacions for this DAGs
      on_success_callback: true
      on_failure_callback: true
    spark: # Spark job data 
      entrypoint: # Spark script 
      arguments: # Arguments required by the Spark script
      spark_submit_parameters: # Spark submit parameters.

The concept behind all of the frameworks is to construct reusable artifacts that allow the event groups to speed up their work whereas offering reliability. On this case, the framework supplies the capabilities to create DAG objects inside Amazon MWAA primarily based on configuration recordsdata (YAML recordsdata).

This specific framework is constructed on layers that add completely different functionalities to the ultimate DAG:

DAGs – The DAGs are constructed primarily based on the metadata data offered to the framework. The info engineers don’t have to write down Python code as a way to create the DAGs, they’re robotically created and this module is accountable for performing this dynamic creation of DAGs.
Validations – This layer handles YAML file validation as a way to forestall corrupted recordsdata from affecting the creation of different DAGs.
Dependencies – This layer handles dependencies amongst completely different DAGs as a way to deal with advanced interconnections.
Notifications – This layer handles the kind of notifications and alerts which are a part of the workflows.

Orchestration

One side to think about when utilizing Amazon MWAA is that, being a managed service, it requires some upkeep from the customers, and it’s essential to have a great understanding of the variety of DAGs and processes that you simply’re anticipated to have as a way to fine-tune the occasion and procure the specified efficiency. A number of the parameters that have been fine-tuned through the engagement have been core.dagbag_import_timeout, core.dag_file_processor_timeout, core.min_serialized_dag_update_interval, core.min_serialized_dag_fetch_interval, scheduler.min_file_process_interval, scheduler.max_dagruns_to_create_per_loop, scheduler.processor_poll_interval, scheduler.dag_dir_list_interval, and celery.worker_autoscale.

One of many layers described within the previous diagram corresponds to validation. This was an essential element for the creation of dynamic DAGs. As a result of the enter to the framework consists of YML recordsdata, it was determined to filter out corrupted recordsdata earlier than trying to create the DAG objects. Following this method, Jumia might keep away from undesired interruptions of the entire course of. The module that really builds DAGs solely receives configuration recordsdata that comply with the required specs to efficiently create them. In case of corrupted recordsdata, data relating to the precise points is logged into Amazon CloudWatch in order that builders can repair them.

Information migration

The target of this section is to construct a metadata-driven framework for migrating knowledge from HDFS to Amazon S3 with Apache Iceberg storage format, which includes the least operational overhead, supplies scalability capability throughout peak hours, and ensures knowledge integrity and confidentiality.

The next diagram illustrates the structure.

Migration

Throughout this section, a metadata-driven framework inbuilt PySpark receives a configuration file as enter in order that some migration duties can run in an Amazon EMR Serverless job. This job makes use of the PySpark framework because the script location. Then the orchestration framework described beforehand is used to create a migration DAG that runs the next duties:

The primary process creates the DDLs in Iceberg format within the AWS Glue Information Catalog utilizing the migration framework inside an Amazon EMR Serverless job.
After the tables are created, the second process transfers HDFS knowledge to a touchdown bucket in Amazon S3 utilizing AWS DataSync to sync buyer knowledge. This course of brings knowledge from all of the completely different layers of the info lake.
When this course of is full, a 3rd process converts knowledge to Iceberg format from the touchdown bucket to the vacation spot bucket (uncooked, course of, or analytics) utilizing once more an alternative choice of the migration framework embedded in an Amazon EMR Serverless job.

Information switch efficiency is healthier when the dimensions of the recordsdata to be transferred is round 128–256 MB, so it’s really helpful to compress the recordsdata on the supply. By decreasing the variety of recordsdata, metadata evaluation and integrity phases are lowered, rushing up the migration section.

Information ingestion

The target of this section is to implement one other framework primarily based on metadata that responds to the 2 knowledge ingestion fashions. A batch mode is chargeable for extracting knowledge from completely different knowledge sources (corresponding to Oracle or PostgreSQL) and a micro-batch-based mode extracts knowledge from a Kafka cluster that, primarily based on configuration parameters, has the capability to run native streams in streaming.

The next diagram illustrates the structure for the batch and micro-batch and streaming method.

Ingestion

Throughout this section, a metadata-driven framework builds the logic to convey knowledge from Kafka, databases, or exterior providers, that shall be run utilizing an ingestion DAG deployed in Amazon MWAA.

Spark Structured Streaming was used to ingest knowledge from Kafka matters. The framework receives configuration recordsdata in YAML format that point out which matters to learn, what extraction processes needs to be carried out, whether or not it needs to be learn in streaming or micro-batch, and during which vacation spot desk the data needs to be saved, amongst different configurations.

For batch ingestion, a metadata-driven framework written in Pyspark was applied. In the identical approach because the earlier one, the framework acquired a configuration in YAML format with the tables to be migrated and their vacation spot.

One of many features to think about in the sort of migration is the synchronization of information from the ingestion section and the migration section, in order that there isn’t any lack of knowledge and that knowledge isn’t reprocessed unnecessarily. To this finish, an answer has been applied that saves the timestamps of the final historic knowledge (per desk) migrated in a DynamoDB desk. Each kinds of frameworks are programmed to make use of this knowledge the primary time they’re run. For micro-batching use instances, which use Spark Structured Streaming, Kafka knowledge is learn by assigning the worth saved in DynamoDB to the startingTimeStamp parameter. For all different executions, precedence shall be given to the metadata within the checkpoint folder. This fashion, you may make positive ingestion is synchronized with the info migration.

Information processing

The target on this section was to have the ability to deal with updates and deletions of information in an object-oriented file system, so Iceberg is a key resolution that was adopted all through the undertaking as delta lake recordsdata due to its ACID capabilities. Though all phases use Iceberg as delta recordsdata, the processing section makes intensive use of Iceberg’s capabilities to do incremental processing of information, creating the processing layer utilizing UPSERT utilizing Iceberg’s potential to run MERGE INTO instructions.

The next diagram illustrates the structure.

Processing

The structure is much like the ingestion section, with simply adjustments to the info supply to be Amazon S3. This method hastens the supply section and maintains high quality with a production-ready resolution.

By default, Amazon EMR Serverless has the spark.dynamicAllocation.enabled parameter set to True. This feature scales up or down the variety of executors registered inside the software, primarily based on the workload. This brings quite a lot of benefits when coping with several types of workloads, nevertheless it additionally brings concerns when utilizing Iceberg tables. As an example, whereas writing knowledge into an Iceberg desk, the Amazon EMR Serverless software can use a lot of executors as a way to pace up the duty. This can lead to reaching Amazon S3 limits, particularly the variety of requests per second per prefix. For that reason, it’s essential to use good knowledge partitioning practices.

One other essential side to think about in these instances is the item storage file structure. By default, Iceberg makes use of the Hive storage structure, however it may be set to make use of ObjectStoreLocationProvider. By setting this property, a deterministic hash is generated for every file, with a hash appended immediately after write.knowledge.path. This will significantly reduce throttle requests primarily based on object prefix, in addition to maximize throughput for Amazon S3 associated I/O operations, as a result of the recordsdata written are equally distributed throughout a number of prefixes.

Information upkeep

When working with knowledge lake desk codecs corresponding to Iceberg, it’s important to interact in routine upkeep duties to optimize desk metadata file administration, stopping a lot of pointless recordsdata from accumulating and promptly eradicating any unused recordsdata. The target of this section was to construct one other framework that may carry out these kinds of duties on the tables inside the knowledge lake.

The next diagram illustrates the structure.

Maintenance

The framework, in addition to the opposite ones, receives a configuration file (YAML recordsdata) indicating the tables and the checklist of upkeep duties with their respective parameters. It was constructed on PySpark in order that it might run as an Amazon EMR Serverless job and could possibly be orchestrated utilizing the orchestration framework similar to the opposite frameworks constructed as a part of this resolution.

The next upkeep duties are supported by the framework:

Expire snapshots – Snapshots can be utilized for rollback operations in addition to time touring queries. Nevertheless, they will accumulate over time and may result in efficiency degradation. It’s extremely really helpful to often expire snapshots which are now not wanted.
Take away previous metadata recordsdata – Metadata recordsdata can accumulate over time similar to snapshots. Eradicating them often can also be really helpful, particularly when coping with streaming or micro-batching operations, which was one of many instances of the general resolution.
Compact recordsdata – Because the variety of knowledge recordsdata will increase, the variety of metadata saved within the manifest recordsdata additionally will increase, and small knowledge recordsdata can result in much less environment friendly queries. As a result of this resolution makes use of a streaming and micro-batching software writing into Iceberg tables, the dimensions of the recordsdata tends to be small. For that reason, a technique to compact recordsdata was crucial to reinforce the general efficiency.
Exhausting delete knowledge – One of many necessities was to have the ability to carry out laborious deletes within the knowledge older than a sure time frame. This means eradicating expiring snapshots and eradicating metadata recordsdata.

The upkeep duties have been scheduled with completely different frequencies relying on the use case and the precise process. For that reason, the schedule data for this duties is outlined in every of the YAML recordsdata of the precise use case.

On the time this framework was applied, there was no any computerized upkeep resolution on prime of Iceberg tables. At AWS re:Invent 2024, Amazon S3 Tables performance has been launched to automatize the upkeep of Iceberg Tables . This performance automates file compaction, snapshot administration, and unreferenced file elimination.

Conclusion

Constructing a knowledge platform on prime of standarized frameworks that use metadata for various features of the info dealing with course of, from knowledge migration and ingestion to orchestration, enhances the visibility and management over every of the phases and considerably hastens implementation and improvement processes. Moreover, by utilizing providers corresponding to Amazon EMR Serverless and DynamoDB, you possibly can convey all the advantages of serverless architectures, together with scalability, simplicity, versatile integration, improved reliability, and cost-efficiency.

With this structure, Jumia was in a position to scale back their knowledge lake value by 50%. Moreover, with this method, knowledge and DevOps groups have been in a position to deploy full infrastructures and knowledge processing capabilities by creating metadata recordsdata together with Spark SQL recordsdata. This method has lowered turnaround time to manufacturing and lowered failure charges. Moreover, AWS Lake Formation offered the capabilities to collaborate and govern datasets on varied storage layers on the AWS platform and externally.

Leveraging AWS for our knowledge platform has not solely optimized and lowered our infrastructure prices but additionally standardized our workflows and methods of working throughout knowledge groups and established a extra reliable single supply of fact for our knowledge belongings. This transformation has boosted our effectivity and agility, enabling quicker insights and enhancing the general worth of our knowledge platform.
– Hélder Russa, Head of Information Engineering at Jumia Group.

Take step one in direction of streamlining the info migration course of now, with AWS.

In regards to the Authors

Ramón Díez is a Senior Buyer Supply Architect at Amazon Net Providers. He led the undertaking with the agency conviction of utilizing know-how in service of the enterprise.

Paula Marenco is a Information Architect at Amazon Net Providers, she enjoys designing analytical options that convey gentle into complexity, turning intricate knowledge processes into clear and actionable insights. Her work focuses on making knowledge extra accessible and impactful for decision-making.

Hélder Russa is the Head of Information Engineering at Jumia Group, contributing to the technique definition, design, and implementation of a number of Jumia knowledge platforms that assist the general decision-making course of, in addition to operational options, knowledge science tasks, and real-time analytics.

Pedro Gonçalves is a Principal Information Engineer at Jumia Group, chargeable for designing and overseeing the info structure, emphasizing on AWS Platform and datalakehouse applied sciences to make sure strong and agile knowledge options and analytics capabilities.

Jumia builds a next-generation knowledge platform with metadata-driven specification frameworks

Resolution overview

Information orchestration

Information migration

Information ingestion

Information processing

Information upkeep

Conclusion

In regards to the Authors

Related Articles

Mars rover makes use of wiggly wheels impressed by lizard

This Week’s Superior Tech Tales From Across the Internet (By means of June 20)

AURA Foresight Reaches World XPRIZE Wildfire Finals in Alaska

LEAVE A REPLY Cancel reply

Latest Articles

Mars rover makes use of wiggly wheels impressed by lizard

This Week’s Superior Tech Tales From Across the Internet (By means of June 20)

AURA Foresight Reaches World XPRIZE Wildfire Finals in Alaska

Photo voltaic Beat Coal in US Electrical energy Combine for the First Time in Might

Robots-Weblog | RoboCup 2050: Werden Roboter einmal Fußball-Weltmeister?

ABOUT US