This publish is co-written with Hemant Aggarwal and Naveen Kambhoji from Kaplan.
Kaplan, Inc. supplies people, instructional establishments, and companies with a broad array of providers, supporting our college students and companions to satisfy their various and evolving wants all through their instructional {and professional} journeys. Our Kaplan tradition empowers folks to realize their targets. Dedicated to fostering a studying tradition, Kaplan is altering the face of schooling.
Kaplan knowledge engineers empower knowledge analytics utilizing Amazon Redshift and Tableau. The infrastructure supplies an analytics expertise to a whole lot of in-house analysts, knowledge scientists, and student-facing frontend specialists. The information engineering workforce is on a mission to modernize its knowledge integration platform to be agile, adaptive, and easy to make use of. To realize this, they selected the AWS Cloud and its providers. There are numerous varieties of pipelines that should be migrated from the prevailing integration platform to the AWS Cloud, and the pipelines have various kinds of sources like Oracle, Microsoft SQL Server, MongoDB, Amazon DocumentDB (with MongoDB compatibility), APIs, software program as a service (SaaS) functions, and Google Sheets. When it comes to scale, on the time of writing over 250 objects are being pulled from three totally different Salesforce situations.
On this publish, we focus on how the Kaplan knowledge engineering workforce carried out knowledge integration from the Salesforce software to Amazon Redshift. The answer makes use of Amazon Easy Storage Service as a knowledge lake, Amazon Redshift as a knowledge warehouse, Amazon Managed Workflows for Apache Airflow (Amazon MWAA) as an orchestrator, and Tableau because the presentation layer.
Resolution overview
The high-level knowledge stream begins with the supply knowledge saved in Amazon S3 after which built-in into Amazon Redshift utilizing varied AWS providers. The next diagram illustrates this structure.

Amazon MWAA is our foremost instrument for knowledge pipeline orchestration and is built-in with different instruments for knowledge migration. Whereas looking for a instrument emigrate knowledge from a SaaS software like Salesforce to Amazon Redshift, we got here throughout Amazon AppFlow. After some analysis, we discovered Amazon AppFlow to be well-suited for our requirement to drag knowledge from Salesforce. Amazon AppFlow supplies the flexibility to straight migrate knowledge from Salesforce to Amazon Redshift. Nonetheless, in our structure, we selected to separate the info ingestion and storage processes for the next causes:
- We wanted to retailer knowledge in Amazon S3 (knowledge lake) as an archive and a centralized location for our knowledge infrastructure.
- From a future perspective, there is perhaps situations the place we have to remodel the info earlier than storing it in Amazon Redshift. By storing the info in Amazon S3 as an intermediate step, we will combine transformation logic as a separate module with out impacting the general knowledge stream considerably.
- Apache Airflow is the central level in our knowledge infrastructure, and different pipelines are being constructed utilizing varied instruments like AWS Glue. Amazon AppFlow is one a part of our general infrastructure, and we wished to take care of a constant strategy throughout totally different knowledge sources and targets.
To accommodate these necessities, we divided the pipeline into two elements:
- Migrate knowledge from Salesforce to Amazon S3 utilizing Amazon AppFlow
- Load knowledge from Amazon S3 to Amazon Redshift utilizing Amazon MWAA
This strategy permits us to make the most of the strengths of every service whereas sustaining flexibility and scalability in our knowledge infrastructure. Amazon AppFlow can deal with the primary a part of the pipeline with out the necessity for every other instrument, as a result of Amazon AppFlow supplies functionalities like making a connection to supply and goal, scheduling the info stream, and creating filters, and we will select the kind of stream (incremental and full load). With this, we have been in a position to migrate the info from Salesforce to an S3 bucket. Afterwards, we created a DAG in Amazon MWAA that runs an Amazon Redshift COPY command on the info saved in Amazon S3 and strikes the info into Amazon Redshift.
We confronted the next challenges with this strategy:
- To do incremental knowledge, we’ve got to manually change the filter dates within the Amazon AppFlow flows, which isn’t elegant. We wished to automate that date filter change.
- Each elements of the pipeline weren’t in sync as a result of there was no technique to know if the primary a part of the pipeline was full in order that the second a part of the pipeline may begin. We wished to automate these steps as effectively.
Implementing the answer
To automate and resolve the aforementioned challenges, we used Amazon MWAA. We created a DAG that acts because the management heart for Amazon AppFlow. We developed an Airflow operator that may carry out varied Amazon AppFlow features utilizing Amazon AppFlow APIs like creating, updating, deleting, and beginning flows, and this operator is used within the DAG. Amazon AppFlow shops the connection knowledge in an AWS Secrets and techniques Supervisor managed secret with the prefix appflow. The price of storing the key is included with the cost for Amazon AppFlow. With this, we have been in a position to run the entire knowledge stream utilizing a single DAG.
The whole knowledge stream consists of the next steps:
- Create the stream within the Amazon AppFlow utilizing a DAG.
- Replace the stream with the brand new filter dates utilizing the DAG.
- After updating the stream, the DAG begins the stream.
- The DAG waits for the stream full by checking the stream’s standing repeatedly.
- Successful standing signifies that the info has been migrated from Salesforce to Amazon S3.
- After the info stream is full, the DAG calls the COPY command to repeat knowledge from Amazon S3 to Amazon Redshift.
This strategy helped us resolve the aforementioned points, and the info pipelines have turn into extra sturdy, easy to grasp, easy to make use of with no guide intervention, and fewer vulnerable to error as a result of we’re controlling all the pieces from a single level (Amazon MWAA). Amazon AppFlow, Amazon S3, and Amazon Redshift are all configured to make use of encryption to guard the info. We additionally carried out logging and monitoring, and carried out auditing mechanisms to trace the info stream and entry utilizing AWS CloudTrail and Amazon CloudWatch. The next determine exhibits a high-level diagram of the ultimate strategy we took.

Conclusion
On this publish, we shared how Kaplan’s knowledge engineering workforce efficiently carried out a sturdy and automatic knowledge integration pipeline from Salesforce to Amazon Redshift, utilizing AWS providers like Amazon AppFlow, Amazon S3, Amazon Redshift, and Amazon MWAA. By making a customized Airflow operator to manage Amazon AppFlow functionalities, we orchestrated your entire knowledge stream seamlessly inside a single DAG. This strategy has not solely resolved the challenges of incremental knowledge loading and synchronization between totally different pipeline phases, however has additionally made the info pipelines extra resilient, easy to take care of, and fewer error-prone. We decreased the time for making a pipeline for a brand new object from an current occasion and a brand new pipeline for a brand new supply by 50%. This additionally helped take away the complexity of utilizing a delta column to get the incremental knowledge, which additionally helped cut back the fee per desk by 80–90% in comparison with a full load of objects each time.
With this contemporary knowledge integration platform in place, Kaplan is well-positioned to supply its analysts, knowledge scientists, and student-facing groups with well timed and dependable knowledge, empowering them to drive knowledgeable choices and foster a tradition of studying and development.
Check out Airflow with Amazon MWAA and different enhancements to enhance your knowledge orchestration pipelines.
For extra particulars and code examples of Amazon MWAA, discuss with the Amazon MWAA Person Information and the Amazon MWAA examples GitHub repo.
Concerning the Authors
Hemant Aggarwal is a senior Information Engineer at Kaplan India Pvt Ltd, serving to in creating and managing ETL pipelines leveraging AWS and course of/technique growth for the workforce.
Naveen Kambhoji is a Senior Supervisor at Kaplan Inc. He works with Information Engineers at Kaplan for constructing knowledge lakes utilizing AWS Providers. He’s the facilitator for your entire migration course of. His ardour is constructing scalable distributed programs for effectively managing knowledge on cloud.Exterior work, he enjoys travelling together with his household and exploring new locations.
Jimy Matthews is an AWS Options Architect, with experience in AI/ML tech. Jimy relies out of Boston and works with enterprise clients as they remodel their enterprise by adopting the cloud and helps them construct environment friendly and sustainable options. He’s keen about his household, vehicles and Blended martial arts.
