[HTML payload içeriği buraya]
28 C
Jakarta
Sunday, May 17, 2026

Implementing a Dimensional Information Warehouse with Databricks SQL: Half 2


As organizations consolidate analytics workloads to Databricks, they usually have to adapt conventional knowledge warehouse strategies. This sequence explores learn how to implement dimensional modeling—particularly, star schemas—on Databricks. The primary weblog targeted on schema design. This weblog walks via ETL pipelines for dimension tables, together with Slowly Altering Dimensions (SCD) Kind-1 and Kind-2 patterns. The final weblog will present you learn how to construct ETL pipelines for reality tables.

Slowly Altering Dimensions (SCD)

Within the final weblog, we outlined our star schema, together with a reality desk and its associated dimensions.  We highlighted one dimension desk specifically, DimCustomer, as proven right here (with some attributes eliminated to preserve house):

The final three fields on this desk, i.e., StartDate, EndDate and IsLateArriving, characterize metadata that assists us with versioning data.  As a given buyer’s revenue, marital standing, dwelling possession, variety of youngsters at dwelling, or different traits change, we’ll wish to create new data for that buyer in order that info corresponding to our on-line gross sales transactions in FactInternetSales are related to the appropriate illustration of that buyer.  The pure (aka enterprise) key, CustomerAlternateKey, would be the identical throughout these data however the metadata will differ, permitting us to know the interval for which that model of the shopper was legitimate, as will the surrogate key, CustomerKey, permitting our info to hyperlink to the appropriate model.  

NOTE: As a result of the surrogate key’s generally used to hyperlink info and dimensions, dimension tables are sometimes clustered based mostly on this key. In contrast to conventional relational databases that make the most of b-tree indexes on sorted data, Databricks implements a singular clustering methodology often called liquid clustering. Whereas the specifics of liquid clustering are exterior the scope of this weblog, we persistently use the CLUSTER BY clause on the surrogate key of our dimension tables throughout their definition to leverage this characteristic successfully.

This sample of versioning dimension data as attributes change is named the Kind-2 Slowly Altering Dimension (or just Kind-2 SCD) sample. The Kind-2 SCD sample is most well-liked for recording dimension knowledge within the basic dimensional methodology. Nonetheless, there are different methods to cope with modifications in dimension data.

Some of the widespread methods to cope with altering dimension values is to replace current data in place.  Just one model of the file is ever created, in order that the enterprise key stays the distinctive identifier for the file.  For varied causes, not the least of that are efficiency and consistency, we nonetheless implement a surrogate key and hyperlink our reality data to those dimensions on these keys. Nonetheless, the StartDate and EndDate metadata fields that describe the time intervals over which a given dimension file is taken into account energetic will not be wanted. This is named the Kind-1 SCD sample.  The Promotion dimension in our star schema gives instance of a Kind-1 dimension desk implementation:

However what concerning the IsLateArriving metadata discipline seen within the Kind-2 Buyer dimension however lacking from the Kind-1 Promotion dimension? This discipline is used to flag data as late arriving.  A late arriving file is one for which the enterprise key reveals up throughout a reality ETL cycle, however there isn’t a file for that key situated throughout prior dimension processing.  Within the case of the Kind-2 SCDs, this discipline is used to indicate that when the information for a late arriving file is first noticed in a dimension ETL cycle, the file ought to be up to date in place (similar to in a Kind-1 SCD sample) after which versioned from that time ahead.  Within the case of the Kind-1 SCDs, this discipline isn’t crucial as a result of the file will likely be up to date in place regardless.

NOTE: The Kimball Group acknowledges further SCD patterns, most of that are variations and mixtures of the Kind-1 and Kind-2 patterns. As a result of the Kind-1 and Kind-2 SCDs are probably the most continuously carried out of those patterns and the strategies used with the others are carefully associated to what’s employed with these, we’re limiting this weblog to only these two dimension sorts. For extra details about the eight forms of SCDs acknowledged by the Kimball Group, please see the Slowly Altering Dimension Strategies part of this doc.

Implementing the Kind-1 SCD Sample

With knowledge being up to date in place, the Kind-1 SCD workflow sample is probably the most easy of the two-dimensional ETL patterns. To help these kinds of dimensions, we merely:

  1. Extract the required knowledge from our operational system(s)
  2. Carry out any required knowledge cleaning operations
  3. Examine our incoming data to these already within the dimension desk
  4. Replace any current data the place incoming attributes differ from what’s already recorded
  5. Insert any incoming data that don’t have a corresponding file within the dimension desk

As an instance a Kind-1 SCD implementation, we’ll outline the ETL for the continued inhabitants of the DimPromotion desk.

Step 1: Extract knowledge from an operational system

Our first step is to extract the information from our operational system.  As our knowledge warehouse is patterned after the AdventureWorksDW pattern database offered by Microsoft, we’re utilizing the carefully related AdventureWorks (OLTP) pattern database as our supply. This database has been deployed to an Azure SQL Database occasion and made accessible inside our Databricks setting by way of a federated question.  Extraction is then facilitated with a easy question (with some fields redacted to preserve house), with the question outcomes continued in a desk in our staging schema (that’s made accessible solely to the information engineers in the environment via permission settings not proven right here). That is however one among some ways we will entry supply system knowledge on this setting:

Step 2: Examine incoming data to these within the desk

Assuming now we have no further knowledge cleaning steps to carry out (which we might implement with an UPDATE or one other CREATE TABLE AS assertion),  we will then sort out our dimension knowledge replace/insert operations in a single step utilizing a MERGE assertion, matching our staged knowledge and dimension knowledge on the enterprise key:

One necessary factor to notice concerning the assertion, because it’s been written right here, is that we replace any current data when a match is discovered between the staged and printed dimension desk knowledge. We might add further standards to the WHEN MATCHED clause to restrict updates to these cases when a file in staging has totally different data from what’s discovered within the dimension desk, however given the comparatively small variety of data on this specific desk, we’ve elected to make use of the comparatively leaner logic proven right here.  (We’ll use the extra WHEN MATCHED logic with DimCustomer, which incorporates way more knowledge.)

The Kind-2 SCD sample

The Kind-2 SCD sample is a little more advanced. To help these kinds of dimensions, we should:

  1. Extract the required knowledge from our operational system(s)
  2. Carry out any required knowledge cleaning operations
  3. Replace any late-arriving member data within the goal desk
  4. Expire any current data within the goal desk for which new variations are present in staging
  5. Insert any new (or new variations) of data into the goal desk

Step 1: Extract and cleanse knowledge from a supply system

As within the Kind-1 SCD sample, our first steps are to extract and cleanse knowledge from the supply system.  Utilizing the identical method as above, we situation a federated question and persist the extracted knowledge to a desk in our staging schema:

Step 2: Examine to a dimension desk

With this knowledge landed, we will now examine it to our dimension desk to be able to make any required knowledge modifications.  The primary of those is to replace in place any data flagged as late arriving from prior reality desk ETL processes.  Please be aware that these updates are restricted to these data flagged as late arriving and the IsLateArriving flag is being reset with the replace in order that these data behave as regular Kind-2 SCDs shifting ahead:

Step 3: Expire versioned data

The following set of knowledge modifications is to run out any data that have to be versioned.  It’s necessary that the EndDate worth we set for these matches the StartDate of the brand new file variations we’ll implement within the subsequent step.  For that cause, we’ll set a timestamp variable for use between these two steps:

NOTE: Relying on the information out there to you, you might elect to make use of an EndDate worth originating from the supply system, at which level you wouldn’t essentially declare a variable as proven right here.

Please be aware the extra standards used within the WHEN MATCHED clause.  As a result of we’re solely performing one operation with this assertion, it could be doable to maneuver this logic to the ON clause, however we saved it separated from the core matching logic, the place we’re matching to the present model of the dimension file for readability and maintainability.

As a part of this logic, we’re making heavy use of the equal_null() operate.  This operate returns TRUE when the primary and second values are the identical or each NULL; in any other case, it returns FALSE.  This gives an environment friendly method to search for modifications on a column-by-column foundation.  For extra particulars on how Databricks helps NULL semantics, please discuss with this doc.

At this stage, any prior variations of data within the dimension desk which have expired have been end-dated.  

Step 4: Insert new data

We are able to now insert new data, each actually new and newly versioned:

As earlier than, this might have been carried out utilizing an INSERT assertion, however the end result is similar.  With this assertion, now we have recognized any data within the staging desk that don’t have an unexpired corresponding file within the dimension tables. These data are merely inserted with a StartDate worth in keeping with any expired data which will exist on this desk.

Subsequent steps: implementing the very fact desk ETL

With the scale carried out and populated with knowledge, we will now concentrate on the very fact tables. Within the subsequent weblog, we’ll exhibit how the ETL for these tables could be carried out.

To be taught extra about Databricks SQL, go to our web site or learn the documentation. You can even take a look at the product tour for Databricks SQL. Suppose you wish to migrate your current warehouse to a high-performance, serverless knowledge warehouse with an awesome consumer expertise and decrease complete price. In that case, Databricks SQL is the answer — attempt it without spending a dime.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles