Data modeling is surely not a new concept, but it is a key one, especially in the age of big data. Data modeling emphasizes what data is needed and how it should be organized. Let’s talk about everyone’s favorite topic, Slowly Changing Dimensions.
A dataset typically contains your fact tables and dimension tables. Dimensions define what you want to analyze or to slice your facts by, such as calendar or geography with their hierarchies (customers, stores, employees, sales channels, etc.). Fact tables change very rapidly; dimensions stay pretty consistent.
Sometimes the data in those tables can change as well. How we handle those changes is what I want to talk about. There are three types of changes but I’m going to focus on the two changes that are most common.
Type 1 Slowly Changing Dimensions – This type occurs when we want to overwrite the data and it’s not necessary to preserve the history. In other words, when changes in data types are detected, a type 1 change always reflects the current values.
This can happen when there’s a data error that needs to be corrected, such as an incorrect entry of an employee birth date. We can go into the employee dimension table, update the birth date there and we are done. We don’t care if the historical records have changed or any reporting has been affected; our goal is just to correct the error.
Type 2 Slowly Changing Dimensions – This type is a bit more complex as we need to preserve the history. For example, if a person changes their address, location, or name, how do we want to handle that from the data perspective?
In this case we need to make sure there is versioning of the dimension numbers. Let’s say our customer, Sally, has moved to a different state and changed her address. We need to have two versions of Sally depending on when the change occurred.
As you can see below Natural Key, such as social security number, won’t help us here.
We need a surrogate key that identifies each version of Sally. So, if we need to run a report of shipments by state, we will appropriately show shipments to the same customer going to two different states.
If the source system doesn’t store versions and you have a data warehouse, the data warehouse load process will detect those changes. If there’s no data warehouse this is something to be aware of as you’re building your reporting solution.
It would also be a good idea to add the range validity of the version of Sally, such as start and end data columns as well as a flag column which states which is the current version of Sally.
Hopefully, you found this quick post on Slowly Changing Dimensions and how to handle data changes helpful. If you’d like to learn more about data modeling, data warehousing or any product or service within Azure, data and the cloud is what we’re all about.
Our experts can help you use your data to grow your business, whether you work on premise, in the cloud or a hybrid approach. Contact us to start a conversation or click the link below. Let us help you take your business from good to great.