Azure Data Week - Modern Data Warehouse Design Patterns

Pragmatic Works Nov 05, 2018

Azure Data Week - Modern Data Warehouse Design Patterns

BannerFinal Azure Data Week In his Azure Data Week session, Modern Data Warehouse Design Patterns, Bob Rubocki gave an overview of modern cloud-based data warehousing and data flow patterns based on Azure technologies including Azure Data Factory, Azure Logic Apps, Azure Data Lake Store, and Azure SQL DB.

There were many questions he was unable to answer during his session and we're happy to share them with you now.

Below are the Q&A from session attendees:

In your demo, did you assume the dimension table loads were not required?

A: Yes. I had loaded the dimensions prior to the demo. I created a pipeline with an activity to execute dimension loads, followed by an activity for loading fact tables. I only demonstrated the fact load process.

How did you identify source records from the file that are already loaded in the DW?

A: I loaded the dimension and fact tables with stored procedures. The procedures contain logic written in SQL to insert records from a stage table that do not exist in the dimension/fact table.

How is Azure Data Factory priced?

A: Data Factory pricing is generally based on 1) the number of pipeline executions, and 2) duration of pipeline execution. There is no monthly or recurring charge to use the activities I used in the demo.

Using the Azure SSIS runtime, however, does incur cost not based on execution. The Azure SSIS runtime allows execution of SSIS packages in Azure. This runtime operates on VM clusters, and there is a charge for uptime of the Azure SSIS runtime VMs, in addition to the cost of executing a package via a pipeline activity.

Following is a link to details on Azure Data Factory V2 pricing à https://azure.microsoft.com/en-us/pricing/details/data-factory/data-pipeline/

In the future as computing performance increases, do you anticipate Microsoft advising not needing to create fact tables or star schema and loading the operational data directly to data? Or even doing all analytics on top of OLTP data directly?

A: NO! To be clear from the start, as you suggest, there are many cases where a star schema is not required. But a well-designed star schema provides ease-of-use benefits to reporting analysts that cannot be discounted.

With my demo as an example, using a star schema makes it simple to analyze data from different data sets (Passing, Rushing, Receiving). In my star schema, I have three fact tables with a handful of common dimensions used in each (Team, Player, Date). As long as my fact tables have dimensions in common, I can very easily create a report showing measures from all three fact tables WITHOUT relating the three fact tables to each other directly. If I left my data in three separate data sets, as might happen in an OLTP scenario, it is challenging (and ugly!) to relate these to each other for reporting.

What is the reason for using staging tables? Why not load the DW directly from the raw files?

A: I implemented a pattern of using stored procedures to populate the DW tables and having data in SQL tables makes this much easier. Also, the ELT pattern (extract, load, transform) pattern is becoming much more common, particularly true if you are using Azure SQL Data Warehouse instead of Azure SQL Database. The ELT pattern generally includes staging data in tables on the DW database from where additional transformations could be applied.

As I write this, the Data Factory Data Flow activity is in preview, and Data Factory does not currently have much capability for DW-style data transformations. My demo (and response to this question) may change after this feature is generally available (GA)!

The reference architecture shows loading everything to file in Blob/ADLS, so it can be processed/joined/analyzed in different ways. If the source is a relational database, should that data still be pulled into Blob/ADLS?

A: Keeping the data in file storage outside the relational database provides many benefits. It reduces the number of times I need to access the source database, which is almost always a good thing. I may extract DB data for a specific use case. But when another analytics question is asked, I can very likely answer that question with data in the files without accessing the DB again.

Another reason is archiving changes to data. We commonly see incremental extract patterns that only pull new/changed items from a DB. Depending on the DB design, these types of changes may not be tracked in the source.

File storage also helps when a source system is shut down or migrated to something new. In that case, your analytics environment still has access to all the old data from that system.

I also prefer this pattern such that I can separate the process of extracting source data from the process of loading the DW. Particularly if my DW includes data from more than one source (very common!). With file storage as the source for my DW loading, my DW load processes are not hard-wired to source data extraction.

What type of data lineage is available from Data Factory and Logic Apps?

A: These tools currently do not provide any data lineage documentation.

Can we load Excel files using ADF or PolyBase without converting to CSV or TXT files?

A: No. According to documentation (link) from Microsoft, PolyBase supports extracting from delimited text files, RC, ORC, or Parquet files. I give more info on this topic in this Azure Every Day post as well.

Can you use dynamic SQL to query data to load to a stage table?

A: Yes. There are many different techniques for creating the dynamic SQL, however. Following are links to two articles (including one from Delora Bradish from Pragmatic Works) with illustrations of two techniques for doing this.

Is there a good way to migrate from ADF V1 to V2?

A: Microsoft has built an Azure Data Factory migration tool available here à link

Can you have read-only access to the Data Factory monitoring to see executions and status?

A: Yes. Azure and Data Factory provide a “Log Analytics Reader” role. From the Azure Portal, “Log Analytics Reader” can view and search all monitoring data, as well as view monitoring settings, including viewing the configuration of Azure diagnostics on all Azure resources.” This article (link) contains more information about Data Factory roles and permissions.