Advertisement

Data Warehouse Model: Diffuse Relationships

By on

Click here to learn more about author Michael Blaha.

A diffuse relationship is one of a group of similar relationships, which broadly apply to entities in a model. We can show them explicitly, but such an approach can become verbose and obscure the deeper content of a model. We coined the term “diffuse relationship” as we haven’t seen this phenomenon named by others.

An Example

Consider the following data warehouse model. The relationships to Source_System, Data_Warehouse_User, and Batch_Load_Run are diffuse relationships. Many of the tables have a source_system_key, created_by_key, updated_by_key, and batch_load_key. You can see the repetition, even in this small example.

diffuse-image1

The model is a proper model in that all relationships are defined. We can readily generate DDL code including foreign key constraints.

However there’s an aesthetic problem. The diffuse relationships cloud the underlying deeper meaning. Furthermore, if we consider the full warehouse with several hundred tables, we have severe clutter in the model. One purpose of a model is to generate DDL and the example model does that well. But another purpose is to foster deep understanding of a problem and communicate to others – the clutter of diffuse relationships undermines this latter purpose.

A Restated Model

Here’s a restated model where we omit the diffuse relationships.

diffuse-image2

Clearly the simplified model is better at showing meaningful content. End users that are writing analytical queries would much prefer the simplified model to the verbose one.

If we want to have foreign key constraints, we have two choices. One option is to dedicate one or more subject areas for defining diffuse relationships. For example, we could have one subject area with all Source_System relationships, another with all Batch_Load_Run relationships, and so forth. With this approach the domain subject areas are readable. However, there’s hassle in maintaining the diffuse relationship subject areas and avoiding oversights.

Another option is to forego defining diffuse relationships in the model. We could use metaSQL (see www.dataversity.net/using-metasql-to-load-table-comments-for-mysql) to generate the missing DDL. For example, we could inspect the system catalog to find all tables with source_system_key as a foreign key and generate DDL. (Sample SQL Server code is shown below.) The metaSQL-generated DDL would augment the DDL from the data modeling tool. We could also generate index statements this way (see www.dataversity.net/please-index-foreign-keys).

SELECT ‘ALTER TABLE ‘ + T.name + ‘ ADD FOREIGN KEY (source_system_key)
REFERENCES Source_System (source_system_key) ON DELETE NO ACTION;’
FROM sys.tables T JOIN sys.columns C ON C.object_id = T.object_id
WHERE C.name = ‘source_system_key’;

In Conclusion

This article uses a data warehouse example. But diffuse relationships are not confined to data warehouses and also occur for day-to-day operating applications. For example, many applications track the creator and last updater of a record.

There is no perfect solution for diffuse relationships. All in all, we favor the approach of omitting diffuse relationships from a model and augmenting model DDL with metaSQL-generated DDL.

Leave a Reply