April 16th, 2025
2 reactions

Validations and Correctness: How UDM enables Devs to build for Data Quality, Uptime, and Velocity

Introduction

Ensuring data correctness and integrity is crucial in any data-driven system. Poor data quality can lead to incorrect insights, disrupted business processes, and failed pipelines. The Unified Data Model (UDM) enforces robust validation rules to maintain high data quality, ensuring consistency across all assets. In this post, we’ll explore how UDM safeguards against missing or incorrect data, the role of schema enforcement and type validation, how data lineage tracking helps troubleshoot failures, and real-world scenarios where UDM prevents costly issues and speeds up end-to-end delivery time.

How UDM Safeguards Against Missing or Incorrect Data

One of the primary ways UDM ensures data quality is through rigorous validation checks before and after data ingestion.

We’ve categorized validations into four broad categories:

  • Availability Validation: Ensure data is always accessible. This is executed at the final table/output at regular time
  • Correctness Validation: This involves various checks 1. Data Type Validation: Ensure data type aligns with the expected data type. For instance, a numerical column must not contain any alphabetic characters. Implement this during data entry. 2. Consistency Validation: Verify data is logically consistent. For instance, a person’s date of birth should not exceed the current date. Implement this during data entry and data processing. 3. Uniqueness Validation: Check for duplicate data entries. For example, no two customers should have the same customer ID. Implement this during data entry and at regular intervals throughout the data lifecycle. 4. Format Validation: Confirm the data follows the correct format. For instance, phone numbers, zip codes, and email addresses should follow their respective patterns. Implement this during data entry. 5. Range Validation: Ensure data falls within a specific range. For instance, age data should ideally be between 0 and 120. Implement this during data entry and data processing. 6. Completeness Validation: Verify all necessary data has been entered. For example, all required fields in a form should be filled out. Implement this right after data entry.
  • Stats Validation: This validation confirms that statistics are trending correctly over daily, weekly, or monthly periods. Implemented normally post-delivery of the data asset
  • Relationship Validation
    1. Referential Integrity Validation: Check if the data follows the defined database relationships. For instance, a customer ID mentioned in the orders table should also exist in the customer’s table. Implement this whenever data is added, updated, or deleted in the database.
    2. Cross-Field Validation: Validate data based on other data in the same record. For instance, a start date should be earlier than the end date. Implement this during data entry and data processing.

We will revisit the gaming profile example from our previous discussion to understand how validations assist in efficient debugging and maintenance of the profile. In this blog post, we will demonstrate how UDM enforces the above principles in practice.

By enforcing these rules before , during or after ingestion of data, UDM validations prevents bad data from propagating downstream.

We’ll be taking the Game Developer Profile and the Extension we built in the previous blog post and expand on it by adding a sample pre-validation and a sample post-validation script.

Recall, that our game developer profiles look like:

Game Developer Profile Schema

Column Name Data Type Nullable Privacy Category Description
DeveloperId GUID No Internal Unique identifier for each developer
  • Primary Key: Developer Id

Developer Core Properties Extension Schema

Column Name Data Type Nullable Privacy Category Description
DeveloperId GUID No Internal Unique identifier for each developer
DeveloperName String No Public Name of the game developer
FoundedYear DateTime Yes Public Year the company was founded
CountryId Long Yes Public Foreign key linking to the Country Dimension
TotalGamesPublished Int Yes Public Total number of games published by the developer
PrimaryGenre String Yes Public The main game genre the developer specializes in
  • Associated Base Profile: Game Developer Profile
  • Join Cardinality: 1:1
  • Primary Key: DeveloperId

Country Dimension Schema

Column Name Data Type Nullable Privacy Category Description
CountryId Long No Internal Unique identifier for the country (e.g., ISO 3166 code)
CountryName String No Public Full name of the country
Region String Yes Public Geographic region (e.g., North America, Europe)
Subregion String Yes Public More granular geographic grouping (e.g., Western Europe, Southeast Asia)
  • Primary Key: CountryId

The Role of Schema Enforcement and Type Validation

Schema enforcement ensures that data adheres to predefined structures. UDM uses:

  • Pre-validation scripts that check for missing attributes, incorrect types, and formatting issues before data is processed.
  • Post-validation scripts that compare current and historical data for anomalies.
  • Strict schema enforcement, ensuring that records match the expected data model, avoiding mismatched data types that could cause processing failures.

For example, in our validation script below, we ensure that every game developer id is a valid GUID, reducing the risk of errors in identity resolution.

Sample Pre-Validation Script

-- Pre-Validation Script for Game Developer Profile and Developer Core Properties Extension

-- Validate that DeveloperId is not NULL and is a valid GUID
SELECT DeveloperId
FROM GameDeveloperProfile
WHERE DeveloperId IS NULL 
OR TRY_CAST(DeveloperId AS UNIQUEIDENTIFIER) IS NULL;

-- Validate that DeveloperName is not NULL and non-empty
SELECT DeveloperId, DeveloperName
FROM DeveloperCorePropertiesExtension
WHERE DeveloperName IS NULL OR DeveloperName = '';

-- Validate that FoundedYear is a valid date (if present)
SELECT DeveloperId, FoundedYear
FROM DeveloperCorePropertiesExtension
WHERE FoundedYear IS NOT NULL 
AND TRY_CAST(FoundedYear AS DATE) IS NULL;

-- Validate that CountryId references an existing Country Dimension
SELECT d.DeveloperId, d.CountryId
FROM DeveloperCorePropertiesExtension d
LEFT JOIN CountryDimension c ON d.CountryId = c.CountryId
WHERE d.CountryId IS NOT NULL 
AND c.CountryId IS NULL;

-- Validate that TotalGamesPublished is non-negative
SELECT DeveloperId, TotalGamesPublished
FROM DeveloperCorePropertiesExtension
WHERE TotalGamesPublished IS NOT NULL 
AND TotalGamesPublished < 0;

-- Validate that PrimaryGenre is not an empty string (if provided)
SELECT DeveloperId, PrimaryGenre
FROM DeveloperCorePropertiesExtension
WHERE PrimaryGenre IS NOT NULL 
AND LEN(PrimaryGenre) = 0;

Data Lineage Tracking and Pipeline Failure Prevention

Data lineage tracking is critical for troubleshooting and preventing failures. UDM tracks data at every stage of its lifecycle, allowing engineers to:

  • Trace errors back to their source: If a validation check fails, lineage tracking helps pinpoint the dataset or transformation step that introduced the issue.
  • Ensure completeness over time: By maintaining historical validation results, teams can detect and address gradual data quality degradation before it impacts operations.
  • Automate alerts for high-priority failures: P0 scenarios such as missing GameDeveloperId or significant drops in records trigger severity 1 alerts, allowing on-call engineers to respond immediately.

Sample Post-Validation Script

-- Post-Validation Script to ensure consistency and correctness after data insertion

-- Validate that all records in DeveloperCorePropertiesExtension have corresponding entries in GameDeveloperProfile
SELECT d.DeveloperId
FROM DeveloperCorePropertiesExtension d
LEFT JOIN GameDeveloperProfile g ON d.DeveloperId = g.DeveloperId
WHERE g.DeveloperId IS NULL;

-- Validate that DeveloperId is unique in both tables
SELECT DeveloperId, COUNT(*)
FROM GameDeveloperProfile
GROUP BY DeveloperId
HAVING COUNT(*) > 1;

SELECT DeveloperId, COUNT(*)
FROM DeveloperCorePropertiesExtension
GROUP BY DeveloperId
HAVING COUNT(*) > 1;

-- Validate that TotalGamesPublished is consistent with historical records (e.g., does not decrease)
SELECT DeveloperId, TotalGamesPublished
FROM DeveloperCorePropertiesExtension
WHERE DeveloperId IN (
    SELECT DeveloperId 
    FROM DeveloperCorePropertiesExtension_History 
    WHERE TotalGamesPublished > DeveloperCorePropertiesExtension.TotalGamesPublished
);

-- Ensure that every developer has at least one assigned genre if they have published games
SELECT DeveloperId
FROM DeveloperCorePropertiesExtension
WHERE TotalGamesPublished > 0 
AND (PrimaryGenre IS NULL OR PrimaryGenre = '');

-- Verify that FoundedYear is not greater than the current year
SELECT DeveloperId, FoundedYear
FROM DeveloperCorePropertiesExtension
WHERE FoundedYear IS NOT NULL 
AND YEAR(FoundedYear) > YEAR(GETDATE());

Real-World Scenarios: The Cost of Incorrect Data and UDM’s Role in Prevention

Poor data quality can have severe business implications. Here are some examples of how incorrect data can disrupt processes and how UDM prevents such issues:

  • Incorrect Customer Segmentation: If our game developersIsXboxSubscription data is incorrect, it can lead to misclassification of customers, affecting targeted campaigns and sales forecasts. UDM ensures this attribute remains accurate.
  • Duplicate Tenant Records: Duplicate GameDeveloperId entries can cause inconsistencies in billing and reporting. UDM’s duplicate validation prevents this issue.
  • Missing Subscription Data: If active subscriptions aren’t reflected in reports, it could lead to erroneous deactivations. UDM cross-checks subscription records against the commercial profile to detect missing data.

Best Practices for Data Validation

To maintain high-quality data, organizations should follow these best practices:

  1. Implement Rigorous Pre-Validation Checks: Catch errors before data enters production.
  2. Leverage Schema Enforcement: Use strong data typing and required fields to avoid structural inconsistencies.
  3. Continuously Monitor Data Completeness: Track data changes over time to detect gradual loss.
  4. Automate Alerts for Critical Failures: Use severity-based alerting to ensure swift response to high-impact issues.
  5. Maintain Historical Validation Records: Store and analyze past validation results to identify long-term trends in data quality.

Conclusion

Ensuring data correctness is not a one-time task but a continuous process. UDM provides a robust framework for enforcing validation rules, tracking lineage, and preventing pipeline failures. By implementing best practices for data validation, organizations can improve decision-making, enhance operational efficiency, and prevent costly errors in their data-driven workflows.

If you’re looking to enhance your data validation strategies, consider leveraging UDM’s validation capabilities to ensure the highest level of data quality. Feel free to hit that like button and drop a comment—I’d love to hear how you’re tackling data quality or if you’re using UDM principles in your own projects. Let’s chat!

Author

Anirudh Ganesh
Software Engineer II

Anirudh Ganesh is a Software Engineer at Microsoft with a passion for building high-performance data platforms. With an MS in Computer Science from Ohio State University and experience at industry leaders like Twilio and Google, he brings a wealth of expertise in metadata discoverability and data governance.

Nitin Sood
Principal Software Eng Manager

Nitin Sood is a Principal Software Engineering Manager at Microsoft with over 15 years of experience. He leads innovative projects, collaborates with top engineers, and drives impactful solutions.  His expertise lies in data modeling and LLM-based governance for distributed data assets.

Carlos Quito
Software Engineer

Software Engineer who works on the Core Data Platform at IDEAs in Microsoft

0 comments