Skip links

Validating your Vault with Snowflake File Comparison

A wise friend keeps telling me that the best way to validate a data vault model is to recreate the source as a test. Recently I had the opportunity to test out how to do this in Snowflake as the entire data pipeline is written with snowflake stored procedures and orchestrated with tasks (this was for a Proof of Concept and I would always advocate for an automation tool to build out any data vault). In Snowflake, comparing two files, even large ones, is really simple. Here’s how I went about it.

The source file was staged in an internal stage. Stored procedures and tasks were then used to migrate it through to the raw vault data structures.  Once the data populated those structures, views were created that mimicked the original source file’s structure and content.

From here, I used a specific file format that created the file in a tsv with .txt extension and then used the COPY INTO to dump the data as a file into an internal stage.

create or replace file format member_roster_extract 
  type = csv 
  field_delimiter = '|' 
  field_optionally_enclosed_by = None 
  file_extension = 'txt' 
  EMPTY_FIELD_AS_NULL = FALSE 
  null_if= ('') compression = 'gzip';
COPY INTO @int_file_delivery
FROM (
  SELECT 
    FIRST_NAME,
    LAST_NAME,
    ADDRESS_1,
    ADDRESS_2,
    CITY,
    STATE,
    ZIP,
    BIRTHDAY
  FROM
    MEMBER_ROSTER)
FILE_FORMAT = (FORMAT_NAME =’member_roster_extract’)
header=true
single = true

Then I wrote a query that joined the views from both stages (where the source file resided and where the output file resided) to compare the contents of both.  Any lines that do not match will show up in this query.

The $1, $2, $3 are the fields from the file that you want to use in the concatenation for the compare; this allows you to pick the order of the fields and which ones you want to use.  I chose to do the concat so I could easily see all the fields in the rows that didn’t match to help with troubleshooting.

with import_file as (
    select
        $1::VARCHAR || ‘|’ || $2::VARCHAR || ‘|’ || $3::VARCHAR || ‘|’ || $4::VARCHAR || ‘|’ ||  $5::VARCHAR || ‘|’ ||  $6::VARCHAR || ‘|’ ||  $7::VARCHAR as concat_row
    from
        @int_file_import/member_roster_202212.txt.gz
        (file_format =>’member_roster_tsv')
),
export_file as (
    select
$1::VARCHAR || ‘|’ || $2::VARCHAR || ‘|’ || $3::VARCHAR || ‘|’ || $4::VARCHAR || ‘|’ ||  $5::VARCHAR || ‘|’ ||  $6::VARCHAR || ‘|’ ||  $7::VARCHAR as concat_row
    from
        @int_file_export/member_roster_202212.txt.gz
        (file_format =>’member_roster_tsv')
)

select
    e.concat_row,
    o.concat_row

from
    extract_file e
    FULL JOIN original_file o on o.concat_row = e.concat_row
where e.concat_row is null or o.concat_row is null;

What I love about this is that you might not care about the order of the columns or the order of the data in the file (sort order) or maybe you have one field you aren’t actually bringing through because it’s not relevant BUT you can actually still generate the file in whatever column or sort order you want and STILL be able to compare quickly.