A wise friend keeps telling me that the best way to validate a data vault model is to recreate the source as a test. Recently I had the opportunity to test out how to do this in Snowflake as the entire data pipeline is written with snowflake stored procedures and orchestrated with tasks (this was for a Proof of Concept and I would always advocate for an automation tool to build out any data vault). In Snowflake, comparing two files, even large ones, is really simple. Here’s how I went about it.
The source file was staged in an internal stage. Stored procedures and tasks were then used to migrate it through to the raw vault data structures. Once the data populated those structures, views were created that mimicked the original source file’s structure and content.
From here, I used a specific file format that created the file in a tsv with .txt extension and then used the COPY INTO to dump the data as a file into an internal stage.
create or replace file format member_roster_extract type = csv field_delimiter = '|' field_optionally_enclosed_by = None file_extension = 'txt' EMPTY_FIELD_AS_NULL = FALSE null_if= ('') compression = 'gzip';
COPY INTO @int_file_delivery FROM ( SELECT FIRST_NAME, LAST_NAME, ADDRESS_1, ADDRESS_2, CITY, STATE, ZIP, BIRTHDAY FROM MEMBER_ROSTER) FILE_FORMAT = (FORMAT_NAME =’member_roster_extract’) header=true single = true
Then I wrote a query that joined the views from both stages (where the source file resided and where the output file resided) to compare the contents of both. Any lines that do not match will show up in this query.
The $1, $2, $3 are the fields from the file that you want to use in the concatenation for the compare; this allows you to pick the order of the fields and which ones you want to use. I chose to do the concat so I could easily see all the fields in the rows that didn’t match to help with troubleshooting.
with import_file as ( select $1::VARCHAR || ‘|’ || $2::VARCHAR || ‘|’ || $3::VARCHAR || ‘|’ || $4::VARCHAR || ‘|’ || $5::VARCHAR || ‘|’ || $6::VARCHAR || ‘|’ || $7::VARCHAR as concat_row from @int_file_import/member_roster_202212.txt.gz (file_format =>’member_roster_tsv') ), export_file as ( select $1::VARCHAR || ‘|’ || $2::VARCHAR || ‘|’ || $3::VARCHAR || ‘|’ || $4::VARCHAR || ‘|’ || $5::VARCHAR || ‘|’ || $6::VARCHAR || ‘|’ || $7::VARCHAR as concat_row from @int_file_export/member_roster_202212.txt.gz (file_format =>’member_roster_tsv') ) select e.concat_row, o.concat_row from extract_file e FULL JOIN original_file o on o.concat_row = e.concat_row where e.concat_row is null or o.concat_row is null;
What I love about this is that you might not care about the order of the columns or the order of the data in the file (sort order) or maybe you have one field you aren’t actually bringing through because it’s not relevant BUT you can actually still generate the file in whatever column or sort order you want and STILL be able to compare quickly.