We implemented a data pipeline and used Snowflake tasks for orchestration. Immediately my mind started to think, how are we going to deploy this to the upper environments (QA and PROD)? The client does not have any automatic deployment, requires a QA Admin to deploy anything to QA (no developers are allowed to deploy) AND it has to be sql files that they execute so no cloning or other fancy Snowflake features.
As developers, we like to be able to develop code directly in the tool; I’m not going to write a sql file, apply it and then re-write it, re-apply until I get it correct. I’m going to go directly to Snowflake and write my sql statements and execute them to get the objects created for usage. However, in order to move this code to QA, we have to pull it back out to a flat sql file for the QA Admin to run (and it had to work and generate the exact objects we wanted to match our dev structure).
A database structure is not automatically checked into Git such that it can then be packaged and deployed like a web application. And for this project, we were not allowed to introduce any new tools (open source or enterprise).
Remember that for tasks, since they are allowed to have dependencies, the order that these tasks are applied in sql matters. Tasks have to be created in order such that if task A is dependent on task B, task B must be created before task A or the sql will fail. So for the first deployment, we opened a database visualization tool and pulled the ddl out manually for all the tasks and then by hand went through and made sure the dependencies were in the correct order. It was painful and would only get more complicated as subsequent releases would be much larger.
As long as the tasks exist in the DEV environment, there can be a programmatic way to pull them out, put them in the right order and provide a file to be deployed to the QA or PROD environment. I created a simple Snowpark function that did exactly this. This allows us to use DEV as a true development space and we can create, adjust, delete, test the pipelines until we get something we are satisfied with then run the Snowpark function to extract the sql to be handed over to the QA deployment admins.
In our architecture, DEV, QA and PROD are all three separate Snowflake databases with multiple schemas in each that have procedures orchestrated by tasks. The goal is to produce a sql of tasks in the correct order and then go through and find/replace to change the DEV references to QA.
Logical Outline:
- Get a list of all the tasks that exist in the designated schema (remember task dependencies have to all exist within one schema)
- Loop through the list of tasks and get the ddl for each of them
- Loop through the list of ddl for each task and use regular expressions to parse out the name of the task and a list of the dependent tasks
- Start creating a new list by looping through the list of parsed tasks
- Add the first item in the list to the new list
- With items 2-X,
– check if any of the dependencies in the item exist currently in the new list. If they do, add the item to the new list one spot down from the highest dependency. If there are no dependencies in the list, add the item to the bottom of the list.
– loop through the entire newly ordered list and move items around if any of the existing items have dependencies that are now above them in the list
Check out this GitHub repo to download the code and try it out!