One of the challenges of being a one-person BI team is that you have to do everything: gather requirements, select the software, architect the physical infrastructure, and build/support the system. This is not an easy task, especially when one considers that they need to keep pace with excited end users who are used to working with manually aggregated data in Excel workbooks (and no charts). One of the biggest areas that I struggle with is with the ETL system. No surprise considering that 70% of BI project effort is expended on the ETL system.
The ETL files is going to be a series of blog posts that I use to document how I tackle this beast once and for all. In this particular post, I am going to discuss my current state of ETL development skills, what my goals are with the ETL job, and hopefully, a few of my readers can share their opinions or links to quality tutorials and videos.
The Current State of the ETL
Right now all ETL jobs are built from scratch using SSIS or T-SQL stored procs. Each project has its own standalone data mart which has me thinking that it’s probably not the optimal way to store conformed dimensions but that is a different problem for a different day. None of my ETL jobs capture any statistics about performance such as run times and statuses, none of my ETL jobs are metadata driven meaning I have to open each job and manually change items such as table names, and none of my jobs perform any auditing of results (such as did the number of rows extracted equals the number of rows inserted + updated + marked duplicates). None of this makes me happy, and none of this leaves me feeling like BI is in a state where it can be introduced to the organization as a whole.
My SSIS packages are all hand-generated so I am spending a lot of time duplicating design patterns such as SCD Type 2 dimensions and accumulating fact tables. Any logging done is also repeated in each package. For one ETL job that feeds 1 fact table and 5 dimensions, I have close to 30 packages to implement the ETL logic for that. This is part of the reason driving me to re-architect the ETL system, document it, and build a framework to standardize it and allow me to spend more time working on the high-value items instead of repeating past development efforts.
Source control with versioning? What’s that?
What I Would Like the Current State to Be
I would be absolutely thrilled to only implement a design pattern once and then use metadata to execute it multiple times. For example, I code an SCD Type 2 dimension once and then use metadata and variables to execute it for each SCD Type 2 dimension in the data warehouse. This would allow me to implement a logging structure that gets followed for each task in the process. It would make auditing functions easier to implement. Finally, it would make the whole ETL system easier to maintain as I would only need to make changes a limited number of times.
Another thing that I would like is to be able to put the ETL in source control and branch it easier. I want to be able to maintain a test environment, a QA environment, and a production environment to have better control of the SDLC (software development life cycle). However, and correct me if I am wrong, SSIS packages don’t branch very well. Doing this will allow me to spin up a new environment to a specific version to develop a fix for a bug, or to compare an older version with a new version to see where a bug might have been introduced.
Data lineage is also important. I want to be able to take a row in a fact table and trace it back through the ETL transformations to the source data to ensure that the ETL is working according to the predefined rules. Performance is also important, and the future version of the ETL needs to be able to run as much in parallel as is feasible. Did I already mention being metadata driven is a must in case the environment ever changes from on-premise to Azure?
Finally, I need the ETL system to be flexible and handle a wide variety of data sources ranging from SQL databases to SharePoint lists to JSON Web API data. And that’s just in the IT department!
The Path Forward
Looking at the differences between the current state of the ETL and my desired future state, it looks like T-SQL won’t be enough to get me there. T-SQL works well for source control, template based design patterns, and being metadata driven, but it drops the ball hard when dealing with data sources other than SQL Server and Excel. Manually feeding data into the ETL process is not something I want to build into my system.
However, SSIS doesn’t branch and version well. My SSIS skills are also lagging behind T-SQL skills as I just have more experience in T-SQL. I also can’t speak to the ability to control configuration and execution with metadata. Finally, can you use templates in SSIS with more elegance than just copy and pasting packages?
Ok, gang, I need your collective wisdom – what tool should I be using to build my ETL? All the Kimball books are technology neutral on this matter so those are of no help. Microsoft products are preferred as I already have the licenses for them. Additionally, I am still working towards that MVP award so I kind of need to stick with the big MS on this one 😉 The data warehouse is being delivered as a SQL Server relational database – no Analysis Services yet. Reporting is done via SSRS for paginated reports and PowerBI for dashboards / mobile reports.
i guess that kind of eliminates the something else… so the question is T-SQL or SSIS? Is this an area where BIML is useful?