Welcome to the first post in the “How I” files. This is a series of posts that detailed how I addressed specific problems and objectives in my particular environment. These posts won’t always represent the absolute best solutions to the current problems, and you may know of a better solution – and that is ok! Be sure to share your solution in the comments so we can all learn together. In this file, we look at how I put together a complete end to end hybrid (well, mostly on-premise) business intelligence solution. This is a 50,000-foot view of the solution – if you want to see specifics, let me know!
The System Design
The following image is a picture of a whiteboard drawing I did. Yes, I have a large whiteboard in my area that I use all the time, and yes, it is as amazing as it sounds. This image describes the general data flow from source operational systems to the analytical data consumers and shows the major components in the chain.
The major components of this system are the source systems, the operational data system (ODS), ETL, Staging, Data Warehouse, Data Mining, Data Mode, Reporting, and PowerBI. Each of these components is discussed in more detail below.
In this particular environment, well, at least this iteration of it, I have to deal with various source systems that include both on-premise sources such as Active Directory and cloud sources such as our IT ticket system which exposes data via a JSON data feed. For this current iteration, I am only dealing with IT service ticket data for our help desk, so I don’t have any ERP systems or other traditional data sources.
The ODS serves as an exact mirror of how the data exists in the source systems. I am primarily using it to both limit load on the source systems during extractions and for operational reports that our current software systems do not provide. The ODS exists as a set of databases named in the following convention: ODS_<division>_<system_name>. One example of an ODS name is ODS_IT_TeamworkDesk.
I know that having multiple databases makes backup and recovery a bit more complex, but my servers all exist as virtual machines in our data center and are backed up at that level. I will be maintaining my own SQL backups of everything to make minor recovery easier, but if a major event happens, I am covered by the VM level backups. I also like the organization that having multiple databases affords me by being able to filter on division, role, and system name using SQL Server Management Studio.
My ODS resides on its own dedicated SQL Server instance on a shared server.
This server exists to house and execute all the ETL packages. It handles the scheduling and orchestration of the packages as well as storing databases to hold ETL logging information. This server is provisioned with extra memory and CPU cores to meet the ETL processing needs. SQL Server Integration Services is the ETL tool of choice, and it is my design goal to do as much of the ETL within Integration Services as possible rather than just using SSIS as a workflow tool for executing T-SQL code.
This is a SQL Server instance on the same server as the ODS instance. Its purpose is to hold all my data that is in staging before being loaded into the data warehouse. I chose to use a separate instance for a few reasons. The first is resource management. Being that my ODS, staging, and data warehouse environments are all on the same server it is important that I am able to allocate the bulk of memory to my data warehouse environment, then to my staging environment, then to my ODS environment. I know that because my servers are all virtual machines I can just crank up the specs (well, at least in theory) I still prefer to have separate instances because of other reasons. Primarily, it keeps my SQL Server Management Studio from becoming cluttered with tons of databases as I am using multiple databases at each layer. Also, it makes managing security a bit easier as the business intelligence team grows and more devs get hired on for specific functions such as ODS, staging, or data warehouse development.
Databases in the staging environment follow this naming convention: Staging_<division>. For the IT staging database, we would see Staging_IT.
This is possibly one of the most important components of the system. This exists as a separate instance on the same server as the staging and ODS components. The mission of the data warehouse is to store cleansed and standardized dimensionally modeled data. No calculations or business logic is implemented in this layer as I would prefer to use DAX and Tabular models to provide that functionality. This is simply a repository of cleaned data from the source systems to support the calculations and analysis that will be presented to the end user in the data model layer.
Within this layer, each division of the company gets its own database, as departments within each division typically share data with each other. The database boundary makes it easier to secure the data in the off chance an end user needs direct access to this layer, and it also makes development a bit more modular in that I can promote one database to production and keep the other databases in testing or development. With the one database approach, every time I promote from testing to production there is a small chance it can impact other areas.
Conformed dimensions (those dimensions that are shared across the organization) are housed in their own database with the divisional databases pulling that data in via a view.
This component is not slated for immediate implementation but the plan does allow for it. Data mining (or machine learning) can be useful to look for complicated patterns in our data, but we just don’t have a need for it in this iteration. However, as BI expands to other parts of the organization I can certainly see a need for it.
This layer is on its own dedicated server because we are implementing SQL Server Analysis Services Tabular models. These are in-memory models so I located them on their own server to avoid memory pressure. By allocating this layer to its own dedicated server I don’t have to worry about other processes taking memory away from SSAS and providing a lower quality user experience.
This layer provides the paginated reports that managers want to be delivered to their inboxes, but likely never look at. All of the system’s paginated reports will end up here. Other reporting tools, including dashboards and KPIs, will be housed in Excel or PowerBI.com.
PowerBI / Excel
This component houses the non-paginated reports and dashboards that I build for the end users, or that end users build for themselves. Data for these objects are sourced only from the data models layer to take advantage of consistent business logic and terminology.
And that’s it – the very high level look at how I put together my business intelligence environment.