Last week I wrote a post about not knowing what to right about. Doug Lane stopped by to give me some good advice, and it got me thinking. I am fairly decent (at least, I am when my imposter syndrome isn’t flaring up) at exploring databases and finding relationships between the tables. So in this post I am going to layout the process I follow to perform exploratory data analysis and how I approach extracting data from databases.
Full Impulse – Look for Vendor Created Views
If you have a vendor created view in your database, your job is mostly done. Views typically contain multiple joins, and in many cases, are useful as written by the vendor.
Warp One – Identifying Interesting Tables
In this step, I am looking over all the tables in a database to see which ones contain numerical data. I typically look for tables that are named after nouns associated with business processes such as Transactions or Sales or Purchases. All of these nouns indicate something happened, and data about what happened is a lot more interesting that explanatory data such as Accounts or Customers.
Warp Two – Primary and Foreign Keys
In this step, we identify our primary key column. The primary key column is important because it is unique per event so we can COUNT DISTINCT on this column to count individual events. The foreign keys are important because they explicitly define relationships. Typically foreign keys will be defined to link events to explanatory data that provides more information about the event. If you are familiar with dimensional modeling, think fact and dimensions here.
Warp Three – Implied Relationships
If we reach this step that means we do not have any foreign key constraints defined on the tables and no vendor views to give us a clear view (pardon the pun) of the data. This is perhaps the most labor intensive step in the whole process. Since we have to infer the relationships between the tables, we need to start looking for clues. The first clue I look for is identical column names in two different tables. For example, take the column AccountNo in dbo.Accounts and dbo.Transactions. In my head, this would clue me in to join the Accounts table to the Transactions table on the AccountNo column.
Candidate columns for this sort of implied relationships typically have “ID” or “Key” in their name, and have matching data types which is typically int or bigint. In typical fashion, I typically over use the word typically, typically when giving examples so if you are the typical reader you would typically gloss over this or typically you would just bear with me in this typical example.
Warp Four – Diagramming the Relationships
In this step, we get a big, big whiteboard and we start drawing the relationships. I typically write the table name, then draw a line under it and list the key columns. Lines connect the key columns in the various tables, and when lines intersect I curve one line over the other so you can follow it easier.
Warp Five – Test the Relationship
In this step, I dump the data into a tool such as PowerBI and create the relationships. I use the various visuals to make sure that the data makes sense. For example, with a table such as Transactions and Customers, I would expect to see variance when looking at the SUM of the transactions over Customers. If the bars make a straight line, then we need to investigate further. As an analyst, this is largely a gut feel. If it passes muster by me, I typically share the visual with a few business users for their opinion on it. Sometimes I am right, sometimes I am wrong and can use the feedback to further refine the data model.
More Power, Scotty!
Once I have a model that passes testing by me and the business users, I amp up the power and create a view of that data in a separate database on the same server. This ensures that future upgrades to the system don’t destroy my view as well as allowing me to abstract the column names so schema changes don’t break my reporting and analysis tools.
How do you explore a foreign database when you have no map or data dictionary to consult?