Today, I learned something rather valuable. If you have a lookup component in a data flow package within a for each loop container that looks up a value in a database table, it might be a good idea to not turn on caching for that component. In this post, I walk you through my package that takes over 9 hours to process 8,000 rows. Join me after the jump for the walk of shame…
So I have this package that pulls data from an API. The data is IT service ticket data, and there is an object that retrieves all the time logged against a single ticket. There is no option for me to tell the API to return all the time logs with their associated ticket numbers, so I need to loop through all the ticket numbers and check for time logs row by agonizing row. So I have a Foreach Loop Container that has a single data flow that pulls the time log data for the current ticket number being processed by the loop, does a lookup in the local database to see if that time log exists, and then chooses to insert a new record or update existing record based on the results of that lookup.
Without thinking I left the lookup with the default options. I ran my package in Visual Studio on my local workstation. It started out fine, but as time went on performance slowly degraded. In looking at the output window, I see an ever-increasing amount of new rows being added to the cache.
Information: 0x40209340 at Extract Ticket Time Logs, Lookup : The number of unique rows added to the cache is 6401.
This occurs everytime the loop is executed. Also, memory usage being used on my dev workstation has slowly climbed up to 1.1 GB of memory being used by Visual Studio.
I think this is where the slow performance is coming from, Billy.
In order to fix this issue, I set the lookup to run with no cache. My hope is that the large amounts of caching are what was causing the problem.
So When to Use Caching?
In this instance, I needed my lookup to read every single row in the data table every single time it ran to check for existing ID numbers. By using full cache option on my lookup component, I was holding a copy of the rows returned via the lookup for every single execution of the loop in memory. This caused my job to run for over nine hours to process 8,292 rows!
My hope is that by changing to no caching the memory usage stays low and the loop executes just as fast on the 8,000th as it did on the first time because memory usage wouldn’t balloon up. Based on this, I would say caching is best left to one off lookups, and turn caching off if you are going to be hitting the database repeatedly retrieving new records.