Which are the different Lookup Cache Modes Available in SSIS?
- ‘Full cache’ is the default cache mode . Here, during the pre-execute phase of the data flow, the database is queried. The entire reference set is being pulled into memory.
- This method uses the most memory, and adds additional startup time for your data flow because all of the caching processes takes place before any rows are read from the data flow source(s).
- One important feature is that the lookup will not swap memory out to disk, hence your data flow will fail if you run out of memory.
- In this mode, At the beginning of the data flow, the lookup cache starts off empty. The lookup transform checks its cache for the matching values when a new row comes in.
- It queries the database if no match is found. The values are cached so they can be used the next time a matching row comes in If the match is found at the database.
- The startup time using a partial cache mode is less than it would be for a full cache since no caching is done during the pre-execute phase.
- However, as you will most likely be hitting the database more often your lookup operations would be slower.
- you can configure the maximum size of the cache when running in partial cache mode. There are two separate values – one for 64bit execution, and another for 32bit.
- The lookup transform will start dropping the least seen rows from the cache to make room for the new ones If the cache gets filled up.
- In this mode the lookup transform doesn’t maintain a lookup cache . This means that you’ll be hitting the database for every row in most situations, .
Cache mode is used when:
- When you’re processing a small number of rows
- When you have non-repeating lookup indexes
- When your reference table is changing (inserts, updates, deletes)
- When you have severe memory limitations
FULL CACHE : mode setting is the default cache mode selection in the SSIS lookup transformation. Like the name implies, full cache mode will cause the lookup transformation to retrieve and store in SSIS cache the entire set of data from the specified lookup location. As a result, the data flow in which the lookup transformation resides will not start processing any data buffers until all of the rows from the lookup query have been cached in SSIS.
The most commonly used cache mode is the full cache setting, and for good reason. The full cache setting has the most practical applications, and should be considered the go-to cache setting when dealing with an untested set of data. With a moderately sized set of reference data, a lookup transformation using full cache mode usually performs well. Full cache mode does not require multiple round trips to the database, since the entire reference result set is cached prior to data flow execution.
There are a few potential gotchas to be aware of when using full cache mode. First, you can see some performance issues – memory pressure in particular – when using full cache mode against large sets of reference data. If the table you use for the lookup is very large (either deep or wide, or perhaps both), there’s going to be a performance cost associated with retrieving and caching all of that data. Also, keep in mind that when doing a lookup on character data, full cache mode will always do a case-sensitive (and in some cases, space-sensitive) string comparison even if your database is set to a case-insensitive collation. This is because the in-memory lookup uses a .NET string comparison (which is case- and space-sensitive) as opposed to a database string comparison (which may be case sensitive, depending on collation). There’s a relatively easy workaround in which you can use the UPPER() or LOWER() function in the pipeline data and the reference data to ensure that case differences do not impact the success of your lookup operation. Again, neither of these present a reason to avoid full cache mode, but should be used to determine whether full cache mode should be used in a given situation.
Full cache mode is ideally useful when one or all of the following conditions exist:
- The size of the reference data set is small to moderately sized
- The size of the pipeline data set (the data you are comparing to the lookup table) is large, is unknown at design time, or is unpredictable
- Each distinct key value(s) in the pipeline data set is expected to be found multiple times in that set of data
When using the partial cache setting, lookup values will still be cached, but only as each distinct value is encountered in the data flow. Initially, each distinct value will be retrieved individually from the specified source, and then cached. To be clear, this is a row-by-row lookup for each distinct key value(s).
This is a less frequently used cache setting because it addresses a narrower set of scenarios. Because each distinct key value(s) combination requires a relational round trip to the lookup source, performance can be an issue, especially with a large pipeline data set to be compared to the lookup data set. If you have, for example, a million records from your pipeline data source, you have the potential for doing a million lookup queries against your lookup data source (depending on the number of distinct values in the key column(s)). Therefore, one has to be keenly aware of the expected row count and value distribution of the pipeline data to safely use partial cache mode.
Using partial cache mode is ideally suited for the conditions below:
- The size of the data in the pipeline (more specifically, the number of distinct key column) is relatively small
- The size of the lookup data is too large to effectively store in cache
- The lookup source is well indexed to allow for fast retrieval of row-by-row values
As you might guess, selecting no cache mode will not add any values to the lookup cache in SSIS. As a result, every single row in the pipeline data set will require a query against the lookup source. Since no data is cached, it is possible to save a small amount of overhead in SSIS memory in cases where key values are not reused. In the real world, I don’t see a lot of use of the no cache setting, but I can imagine some edge cases where it might be useful.
As such, it’s critical to know your data before choosing this option. Obviously, performance will be an issue with anything other than small sets of data, as the no cache setting requires row-by-row processing of all of the data in the pipeline.
I would recommend considering the no cache mode only when all of the below conditions are true:
- The reference data set is too large to reasonably be loaded into SSIS memory
- The pipeline data set is small and is not expected to grow
- There are expected to be very few or no duplicates of the key values(s) in the pipeline data set (i.e., there would be no benefit from caching these values)
SSIS Lookup transformation is used to find/validate existing reference record. This can be implemented in 3 modes.
- Full Cache – Referring dataset will be placed in memory before validation
- Partial Cache – Referring dataset will be accessed on run time and accessed records will be cached
- No Cache – Referring dataset will be accessed on run time for each record