Sunday, February 19, 2012

How to Choose "Cache Size" for a Lookup Transformation?

Hello,

I searched around and not been able to find any guidance on this question: if I am designing a lookup transformation, how do I decide what I should set the cache size to?

For the transformation on which I am currently working, the size of the lookup table will be small (like a dozen rows), so should I just reduce it to 1MB? Or should I not even bother with caching for a lookup table that small?

Hypothetically speaking, if I am working with much larger lookups in the future (let's say 30,000 rows in the lookup table), is there some methor or formula that I should use to try to figure out the best cache size?

If the cache size is set larger than the actual data being cached, is the entire cache size allocated in memory, or is the cache size managed to only be as large as it needs to be? In other words, is the cache size a maximum to which the cache will grow, or is it a preset that sets the cache for that lookup to exactly the specified size?

Thanks,
Dan Read
SSIS Blogging

Dan,

I would suggest that if you have such a small LOOKUP set then you should definately cache it all. The reason being that it is not necassarily the size of the lookup set that determines the length of execution, it is the number of LOOKUPs that you will be doing. In other words, the number of rows in the pipeline.

My very simplistic advice would be "If you can cache it all, do so".

It would be very nice if sch a method/formula existed but if it it does I am not aware of it. As ever the best advice I can offer is "test and measure"!

-Jamie

|||Hi Jamie,

Thanks for the input. You wrote:

As ever the best advice I can offer is "test and measure"!


Do you know of a method to observe how a cache specifically is performing (that is, whether it is being filled consistently, or experiencing a certain hit/miss ratio), or does one have to extroplate based on general performance measurements?

Thanks again,
Dan

|||

Dan,

I don't I'm afraid, no.

A very crude method would be to count, using Profiler, how many queries get sent to the DB. Then you can take that as a ercentage of the total rows inn the pipeline to work out the cache hit ratio. Not pretty but it would work. I can't think of another way.

If there's anyone from MS reading, can I suggest you add a LOOKUP perfmon counter that will enable us to measure LOOKUP cache performance.

-Jamie

|||I would assume that a crude method for calculating a the memory usage for a full cache lookup would be the extended row length (rowlength * no of rows).
For example if you have 10,000 rows in the source table and your lookup retrieves 2 columns (col1 being int and col2 being varchar(10)) then the total row length would be 4 bytes + 10*1 byte or 14 bytes. The extended row length would be 10,000 * 14 bytes = 140,000 bytes = .13MB (I hope I did the calculation right). I'm not sure, but perhaps something similar was discussed in one of the Project Real documents.

Although this isn't exactly correct since most likely there is some overhead associated with the underlying objects used to implement the lookup component, it should be a good estimate.

Perhaps someone from MS can post recommendations or clarifications for the memory usage of the lookup component.

Larry Pope

No comments:

Post a Comment