leo_wyn
作者leo_wyn·2020-04-29 10:30
商业智能工程师·Security

Filtered Indexes and INCLUDEd Columns

字数 13005阅读 1006评论 0赞 0

Filtered Indexes and INCLUDEd Columns

March 29, 2019 by Rob Farley in SQL Indexes, SQL Performance, T-SQL Queries | 1 Comment

Filtered indexes are amazingly powerful, but I still see some confusion out there about them – particularly about the

columns that are used in the filters, and what happens when you want to tighten the filters.

A recent question on dba.stackexchange asked for help about why columns used in the filter of a filtered index

should be included in the 'included' columns of the index. Excellent question – except that I felt like it started on a

poor premise, because those columns shouldn't have to be included in the index. Yes they help, but not in the

way that the question seemed to suggest.

To save you looking at the question itself, here's a quick summary:

To satisfy this query…

SELECT Id, DisplayName FROM Users WHERE Reputation > 400000;

…the following filtered index is pretty good:

But despite having this index in place, the Query Optimizer recommends the following index if the filtered value is

tightened to, say, 450000.

CREATE NONCLUSTERED INDEX IndexThatWasMissing ON dbo.Users ( Reputation ) INCLUDE ( DisplayName, Id );

I'm paraphrasing the question a little here, which starts by referring to this situation and then builds a different

example, but the idea is the same. I just didn't want to make things more complicated by involving a separate table.

Point is – the index suggested by the QO is the original index but turned on its head. The original index had

Reputation in the INCLUDE list, and DisplayName and Id as key columns, while the new recommended index is the

opposite way around with Reputation as the key column and DisplayName & ID in the INCLUDE. Let's look into why.

The question refers to a post by Erik Darling, where he explains that he tuned the '450,000' query above by putting

Reputation into the INCLUDE column. Erik shows that without Reputation in the INCLUDE list, a query which filters

to a higher value of Reputation needs to do Lookups (bad!), or maybe even give up entirely on the filtered index

(potentially even worse). He concludes that having the Reputation column in the INCLUDE list lets SQL have

statistics, so that it can make better choices, and shows that with Reputation in the INCLUDE a variety of queries

that all filter on higher Reputation values all scan his filtered index.

In an answer to the dba.stackexchange question, Brent Ozar points out that Erik's improvements aren't particularly

great because they cause Scans. I'll come back to that one, because it's an interesting point in itself, and somewhat

incorrect.

CREATE UNIQUE NONCLUSTERED INDEX Users_400k_Club ON dbo.Users ( DisplayName, Id ) INCLUDE ( Reputation ) WHERE R

Home Archives Authors Categories Newsletters About Search... GO

10/24/2019 Filtered Indexes and INCLUDEd Columns - SQLPerformance.com

https://sqlperformance.com/2019/03/sql-performance/filtered-indexes-and-included-columns 2/4

First let's think a little about indexes in general.

An index provides an ordered structure to a set of data. (I could be pedantic and point out that reading through the

data in an index from start to end might jump you from page to page in a seeming haphazard way, but still as

you're reading through pages, following the pointers from one page to the next you can be confident the data is

ordered. Within each page you might even jump around to read the data in order, but there is a list showing you

which parts (slots) of the page should be read in which order. There really is no point in my pedantry except to

answer those equally pedantic who will comment if I don't.)

And this order is according to the key columns – that's the easy bit that everyone gets. It's useful not only for being

able to avoid re-ordering the data later, but also for being able to quickly locate any particular row or range of rows

by those columns.

The leaf levels of the index contain the values in any columns in the INCLUDE list, or in the case of a Clustered

Index, the values across all the columns in the table (except non-persisted computed columns). The other levels in

the index contain just the key columns and (if the index is not unique) the unique address of the row – which is

either the keys of the clustered index (with the row's uniquifier if the clustered index isn't unique either) or the

RowID value for a heap, enough to allow easy access to all the other column values for the row. The leaf levels also

include all the 'address' information.

But that's not the interesting bit for this post. The interesting bit for this post is what I mean by "to a set of data".

Remember I said "An index provides an ordered structure to a set of data".

In a clustered index, that set of data is the whole table, but it could be something else. You can probably already

picture how most non-clustered indexes don't involve all the columns of the table. This is one of the things that

make non-clustered indexes so useful, because they're typically a lot smaller than the underlying table.

In the case of an indexed view, our set of data it could be the results of a whole query, including joins across many

tables! That's for another post.

But in a filtered index, it's not just a copy of a subset of columns, but a subset of rows too. So in the example here,

the index is only across the users with more than 400k reputation.

This index takes the users who have more than 400k reputation, and orders them by DisplayName and Id. It can be

unique because (assumedly) the Id column is already unique. If you try something similar on your own table, you

might need to be careful of that.

But at this point, the index doesn't care what the Reputation is for each user – it just cares whether the Reputation

is high enough to be in the index or not. If a user's reputation gets updated and it tips over the threshold that

user's DisplayName and Id will be inserted into the index. If it drops below, it'll get deleted from the index. It's just

like having a separate table for the high rollers, except that we get people into that table by increasing their

Reputation value over the 400k threshold in the underlying table. It can do this without having to actually store the

Reputation value itself.

So now if we want to find people who have a threshold over 450k, that index is missing some information.

Sure, we could confidently say that everyone we'll find is in that index – but the index doesn't contain enough

information in itself to filter any further on Reputation. If I told you I had an alphabetical list of Best Picture Oscar

winning films from the 1990s (American Beauty, Braveheart, Dances With Wolves, English Patient, Forrest Gump,

Schindler's List, Shakespeare in Love, Silence of the Lambs, Titanic, Unforgiven), then I can assure you that the

CREATE UNIQUE NONCLUSTERED INDEX Users_400k_Club_NoInclude ON dbo.Users ( DisplayName, Id ) WHERE Reputation &g

10/24/2019 Filtered Indexes and INCLUDEd Columns - SQLPerformance.com

https://sqlperformance.com/2019/03/sql-performance/filtered-indexes-and-included-columns 3/4

winners for 1994-1996 would be a subset of those, but I can't answer the question without first getting some more

information.

Obviously my filtered index would be more useful if I had included the year, and potentially even more so if the year

were a key column, since my new query wants to find the ones for 1994-1996. But I probably designed this index

around a query to list all the movies from the 1990s in alphabetical order. That query doesn't care about what the

actual year is, only whether it's in the 1990s or not, and I don't even need to return the year – just the title – so I can

scan my filtered index to get the results. For that query I don't even need to reorder the results or find the starting

point – my index really is perfect.

A more practical example of not caring about the value of the column in the filter is on status, such as:

WHERE IsActive = 1

I frequently see code that moves data from one table to another when rows stop being 'active'. People don't want

old rows cluttering up their table, and they recognise that their 'hot' data is only a small subset of all their data. So

they move their cooling data off into an Archive table, keeping their Active table small.

A filtered index can do this for you. Behind the scenes. As soon as you update the row and change that IsActive

column to something other than 1. If you only care about having active data in most of your indexes, then filtered

indexes are ideal. It'll even bring rows back into the indexes if the IsActive value changes back to 1.

But you don't need to put IsActive in the INCLUDE list to achieve this. Why would you want to store the value – you

already know what the value is – it's 1! Unless you're asking to return the value you shouldn't need it. And why

would you return the value when you already know that the answer is 1, right?! Except that frustratingly, the

statistics that Erik refers to in his post will take advantage of being in the INCLUDE list. You don't need it for the

query, but you should include it for the stats.

Let's think about what the Query Optimizer needs to do to figure out the usefulness of an index.

Before it can do much at all, it needs to consider whether the index is a candidate. No point in using an index if it

doesn't have all the rows that might be needed – not unless we have an effective way of getting the rest. If I want

movies from 1985-1995, then my index of 1990s movies is pretty pointless. But for 1994-1996, maybe it's not bad.

At this point, just like any index consideration, I need to think about whether it will help enough for finding the data

and getting it into an order that will help executing the rest of the query (possibly for a Merge Join, Stream

Aggregate, satisfying an ORDER BY, or various other reasons). If my query filter matches the index filter exactly,

then I don't need to filter any further – just using the index is enough. This sounds great, but if it doesn't match

exactly, if my query filter is tighter that the index filter (like my 1994-1996 example, or Erik's 450,000), I'm going to

need to have those Year values or Reputation values to check – hopefully getting them either from the INCLUDEd at

the leaf level or somewhere in my key columns. If they're not in the index, I'm going to have to do a Lookup for

every row in my filtered index (and ideally, have an idea about how many times my Lookup will be called, which are

the stats that Erik wants the column included for).

Ideally, any index I plan to use is ordered correctly (via the keys), INCLUDEs all the columns I need to return, and is

pre-filtered to just the rows I need. That would be the perfect index, and my execution plan will be a Scan.

That's right, a SCAN. Not a Seek, but a Scan. It'll start on the first page of my index and keep giving me rows until

I've got as many as I need, or until there are no more rows to return. Not skipping any, not sorting them – just

giving me the rows in order.

A Seek would suggest that I don't need the whole index, which means I'm wasting resources in maintaining that

part of the index, and to query it I have to find the starting point and keep checking rows to see if I've hit the end or

10/24/2019 Filtered Indexes and INCLUDEd Columns - SQLPerformance.com

https://sqlperformance.com/2019/03/sql-performance/filtered-indexes-and-included-columns 4/4

Share this post: Facebook Twitter LinkedIn

not. If my Scan has a Predicate, then sure, I'm having to look through (and test) more data than I need to, but if my

index filters are perfect, then the Query Optimizer should recognise that and not have to perform those checks.

Final Thoughts

INCLUDEs aren't critical for filtered indexes. They're useful for providing easy access to columns which might be

useful for your query, and if you happen to be tightening what's in your filtered index by any column, whether it's

mentioned in the filter or not, you should consider having that column in the mix. But at that point you ought to be

asking whether your index's filter is the right one, what else you should have in your INCLUDE list, and even what

the key column(s) should be. Erik's queries weren't playing nicely because he needed information that wasn't in the

index, even though he had mentioned the column in the filter. He found a good use for the statistics too, and I

would still encourage you to include the filter columns for that reason. But putting them in an INCLUDE doesn't

allow them to suddenly start doing a Seek, because that's not how any index works, whether filtered or not.

I want you, reader, to understand filtered indexes really well. They're incredibly useful and, when you start to

picture them like tables in their own rights, can become part of your overall database design. They're also a reason

for always using the ANSI_NULLs and QUOTED_IDENTIFIER settings, because you'll get errors from filtered index

unless those settings are ON, but hopefully you already make sure they're always on anyway.

Oh, and those movies were Forrest Gump, Braveheart, and The English Patient.

如果觉得我的文章对您有用,请点赞。您的支持将鼓励我继续创作!

0

添加新评论0 条评论

Ctrl+Enter 发表

作者其他文章

相关文章

相关问题

相关资料

X社区推广