SQL Server 2008 : Index design (part 2) - Improving nonclustered index efficiency

2. Improving nonclustered index efficiency

The accumulated cost of random I/O involved in key/RID lookups often leads to nonclustered indexes being ignored in favor of sequential I/O with clustered index scans. To illustrate this and explore options for avoiding the key lookup process, let's walk through a number of examples using the Person. Contact table in the sample AdventureWorks database. In demonstrating how SQL Server uses different indexes for different queries, we'll view the graphical execution plans, which use different icons, as shown in figure 1, to represent different actions (lookups, scans, seeks, and so forth).

Seek vs. scan

Several important terms are used when discussing index usage. An index seek is used when the query optimizer chooses to navigate through the levels of a clustered or nonclustered index B-tree to quickly reach the appropriate leaf level pages. In contrast, an index scan, as the name suggests, scans the leaf level, left to right, one page at a time.

Figure 1. Common icons used in graphical execution plans

The Person.Contact table, as defined below (abbreviated table definition), contains approximately 20,000 rows. For the purposes of this test, we'll create a nonunique, nonclustered index on the LastName column:

-- Create a contact table with a nonclustered index on LastName
CREATE TABLE [Person].[Contact](
   [ContactID] [int] IDENTITY(1,1) PRIMARY KEY CLUSTERED
   , [Title] [nvarchar](8) NULL
   , [FirstName] [dbo].[Name] NOT NULL
   , [LastName] [dbo].[Name] NOT NULL
   , [EmailAddress] [nvarchar](50) NULL
)
GO

CREATE NONCLUSTERED INDEX [ixContactLastName] ON [Person].[Contact]
   ([LastName] ASC)

GO

For our first example, let's run a query to return all contacts with a LastName starting with C:

-- Statistics indicate too many rows for an index lookup
SELECT *
FROM Person.Contact
WHERE LastName like 'C%'

Despite the presence of a nonclustered index on LastName, which in theory could be used for this query, SQL Server correctly ignores it in favor of a clustered index scan. If we execute this query in SQL Server Management Studio using the Include Actual Execution Plan option (Ctrl+M, or select from the Query menu), we can see the graphical representation of the query execution, as shown in figure 2.

Figure 2. A clustered index scan is favored for this query in place of a nonclustered index seek plus key lookup.

No great surprises here; SQL Server is performing a clustered index scan to retrieve the results. Using an index hint, let's rerun this query and force SQL Server to use the ixContactLastName index:

-- Force the index lookup with an index hint
SELECT *
FROM Person.Contact WITH (index=ixContactLastName)
WHERE LastName like 'C%'

Looking at the graphical execution plan, we can confirm that the index is being used, as per figure 3.

Figure 3. Adding an index hint to the previous query results in an index seek plus key lookup.

On a small database such as AdventureWorks, the performance difference between these two methods is negligible; both complete in under a second. To better understand how much slower the index lookup method is, we can use the SET STATISTICS IO option, which returns disk usage statistics^[] alongside the query results. Consider the script in listing 12.

^[] Not to be confused with index statistics, query statistics refer to disk usage, such as the number of pages read from buffer or physical disk reads.

Example 1. Comparing query execution methods

-- Compare the Disk I/O with and without an index lookup
SET STATISTICS IO ON
GO

DBCC DROPCLEANBUFFERS
GO

SELECT *
FROM Person.Contact
WHERE LastName like 'C%'

DBCC DROPCLEANBUFFERS

SELECT *
FROM Person.Contact with (index=ixContactLastName)
WHERE LastName like 'C%'
GO

This script will run the query with and without the index hint. Before each query, we'll clear the buffer cache using DBCC DROPCLEANBUFFERS to eliminate the memory cache effects. The STATISTICS IO option will produce, for each query, the number of logical, physical, and read-ahead pages, defined as follows:

Logical Reads—Represents the number of pages read from the data cache.
Physical Reads—If the required page is not in cache, it will be read from disk. It follows that this value will be the same or less than the Logical Reads counter.

Figure 4. Forcing a nonclustered index seek plus key lookup significantly increases the number of pages read.
Read Ahead Reads—The SQL Server storage engine uses a performance optimization technique called Read Ahead, which anticipates a query's future page needs and prefetches those pages from disk into the data cache. In doing so, the pages are available in cache when required, avoiding the need for the query to wait on future physical page reads.

So with these definitions in mind, let's look at the STATISTICS IO output in figure 4.

These statistics make for some very interesting reading. Note the big increase in logical reads (3326 versus 569) for the second query, which contains the (index= ixContactLastName) hint. Why such a big increase? A quick check of sys.dm_ db_index_physical_stats, reveals there are only 570 pages in the table/clustered index. This is consistent with the statistics from the query that used the clustered index scan. So how can the query using the nonclustered index read so many more pages? The answer lies in the key lookup.

What's actually occurring here is that a number of clustered index pages are being read more than once. In addition to reading the nonclustered index pages for matching records, each key lookup reads pages from the clustered index to compete the query. In this case, a number of the key lookups are rereading the same clustered index page. Clearly a single clustered index scan is more efficient, and SQL Server was right to ignore the nonclustered index.

Let's move on to look at an example where SQL Server uses the nonclustered index without any index hints:

SELECT *
FROM Person.Contact
WHERE LastName like 'Carter%'

The graphical execution plan for this query is shown in figure 5, and it confirms the index is being used.

We can see that of the overall query cost, 98 percent is the key lookup. Eliminating this step will derive a further performance increase. You'll note that in our queries so far we've been using select *; what if we reduced the required columns for the query to only those actually required and included them in the index? Such an index is called a covering index.

Figure 5. This particular query uses the nonclustered index without any hints. Note the major cost of the query is the key lookup at 98 percent.

Covering indexes

Let's assume we actually need only FirstName, LastName, and EmailAddress. If we created a composite index containing these three columns, the key lookup wouldn't be required. Let's modify the index to include the columns and rerun the query:

-- Create a covering index
DROP INDEX [ixContactLastName] ON [Person].[Contact]
GO
CREATE NONCLUSTERED INDEX [ixContactLastName] ON [Person].[Contact]
(
   [LastName] ASC
   , [FirstName] ASC
   , [EmailAddress] ASC
)
GO

SELECT LastName, FirstName, EmailAddress
FROM Person.Contact
WHERE LastName LIKE 'Carter%'

The execution plan from the query with the new index is shown in figure 6.

As you can see, the query is now satisfied from the contents of the nonclustered index alone. No key lookups are necessary, as all of the required columns are contained in the nonclustered index. In some ways, this index can be considered a mini, alternatively clustered version of the table.

Figure 6. Covering the index eliminates the key lookup, significantly improving query performance.

Figure 7. By listing the required columns in the select clause and including them in the nonclustered index, the key lookups are eliminated, with logical reads dropping from 279 to 5.

Confirming the improvement from a disk-statistics perspective, the logical reads drop significantly, from 279 to 5, as shown in figure 7.

Including additional columns in the nonclustered index to avoid the key lookup process makes it a covering index. While this is an excellent performance-tuning technique, the one downside is that the additional columns are included at all levels of the index (root, all intermediate levels, and the leaf level). In our query above, given that we're not using the additional columns as predicates, that is, where clause conditions, they're not required at any level of the index other than the leaf level to avoid the key lookup. In small indexes, this is not really an issue. However, for very large indexes, the additional space taken up by the additional columns at each index level not only increases the index size but makes the index seek process less efficient. The included columns feature, introduced in SQL Server 2005, enhances covering indexes in several ways.

Included columns

While they're a relatively simple and very effective performance-tuning mechanism, covering indexes are not without their limitations; there can be a maximum of 16 columns in a composite index with a maximum combined size of 900 bytes. Further, columns of certain data types, including n/varchar(max), n/varbinary(max), n/text, XML, and image cannot be specified as index key columns.

Recognizing the value of covering indexes, SQL Server 2005 and above circumvent the size and data type limitations through indexes with included columns. Such indexes allow additional columns to be added to the leaf level of nonclustered indexes. In doing so, the additional columns are not counted in the 16 column and 900 byte maximum, and additional data types are allowed for these columns (n/varchar(max), n/varbinary(max), and XML). Consider the following create index statement:

-- Include columns at the leaf level of the index
CREATE NONCLUSTERED INDEX ixContactLastName
ON Person.Contact (LastName)
INCLUDE (FirstName, EmailAddress)

Notice the additional INCLUDE clause at the end of the statement; this index will offer all the benefits of the previous covering index. Further, if appropriate, we could add columns with data types not supported in traditional indexes, and we wouldn't be restricted by the 16-column maximum.

When deciding whether to place a column in the index definition as a key column or as an included column, the determining factor is whether the column will be used as a predicate, that is, a search condition in the where clause of a query. If a column is added purely to avoid the key lookup because of inclusion in the select list, then it makes sense for it to be an included column. Alternatively, a column used for filtering/searching purposes should be included as a key column in the index definition.

Let's take our previous example of a surname search. If a common search condition was on the combination of surname and first name, then it would make sense for both columns to be included in the index as key columns for more efficient lookups when seeking through the intermediate index levels. If the email address column is used purely as return information, that is, in the query's select list, but not as a predicate (where clause condition), then it makes sense for it to be an included column. Such an index definition would look like this:

CREATE NONCLUSTERED INDEX ixContactLastName
ON Person.Contact (LastName, FirstName)
INCLUDE (EmailAddress)

In summary, included column indexes retain the power of covering indexes while minimizing the index size and therefore maximizing lookup efficiency.

Before closing our section on nonclustered index design, let's spend some time covering an important new indexing feature included in SQL Server 2008: filtered indexes.

Filtered indexes

Filtered indexes are one of my favorite new features in SQL Server 2008. Before we investigate their many advantages, consider the following table used to store customer details, including a country code:

CREATE TABLE [Person].[Customer](
   [CustomerID] [int] IDENTITY(1,1) PRIMARY KEY CLUSTERED

, [Title] [nvarchar](8) NULL
   , [FirstName] [nvarchar](100) NOT NULL
   , [LastName] [nvarchar](100) NOT NULL
   , [EmailAddress] [nvarchar](100) NULL
   , [CountryCode] char(2) NULL
)
GO

Let's imagine this table is part of a database used around the globe on a 24/7 basis. The Customer table is used predominantly by a follow-the-sun call center, where customer details are accessed by call center staff from the same country or region as the calling customers.

Creating a nonclustered index on this table similar to the one earlier in the article where we included FirstName, LastName, and EmailAddress will enable lookups on customer name to return the required details. If this was a very large table, the size of the corresponding nonclustered indexes would also be large. Maintaining large indexes that are in use 24/7 presents some interesting challenges.

In our example here, a traditional (full table) index would be created similar to what we've already seen earlier in the article ; columns would be defined as key or included index columns, ideally as part of a covering index. All is fine so far, but wouldn't it be good if we could have separate versions of the index for specific countries? That would enable, for example, the Australian version of the index to be rebuilt when it's midnight in Australia and few, if any, Australian users are being accessed. Such an index design would reduce the impact on sections of users that are unlikely to be accessed at the time of the index maintenance.

Consider the following two index-creation statements:

-- Create 2 filtered indexes on the Customer table
CREATE NONCLUSTERED INDEX ixCustomerAustralia
ON Person.Customer (LastName, FirstName)
INCLUDE (EmailAddress)
WHERE CountryCode = 'AU'
GO

CREATE NONCLUSTERED INDEX  ixCustomerUnitedKingdom
ON Person.Customer (LastName, FirstName)
INCLUDE (EmailAddress)
WHERE CountryCode = 'UK'
GO

The indexes we've created here are similar to ones from earlier in the article with one notable exception: they have a predicate (where clause filter) as part of their definition. When a search is performed using a matching predicate and index keys, the query optimizer will consider using the index, subject to the usual considerations. For example, the ixCustomerAustralia index could be used for a query that includes the CountryCode = 'AU' predicate such as this:

SELECT FirstName, LastName, EmailAddress
FROM Person.Customer
WHERE

LastName = 'Colledge'
   AND FirstName like 'Rod%'
   AND CountryCode = 'AU'

Such indexes, known as filtered indexes, enable a whole range of benefits. Let's cover the major ones:

Segregated maintenance—As we've discussed, creating multiple smaller versions of a single larger index enables maintenance routines such as index rebuilds to be scheduled in isolation from other versions of the index that may be receiving heavy usage.
Smaller, faster indexes—Filtering an index makes it smaller and therefore faster. Best of all, covered filtered indexes support optimized lookups for specialized purposes. Consider a very large product table with a ProductCategory column; filtered indexes could be created for product categories, which include the appropriate columns specific to that category. When combined with application logic, such indexes enable fast, optimized lookups for sections of data within a table.
Creating unique indexes on nullable columns— Consider the Social Security number (SSN) column ; to support storing records for non-U.S. residents, we couldn't define the column as NOT NULL. This would mean that a percentage of the records would have a NULL SSN, but those that do have one should be unique. By creating a filtered unique nonclustered index, we can achieve both of these goals by defining the index with a WHERE SSN IS NOT NULL predicate.
More accurate statistics—Unless created with the FULLSCAN option , statistics work by sampling a subset of the index. In a filtered index, all of the sampled statistics are specific to the filter; therefore they are more accurate compared to an index that keeps statistics on all table data, some of which may never be used for index lookups.
Lower storage costs—The ability to exclude unwanted data from indexes enables the size, and therefore storage costs, to be reduced.

Some of the advantages of filtered indexes could be achieved in earlier versions of SQL Server using indexed views. While similar, there are important differences and restrictions to be aware of when choosing one method over the other.

SQL Server 2008 : Index design (part 3) - Indexed views

SQL Server 2008 : Index design (part 1) - Selecting a clustered index

Other

SQL Server 2008 : Index design and maintenance - An introduction to indexes

USB Hard Drive Hangout (Part 2) - Toshiba Canvio plus 3.0 1.5TB, Adata Dashdrive Elite 500GB HE720

USB Hard Drive Hangout (Part 1) - WD My Passport 2TB

WD 4 TB Black Review – It’s The One To Get

WD My Passport Edge - Fast Mobile Drive For Mac 500GB

Seagate Slim Portable – Thin Mobile Drive

SSDs: Solid And Reliable

Sandisk ReadyCache 32GB - Optimum Your Space To Cache

Silverlight : Data Binding - Providing reasonable defaults for bound data

Silverlight : Data Binding - Controlling Updates