SQL Server 2008 : Index design (part 1) - Selecting a clustered index

A good database design is made in conjunction with, and is conscious of, application data access logic. For example, in order to design indexes for a particular table, the database designer must know how users will be accessing the table from the application(s). If an application allows searching for data on a particular column or set of columns, then this needs to be considered from an indexing point of view. That's not to suggest that the application completely dictates index design. The reverse is often true; sometimes unrealistic application access must be modified in order to prevent user-generated activity that causes database performance problems.

In this section, we'll concentrate on generic index design strategies, beginning with the type of columns suitable for a clustered index. We'll then look at an area we touched on in our introduction, covering indexes and included columns, before concluding the section with coverage of a new feature in SQL Server 2008, filtered indexes, and how they compare with indexed views.

Let's begin with an important step in table design: selecting a clustered index.

1. Selecting a clustered index

When a table is created with a primary key constraint, as per the following example, a unique clustered index is automatically created on the column(s) in the primary key, unless specified otherwise.

-- Creates a clustered index by default on the clientCode primary key
CREATE TABLE dbo.client (
    clientCode int PRIMARY KEY
    , surname nvarchar(100)
    , firstName nvarchar(100)
    , SSN char(11)
    , DOB datetime
)
GO

In this example, the clientCode column will be used as the primary key of the table as well as the unique clustered index. Defining the column as the primary key means an explicit CREATE CLUSTERED INDEX command is not required. Should we wish to create the clustered index on a different column, SSN for example, we could create the table as follows:

-- Create a clustered index on a nonprimary key column
CREATE TABLE dbo.client (
    clientCode int PRIMARY KEY NONCLUSTERED
    , surname nvarchar(100)
    , firstName nvarchar(100)
    , SSN char(11)
    , DOB datetime
)
GO

CREATE UNIQUE CLUSTERED INDEX cixClientSSN ON dbo.client(SSN)
GO

Created in this manner, the client table will contain two indexes: a unique nonclustered index for the primary key constraint and a unique clustered index for the SSN column.

So, generally speaking, which types of columns make the best candidates for a clustered index?

The clustered index key is contained in the leaf node of each nonclustered index as the row locator. If the clustered index changes from one column to another, each nonclustered index needs to be updated in order to maintain the linkage from the nonclustered index to the base table. Further, if the column value of the clustered index changes, similar updates are required in each of the nonclustered indexes.
The width of the clustered index directly affects the size of each nonclustered index. Again, this is a consequence of including the clustered index key in the leaf node of each nonclustered index.
If a clustered index is not unique, SQL Server will make it so by adding a hidden uniqueifier column to the table for inclusion in the index.

It follows that the best candidates for a clustered index are columns that

Change infrequently (ideally not at all)—A stable column value avoids the need to maintain nonclustered index row locators.
Are narrow—They limit the size of each nonclustered index.
Are unique—They avoid the need for a uniqueifier.

With these attributes in mind, a common pattern for table design is to create what's called a surrogate key, using the IDENTITY property as per this example:

-- Use the IDENTITY property to create a clustered primary key column
CREATE TABLE dbo.client (
    clientKey int IDENTITY (1,1) PRIMARY KEY
    , surname nvarchar(100)
    , firstName nvarchar(100)
    , SSN char(11)
    , DOB datetime
)
GO

By adding the IDENTITY (1,1) property to the clientKey column definition, SQL Server will populate this column's value with an automatically incrementing number for each new row, starting at 1 for the first row and increasing upward by 1 for each new row.

Using the IDENTITY property to create a surrogate key in this manner meets the desired attributes for a clustered index. It's an arbitrary number used purely to identify the record, and therefore it has no reason to be modified. It's narrow: a single integer-based column will occupy only 4 bytes. Finally, it's unique; SQL Server will automatically take care of the uniqueness, courtesy of the IDENTITY property.

In our client table example, the other candidate for a clustered index, as well as the primary key, is the Social Security number. It's reasonably narrow (11 bytes), unlikely to change, and unique. In fact, if we made SSN the unique clustered primary key, we'd have no need for the identity-based clientKey column. But there's one big problem here. It's unique for those who have an SSN. What about those who don't have one or those who can't recall it? If the SSN was the primary key value, the lack of an SSN would prevent a row from being inserted into the table.^[] For this reason, the best primary keys/unique clustered indexes tend to be artificial or surrogate keys that lack meaning and use system-generated uniqueness features such as the identity column. Of course, there are exceptions to this rule, and this is a commonly argued point among database design professionals.

^[] As an Australian without a U.S. Social Security number, I've witnessed this firsthand.

The other consideration for a clustered index is column(s) involved in frequent range scans and queries that require sorted data.

GUIDs and clustered indexes

A common database design practice is to use globally unique identifier (GUID) columns as primary keys, which by default will also be the table's clustered index unless specified otherwise. Not only are GUIDs wide (16 bytes), they're randomly generated. Given such tables are clustered on the GUID column, newly inserted rows will be randomly positioned throughout the table, leading to page splits and subsequent fragmentation. This is a particular concern for tables with a high volume of data inserts. SQL Server 2005 introduced the NewSequentialID() function, which partially offsets this problem by generating GUIDs in a sequential manner. Removing the "randomness" from the GUID values helps in reducing both page splits and fragmentation.

Range scans and sort operations

The case where nonclustered indexes are sometimes ignored if the estimated number of rows to be returned exceeds a certain percentage. The reason for this is the accumulated cost of the individual key/RID lookup and random I/O operations for each row.

For tables that are frequently used in range-scanning operations, clustering on the column(s) used in the range scan can provide a big performance boost. As an example, consider a sales table with an orderDate column and frequent queries such as this one:

-- Range Scan - Potential for a clustered index on orderDate?
SELECT *
FROM dbo.sales
WHERE orderDate BETWEEN '1 Jan 2008' AND '1 Feb 2008'

Depending on the statistics, a nonclustered index seek on orderDate will more than likely be ignored because of the number of key lookups involved. However, a clustered index on orderDate would be ideal; using the clustered index, SQL Server would quickly locate the first order and then use sequential I/O to return all remaining orders for the date range.

Finally, queries that select large volumes of sorted (ORDER BY) data often benefit from clustered indexes on the column used in the ORDER BY clause. With the data already sorted in the clustered index, the sort operation is avoided, boosting performance.

Often, a number of attributes come together to make a column an ideal clustered index candidate. Take, for example, the previous query, which selects orders based on a date range; if that query also required orders to be sorted, then we could avoid both key lookups and sort operations by clustering on orderDate.

The process for choosing the best clustered index is obviously dependent on the specifics of each database table and knowledge of how applications use the table. That being said, the above recommendations hold true in most cases. In a similar manner, there are a number of common techniques used in designing nonclustered indexes.

SQL Server 2008 : Index design (part 3) - Indexed views

SQL Server 2008 : Index design (part 2) - Improving nonclustered index efficiency

Other

SQL Server 2008 : Index design and maintenance - An introduction to indexes

USB Hard Drive Hangout (Part 2) - Toshiba Canvio plus 3.0 1.5TB, Adata Dashdrive Elite 500GB HE720

USB Hard Drive Hangout (Part 1) - WD My Passport 2TB

WD 4 TB Black Review – It’s The One To Get

WD My Passport Edge - Fast Mobile Drive For Mac 500GB

Seagate Slim Portable – Thin Mobile Drive

SSDs: Solid And Reliable

Sandisk ReadyCache 32GB - Optimum Your Space To Cache

Silverlight : Data Binding - Providing reasonable defaults for bound data

Silverlight : Data Binding - Controlling Updates