Hello, my name is Stephane Faroult.
I am a French database consultant. I’d like to talk a little about indexes, because I
see a lot of mistakes in the field.
I have published two books with O’Reilly, one about how I think one should write SQL,
and more recently a second one more relevant to what takes most
of my professional time, namely improving SQL applications
written by people who have obviously NOT read my first book.
In both books, I had to talk about indexes, a subject that is often surprisingly misunderstood.
Almost everyone knows you mostly have two ways to access data in a relational table:
one is to scan the table from beginning to end, and the other
one is to descend an index that is most often a tree
tructure, find the physical address of rows of interest, then directly access them.
Neat and fast.
So, why the fuss? We should probaly index every column that intervenes in a WHERE condition.
Hold on. Some indexes are created to implement constraints, in particular to quickly identify
that you are trying to insert duplicates in columns that should contain unique values.
And you almost always have some obvious “entry points”, some columns that are particularly
important and selective search criteria. I have no issue
with these indexes.
But all indexes have a cost.
For one thing, they take a lot of storage. This shows the total storage taken by a real
8 Gigabyte table in a bank. The five indexes take together
more than three times as much storage as data.
You are going to tell me: who cares, storage is cheap nowadays.
True enough, but only to some extent. And of course I’m not only talking about the
production database. It is likely, if data is precious,
that disks are mirrored. And if the data is critical,
it is probably replicated to some distant disaster recovery site, where it can also
be mirrored.
We probably also have a copy of the production data that is used for development. And another
copy for maintenance of the current version of
the software. And a user-acceptance database copy, for testing
new functionalities. And a performance database for stressing the system.
A few gigabytes here, a few gigabytes there, and you are soon talking about real storage
requirements, that don’t come that cheap. But paying for
disks isn’t the only cost.
The time it takes to backup a database is directly proportional to its size.
Backups are usually planned. But recoveries often aren’t, and the problem is the same.
If your database takes a lot of space because of indexes, it will take longer to recover.
Franklin gave to an old saying its modern guise in 1748 in his “Advice to a Young
Tradesman”:
Remember that time is money.
There is even worse, and this is a test you can easily replicate. I have inserted 100,000
rows into a table of a dozen columns.
In the same time I could insert 100 rows without any index,
After creating a primary key index I inserted 65
After adding a second index I was down to 22,
After adding another index I could only insert 15
And a last index brought my throughput down to 5.
The actual numbers may vary, but since indexes are maintained in real time each index inflicts
a performance penalty, and it can be verified with any database management system.
It’s not a concern in a read-only decision support system. It can be one in an operational
system.
I certainly want some indexes. But since indexes are costly,
we must study the cost/benefit ratio and really weigh their usefulness.
Let’s first understand how they really work.
An index entry is the association of a key value with a physical addresses.
In an index, all the entries are stored in sorted order of the key values, and a tree
structure is plugged over this list to quickly access
the address of a key value by comparison of this key
with what you find in the various nodes.
Searching an index is in fact very similar to searching a phone book, in which the key
is the surname, firstname and perhaps the street, while the
associated value that is searched is the phone number.
By comparing the name you search to the top of the page, you reach the good line very
fast.
But when we are looking for a lot of values, either because the search criterion isn’t
very selective, or because we are using the result of a subquery
that returns many rows, it may actually be more
efficient to remorselessly scan the table than to get the address from the index each
time.
The question really is: do we need an index, like a book index that is great to find some
very specific information at one particular place
or do we need something which is more like the broad categorization of the table of contents,
because we need to get information from full chapters.
Databases know the equivalent of chapters, they are called partitions.
Even when you are certain about which columns really need indexing, you must be careful
about how you index them.
For instance, suppose we have a table called T, and that we have indexes on two columns
from T, country and year.
If we often run queries in which we combine a condition on the country with a condition
on the year, everything is fine …
except that the optimizer will have to think hard about how best to run the query.
It can use an index and forget about the other one,
or it can sometimes use both indexes to find the intersection of two result sets.
If both columns appear in most query conditions, it makes sense to create a composite index,
in which we shall take the year and the country, concatenate the two and call the result the
key.
The composite index will take us precisely to the rows we want, and in terms of index
maintenance during inserts and deletes, it will be faster
to maintain one index on two columns than two indexes
on one column.
But then there is a question.
Should the index be built on country and year,
or on year and country? It’s not exactly the same thing.
Remember that keys are sorted in indexes. An index on year and country might look like
this
While an index an country and year would resemble that.
If both country and year appear in the condition, the two indexes will be equivalent
and will allow to zoom fast on the result.
If only a condition on the year appears in the where clause, we shall be able to use
the index that starts with the year, but not the other
one, unless we scan it fully.
And of course the reverse will be true if we only have a condition on the name of the
country.
Searching an index is similar to searching a dictionary : you quickly find what you want
if you only know the beginning of a word the spelling
of which you want to check.
If you want to find words that rhyme together a regular dictionary is useless.
It also means that when you have an index on two columns C1 and C2, then having a single
column index on C1 is pointless. However, it may
make sense to have an index on C2 alone if it speeds up
a series of queries in which there is no condition on C1.
There are some other subtleties.
If we have a range condition on the year, both indexes will in theory be usable, but
searching the index that starts with the year will require
more work, and perhaps that the optimizer will
decide that it’s not worth the trouble.
There is much more to say about indexes, and this video is a mere introduction to the topic.
But remember that indexing indiscriminately, hoping to reach the target, will in most cases
not work, because indexes are not the silver bullet
that always makes queries fast.
If you want to index efficiently your tables, you don’t usually need many indexes,
but you need finely targeted indexes.
Thank you for listening.