Hello, my name is Stephane Faroult.
I’m a French SQL and database performance consultant and I’d like to illustrate briefly
how you can try to improve query performance.
I’ve recently seen on the internet a query for which the poster was asking for help.
I have removed some of the columns that were returned to shorten the query and make it
more legible, but otherwise here it is. It’s fairly standard
SQL, but for the limit clause that hints at MySQL
or Postgres.
With Oracle, you would have had a condition on the pseudo-column rownum.
With SQL Server, you would have had a “top 20” clause after the SELECT.
What I am going to tell you would apply to any database product.
The first advices that were posted were “what does EXPLAIN PLAN say?”
As a general rule, I don’t find looking at execution plans very useful for improving
queries, unless perhaps you are comparing two plans
– and yet.
In any case, I don’t believe that looking at execution plans is the first step to take.
If I can give a comparison, SQL query execution is a bit like automated translation.
Suppose that I want to translate a sentence by Montaigne from French to English. Not the
easiest French you can find.
With an automated tool, the raw translation reads like the instructions for some cheap
electronic device.
I’m pretty sure that knowing how the translation engine exactly works and tracing what it does
would be very interesting, but if I want a good result the simplest thing I can do is
probably, with all due respect to Montaigne, to rewrite
phrases in a simpler and less ambiguous way.
I take the same approach with SQL.
Some other people came with questions I find more relevant, such as “what is the size
of your tables” and “where are indexes”. I have recently
published with O’Reilly a book called “Refactoring SQL Applications” in which
I discuss all of this in detail.
But I want here to show you just how far you can go by just analyzing a query.
Usually people come to me and say “Can you help me tune my query, which is too slow?”
I don’t know what they exactly expect, but I find the word
“tuning” very misleading, because it suggests small
changes, where in fact an SQL statement is a program in itself and modifying it often
has a huge performance impact.
The query I am given to “tune” usually looks like this, and the hardest part is to
try to make sense of it.
Once I have understood, I can delicately modify the query.
An essential point with SQL is that the result of a query is a data set in exactly the same
sense as a table is a data set. What is true for
a table should be true for the result of a query,
in particular the fact that there should be a key – columns that uniquely identify each
row.
Finding the key is a great help in understanding the query.
In this particular example I have a mix of a right outer join and a left outer join that
I find very confusing. I never use right outer joins. The table that is at the core of the
query looks like the items table – the innermost
condition is applied to it, most columns that are returned
come from it, all other tables are joined to it.
The key looks like the “itemid”, Votes come as a second thought
. I’ll therefore start with rewriting the
from clause and giving prominence to the items table.
Now let’s start with the serious analysis. I concentrate first on what determines the
result set, that is what allows me to say that a row,
identified by its key, belongs or doesn’t belong to the
result set.
What are the filtering conditions?
A condition on items in an inner query, and an outer condition on one column of MyTable.
We need items. The column from MyTable is actually a column that comes from the myitems
table, which is directly joined to items. We also
have the “limit” clause, which is linked to
the “order by” that only applies to the itemid.
In other words, we want the 20 first identifiers of items for which the “deleted” flag
is 0, and which have no match in myitems or for
which userIDFav is 3 or undefined in this table.
Do the other joins contribute to refining the result set? Not the join with ‘votes’,
since it’s an outer join.
Let’s remove the references to votes for the moment. What about the join with “users”?
It might be understood as a restriction if there were some values of userid in the items
table without any match in the users table. But
in all likelihood, we have a foreign key relationship between the two, and there is probably always
a matching row in users.
My conclusion is that it doesn’t contribute either to defining the result set, and it
can go to.
Now I have cleaned up my query, and I get a query that returns fewer columns, but returns
the same identifiers as the original query.
Very interestingly, we no longer have the aggregate, which requires sorting data.
We can still simplify the query by moving up conditions that have no real reason to
be separated from the others.
If this were an Oracle query, it should remain as it is now, with a condition on rownum outside
the query that contains the order by clause. But
since this is MySQL, I can also push the limit clause
in the inner query.
Now comes the reconstruction phase, and I am going to bring back one by one all the
pieces of data I have removed.
This is the initial query, with its select list.
We have the itemid. To return the count of votes, I need an outer join with the votes
table. I’ll add the GROUP BY when I’m done.
I have the title, and to return the user name I need to join on the users table.
At this point I have an issue, because the in the original query there is a join on a
column from items that isn’t return by the subquery.
I have to add it.
I add userIDFav and I’m done with the select list. I can now add the GROUP BY clause, as
well as an ORDER BY.
Adding a second order by is really important. It isn’t guaranteed that the joins that
have been added will keep the rows sorted on itemID. I really
have to add it.
But I know that I’ll only have 20 rows at most to sort, which is nothing.
And here I am.
As I told you, I found the query on the internet, and I have had no opportunity to test it.
But although I have merely applied logic, I’m ready to bet that this version is faster
than the original one, because I aggregate votes
and join for only 20 items, instead of aggregating and joining for a potentially very large number
of items and keeping the 20 first ones only.
Moreover, my query will scale much better when the number of items increases, because
I work on as little data as I can when I identify
the rows to be returned.
I have applied three key principles in the rewriting of the query.
Firstly, I always try to get rid of the rows that will not be needed in the final result
set as soon as possible. This is why I have limited
the number of rows before aggregating rather than the reverse.
Secondly, whenever you sort or aggregate, work on as little data as you can. It’s
always faster to only sort customerids than customerids
plus all the information associated to a customer.
Thirdly, join late. Wait for the very last stage before you join to retrieve information
you want to see returned but doesn’t contribute to
limiting the number of rows that are returned.
Apply systematically those three principles, and you will probably see significant performance
improvements with many queries.
Thank you for your attention.