The main purpose of this article is to demonstrate how to look at MySQL joins. By "look at" I mean how to see through MySQL's eyes, so-to-speak. This task is fundamentally different from the rather straight-forward tasks of how to do joins or how to index for joins with MySQL. Even knowing how to join and how to index for joins does not completely demystify the process because there are many aspects about the query which MySQL takes into consideration when deciding on a join plan. Knowing how to look at MySQL joins (and how to join and how to index for joins) considerably demystifies the process. Particularly in cases like this, when we're tasked with optimizing a query without access to the actual database, we must be able to look at the join as MySQL would to know "if we re-write the query like this, then MySQL should execute a join plan like this." Therefore, this article could be titled, "Understanding MySQL Joins a priori."
With that in mind, a visitor sent me an email asking how the following query could be optimized. The query is part of NewBB Plus included in RunCMS (formerly E-Xoops). Its purpose is to retrieve the top N newest posts in the forums.
SELECT ... FROM mpn_bb_posts p, mpn_bb_forums f INNER JOIN mpn_bb_topics t ON t.topic_last_post_id = p.post_id WHERE f.forum_id = t.forum_id AND (f.forum_type = 0) AND (f.forum_id<>15) AND (f.forum_id<>25) AND (f.forum_id<>29) AND (f.forum_id<>33) GROUP BY p.topic_id ORDER BY t.topic_time DESC;
At first you may notice two parts of the query which are technically wrong, although the query does work: first, the query uses GROUP BY without any aggregate functions, and second, the order of tables listed in the FROM clause should be f, p, INNER JOIN t. First, we'll briefly discuss these two problems. That will clear the way for us to look at this join as MySQL would. Finally, we'll fix—optimize—this query.
What is that GROUP BY doing? Not much. It's attempted purpose is to limit the results by having no more than one result per forum topic. In case it's not clear, the hierarchy of the forum is form, topic, post. For our visitor who originally asked me about this query, they have roughly 35 forums, 49,000 topics, and 549,000 posts. Therefore, the GROUP BY limits the result set to ~49,000 records. Obviously this is a very poor limiter, especially when only the top 40 newest posts are displayed on the website. You may ask, "But if we remove the GROUP BY, will the query return more than one post per topic?" No, because the INNER JOIN clause joins only the newest post for each topic, and obviously there can't be more than one newest post. Removing the GROUP BY doesn't change MySQL's join plan for the query, but it needs to be removed anyway to allow us to work ORDER BY with LIMIT magic later.
The second problem is the order in which the tables are listed in the FROM clause. I have yet to test older versions of MySQL, but in v5.0.15 the query causes the following error:
The answer is stated in the final comment of the MySQL bug report 13832 (which is actually not a bug); I'll rephrase the answer. The comma has lower precedence than JOIN. The FROM clause is interpreted as:
FROM mpn_bb_posts p, (mpn_bb_forums f INNER JOIN mpn_bb_topics t ON t.topic_last_post_id = p.post_id)
In other words, the INNER JOIN joins tables f and t. Therefore, the ON clause for this join must reference columns in either tables f or t, but it doesn't, it references a column in table p (p.post_id). Hence, MySQL dies. The finer details of this are discussed in the MySQL manual section 13.2.7.1. JOIN Syntax. The fix is easy: swap p and f in the FROM clause:
FROM mpn_bb_forums f, mpn_bb_posts p INNER JOIN mpn_bb_topics t ON t.topic_last_post_id = p.post_id
A RunCMS developer suggests we can just get rid of the INNER JOIN statement and re-write the relevant portions of the query:
FROM runcms_bbplus_posts p, runcms_bbplus_forums f, runcms_bbplus_topics t WHERE t.topic_last_post_id = p.post_id AND f.forum_id = t.forum_id
Either way works; we'll use the INNER JOIN style. Now that the query is technically correct we can begin an a priori look at how MySQL will execute this query and then optimize it.
Three important ideas help guide our understanding of how to look at joins with MySQL:
I'll discuss each idea in turn, then we'll apply it to this query. By "criteria" I mean the columns of each table referenced in the query. These are often the columns used to relate or join the tables (commonly "ID" columns), but columns referenced anywhere in the query (WHERE, GROUP BY, ORDER BY, etc.) are included. For each table we can make a list of its criteria:
Table f | Table p | Table t |
---|---|---|
WHERE AND f.forum_id = t.forum_id AND (f.forum_id<>15) AND (f.forum_id<>25) AND (f.forum_id<>29) AND (f.forum_id<>33) | WHERE t.topic_last_post_id = p.post_id | WHERE t.topic_last_post_id = p.post_id AND f.forum_id = t.forum_id ORDER BY t.topic_time DESC |
Naturally there are overlaps because tables references each other (which is the whole idea of a join); these overlaps are how the tables relate to each other. Knowing each tables' criteria we can start to consider different ways to join the tables.
We have to pick a table to start with, but first a few words on two ways tables are conceptually joined. Usually, tables are joined through a series of relations. That is, the second table in the join plan has a relation to the first, the third table has a relation to the second, the forth to the third, etc. Therefore, column values from one table will be used to join the next table. The logical implication that follows is that each table should have a criteria that references the table prior to it in the join plan (except for the first table of course).
The alternative to joining tables through a series of relations is to join tables on a common relation. For example, the second and third tables may not share a relation to one another but both share a relation to the first table.
Stated visually, these two ways of joining tables looks like:
Either way, MySQL reads table 1, then 2, then 3—its "single-sweep multi-join method." The "(1)" in the common relation join means that when table 3 is read, rows in it are found using its relation to table 1, not table 2 as in the series of relations join. Joining tables through a series of relations has the advantage of reducing duplicate table reads if the first table is the most restrictive, the second table less restrictive, the third table even less restrictive, etc. Why this is is a positive consequence of MySQL's single-sweep multi-join method but not of necessary importance for this article.
Going back to picking a table to start with, there is only a limited number of possible join plans that follow from using each table first:
Actually there's more than three join plans: f, p, t could be a join plan, but this is neither a series of relations or a common relation because there is no relation between f and p. Therefore f, p, t—or p, f, t—is a very poor join plan that MySQL will certainly avoid. For these three join plans though, the first two have a series of relations and the third has a common relation (table t). Which table do we pick to start with?
The third idea guides us: use the least costly join plan. "Cost" is primarily a matter of of three factors:
Those factors are hierarchical: without proper indexes, all hope is lost; then we have to reduce table reads as much as possible because hard drives are really slow and while we're reading tables the tables are locked; finally, we can try to optimize out extra work if doing such also satisfies the previous two factors. Indexing for joins is discussed in the article How To Index For Joins With MySQL. In this article we're focusing on table reads and extra work (the tables are properly indexed). We have three join plans and obviously we want the one which costs the least amount of table reads. As stated earlier: joining tables through a series of relations has the advantage of reducing duplicate table reads. Therefore, we want to pick table f or p to start with. Since table f is extremely more restrictive than table p (34 forums verses 549,000 posts), this is the least costly join plan (f, t, p). Until now we haven't consulted EXPLAIN because the overall idea of this article is understanding MySQL joins a priori, but the time has come now to EXPLAIN and see that I haven't been lying to you. The following EXPLAIN is the original, unmodified from my visitor:
+-------+--------+---------------+---------+---------+----------------------+------+----------------------------------------------+ | table | type | possible_keys | key | key_len | ref | rows | Extra | +-------+--------+---------------+---------+---------+----------------------+------+----------------------------------------------+ | f | ALL | PRIMARY | NULL | NULL | NULL | 34 | Using where; Using temporary; Using filesort | | t | ref | idx | idx | 3 | f.forum_id | 1452 | | | p | eq_ref | PRIMARY | PRIMARY | 3 | t.topic_last_post_id | 1 | | +-------+--------+---------------+---------+---------+----------------------+------+----------------------------------------------+
MySQL has chosen join plan f, t, p just as we predicted. Therefore, we have proper indexes, and we're using the join plan with the least amount of table reads, the trick now is to get rid of "Using temporary; Using filesort".
The "Using temporary; Using filesort" is a consequence of the ORDER BY clause in the query. From the article ORDER BY With LIMIT and ANALYZE we already know why ORDER BY with LIMIT is a MySQL query optimization technique. Fortunately, we can apply the same technique to this query because we also need to limit the results (which was what GROUP BY was trying to do). However, the query in the previous article used a single table and didn't incur "Using temporary; Using filesort" so while the optimization technique works for this query too, it does so under a different premise which we'll examine.
In the previous article and query, the premise was that MySQL was using an index to avoid "Using temporary; Using filesort" even when EXPLAIN said there were no possible keys. For this query the premise is dramatically different because with multiple tables we have to properly index for the join first, then get the least costly join plan, and then finally try to remove the extra work of "Using temporary; Using filesort" by using an index that also accomplishes the prior two factors. Therefore, the ORDER BY with LIMIT technique is not so easy as it was in the previous case.
The first thing to do is index the column in the ORDER BY. In this case, t.topic_time is indexed. Then we have to examine whether a join plan starting with the table of this column is possible. From the list of possible join plans we discussed earlier, we know that t, f, p (same as t, p, f) is a possible common relation join. Normally this join plan would read many more rows than plan f, t, p, but since it can be used for the ORDER BY with LIMIT optimization technique it really only reads N rows, where N is the LIMIT. Basically, we're just lucky this works with the join. There are a number of situations that would prevent us from using this optimization:
Even before consulting EXPLAIN again we can be very sure MySQL will switch the join plan from f, t, p to t, f, p (or t, p, f). Furthermore, we can even predict what indexes it will use for each table. We know that table t will be first in the join plan and MySQL will use the index on column topic_time (called t_time). Then, consulting the table of relations we made earlier, we know that both tables f and p are related to t by their PRIMARY KEY column (forum_id and post_id respectively). Therefore, the new join plan will uses indexes t_time, PRIMARY, PRIMARY. This is what I told our visitor and then asked him for the new EXPLAIN:
+-------+--------+---------------+---------+---------+----------------------+-------+-------------+ | table | type | possible_keys | key | key_len | ref | rows | Extra | +-------+--------+---------------+---------+---------+----------------------+-------+-------------+ | t | index | idx | t_time | 4 | NULL | 49547 | | | f | eq_ref | PRIMARY | PRIMARY | 3 | t.forum_id | 1 | Using where | | p | eq_ref | PRIMARY | PRIMARY | 3 | t.topic_last_post_id | 1 | | +-------+--------+---------------+---------+---------+----------------------+-------+-------------+
We were correct again, a priori. At this point, the query can't really be optimized anymore. The 49,547 rows for table t is misleading. In reality MySQL will stop reading records after it's found N records that satisfy the join, and since the other tables are being joined on unique columns, MySQL should only need to read N records from t and join those with f and p. In a highly unlikely situation, MySQL may find a record in t that has no matching record in f or p, but hopefully the database has no such orphaned forums or posts (i.e., forms or posts which are not related to any topic).
In this article we've discussed how to look at MySQL joins. Different from a typical hands-on approach to working with queries and joins, we've taken an a priori approach to understanding how and why MySQL develops and chooses among various multi-table join plans. This process is framed by three ideas, the first of which is that every table has its criteria. To facilitate this idea, I find it helpful to make a side-by-side list of each table's criteria. These criteria also include the relation between tables; that is, where one table references another. Knowing these relations is important later. The second idea is that we have to pick a table to start the join plan with. Entailed within this idea are two conceptual ways to join tables: a series of relations and a common relation. A series of relations join incurs less table reads if the tables are progressively more restrictive from the start of the join plan. Since we know the relations between all the tables from the first idea, we can construct a list of all plausible join plans for each table as the starting table. A join plan is plausible if it is either a series of relations or a common relation join. Otherwise the join plan is possible but implausible and undesirable because it tends to incur full cross-product joins between the unrelated tables—very slow. The third idea tells us how to pick a table to start with: use the least costly join plan. "Cost" is a hierarchy of factors: indexes, table reads, and finally extra work. Proper indexes for the join are most important; without them the join is doomed. Then we prefer a join plan with the least amount of table reads. As stated earlier, this is usually the series of relations join plan with the most restrictive first table. Finally, if we're lucky, we can also index where necessary to avoid extra work like filesorts or temp tables, but only if this index still satisfies the first two factors.
Looks like a lot of information to know, but as always, if the problem is approached methodically as we've done in this article, you can play the part of the MySQL query execution planner in your mind with a good degree of accuracy. You can take the hands-on approach and tweak the query, EXPLAIN, and repeat, but to optimize the query and join in your mind a priori is a very effective way to geek out.
新闻热点
疑难解答