{question}
How to read an EXPLAIN output?
{question}
{answer}
The EXPLAIN command is one of the main tools to understand what is happening behind the screen when you run a query. You'll need to use it when you want to know how to improve the performance of your queries, and here we'll explain how to read it.
The EXPLAIN command shows an execution plan of the query based on the estimations, meaning it doesn't actually run the query and shows a plan it assumes will be the best according to the statistics it has on the tables.
In order to run an EXPLAIN, you just need to add the command EXPLAIN before the query you want to check, like in the example below.
We create two test tables:
create rowstore table table1 (a int, b int, shard key(a));
create rowstore table table2 (a int, x int, y varchar(20), shard key(x));
Insert a bit of data so we won't run on an empty table
insert into table1 select 1, 1000;
insert into table1 select 2, 2000;
insert into table1 select 3, 3000;
insert into table1 select 4, 4000;
insert into table2 select 1, 5, 'abc';
insert into table2 select 2, 25221, 'tyr';
insert into table2 select 3, 98, 'htj';
insert into table2 select 4, 259, 'lke';
First, we'll start with a simple query to get started:
EXPLAIN SELECT * FROM table1 WHERE a = 3 LIMIT 10;
And here's the output:
Gather partitions:single alias:remote_0
Project [table1.a, table1.b]
Top limit:10
IndexRangeScan test.table1, SHARD KEY a (a) scan:[a = 3] table_type:sharded_rowstore
In an EXPLAIN, you should start reading from bottom to top, the bottom commands will usually be the scan operator where the engine fetches the data from the tables, and the top will usually be the project operator where you send the results of the query back to the client.
In this case, we started with an IndexRangeScan on the shard key because the query was filtered by the column in the shard key, which means that we didn't need to scan the whole table to look for records that fit the filter.
Here you can see which table was scanned, which index was used and the type of the table (rowstore or columnstore)
The next step is the Top operator, since our query had the LIMIT operator after we fetched the data, we need to apply it, in this case only bring up to the next step the first 10 rows.
The next step is the Project operator which takes the data from the previous step and organizes it into a tabular format.
The final step is the Gather, which gathers all the data from the leaves to the aggregator that ran the query.
Notice that always all the steps below the Gather happen on the leaves, and above it happen on the aggregator.
Now let's EXPLAIN a query that is a bit more complicated
explain select t2.a, t2.y
from table1 t1
inner join table2 t2
on t1.a = t2.a;
And now let's look at the output:
Project [t2.a, t2.y] est_rows:4
Top limit:[@@SESSION.`sql_select_limit`]
HashJoin
|---HashTableProbe [t2.a = t1.a]
| HashTableBuild alias:t2
| Project [remote_1.a, remote_1.y] est_rows:4
| Gather partitions:all est_rows:4 alias:remote_1
| Project [t2_0.a, t2_0.y] est_rows:4
| TableScan test.table2 AS t2_0 table_type:sharded_rowstore est_table_rows:4 est_filtered:4
Gather partitions:all est_rows:4 alias:t1
Project [remote_0.a] est_rows:4
TableScan test.table1 AS remote_0 table_type:sharded_rowstore est_table_rows:4 est_filtered:4
What's important to notice here, is that now we have two parts in the plan.
Each individual work that is done on a table is shown in a separate part, and the end of the inside part (the top) will always be a join with the outside part.
in this case, the scan of table1 is on the outside, and the scan of table2 is on the inside part (marked with |).
Both of these can run in parallel, and each part is running from the bottom to the top until they both join together in the HashJoin.
Additional information can also be seen here.
{answer}