Mountain to Valley

Today was a wonderful day, and a horrible day. It was an uplifting day, and a discouraging day all in one. Why? Because I went on a ladies retreat with my church. It was tucked away in the woods, off…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




The gentle Art of referring to Columns

Why we abandoned column names in Hyrise. Or did we really?

This post is part of a blog series about the progress of writing a query optimizer for Hyrise, a main-memory column-oriented research database.
So far, we have written about:

We, as humans, like to put labels on everything and use them to refer to one thing or another as we talk about them. In databases, this means that we give names like income or birthday to columns and persons to a table when we’re communicating with a database via SQL. To the core of the database, the part that actually executes our queries and is therefore often referred to as the “execution engine”, these names mean little —it just holds a list of tables and for each table a list of columns. At the bottom of it, this article is about the best way of achieving a mapping between the name that the user likes to work with and the index in a list, which is really the way of the database to access a column.

But before we dive deeper, let’s have a quick look at why the way we can name things in SQL is not as simple to process for a database as it might seem at first glance. The short answer is: prefixing and aliases. The long answer is best given by a big wall of text, or by some examples. Let us go for the latter, shall we?

First, assume we have the following tables loaded into our database:

Now, the first query we will consider is a relatively straight forward case:

Can you hear the ambiguity bell ringing? The column a is contained in both T1 and T2 — we would have to either refer to T1.a or T2.a to remove the ambiguity, while for column c the T2. prefix would be optional. Okay, let us fix the query and extend it a bit:

Looks correct to you? Well, it is not, because the column names of InnerResult are still a and c, the first column is not named T1.a, despite us using that name to refer to it.

For the last mind challenge, imagine yourself in the shoes of a query optimizer (Remember? We’re writing Hyrise’s query optimizer!) that takes this particular query above and realises…

…and then you go ahead and form the following query:

…and of course you would have stepped right into another trap SQL has laid out for you, because now InnerResult is not a known prefix anymore. And then it dawns on you that even removing the InnerResult prefix is not enough for the push down to work, but adding the T1 prefix again would be necessary and cold sweat starts running down your back and you scream “I’m just a query optimizer I don’t want to deal with this!!” and the execution engine agrees with you and starts wailing as well — and you know what? We think you are right — SQL’s naming complexity is not something the inner workings of Hyrise should have to worry about. To guard you from it, we will build a nice layer of defense around it, one that translates these names into nice, unambiguous column indices that are, as we will lay out further down, much easier to work with.

In the process outlined above the question is where to translate column names to column indices. In Hyrise, we decided to introduce the indices as early in the process as possible, because we think that the unambiguity of those indices helps us in all the following stages, namely optimization and execution.
The parser is not able to translate the column names to indices because it has no knowledge of the tables and columns that are used in the statement — it operates completely independent of the storage layer. Consequently, the SQLToASTTranslator would be the next possible component. Introducing indices here has the advantage that virtually the whole code base outside of that translator does not have to resolve column names. That includes the statistics component, the ASTToOperatorTranslator, as well as all operators and optimization rules. We found that to be a major advantage regarding the complexity of the code and therefore decided to follow that approach.

If a second scan follows the first one, the process is a bit different. The second scan has as its input the first scan, and therefore asks that node to resolve the column name. That node does not change the table structure and therefore simply redirects the task to its input, which would in this case be the StoredTableNode. The advantage is that we only have to define the lookup functionality for nodes whose input and output structures are different. In Hyrise, we currently have four nodes that can change the table layout: StoredTableNode, ProjectionNode, JoinNode, and AggregateNode.
ProjectionNode and AggregateNode are different from the other two in that they have the power to create new columns that do not stem from its input. In the case of the AggregateNode that would be aggregate functions, and for the Projection that would be arithmetic expressions. The respective nodes have to consider that when resolving column names.
This is important since all nodes provide a mapping from a node’s output column indices to its input column indices. For output columns that are created by this node, a special token is used.

As mentioned in the beginning of this post, SQL provides a way to specify which table a column is being accessed from. This is required if a column otherwise cannot be resolved unambiguously, and optional otherwise. To support that, nodes can be asked whether they manage a respective table qualifier, and will only be taken under consideration if they do. Currently, StoredTableNodes store the table name and can thereby take responsibility for that respective table qualifier. Other nodes simply ask their input nodes whether they manage that table.
The one interesting special case is the JoinNode, which handles two tables. It knows how many columns each of the two inputs specifies, and can therefore limit the search to the columns of the responsible table.
In the future, this concept needs to be extended to support aliases for subqueries as well — and we have plans prepared for that.

The main objective of the optimizer is to generate a query plan as close to optimal as possible within a reasonable time frame. Often times that means changing the order in which operators are being executed. By relying on column indices instead of names, changing the tree might involve the need to change column indices as well. Imagine a simple example, in which we first do an INNER JOIN of two tables and subsequently scan the join result on a column of the right table of the join. The optimizer might decide that it is cheaper to first scan the right table and then join the left table with the result of the scan.
However, the AST node representing the TableScan uses as a column index the column index that is correct for the table resulting from the INNER JOIN, which first contains all columns of the left input and then all columns of the right input. This column index is different if the scan is to be executed before the join, because then the input table for the scan only has the columns of the right table.

That means there are three cases to consider in the rules:

To ease the implementation of rules in Hyrise, we plan to provide helper functions in the future that handle these updates of column indices automatically.

While this makes the implementation of the operators somewhat cleaner, less redundant and even a slight bit faster, this approach has some downsides too, that we have to be open about:

Despite these problems we still decided to make the shift, since we think that the reduced complexity in the optimizer makes up for it.

Integrating column indices deep into Hyrise’s architecture at this point was a lot of work. But this work makes the life of the optimizer and execution engine coders so much more joyful — and that makes it worth it.
Dear reader, thank you for following us until here, but our journey is almost over. In fact this was the last content-heavy blog post of this semester’s master project at HPI’s EPIC chair. But don’t leave us just yet, because there is a nice wrap-up article for our whole project coming your way soon — see you then!

Add a comment

Related posts:

Wireshark Filter Command

Wireshark some commnd for filter. “Wireshark Filter Command” is published by CC(ChenChih) in Chen-Chih’s Portfolio Page.

Welding Innovations in 2020

Welding is one of the most important trades in the world of construction and manufacturing, whether it’s used for cars, buildings, or ships. It’s also a trade that’s changing very rapidly. Currently…

Paradox

This reached out and slapped me across the face. The consequence of self-ignorance is that which we are ignorant about being projected out into the world and running amok with a free hand…