On part 1 of this article, we introduced the problem of DB function fan-out, and we figured every such function is actually a multi-function with 2^n versions, given n inputs. On this part, we'll focus on current solutions to the problem on two open-source databases: MonetDB & ClickHouse. In addition we'll try to come up with simple, yet ineffective/incomplete solutions to the problem.
Simple standard solution recap
Last time we had a brief look at a standard solution to the concat problem, using 4 different functions - one that we called scalar, and 3 more that used at least one vector input. It looked like that:
Simply put, this solution reused the scalar version (first function on the snippet above) in the other 3 vector versions (below in the snippet). The scalar version serves for a "row execution": every version actually sets the data row-by-row (by looping on the input vector/vectors), and runs the scalar version for each row. Nice and easy, and very effective (given the scalar version is effective). However, as dicussed, this solution suffers from many disadvantages: horrible duplication of code (4 versions), comlicated to modify (imagine you'd be asked to change the data structure from vector to something else) and not easily testable. Writing this code is an annoying boilerplate code, and so does its testing code. We wouldn't like to do that, especially if we have to implement 100+ such multi-functions in a database.
Next, I'll try to give you a bird's eye view of the solutions offered to the problem on MonetDB & ClickHouse. The code on both projects is extremely complex, so I'll try to focus on the big picture without going into too much details. I think, though different, both conceptually aren't that different from the solution above.
Let's have a look at how this problem is solved in MonetDB*. This DB is implemented in the C-language, and isn't easy for most eyes. A single version of subString implementation (take a breath...) looks like that:
Don't worry, we are not interested in the implementation details of that function. What's more interesting to us, is the number of different versions this function has in the database. An easy place to examine this is in the registration of substring in the DB catalog file:
In MonetDB's terminology arg is a scalar (so arg("start", int) is a scalar parameter with name "start" and type int) and batarg is a vector with the same (name, type) declaration. In the snippet above, while looking at the last 3 arguments at every line of registration, you can see that they register all 7 different, non-scalar versions of subString (scalar version is registered in a different file). For example, in the 1st line, the last 3 arguments are: batarg("s",str),batarg("start",int),batarg("index",int) which means it registers the version of subString with 3 input vectors ("111" in our truth table from the previous article).
Conclusion: Ignoring all this database details-specific code (which is long as you've seen above. Now imagine 7-8 times of copy-paste of that code spread over hunderds of rows), we have 7 different, separate vector functions + 1 additional scalar function. In this database specifically you won't find any unit tests for any of the versions, and the code reuse in the 7 (very similar) functions doesn't exist - they have just duplicated the entire code with small changes for all versions. Bottom line: no better than the naive approach we've shown above with the concat example.
This DB is implemented in (a beautifully written) modern C++. It is truly my favourite analytical database, a great product used by many cloud-scale companies and with a great performance like no other. I can say that I've seen it from the inside, and the guys from Yandex have really shown great innovation, design and produced high-class software with great advanced techniques. That being said, the problem we discussed isn't solved there in a much better way, only with code that is much more advanced (being modern C++ instead of plain C). BTW, I think it is much more complex than the one in MonetDB above. Let's have a look.
Every such multi-function in ClickHouse is implemented as a class; For example: class FunctionSubstring or class ConcatImpl, both implementing the same class interface. Let's dive in and look at subString's executeImpl function:
ClickHouse version of this function is much more complex than the first version presented above, so let's try to focus on our interests:
- Their signature of the function is: subString(source, start, length), with length being optional.
- Right from the start, the code examines the 3 arguments to conclude whether each one is a constant or not (lines 125-136). It ends up with 2 possible versions for each one of the start & length arguments, and extracts the constant values if they are indeed scalars (lines 141-144). The code maintains the ambiguity of scalar/vector for each parameter by sustaining a variable for each version (e.g.: column_start & column_start_const for start parameter and start_value in case it is indeed a constant!).
- Ignoring lines 146-159 (this is a variant of the class not interesting to us), you can see that lines 161-172 calls executeForSource with all possible arguments and with different casting of the last argument source (oh my god...). I don't even want to go into the difference between the different castings on lines 163, 166, 169 & 172 - it's too complex. Eventually, all these lines call the function executeForSource, presented next:
- Again, without going into much details, please notice the codes' conditional paths that depends on whether each parameter is a constant or not (lines 91 & 105, and line 89 that tries to figure out if length parameter is in use or not).
- Eventually, there are more function calls on lines 94, 96, 101, 108, 110 & 115 - all are different functions!
- Agreed - they solve here more issues than the original one (e.g.: handling negative start value for some reason). But have a look at the code complexity! We already have so many conditional paths on the only 2 functions we examined together. If you'll drill down to any one of these 6 function calls, you'll find more types being used to determine how to run the underlying subString function. I don't know how this is sustainable, but it sure looks very complex.
I might be wrong, but I think that the root cause of that complexity is the way they deal with all these multi-function versions. In any case, I can't see why implementing such a simple function should take so many lines of code. Maybe I miss something and someone else can advise me. Bottom line: in this implementation we haven't seen any elegant, neat solution to the multi-function fan out problem (many if conditions and multiple functions weren't eliminated from this solution as well).
less naive solution, yet without using TMP
Finally, I'd like to show an improved solution to first one shown above. I ask myself - what is the "root cause" of our problem? I believe this is the fact that we can't know in advance whether a given parameter is a constant or a vector. If we could abstract that away effectively with a neat code, without duplications and boilerplate, we'll be on our way to the solution. So a (very) naive approach can basically turn every scalar parameter to a vector parameter: just initiate a vector of the same size with all cells equal to that single constant. E.g.: if source is a vector with 10M rows and start is the constant 0, create a vector of length 10M with all cells equal to 0. That would now allow us to use only a single version of the multi-function, the one with all-vectors! ('111' in our truth table).
Of course, this solution would be very inefficient in memory, but we can tweak it a bit. Instead of actually allocating a vector with 10M rows of constant value, we can keep that value once, and use some kind of abstraction that will give us that single constant value when we ask for the value at index i (as if it was a vector). Let's see how this could be implemented. Have a look at the following class Parameter:
This class is a template class with T being the type of the parameter. It serves as the abstraction we look for in our parameter of type T- whether it is a simple T val or a vector<T> values. This type lets us initiate it with either T or with vector<T> (note the 2 different constructors on the code snippet above). At last, by using getValue with index parameter, this type allows us an abstraction of constant vs vector, by allowing us to get the constant value at index i even though we don't really have a vector behind. Of course, when indeed there is a vector behind, it would just give us its ith element, as in any vector.
Now we can use subString with the following signature: subString(Parameter<String>, Parameter<int>, Parameter<Int> as our only subString version (we'd still need to write code that will "box" every possible value (constant or vector) into a Parameter<T> object). Now, why isn't it a good enough solution? Remember, performance here is a key. So instead of using vector-only access through regular loop or a constant value read, as in the first version above, we replaced it with frequent calls to getValue function (one per row). This function uses a condition for every single row calculation. That kills our performance! (it runs twice as slow as the first solution, if not slower. I'll try to provide an organized benchmarks table at the end of part 4 of this article). Why? In a nutshell: this condition won't let the compiler use any low-level optimization such as SIMD, pre-fetching vector elements in batch to memory/cache or other (like treating the scalar value as a single, never changing constant throughout the entire calcuation). Instead I believe it forces it to handle branch prediction on each call and and that is just irrelevant to this scenario (we created that problem with our solution! It wasn't there on the first solution). Well - replacing one problem with another wasn't the idea. And besides, a solution 2-3 times slower can't be a solution to the problem in a high-scale analytical database!
Before we'll wrap this part up, let me tell the modern C++ savvy readers that you might have implemented the above Paramater<T> class in a much more elegant way, by using C++ 17's std::variant:
I won't go into details about this implementation. It is really quite straightforward once you get familiar with std::variant (which by itself isn't a new concept in other languages). The surprising thing is that this version is even slower than the previous one without using std::variant! (Again, it would be part of the benchmarks table at the end of part 4 of this article). So, once again, this is not the solution we look for.
These different experiments with run-time type resolution might start to convince you that we look for an answer on the wrong place - we should try and work this out in compile time: TMP to the rescue!
That concludes this part of the article. Please add a comment if you have something to ask/remark/fix or hit the like button if you wish to encourage me :)
On the next part of this article, part III, I'll introduce some advanced modern C++ template meta-programming (TMP) techniques and classes that will be used as building blocks for my suggested solution on the final part that will conclude this set of articles. Stay tuned!
* MonetDB - an open-source column-oriented DBMS developed at the Centrum Wiskunde & Informatica in the Netherlands, that was designed to provide high performance on complex queries against large databases, such as combining tables with hundreds of columns and millions of rows (from Wikipedia).
** ClickHouse - an open-source column-oriented DBMS for online analytical processing. ClickHouse was developed by the Russian IT company Yandex for the Yandex.Metrica web analytics service. ClickHouse allows analysis of data that is updated in real time. The system is marketed for high performance (from Wikipedia).