Thursday, July 24, 2003

Databases - My Brilliant Career Tuning SQL
One of the first things I did when I started my first real job, way back when GWB I was prez, was to try to tune a bunch of complex SQL statements. My first task in my new job was to try to tune a bunch of complex SQL statements. So what's changed in 15 years? Well, the databases are much larger (hundreds of GBs vs. hundreds of MBs), we're now up to Oracle Version 9 (it was v5 then), PL/SQL and Java came along to change how we write programs, the net came along to change the kind of programs we write, Larry Ellison is even richer, etc., etc.

What hasn't changed? The basic nature of the problem. Conceptually, there has been no real progress. I'm still doing essentially the same thing - figuring out what the database engine should do and then finding a way to make the engine do it.

Yes, it's all bigger-faster-better, and yes, I don't have to ensure that every single statement is optimized like I did in days of yore. But the growth in the size and complexity of applications effectively cancels out whatever other progress has been made. And so I find myself back where I started. But I'm not alone. The whole database industry has the same problem.

It wasn't supposed to be this way. What was supposed to happen was that the database engine's optimizer would be able to always figure out the optimal execution plan for you. It would do this using the statistics collected in the database catalog together with a variety of expression transformation algorithms to produce the best possible access plan. No intervention required, and on to the next level of abstraction. But we haven't gotten there. And it's not looking like we will anytime soon.

Now, in the interests of full disclosure, I should tell you that SQL tuning has been very, very good to me. It's one of the reasons why I receive a handsome salary. While I consider it one of the lesser talents that I bring to the show, making something run several orders of magnitude faster than it did before I performed my magic tuning act certainly impresses the masses. And the semi-arcane knowledge required seems plenty daunting to the uninitiated. It does requires a certain feel or intuitive sense, but so do lots of other things. In short, it's an overrated skill.

But the application of said knowledge and instinct ought to be completely unnecessary. Back in the early-to-mid '80's, database researchers and gurus were just about sure that the need for manual tuning could be eliminated. And indeed it can be, in a relational database system. Because in a relational database system, every query could be reduced to a set of canonical relational algebra expressions. Potential access paths could be analyzed by looking at the statistics captured in the system catalog and the most optimal ones chosen. The set of expressions could then be sequenced in the most efficient order, and the statement would be executed. All with no human intervention required. So the effort previously devoted to manually optimizing physical data access could now be devoted to figuring out how to make applications and systems work better for the people who used and depended on them - a small example of the general way in which human progress is made, i.e., formerly labor-intensive tasks are automated, allowing us to move to the next higher level of abstraction.

And for one brief, shining moment this was actually achieved. A 1988 article in Database Programming and Design showed a comparison of several different database management systems (Oracle, Sybase, Ingres, etc.) each executing a variety of SQL queries. The twist was that all of the SQL statements were semantically equivalent - i.e., they meant the same thing and produced the same results. But due to the redundancy of SQL syntax, the same 3-table join could be expressed in a variety of different ways. So the article measured how fast each DBMS executed each variation of the statement. Ideally, every statement would take the roughly the same time to execute - because the optimizer would produce the same execution plan, and the only variable would be how long it took the optimizer to figure out the access plan.

Well, it didn't work that way. The variation in execution times ranged from about .1 seconds to over 1000 seconds. Oracle, which at this time had a syntactically based optimizer, produced both the fastest and slowest results. But the others weren't much better.

With one exception - Ingres. Every query got executed in about 2 seconds. No significant variations. A bit slow, perhaps - every other DBMS's best time was better - but almost perfectly consistent. Why? Because the Ingres optimizer produced the same execution plan every time. The other DBMS's produced different plans for different formulations of the same statement. How? Well, the secret was that Ingres originally used a different query language - QUEL - instead of SQL. QUEL was much closer to the relational calculus languages that Codd originally proposed. Consequently, it was much easier to optimize - because every QUEL statement could be transformed into a canonical relational algebra expression. Ingres was eventually forced to support SQL, but they were able to apply what they'd learned about optimization of QUEL to SQL (originally, in fact, they translated SQL to QUEL and then optimized it. By Ingres 7.X, they eliminated the translation step and optimized the SQL directly. It didn't work as well as optimizing QUEL, but it was better than what anyone else was doing).

So it can be done. Ingres was slow, but making it faster was an engineering problem, not a research problem. Engineering problems, in my definition, are ones that can be solved incrementally. Making cars more fuel-efficient is an engineering problem. Research problems, on the other hand, deal with bigger, more fundamental issues. Fusion power is a research problem. Nanotechnology is a research problem. Research leads to conceptual breakthroughs. Engineering takes those breakthroughs and produces viable, usable products. Like a relational database management system (DBMS) with a reliable optimizer. It can be done.

But it won't be, at least not by our present crop of DBMS's. That's because they use SQL. And guess what? SQL can't really be optimized. Why not? Because it breaks so many of the fundamental rules for a relational language that it can't be systematically and unambiguously optimized. One of the rules of relational algebra is that an arbitrary set of expressions can be evaluated in any order and still produce the same result. However, because SQL allows duplicates, this isn't true. Different orders of evaluation can yield different results. Further complicating this is the syntactic redundancy mentioned earlier - making the task of transforming expressions into a canonical form even more difficult. An optimizer that is unable to always transform equivalent expressions to a single form and that is unable to always arrange those expressions in the most optimal manner is not going to be an optimizer that is always able to produce the best (or even an acceptable) execution plan.

And so here we are, still hand-optimizing complex SQL queries instead of solving problems for people and businesses. One of the prevailing myths of this society is that progress moves in a straight, smooth path from one technological triumph to the next. But it doesn't. The path isn't uninterrupted, it's full of ruts and dead ends, and we seldom know where it will lead us. Sometimes, we even lose the path completely for a time. That's why Western philosophy didn't advance much beyond Aristotle and Plato for over 1000 years, that's why we're still using the internal combustion engine 150 years after its invention, and that's why we don't have real relational databases.

1 comment:

Iggy Fernandez said...

Very well said. The 1988 article you mention was by Fabian Pascal. He has made it available at http://www.dbdebunk.com/page/page/1317920.htm.

Regards,

Iggy Fernandez