Correlated subqueries
Quote from Dave Voorhis on October 25, 2020, 10:22 amQuote from Jake Wheat on October 25, 2020, 3:14 amQuote from Darren Duncan on October 24, 2020, 4:07 amAny D language intended for industry will provide an analogy to correlated subqueries
Did anyone here who built a D include a feature like this?
There's no need. An implementation of relational algebra appears to nicely (arguably, more nicely) handle anything where you'd use a correlated subquery.
Correlated subqueries aren't a theoretical construct or a feature, they're an oddity driven by SQL's quirkiness.
Quote from Jake Wheat on October 25, 2020, 3:14 amQuote from Darren Duncan on October 24, 2020, 4:07 amAny D language intended for industry will provide an analogy to correlated subqueries
Did anyone here who built a D include a feature like this?
There's no need. An implementation of relational algebra appears to nicely (arguably, more nicely) handle anything where you'd use a correlated subquery.
Correlated subqueries aren't a theoretical construct or a feature, they're an oddity driven by SQL's quirkiness.
Quote from Erwin on October 25, 2020, 4:06 pmQuote from Jake Wheat on October 24, 2020, 2:23 amHi All,
Correlated subqueries in a D?
Let's say I have a new D based DBMS, and somehow I know I can persuade a load of existing SQL users to switch by rewriting their existing SQL to a D.
Some of them use correlated subqueries extensively. What could be the options to support these users - do you think any of the following make sense, are there other options?
1. implement a D which allows you to mix relational algebra and a complete relational calculus
2. do a SQL-inspired small fraction of relational calculus embedded in relational algebra to provide a way to easily rewrite any SQL csq query into this language (is this possible?)
3. ask customers to convert most of their csqs to joins themselves, and the ones they can't, then ... (not sure - ask them to use divide?)
Thanks,
Jake.Option 1. does not make sense to me because of the implementation redundancy imposed on the DBMS writer. You can do everything you need with the algebra. Why offer a second means to achieve the same ? It's more work for the DBMS writer, more opportunity to introduce errors inside the DBMS, more testing to do, more bugs remaining despite all the testing effort when the product gets shipped, etc. etc.
Option 2. does not make sense to me for exactly the same reasons, plus because of the phenomenon that it might then be "easy to rewrite into the style of the new language" but the magnitude of the problem for the DBMS system to find an optimal data access strategy seems likely to increase seriously. To me at least.
Option 3. would be what I'd always suggest, but with the remark that "convert to joins" is too narrow a view of what the solution leads to. "List the customers who have not placed any orders" involves WHERE NOT EXISTS (SELECT DISTINCT 1 AS DUMMY_ATTR FROM ORDERS WHERE C.CID = O.CID) and that's definitely a correlated subquery and supposing you have a "conversion strategy" that, say, takes advantage of the fact that D has RVA's (a natural direction to follow when seeking a solution because obviously "nested query evaluation" leads quite naturally to "nested relation values"), you might end up with a query that :
- EXTENDs the relational expression corresponding to the outer select with an RVA corresponding to the csq
- then, say, projects that RVA to {} (TRANSFORMs the expression to add a second RVA that is that projection of the first RVA, that is) (yielding either TABLE_DEE or TABLE_DUM) to cater for the NOT EXISTS portion
- then, say, wraps a restrict around the resulting relational expression to test whether THAT_SECOND_RVA = TABLE_DUM.
when the actual result of the conversion should really just be CUSTOMER NOT MATCHING ORDER.
What I meant where I said "does not exist" is "we are never going to find an algorithm that mechanically produces the algebra-based equivalent to any given SQL query and guaranteeably will not leave the users with an abomination like that coming out of my bulleted list". (Note that Codd's reduction algorithm applied only to Codd's algebra, and that's a very small subset of the thing we call "relational algebra" these days !) And if we are never going to find any such algorithm (whether for computer implementation or for follow-this-procedure-literally manuals handed out to, eurhm, lesser skilled human query [re]writers), then what kind of "support" is it that you are thinking of and that a DBMS writer can still possibly have to offer ? Thought I might have had to clarify that with my first reply so I'm doing it now.
Quote from Jake Wheat on October 24, 2020, 2:23 amHi All,
Correlated subqueries in a D?
Let's say I have a new D based DBMS, and somehow I know I can persuade a load of existing SQL users to switch by rewriting their existing SQL to a D.
Some of them use correlated subqueries extensively. What could be the options to support these users - do you think any of the following make sense, are there other options?
1. implement a D which allows you to mix relational algebra and a complete relational calculus
2. do a SQL-inspired small fraction of relational calculus embedded in relational algebra to provide a way to easily rewrite any SQL csq query into this language (is this possible?)
3. ask customers to convert most of their csqs to joins themselves, and the ones they can't, then ... (not sure - ask them to use divide?)
Thanks,
Jake.
Option 1. does not make sense to me because of the implementation redundancy imposed on the DBMS writer. You can do everything you need with the algebra. Why offer a second means to achieve the same ? It's more work for the DBMS writer, more opportunity to introduce errors inside the DBMS, more testing to do, more bugs remaining despite all the testing effort when the product gets shipped, etc. etc.
Option 2. does not make sense to me for exactly the same reasons, plus because of the phenomenon that it might then be "easy to rewrite into the style of the new language" but the magnitude of the problem for the DBMS system to find an optimal data access strategy seems likely to increase seriously. To me at least.
Option 3. would be what I'd always suggest, but with the remark that "convert to joins" is too narrow a view of what the solution leads to. "List the customers who have not placed any orders" involves WHERE NOT EXISTS (SELECT DISTINCT 1 AS DUMMY_ATTR FROM ORDERS WHERE C.CID = O.CID) and that's definitely a correlated subquery and supposing you have a "conversion strategy" that, say, takes advantage of the fact that D has RVA's (a natural direction to follow when seeking a solution because obviously "nested query evaluation" leads quite naturally to "nested relation values"), you might end up with a query that :
- EXTENDs the relational expression corresponding to the outer select with an RVA corresponding to the csq
- then, say, projects that RVA to {} (TRANSFORMs the expression to add a second RVA that is that projection of the first RVA, that is) (yielding either TABLE_DEE or TABLE_DUM) to cater for the NOT EXISTS portion
- then, say, wraps a restrict around the resulting relational expression to test whether THAT_SECOND_RVA = TABLE_DUM.
when the actual result of the conversion should really just be CUSTOMER NOT MATCHING ORDER.
What I meant where I said "does not exist" is "we are never going to find an algorithm that mechanically produces the algebra-based equivalent to any given SQL query and guaranteeably will not leave the users with an abomination like that coming out of my bulleted list". (Note that Codd's reduction algorithm applied only to Codd's algebra, and that's a very small subset of the thing we call "relational algebra" these days !) And if we are never going to find any such algorithm (whether for computer implementation or for follow-this-procedure-literally manuals handed out to, eurhm, lesser skilled human query [re]writers), then what kind of "support" is it that you are thinking of and that a DBMS writer can still possibly have to offer ? Thought I might have had to clarify that with my first reply so I'm doing it now.
Quote from dandl on October 25, 2020, 11:02 pmQuote from Dave Voorhis on October 25, 2020, 9:14 amQuote from dandl on October 25, 2020, 4:13 amA while ago I wrote some code to run correlated subqueries on a SQL engine, by converting them to joins. I started with TPC-H, I think I managed to convert all but one query with csqs to use joins instead mechanically, but these conversions were all special cases for particular shapes of csqs. I didn't see any way to generalize the rewrites. And there was one query that I couldn't figure out how to rewrite to a join even manually (I wasn't thinking much about the semantics of the query).
Do you have anything more to say about there not being a feasible mechanical solution? I think something along the lines of the method of converting relational calculus to relational algebra would work, but I have no idea whether this would actually be usable for product.
As I've been saying, my view is that you can mechanically rewrite a CSQ as a nested function, in which the body is the nested query and the 'correlated' bits are parameters. Do you see any reason why this would not work?
I think that window functions might be a little easier - I suspect there's a straightforward and reasonable mechanical conversion possible (i.e. you can produce an efficient query plan from the result).
No, there really isn't. The inherent features of window functions is the window, which is built on a sorted query. TTM and TD don't admit to the concept of the sorted query (core philosophy I think) so no sorts and therefore no windows.
More than core philosophy, it's per the definition of a relation. A "sorted relation" isn't part of any relational model.
Of course, that doesn't preclude defining a RANK operator that returns a relation indicating tuples ranked on some order, or defining some non-relation construct that defines a tuple order. Tutorial D defines an ARRAY construct for this, and you could certainly define a rich set of windowing (and other) operators for ARRAY. But that is outside the scope of the relational model.
We've discussed this before.
For the sake of those who missed that gripping discussion: I agree with (a), disagree strongly with (b). Andl has no concept of a 'sorted relation', but it does have a concept of an 'ordering function'. This is used by a variety of queries, including ORD (for sequence numbers), SKIP/TAKE (for paged queries), LEAD/LAG (eg to report differences over time), SUM/MIN/MAX/AVE (for running sums, moving averages, etc). This is fully compliant with the relational model, and is simply a further modest extension of the familiar Extended RA.
Arguments based on 'sorted relations' are plain wrong, but some refuse to be convinced.
Quote from Dave Voorhis on October 25, 2020, 9:14 amQuote from dandl on October 25, 2020, 4:13 amA while ago I wrote some code to run correlated subqueries on a SQL engine, by converting them to joins. I started with TPC-H, I think I managed to convert all but one query with csqs to use joins instead mechanically, but these conversions were all special cases for particular shapes of csqs. I didn't see any way to generalize the rewrites. And there was one query that I couldn't figure out how to rewrite to a join even manually (I wasn't thinking much about the semantics of the query).
Do you have anything more to say about there not being a feasible mechanical solution? I think something along the lines of the method of converting relational calculus to relational algebra would work, but I have no idea whether this would actually be usable for product.
As I've been saying, my view is that you can mechanically rewrite a CSQ as a nested function, in which the body is the nested query and the 'correlated' bits are parameters. Do you see any reason why this would not work?
I think that window functions might be a little easier - I suspect there's a straightforward and reasonable mechanical conversion possible (i.e. you can produce an efficient query plan from the result).
No, there really isn't. The inherent features of window functions is the window, which is built on a sorted query. TTM and TD don't admit to the concept of the sorted query (core philosophy I think) so no sorts and therefore no windows.
More than core philosophy, it's per the definition of a relation. A "sorted relation" isn't part of any relational model.
Of course, that doesn't preclude defining a RANK operator that returns a relation indicating tuples ranked on some order, or defining some non-relation construct that defines a tuple order. Tutorial D defines an ARRAY construct for this, and you could certainly define a rich set of windowing (and other) operators for ARRAY. But that is outside the scope of the relational model.
We've discussed this before.
For the sake of those who missed that gripping discussion: I agree with (a), disagree strongly with (b). Andl has no concept of a 'sorted relation', but it does have a concept of an 'ordering function'. This is used by a variety of queries, including ORD (for sequence numbers), SKIP/TAKE (for paged queries), LEAD/LAG (eg to report differences over time), SUM/MIN/MAX/AVE (for running sums, moving averages, etc). This is fully compliant with the relational model, and is simply a further modest extension of the familiar Extended RA.
Arguments based on 'sorted relations' are plain wrong, but some refuse to be convinced.
Quote from dandl on October 25, 2020, 11:10 pmQuote from Dave Voorhis on October 25, 2020, 10:22 amQuote from Jake Wheat on October 25, 2020, 3:14 amQuote from Darren Duncan on October 24, 2020, 4:07 amAny D language intended for industry will provide an analogy to correlated subqueries
Did anyone here who built a D include a feature like this?
There's no need. An implementation of relational algebra appears to nicely (arguably, more nicely) handle anything where you'd use a correlated subquery.
Correlated subqueries aren't a theoretical construct or a feature, they're an oddity driven by SQL's quirkiness.
True enough. But not really all that odd: they just use a function construct (with an unusual syntax) instead of allowing relational queries to be composed directly. That's why they're hard to rewrite optimally: it amounts to unwrapping a function into the value that will result form multiple calls to that function.
Quote from Dave Voorhis on October 25, 2020, 10:22 amQuote from Jake Wheat on October 25, 2020, 3:14 amQuote from Darren Duncan on October 24, 2020, 4:07 amAny D language intended for industry will provide an analogy to correlated subqueries
Did anyone here who built a D include a feature like this?
There's no need. An implementation of relational algebra appears to nicely (arguably, more nicely) handle anything where you'd use a correlated subquery.
Correlated subqueries aren't a theoretical construct or a feature, they're an oddity driven by SQL's quirkiness.
True enough. But not really all that odd: they just use a function construct (with an unusual syntax) instead of allowing relational queries to be composed directly. That's why they're hard to rewrite optimally: it amounts to unwrapping a function into the value that will result form multiple calls to that function.
Quote from dandl on October 25, 2020, 11:28 pm
What I meant where I said "does not exist" is "we are never going to find an algorithm that mechanically produces the algebra-based equivalent to any given SQL query and guaranteeably will not leave the users with an abomination like that coming out of my bulleted list". (Note that Codd's reduction algorithm applied only to Codd's algebra, and that's a very small subset of the thing we call "relational algebra" these days !) And if we are never going to find any such algorithm (whether for computer implementation or for follow-this-procedure-literally manuals handed out to, eurhm, lesser skilled human query [re]writers), then what kind of "support" is it that you are thinking of and that a DBMS writer can still possibly have to offer ? Thought I might have had to clarify that with my first reply so I'm doing it now.
No, that's not it. You can directly translate CSQ into Extended RA, provided your ERA offers functions over attribute values, and those functions support values of type relation. And you can express that function as a RELCON, as discussed previously. You just won't easily get the kind of code we like to write.
And no, the RA is not a 'small subset'. The only things added to the RA itself in the past 40 odd years are new values (EXTEND/RELCON), aggregation and while/transitive closure. Of course any usable implementation will also provide a type system, but that's a separate concern. The rest is all there or directly implied by Codd in 1972.
What I meant where I said "does not exist" is "we are never going to find an algorithm that mechanically produces the algebra-based equivalent to any given SQL query and guaranteeably will not leave the users with an abomination like that coming out of my bulleted list". (Note that Codd's reduction algorithm applied only to Codd's algebra, and that's a very small subset of the thing we call "relational algebra" these days !) And if we are never going to find any such algorithm (whether for computer implementation or for follow-this-procedure-literally manuals handed out to, eurhm, lesser skilled human query [re]writers), then what kind of "support" is it that you are thinking of and that a DBMS writer can still possibly have to offer ? Thought I might have had to clarify that with my first reply so I'm doing it now.
No, that's not it. You can directly translate CSQ into Extended RA, provided your ERA offers functions over attribute values, and those functions support values of type relation. And you can express that function as a RELCON, as discussed previously. You just won't easily get the kind of code we like to write.
And no, the RA is not a 'small subset'. The only things added to the RA itself in the past 40 odd years are new values (EXTEND/RELCON), aggregation and while/transitive closure. Of course any usable implementation will also provide a type system, but that's a separate concern. The rest is all there or directly implied by Codd in 1972.
Quote from Erwin on October 26, 2020, 1:05 pmQuote from dandl on October 25, 2020, 11:28 pmThe only things added to the RA itself in the past 40 odd years are new values (EXTEND/RELCON), aggregation and while/transitive closure. Of course any usable implementation will also provide a type system, but that's a separate concern. The rest is all there or directly implied by Codd in 1972.
Codd's RESTRICT was limited to one single attribute, no scalar functions, only the six usual comparison operators available, and only comparing to a literal. These days it's just any boolean expression so long as it's deterministic. I'm not going to bother finding yet other examples.
Quote from dandl on October 25, 2020, 11:28 pmThe only things added to the RA itself in the past 40 odd years are new values (EXTEND/RELCON), aggregation and while/transitive closure. Of course any usable implementation will also provide a type system, but that's a separate concern. The rest is all there or directly implied by Codd in 1972.
Codd's RESTRICT was limited to one single attribute, no scalar functions, only the six usual comparison operators available, and only comparing to a literal. These days it's just any boolean expression so long as it's deterministic. I'm not going to bother finding yet other examples.
Quote from Jake Wheat on October 27, 2020, 1:17 amQuote from dandl on October 25, 2020, 12:58 am...
As I said earlier, I don't agree, I think it is feasible. Here is the direct rewrite in Andl-like pseudo-code.Thanks David, I will try to digest your idea.
Quote from dandl on October 25, 2020, 12:58 am...
As I said earlier, I don't agree, I think it is feasible. Here is the direct rewrite in Andl-like pseudo-code.
Thanks David, I will try to digest your idea.
Quote from Jake Wheat on October 27, 2020, 1:21 amQuote from Darren Duncan on October 25, 2020, 6:33 amQuote from Jake Wheat on October 25, 2020, 3:47 amQuote from Darren Duncan on October 25, 2020, 3:21 amI am in the process of doing so with my in-development industrial D language. Also I see correlated subquery analogies as no big deal worth singling out for a big discussion, as implementing them is trivially easy, as much so as any other features, and anyone who uses this in SQL should be able to express the same logic in a D language with no greater effort than in SQL.
Can you explain the trivially easy implementation?
Muldis Data Language is architecturally like a generic application programming language, and everything users write in it has the form of generic pure functions (composed only of pure expressions per a functional language) and procedures (composed of generic statements and expressions), so users have the ability fo formulate arbitrarily complex user-defined relational types and operators.
With such a foundation, how can it not be trivially easy to express everything SQL can, including correlated subqueries?
If you feel otherwise then please explain how this language should not be capable of trivially expressing the logic of a correlated subquery in a similar looking arrangement?
If someone claims this is trivial, I expect something like 'here is some rough concrete syntax, here are some rough physical operators which look like they don't perform badly, and here's a sketch of how you get from one to the other'.
Your ideas about extensibility seem good in general, but did I misunderstand - you seem to be saying 'my language is extensible therefore users can implement correlated subqueries themselves, and so implementing them is trivial'? This doesn't seem like a complete argument to me.
Quote from Darren Duncan on October 25, 2020, 6:33 amQuote from Jake Wheat on October 25, 2020, 3:47 amQuote from Darren Duncan on October 25, 2020, 3:21 amI am in the process of doing so with my in-development industrial D language. Also I see correlated subquery analogies as no big deal worth singling out for a big discussion, as implementing them is trivially easy, as much so as any other features, and anyone who uses this in SQL should be able to express the same logic in a D language with no greater effort than in SQL.
Can you explain the trivially easy implementation?
Muldis Data Language is architecturally like a generic application programming language, and everything users write in it has the form of generic pure functions (composed only of pure expressions per a functional language) and procedures (composed of generic statements and expressions), so users have the ability fo formulate arbitrarily complex user-defined relational types and operators.
With such a foundation, how can it not be trivially easy to express everything SQL can, including correlated subqueries?
If you feel otherwise then please explain how this language should not be capable of trivially expressing the logic of a correlated subquery in a similar looking arrangement?
If someone claims this is trivial, I expect something like 'here is some rough concrete syntax, here are some rough physical operators which look like they don't perform badly, and here's a sketch of how you get from one to the other'.
Your ideas about extensibility seem good in general, but did I misunderstand - you seem to be saying 'my language is extensible therefore users can implement correlated subqueries themselves, and so implementing them is trivial'? This doesn't seem like a complete argument to me.
Quote from Jake Wheat on October 27, 2020, 1:23 amQuote from Dave Voorhis on October 25, 2020, 9:14 amMore than core philosophy, it's per the definition of a relation. A "sorted relation" isn't part of any relational model.
Of course, that doesn't preclude defining a RANK operator that returns a relation indicating tuples ranked on some order, or defining some non-relation construct that defines a tuple order. Tutorial D defines an ARRAY construct for this, and you could certainly define a rich set of windowing (and other) operators for ARRAY. But that is outside the scope of the relational model.
We've discussed this before.
So do you agree that with the addition of something like 'arrays of tuples' in a TTM language, you can implement window functions similar to SQL (and even create a general mechanical conversion) that will execute efficiently, or are there other issues?
Quote from Dave Voorhis on October 25, 2020, 9:14 amMore than core philosophy, it's per the definition of a relation. A "sorted relation" isn't part of any relational model.
Of course, that doesn't preclude defining a RANK operator that returns a relation indicating tuples ranked on some order, or defining some non-relation construct that defines a tuple order. Tutorial D defines an ARRAY construct for this, and you could certainly define a rich set of windowing (and other) operators for ARRAY. But that is outside the scope of the relational model.
We've discussed this before.
So do you agree that with the addition of something like 'arrays of tuples' in a TTM language, you can implement window functions similar to SQL (and even create a general mechanical conversion) that will execute efficiently, or are there other issues?
Quote from Jake Wheat on October 27, 2020, 1:32 amQuote from Erwin on October 25, 2020, 4:06 pmYou can do everything you need with the algebra. Why offer a second means to achieve the same ? It's more work for the DBMS writer, more opportunity to introduce errors inside the DBMS, more testing to do, more bugs remaining despite all the testing effort when the product gets shipped, etc. etc.
This is also what Dave and Hugh are saying? I think I'm pretty convinced by this argument now.
What I meant where I said "does not exist" is "we are never going to find an algorithm that mechanically produces the algebra-based equivalent to any given SQL query and guaranteeably will not leave the users with an abomination like that coming out of my bulleted list". (Note that Codd's reduction algorithm applied only to Codd's algebra, and that's a very small subset of the thing we call "relational algebra" these days !) And if we are never going to find any such algorithm (whether for computer implementation or for follow-this-procedure-literally manuals handed out to, eurhm, lesser skilled human query [re]writers), then what kind of "support" is it that you are thinking of and that a DBMS writer can still possibly have to offer ? Thought I might have had to clarify that with my first reply so I'm doing it now.
So there is a mechanical algorithm that can convert correlated subqueries into relational algebra, but it won't produce the queries that we would write from scratch, and we'd never want to use the result of such an algorithm because these queries would be difficult to read and maintain, and difficult to execute efficiently?
Quote from Erwin on October 25, 2020, 4:06 pmYou can do everything you need with the algebra. Why offer a second means to achieve the same ? It's more work for the DBMS writer, more opportunity to introduce errors inside the DBMS, more testing to do, more bugs remaining despite all the testing effort when the product gets shipped, etc. etc.
This is also what Dave and Hugh are saying? I think I'm pretty convinced by this argument now.
What I meant where I said "does not exist" is "we are never going to find an algorithm that mechanically produces the algebra-based equivalent to any given SQL query and guaranteeably will not leave the users with an abomination like that coming out of my bulleted list". (Note that Codd's reduction algorithm applied only to Codd's algebra, and that's a very small subset of the thing we call "relational algebra" these days !) And if we are never going to find any such algorithm (whether for computer implementation or for follow-this-procedure-literally manuals handed out to, eurhm, lesser skilled human query [re]writers), then what kind of "support" is it that you are thinking of and that a DBMS writer can still possibly have to offer ? Thought I might have had to clarify that with my first reply so I'm doing it now.
So there is a mechanical algorithm that can convert correlated subqueries into relational algebra, but it won't produce the queries that we would write from scratch, and we'd never want to use the result of such an algorithm because these queries would be difficult to read and maintain, and difficult to execute efficiently?