The Forum for Discussion about The Third Manifesto and Related Matters

Please or Register to create posts and topics.

Correlated subqueries

PreviousPage 4 of 8Next
Quote from Jake Wheat on October 27, 2020, 1:21 am
Quote from Darren Duncan on October 25, 2020, 6:33 am

Muldis Data Language is architecturally like a generic application programming language, and everything users write in it has the form of generic pure functions (composed only of pure expressions per a functional language) and procedures (composed of generic statements and expressions), so users have the ability fo formulate arbitrarily complex user-defined relational types and operators.

With such a foundation, how can it not be trivially easy to express everything SQL can, including correlated subqueries?

If someone claims this is trivial, I expect something like 'here is some rough concrete syntax, here are some rough physical operators which look like they don't perform badly, and here's a sketch of how you get from one to the other'.

Your ideas about extensibility seem good in general, but did I misunderstand - you seem to be saying 'my language is extensible therefore users can implement correlated subqueries themselves, and so implementing them is trivial'? This doesn't seem like a complete argument to me.

You are asking a broad question and I'm giving an equally broad answer.  The concept of a correlated subquery is quite broad.  It is as broad as saying "relation-valued function".  So I'm saying I support arbitrary user-defined relation-valued functions.  If you want concrete syntax from me then you have to give me one or more concrete SQL queries that you want to see the translations of.  Then I'll show you how trivial it is to map one to the other.

Quote from Jake Wheat on October 27, 2020, 1:23 am
Quote from Dave Voorhis on October 25, 2020, 9:14 am

More than core philosophy, it's per the definition of a relation. A "sorted relation" isn't part of any relational model.

Of course, that doesn't preclude defining a RANK operator that returns a relation indicating tuples ranked on some order, or defining some non-relation construct that defines a tuple order. Tutorial D defines an ARRAY construct for this, and you could certainly define a rich set of windowing (and other) operators for ARRAY. But that is outside the scope of the relational model.

We've discussed this before.

So do you agree that with the addition of something like 'arrays of tuples' in a TTM language, you can implement window functions similar to SQL (and even create a general mechanical conversion) that will execute efficiently, or are there other issues?

No, I think it's a terrible idea. I think ARRAY in TD is the bastard child of 'we can't think of anything better'. I found in writing Andl samples that having relations as the only collection was nearly enough, roughly comparable to C# with LINQ, and adding a few minor tweaks just about gets there. I never found any need for an array (I don't use them much in C# either). There are issues dealing with data that has duplicates, so you need a collection type for those or a way of adding a sequence number.

I don't think there is any reasonable path from ARRAY to window functions, but that's for others to propose and defend.

Andl - A New Database Language - andl.org
Quote from Jake Wheat on October 27, 2020, 1:32 am
Quote from Erwin on October 25, 2020, 4:06 pm

You can do everything you need with the algebra.  Why offer a second means to achieve the same ?  It's more work for the DBMS writer, more opportunity to introduce errors inside the DBMS, more testing to do, more bugs remaining despite all the testing effort when the product gets shipped, etc. etc.

This is also what Dave and Hugh are saying? I think I'm pretty convinced by this argument now.

What I meant where I said "does not exist" is "we are never going to find an algorithm that mechanically produces the algebra-based equivalent to any given SQL query and guaranteeably will not leave the users with an abomination like that coming out of my bulleted list".  (Note that Codd's reduction algorithm applied only to Codd's algebra, and that's a very small subset of the thing we call "relational algebra" these days !)  And if we are never going to find any such algorithm (whether for computer implementation or for follow-this-procedure-literally manuals handed out to, eurhm, lesser skilled human query [re]writers), then what kind of "support" is it that you are thinking of and that a DBMS writer can still possibly have to offer ?  Thought I might have had to clarify that with my first reply so I'm doing it now.

So there is a mechanical algorithm that can convert correlated subqueries into relational algebra, but it won't produce the queries that we would write from scratch, and we'd never want to use the result of such an algorithm because these queries would be difficult to read and maintain, and difficult to execute efficiently?

Almost. There is at least one mechanical approach to rewriting CSQ, based on nested functions; it's not one would (usually) write from scratch, but it's perfectly easy to read and maintain; it may execute with very high efficiency or not, but it's probably impossible to optimise while the inner query exists separately inside a function.

Andl - A New Database Language - andl.org
Quote from dandl on October 27, 2020, 2:28 am
Quote from Jake Wheat on October 27, 2020, 1:23 am
Quote from Dave Voorhis on October 25, 2020, 9:14 am

More than core philosophy, it's per the definition of a relation. A "sorted relation" isn't part of any relational model.

Of course, that doesn't preclude defining a RANK operator that returns a relation indicating tuples ranked on some order, or defining some non-relation construct that defines a tuple order. Tutorial D defines an ARRAY construct for this, and you could certainly define a rich set of windowing (and other) operators for ARRAY. But that is outside the scope of the relational model.

We've discussed this before.

So do you agree that with the addition of something like 'arrays of tuples' in a TTM language, you can implement window functions similar to SQL (and even create a general mechanical conversion) that will execute efficiently, or are there other issues?

No, I think it's a terrible idea. I think ARRAY in TD is the bastard child of 'we can't think of anything better'. I found in writing Andl samples that having relations as the only collection was nearly enough, roughly comparable to C# with LINQ, and adding a few minor tweaks just about gets there. I never found any need for an array (I don't use them much in C# either). There are issues dealing with data that has duplicates, so you need a collection type for those or a way of adding a sequence number.

I don't think there is any reasonable path from ARRAY to window functions, but that's for others to propose and defend.

It's straightforward and ARRAY is fine -- it simply addresses the need for random access, akin to SQL cursors.

I generalised aggregation in Tutorial D and implemented it in Rel, which is documented at https://reldb.org/c/index.php/read/user-defined-aggregate-operators/

As noted in the conclusion, implementing windowing functions is straightforward. The ad hoc generalisation of windowing is to simultaneously be able to access the current tuple (or attribute value) and arbitrarily any other tuple (or value of the same attribute.) Regardless what mechanism is used to implement it, semantically it is an array.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org
Quote from Jake Wheat on October 27, 2020, 1:23 am
Quote from Dave Voorhis on October 25, 2020, 9:14 am

More than core philosophy, it's per the definition of a relation. A "sorted relation" isn't part of any relational model.

Of course, that doesn't preclude defining a RANK operator that returns a relation indicating tuples ranked on some order, or defining some non-relation construct that defines a tuple order. Tutorial D defines an ARRAY construct for this, and you could certainly define a rich set of windowing (and other) operators for ARRAY. But that is outside the scope of the relational model.

We've discussed this before.

So do you agree that with the addition of something like 'arrays of tuples' in a TTM language, you can implement window functions similar to SQL (and even create a general mechanical conversion) that will execute efficiently, or are there other issues?

Yes, with something like arrays of tuples you can implement window functions similar to SQL. Whether ordering and selection of tuples is explicitly visible in the form of an array or implicitly via some ordering function doesn't make a substantial difference.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org
Quote from Dave Voorhis on October 27, 2020, 9:16 am
Quote from dandl on October 27, 2020, 2:28 am
Quote from Jake Wheat on October 27, 2020, 1:23 am
Quote from Dave Voorhis on October 25, 2020, 9:14 am

More than core philosophy, it's per the definition of a relation. A "sorted relation" isn't part of any relational model.

Of course, that doesn't preclude defining a RANK operator that returns a relation indicating tuples ranked on some order, or defining some non-relation construct that defines a tuple order. Tutorial D defines an ARRAY construct for this, and you could certainly define a rich set of windowing (and other) operators for ARRAY. But that is outside the scope of the relational model.

We've discussed this before.

So do you agree that with the addition of something like 'arrays of tuples' in a TTM language, you can implement window functions similar to SQL (and even create a general mechanical conversion) that will execute efficiently, or are there other issues?

No, I think it's a terrible idea. I think ARRAY in TD is the bastard child of 'we can't think of anything better'. I found in writing Andl samples that having relations as the only collection was nearly enough, roughly comparable to C# with LINQ, and adding a few minor tweaks just about gets there. I never found any need for an array (I don't use them much in C# either). There are issues dealing with data that has duplicates, so you need a collection type for those or a way of adding a sequence number.

I don't think there is any reasonable path from ARRAY to window functions, but that's for others to propose and defend.

It's straightforward and ARRAY is fine -- it simply addresses the need for random access, akin to SQL cursors.

Again, I'll wait and see. I've found no need for random access, and cursors form no part of the SQL query language. They support tuple-at-a-time programming, which TTM despises.

I generalised aggregation in Tutorial D and implemented it in Rel, which is documented at https://reldb.org/c/index.php/read/user-defined-aggregate-operators/

You did indeed. I was surprised at the complexity: I had implemented it in Andl with a single operator : fold().

As noted in the conclusion, implementing windowing functions is straightforward. The ad hoc generalisation of windowing is to simultaneously be able to access the current tuple (or attribute value) and arbitrarily any other tuple (or value of the same attribute.) Regardless what mechanism is used to implement it, semantically it is an array.

I look forward to seeing it. No, Andl has no arrays and the semantics of its windowing are not those of an array. There is no means provided to store a sorted dataset and the relative access provided by LEAD/LAG etc is transitory. The mental model is more like inspecting a flow of traffic: you can look at a car, or the one in front or the one behind, but you can never do the things you can with arrays, like absolute indexing or iterating over the whole array.

And because it is a relation and not an array, you keep the full power of the RA. You avoid the need to convert data into and out of ARRAY form (no UNORDER needed).

But I think you've already made up your mind.

 

Andl - A New Database Language - andl.org
Quote from dandl on October 27, 2020, 10:18 am
Quote from Dave Voorhis on October 27, 2020, 9:16 am
Quote from dandl on October 27, 2020, 2:28 am
Quote from Jake Wheat on October 27, 2020, 1:23 am
Quote from Dave Voorhis on October 25, 2020, 9:14 am

More than core philosophy, it's per the definition of a relation. A "sorted relation" isn't part of any relational model.

Of course, that doesn't preclude defining a RANK operator that returns a relation indicating tuples ranked on some order, or defining some non-relation construct that defines a tuple order. Tutorial D defines an ARRAY construct for this, and you could certainly define a rich set of windowing (and other) operators for ARRAY. But that is outside the scope of the relational model.

We've discussed this before.

So do you agree that with the addition of something like 'arrays of tuples' in a TTM language, you can implement window functions similar to SQL (and even create a general mechanical conversion) that will execute efficiently, or are there other issues?

No, I think it's a terrible idea. I think ARRAY in TD is the bastard child of 'we can't think of anything better'. I found in writing Andl samples that having relations as the only collection was nearly enough, roughly comparable to C# with LINQ, and adding a few minor tweaks just about gets there. I never found any need for an array (I don't use them much in C# either). There are issues dealing with data that has duplicates, so you need a collection type for those or a way of adding a sequence number.

I don't think there is any reasonable path from ARRAY to window functions, but that's for others to propose and defend.

It's straightforward and ARRAY is fine -- it simply addresses the need for random access, akin to SQL cursors.

Again, I'll wait and see. I've found no need for random access, and cursors form no part of the SQL query language. They support tuple-at-a-time programming, which TTM despises.

like random access. I have user interface mechanisms that make use of it, much like the LIMIT and OFFSET in many SQL dialects but more flexible.

TTM despises it in the relational model, but this is outside the relational model and in the realm of general programming.

I generalised aggregation in Tutorial D and implemented it in Rel, which is documented at https://reldb.org/c/index.php/read/user-defined-aggregate-operators/

You did indeed. I was surprised at the complexity: I had implemented it in Andl with a single operator : fold().

All of the complexity is to seamlessly integrate it into the existing Tutorial D framework without changes to the current Tutorial D specification. Conceptually, it's trivial.

Fold is fine, but is of course limited to a notion of the next tuple or value. It is not the same as an array construct, which is more general.

But I think you've already made up your mind.

One of the singular and notable strengths of TTM: it defined D to be a language family rather than a specific language, which leaves us free to choose much of the syntax and semantics as we see fit. I have made up my mind about my implementation. You are free to make up your mind about yours. We are all free to do the same.

Which is as it should be.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org
Quote from Jake Wheat on October 27, 2020, 1:32 am
Quote from Erwin on October 25, 2020, 4:06 pm

You can do everything you need with the algebra.  Why offer a second means to achieve the same ?  It's more work for the DBMS writer, more opportunity to introduce errors inside the DBMS, more testing to do, more bugs remaining despite all the testing effort when the product gets shipped, etc. etc.

This is also what Dave and Hugh are saying? I think I'm pretty convinced by this argument now.

What I meant where I said "does not exist" is "we are never going to find an algorithm that mechanically produces the algebra-based equivalent to any given SQL query and guaranteeably will not leave the users with an abomination like that coming out of my bulleted list".  (Note that Codd's reduction algorithm applied only to Codd's algebra, and that's a very small subset of the thing we call "relational algebra" these days !)  And if we are never going to find any such algorithm (whether for computer implementation or for follow-this-procedure-literally manuals handed out to, eurhm, lesser skilled human query [re]writers), then what kind of "support" is it that you are thinking of and that a DBMS writer can still possibly have to offer ?  Thought I might have had to clarify that with my first reply so I'm doing it now.

So there is a mechanical algorithm that can convert correlated subqueries into relational algebra, but it won't produce the queries that we would write from scratch, and we'd never want to use the result of such an algorithm because these queries would be difficult to read and maintain, and difficult to execute efficiently?

"So there is a mechanical algorithm" is an overinterpretation of what I said.  I'd only say "it exists" if I knew it did and knew what it looked like (in pretty minute detail) and believed myself competent and able to explain it to others.  None of the three conjuncts in there apply.  I intended something like "I wouldn't be terribly surprised if someone set out to find it and succeeded".  But I am very deeply convinced any such algorithm, if ever found, would indeed suffer from the characteristics you describe.  (Also note that despite Codd's reduction algorithm and the corresponding ***proof*** of equivalence between [his] calculus and [his] algebra, juniors and students are to this day still asking questions on StackOverflow of the ilk "how can I convert a given SQL query (expressions in a rather calculus-like language) to RA (expressions in algebra) ?".)

Quote from dandl on October 27, 2020, 10:18 am

I generalised aggregation in Tutorial D and implemented it in Rel, which is documented at https://reldb.org/c/index.php/read/user-defined-aggregate-operators/

You did indeed. I was surprised at the complexity: I had implemented it in Andl with a single operator : fold().

Can you sensibly provide all SQL window functionality via fold? Do you ask users to write window queries using fold directly instead? Or is this only about user defined window functions?

As noted in the conclusion, implementing windowing functions is straightforward. The ad hoc generalisation of windowing is to simultaneously be able to access the current tuple (or attribute value) and arbitrarily any other tuple (or value of the same attribute.) Regardless what mechanism is used to implement it, semantically it is an array.

I look forward to seeing it. No, Andl has no arrays and the semantics of its windowing are not those of an array. There is no means provided to store a sorted dataset and the relative access provided by LEAD/LAG etc is transitory. The mental model is more like inspecting a flow of traffic: you can look at a car, or the one in front or the one behind, but you can never do the things you can with arrays, like absolute indexing or iterating over the whole array.

I think there may be a way to iterate over a sorted array representing the entire window and do an operation that depends on that order in SQL windows, but I don't think I've ever seen it in use.

Quote from Darren Duncan on October 27, 2020, 2:19 am
Quote from Jake Wheat on October 27, 2020, 1:21 am
Quote from Darren Duncan on October 25, 2020, 6:33 am

Muldis Data Language is architecturally like a generic application programming language, and everything users write in it has the form of generic pure functions (composed only of pure expressions per a functional language) and procedures (composed of generic statements and expressions), so users have the ability fo formulate arbitrarily complex user-defined relational types and operators.

With such a foundation, how can it not be trivially easy to express everything SQL can, including correlated subqueries?

If someone claims this is trivial, I expect something like 'here is some rough concrete syntax, here are some rough physical operators which look like they don't perform badly, and here's a sketch of how you get from one to the other'.

Your ideas about extensibility seem good in general, but did I misunderstand - you seem to be saying 'my language is extensible therefore users can implement correlated subqueries themselves, and so implementing them is trivial'? This doesn't seem like a complete argument to me.

You are asking a broad question and I'm giving an equally broad answer.  The concept of a correlated subquery is quite broad.  It is as broad as saying "relation-valued function".  So I'm saying I support arbitrary user-defined relation-valued functions.  If you want concrete syntax from me then you have to give me one or more concrete SQL queries that you want to see the translations of.  Then I'll show you how trivial it is to map one to the other.

Here's an example, Q21 from TPC-H:

The Suppliers Who Kept Orders Waiting query identifies suppliers, for a given nation, whose product was part of a multi-supplier order (with current status of 'F') where they were the only supplier who failed to meet the committed delivery date.


select s_name, count(*) as numwait
from supplier, lineitem as l1, orders, nation
where s_suppkey = l1.l_suppkey
      and o_orderkey = l1.l_orderkey
      and o_orderstatus = 'F'
      and l1.l_receiptdate > l1.l_commitdate
      and exists (select *
                  from lineitem as l2
                  where l2.l_orderkey = l1.l_orderkey
                        and l2.l_suppkey <> l1.l_suppkey)
      and not exists (select *
                      from lineitem as l3
                      where l3.l_orderkey = l1.l_orderkey
                            and l3.l_suppkey <> l1.l_suppkey
                            and l3.l_receiptdate > l3.l_commitdate)
      and s_nationkey = n_nationkey
      and n_name = 'SAUDI ARABIA'
group by s_name
order by numwait desc, s_name asc
limit 100;
PreviousPage 4 of 8Next