The Third Manifesto ForumThe Third Manifesto: TTM General discussionThe case against range variables

Please or Register to create posts and topics.

The case against range variables

#1 · November 5, 2019, 3:50 pm

Sorry for not following the forum very well these days, but I think I saw a response to me, on the "need" for natural joins in which the writer wondered what's wrong with range variables, as used in Codd's relational calculus and SQL.

TTM requires all types to be "first-class", including relation types in particular. If type T is first-class, then one implication is that if x is a value of T, then it is possible to assign x to a variable of that type.

Consider the SQL expression SELECT * FROM T T1, T T2 WHERE T1.A < T2.A

No base table in SQL can have its columns defined to match those of the result of the above expression.

Now, for the very first edition of our TTM book, Chris drafted an appendix proposing a relational calculus based notation for relation expressions. In my review of his draft I asked about the attribute names of results and whether he could be sure that a result was always assignable to a variable of its (relation) type. I also enquired as to whether an attribute name sometimes had to be qualified an indefinite number of times by range variables: a.b.c....x, as a result of nested expressions.

In SQL qualifiers are not carried forward in SELECT expressions in the FROM clause. Try putting my expression above into a FROM clause:

SELECT * FROM (SELECT * FROM T T1, T T2 WHERE T1.A < T2.A) AS T3

The internal names T1.A and T2.A reference different columns. Which column does T3.A refer to? (I believe the expression is legal according to the SQL standard).

Anyway, Chris abandoned the attempt. I don't claim this a proof that range variables always militate against "all type first-class", but can anybody provide a counter-example (i.e. a notation that supports range variables and "all types first-class")?

Hugh

Coauthor of The Third Manifesto and related books.

#2 · November 5, 2019, 10:37 pm

Quote from AntC on November 5, 2019, 10:37 pm

Quote from Hugh on November 5, 2019, 3:50 pm

... a response to me, on the "need" for natural joins in which the writer wondered what's wrong with range variables, as used in Codd's relational calculus and SQL.

Thanks Hugh, that thread has been going fast and furious, with no proper closure for several of the points raised.

I'm familiar with 'range variables' in Codd's RC, and I kinda see a resemblance to what I call AS-names in SQL, but I didn't realise the SQL usage had an official name. As ever, SQL has managed a ghastly mongrel half-implementation of a sensible idea.

I guess the problem for language design is that the introduced names have no very clear scope: there's no BEGIN ... END or parenthesised expression with the names 'declared' textually at the start. Rather: the names are introduced/bound in some sub-expression then 'bubble up' to the top level. They're supposed not to clash with names coming from the database, but if all the compiler can see is SELECT * ..., it has to run around looking up the schema.

TTM requires all types to be "first-class", including relation types in particular. If type T is first-class, then one implication is that if x is a value of T, then it is possible to assign x to a variable of that type.

Consider the SQL expression SELECT * FROM T T1, T T2 WHERE T1.A < T2.A

No base table in SQL can have its columns defined to match those of the result of the above expression.

Now, for the very first edition of our TTM book, Chris drafted an appendix proposing a relational calculus based notation for relation expressions. In my review of his draft I asked about the attribute names of results and whether he could be sure that a result was always assignable to a variable of its (relation) type. I also enquired as to whether an attribute name sometimes had to be qualified an indefinite number of times by range variables: a.b.c....x, as a result of nested expressions.

Interesting history. Yes it's the stacking up of range variables needed due to nesting of query expressions that I pointed out in that earlier thread. (My comment also got lost in the turmoil.) That stacking up doesn't occur in Codd's RC.

In SQL qualifiers are not carried forward in SELECT expressions in the FROM clause. Try putting my expression above into a FROM clause:

SELECT * FROM (SELECT * FROM T T1, T T2 WHERE T1.A < T2.A) AS T3

The internal names T1.A and T2.A reference different columns. Which column does T3.A refer to? (I believe the expression is legal according to the SQL standard).

IIRC some of your critiques of SQL point out places where a sub-query must get a result name even though it's unusable, and other places where you can't give a result name even though it would be helpful -- for example the result from a UNION.

Anyway, Chris abandoned the attempt. I don't claim this a proof that range variables always militate against "all type first-class", but can anybody provide a counter-example (i.e. a notation that supports range variables and "all types first-class")?

Doesn't Codd's RC achieve that? Essentially every sub-query/nested result must get a distinct name, and no 'stacking up' of names. I'm not sure it's worth the effort of trying to patch up SQL, but I suppose:

Never use SELECT *, always explicitly list the result columns.

Always use ... AS ... for both column names and result names. [**]

Require AS-names to be unique throughout scope of the outermost SELECT.

[**] I guess SELECT A AS A ... would look dumb, so we might allow the AS-name to be omitted, where it's 'obvious' (haha).

[**] SELECT ... FROM T AS T WHERE ... also looks dumb; so allow the result name to be omitted in a FROM with only a single table? Except that same table might appear elsewhere in the same query (say in a subselect/self-join in the WHERE clause, then we can't omit the AS-name everywhere). Let's get the SQL Committee on to defining the rules for where omittable. That should add another doorstep's worth to the Standards doco.

Quote from Hugh on November 5, 2019, 3:50 pm

... a response to me, on the "need" for natural joins in which the writer wondered what's wrong with range variables, as used in Codd's relational calculus and SQL.

Thanks Hugh, that thread has been going fast and furious, with no proper closure for several of the points raised.

I'm familiar with 'range variables' in Codd's RC, and I kinda see a resemblance to what I call AS-names in SQL, but I didn't realise the SQL usage had an official name. As ever, SQL has managed a ghastly mongrel half-implementation of a sensible idea.

I guess the problem for language design is that the introduced names have no very clear scope: there's no BEGIN ... END or parenthesised expression with the names 'declared' textually at the start. Rather: the names are introduced/bound in some sub-expression then 'bubble up' to the top level. They're supposed not to clash with names coming from the database, but if all the compiler can see is SELECT * ..., it has to run around looking up the schema.

TTM requires all types to be "first-class", including relation types in particular. If type T is first-class, then one implication is that if x is a value of T, then it is possible to assign x to a variable of that type.

Consider the SQL expression SELECT * FROM T T1, T T2 WHERE T1.A < T2.A

No base table in SQL can have its columns defined to match those of the result of the above expression.

Now, for the very first edition of our TTM book, Chris drafted an appendix proposing a relational calculus based notation for relation expressions. In my review of his draft I asked about the attribute names of results and whether he could be sure that a result was always assignable to a variable of its (relation) type. I also enquired as to whether an attribute name sometimes had to be qualified an indefinite number of times by range variables: a.b.c....x, as a result of nested expressions.

Interesting history. Yes it's the stacking up of range variables needed due to nesting of query expressions that I pointed out in that earlier thread. (My comment also got lost in the turmoil.) That stacking up doesn't occur in Codd's RC.

In SQL qualifiers are not carried forward in SELECT expressions in the FROM clause. Try putting my expression above into a FROM clause:

SELECT * FROM (SELECT * FROM T T1, T T2 WHERE T1.A < T2.A) AS T3

The internal names T1.A and T2.A reference different columns. Which column does T3.A refer to? (I believe the expression is legal according to the SQL standard).

IIRC some of your critiques of SQL point out places where a sub-query must get a result name even though it's unusable, and other places where you can't give a result name even though it would be helpful -- for example the result from a UNION.

Anyway, Chris abandoned the attempt. I don't claim this a proof that range variables always militate against "all type first-class", but can anybody provide a counter-example (i.e. a notation that supports range variables and "all types first-class")?

Doesn't Codd's RC achieve that? Essentially every sub-query/nested result must get a distinct name, and no 'stacking up' of names. I'm not sure it's worth the effort of trying to patch up SQL, but I suppose:

Never use SELECT *, always explicitly list the result columns.
Always use ... AS ... for both column names and result names. [**]
Require AS-names to be unique throughout scope of the outermost SELECT.
[**] I guess SELECT A AS A ... would look dumb, so we might allow the AS-name to be omitted, where it's 'obvious' (haha).
[**] SELECT ... FROM T AS T WHERE ... also looks dumb; so allow the result name to be omitted in a FROM with only a single table? Except that same table might appear elsewhere in the same query (say in a subselect/self-join in the WHERE clause, then we can't omit the AS-name everywhere). Let's get the SQL Committee on to defining the rules for where omittable. That should add another doorstep's worth to the Standards doco.

#3 · November 6, 2019, 12:17 am

Quote from dandl on November 6, 2019, 12:17 am

Quote from Hugh on November 5, 2019, 3:50 pm

Sorry for not following the forum very well these days, but I think I saw a response to me, on the "need" for natural joins in which the writer wondered what's wrong with range variables, as used in Codd's relational calculus and SQL.

I can't find any definition of "range variable". Codd 1972 provides a careful definition of range term:

A monadic predicate followed by a tuple variable is called a range term. The range term Pjr is interpreted as stating that tuple variable r has relation R.

But I rather think you mean range variable in the rather narrow sense in SQL, where it is roughly equivalent to a table alias: FROM SOMETABLE AS Swhere S is the range variable.

TTM requires all types to be "first-class", including relation types in particular. If type T is first-class, then one implication is that if x is a value of T, then it is possible to assign x to a variable of that type.

Consider the SQL expression SELECT * FROM T T1, T T2 WHERE T1.A < T2.A

No base table in SQL can have its columns defined to match those of the result of the above expression.

Why so? In my part time role as an SQL compiler, I shall:

replace the '*' by a list of column names obtained from the schema

if a range variable (table alias) is involved, look through the alias back to the underlying table

note the duplicate column names, and mechanically replace each by an alias (equivalent to T1.A AS T1_A, T2.A AS T2_A)

Alternatively, I shall raise an error and require the programmer to insert sufficient column renaming terms so that every relation-valued expression has unique column names.

Now, for the very first edition of our TTM book, Chris drafted an appendix proposing a relational calculus based notation for relation expressions. In my review of his draft I asked about the attribute names of results and whether he could be sure that a result was always assignable to a variable of its (relation) type. I also enquired as to whether an attribute name sometimes had to be qualified an indefinite number of times by range variables: a.b.c....x, as a result of nested expressions.

In SQL qualifiers are not carried forward in SELECT expressions in the FROM clause. Try putting my expression above into a FROM clause:

SELECT * FROM (SELECT * FROM T T1, T T2 WHERE T1.A < T2.A) AS T3

The internal names T1.A and T2.A reference different columns. Which column does T3.A refer to? (I believe the expression is legal according to the SQL standard).

The solution I propose above (either of them) resolves this problem without difficulty. We might have to tweak the standard slightly, but at some point in the past that was not beyond the bounds of possibility.

Anyway, Chris abandoned the attempt. I don't claim this a proof that range variables always militate against "all type first-class", but can anybody provide a counter-example (i.e. a notation that supports range variables and "all types first-class")?

I would have thought it is always possible to write SQL giving explicit column names and aliases so that in the final result of the expression every name is unique, and the type is reliably traceable to a base column. Doesn't that resolve the issue?

Quote from Hugh on November 5, 2019, 3:50 pm

Sorry for not following the forum very well these days, but I think I saw a response to me, on the "need" for natural joins in which the writer wondered what's wrong with range variables, as used in Codd's relational calculus and SQL.

I can't find any definition of "range variable". Codd 1972 provides a careful definition of range term:

A monadic predicate followed by a tuple variable is called a range term. The range term Pjr is interpreted as stating that tuple variable r has relation R.

But I rather think you mean range variable in the rather narrow sense in SQL, where it is roughly equivalent to a table alias: FROM SOMETABLE AS Swhere S is the range variable.

TTM requires all types to be "first-class", including relation types in particular. If type T is first-class, then one implication is that if x is a value of T, then it is possible to assign x to a variable of that type.

Consider the SQL expression SELECT * FROM T T1, T T2 WHERE T1.A < T2.A

No base table in SQL can have its columns defined to match those of the result of the above expression.

Why so? In my part time role as an SQL compiler, I shall:

replace the '*' by a list of column names obtained from the schema
if a range variable (table alias) is involved, look through the alias back to the underlying table
note the duplicate column names, and mechanically replace each by an alias (equivalent to T1.A AS T1_A, T2.A AS T2_A)

Alternatively, I shall raise an error and require the programmer to insert sufficient column renaming terms so that every relation-valued expression has unique column names.

Now, for the very first edition of our TTM book, Chris drafted an appendix proposing a relational calculus based notation for relation expressions. In my review of his draft I asked about the attribute names of results and whether he could be sure that a result was always assignable to a variable of its (relation) type. I also enquired as to whether an attribute name sometimes had to be qualified an indefinite number of times by range variables: a.b.c....x, as a result of nested expressions.

In SQL qualifiers are not carried forward in SELECT expressions in the FROM clause. Try putting my expression above into a FROM clause:

SELECT * FROM (SELECT * FROM T T1, T T2 WHERE T1.A < T2.A) AS T3

The internal names T1.A and T2.A reference different columns. Which column does T3.A refer to? (I believe the expression is legal according to the SQL standard).

The solution I propose above (either of them) resolves this problem without difficulty. We might have to tweak the standard slightly, but at some point in the past that was not beyond the bounds of possibility.

Anyway, Chris abandoned the attempt. I don't claim this a proof that range variables always militate against "all type first-class", but can anybody provide a counter-example (i.e. a notation that supports range variables and "all types first-class")?

I would have thought it is always possible to write SQL giving explicit column names and aliases so that in the final result of the expression every name is unique, and the type is reliably traceable to a base column. Doesn't that resolve the issue?

Andl - A New Database Language - andl.org

#4 · November 6, 2019, 1:03 am

Quote from AntC on November 6, 2019, 1:03 am

Quote from dandl on November 6, 2019, 12:17 am

Quote from Hugh on November 5, 2019, 3:50 pm

Consider the SQL expression SELECT * FROM T T1, T T2 WHERE T1.A < T2.A

No base table in SQL can have its columns defined to match those of the result of the above expression.

Why so? In my part time role as an SQL compiler, I shall:

replace the '*' by a list of column names obtained from the schema

if a range variable (table alias) is involved, look through the alias back to the underlying table

note the duplicate column names, and mechanically replace each by an alias (equivalent to T1.A AS T1_A, T2.A AS T2_A)

Alternatively, I shall raise an error and require the programmer to insert sufficient column renaming terms so that every relation-valued expression has unique column names.

In SQL qualifiers are not carried forward in SELECT expressions in the FROM clause. Try putting my expression above into a FROM clause:

SELECT * FROM (SELECT * FROM T T1, T T2 WHERE T1.A < T2.A) AS T3

The internal names T1.A and T2.A reference different columns. Which column does T3.A refer to? (I believe the expression is legal according to the SQL standard).

The solution I propose above (either of them) resolves this problem without difficulty.

Em I don't think so. From the subquery we have names T1_A, T2_A by the rules you give. Which column does T3.A refer to?

...

I would have thought it is always possible to write SQL giving explicit column names and aliases so that in the final result of the expression every name is unique, and the type is reliably traceable to a base column. Doesn't that resolve the issue?

From a deep-nested subselect, we're going to end up with names like T7_T5_T3_T1_A. And the programmer will need to do careful analysis of the query to see that column has the same content as T6_T4_T2_A. I'll continue to prefer Natural JOIN with a sprinkling of RENAME, thanks. BTW how do your approaches deal with

SELECT * FROM ((SELECT A, B, C FROM T T1) UNION (SELECT D, E, F FROM S) ) AS T3;

Which column does T3.A refer to? Which column does T3.D refer to, if any? What's the type/heading of T3?

Quote from dandl on November 6, 2019, 12:17 am

Quote from Hugh on November 5, 2019, 3:50 pm

Consider the SQL expression SELECT * FROM T T1, T T2 WHERE T1.A < T2.A

No base table in SQL can have its columns defined to match those of the result of the above expression.

Why so? In my part time role as an SQL compiler, I shall:

replace the '*' by a list of column names obtained from the schema

if a range variable (table alias) is involved, look through the alias back to the underlying table

note the duplicate column names, and mechanically replace each by an alias (equivalent to T1.A AS T1_A, T2.A AS T2_A)

Alternatively, I shall raise an error and require the programmer to insert sufficient column renaming terms so that every relation-valued expression has unique column names.

In SQL qualifiers are not carried forward in SELECT expressions in the FROM clause. Try putting my expression above into a FROM clause:

SELECT * FROM (SELECT * FROM T T1, T T2 WHERE T1.A < T2.A) AS T3

The internal names T1.A and T2.A reference different columns. Which column does T3.A refer to? (I believe the expression is legal according to the SQL standard).

The solution I propose above (either of them) resolves this problem without difficulty.

Em I don't think so. From the subquery we have names T1_A, T2_A by the rules you give. Which column does T3.A refer to?

...

I would have thought it is always possible to write SQL giving explicit column names and aliases so that in the final result of the expression every name is unique, and the type is reliably traceable to a base column. Doesn't that resolve the issue?

From a deep-nested subselect, we're going to end up with names like T7_T5_T3_T1_A. And the programmer will need to do careful analysis of the query to see that column has the same content as T6_T4_T2_A. I'll continue to prefer Natural JOIN with a sprinkling of RENAME, thanks. BTW how do your approaches deal with

SELECT * FROM ((SELECT A, B, C FROM T T1) UNION (SELECT D, E, F FROM S) ) AS T3;

Which column does T3.A refer to? Which column does T3.D refer to, if any? What's the type/heading of T3?

#5 · November 6, 2019, 6:20 am

Quote from dandl on November 6, 2019, 6:20 am

Quote from AntC on November 6, 2019, 1:03 am

Quote from dandl on November 6, 2019, 12:17 am

In SQL qualifiers are not carried forward in SELECT expressions in the FROM clause. Try putting my expression above into a FROM clause:

SELECT * FROM (SELECT * FROM T T1, T T2 WHERE T1.A < T2.A) AS T3

The internal names T1.A and T2.A reference different columns. Which column does T3.A refer to? (I believe the expression is legal according to the SQL standard).

The solution I propose above (either of them) resolves this problem without difficulty.

Em I don't think so. From the subquery we have names T1_A, T2_A by the rules you give. Which column does T3.A refer to?

By the rules I gave, there is no T3.A. My compiler may:

apply the same rule again, so now there is a T3_T1_A and T3_T2_A but no T3_A

require the programmer to explicitly name result columns to resolve any ambiguity

flag an error.

My knowledge of SQL falls well short of what would be needed to analyse the potential solutions in depth, but these seem to be short-comings of the SQL language alone. I was under the impression that Hugh saw a problem inherent in the RA/RC, and that I don't see (yet).

I would have thought it is always possible to write SQL giving explicit column names and aliases so that in the final result of the expression every name is unique, and the type is reliably traceable to a base column. Doesn't that resolve the issue?

From a deep-nested subselect, we're going to end up with names like T7_T5_T3_T1_A. And the programmer will need to do careful analysis of the query to see that column has the same content as T6_T4_T2_A. I'll continue to prefer Natural JOIN with a sprinkling of RENAME, thanks. BTW how do your approaches deal with

That's if you choose to resolve the problem by the method of auto-renaming. The alternative is to require the programmer to fully specify every query (as many SQL shops do) and the problem disappears.

SELECT * FROM ((SELECT A, B, C FROM T T1) UNION (SELECT D, E, F FROM S) ) AS T3;

Which column does T3.A refer to? Which column does T3.D refer to, if any? What's the type/heading of T3?

As I understand it, Codd would not permit that union as the columns are from different domains (same as TTM if the attributes are different names/types). If by some means it is to be allowed then the union will proceed by taking the columns in order. The compiler may:

rename the 3 columns so that the result is (A,B,C)

flag an error

force the programmer to explicitly rename columns.

These are all straightforward problems for which the solution looks nicer in the TTM type system. What a pity no-one wants to use it!

Quote from AntC on November 6, 2019, 1:03 am

Quote from dandl on November 6, 2019, 12:17 am

In SQL qualifiers are not carried forward in SELECT expressions in the FROM clause. Try putting my expression above into a FROM clause:

SELECT * FROM (SELECT * FROM T T1, T T2 WHERE T1.A < T2.A) AS T3

The internal names T1.A and T2.A reference different columns. Which column does T3.A refer to? (I believe the expression is legal according to the SQL standard).

The solution I propose above (either of them) resolves this problem without difficulty.

Em I don't think so. From the subquery we have names T1_A, T2_A by the rules you give. Which column does T3.A refer to?

By the rules I gave, there is no T3.A. My compiler may:

apply the same rule again, so now there is a T3_T1_A and T3_T2_A but no T3_A
require the programmer to explicitly name result columns to resolve any ambiguity
flag an error.

My knowledge of SQL falls well short of what would be needed to analyse the potential solutions in depth, but these seem to be short-comings of the SQL language alone. I was under the impression that Hugh saw a problem inherent in the RA/RC, and that I don't see (yet).

I would have thought it is always possible to write SQL giving explicit column names and aliases so that in the final result of the expression every name is unique, and the type is reliably traceable to a base column. Doesn't that resolve the issue?

From a deep-nested subselect, we're going to end up with names like T7_T5_T3_T1_A. And the programmer will need to do careful analysis of the query to see that column has the same content as T6_T4_T2_A. I'll continue to prefer Natural JOIN with a sprinkling of RENAME, thanks. BTW how do your approaches deal with

That's if you choose to resolve the problem by the method of auto-renaming. The alternative is to require the programmer to fully specify every query (as many SQL shops do) and the problem disappears.

SELECT * FROM ((SELECT A, B, C FROM T T1) UNION (SELECT D, E, F FROM S) ) AS T3;

Which column does T3.A refer to? Which column does T3.D refer to, if any? What's the type/heading of T3?

As I understand it, Codd would not permit that union as the columns are from different domains (same as TTM if the attributes are different names/types). If by some means it is to be allowed then the union will proceed by taking the columns in order. The compiler may:

rename the 3 columns so that the result is (A,B,C)
flag an error
force the programmer to explicitly rename columns.

These are all straightforward problems for which the solution looks nicer in the TTM type system. What a pity no-one wants to use it!

Andl - A New Database Language - andl.org

#6 · November 6, 2019, 11:55 am

Quote from Hugh on November 6, 2019, 11:55 am

Quote from AntC on November 5, 2019, 10:37 pm

Quote from Hugh on November 5, 2019, 3:50 pm

... a response to me, on the "need" for natural joins in which the writer wondered what's wrong with range variables, as used in Codd's relational calculus and SQL.

Thanks Hugh, that thread has been going fast and furious, with no proper closure for several of the points raised.

I'm familiar with 'range variables' in Codd's RC, and I kinda see a resemblance to what I call AS-names in SQL, but I didn't realise the SQL usage had an official name. As ever, SQL has managed a ghastly mongrel half-implementation of a sensible idea.

I guess the problem for language design is that the introduced names have no very clear scope: there's no BEGIN ... END or parenthesised expression with the names 'declared' textually at the start. Rather: the names are introduced/bound in some sub-expression then 'bubble up' to the top level. They're supposed not to clash with names coming from the database, but if all the compiler can see is SELECT * ..., it has to run around looking up the schema.

TTM requires all types to be "first-class", including relation types in particular. If type T is first-class, then one implication is that if x is a value of T, then it is possible to assign x to a variable of that type.

Consider the SQL expression SELECT * FROM T T1, T T2 WHERE T1.A < T2.A

No base table in SQL can have its columns defined to match those of the result of the above expression.

Now, for the very first edition of our TTM book, Chris drafted an appendix proposing a relational calculus based notation for relation expressions. In my review of his draft I asked about the attribute names of results and whether he could be sure that a result was always assignable to a variable of its (relation) type. I also enquired as to whether an attribute name sometimes had to be qualified an indefinite number of times by range variables: a.b.c....x, as a result of nested expressions.

Interesting history. Yes it's the stacking up of range variables needed due to nesting of query expressions that I pointed out in that earlier thread. (My comment also got lost in the turmoil.) That stacking up doesn't occur in Codd's RC.

The reason why they don't stack up in Codd's RC is that he (wrongly) assumed nesting was never needed. The same wrong assumption carried itself forward into Sequel and original SQL. The mistake was corrected in SQL in the 1992 edition of the standard when "derived tables in the FROM clause" was added (but made optional and remaining so until fairly recently). I'm not sure what you mean later when you write, "every sub-query/nested result must get a distinct name".

Hugh

In SQL qualifiers are not carried forward in SELECT expressions in the FROM clause. Try putting my expression above into a FROM clause:

SELECT * FROM (SELECT * FROM T T1, T T2 WHERE T1.A < T2.A) AS T3

The internal names T1.A and T2.A reference different columns. Which column does T3.A refer to? (I believe the expression is legal according to the SQL standard).

IIRC some of your critiques of SQL point out places where a sub-query must get a result name even though it's unusable, and other places where you can't give a result name even though it would be helpful -- for example the result from a UNION.

Anyway, Chris abandoned the attempt. I don't claim this a proof that range variables always militate against "all type first-class", but can anybody provide a counter-example (i.e. a notation that supports range variables and "all types first-class")?

Doesn't Codd's RC achieve that? Essentially every sub-query/nested result must get a distinct name, and no 'stacking up' of names. I'm not sure it's worth the effort of trying to patch up SQL, but I suppose:

Never use SELECT *, always explicitly list the result columns.

Always use ... AS ... for both column names and result names. [**]

Require AS-names to be unique throughout scope of the outermost SELECT.

[**] I guess SELECT A AS A ... would look dumb, so we might allow the AS-name to be omitted, where it's 'obvious' (haha).

[**] SELECT ... FROM T AS T WHERE ... also looks dumb; so allow the result name to be omitted in a FROM with only a single table? Except that same table might appear elsewhere in the same query (say in a subselect/self-join in the WHERE clause, then we can't omit the AS-name everywhere). Let's get the SQL Committee on to defining the rules for where omittable. That should add another doorstep's worth to the Standards doco.

Quote from AntC on November 5, 2019, 10:37 pm

Quote from Hugh on November 5, 2019, 3:50 pm

... a response to me, on the "need" for natural joins in which the writer wondered what's wrong with range variables, as used in Codd's relational calculus and SQL.

Thanks Hugh, that thread has been going fast and furious, with no proper closure for several of the points raised.

I'm familiar with 'range variables' in Codd's RC, and I kinda see a resemblance to what I call AS-names in SQL, but I didn't realise the SQL usage had an official name. As ever, SQL has managed a ghastly mongrel half-implementation of a sensible idea.

I guess the problem for language design is that the introduced names have no very clear scope: there's no BEGIN ... END or parenthesised expression with the names 'declared' textually at the start. Rather: the names are introduced/bound in some sub-expression then 'bubble up' to the top level. They're supposed not to clash with names coming from the database, but if all the compiler can see is SELECT * ..., it has to run around looking up the schema.

TTM requires all types to be "first-class", including relation types in particular. If type T is first-class, then one implication is that if x is a value of T, then it is possible to assign x to a variable of that type.

Consider the SQL expression SELECT * FROM T T1, T T2 WHERE T1.A < T2.A

No base table in SQL can have its columns defined to match those of the result of the above expression.

Now, for the very first edition of our TTM book, Chris drafted an appendix proposing a relational calculus based notation for relation expressions. In my review of his draft I asked about the attribute names of results and whether he could be sure that a result was always assignable to a variable of its (relation) type. I also enquired as to whether an attribute name sometimes had to be qualified an indefinite number of times by range variables: a.b.c....x, as a result of nested expressions.

Interesting history. Yes it's the stacking up of range variables needed due to nesting of query expressions that I pointed out in that earlier thread. (My comment also got lost in the turmoil.) That stacking up doesn't occur in Codd's RC.

The reason why they don't stack up in Codd's RC is that he (wrongly) assumed nesting was never needed. The same wrong assumption carried itself forward into Sequel and original SQL. The mistake was corrected in SQL in the 1992 edition of the standard when "derived tables in the FROM clause" was added (but made optional and remaining so until fairly recently). I'm not sure what you mean later when you write, "every sub-query/nested result must get a distinct name".

Hugh

In SQL qualifiers are not carried forward in SELECT expressions in the FROM clause. Try putting my expression above into a FROM clause:

SELECT * FROM (SELECT * FROM T T1, T T2 WHERE T1.A < T2.A) AS T3

The internal names T1.A and T2.A reference different columns. Which column does T3.A refer to? (I believe the expression is legal according to the SQL standard).

IIRC some of your critiques of SQL point out places where a sub-query must get a result name even though it's unusable, and other places where you can't give a result name even though it would be helpful -- for example the result from a UNION.

Anyway, Chris abandoned the attempt. I don't claim this a proof that range variables always militate against "all type first-class", but can anybody provide a counter-example (i.e. a notation that supports range variables and "all types first-class")?

Doesn't Codd's RC achieve that? Essentially every sub-query/nested result must get a distinct name, and no 'stacking up' of names. I'm not sure it's worth the effort of trying to patch up SQL, but I suppose:

Never use SELECT *, always explicitly list the result columns.

Always use ... AS ... for both column names and result names. [**]

Require AS-names to be unique throughout scope of the outermost SELECT.

[**] I guess SELECT A AS A ... would look dumb, so we might allow the AS-name to be omitted, where it's 'obvious' (haha).

[**] SELECT ... FROM T AS T WHERE ... also looks dumb; so allow the result name to be omitted in a FROM with only a single table? Except that same table might appear elsewhere in the same query (say in a subselect/self-join in the WHERE clause, then we can't omit the AS-name everywhere). Let's get the SQL Committee on to defining the rules for where omittable. That should add another doorstep's worth to the Standards doco.

Coauthor of The Third Manifesto and related books.

#7 · November 7, 2019, 10:15 am

Quote from AntC on November 7, 2019, 10:15 am

Quote from Hugh on November 6, 2019, 11:55 am

Quote from AntC on November 5, 2019, 10:37 pm

Quote from Hugh on November 5, 2019, 3:50 pm

... a response to me, on the "need" for natural joins in which the writer wondered what's wrong with range variables, as used in Codd's relational calculus and SQL.

I'm familiar with 'range variables' in Codd's RC, and I kinda see a resemblance to what I call AS-names in SQL, but I didn't realise the SQL usage had an official name. As ever, SQL has managed a ghastly mongrel half-implementation of a sensible idea.

TTM requires all types to be "first-class", including relation types in particular. If type T is first-class, then one implication is that if x is a value of T, then it is possible to assign x to a variable of that type.

Consider the SQL expression SELECT * FROM T T1, T T2 WHERE T1.A < T2.A

No base table in SQL can have its columns defined to match those of the result of the above expression.

Now, for the very first edition of our TTM book, ...

Interesting history. Yes it's the stacking up of range variables needed due to nesting of query expressions that I pointed out in that earlier thread. (My comment also got lost in the turmoil.) That stacking up doesn't occur in Codd's RC.

The reason why they don't stack up in Codd's RC is that he (wrongly) assumed nesting was never needed.

Hmm? I might be mixing up half-remembered variants of Relational Calculus. Codd did't really contemplate aggregates being within the calculus/algebra, and that's where you need heavy nesting. Let's work with a specific example (avoiding aggregates). Take the Tutorial D nesting

(SP JOIN S) JOIN (P RENAME {PCITY := CITY})

The 'word-for-word translation' SQL version might go

SELECT * FROM (SELECT * FROM SP NATURAL JOIN S) AS SSP NATURAL JOIN (SELECT P#, PNAME, CITY AS PCITY, WEIGHT FROM P AS PC) AS SSPP;

What are the column names from that result? Specifically do we get SSPP.S# and/or SSP.S# and/or SP.S# and/or S.S#? Should we get nested SSPP.SSP.SP.S#? Those are genuine questions/I don't know because I'd never write it like that. Because I've never worked with an SQL that supported NATURAL JOIN. Then the more idiomatic (but tedious) way would be

SELECT SP.S#, SP.P#, SP.QTY, S.SNAME, S.CITY, S.STATUS, P.PNAME, P.CITY AS PCITY, P.WEIGHT

FROM SP JOIN S ON SP.S# = S.S# JOIN P ON SP.P# = P.P#

If I'm reading Codd 1972 correctly [Section 3.1], a WFF can mention three different base tables as distinct 'range terms', so it can express as a flattened 'theta-join term' what must be expressed in Tutorial D as a nesting -- RC broadly similar to my idiomatic SQL version. Except RC uses positional notation throughout (as Erwin pointed out), so no need for the RENAME/P.CITY AS PCITY -- whose purpose is to avoid an accidental clash of same-named attributes.

The same wrong assumption carried itself forward into Sequel and original SQL. The mistake was corrected in SQL in the 1992 edition of the standard when "derived tables in the FROM clause" was added (but made optional and remaining so until fairly recently). I'm not sure what you mean later when you write, "every sub-query/nested result must get a distinct name".

I'd half-remembered this from section 3.3:

z = (t₁, t₂, ..., t_k) : w.

Which to me looks like binding or assigning [**] some sub-query result to a temp/local variable z. Then z acts as if a base table and can be bound to some later 'range term'. IOW give a distinct name to what in the algebra would be a nesting.

[**] This being a calculus, that = is not destructive assignment as in a procedural programming language (although that's how it might get implemented), but merely a referentially transparent name binding to an expression.

In SQL qualifiers are not carried forward in SELECT expressions in the FROM clause. Try putting my expression above into a FROM clause:

SELECT * FROM (SELECT * FROM T T1, T T2 WHERE T1.A < T2.A) AS T3

The internal names T1.A and T2.A reference different columns. Which column does T3.A refer to? (I believe the expression is legal according to the SQL standard).

IIRC some of your critiques of SQL point out places where a sub-query must get a result name even though it's unusable, and other places where you can't give a result name even though it would be helpful -- for example the result from a UNION.

Anyway, Chris abandoned the attempt. I don't claim this a proof that range variables always militate against "all type first-class", but can anybody provide a counter-example (i.e. a notation that supports range variables and "all types first-class")?

Doesn't Codd's RC achieve that? Essentially every sub-query/nested result must get a distinct name, and no 'stacking up' of names. I'm not sure it's worth the effort of trying to patch up SQL, but I suppose: ...

Quote from Hugh on November 6, 2019, 11:55 am

Quote from AntC on November 5, 2019, 10:37 pm

Quote from Hugh on November 5, 2019, 3:50 pm

... a response to me, on the "need" for natural joins in which the writer wondered what's wrong with range variables, as used in Codd's relational calculus and SQL.

I'm familiar with 'range variables' in Codd's RC, and I kinda see a resemblance to what I call AS-names in SQL, but I didn't realise the SQL usage had an official name. As ever, SQL has managed a ghastly mongrel half-implementation of a sensible idea.

TTM requires all types to be "first-class", including relation types in particular. If type T is first-class, then one implication is that if x is a value of T, then it is possible to assign x to a variable of that type.

Consider the SQL expression SELECT * FROM T T1, T T2 WHERE T1.A < T2.A

No base table in SQL can have its columns defined to match those of the result of the above expression.

Now, for the very first edition of our TTM book, ...

Interesting history. Yes it's the stacking up of range variables needed due to nesting of query expressions that I pointed out in that earlier thread. (My comment also got lost in the turmoil.) That stacking up doesn't occur in Codd's RC.

The reason why they don't stack up in Codd's RC is that he (wrongly) assumed nesting was never needed.

Hmm? I might be mixing up half-remembered variants of Relational Calculus. Codd did't really contemplate aggregates being within the calculus/algebra, and that's where you need heavy nesting. Let's work with a specific example (avoiding aggregates). Take the Tutorial D nesting

(SP JOIN S) JOIN (P RENAME {PCITY := CITY})

The 'word-for-word translation' SQL version might go

SELECT * FROM (SELECT * FROM SP NATURAL JOIN S) AS SSP NATURAL JOIN (SELECT P#, PNAME, CITY AS PCITY, WEIGHT FROM P AS PC) AS SSPP;

What are the column names from that result? Specifically do we get SSPP.S# and/or SSP.S# and/or SP.S# and/or S.S#? Should we get nested SSPP.SSP.SP.S#? Those are genuine questions/I don't know because I'd never write it like that. Because I've never worked with an SQL that supported NATURAL JOIN. Then the more idiomatic (but tedious) way would be

SELECT SP.S#, SP.P#, SP.QTY, S.SNAME, S.CITY, S.STATUS, P.PNAME, P.CITY AS PCITY, P.WEIGHT

FROM SP JOIN S ON SP.S# = S.S# JOIN P ON SP.P# = P.P#

If I'm reading Codd 1972 correctly [Section 3.1], a WFF can mention three different base tables as distinct 'range terms', so it can express as a flattened 'theta-join term' what must be expressed in Tutorial D as a nesting -- RC broadly similar to my idiomatic SQL version. Except RC uses positional notation throughout (as Erwin pointed out), so no need for the RENAME/P.CITY AS PCITY -- whose purpose is to avoid an accidental clash of same-named attributes.

The same wrong assumption carried itself forward into Sequel and original SQL. The mistake was corrected in SQL in the 1992 edition of the standard when "derived tables in the FROM clause" was added (but made optional and remaining so until fairly recently). I'm not sure what you mean later when you write, "every sub-query/nested result must get a distinct name".

I'd half-remembered this from section 3.3:

z = (t₁, t₂, ..., t_k) : w.

Which to me looks like binding or assigning [**] some sub-query result to a temp/local variable z. Then z acts as if a base table and can be bound to some later 'range term'. IOW give a distinct name to what in the algebra would be a nesting.

[**] This being a calculus, that = is not destructive assignment as in a procedural programming language (although that's how it might get implemented), but merely a referentially transparent name binding to an expression.

In SQL qualifiers are not carried forward in SELECT expressions in the FROM clause. Try putting my expression above into a FROM clause:

SELECT * FROM (SELECT * FROM T T1, T T2 WHERE T1.A < T2.A) AS T3

The internal names T1.A and T2.A reference different columns. Which column does T3.A refer to? (I believe the expression is legal according to the SQL standard).

IIRC some of your critiques of SQL point out places where a sub-query must get a result name even though it's unusable, and other places where you can't give a result name even though it would be helpful -- for example the result from a UNION.

Anyway, Chris abandoned the attempt. I don't claim this a proof that range variables always militate against "all type first-class", but can anybody provide a counter-example (i.e. a notation that supports range variables and "all types first-class")?

Doesn't Codd's RC achieve that? Essentially every sub-query/nested result must get a distinct name, and no 'stacking up' of names. I'm not sure it's worth the effort of trying to patch up SQL, but I suppose: ...

#8 · November 7, 2019, 5:15 pm

The term I used, range variable, has been questioned. Perhaps it was Chris Date, or somebody else other than Codd, who introduced it. Anyway, it is defined in Date's The Relational Database Dictionary as a logic variable that ranges over a set, taking each element in turn. In the third edition (1981) of his An Introduction to Database Systems he used the term tuple variable but in the fifth edition both terms appear. I think the more general term range variable covers both the tuple relational calculus and the domain relational calculus. In the latter (which I have never been able to grasp), the variables range over something other than tuples (domain elements?).

Anyway, an SQL range variable (and I am the one responsible for the use of that term in the international standard, sorry!) ranges over the rows of a table, so perhaps should have been called a row variable.

But one thing these SQL things (and those RC things) are ABSOLUTELY NOT (excuse the shouting) is table or relation "aliases". Such a variable never stands for a table (or relation) in SQL (or RC).

Hugh

Coauthor of The Third Manifesto and related books.

#9 · November 8, 2019, 12:21 am

Thanks Hugh, I didn't know that. The term range variable does not seem to be in widespread use, and I had not run across it (that I remember). I'm not an avid reader of the SQL standard.

I understand your position, but I'm not sure it's widely held. To take a simple example from a widely used tutorial site w3schools:

Alias for Tables Example

The following SQL statement selects all the orders from the customer with CustomerID=4 (Around the Horn). We use the "Customers" and "Orders" tables, and give them the table aliases of "c" and "o" respectively (Here we use aliases to make the SQL shorter):

So what is the key distinction, and why is it so widely misunderstood? In the context of a SELECT statement, how is a variable that 'ranges over' the rows of a table different from an alias for a table? How are the rows of a table to be distinguished from the table itself (which may include a kind of 'virtual table' you get from a sub-query)?

Andl - A New Database Language - andl.org

#10 · November 8, 2019, 4:51 am

Quote from AntC on November 8, 2019, 4:51 am

Quote from Hugh on November 7, 2019, 5:15 pm

The term I used, range variable, has been questioned.

'Range' appears in Codd 1972 prefixed to various other words, but not 'range variable'. AFAICT 'range' does not appear in Codd 1970. It does appear a couple of times in Codd 1979, but in the sense of an operator/function having a domain and a range, so not relevant to this discussion. I didn't bother looking in other writings.

[Codd 1972 section 3.1] Suppose the relations [in the given database] are R₁, R₂, . . . . , R_N. Then, P_j indicates membership of tuples in relation R_j (j = 1, 2, . . . . N). A monadic predicate followed by a tuple variable is called a range term. The range term P_jr is interpreted as stating that tuple variable r has relation R_j as its range.

Codd 1972 goes on to consider 'range-separable', 'range WFF', 'range predicates', "the range of a tuple variable r", 'range-coupled quantifiers'.

Perhaps it was Chris Date, or somebody else other than Codd, who introduced it.

Yes I think we have to conclude that the exact term 'range variable' is not due to Codd.

Anyway, it is defined in Date's The Relational Database Dictionary as a logic variable that ranges over a set, taking each element in turn. In the third edition (1981) of his An Introduction to Database Systems he used the term tuple variable but in the fifth edition both terms appear. I think the more general term range variable covers both the tuple relational calculus and the domain relational calculus. In the latter (which I have never been able to grasp), the variables range over something other than tuples (domain elements?).

Anyway, an SQL range variable (and I am the one responsible for the use of that term in the international standard, sorry!) ranges over the rows of a table, so perhaps should have been called a row variable.

Hmm that would have been more consistent with Codd's 'tuple variable'. But frankly (see below) I think the terminology is inappropriate to SQL, and would best have been avoided. No wonder it's widely misunderstood/misused.

But one thing these SQL things (and those RC things) are ABSOLUTELY NOT (excuse the shouting) is table or relation "aliases". Such a variable never stands for a table (or relation) in SQL (or RC).

The first StackOverflow question I fell across for 'range variables' says "range variables (aka aliases) for two ... tables". And the first answer says "You threw me with the term "range variable". I've always seen that construct called an "alias"."

From some SQL support site: "Range variables are correlation names for tables, and more. Refer to the "Ingres SQL Reference Manual" for a discussion of correlation names." It says "for tables", not "for rows". BTW in the Ingres SQL Reference, 'range variable' does not appear. So at least one vendor thinks it's not a useful term.

Most of the other GHits for 'range variable' are to D&D texts.

So sorry, but usage trumps prescription. My education in SQL was rather learning-on-the-job, but until yesterday, I'd never heard AS-names as introduced in the FROM clause called 'range variables'. I had heard them called 'table aliases'. (I can also see that's wrong: they're naming or ranging over query results, not necessarily tables.)

Hugh, I think you'll just have to get over it. I chime with David's q: what harm results from thinking of/calling those names 'table aliases'?

Quote from dandl on November 8, 2019, 12:21 am

So what is the key distinction, and why is it so widely misunderstood? In the context of a SELECT statement, how is a variable that 'ranges over' the rows of a table different from an alias for a table? How are the rows of a table to be distinguished from the table itself (which may include a kind of 'virtual table' you get from a sub-query)?

In Codd 1972's exposition, the 'tuple variables' are explicitly quantified over ranges, in Predicate Logic style.

In SQL there's no apparent quantification. We try to think of SQL as operating whole-relation-at-a-time (Codd Rules 7, 12). Then something ranging over rows sounds awful like Row-by-Agonising-Row terminology. We might describe the semantics as kinda mapping a lambda-expression (the 'open formula') over the table/query as a collection so those 'variables' are lambda-bound, but since you seem to be proud in your ignorance of lambda calculus, and most SQL programmers wouldn't know it from a bar of soap, that's no help either.

I conclude 'range variable' should be expunged from SQL in favour of 'table alias'.

Addit: one rather well-known vendor of SQL, in their SQL syntax reference, calls the FROM AS-name table_alias. And another, arguably even better known vendor, calls it t_alias. AFAICT no vendor in their public-facing materials uses 'range variable'.

Quote from Hugh on November 7, 2019, 5:15 pm

The term I used, range variable, has been questioned.

'Range' appears in Codd 1972 prefixed to various other words, but not 'range variable'. AFAICT 'range' does not appear in Codd 1970. It does appear a couple of times in Codd 1979, but in the sense of an operator/function having a domain and a range, so not relevant to this discussion. I didn't bother looking in other writings.

[Codd 1972 section 3.1] Suppose the relations [in the given database] are R₁, R₂, . . . . , R_N. Then, P_j indicates membership of tuples in relation R_j (j = 1, 2, . . . . N). A monadic predicate followed by a tuple variable is called a range term. The range term P_jr is interpreted as stating that tuple variable r has relation R_j as its range.

Codd 1972 goes on to consider 'range-separable', 'range WFF', 'range predicates', "the range of a tuple variable r", 'range-coupled quantifiers'.

Perhaps it was Chris Date, or somebody else other than Codd, who introduced it.

Yes I think we have to conclude that the exact term 'range variable' is not due to Codd.

Anyway, it is defined in Date's The Relational Database Dictionary as a logic variable that ranges over a set, taking each element in turn. In the third edition (1981) of his An Introduction to Database Systems he used the term tuple variable but in the fifth edition both terms appear. I think the more general term range variable covers both the tuple relational calculus and the domain relational calculus. In the latter (which I have never been able to grasp), the variables range over something other than tuples (domain elements?).

Anyway, an SQL range variable (and I am the one responsible for the use of that term in the international standard, sorry!) ranges over the rows of a table, so perhaps should have been called a row variable.

Hmm that would have been more consistent with Codd's 'tuple variable'. But frankly (see below) I think the terminology is inappropriate to SQL, and would best have been avoided. No wonder it's widely misunderstood/misused.

But one thing these SQL things (and those RC things) are ABSOLUTELY NOT (excuse the shouting) is table or relation "aliases". Such a variable never stands for a table (or relation) in SQL (or RC).

The first StackOverflow question I fell across for 'range variables' says "range variables (aka aliases) for two ... tables". And the first answer says "You threw me with the term "range variable". I've always seen that construct called an "alias"."

From some SQL support site: "Range variables are correlation names for tables, and more. Refer to the "Ingres SQL Reference Manual" for a discussion of correlation names." It says "for tables", not "for rows". BTW in the Ingres SQL Reference, 'range variable' does not appear. So at least one vendor thinks it's not a useful term.

Most of the other GHits for 'range variable' are to D&D texts.

So sorry, but usage trumps prescription. My education in SQL was rather learning-on-the-job, but until yesterday, I'd never heard AS-names as introduced in the FROM clause called 'range variables'. I had heard them called 'table aliases'. (I can also see that's wrong: they're naming or ranging over query results, not necessarily tables.)

Hugh, I think you'll just have to get over it. I chime with David's q: what harm results from thinking of/calling those names 'table aliases'?

Quote from dandl on November 8, 2019, 12:21 am

So what is the key distinction, and why is it so widely misunderstood? In the context of a SELECT statement, how is a variable that 'ranges over' the rows of a table different from an alias for a table? How are the rows of a table to be distinguished from the table itself (which may include a kind of 'virtual table' you get from a sub-query)?

In Codd 1972's exposition, the 'tuple variables' are explicitly quantified over ranges, in Predicate Logic style.

In SQL there's no apparent quantification. We try to think of SQL as operating whole-relation-at-a-time (Codd Rules 7, 12). Then something ranging over rows sounds awful like Row-by-Agonising-Row terminology. We might describe the semantics as kinda mapping a lambda-expression (the 'open formula') over the table/query as a collection so those 'variables' are lambda-bound, but since you seem to be proud in your ignorance of lambda calculus, and most SQL programmers wouldn't know it from a bar of soap, that's no help either.

I conclude 'range variable' should be expunged from SQL in favour of 'table alias'.

Addit: one rather well-known vendor of SQL, in their SQL syntax reference, calls the FROM AS-name table_alias. And another, arguably even better known vendor, calls it t_alias. AFAICT no vendor in their public-facing materials uses 'range variable'.

TTM Forum

The Forum for Discussion about The Third Manifesto and Related Matters

The case against range variables

Alias for Tables Example

Alias for Tables Example