The Forum for Discussion about The Third Manifesto and Related Matters

Please or Register to create posts and topics.

Why do we need natural joins?

Page 1 of 5Next

Codd described theta joins, where theta represents a comparison operator between 'elements' of two relations. A theta join is possible whenever a comparison between elements is true or false, not undefined. A special case of theta-join is the equi-join. When the duplicate domains are removed by project, he defines the result as a natural join. At no point is there an implied or stated requirement for matching domains by name. Note that Codd uses named and numbered references to domains interchangeably. For example:

R[B = D]S(A B C E)

Codd syntax survives into SQL, where it would look something like this:

select A,B,C,E from R,S where B = D

So why exactly is the term 'natural join' now used to refer to matching by name, when Codd used it differently?

Further, there is widespread advice against using natural joins in SQL. So why did D&D choose to focus on natural join (by the new definition) to the exclusion of all others?

And finally, if we revisit that decision and use theta joins instead, is it not true that may of the problems with the TTM type system fall away?

 

Andl - A New Database Language - andl.org
Quote from dandl on November 2, 2019, 11:28 pm

Codd described theta joins, where theta represents a comparison operator between 'elements' of two relations. A theta join is possible whenever a comparison between elements is true or false, not undefined. A special case of theta-join is the equi-join. When the duplicate domains are removed by project, he defines the result as a natural join. At no point is there an implied or stated requirement for matching domains by name. Note that Codd uses named and numbered references to domains interchangeably. For example:

R[B = D]S(A B C E)

Codd syntax survives into SQL, where it would look something like this:

select A,B,C,E from R,S where B = D

So why exactly is the term 'natural join' now used to refer to matching by name, when Codd used it differently?

Further, there is widespread advice against using natural joins in SQL. So why did D&D choose to focus on natural join (by the new definition) to the exclusion of all others?

And finally, if we revisit that decision and use theta joins instead, is it not true that may of the problems with the TTM type system fall away?

Why would type system issues fall away?

Don't Codd's join and TTM JOIN differ only in that the former requires explicit choice of attributes to join (and then must check that they are type-compatible), whilst the latter joins on common names (and then must check that they are type-compatible)?

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

I would have thought it was obvious. In effect the type generators disappear, TUPLE is just a scalar with components and RELATION is a collection of TUPLE. That's perfectly compatible with any number of existing languages.

A JOIN now looks something like this:

S.Join(SP, (s,sp) => s.Sid = sp.Sid, (s,sp) => new { Sid=s.Sid, SNAME=s.SNAME, STATUS=s.STATUS, CITY=s.CITY, Pid=sp.Pid, QTY=sp.QTY })

This is pure LINQ. Yes, you have to spell out all the field names just like in SQL, but is that too big a price? The advantage is that you can see precisely what the output will be without going back to the declarations, or a previous expression, to find out. The compiler still does all the type checking, so the expression is type safe. You just have to code a little harder.

For TTM, all that's needed is to delete RM Pre 6 and 7, make the necessary consequential changes and we're done. TTM is now easy to do!

The grief comes 100% entirely from choosing natural join with name matching as the basis for an entire type system, where everybody else did not. Perhaps that is the one big blunder.

Andl - A New Database Language - andl.org
Quote from dandl on November 3, 2019, 1:42 am

I would have thought it was obvious. In effect the type generators disappear, TUPLE is just a scalar with components and RELATION is a collection of TUPLE. That's perfectly compatible with any number of existing languages.

A JOIN now looks something like this:

S.Join(SP, (s,sp) => s.Sid = sp.Sid, (s,sp) => new { Sid=s.Sid, SNAME=s.SNAME, STATUS=s.STATUS, CITY=s.CITY, Pid=sp.Pid, QTY=sp.QTY })

This is pure LINQ. Yes, you have to spell out all the field names just like in SQL, but is that too big a price? The advantage is that you can see precisely what the output will be without going back to the declarations, or a previous expression, to find out. The compiler still does all the type checking, so the expression is type safe. You just have to code a little harder.

For TTM, all that's needed is to delete RM Pre 6 and 7, make the necessary consequential changes and we're done. TTM is now easy to do!

The grief comes 100% entirely from choosing natural join with name matching as the basis for an entire type system, where everybody else did not. Perhaps that is the one big blunder.

I must be missing something here. I don't see how natural join is the basis for the type system, nor do I see how type generators disappear merely because you've made what appears to be little more than a syntactic change to JOIN. You'd still use TUPLE and RELATION to declare variables and literals. The only difference appears to be whether a JOIN requires you to explicitly identify attribute pairs vs the system identifying them by the same name.

I can see a different JOIN changing how often you might need to use RENAME -- maybe -- but not the type system. Indeed, the TTM/IM or even the TTM type system sans IM would be the same even if not used with the relational model.

But if used with the relational model and a different JOIN, all the same issues apply to UNION, MINUS, INTERSECT, no?

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

A mistake which seems to be universal is reading Codd as if he is prescribing language syntax when he defines operators in the abstract. So you get all these languages that implement the wrong idea that his explication of natural join must mean that the expression (R Join S) means that the names R and S represent joinable relations which they do not.

When named tables are assumed to be variables, or named relvars are assumed the mistake is only propagated.

Codd 1970 says: "A binary relation R is joinable with a binary relation S if there exists a ternary relation U such that pi 1,2 (U) = R and pi 2,3 (U) = S. Any such ternary relation is called a join of R with S. If R, S are binary relations such that pi 2 (R) = pi 1 (S), then R is joinable with S. One join that always exists in such a case is the natural join of R with S defined by ..."

There is no reason to think that his abstract definition signifies relvar names or table names or relation names that are the explicit names of the effective operands in a language implementation rather than implicit names. Note he says "if there exists a ternary relation U" which is as abstract as can be.

Note also that he defined projection before join because it's needed to define all joins, not only natural join. But given the widespread desire to optimize storage which applies normalization techniques, in practice most natural join values are not joins of explicit operands. Natural join is rather special because it doesn't depend on a host language for arithmetic or set ordering, only the ability to decide whether two values are the same value.

This should be quite obvious when a supposed natural join expression (R Join S) is replaced with ((R Union T) Join S). When no tuple of T appears as a sub-tuple of the join it should be clear that the tuples given by a natural join value are not necessarily derivable from the join operands named by a language.

The tuples of R that are not derivable from a join value are given by r =R Minus (R Minus (R Join S) {R-attribute set}). Codd could have been more clear if he had distinguished r from R and s from S but that's no excuse for languages that transpose his notation willy-nilly.

A bunch of useful equations/tautologies result when his definition is accurately applied. Then it can be understood that natural join is a way to de-normalize normalized relations, a transformation required by the data independence requirement. Chief among them are the logical simplifications that are possible in logical arguments that avoid explicit projections/quantifications since actual values are implied when they assume common attributes. Questions about programming types go away when the logical basis needs only one type because it doesn’t even need to be mentioned, likewise questions about relvar behaviour. Besides that the logical validity of deductive conclusions becomes possible. That possibility is itself a necessary requirement of system logical completeness. 

It's more than a little pathetic how the meaning of natural join has been usurped without much thought.

 

Quote from Dave Voorhis on November 3, 2019, 9:23 am
Quote from dandl on November 3, 2019, 1:42 am

I would have thought it was obvious. In effect the type generators disappear, TUPLE is just a scalar with components and RELATION is a collection of TUPLE. That's perfectly compatible with any number of existing languages.

A JOIN now looks something like this:

S.Join(SP, (s,sp) => s.Sid = sp.Sid, (s,sp) => new { Sid=s.Sid, SNAME=s.SNAME, STATUS=s.STATUS, CITY=s.CITY, Pid=sp.Pid, QTY=sp.QTY })

This is pure LINQ. Yes, you have to spell out all the field names just like in SQL, but is that too big a price? The advantage is that you can see precisely what the output will be without going back to the declarations, or a previous expression, to find out. The compiler still does all the type checking, so the expression is type safe. You just have to code a little harder.

For TTM, all that's needed is to delete RM Pre 6 and 7, make the necessary consequential changes and we're done. TTM is now easy to do!

The grief comes 100% entirely from choosing natural join with name matching as the basis for an entire type system, where everybody else did not. Perhaps that is the one big blunder.

I must be missing something here.

Obviously.

I don't see how natural join is the basis for the type system, nor do I see how type generators disappear merely because you've made what appears to be little more than a syntactic change to JOIN. You'd still use TUPLE and RELATION to declare variables and literals.

There is no distinct TUPLE type, there is only a scalar with components. It can be implemented as a class in Java, or a value type in C#. Two tuples with the same components are different types, as they are in most other languages. Some languages might allow structural equivalence of types.

There is no distinct RELATION type, there is merely a collection of tuples. In OO it's convenient for the collections to inherit from something. In Andl.NET I used an interface IRelatable.

The only difference appears to be whether a JOIN requires you to explicitly identify attribute pairs vs the system identifying them by the same name.

I can see a different JOIN changing how often you might need to use RENAME -- maybe -- but not the type system. Indeed, the TTM/IM or even the TTM type system sans IM would bethe same even if not used with the relational model.

There is no need for RENAME (as per SQL). And no, the two type generators exist only for the relatoinal model. Take out the RA of RM Pre 18 and they have no reason to exist.

But if used with the relational model and a different JOIN, all the same issues apply to UNION, MINUS, INTERSECT, no?

The set operations work only on tuples of the same type, so if they are not then one has to be transformed to match the other.

S1.Union(S2, s => new typeof(S1) { Sid = s.Sid, SNAME = s.SNAME, STATUS = s.STATUS, CITY = s.CITY }) // result is same type as S1, not S2

There's nothing weird about this. Most of it is already in LINQ. It works just fine, although it can get a bit verbose at times.

Andl - A New Database Language - andl.org

You don't "need" them.  I do, because I don't use SQL, but in any case I think "prefer" is more appropriate than "need".  To answer the question thus revised:

First, JOIN is the obvious natural counterpart to conjunction as used in predicate expressions, using predicate variables as attribute names.

Second, JOIN avoids the appearance of identical columns.  In the SQL example given with the original question, the writer uses projection to explicitly remove the column D.

Third, JOIN avoids the need for range variables in cases where operands have common attributes.

Fourth, The "Marriage Principle", as I called it when I wrote about it in 1989, neatly applies to all the dyadic operators.

I can't see any relevance to the type system of dyadic operators in RA, apart from the requirement for common attributes to be of the same type (which applies to all comparisons anyway).

Hugh

Coauthor of The Third Manifesto and related books.
Quote from dandl on November 3, 2019, 1:37 pm
Quote from Dave Voorhis on November 3, 2019, 9:23 am
Quote from dandl on November 3, 2019, 1:42 am

I would have thought it was obvious. In effect the type generators disappear, TUPLE is just a scalar with components and RELATION is a collection of TUPLE. That's perfectly compatible with any number of existing languages.

A JOIN now looks something like this:

S.Join(SP, (s,sp) => s.Sid = sp.Sid, (s,sp) => new { Sid=s.Sid, SNAME=s.SNAME, STATUS=s.STATUS, CITY=s.CITY, Pid=sp.Pid, QTY=sp.QTY })

This is pure LINQ. Yes, you have to spell out all the field names just like in SQL, but is that too big a price? The advantage is that you can see precisely what the output will be without going back to the declarations, or a previous expression, to find out. The compiler still does all the type checking, so the expression is type safe. You just have to code a little harder.

For TTM, all that's needed is to delete RM Pre 6 and 7, make the necessary consequential changes and we're done. TTM is now easy to do!

The grief comes 100% entirely from choosing natural join with name matching as the basis for an entire type system, where everybody else did not. Perhaps that is the one big blunder.

I must be missing something here.

Obviously.

I don't see how natural join is the basis for the type system, nor do I see how type generators disappear merely because you've made what appears to be little more than a syntactic change to JOIN. You'd still use TUPLE and RELATION to declare variables and literals.

There is no distinct TUPLE type, there is only a scalar with components. It can be implemented as a class in Java, or a value type in C#. Two tuples with the same components are different types, as they are in most other languages. Some languages might allow structural equivalence of types.

There is no distinct RELATION type, there is merely a collection of tuples. In OO it's convenient for the collections to inherit from something. In Andl.NET I used an interface IRelatable.

The only difference appears to be whether a JOIN requires you to explicitly identify attribute pairs vs the system identifying them by the same name.

I can see a different JOIN changing how often you might need to use RENAME -- maybe -- but not the type system. Indeed, the TTM/IM or even the TTM type system sans IM would bethe same even if not used with the relational model.

There is no need for RENAME (as per SQL). And no, the two type generators exist only for the relatoinal model. Take out the RA of RM Pre 18 and they have no reason to exist.

But if used with the relational model and a different JOIN, all the same issues apply to UNION, MINUS, INTERSECT, no?

The set operations work only on tuples of the same type, so if they are not then one has to be transformed to match the other.

S1.Union(S2, s => new typeof(S1) { Sid = s.Sid, SNAME = s.SNAME, STATUS = s.STATUS, CITY = s.CITY }) // result is same type as S1, not S2

There's nothing weird about this. Most of it is already in LINQ. It works just fine, although it can get a bit verbose at times.

You seem to be echoing what I'd suggested elsewhere -- such as https://forum.thethirdmanifesto.com/forum/topic/how-to-resolve-the-type-system-relational-algebra-mismatch/#postid-985925 -- in terms of using Java Streams or .NET LINQ, but I'm not clear how some mainly-syntactic tweaks to JOIN (i.e., implicit matching of attributes based on common names vs explicit matching of attributes whether names are common or not) now make LINQ/Streams usable in a way they apparently weren't before. I already know they're usable -- I've made extensive use of LINQ and Streams -- and you probably have too. Using a (dynamically generated & compiled) class to represent a tuple type is what I'm doing in my datasheet project, as I've noted several times.

Whilst neither Streams nor LINQ are a relational algebra per se, they unquestionably embody its essential strength -- i.e., a set of data structures abstracted under a unified interface and an algebra specifying transformations to values/instances of that interface.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org
Quote from Dave Voorhis on November 3, 2019, 4:29 pm

I'm not clear how some mainly-syntactic tweaks to JOIN (i.e., implicit matching of attributes based on common names vs explicit matching of attributes whether names are common or not) now make LINQ/Streams usable in a way they apparently weren't before.

I don't think they either.  But I do wonder if natural joins as we know them are a bad idea on PL design grounds.  In the same sort of way as structural equivalence of types, they allow accidental matchings between semantically unrelated but similarly-named things, in this case attributes.  Certainly ordinary renames are sufficient:  a RENAME {bar AS baz, my_goo AS goo} JOIN b RENAME{baz AS foo, your_goo AS goo} will suffice to join a{foo, bar, this, my_goo} with b{baz, bar, that, your_goo} to return something of type REL{foo, baz, this, that, goo} without a problem from the semantically unrelated attributes named bar.

But I think an unnatural-join syntax like a JOIN b {LEFT foo RIGHT bar AS foo, LEFT bar RIGHT baz AS baz, LEFT my_goo RIGHT your_goo AS goo} is easier to understand, not least because it puts the corresponding names together.  In the usual case where the attribute name of the result is not novel, this can be abbreviated to a JOIN b {RIGHT bar AS baz, LEFT bar AS baz, LEFT my_goo RIGHT your_goo AS goo}.

A more SQL syntax is also possible:  a JOIN b ON {a.foo = b.baz AS foo, a.bar = b.baz AS baz, a.my_goo = b.your_goo AS goo} or for short a JOIN b ON {a.foo AS b.baz, b.bar AS a.baz, a.my_goo = b.your_goo AS goo}.  Of course dot is a special syntax valid only in the ON-clause, not a general escape from identifier rules.

Quote from johnwcowan on November 3, 2019, 8:17 pm

But I do wonder if natural joins as we know them are a bad idea on PL design grounds.  In the same sort of way as structural equivalence of types, they allow accidental matchings between semantically unrelated but similarly-named things, in this case attributes.

That is a common criticism. I offer the usual response: design the schema such that if any pair of attributes X and Y are to be joined, they must have the same name and type. If attributes X and Y are not to be joined, they must have different names.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org
Page 1 of 5Next