Why do we need natural joins?

#31 · November 4, 2019, 10:25 pm

Quote from AntC on November 4, 2019, 10:25 pm

Quote from Vadim on November 4, 2019, 6:24 pm

Quote from AntC on November 3, 2019, 8:55 pm

Furthermore JOIN is the natural generalisation of INTERSECT when the operands don't have identical headings; and the natural generalisation of TIMES when the operands don't have disjoint headings.

It subsumes selection operator as well.

I did mention that. JOIN also subsumes EXTEND, especially obviously if the attribs in common are a key to the 'lookup' relation. So JOIN wrt the comparative headings of its operands:

no attribs in common = TIMES

same attribs = INTERSECT

one a subset = restrict/WHERE

attribs in common a key = EXTEND

attribs in common non-empty, and each operand's exclusive attribs also non-empty = vanilla JOIN

I'm puzzled why natural join generalizes 3 concepts, while inner union only 2...

For Union, we don't have a 'concept' of unioning relations with no attributes in common (compare TIMES) -- or do I mean a 'concept' of projecting on no attributes? Work through the bullet list above for InnerUnion the comparative headings of its operands:

no attribs in common = ?

same attribs = (Codd's) UNION

one a subset (and that subset relation empty) = project

attribs in common a key = ?

attribs in common non-empty, and each operand's exclusive attribs also non-empty, and one relation empty = generalised project

as above except neither relation empty = generalised UNION

Database practice, though, is messier than theory. If natural join is considered as a foundation of some query language, then what is the method of handling columns like CREATED_BY and CREATED_ON?

Would we be doing that if it weren't that SQL allows us to 'get away with it'? I've worked in shops where naming standards are that columns must have a prefix indicating the base table -- this avoids 'clash' of ubiquitous names, at cost of destroying any hope for Natural JOIN being useful.

Is RDBMS supposed to automatically generate "end-user-query-oriented" views that exclude such columns? Or every transaction is recorded with provenance information so that tuple modification date and user can be queried separately without having them in every table. Still, this design seems to be untrustworthy when adding columns to the schema breaks existing queries. Are there more examples of the columns similar to CREATED_BY and CREATED_ON?

Arguably audit fields like that should be kept in the systems/database commit log, not base tables. Never the less there might be a TRANSACTION_ID in the base table that links to the log.

Quote from Vadim on November 4, 2019, 6:24 pm

Quote from AntC on November 3, 2019, 8:55 pm

Furthermore JOIN is the natural generalisation of INTERSECT when the operands don't have identical headings; and the natural generalisation of TIMES when the operands don't have disjoint headings.

It subsumes selection operator as well.

I did mention that. JOIN also subsumes EXTEND, especially obviously if the attribs in common are a key to the 'lookup' relation. So JOIN wrt the comparative headings of its operands:

no attribs in common = TIMES
same attribs = INTERSECT
one a subset = restrict/WHERE
attribs in common a key = EXTEND
attribs in common non-empty, and each operand's exclusive attribs also non-empty = vanilla JOIN

I'm puzzled why natural join generalizes 3 concepts, while inner union only 2...

For Union, we don't have a 'concept' of unioning relations with no attributes in common (compare TIMES) -- or do I mean a 'concept' of projecting on no attributes? Work through the bullet list above for InnerUnion the comparative headings of its operands:

no attribs in common = ?
same attribs = (Codd's) UNION
one a subset (and that subset relation empty) = project
attribs in common a key = ?
attribs in common non-empty, and each operand's exclusive attribs also non-empty, and one relation empty = generalised project
as above except neither relation empty = generalised UNION

Database practice, though, is messier than theory. If natural join is considered as a foundation of some query language, then what is the method of handling columns like CREATED_BY and CREATED_ON?

Would we be doing that if it weren't that SQL allows us to 'get away with it'? I've worked in shops where naming standards are that columns must have a prefix indicating the base table -- this avoids 'clash' of ubiquitous names, at cost of destroying any hope for Natural JOIN being useful.

Is RDBMS supposed to automatically generate "end-user-query-oriented" views that exclude such columns? Or every transaction is recorded with provenance information so that tuple modification date and user can be queried separately without having them in every table. Still, this design seems to be untrustworthy when adding columns to the schema breaks existing queries. Are there more examples of the columns similar to CREATED_BY and CREATED_ON?

Arguably audit fields like that should be kept in the systems/database commit log, not base tables. Never the less there might be a TRANSACTION_ID in the base table that links to the log.

#32 · November 4, 2019, 10:33 pm

Quote from AntC on November 4, 2019, 10:25 pm

I'm puzzled why natural join generalizes 3 concepts, while inner union only 2...

For Union, we don't have a 'concept' of unioning relations with no attributes in common (compare TIMES) -- or do I mean a 'concept' of projecting on no attributes?

That would be outer join if we had outer join.

#33 · November 4, 2019, 11:29 pm

Quote from Dave Voorhis on November 4, 2019, 7:49 am

Quote from dandl on November 4, 2019, 6:54 am

Can you? What if the operands happen to have some attribute name in common? Or are you excluding that possibility? Then how do you express in Java's type system: these collections must have no attributes in common? How do you write a Cartesian product in Java? How do you write a theta-selection over a Cartesian product?

You can't express any of those things, Java's type system is way too simple. But they are not requirement for TTM and you can most certainly write a Cartesian product in C# LINQ, and a selection over that. See samples posted elsewhere.

Are you claiming .NET LINQ can do something vis-a-vis Cartesian product of collections that Java Streams can't?

They appear to be essentially equivalent in capability, being based on essentially the same language and type system. Differences are negligible. C#'s LINQ query syntax might look friendlier compared to LINQ method syntax or Java Streams, but the latter two are notionally equivalent.

Of course not. I know how to do it in LINQ, and so I assume there is a way in JS.

I only claim there is no way to get the type system to do the other things requested in this post. You can write the code and it will compile and avoid type errors, but you would need runtime code, probably based on Reflection, to do the rest.

Andl - A New Database Language - andl.org

#34 · November 4, 2019, 11:55 pm

Quote from dandl on November 4, 2019, 11:55 pm

Quote from Dave Voorhis on November 4, 2019, 8:10 am

Quote from dandl on November 3, 2019, 11:45 pm

Quote from Dave Voorhis on November 3, 2019, 4:29 pm

You seem to be echoing what I'd suggested elsewhere -- such as https://forum.thethirdmanifesto.com/forum/topic/how-to-resolve-the-type-system-relational-algebra-mismatch/#postid-985925 -- in terms of using Java Streams or .NET LINQ, but I'm not clear how some mainly-syntactic tweaks to JOIN (i.e., implicit matching of attributes based on common names vs explicit matching of attributes whether names are common or not) now make LINQ/Streams usable in a way they apparently weren't before. I already know they're usable -- I've made extensive use of LINQ and Streams -- and you probably have too. Using a (dynamically generated & compiled) class to represent a tuple type is what I'm doing in my datasheet project, as I've noted several times.

Whilst neither Streams nor LINQ are a relational algebra per se, they unquestionably embody its essential strength -- i.e., a set of data structures abstracted under a unified interface and an algebra specifying transformations to values/instances of that interface.

It's obviously a closely related theme, but what I'm saying is: D&D made a choice (natural join) in writing TTM, the choice led them to a certain kind of type system (with type generators and headings), and that leads to a dead end in terms of irreconcilable type systems. If they made a different choice (such as theta joins), could that lead to a TTM Mk II with a different kind of type system, more like that found in modern statically typed programming languages? Is it possible to throw out the tuple and relation type generators as written, replace them by something different, and keep all the other useful work? By my count, RM Pre 1-5, 8, 11-26 would survive with little or no change. The others would need to be rewritten.

I quoted LINQ-like code to show one possible approach, but with a different TTM type system as the starting point, there might be any number of ways to get there.

LINQ is only a small part of the solution. TTM aims to solve a bigger problem than just the type system and the RA.

As others have noted, the choice of natural join is the result of making choices about RA and type system design. The TTM RA and type system was not a result of making a choice to use natural join.

Chicken or egg?

I don't see how the choice of natural join or not relates to tuple and relation type generators. The use of tuple and relation type generators seem more related to specifying non-scalar literals, variable and parameter types -- and use of UNION, INTERSECT and MINUS, etc. -- than they do to natural join.

Natural join with its dependence on matching fields by name is the key construct in TTM that mandates a type system with generators based on headings. Headings and the generators that depend on them allow the compiler to type check a natural join, as per RM Pre 18. Without them it can only be done at runtime, and then you are exposed to type errors. The same goes for derivations like MATCHING and NON MATCHING. The set operations can be type checked if the types are identical, or left to runtime if they rely on name matching.

The theta join requires the programmer to express the type of inputs, outputs and join columns, which allows the compiler to type check a join and satisfy RM Pre 18. Similar considerations apply to set operations and projections. You lose convenience, but you gain access to a wide range of existing languages in the role of TTM/D.

Of course, there are various ways to implement an RA -- or a non-RA collections framework and associated algebra such as .NET LINQ or Java Streams -- but these are not TTM. That doesn't mean they wouldn't be equally valid or useful.

With this change, they could be.

Quote from Dave Voorhis on November 4, 2019, 8:10 am

Quote from dandl on November 3, 2019, 11:45 pm

Quote from Dave Voorhis on November 3, 2019, 4:29 pm

You seem to be echoing what I'd suggested elsewhere -- such as https://forum.thethirdmanifesto.com/forum/topic/how-to-resolve-the-type-system-relational-algebra-mismatch/#postid-985925 -- in terms of using Java Streams or .NET LINQ, but I'm not clear how some mainly-syntactic tweaks to JOIN (i.e., implicit matching of attributes based on common names vs explicit matching of attributes whether names are common or not) now make LINQ/Streams usable in a way they apparently weren't before. I already know they're usable -- I've made extensive use of LINQ and Streams -- and you probably have too. Using a (dynamically generated & compiled) class to represent a tuple type is what I'm doing in my datasheet project, as I've noted several times.

Whilst neither Streams nor LINQ are a relational algebra per se, they unquestionably embody its essential strength -- i.e., a set of data structures abstracted under a unified interface and an algebra specifying transformations to values/instances of that interface.

It's obviously a closely related theme, but what I'm saying is: D&D made a choice (natural join) in writing TTM, the choice led them to a certain kind of type system (with type generators and headings), and that leads to a dead end in terms of irreconcilable type systems. If they made a different choice (such as theta joins), could that lead to a TTM Mk II with a different kind of type system, more like that found in modern statically typed programming languages? Is it possible to throw out the tuple and relation type generators as written, replace them by something different, and keep all the other useful work? By my count, RM Pre 1-5, 8, 11-26 would survive with little or no change. The others would need to be rewritten.

I quoted LINQ-like code to show one possible approach, but with a different TTM type system as the starting point, there might be any number of ways to get there.

LINQ is only a small part of the solution. TTM aims to solve a bigger problem than just the type system and the RA.

As others have noted, the choice of natural join is the result of making choices about RA and type system design. The TTM RA and type system was not a result of making a choice to use natural join.

Chicken or egg?

I don't see how the choice of natural join or not relates to tuple and relation type generators. The use of tuple and relation type generators seem more related to specifying non-scalar literals, variable and parameter types -- and use of UNION, INTERSECT and MINUS, etc. -- than they do to natural join.

Natural join with its dependence on matching fields by name is the key construct in TTM that mandates a type system with generators based on headings. Headings and the generators that depend on them allow the compiler to type check a natural join, as per RM Pre 18. Without them it can only be done at runtime, and then you are exposed to type errors. The same goes for derivations like MATCHING and NON MATCHING. The set operations can be type checked if the types are identical, or left to runtime if they rely on name matching.

The theta join requires the programmer to express the type of inputs, outputs and join columns, which allows the compiler to type check a join and satisfy RM Pre 18. Similar considerations apply to set operations and projections. You lose convenience, but you gain access to a wide range of existing languages in the role of TTM/D.

Of course, there are various ways to implement an RA -- or a non-RA collections framework and associated algebra such as .NET LINQ or Java Streams -- but these are not TTM. That doesn't mean they wouldn't be equally valid or useful.

With this change, they could be.

Andl - A New Database Language - andl.org

#35 · November 5, 2019, 12:01 am

Quote from dandl on November 4, 2019, 11:55 pm

The theta join requires the programmer to express the type of inputs, outputs and join columns, which allows the compiler to type check a join and satisfy RM Pre 18.

What compiler? If you mean a conventional Java-ish compiler, I don't see how, except by reflection. How do you express the type of a column? Can you give an example?

#36 · November 5, 2019, 12:09 am

Quote from dandl on November 5, 2019, 12:09 am

Quote from p c on November 4, 2019, 10:26 am

Quote from dandl on November 4, 2019, 1:54 am

Quote from p c on November 3, 2019, 11:48 am

A mistake which seems to be universal is reading Codd as if he is prescribing language syntax when he defines operators in the abstract. So you get all these languages that implement the wrong idea that his explication of natural join must mean that the expression (R Join S) means that the names R and S represent joinable relations which they do not.

Please be aware that I am specifically referring to Codd's early writings, up to 1972. I find no mention of an expression in that form in his writings.

When named tables are assumed to be variables, or named relvars are assumed the mistake is only propagated.

Codd 1970 says: "A binary relation R is joinable with a binary relation S if there exists a ternary relation U such that a binary relation R is joinable with a binary relation S if there exists a ternary relation U such that pi1,2 (U) = R and pi2,3 (U) = S. Any such ternary relation is called a join of R with pi1,2 (U) = R and pi2,3 (U) = S. Any such ternary relation is called a join of R with S".

Where? I don't see any references to a ternary relation in my (limited) sources. And I find it hard to accept that as a direct quote, given the duplication in the wording.

...

The copy-and-paste from an Adobe pdf got mangled, perhaps by me trying to overcome the incompatibility with the Adobe font for the "pi" symbol. Anyway, here is a correction:

"A binary relation R is joinable with a binary relation S if there exists a ternary relation U such that pi 1,2 (U) = R and pi 2,3 (U) = S. Any such ternary relation is called a join of R with S. If R, S are binary relations such that pi 2 (R) = pi 1 (S), then R is joinable with S. One join that always exists in such a case is the natural join of R with S defined by ..."

I can't explain why you wouldn't find "ternary relation" in the pdf of Codd 1970. It appears in your quote. Maybe you are also looking at an edited copy that has also been mangled.

OK, got it. Relation Model for Large Shared Data Banks, 1970. I find the later 1972 "Relational Completeness of Database Sublanguages" more useful, and it was to that I was referring.

In the 1970 paper it's all very algebraic, but in 1972 he is clearly setting out a possible syntactical form for a query language.

Quote from p c on November 4, 2019, 10:26 am

Quote from dandl on November 4, 2019, 1:54 am

Quote from p c on November 3, 2019, 11:48 am

A mistake which seems to be universal is reading Codd as if he is prescribing language syntax when he defines operators in the abstract. So you get all these languages that implement the wrong idea that his explication of natural join must mean that the expression (R Join S) means that the names R and S represent joinable relations which they do not.

Please be aware that I am specifically referring to Codd's early writings, up to 1972. I find no mention of an expression in that form in his writings.

When named tables are assumed to be variables, or named relvars are assumed the mistake is only propagated.

Codd 1970 says: "A binary relation R is joinable with a binary relation S if there exists a ternary relation U such that a binary relation R is joinable with a binary relation S if there exists a ternary relation U such that pi1,2 (U) = R and pi2,3 (U) = S. Any such ternary relation is called a join of R with pi1,2 (U) = R and pi2,3 (U) = S. Any such ternary relation is called a join of R with S".

Where? I don't see any references to a ternary relation in my (limited) sources. And I find it hard to accept that as a direct quote, given the duplication in the wording.

...

The copy-and-paste from an Adobe pdf got mangled, perhaps by me trying to overcome the incompatibility with the Adobe font for the "pi" symbol. Anyway, here is a correction:

"A binary relation R is joinable with a binary relation S if there exists a ternary relation U such that pi 1,2 (U) = R and pi 2,3 (U) = S. Any such ternary relation is called a join of R with S. If R, S are binary relations such that pi 2 (R) = pi 1 (S), then R is joinable with S. One join that always exists in such a case is the natural join of R with S defined by ..."

I can't explain why you wouldn't find "ternary relation" in the pdf of Codd 1970. It appears in your quote. Maybe you are also looking at an edited copy that has also been mangled.

OK, got it. Relation Model for Large Shared Data Banks, 1970. I find the later 1972 "Relational Completeness of Database Sublanguages" more useful, and it was to that I was referring.

In the 1970 paper it's all very algebraic, but in 1972 he is clearly setting out a possible syntactical form for a query language.

Andl - A New Database Language - andl.org

#37 · November 5, 2019, 12:22 am

Quote from johnwcowan on November 4, 2019, 10:33 pm

Quote from AntC on November 4, 2019, 10:25 pm

I'm puzzled why natural join generalizes 3 concepts, while inner union only 2...

For Union, we don't have a 'concept' of unioning relations with no attributes in common (compare TIMES) -- or do I mean a 'concept' of projecting on no attributes?

That would be outer join if we had outer join.

Appendix A's <OR> is the closest to outer join (without introducing Nulls). I'd say noone had a 'concept' of <OR> -- it bears no close similarity to any of Codd's 1972 operators, which is where I presume Vadim is getting his 'concept's. Both <OR> and (FULL) Outer Join have a determinate result in case the operands have no attributes in common, but I'd describe that as a happy accident carefully arranged rather than a pre-existing 'concept' like TIMES/Cartesian product, which comes from math.

#38 · November 5, 2019, 2:19 am

Quote from johnwcowan on November 5, 2019, 12:01 am

Quote from dandl on November 4, 2019, 11:55 pm

The theta join requires the programmer to express the type of inputs, outputs and join columns, which allows the compiler to type check a join and satisfy RM Pre 18.

What compiler? If you mean a conventional Java-ish compiler, I don't see how, except by reflection. How do you express the type of a column? Can you give an example?

This will be the third time I've posted this. I don't see why it's so hard to understand.

A natural join on the D&D supplier database looks like this:

S join SP

The Codd theta join looks like this:

S[Sid=Sid]SP(Sid SNAME STATUS CITY Pid QTY) // projects away the duplicate Sid

The SQL version looks roughly like this:

select S.Sid, SNAME, STATUS, CITY, Pid, QTY from S, SP where S.Sid = SP.Sid

The C# LINQ version looks like this:

S.Join(SP, (s,sp) => s.Sid = sp.Sid, (s,sp) => new { Sid=s.Sid, SNAME=s.SNAME, STATUS=s.STATUS, CITY=s.CITY, Pid=sp.Pid, QTY=sp.QTY })

I don't know Java streams so I can't tell you, but I'm sure it looks a lot like the C#. The Join method takes two function arguments, one to do the comparison, the other to fabricate the new record.

The type of a column is anything for which assignment and compare equals are defined. That's any value type in C#.

Andl - A New Database Language - andl.org

#39 · November 5, 2019, 8:09 am

Quote from dandl on November 4, 2019, 11:55 pm

Quote from Dave Voorhis on November 4, 2019, 8:10 am

As others have noted, the choice of natural join is the result of making choices about RA and type system design. The TTM RA and type system was not a result of making a choice to use natural join.

Chicken or egg?

I'm not sure why natural JOIN should be singled out for determining the course of TTM type-system direction instead of relvar declarations, tuple variable declarations, parameter types, use of structural typing rather than nominal typing in general, the semantics of COMPOSE/MATCHING/UNION/INTERSECT/MINUS, etc. It seems more like type-system choices are deeply commingled with TTM in general (and visibly manifest in Tutorial D in particular) with natural JOIN being one (relatively small) part of that.

Dispensing with tuple and relation type generators would mean a rather dramatic change to TTM and Tutorial D just to suit a rather specialised use case of making TTM-compliant collection frameworks in languages that happen to be popular right now. I'm not convinced it's worth it.

Quote from dandl on November 4, 2019, 11:55 pm

Quote from Dave Voorhis on November 4, 2019, 8:10 am

Of course, there are various ways to implement an RA -- or a non-RA collections framework and associated algebra such as .NET LINQ or Java Streams -- but these are not TTM. That doesn't mean they wouldn't be equally valid or useful.

With this change, they could be.

TTM as a collections framework integrated into LINQ or Streams is a different purpose from defining alternatives to SQL. Is there some fundamental problem with LINQ or Streams that makes it compelling to redefine TTM to suit them?

I'm not sure there is.

If aspects of a relational algebra would make LINQ or Streams better, that's justification enough. It doesn't need TTM to change for that to happen or to be of value.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

#40 · November 6, 2019, 1:31 pm

Quote from dandl on November 4, 2019, 6:23 am

A natural join on the D&D supplier database looks like this:

S join SP

The Codd theta join looks like this:

S[Sid=Sid]SP(Sid SNAME STATUS CITY Pid QTY) // projects away the duplicate Sid

The SQL version ...

Please, please, PLEASE.

Codd Theta-join also supported the 5 other comparison operators (< > <= >= <>) for matching tuples. So Codd theta-join could not possibly have room for "projects away the duplicate Sid" if no information was to be lost. If you are thinking of equi-join then please use the word equi-join.

Codd promoted (well, no, lip-serviced is a far better term for what he did) TTM's concept of attribute names [as the SOLE identifiers for attributes in a tuple/relation] because he probably had some awareness of the practical/pragmatic utility, but for the rest he was neck-deep into cartesian products with numbered attributes, at least in the earliest writings. It rippled through to (not to say poisoned) SQL and since whatever comes out of the industry is neck-deep into SQL, whatever comes out of the industry suffers exactly as much from the exact same trouble.

Author of SIRA_PRISE

The Forum for Discussion about The Third Manifesto and Related Matters