The Forum for Discussion about The Third Manifesto and Related Matters

You need to log in to create posts and topics.

Why do we need natural joins?

Quote from Dave Voorhis on November 3, 2019, 8:24 pm

That is a common criticism. I offer the usual response: design the schema such that if any pair of attributes X and Y are to be joined, they must have the same name and type. If attributes X and Y are not to be joined, they must have different names.

You could say exactly the same about structural equivalence, and yet Darwinian selection seems to have weeded it out of statically typed languages.  Go has a different form of structural equivalence depending on method names rather than attribute names: if you implement all the methods required by an interface, you implement the interface with no need to say so.  That seems less dangerous.

In addition, the schema designer is often not the database programmer, and waterfall design is far from dead.

Quote from Hugh on November 3, 2019, 2:58 pm

You don't "need" them.  I do, because I don't use SQL, but in any case I think "prefer" is more appropriate than "need".

Thank you Hugh, you've said nearly everything I wanted to say.

To answer the question thus revised:

First, JOIN is the obvious natural counterpart to conjunction as used in predicate expressions, using predicate variables as attribute names.

Furthermore JOIN is the natural generalisation of INTERSECT when the operands don't have identical headings; and the natural generalisation of TIMES when the operands don't have disjoint headings.

Second, JOIN avoids the appearance of identical columns.  In the SQL example given with the original question, the writer uses projection to explicitly remove the column D.

Third, JOIN avoids the need for range variables in cases where operands have common attributes.

So it's possible to define JOIN in terms of RENAME and theta-joins (or TIMES and theta-selection) and projections (as does Codd 1972); but why all that complexity?

Fourth, The "Marriage Principle", as I called it when I wrote about it in 1989, neatly applies to all the dyadic operators.

I can't see any relevance to the type system of dyadic operators in RA, apart from the requirement for common attributes to be of the same type (which applies to all comparisons anyway).

 

Quote from Dave Voorhis on November 3, 2019, 8:24 pm
Quote from johnwcowan on November 3, 2019, 8:17 pm

But I do wonder if natural joins as we know them are a bad idea on PL design grounds.  In the same sort of way as structural equivalence of types, they allow accidental matchings between semantically unrelated but similarly-named things, in this case attributes.

That is a common criticism. I offer the usual response: design the schema such that if any pair of attributes X and Y are to be joined, they must have the same name and type. If attributes X and Y are not to be joined, they must have different names.

I think it's also worth pointing out that JOIN with the "Marriage Principle" is the natural counterpart to normalisation: with a unnormalised schema you start with a single attribute; normalising vertically partitions the schema with the effect that one attribute appears in two (or more) relvars. It is not that we get two attributes with the same name and type by accident. So JOIN isn't causing "accidental matchings between semantically unrelated"; it's causing deliberate matchings that unify what was split asunder.

Quote from johnwcowan on November 3, 2019, 8:52 pm
Quote from Dave Voorhis on November 3, 2019, 8:24 pm

That is a common criticism. I offer the usual response: design the schema such that if any pair of attributes X and Y are to be joined, they must have the same name and type. If attributes X and Y are not to be joined, they must have different names.

You could say exactly the same about structural equivalence, and yet Darwinian selection seems to have weeded it out of statically typed languages.  Go has a different form of structural equivalence depending on method names rather than attribute names: if you implement all the methods required by an interface, you implement the interface with no need to say so.  That seems less dangerous.

In addition, the schema designer is often not the database programmer, and waterfall design is far from dead.

If you want belt and braces, use a form of JOIN in which the programmer must explicitly name all the expected attributes in common.

S JOIN SP ON {S#}

(Makes a mess of the infix syntax.)

Quote from Hugh on November 3, 2019, 2:58 pm

You don't "need" them.  I do, because I don't use SQL, but in any case I think "prefer" is more appropriate than "need".

The intent of the question is to ask: if choosing a particular kind of natural join leads to a serious problem of language design (the type generators), should we at least consider alternatives? If we choose a different kind of join (such as Codd's theta join or even Codd's natural join), can we keep all the other good ideas in TTM and avoid the type system problem?

The question of whether we 'prefer' or 'need' one or the other is moot. The question is, having found a roadblock, how far we can go down the other path?

You could respond to that by saying that natural joins are non-negotiable, or by demonstrating a different way around the roadblock, or by considering the consequences of such a choose and showing why that would or would not work.

To answer the question thus revised:

First, JOIN is the obvious natural counterpart to conjunction as used in predicate expressions, using predicate variables as attribute names.

Second, JOIN avoids the appearance of identical columns.  In the SQL example given with the original question, the writer uses projection to explicitly remove the column D.

Correct. That is a common feature of theta joins and SQL.

Third, JOIN avoids the need for range variables in cases where operands have common attributes.

That needs explanation.

Fourth, The "Marriage Principle", as I called it when I wrote about it in 1989, neatly applies to all the dyadic operators.

I can't see any relevance to the type system of dyadic operators in RA, apart from the requirement for common attributes to be of the same type (which applies to all comparisons anyway).

The problem is precisely that: the concept of 'common attributes' turns out to be next to impossible to implement in the type system of many modern programming languages.

 

Andl - A New Database Language - andl.org
Quote from Dave Voorhis on November 3, 2019, 4:29 pm

You seem to be echoing what I'd suggested elsewhere -- such as https://forum.thethirdmanifesto.com/forum/topic/how-to-resolve-the-type-system-relational-algebra-mismatch/#postid-985925 -- in terms of using Java Streams or .NET LINQ, but I'm not clear how some mainly-syntactic tweaks to JOIN (i.e., implicit matching of attributes based on common names vs explicit matching of attributes whether names are common or not) now make LINQ/Streams usable in a way they apparently weren't before. I already know they're usable -- I've made extensive use of LINQ and Streams -- and you probably have too. Using a (dynamically generated & compiled) class to represent a tuple type is what I'm doing in my datasheet project, as I've noted several times.

Whilst neither Streams nor LINQ are a relational algebra per se, they unquestionably embody its essential strength -- i.e., a set of data structures abstracted under a unified interface and an algebra specifying transformations to values/instances of that interface.

It's obviously a closely related theme, but what I'm saying is: D&D made a choice (natural join) in writing TTM, the choice led them to a certain kind of type system (with type generators and headings), and that leads to a dead end in terms of irreconcilable type systems. If they made a different choice (such as theta joins), could that lead to a TTM Mk II with a different kind of type system, more like that found in modern statically typed programming languages? Is it possible to throw out the tuple and relation type generators as written, replace them by something different, and keep all the other useful work? By my count, RM Pre 1-5, 8, 11-26 would survive with little or no change. The others would need to be rewritten.

I quoted LINQ-like code to show one possible approach, but with a different TTM type system as the starting point, there might be any number of ways to get there.

LINQ is only a small part of the solution. TTM aims to solve a bigger problem than just the type system and the RA.

Andl - A New Database Language - andl.org
Quote from dandl on November 3, 2019, 11:45 pm

It's obviously a closely related theme, but what I'm saying is: D&D made a choice (natural join) in writing TTM, the choice led them to a certain kind of type system (with type generators and headings),

I don't understand how one leads to the other, but that's by the way.

and that leads to a dead end in terms of irreconcilable type systems. If they made a different choice (such as theta joins), could that lead to a TTM Mk II with a different kind of type system, more like that found in modern statically typed programming languages? Is it possible to throw out the tuple and relation type generators as written, replace them by something different, and keep all the other useful work?

Type generators are just generic types by another name, cf. Haskell.   I don't understand the supposed difference between TTM and a subset of Java in which all objects are immutable.

Quote from p c on November 3, 2019, 11:48 am

A mistake which seems to be universal is reading Codd as if he is prescribing language syntax when he defines operators in the abstract. So you get all these languages that implement the wrong idea that his explication of natural join must mean that the expression (R Join S) means that the names R and S represent joinable relations which they do not.

Please be aware that I am specifically referring to Codd's early writings, up to 1972. I find no mention of an expression in that form in his writings.

When named tables are assumed to be variables, or named relvars are assumed the mistake is only propagated.

Codd 1970 says: "A binary relation R is joinable with a binary relation S if there exists a ternary relation U such that a binary relation R is joinable with a binary relation S if there exists a ternary relation U such that pi1,2 (U) = R and pi2,3 (U) = S. Any such ternary relation is called a join of R with pi1,2 (U) = R and pi2,3 (U) = S. Any such ternary relation is called a join of R with S".

Where? I don't see any references to a ternary relation in my (limited) sources. And I find it hard to accept that as a direct quote, given the duplication in the wording.

There is no reason to think that his abstract definition signifies relvar names or table names or relation names that are the explicit names of the effective operands in a language implementation rather than implicit names. Note he says "if there exists a ternary relation U" which is as abstract as can be.

Note also that he defined projection before join because it's needed to define all joins, not only natural join. But given the widespread desire to optimize storage which applies normalization techniques, in practice most natural join values are not joins of explicit operands. Natural join is rather special because it doesn't depend on a host language for arithmetic or set ordering, only the ability to decide whether two values are the same value.

This should be quite obvious when a supposed natural join expression (R Join S) is replaced with ((R Union T) Join S). When no tuple of T appears as a sub-tuple of the join it should be clear that the tuples given by a natural join value are not necessarily derivable from the join operands named by a language.

Do you claim this of Codd, or is this your own work?

The tuples of R that are not derivable from a join value are given by r =R Minus (R Minus (R Join S) {R-attribute set}). Codd could have been more clear if he had distinguished r from R and s from S but that's no excuse for languages that transpose his notation willy-nilly.

Again, where?

A bunch of useful equations/tautologies result when his definition is accurately applied. Then it can be understood that natural join is a way to de-normalize normalized relations, a transformation required by the data independence requirement. Chief among them are the logical simplifications that are possible in logical arguments that avoid explicit projections/quantifications since actual values are implied when they assume common attributes. Questions about programming types go away when the logical basis needs only one type because it doesn’t even need to be mentioned, likewise questions about relvar behaviour. Besides that the logical validity of deductive conclusions becomes possible. That possibility is itself a necessary requirement of system logical completeness. 

It's more than a little pathetic how the meaning of natural join has been usurped without much thought.

Please be specific: whose 'meanings' of natural join do you refer to, and what do they say?

 

Andl - A New Database Language - andl.org
Quote from johnwcowan on November 3, 2019, 11:56 pm
Quote from dandl on November 3, 2019, 11:45 pm

It's obviously a closely related theme, but what I'm saying is: D&D made a choice (natural join) in writing TTM, the choice led them to a certain kind of type system (with type generators and headings),

I don't understand how one leads to the other, but that's by the way.

Then tell me what purpose such an arcane type system serves other than to enable natural joins? Why not choose a record type similar to C, PL/I and Pascal, other than the difficulty of describing natural joins in this languages?

and that leads to a dead end in terms of irreconcilable type systems. If they made a different choice (such as theta joins), could that lead to a TTM Mk II with a different kind of type system, more like that found in modern statically typed programming languages? Is it possible to throw out the tuple and relation type generators as written, replace them by something different, and keep all the other useful work?

Type generators are just generic types by another name, cf. Haskell.   I don't understand the supposed difference between TTM and a subset of Java in which all objects are immutable.

I think the better term is a 'parameterised type'. And the difference: you cannot write or implement a natural join in Java (except by reflection or code generation). You can write and implement a theta join.

 

Andl - A New Database Language - andl.org
Quote from dandl on November 4, 2019, 2:06 am

And the difference: you cannot write or implement a natural join in Java (except by reflection or code generation). You can write and implement a theta join.

Okay, I understand now.  But I don't see how you can do θ-joins either in Java without reflection/generation.  To do it at the method call level, you need Attribute objects containing Strings and Classes that you can pass around, which means you can't represent tuples as POJOs: they would have to be of type Map<Attribute, Object> and can't be used without either explicit casting or reflective castAs.

Can you sketch what you have in mind?

Quote from dandl on November 4, 2019, 2:06 am
Quote from johnwcowan on November 3, 2019, 11:56 pm
Quote from dandl on November 3, 2019, 11:45 pm

It's obviously a closely related theme, but what I'm saying is: D&D made a choice (natural join) in writing TTM, the choice led them to a certain kind of type system (with type generators and headings),

I don't understand how one leads to the other, but that's by the way.

John's correct/the causation is the other way round: the (ahem) natural choice of relations-as-sets, headings-as-sets lead to the Prescriptions defining a relation (value); and that leads to Natural JOIN being the (ahem) natural expression of FOPL AND for the predicates of which the relation values are extension.

Then tell me what purpose such an arcane type system

I see no "arcane" type system. The TTM type system is easy to describe (leaving out the IM). I see a lot of benighted users of type-systems in currently popular languages that seem to have their understandings crippled.

serves other than to enable natural joins?

and disjunctions and relative complements (NOT MATCHING) and projections and restrictions-as-joins (ref Appendix A) and ... pretty much every operation you'd want in an expressive query language/algebra.

Why not choose a record type similar to C, PL/I and Pascal, other than the difficulty of describing natural joins in this languages?

The difficulty of "describing natural joins" is just a microcosm of the difficulty of the 'positional perspective' in general: for example, how do you specify the resulting column positions from a Cartesian Product? Remember your operations are in general nested, so that has to work for a Cartesian product whose operands are Cartesian products, etc, etc. Look how much of the SQL standard is preoccupied by trying to specify positional access to columns.

Perhaps I should be more explicit: the result of Natural JOIN is a relation value/the operation is 'closed over' relations. If your positional access has to say: this is a 4-ply deep JOIN/product, so you need 3-dot suffixes to access nested attributes, then you're working with values that are not relations/you have non-uniformity of representations.

and that leads to a dead end in terms of irreconcilable type systems. If they made a different choice (such as theta joins), could that lead to a TTM Mk II with a different kind of type system, more like that found in modern statically typed programming languages? Is it possible to throw out the tuple and relation type generators as written, replace them by something different, and keep all the other useful work?

Type generators are just generic types by another name, cf. Haskell.   I don't understand the supposed difference between TTM and a subset of Java in which all objects are immutable.

I think the better term is a 'parameterised type'. And the difference: you cannot write or implement a natural join in Java (except by reflection or code generation). You can write and implement a theta join.

Can you? What if the operands happen to have some attribute name in common? Or are you excluding that possibility? Then how do you express in Java's type system: these collections must have no attributes in common? How do you write a Cartesian product in Java? How do you write a theta-selection over a Cartesian product?