Java JOIN

#11 · May 16, 2020, 8:45 am

Quote from Dave Voorhis on May 16, 2020, 8:45 am

Quote from dandl on May 16, 2020, 2:34 am

Quote from Dave Voorhis on May 15, 2020, 8:47 pm

Partly inspired by dandl's recent TTM-in-C# efforts, and partly by my own work-in-progress "Java for Data" and datasheet efforts, I've created two variants on a JOIN operator in Java. One is a useful generic JOIN, the other is a hard-wired JOIN for comparison purposes.

The hard-wired JOIN only accepts the canonical S and P example relations (implemented using Java's generic HashSet) and joins them on their common city attributes.

The generic JOIN accepts any two collection instances and joins items in the collections based on the result of evaluating a user-defined expression. The expression must return true for every pair of items that should be included in the JOIN result, and return false for every pair of items that should not be included in the JOIN result.

Both JOIN operators employ an unusual way of representing their results. Rather than emitting a collection or Stream of a new tuple type whose attributes are the union of the attributes in both operands, it emits a Stream of PairS, where each Pair consists of a left item and a right item. The left item is from the left operand; the right item is the matching item from the right operand.

That's a straightforward theta join: Cartesian Product followed by compare operation as defined by Codd. Now you rely on the caller to do the final step of projecting the pair of tuples onto a single tuple, to remove excess attributes and duplicates.

Yes, it's a straightforward theta join and indeed, I suppose the caller might want a flat, de-duplicated result. I could probably provide a "flatten" method to make that easier.

But alternatively, they can almost certainly use the result as is, per the usual idiomatic approaches to using Java Streams and friends.

This will behave really badly in cases where the final cardinality is low. Say you have 1 thousand tuples each side, typical joins will return anywhere from a few thousand tuples down to a just few or none, but you have to generate 1 million pairs to get there.

It will also perform poorly if the end result it a semijoin or antijoin, but you can fix that more easily.

This approach requires no new type to be defined to represent the JOIN result and integrates nicely with existing Java Streams capability.

Although the JOIN shown here plus Streams is not an implementation of the relational model per se, it provides equivalent facility.

The theta join is OK, it's the lack of the final project that is non-RA.

Notably, it requires no reflection, is fully statically type-safe, and the performance of the generic JOIN is very close to that of the hard-wired equivalent.

I'm sure I could come up with sample data that would destroy that observation. Try 1000 suppliers and 1000 parts, with exactly one of each with CITY="Sydney". Rows processed: 1,000,000. Rows in final result: 1.

Per lines 51 and 79 of my source code, the JOIN strategy is avowedly suboptimal, but performance is not significantly affected by choice of hard-wired vs generic JOIN because they're algorithmically equivalent and both statically compiled.

The generic JOIN uses no run-time reflection.

Quote from dandl on May 16, 2020, 2:34 am

Quote from Dave Voorhis on May 15, 2020, 8:47 pm

Partly inspired by dandl's recent TTM-in-C# efforts, and partly by my own work-in-progress "Java for Data" and datasheet efforts, I've created two variants on a JOIN operator in Java. One is a useful generic JOIN, the other is a hard-wired JOIN for comparison purposes.

The hard-wired JOIN only accepts the canonical S and P example relations (implemented using Java's generic HashSet) and joins them on their common city attributes.

The generic JOIN accepts any two collection instances and joins items in the collections based on the result of evaluating a user-defined expression. The expression must return true for every pair of items that should be included in the JOIN result, and return false for every pair of items that should not be included in the JOIN result.

Both JOIN operators employ an unusual way of representing their results. Rather than emitting a collection or Stream of a new tuple type whose attributes are the union of the attributes in both operands, it emits a Stream of PairS, where each Pair consists of a left item and a right item. The left item is from the left operand; the right item is the matching item from the right operand.

That's a straightforward theta join: Cartesian Product followed by compare operation as defined by Codd. Now you rely on the caller to do the final step of projecting the pair of tuples onto a single tuple, to remove excess attributes and duplicates.

Yes, it's a straightforward theta join and indeed, I suppose the caller might want a flat, de-duplicated result. I could probably provide a "flatten" method to make that easier.

But alternatively, they can almost certainly use the result as is, per the usual idiomatic approaches to using Java Streams and friends.

This will behave really badly in cases where the final cardinality is low. Say you have 1 thousand tuples each side, typical joins will return anywhere from a few thousand tuples down to a just few or none, but you have to generate 1 million pairs to get there.

It will also perform poorly if the end result it a semijoin or antijoin, but you can fix that more easily.

This approach requires no new type to be defined to represent the JOIN result and integrates nicely with existing Java Streams capability.

Although the JOIN shown here plus Streams is not an implementation of the relational model per se, it provides equivalent facility.

The theta join is OK, it's the lack of the final project that is non-RA.

Notably, it requires no reflection, is fully statically type-safe, and the performance of the generic JOIN is very close to that of the hard-wired equivalent.

I'm sure I could come up with sample data that would destroy that observation. Try 1000 suppliers and 1000 parts, with exactly one of each with CITY="Sydney". Rows processed: 1,000,000. Rows in final result: 1.

Per lines 51 and 79 of my source code, the JOIN strategy is avowedly suboptimal, but performance is not significantly affected by choice of hard-wired vs generic JOIN because they're algorithmically equivalent and both statically compiled.

The generic JOIN uses no run-time reflection.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

#12 · May 16, 2020, 8:50 am

Quote from Darren Duncan on May 16, 2020, 4:04 am

Having implemented generic relational join myself, unless I'm missing something, your implementation is missing a simple but important key detail if you want it to be performant.

You need to explicitly index on the attributes being joined, and not the tuples as a whole, but where those are one and the same.

How I implemented this is that in principle rather than using a HashSet I used a Dictionary, where the values are the whole tuple and the keys are a projection on the attributes being joined.

When you do this, the common equal-join is just order of N rather than order of N-squared, because it can iterate over the smaller cardinality argument and for each one do an order of one lookup at the index for the other argument (building the index is also order of N), so say you could join a thousand to a thousand or whatever in reasonable time, barring when you are explicitly doing a cartesian product.

The above assumes a hash index. I never bothered with the complexity of a tree or something that would support doing certain other kinds of theta join quickly.

Most of us do it that way. For semi/antijoin you only need a Set, not a Dict. But either way it's O(aM+bN) , where the a and b factors can be quite different. If one already has a usable index it's O(N) on the other one. And for theta/Cartesian it's generally O(M*N), with a number of special cases.

Andl - A New Database Language - andl.org

#13 · May 16, 2020, 8:52 am

Quote from dandl on May 16, 2020, 8:52 am

Quote from Dave Voorhis on May 16, 2020, 8:45 am

Quote from dandl on May 16, 2020, 2:34 am

Quote from Dave Voorhis on May 15, 2020, 8:47 pm

Partly inspired by dandl's recent TTM-in-C# efforts, and partly by my own work-in-progress "Java for Data" and datasheet efforts, I've created two variants on a JOIN operator in Java. One is a useful generic JOIN, the other is a hard-wired JOIN for comparison purposes.

The hard-wired JOIN only accepts the canonical S and P example relations (implemented using Java's generic HashSet) and joins them on their common city attributes.

The generic JOIN accepts any two collection instances and joins items in the collections based on the result of evaluating a user-defined expression. The expression must return true for every pair of items that should be included in the JOIN result, and return false for every pair of items that should not be included in the JOIN result.

Both JOIN operators employ an unusual way of representing their results. Rather than emitting a collection or Stream of a new tuple type whose attributes are the union of the attributes in both operands, it emits a Stream of PairS, where each Pair consists of a left item and a right item. The left item is from the left operand; the right item is the matching item from the right operand.

That's a straightforward theta join: Cartesian Product followed by compare operation as defined by Codd. Now you rely on the caller to do the final step of projecting the pair of tuples onto a single tuple, to remove excess attributes and duplicates.

Yes, it's a straightforward theta join and indeed, I suppose the caller might want a flat, de-duplicated result. I could probably provide a "flatten" method to make that easier.

But alternatively, they can almost certainly use the result as is, per the usual idiomatic approaches to using Java Streams and friends.

This will behave really badly in cases where the final cardinality is low. Say you have 1 thousand tuples each side, typical joins will return anywhere from a few thousand tuples down to a just few or none, but you have to generate 1 million pairs to get there.

It will also perform poorly if the end result it a semijoin or antijoin, but you can fix that more easily.

This approach requires no new type to be defined to represent the JOIN result and integrates nicely with existing Java Streams capability.

Although the JOIN shown here plus Streams is not an implementation of the relational model per se, it provides equivalent facility.

The theta join is OK, it's the lack of the final project that is non-RA.

Notably, it requires no reflection, is fully statically type-safe, and the performance of the generic JOIN is very close to that of the hard-wired equivalent.

I'm sure I could come up with sample data that would destroy that observation. Try 1000 suppliers and 1000 parts, with exactly one of each with CITY="Sydney". Rows processed: 1,000,000. Rows in final result: 1.

Per lines 51 and 79 of my source code, the JOIN strategy is avowedly suboptimal, but performance is not significantly affected by choice of hard-wired vs generic JOIN because they're algorithmically equivalent and both statically compiled.

I agree. Both equally bad (and seriously so).

The generic JOIN uses no run-time reflection.

Quote from Dave Voorhis on May 16, 2020, 8:45 am

Quote from dandl on May 16, 2020, 2:34 am

Quote from Dave Voorhis on May 15, 2020, 8:47 pm

Partly inspired by dandl's recent TTM-in-C# efforts, and partly by my own work-in-progress "Java for Data" and datasheet efforts, I've created two variants on a JOIN operator in Java. One is a useful generic JOIN, the other is a hard-wired JOIN for comparison purposes.

The hard-wired JOIN only accepts the canonical S and P example relations (implemented using Java's generic HashSet) and joins them on their common city attributes.

The generic JOIN accepts any two collection instances and joins items in the collections based on the result of evaluating a user-defined expression. The expression must return true for every pair of items that should be included in the JOIN result, and return false for every pair of items that should not be included in the JOIN result.

Both JOIN operators employ an unusual way of representing their results. Rather than emitting a collection or Stream of a new tuple type whose attributes are the union of the attributes in both operands, it emits a Stream of PairS, where each Pair consists of a left item and a right item. The left item is from the left operand; the right item is the matching item from the right operand.

That's a straightforward theta join: Cartesian Product followed by compare operation as defined by Codd. Now you rely on the caller to do the final step of projecting the pair of tuples onto a single tuple, to remove excess attributes and duplicates.

Yes, it's a straightforward theta join and indeed, I suppose the caller might want a flat, de-duplicated result. I could probably provide a "flatten" method to make that easier.

But alternatively, they can almost certainly use the result as is, per the usual idiomatic approaches to using Java Streams and friends.

This will behave really badly in cases where the final cardinality is low. Say you have 1 thousand tuples each side, typical joins will return anywhere from a few thousand tuples down to a just few or none, but you have to generate 1 million pairs to get there.

It will also perform poorly if the end result it a semijoin or antijoin, but you can fix that more easily.

This approach requires no new type to be defined to represent the JOIN result and integrates nicely with existing Java Streams capability.

Although the JOIN shown here plus Streams is not an implementation of the relational model per se, it provides equivalent facility.

The theta join is OK, it's the lack of the final project that is non-RA.

Notably, it requires no reflection, is fully statically type-safe, and the performance of the generic JOIN is very close to that of the hard-wired equivalent.

I'm sure I could come up with sample data that would destroy that observation. Try 1000 suppliers and 1000 parts, with exactly one of each with CITY="Sydney". Rows processed: 1,000,000. Rows in final result: 1.

Per lines 51 and 79 of my source code, the JOIN strategy is avowedly suboptimal, but performance is not significantly affected by choice of hard-wired vs generic JOIN because they're algorithmically equivalent and both statically compiled.

I agree. Both equally bad (and seriously so).

The generic JOIN uses no run-time reflection.

Andl - A New Database Language - andl.org

#14 · May 16, 2020, 9:02 am

Really this needs an implementation of RM VSS 6 generic relational operators.

Which is a really good argument in favour of 'C# as D', where that comes as part of the deal. Of if you know something helpful about your own problem, feel free to adapt/extend. The one enormous advantage over TD/Rel and similar is that there is nothing between you and the library source code, just waiting for you to extend or improve it.

Andl - A New Database Language - andl.org

#15 · May 16, 2020, 9:10 am

Quote from dandl on May 16, 2020, 9:02 am

Really this needs an implementation of RM VSS 6 generic relational operators.

Which is a really good argument in favour of 'C# as D', where that comes as part of the deal. Of if you know something helpful about your own problem, feel free to adapt/extend. The one enormous advantage over TD/Rel and similar is that there is nothing between you and the library source code, just waiting for you to extend or improve it.

Of course, that comes with its own set of tradeoffs. It presumes, for example, that you like C# (or Java or whatever.) If you consider reference semantics or static-class-based OO or whatever to be abominable, using C# (or Java or whatever) at all, let alone extending the library source code is -- at best -- an unpleasant chore rather than a programming pleasure.

An obvious -- and worthy -- justification for a new D is to get a better programming language in general, and not just to get an integrated implementation of the relational model in a language you consider inadequate for reasons unrelated to the relational model.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

#16 · May 16, 2020, 10:50 am

Quote from dandl on May 16, 2020, 10:50 am

Quote from Dave Voorhis on May 16, 2020, 9:10 am

Quote from dandl on May 16, 2020, 9:02 am

Really this needs an implementation of RM VSS 6 generic relational operators.

Which is a really good argument in favour of 'C# as D', where that comes as part of the deal. Of if you know something helpful about your own problem, feel free to adapt/extend. The one enormous advantage over TD/Rel and similar is that there is nothing between you and the library source code, just waiting for you to extend or improve it.

Of course, that comes with its own set of tradeoffs. It presumes, for example, that you like C# (or Java or whatever.) If you consider reference semantics or static-class-based OO or whatever to be abominable, using C# (or Java or whatever) at all, let alone extending the library source code is -- at best -- an unpleasant chore rather than a programming pleasure.

An obvious -- and worthy -- justification for a new D is to get a better programming language in general, and not just to get an integrated implementation of the relational model in a language you consider inadequate for reasons unrelated to the relational model.

Quote from Dave Voorhis on May 16, 2020, 9:10 am

Quote from dandl on May 16, 2020, 9:02 am

Really this needs an implementation of RM VSS 6 generic relational operators.

Which is a really good argument in favour of 'C# as D', where that comes as part of the deal. Of if you know something helpful about your own problem, feel free to adapt/extend. The one enormous advantage over TD/Rel and similar is that there is nothing between you and the library source code, just waiting for you to extend or improve it.

Of course, that comes with its own set of tradeoffs. It presumes, for example, that you like C# (or Java or whatever.) If you consider reference semantics or static-class-based OO or whatever to be abominable, using C# (or Java or whatever) at all, let alone extending the library source code is -- at best -- an unpleasant chore rather than a programming pleasure.

First, yes, I was not singling out C#. My proposition is that you can add all the Pre features of TTM/D to any modern strongly typed programming language, and I think I've shown that is largely true. The only real challenges are (a) value types and (b) the tuple type in particular, and they're well within grasp.

So I'm proceeding on the basis that

users have picked out their favourite programming language on some other grounds

that language will be one that already sits somewhere in the upper rankings of the TIOBE list

if that language is strongly typed, it can form the basis for a D fully compliant with TTM.

An obvious -- and worthy -- justification for a new D is to get a better programming language in general, and not just to get an integrated implementation of the relational model in a language you consider inadequate for reasons unrelated to the relational model.

I don't presume to divine user preferences. To me it's perfectly obvious that C# is better than Java, but while popular in its own right it gets used far less. People make choices for all kinds of reasons, and studying TTM is probably not one of them. But reassuringly, whatever language they choose, you can be sure that if it's at least as good as those two, it can be a D.

Quote from Dave Voorhis on May 16, 2020, 9:10 am

Quote from dandl on May 16, 2020, 9:02 am

Really this needs an implementation of RM VSS 6 generic relational operators.

Which is a really good argument in favour of 'C# as D', where that comes as part of the deal. Of if you know something helpful about your own problem, feel free to adapt/extend. The one enormous advantage over TD/Rel and similar is that there is nothing between you and the library source code, just waiting for you to extend or improve it.

Of course, that comes with its own set of tradeoffs. It presumes, for example, that you like C# (or Java or whatever.) If you consider reference semantics or static-class-based OO or whatever to be abominable, using C# (or Java or whatever) at all, let alone extending the library source code is -- at best -- an unpleasant chore rather than a programming pleasure.

An obvious -- and worthy -- justification for a new D is to get a better programming language in general, and not just to get an integrated implementation of the relational model in a language you consider inadequate for reasons unrelated to the relational model.

Quote from Dave Voorhis on May 16, 2020, 9:10 am

Quote from dandl on May 16, 2020, 9:02 am

Really this needs an implementation of RM VSS 6 generic relational operators.

Which is a really good argument in favour of 'C# as D', where that comes as part of the deal. Of if you know something helpful about your own problem, feel free to adapt/extend. The one enormous advantage over TD/Rel and similar is that there is nothing between you and the library source code, just waiting for you to extend or improve it.

Of course, that comes with its own set of tradeoffs. It presumes, for example, that you like C# (or Java or whatever.) If you consider reference semantics or static-class-based OO or whatever to be abominable, using C# (or Java or whatever) at all, let alone extending the library source code is -- at best -- an unpleasant chore rather than a programming pleasure.

First, yes, I was not singling out C#. My proposition is that you can add all the Pre features of TTM/D to any modern strongly typed programming language, and I think I've shown that is largely true. The only real challenges are (a) value types and (b) the tuple type in particular, and they're well within grasp.

So I'm proceeding on the basis that

users have picked out their favourite programming language on some other grounds
that language will be one that already sits somewhere in the upper rankings of the TIOBE list
if that language is strongly typed, it can form the basis for a D fully compliant with TTM.

An obvious -- and worthy -- justification for a new D is to get a better programming language in general, and not just to get an integrated implementation of the relational model in a language you consider inadequate for reasons unrelated to the relational model.

I don't presume to divine user preferences. To me it's perfectly obvious that C# is better than Java, but while popular in its own right it gets used far less. People make choices for all kinds of reasons, and studying TTM is probably not one of them. But reassuringly, whatever language they choose, you can be sure that if it's at least as good as those two, it can be a D.

Andl - A New Database Language - andl.org

#17 · May 16, 2020, 11:09 am

Quote from dandl on May 16, 2020, 10:50 am

... To me it's perfectly obvious that C# is better than Java ...

Funny, that. Having written a lot of lines in both, I have to say that for 99.99999% of purposes, I can't tell them apart.

For the remaining 0.00001% of cases, where C# beats Java linguistically, Java beats C# in cross platform support and 3rd party library coverage.

Tie game.

... but while popular in its own right it gets used far less. People make choices for all kinds of reasons, and studying TTM is probably not one of them. But reassuringly, whatever language they choose, you can be sure that if it's at least as good as those two, it can be a D.

I think this highlights an interesting question: is the benefit of a D just the integrated relational model (and persistence?), or that it's a new language with certain semantics that just happens to integrate the relational model (and persistence?)?

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

#18 · May 16, 2020, 2:41 pm

Quote from Dave Voorhis on May 16, 2020, 11:09 am

Quote from dandl on May 16, 2020, 10:50 am

... To me it's perfectly obvious that C# is better than Java ...

Funny, that. Having written a lot of lines in both, I have to say that for 99.99999% of purposes, I can't tell them apart.

Then you probably never used struct, var, properties, delegates, operator overloading, implicit conversions, nullable, anonymous types, Tuple<>, Func<>, goto, conditional compilation, field layouts (and unions), Interop or (unsafe) pointers. But OTOH Java has useful enums, which C# lacks.

For the remaining 0.00001% of cases, where C# beats Java linguistically, Java beats C# in cross platform support and 3rd party library coverage.

Tie game.

Not really. Strong home game advantage, almost never a tie.

... but while popular in its own right it gets used far less. People make choices for all kinds of reasons, and studying TTM is probably not one of them. But reassuringly, whatever language they choose, you can be sure that if it's at least as good as those two, it can be a D.

I think this highlights an interesting question: is the benefit of a D just the integrated relational model (and persistence?), or that it's a new language with certain semantics that just happens to integrate the relational model (and persistence?)?

I don't think TTM offers any interesting or useful semantics on which to base the design or choice of a language. What it does offer is a fairly pedestrian value-based scalar type system, and an unusual but useful non-scalar type system. Based on this and TD, it would be fair to conclude that the authors of D had almost no clue about language design; they didn't even use the best available ideas of that time (1990s).

Nor did they pretend to. They left us to design (or choose) a language for other reasons, and simply prescribed that it should include a set of features. Which is exactly what I set out to do.

Andl - A New Database Language - andl.org

The Forum for Discussion about The Third Manifesto and Related Matters