Special cases for JOIN operations

#1 · June 16, 2020, 8:00 am

I've been able to identify the following special cases for relational operations, which should be reflected in implementation decisions for any TTM implementation. In each of these cases there is a different algorithm that may be faster than the generic.

Monadic or dyadic function with empty argument = empty or identity (depends on operator)
Dyadic function with one or both empty heading = empty or identity (depends on operator)
Join with disjoint headings = Cartesian product
Join with equal/superset/subset heading = Semijoin
Semijoin with disjoint heading = Empty
Semijoin with equal heading = Intersection
Antijoin with disjoint heading = Identity
Antijoin with equal heading = Symmetric difference

I would be interested to know if anyone (a) disagrees or (b) can add to the list or (c) can point to a nice little table of all the combinations.

Edit: Join was wrong.

Andl - A New Database Language - andl.org

#2 · June 16, 2020, 1:23 pm

Only three wrong ...

Perhaps give it a second try and perhaps ponder the possibility that you have very little reason for bragging with "special cases you have been able to identify" ...

No wait, FOUR wrong.

Author of SIRA_PRISE

#3 · June 20, 2020, 2:07 am

So does anyone have anything useful to say on the subject? Surely someone has given this some thought?

Andl - A New Database Language - andl.org

#4 · June 20, 2020, 4:34 am

Quote from dandl on June 20, 2020, 2:07 am

So does anyone have anything useful to say on the subject? Surely someone has given this some thought?

I thought Erwin's reply was very useful.

It'd be a good question to put on StackOverflow; tag it as relational-algebra so that Philip Kelly will pick it up.

Q1: what do you mean by 'special case'? What's special (for example) about Semijoin or Antijoin: the heading of their result is the same as heading of the left operand; why would the implementation be anything other than a filter over the left operand? -- oh, because of your 4 Semi/Antijoin cases, 3 you've given the wrong semantics.

Q2: why do you expect anybody to reply to a list of which nearly half are wrong -- other than to tell you they're wrong and leave it as an exercise for a newbie to correct it.

Q3: some of those cases are likely to result from specific surface language constructs (like WHERE), why not just implement those language constructs as a specific algorithm? (I believe Rel does so.)

So in terms of 'give this some thought': you start (which is what philipxy would tell you to do), amend the list to give the right semantics.

#5 · June 20, 2020, 5:44 am

Quote from dandl on June 20, 2020, 5:44 am

Quote from AntC on June 20, 2020, 4:34 am

Quote from dandl on June 20, 2020, 2:07 am

So does anyone have anything useful to say on the subject? Surely someone has given this some thought?

I thought Erwin's reply was very useful.

For some definition of 'useful', such as 'go away and don't bother me'.

It'd be a good question to put on StackOverflow; tag it as relational-algebra so that Philip Kelly will pick it up.

I don't think so. Do you?

Q1: what do you mean by 'special case'? What's special (for example) about Semijoin or Antijoin: the heading of their result is the same as heading of the left operand; why would the implementation be anything other than a filter over the left operand? -- oh, because of your 4 Semi/Antijoin cases, 3 you've given the wrong semantics.

It should have been clear: a case detectable at runtime if not earlier that justifies a different algorithm on performance grounds.

Join on disjoint headers: the cost of building an index is avoided

Semijoin on disjoint headers: nothing matches, the output is empty.

Semijoin on disjoint headers requires building a Set on the right argument and filtering the left. That's one pass through each file and enough memory allocation to hold the right file (or some clever caching strategy). If the headers are the same the operation is symmetric. If the left file is smaller, the available strategy is to reverse the inputs. If the left input is very small or already indexed, the saving is significant.

Q2: why do you expect anybody to reply to a list of which nearly half are wrong -- other than to tell you they're wrong and leave it as an exercise for a newbie to correct it.

So which do you now say are wrong and why? Or do you disagree with my semijoin strategy, and if so why?

Q3: some of those cases are likely to result from specific surface language constructs (like WHERE), why not just implement those language constructs as a specific algorithm? (I believe Rel does so.)

I'm only interested here in the cases where that strategy is not available.

So in terms of 'give this some thought': you start (which is what philipxy would tell you to do), amend the list to give the right semantics.

Quote from AntC on June 20, 2020, 4:34 am

Quote from dandl on June 20, 2020, 2:07 am

So does anyone have anything useful to say on the subject? Surely someone has given this some thought?

I thought Erwin's reply was very useful.

For some definition of 'useful', such as 'go away and don't bother me'.

It'd be a good question to put on StackOverflow; tag it as relational-algebra so that Philip Kelly will pick it up.

I don't think so. Do you?

Q1: what do you mean by 'special case'? What's special (for example) about Semijoin or Antijoin: the heading of their result is the same as heading of the left operand; why would the implementation be anything other than a filter over the left operand? -- oh, because of your 4 Semi/Antijoin cases, 3 you've given the wrong semantics.

It should have been clear: a case detectable at runtime if not earlier that justifies a different algorithm on performance grounds.

Join on disjoint headers: the cost of building an index is avoided

Semijoin on disjoint headers: nothing matches, the output is empty.

Semijoin on disjoint headers requires building a Set on the right argument and filtering the left. That's one pass through each file and enough memory allocation to hold the right file (or some clever caching strategy). If the headers are the same the operation is symmetric. If the left file is smaller, the available strategy is to reverse the inputs. If the left input is very small or already indexed, the saving is significant.

Q2: why do you expect anybody to reply to a list of which nearly half are wrong -- other than to tell you they're wrong and leave it as an exercise for a newbie to correct it.

So which do you now say are wrong and why? Or do you disagree with my semijoin strategy, and if so why?

Q3: some of those cases are likely to result from specific surface language constructs (like WHERE), why not just implement those language constructs as a specific algorithm? (I believe Rel does so.)

I'm only interested here in the cases where that strategy is not available.

So in terms of 'give this some thought': you start (which is what philipxy would tell you to do), amend the list to give the right semantics.

Andl - A New Database Language - andl.org

#6 · June 20, 2020, 9:18 am

Quote from AntC on June 20, 2020, 9:18 am
Go away and correct the errors in your O.P. before bothering this list any more. Your strategies below are wrong because your semantics is wrong.

IIRC the moderator has requested you listen before posting. Then listen (for the third time).

Quote from dandl on June 20, 2020, 5:44 am

Quote from AntC on June 20, 2020, 4:34 am

Quote from dandl on June 20, 2020, 2:07 am

So does anyone have anything useful to say on the subject? Surely someone has given this some thought?

I thought Erwin's reply was very useful.

For some definition of 'useful', such as 'go away and don't bother me'.

It'd be a good question to put on StackOverflow; tag it as relational-algebra so that Philip Kelly will pick it up.

I don't think so. Do you?

Q1: what do you mean by 'special case'? What's special (for example) about Semijoin or Antijoin: the heading of their result is the same as heading of the left operand; why would the implementation be anything other than a filter over the left operand? -- oh, because of your 4 Semi/Antijoin cases, 3 you've given the wrong semantics.

It should have been clear: a case detectable at runtime if not earlier that justifies a different algorithm on performance grounds.

Join on disjoint headers: the cost of building an index is avoided

Semijoin on disjoint headers: nothing matches, the output is empty.

Semijoin on disjoint headers requires building a Set on the right argument and filtering the left. That's one pass through each file and enough memory allocation to hold the right file (or some clever caching strategy). If the headers are the same the operation is symmetric. If the left file is smaller, the available strategy is to reverse the inputs. If the left input is very small or already indexed, the saving is significant.

Q2: why do you expect anybody to reply to a list of which nearly half are wrong -- other than to tell you they're wrong and leave it as an exercise for a newbie to correct it.

So which do you now say are wrong and why? Or do you disagree with my semijoin strategy, and if so why?

Q3: some of those cases are likely to result from specific surface language constructs (like WHERE), why not just implement those language constructs as a specific algorithm? (I believe Rel does so.)

I'm only interested here in the cases where that strategy is not available.

So in terms of 'give this some thought': you start (which is what philipxy would tell you to do), amend the list to give the right semantics.

Go away and correct the errors in your O.P. before bothering this list any more. Your strategies below are wrong because your semantics is wrong.

IIRC the moderator has requested you listen before posting. Then listen (for the third time).

Quote from dandl on June 20, 2020, 5:44 am

Quote from AntC on June 20, 2020, 4:34 am

Quote from dandl on June 20, 2020, 2:07 am

So does anyone have anything useful to say on the subject? Surely someone has given this some thought?

I thought Erwin's reply was very useful.

For some definition of 'useful', such as 'go away and don't bother me'.

It'd be a good question to put on StackOverflow; tag it as relational-algebra so that Philip Kelly will pick it up.

I don't think so. Do you?

Q1: what do you mean by 'special case'? What's special (for example) about Semijoin or Antijoin: the heading of their result is the same as heading of the left operand; why would the implementation be anything other than a filter over the left operand? -- oh, because of your 4 Semi/Antijoin cases, 3 you've given the wrong semantics.

It should have been clear: a case detectable at runtime if not earlier that justifies a different algorithm on performance grounds.

Join on disjoint headers: the cost of building an index is avoided

Semijoin on disjoint headers: nothing matches, the output is empty.

Semijoin on disjoint headers requires building a Set on the right argument and filtering the left. That's one pass through each file and enough memory allocation to hold the right file (or some clever caching strategy). If the headers are the same the operation is symmetric. If the left file is smaller, the available strategy is to reverse the inputs. If the left input is very small or already indexed, the saving is significant.

Q2: why do you expect anybody to reply to a list of which nearly half are wrong -- other than to tell you they're wrong and leave it as an exercise for a newbie to correct it.

So which do you now say are wrong and why? Or do you disagree with my semijoin strategy, and if so why?

Q3: some of those cases are likely to result from specific surface language constructs (like WHERE), why not just implement those language constructs as a specific algorithm? (I believe Rel does so.)

I'm only interested here in the cases where that strategy is not available.

So in terms of 'give this some thought': you start (which is what philipxy would tell you to do), amend the list to give the right semantics.

#7 · June 20, 2020, 11:02 am

Quote from dandl on June 20, 2020, 2:07 am

So does anyone have anything useful to say on the subject? Surely someone has given this some thought?

I'm not sure what you're looking for here. General optimisations should be obvious, specific optimisations will be implementation-dependent.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

The Forum for Discussion about The Third Manifesto and Related Matters

Special cases for JOIN operations