Special cases for JOIN operations
Quote from dandl on June 16, 2020, 8:00 amI've been able to identify the following special cases for relational operations, which should be reflected in implementation decisions for any TTM implementation. In each of these cases there is a different algorithm that may be faster than the generic.
- Monadic or dyadic function with empty argument = empty or identity (depends on operator)
- Dyadic function with one or both empty heading = empty or identity (depends on operator)
- Join with disjoint headings = Cartesian product
- Join with equal/superset/subset heading = Semijoin
- Semijoin with disjoint heading = Empty
- Semijoin with equal heading = Intersection
- Antijoin with disjoint heading = Identity
- Antijoin with equal heading = Symmetric difference
I would be interested to know if anyone (a) disagrees or (b) can add to the list or (c) can point to a nice little table of all the combinations.
Edit: Join was wrong.
I've been able to identify the following special cases for relational operations, which should be reflected in implementation decisions for any TTM implementation. In each of these cases there is a different algorithm that may be faster than the generic.
- Monadic or dyadic function with empty argument = empty or identity (depends on operator)
- Dyadic function with one or both empty heading = empty or identity (depends on operator)
- Join with disjoint headings = Cartesian product
- Join with equal/superset/subset heading = Semijoin
- Semijoin with disjoint heading = Empty
- Semijoin with equal heading = Intersection
- Antijoin with disjoint heading = Identity
- Antijoin with equal heading = Symmetric difference
I would be interested to know if anyone (a) disagrees or (b) can add to the list or (c) can point to a nice little table of all the combinations.
Edit: Join was wrong.
Quote from Erwin on June 16, 2020, 1:23 pmOnly three wrong ...
Perhaps give it a second try and perhaps ponder the possibility that you have very little reason for bragging with "special cases you have been able to identify" ...
No wait, FOUR wrong.
Only three wrong ...
Perhaps give it a second try and perhaps ponder the possibility that you have very little reason for bragging with "special cases you have been able to identify" ...
No wait, FOUR wrong.
Quote from dandl on June 20, 2020, 2:07 amSo does anyone have anything useful to say on the subject? Surely someone has given this some thought?
So does anyone have anything useful to say on the subject? Surely someone has given this some thought?
Quote from AntC on June 20, 2020, 4:34 amQuote from dandl on June 20, 2020, 2:07 amSo does anyone have anything useful to say on the subject? Surely someone has given this some thought?
I thought Erwin's reply was very useful.
It'd be a good question to put on StackOverflow; tag it as
relational-algebra
so that Philip Kelly will pick it up.Q1: what do you mean by 'special case'? What's special (for example) about
Semijoin
orAntijoin
: the heading of their result is the same as heading of the left operand; why would the implementation be anything other than a filter over the left operand? -- oh, because of your 4Semi/Antijoin
cases, 3 you've given the wrong semantics.Q2: why do you expect anybody to reply to a list of which nearly half are wrong -- other than to tell you they're wrong and leave it as an exercise for a newbie to correct it.
Q3: some of those cases are likely to result from specific surface language constructs (like
WHERE
), why not just implement those language constructs as a specific algorithm? (I believe Rel does so.)So in terms of 'give this some thought': you start (which is what
philipxy
would tell you to do), amend the list to give the right semantics.
Quote from dandl on June 20, 2020, 2:07 amSo does anyone have anything useful to say on the subject? Surely someone has given this some thought?
I thought Erwin's reply was very useful.
It'd be a good question to put on StackOverflow; tag it as relational-algebra
so that Philip Kelly will pick it up.
Q1: what do you mean by 'special case'? What's special (for example) about Semijoin
or Antijoin
: the heading of their result is the same as heading of the left operand; why would the implementation be anything other than a filter over the left operand? -- oh, because of your 4 Semi/Antijoin
cases, 3 you've given the wrong semantics.
Q2: why do you expect anybody to reply to a list of which nearly half are wrong -- other than to tell you they're wrong and leave it as an exercise for a newbie to correct it.
Q3: some of those cases are likely to result from specific surface language constructs (like WHERE
), why not just implement those language constructs as a specific algorithm? (I believe Rel does so.)
So in terms of 'give this some thought': you start (which is what philipxy
would tell you to do), amend the list to give the right semantics.
Quote from dandl on June 20, 2020, 5:44 amQuote from AntC on June 20, 2020, 4:34 amQuote from dandl on June 20, 2020, 2:07 amSo does anyone have anything useful to say on the subject? Surely someone has given this some thought?
I thought Erwin's reply was very useful.
For some definition of 'useful', such as 'go away and don't bother me'.
It'd be a good question to put on StackOverflow; tag it as
relational-algebra
so that Philip Kelly will pick it up.I don't think so. Do you?
Q1: what do you mean by 'special case'? What's special (for example) about
Semijoin
orAntijoin
: the heading of their result is the same as heading of the left operand; why would the implementation be anything other than a filter over the left operand? -- oh, because of your 4Semi/Antijoin
cases, 3 you've given the wrong semantics.It should have been clear: a case detectable at runtime if not earlier that justifies a different algorithm on performance grounds.
Join on disjoint headers: the cost of building an index is avoided
Semijoin on disjoint headers: nothing matches, the output is empty.
Semijoin on disjoint headers requires building a Set on the right argument and filtering the left. That's one pass through each file and enough memory allocation to hold the right file (or some clever caching strategy). If the headers are the same the operation is symmetric. If the left file is smaller, the available strategy is to reverse the inputs. If the left input is very small or already indexed, the saving is significant.
Q2: why do you expect anybody to reply to a list of which nearly half are wrong -- other than to tell you they're wrong and leave it as an exercise for a newbie to correct it.
So which do you now say are wrong and why? Or do you disagree with my semijoin strategy, and if so why?
Q3: some of those cases are likely to result from specific surface language constructs (like
WHERE
), why not just implement those language constructs as a specific algorithm? (I believe Rel does so.)I'm only interested here in the cases where that strategy is not available.
So in terms of 'give this some thought': you start (which is what
philipxy
would tell you to do), amend the list to give the right semantics.
Quote from AntC on June 20, 2020, 4:34 amQuote from dandl on June 20, 2020, 2:07 amSo does anyone have anything useful to say on the subject? Surely someone has given this some thought?
I thought Erwin's reply was very useful.
For some definition of 'useful', such as 'go away and don't bother me'.
It'd be a good question to put on StackOverflow; tag it as
relational-algebra
so that Philip Kelly will pick it up.
I don't think so. Do you?
Q1: what do you mean by 'special case'? What's special (for example) about
Semijoin
orAntijoin
: the heading of their result is the same as heading of the left operand; why would the implementation be anything other than a filter over the left operand? -- oh, because of your 4Semi/Antijoin
cases, 3 you've given the wrong semantics.
It should have been clear: a case detectable at runtime if not earlier that justifies a different algorithm on performance grounds.
Join on disjoint headers: the cost of building an index is avoided
Semijoin on disjoint headers: nothing matches, the output is empty.
Semijoin on disjoint headers requires building a Set on the right argument and filtering the left. That's one pass through each file and enough memory allocation to hold the right file (or some clever caching strategy). If the headers are the same the operation is symmetric. If the left file is smaller, the available strategy is to reverse the inputs. If the left input is very small or already indexed, the saving is significant.
Q2: why do you expect anybody to reply to a list of which nearly half are wrong -- other than to tell you they're wrong and leave it as an exercise for a newbie to correct it.
So which do you now say are wrong and why? Or do you disagree with my semijoin strategy, and if so why?
Q3: some of those cases are likely to result from specific surface language constructs (like
WHERE
), why not just implement those language constructs as a specific algorithm? (I believe Rel does so.)
I'm only interested here in the cases where that strategy is not available.
So in terms of 'give this some thought': you start (which is what
philipxy
would tell you to do), amend the list to give the right semantics.
Quote from AntC on June 20, 2020, 9:18 amGo away and correct the errors in your O.P. before bothering this list any more. Your strategies below are wrong because your semantics is wrong.IIRC the moderator has requested you listen before posting. Then listen (for the third time).Quote from dandl on June 20, 2020, 5:44 amQuote from AntC on June 20, 2020, 4:34 amQuote from dandl on June 20, 2020, 2:07 amSo does anyone have anything useful to say on the subject? Surely someone has given this some thought?
I thought Erwin's reply was very useful.
For some definition of 'useful', such as 'go away and don't bother me'.
It'd be a good question to put on StackOverflow; tag it as
relational-algebra
so that Philip Kelly will pick it up.I don't think so. Do you?
Q1: what do you mean by 'special case'? What's special (for example) about
Semijoin
orAntijoin
: the heading of their result is the same as heading of the left operand; why would the implementation be anything other than a filter over the left operand? -- oh, because of your 4Semi/Antijoin
cases, 3 you've given the wrong semantics.It should have been clear: a case detectable at runtime if not earlier that justifies a different algorithm on performance grounds.
Join on disjoint headers: the cost of building an index is avoided
Semijoin on disjoint headers: nothing matches, the output is empty.
Semijoin on disjoint headers requires building a Set on the right argument and filtering the left. That's one pass through each file and enough memory allocation to hold the right file (or some clever caching strategy). If the headers are the same the operation is symmetric. If the left file is smaller, the available strategy is to reverse the inputs. If the left input is very small or already indexed, the saving is significant.
Q2: why do you expect anybody to reply to a list of which nearly half are wrong -- other than to tell you they're wrong and leave it as an exercise for a newbie to correct it.
So which do you now say are wrong and why? Or do you disagree with my semijoin strategy, and if so why?
Q3: some of those cases are likely to result from specific surface language constructs (like
WHERE
), why not just implement those language constructs as a specific algorithm? (I believe Rel does so.)I'm only interested here in the cases where that strategy is not available.
So in terms of 'give this some thought': you start (which is what
philipxy
would tell you to do), amend the list to give the right semantics.
Quote from dandl on June 20, 2020, 5:44 amQuote from AntC on June 20, 2020, 4:34 amQuote from dandl on June 20, 2020, 2:07 amSo does anyone have anything useful to say on the subject? Surely someone has given this some thought?
I thought Erwin's reply was very useful.
For some definition of 'useful', such as 'go away and don't bother me'.
It'd be a good question to put on StackOverflow; tag it as
relational-algebra
so that Philip Kelly will pick it up.I don't think so. Do you?
Q1: what do you mean by 'special case'? What's special (for example) about
Semijoin
orAntijoin
: the heading of their result is the same as heading of the left operand; why would the implementation be anything other than a filter over the left operand? -- oh, because of your 4Semi/Antijoin
cases, 3 you've given the wrong semantics.It should have been clear: a case detectable at runtime if not earlier that justifies a different algorithm on performance grounds.
Join on disjoint headers: the cost of building an index is avoided
Semijoin on disjoint headers: nothing matches, the output is empty.
Semijoin on disjoint headers requires building a Set on the right argument and filtering the left. That's one pass through each file and enough memory allocation to hold the right file (or some clever caching strategy). If the headers are the same the operation is symmetric. If the left file is smaller, the available strategy is to reverse the inputs. If the left input is very small or already indexed, the saving is significant.
Q2: why do you expect anybody to reply to a list of which nearly half are wrong -- other than to tell you they're wrong and leave it as an exercise for a newbie to correct it.
So which do you now say are wrong and why? Or do you disagree with my semijoin strategy, and if so why?
Q3: some of those cases are likely to result from specific surface language constructs (like
WHERE
), why not just implement those language constructs as a specific algorithm? (I believe Rel does so.)I'm only interested here in the cases where that strategy is not available.
So in terms of 'give this some thought': you start (which is what
philipxy
would tell you to do), amend the list to give the right semantics.
Quote from Dave Voorhis on June 20, 2020, 11:02 amQuote from dandl on June 20, 2020, 2:07 amSo does anyone have anything useful to say on the subject? Surely someone has given this some thought?
I'm not sure what you're looking for here. General optimisations should be obvious, specific optimisations will be implementation-dependent.
Quote from dandl on June 20, 2020, 2:07 amSo does anyone have anything useful to say on the subject? Surely someone has given this some thought?
I'm not sure what you're looking for here. General optimisations should be obvious, specific optimisations will be implementation-dependent.