The Forum for Discussion about The Third Manifesto and Related Matters

Please or Register to create posts and topics.

ANNOUNCE: a relational algebra with only one primitive operator (tentative) (please check/critique)

Page 1 of 12Next

[It'll take a few edits to write this up. Draft for now, I'll remove this message when done.]

Motivation

We'd like an algebra in which we can reason algebraically about the equivalence of queries. For example

  • When is it equivalent to 'push restrict through (natural) Join'? The A algebra goes some way to answer that. It reduces surface-language WHERE STATUS = 10 to <AND> REL{TUP{ STATUS 10}}; then we can reason from the Associativity of <AND>.
  • When is it equivalent to 'push project through Join'? But the A algebra has no answer here: project/REMOVE are not susceptible to equational reasoning, because their attribute-comma-list is not first-class: there's no operators to manipulate lists of attributes, obtain them from relation(al expressions), etc. Only literals are allowed.
  • And yet A at several points wants to manipulate attribute-lists: project in the surface language is to be implemented via REMOVE (of the complement of those attributes, relative to the relation being projected over); COMPOSE "─we have allowed ourselves the luxury (some might think) of including a "macro" operator", what do the scare-quotes indicate? how could any "macro" look at the headings of COMPOSE's operands and infer the attributes in common?
  • (I'm not 'picking on' A specifically; it's just that D&D are being admirably clear about requirements. Codd's algebra suffers the same difficulties. Whatever optimisation/equivalences/manipulation goes on inside SQL engines and their internal 'algebra' is similarly not easy to reason with.)

We could introduce machinery into the algebra to extract the heading of a relation(al expression) to an attribute list, and provide set operators to manipulate the list. But

  • That gives two 'sorts' (in the mathematical sense) of entities to reason about, with a tricky inter-relationship as they shuttle from heading to set operator back to operand of relational operator.
  • We already have relational operators that carry out some set operations on headings: specifically, <AND>, <OR> take the union of headings.
  • So several on this forum have already advocated for using relation values to denote their headings, when appearing as operands in positions that want a heading -- that is, the operator will ignore the relation value's content, and/or the algebra will 'emptify' the relation.

(Empty) Relations denoting headings

Consider Tropashko's 'Inner Union' aka SQL's UNION CORRESPONDING aka ISBL UNION:

  • It projects each operand's body on the attributes in common; then forms the set-union of those projections/bodies.
  • Then we can define r PROJECT s =df r InnerUnion emptify(s).
  • r COMPOSE s =df (r NatJoin s) REMOVE emptify(r InnerUnion s).
  • r PROJECT s =df r REMOVE emptify(r REMOVE s) -- alternative definition.
  • emptify(r) =df r NotMatching r.

Only one primitive operator

  • InnerUnion has some very nice properties for equational reasoning: it is commutative, idempotent, associative -- indeed as is NatJoin. Unfortunately, InnerUnion is not distributive over NatJoin, nor vice versa.
  • But (ANNOUNCE) if we could define InnerUnion in terms of NatJoin (embedding the definition in the machinery of FOPL with equality) then we can reason around equations using both operators.
  • Furthermore (further ANNOUNCE) if we could define PROJECT/REMOVE, NotMatching in terms of NatJoin (embedding etc) then we can reason all the way round equations.

At this juncture, I'd better point out that this is not intended as an 'executable' algebra any more than A is. And it would be horribly inefficient to go about emptifying relations just to get their heading -- or even to evaluate the relational expression argument to emptify( )! It's an algebra purely for reasoning equationally about the equivalence of queries.

My attempts at such definitions have foundered in the past because they relied on DUM being defined; and I couldn't obtain a definition for DUM that fixed it uniquely in the structure of relations. (Reminder: we can use in the definitions only relation values unanalysed, and operators on them. There's no ability to manipulate 'headings' or 'bodies' (as sets), because that would introduce additional 'sorts' into the algebra.)

The definition for DEE hadn't suffered that difficulty, because we can fix DEE as the identity element for NatJoin. And it turns out (claim not yet demonstrated) we can define the other operators without mentioning DUM; then define DUM from them:

  • DEE =df s s.t. ∀ r. s NatJoin r = r.
  • DUM =df DEE NotMatching DEE.
  • DUM =df ∀ r. (r NotMatching r) REMOVE r equivalently  ∀ r. (r REMOVE r) NotMatching r .
  • (Yes, that's three equivalent definitions for DUM -- proven equivalent by equational reasoning.)

Defining InnerUnion in terms of NatJoin

In the formalism of lattices, these two operators are complementary, and that can be expressed equationally in the absorption laws. Or we can equivalently define InnerUnion in terms of NatJoin, then obtain the absorption laws, and other properties of InnerUnion by equational reasoning:

  • r InnerUnion (r NatJoin s) = r -- InnerUnion absorbs NatJoin.
  • r NatJoin (r InnerUnion s) = r -- NatJoin absorbs InnerUnion.
  • ∀ r1 r2 s. (r1 = (r1 NatJoin s) & r2 = (r2 NatJoin s)) ≡ ((r1 InnerUnion r2) = (r1 InnerUnion r2) NatJoin s).
    (A uniqueness proof is available for InnerUnion defined this way -- verified with good old Prover9.)
  • Having defined DEE as the identity for NatJoin, we now have a half-bounded lattice with DEE as top. In fact we have a fully-bounded lattice, but there's no cute name for lattice bottom (I call it Dumpty), and it's a rather vertiginous empty relation with the maximum possible heading; I haven't needed to mention it in the definitions, so it'll play no further part.

Ok that's all the lattice theory I'm going to introduce -- indeed I didn't need to let on it was anything to do with lattices, you can just go by the equations.

The equational technique to note in that equation for InnerUnion is there are many candidate relations s s.t. r1 = (r1 NatJoin s) & r2 = (r2 NatJoin s); s = DEE would be a solution; (r1 InnerJoin r2) is fixed to be the bottommost such s/ the closest to r1, r2/ the furthest from DEE.

Defining NotMatching in terms of NatJoin, InnerUnion

Noting (from Tropashko) that (r1 NatJoin r2) InnerUnion (r1 NotMatching r2) = r1;in general there's many candidates s s.t. (r1 NatJoin r2) InnerUnion s = r1 , including s = r1. This uses the same equational technique to require the NotMatching be the bottommost such s/ furthest from r1:

  • r1 NatJoin s = s & s NatJoin (r1 NotMatching r2) = (r1 NotMatching r2) ≡ (r1 NatJoin r2) InnerUnion s = r1.
  • TentativeDUM =df DEE NotMatching DEE.

And now I have to tell of an ugliness. Help requested to find a more elegant equation than the following. Prover9 alleges the above definition is sufficient to make NotMatching unique; and to make TentativeDUM unique. But (using Test-Driven-Development) lots of the equational reasoning I expected to be fixed for NotMatching turned out to have non-unique answers. Adding this (ugly) axiom resolved it. (Ugly in the sense of needing a negation, which makes the whole model potentially unstable -- Litak et al private correspondence.)

  • (r NotMatching r) ≠ r ≡ r InnerUnion (DEE NotMatching DEE) = DEE.
  • Informally: (r NotMatching r) ≠ r means r is non-empty; equivalently projecting r on no attributes = DEE means r is non-empty.
  • The crucial effect of that axiom is to 'skew' the lattice: We can derive that there are no relation values falling between DEE and (DEE NotMatching DEE). Whereas for most r, (r NotMatching r) are a long way apart: what's in between is all the relations whose bodies form the powerset of the tuples in r. (BTW that powerset structure forms a (latticeland conventional) distributive sub-lattice -- as is provable from the axioms, contrast Litak et al need a specific axiom to fix that. Perhaps there's some equivalence?)
  • So I've postponed defining DUM until REMOVE is also defined; then DUM is defined as above as the result from 'generalised NotMatching/REMOVE', which I hope locks the equations together.

Defining REMOVE in terms of NatJoin, InnerUnion, NotMatching, DEE

tba [it's late].

Quote from AntC on May 12, 2021, 11:46 am

[It'll take a few edits to write this up. Draft for now, I'll remove this message when done.]

Motivation

We'd like an algebra in which we can reason algebraically about the equivalence of queries. For example

  • When is it equivalent to 'push restrict through (natural) Join'? The A algebra goes some way to answer that. It reduces surface-language WHERE STATUS = 10 to <AND> REL{TUP{ STATUS 10}}; then we can reason from the Associativity of <AND>.
  • When is it equivalent to 'push project through Join'? But the A algebra has no answer here: project/REMOVE are not susceptible to equational reasoning, because their attribute-comma-list is not first-class: there's no operators to manipulate lists of attributes, obtain them from relation(al expressions), etc. Only literals are allowed.
  • And yet A at several points wants to manipulate attribute-lists: project in the surface language is to be implemented via REMOVE (of the complement of those attributes, relative to the relation being projected over); COMPOSE "─we have allowed ourselves the luxury (some might think) of including a "macro" operator", what do the scare-quotes indicate? how could any "macro" look at the headings of COMPOSE's operands and infer the attributes in common?
  • (I'm not 'picking on' A specifically; it's just that D&D are being admirably clear about requirements. Codd's algebra suffers the same difficulties. Whatever optimisation/equivalences/manipulation goes on inside SQL engines and their internal 'algebra' is similarly not easy to reason with.)

We could introduce machinery into the algebra to extract the heading of a relation(al expression) to an attribute list, and provide set operators to manipulate the list. But

  • That gives two 'sorts' (in the mathematical sense) of entities to reason about, with a tricky inter-relationship as they shuttle from heading to set operator back to operand of relational operator.
  • We already have relational operators that carry out some set operations on headings: specifically, <AND>, <OR> take the union of headings.
  • So several on this forum have already advocated for using relation values to denote their headings, when appearing as operands in positions that want a heading -- that is, the operator will ignore the relation value's content, and/or the algebra will 'emptify' the relation.

(Empty) Relations denoting headings

Consider Tropashko's 'Inner Union' aka SQL's UNION CORRESPONDING aka ISBL UNION:

  • It projects each operand's body on the attributes in common; then forms the set-union of those projections/bodies.
  • Then we can define r PROJECT s =df r InnerUnion emptify(s).
  • r COMPOSE s =df (r NatJoin s) REMOVE emptify(r InnerUnion s).
  • r PROJECT s =df r REMOVE emptify(r REMOVE s) -- alternative definition.
  • emptify(r) =df r NotMatching r.

Only one primitive operator

  • InnerUnion has some very nice properties for equational reasoning: it is commutative, idempotent, associative -- indeed as is NatJoin. Unfortunately, InnerUnion is not distributive over NatJoin, nor vice versa.
  • But (ANNOUNCE) if we could define InnerUnion in terms of NatJoin (embedding the definition in the machinery of FOPL with equality) then we can reason around equations using both operators.
  • Furthermore (further ANNOUNCE) if we could define PROJECT/REMOVE, NotMatching in terms of NatJoin (embedding etc) then we can reason all the way round equations.

At this juncture, I'd better point out that this is not intended as an 'executable' algebra any more than A is. And it would be horribly inefficient to go about emptifying relations just to get their heading -- or even to evaluate the relational expression argument to emptify( )! It's an algebra purely for reasoning equationally about the equivalence of queries.

No problem down to here, but I see trouble ahead. I was expecting the one operator to be a form of negation (like NAND). This choice looks doomed to failure.

My attempts at such definitions have foundered in the past because they relied on DUM being defined; and I couldn't obtain a definition for DUM that fixed it uniquely in the structure of relations. (Reminder: we can use in the definitions only relation values unanalysed, and operators on them. There's no ability to manipulate 'headings' or 'bodies' (as sets), because that would introduce additional 'sorts' into the algebra.)

The definition for DEE hadn't suffered that difficulty, because we can fix DEE as the identity element for NatJoin. And it turns out (claim not yet demonstrated) we can define the other operators without mentioning DUM; then define DUM from them:

  • DEE =df s s.t. ∀ r. s NatJoin r = r.
  • DUM =df DEE NotMatching DEE.
  • DUM =df ∀ r. (r NotMatching r) REMOVE r equivalently  ∀ r. (r REMOVE r) NotMatching r .
  • (Yes, that's three equivalent definitions for DUM -- proven equivalent by equational reasoning.)

Seems to me it shouldn't be too hard to Dee and Dum, but only if you can negate.

Defining InnerUnion in terms of NatJoin

In the formalism of lattices, these two operators are complementary, and that can be expressed equationally in the absorption laws. Or we can equivalently define InnerUnion in terms of NatJoin, then obtain the absorption laws, and other properties of InnerUnion by equational reasoning:

  • r InnerUnion (r NatJoin s) = r -- InnerUnion absorbs NatJoin.
  • r NatJoin (r InnerUnion s) = r -- NatJoin absorbs InnerUnion.
  • ∀ r1 r2 s. (r1 = (r1 NatJoin s) & r2 = (r2 NatJoin s)) ≡ ((r1 InnerUnion r2) = (r1 InnerUnion r2) NatJoin s).
    (A uniqueness proof is available for InnerUnion defined this way -- verified with good old Prover9.)
  • Having defined DEE as the identity for NatJoin, we now have a half-bounded lattice with DEE as top. In fact we have a fully-bounded lattice, but there's no cute name for lattice bottom (I call it Dumpty), and it's a rather vertiginous empty relation with the maximum possible heading; I haven't needed to mention it in the definitions, so it'll play no further part.

Ok that's all the lattice theory I'm going to introduce -- indeed I didn't need to let on it was anything to do with lattices, you can just go by the equations.

The equational technique to note in that equation for InnerUnion is there are many candidate relations s s.t. r1 = (r1 NatJoin s) & r2 = (r2 NatJoin s); s = DEE would be a solution; (r1 InnerJoin r2) is fixed to be the bottommost such s/ the closest to r1, r2/ the furthest from DEE.

Defining NotMatching in terms of NatJoin, InnerUnion

Noting (from Tropashko) that (r1 NatJoin r2) InnerUnion (r1 NotMatching r2) = r1;in general there's many candidates s s.t. (r1 NatJoin r2) InnerUnion s = r1 , including s = r1. This uses the same equational technique to require the NotMatching be the bottommost such s/ furthest from r1:

  • r1 NatJoin s = s & s NatJoin (r1 NotMatching r2) = (r1 NotMatching r2) ≡ (r1 NatJoin r2) InnerUnion s = r1.
  • TentativeDUM =df DEE NotMatching DEE.

And now I have to tell of an ugliness. Help requested to find a more elegant equation than the following. Prover9 alleges the above definition is sufficient to make NotMatching unique; and to make TentativeDUM unique. But (using Test-Driven-Development) lots of the equational reasoning I expected to be fixed for NotMatching turned out to have non-unique answers. Adding this (ugly) axiom resolved it. (Ugly in the sense of needing a negation, which makes the whole model potentially unstable -- Litak et al private correspondence.)

  • (r NotMatching r) ≠ r ≡ r InnerUnion (DEE NotMatching DEE) = DEE.
  • Informally: (r NotMatching r) ≠ r means r is non-empty; equivalently projecting r on no attributes = DEE means r is non-empty.
  • The crucial effect of that axiom is to 'skew' the lattice: We can derive that there are no relation values falling between DEE and (DEE NotMatching DEE). Whereas for most r, (r NotMatching r) are a long way apart: what's in between is all the relations whose bodies form the powerset of the tuples in r. (BTW that powerset structure forms a (latticeland conventional) distributive sub-lattice -- as is provable from the axioms, contrast Litak et al need a specific axiom to fix that. Perhaps there's some equivalence?)
  • So I've postponed defining DUM until REMOVE is also defined; then DUM is defined as above as the result from 'generalised NotMatching/REMOVE', which I hope locks the equations together.

Defining REMOVE in terms of NatJoin, InnerUnion, NotMatching, DEE

No, I really don't follow. I don't know whether it's the lattices or the toolset but you lost me here. As I see it REMOVE is just AntiJoin. Easy if you start with negation.

tba [it's late].

Care to mention why you didn't follow the NAND lead in App-A?

Andl - A New Database Language - andl.org

Consider a database with all relations having the same set of attributes. Then, it is a Boolean Algebra. It can be axiomatized with a single operation, but this operation is not a lattice one.

P.S. A lot of questions still hang from our past discussions. For example, what do you mean by the "fixed" algebraic object (zilliary operation DUM, unary negation, etc.)? It looks like the Category Theory might provide formalization (the same object/morphism up to equivalency, up to isomorphism etc). However, I will not pretend I mastered this subject yet. (And how do they write books like "Category theory for working [scientist]" (replace the "scientist" with anything but "mathematician") when all the category examples are mathematical objects, of which the reader might be familiar with only one or two?)

 

Quote from AntC on May 12, 2021, 11:46 am

[It'll take a few edits to write this up. Draft for now, I'll remove this message when done.]

Motivation

We'd like an algebra in which we can reason algebraically about the equivalence of queries. For example

  • When is it equivalent to 'push restrict through (natural) Join'? The A algebra goes some way to answer that. It reduces surface-language WHERE STATUS = 10 to <AND> REL{TUP{ STATUS 10}}; then we can reason from the Associativity of <AND>.
  • When is it equivalent to 'push project through Join'? But the A algebra has no answer here: project/REMOVE are not susceptible to equational reasoning, because their attribute-comma-list is not first-class: there's no operators to manipulate lists of attributes, obtain them from relation(al expressions), etc. Only literals are allowed.
  • And yet A at several points wants to manipulate attribute-lists: project in the surface language is to be implemented via REMOVE (of the complement of those attributes, relative to the relation being projected over); COMPOSE "─we have allowed ourselves the luxury (some might think) of including a "macro" operator", what do the scare-quotes indicate? how could any "macro" look at the headings of COMPOSE's operands and infer the attributes in common?
  • (I'm not 'picking on' A specifically; it's just that D&D are being admirably clear about requirements. Codd's algebra suffers the same difficulties. Whatever optimisation/equivalences/manipulation goes on inside SQL engines and their internal 'algebra' is similarly not easy to reason with.)

Are you asking this out of a concern about "query optimisation" ?  If not, don't read any further.  If so, you may but it's at your own risk.

From having written SIRA_PRISE from the ground up, I have come to believe that "query rewrite" is the wrong strategy, or the wrong problem, or the wrong solution, or, well, the wrong something.  I have come to believe that people asking themselves (and worse, others too) these questions need to get their head cleared (***totally***) of their own preconceptions and ask themselves ***WHAT*** problem it is they are trying to solve and while doing so, wonder about ***every single aspect of the definition they find*** whether it is an aspect that pertains to the 'model' space or one that pertains to the 'implementation' space.

The first reason for my saying so is that when you have a query that you want to find an optimal execution strategy for, and you manage to rewrite that query to a provably equivalent one, what else do you have after that than yet another query that you want to find an optimal execution strategy for ?  (The answer to that question is : "a query that has certain properties/satisfies certain conditions that the original query didn't".  I suggest you now start to wonder about ***what exactly those properties/conditions are***.  I think you are bound to find, and cannot escape, that the answer is "properties/conditions that are determined by what the optimizer does and doesn't do".  But if your goal is to ***find*** what the optimizer should and shouldn't do, then you don't know what those properties/conditions are, and as a consequence you don't know what qualities the result of the rewrite is supposed to meet/adhere to/abide by/... .)

The second reason for my saying so is, why do you only mention 'restrict' and 'project' as candidates for "pushing through join" ?  "Pushing extend through join" is a worthwhile strategy to consider if the EXTEND is computationally expensive [and the pushing is possible].

The third reason for my saying so, is that from having to find reliable answers to these kinds of question or otherwise I'd never have gotten the working product I have, I know that "push restrict through join" (and "pushing extend through join" is -as far as the semantics of the query at hand are concerned- in the same category because RESTRICT and EXTEND are in the very same 'A' category of operators, or perhaps I should say "are cases of the very same 'A' operator") still has distinct cases to consider.  Say,  relational expression A with attrs Ax,Cy and relational expression B with attrs Cy,Bz.  "WHERE f(Ai,Bj)" : not pushable on the [natural] join of A and B.  Ditto for EXTEND.  "WHERE f(Ai,Cj)" : pushable, because the result can by definition include only tuples for which f(Ai,Cj) is true and that's evaluatable on A on its own.  Ditto for f(Ai,Cj) used in EXTEND, but ***only*** useful in the case where f is known to be computationally expensive, ***and*** the risk is non-negligible that the very same computation would have to be carried out several times for each tuple matched in the JOIN.  The point of this very long paragraph being : "push restrict through join" is just simply not enough as a detailed definition of the thing you are thinking of.

Further elaboration : now, f(Ax,By) is a logical conjunct where any Ax might be any of Ai,Cj and the same holds for By.  If there is any (Ax,By) where Ax is actually an Ai and By is actually a Bj, then the conjunction as a whole is not "pushable".  However, there might be constituants of the CNF form of the logical conjuct that ***are*** "pushable".  If you can find that CNF form and observe certain sub-constituants that are indeed "pushable" and that effectively reduce (per the best esimates you can possible make) the cost estimate that your optimizer will associate with "carrying out this particular strategy", then push them.

Somewhat similar observations for the case where F(Ax,By) is a logical disjunct, but I'm afraid I'm getting too tired to continue writing and not start uttering stupidities.

Quote from dandl on May 12, 2021, 1:41 pm
Quote from AntC on May 12, 2021, 11:46 am

Motivation

We'd like an algebra in which we can reason algebraically about the equivalence of queries. For example

  • When is it equivalent to 'push restrict through (natural) Join'? The A algebra goes some way to answer that. It reduces surface-language WHERE STATUS = 10 to <AND> REL{TUP{ STATUS 10}}; then we can reason from the Associativity of <AND>.
  • When is it equivalent to 'push project through Join'? But the A algebra has no answer here: project/REMOVE are not susceptible to equational reasoning, because their attribute-comma-list is not first-class: there's no operators to manipulate lists of attributes, obtain them from relation(al expressions), etc. Only literals are allowed.
  • And yet A at several points wants to manipulate attribute-lists: project in the surface language is to be implemented via REMOVE (of the complement of those attributes, relative to the relation being projected over); COMPOSE "─we have allowed ourselves the luxury (some might think) of including a "macro" operator", what do the scare-quotes indicate? how could any "macro" look at the headings of COMPOSE's operands and infer the attributes in common?
  • (I'm not 'picking on' A specifically; it's just that D&D are being admirably clear about requirements. Codd's algebra suffers the same difficulties. Whatever optimisation/equivalences/manipulation goes on inside SQL engines and their internal 'algebra' is similarly not easy to reason with.)

We could introduce machinery into the algebra to extract the heading of a relation(al expression) to an attribute list, and provide set operators to manipulate the list. But

  • That gives two 'sorts' (in the mathematical sense) of entities to reason about, with a tricky inter-relationship as they shuttle from heading to set operator back to operand of relational operator.
  • We already have relational operators that carry out some set operations on headings: specifically, <AND>, <OR> take the union of headings.
  • So several on this forum have already advocated for using relation values to denote their headings, when appearing as operands in positions that want a heading -- that is, the operator will ignore the relation value's content, and/or the algebra will 'emptify' the relation.

(Empty) Relations denoting headings

Consider Tropashko's 'Inner Union' aka SQL's UNION CORRESPONDING aka ISBL UNION:

  • It projects each operand's body on the attributes in common; then forms the set-union of those projections/bodies.
  • Then we can define r PROJECT s =df r InnerUnion emptify(s).
  • r COMPOSE s =df (r NatJoin s) REMOVE emptify(r InnerUnion s).
  • r PROJECT s =df r REMOVE emptify(r REMOVE s) -- alternative definition.
  • emptify(r) =df r NotMatching r.

Only one primitive operator

  • InnerUnion has some very nice properties for equational reasoning: it is commutative, idempotent, associative -- indeed as is NatJoin. Unfortunately, InnerUnion is not distributive over NatJoin, nor vice versa.
  • But (ANNOUNCE) if we could define InnerUnion in terms of NatJoin (embedding the definition in the machinery of FOPL with equality) then we can reason around equations using both operators.
  • Furthermore (further ANNOUNCE) if we could define PROJECT/REMOVE, NotMatching in terms of NatJoin (embedding etc) then we can reason all the way round equations.

At this juncture, I'd better point out that this is not intended as an 'executable' algebra any more than A is. And it would be horribly inefficient to go about emptifying relations just to get their heading -- or even to evaluate the relational expression argument to emptify( )! It's an algebra purely for reasoning equationally about the equivalence of queries.

No problem down to here, but I see trouble ahead. I was expecting the one operator to be a form of negation (like NAND).

Appendix A claims NAND could replace only <AND>, <OR>, <NOT>; it doesn't go near project/REMOVE, COMPOSE. So that's a non-starter. (Also it would introduce domain-dependence, which gives me a queasy feeling.)

...

My attempts at such definitions have foundered in the past because they relied on DUM being defined; and I couldn't obtain a definition for DUM that fixed it uniquely in the structure of relations. (Reminder: we can use in the definitions only relation values unanalysed, and operators on them. There's no ability to manipulate 'headings' or 'bodies' (as sets), because that would introduce additional 'sorts' into the algebra.)

The definition for DEE hadn't suffered that difficulty, because we can fix DEE as the identity element for NatJoin. And it turns out (claim not yet demonstrated) we can define the other operators without mentioning DUM; then define DUM from them:

  • DEE =df s s.t. ∀ r. s NatJoin r = r.
  • DUM =df DEE NotMatching DEE.
  • DUM =df ∀ r. (r NotMatching r) REMOVE r equivalently  ∀ r. (r REMOVE r) NotMatching r .
  • (Yes, that's three equivalent definitions for DUM -- proven equivalent by equational reasoning.)

Seems to me it shouldn't be too hard to Dee and Dum, but only if you can negate.

DEE is easy/doesn't need negate. DUM = DEE NotMatching DEE turns out to have problems/is not unique enough, as I discovered. Lattice algebras tend to presume the structure is distributive, complemented unless you give a strong instruction otherwise -- which is my 'ugly' axiom.

...

No, I really don't follow. I don't know whether it's the lattices or the toolset but you lost me here. As I see it REMOVE is just AntiJoin. Easy if you start with negation.

AntiJoin could be defined as the negation of MATCHING; but r1 MATCHING r2 = r1 NatJoin r2 {attributes-from-r1}, so you still need projection

Quote from Erwin on May 12, 2021, 7:34 pm
Quote from AntC on May 12, 2021, 11:46 am

[It'll take a few edits to write this up. Draft for now, I'll remove this message when done.]

Motivation

We'd like an algebra in which we can reason algebraically about the equivalence of queries. For example

  • When is it equivalent to 'push restrict through (natural) Join'? The A algebra goes some way to answer that. It reduces surface-language WHERE STATUS = 10 to <AND> REL{TUP{ STATUS 10}}; then we can reason from the Associativity of <AND>.
  • When is it equivalent to 'push project through Join'? But the A algebra has no answer here: project/REMOVE are not susceptible to equational reasoning, because their attribute-comma-list is not first-class: there's no operators to manipulate lists of attributes, obtain them from relation(al expressions), etc. Only literals are allowed.
  • And yet A at several points wants to manipulate attribute-lists: project in the surface language is to be implemented via REMOVE (of the complement of those attributes, relative to the relation being projected over); COMPOSE "─we have allowed ourselves the luxury (some might think) of including a "macro" operator", what do the scare-quotes indicate? how could any "macro" look at the headings of COMPOSE's operands and infer the attributes in common?
  • (I'm not 'picking on' A specifically; it's just that D&D are being admirably clear about requirements. Codd's algebra suffers the same difficulties. Whatever optimisation/equivalences/manipulation goes on inside SQL engines and their internal 'algebra' is similarly not easy to reason with.)

Are you asking this out of a concern about "query optimisation" ?

To be clear, this isn't a mechanism to tell an implementor how to optimise a query. It is aimed at mechanically checking whether a query is equivalent to some (alleged) optimisation. The optimisation strategy  is likely to rely on observing that mentioned relations do/don't have attributes in common. From what I've shown about (empty) relations-denoting-headings, you should be able to see the algebra is expressive enough. Optimisations also rely on considerations like keys and foreign keys. Then (I claim, not yet substantiated) the algebra is powerful enough to express those properties.

  If not, don't read any further.  If so, you may but it's at your own risk.

From having written SIRA_PRISE from the ground up, I have come to believe that "query rewrite" is the wrong strategy, or the wrong problem, or the wrong solution, or, well, the wrong something.

Then you're swimming against the tide of modern compiler practice. "AST rewrite" is the strategy for implementors: rewrite the surface-language AST to some reduced-instruction-set algebra/AST; optimise that by manipulating the AST; type-check as you go, as a sanity check over the manipulations; rewrite to some assembly-level or LLVM/abstract JVM AST; translate to object code.

  I have come to believe that people asking themselves (and worse, others too) these questions need to get their head cleared (***totally***) of their own preconceptions and ask themselves ***WHAT*** problem it is they are trying to solve and while doing so, wonder about ***every single aspect of the definition they find*** whether it is an aspect that pertains to the 'model' space or one that pertains to the 'implementation' space.

The first reason for my saying so is that when you have a query that you want to find an optimal execution strategy for, and you manage to rewrite that query to a provably equivalent one, ...

And how are you proving equivalence? From what I see of SQL engines, it's black arts and rule-of-thumb. Of course SQL is severely complicated by dealing with Null, duplicate rows, duplicate/anonymous columns.

what else do you have after that than yet another query that you want to find an optimal execution strategy for ?  (The answer to that question is : "a query that has certain properties/satisfies certain conditions that the original query didn't".  I suggest you now start to wonder about ***what exactly those properties/conditions are***.  I think you are bound to find, and cannot escape, that the answer is "properties/conditions that are determined by what the optimizer does and doesn't do".  But if your goal is to ***find*** what the optimizer should and shouldn't do, then you don't know what those properties/conditions are, and as a consequence you don't know what qualities the result of the rewrite is supposed to meet/adhere to/abide by/... .)

The second reason for my saying so is, why do you only mention 'restrict' and 'project' as candidates for "pushing through join" ?  "Pushing extend through join" is a worthwhile strategy to consider if the EXTEND is computationally expensive [and the pushing is possible].

I mentioned 'restrict' and 'project' push-through-join as the most common examples. And the same technique works for push-EXTEND-through join; and already is catered for by the A algebra. Frankly if you can't already see that, I'm wondering how competent an implementor you are. To give an example using only A:

  • Assuming we have relvar DoubleQty := ((PLUS WHERE X = Y) {ALL BUT Y}) RENAME { QTY := X, QTYx2 := Z }
  • Take query ((S JOIN SP) WHERE STATUS = 10) EXTEND { QTYx2 := QTY + QTY };
  • Translate to A ((S <AND> SP) <AND> REL{TUP{ STATUS 10 }}) <AND> DoubleQty;
  • <AND> is associative, commutative, so shuffle around the parens: ((S <AND> REL{TUP{ STATUS 10 }}) <AND> (SP <AND> DoubleQty));
  • Translate back to Tutorial D ((S WHERE STATUS = 10) JOIN (SP EXTEND { QTYx2 := QTY + QTY}));
  • QED: push EXTEND through join, and push restrict through join.
  • I didn't show the bit about checking for attributes in common. Going only by associativity of <AND> could end up with this:
    ((S EXTEND { QTYx2 := QTY + QTY }) JOIN (SP WHERE STATUS = 10)); which might still work if actually executed by an (imaginary) A engine, but would then rely on domain-dependent semantics/infinite relations, so clearly isn't any sort of optimisation.

I think the following 2 paras might be saying the same as my demonstration. But your prolixity has buried any point. I am not saying that push-x-through-join is necessarily an optimisation. Only the implementor can know that, wrt the specifics of their engine.

The third reason for my saying so, is that from having to find reliable answers to these kinds of question or otherwise I'd never have gotten the working product I have, I know that "push restrict through join" (and "pushing extend through join" is -as far as the semantics of the query at hand are concerned- in the same category because RESTRICT and EXTEND are in the very same 'A' category of operators, or perhaps I should say "are cases of the very same 'A' operator") still has distinct cases to consider.  Say,  relational expression A with attrs Ax,Cy and relational expression B with attrs Cy,Bz.  "WHERE f(Ai,Bj)" : not pushable on the [natural] join of A and B.  Ditto for EXTEND.  "WHERE f(Ai,Cj)" : pushable, because the result can by definition include only tuples for which f(Ai,Cj) is true and that's evaluatable on A on its own.  Ditto for f(Ai,Cj) used in EXTEND, but ***only*** useful in the case where f is known to be computationally expensive, ***and*** the risk is non-negligible that the very same computation would have to be carried out several times for each tuple matched in the JOIN.  The point of this very long paragraph being : "push restrict through join" is just simply not enough as a detailed definition of the thing you are thinking of.

Further elaboration : now, f(Ax,By) is a logical conjunct where any Ax might be any of Ai,Cj and the same holds for By.  If there is any (Ax,By) where Ax is actually an Ai and By is actually a Bj, then the conjunction as a whole is not "pushable".  However, there might be constituants of the CNF form of the logical conjuct that ***are*** "pushable".  If you can find that CNF form and observe certain sub-constituants that are indeed "pushable" and that effectively reduce (per the best esimates you can possible make) the cost estimate that your optimizer will associate with "carrying out this particular strategy", then push them.

Somewhat similar observations for the case where F(Ax,By) is a logical disjunct, but I'm afraid I'm getting too tired to continue writing and not start uttering stupidities.

 

Quote from Vadim on May 12, 2021, 5:28 pm

Consider a database with all relations having the same set of attributes. Then, it is a Boolean Algebra. It can be axiomatized with a single operation, but this operation is not a lattice one.

Thanks Vadim. True. I don't see the relevance: we know a database of all relations (relvars) having the same heading is not adequate.

P.S. A lot of questions still hang from our past discussions. For example, what do you mean by the "fixed" algebraic object (zilliary operation DUM, unary negation, etc.)?

Compare: zilliary DEE is the identity value for NatJoin, therefore lattice top, known as R01 in your papers. Is there any difficulty with that? DUM is just some other specified value -- that is, providing I can arrive at a definition that fixes it. Is there something illegitimate with zilliary operations?

Unary negation (A's <NOT>) I have not defined/have not used. I don't "mean" anything by it. It appears in some of your papers, what do you mean by it?

 

It looks like the Category Theory might provide formalization (the same object/morphism up to equivalency, up to isomorphism etc). However, I will not pretend I mastered this subject yet. (And how do they write books like "Category theory for working [scientist]" (replace the "scientist" with anything but "mathematician") when all the category examples are mathematical objects, of which the reader might be familiar with only one or two?)

 

Some people (by which I mean many propeller-headed Haskellers) seem to think Category Theory is some kind of magic mathematical dust that you can sprinkle on any abstraction and make it more easy to reason about. I don't see any Category-theoretic treatment of lattices; I don't understand wikipedia on Category Theory; I do understand wikipedia on Lattices.

Only one primitive operator

  • InnerUnion has some very nice properties for equational reasoning: it is commutative, idempotent, associative -- indeed as is NatJoin. Unfortunately, InnerUnion is not distributive over NatJoin, nor vice versa.
  • But (ANNOUNCE) if we could define InnerUnion in terms of NatJoin (embedding the definition in the machinery of FOPL with equality) then we can reason around equations using both operators.
  • Furthermore (further ANNOUNCE) if we could define PROJECT/REMOVE, NotMatching in terms of NatJoin (embedding etc) then we can reason all the way round equations.

At this juncture, I'd better point out that this is not intended as an 'executable' algebra any more than A is. And it would be horribly inefficient to go about emptifying relations just to get their heading -- or even to evaluate the relational expression argument to emptify( )! It's an algebra purely for reasoning equationally about the equivalence of queries.

No problem down to here, but I see trouble ahead. I was expecting the one operator to be a form of negation (like NAND).

Appendix A claims NAND could replace only <AND>, <OR>, <NOT>; it doesn't go near project/REMOVE, COMPOSE. So that's a non-starter. (Also it would introduce domain-dependence, which gives me a queasy feeling.)

It says:

"In fact, we will show in the next section that we do not really
need RENAME either; thus, we could in fact reduce our algebra
still further to just the two operators REMOVE and either
NOR or NAND (plus TCLOSE)."

Project/Remove is simply InnerUnion with an empty relation representing the new heading.

...

My attempts at such definitions have foundered in the past because they relied on DUM being defined; and I couldn't obtain a definition for DUM that fixed it uniquely in the structure of relations. (Reminder: we can use in the definitions only relation values unanalysed, and operators on them. There's no ability to manipulate 'headings' or 'bodies' (as sets), because that would introduce additional 'sorts' into the algebra.)

The definition for DEE hadn't suffered that difficulty, because we can fix DEE as the identity element for NatJoin. And it turns out (claim not yet demonstrated) we can define the other operators without mentioning DUM; then define DUM from them:

  • DEE =df s s.t. ∀ r. s NatJoin r = r.
  • DUM =df DEE NotMatching DEE.
  • DUM =df ∀ r. (r NotMatching r) REMOVE r equivalently  ∀ r. (r REMOVE r) NotMatching r .
  • (Yes, that's three equivalent definitions for DUM -- proven equivalent by equational reasoning.)

Seems to me it shouldn't be too hard to Dee and Dum, but only if you can negate.

DEE is easy/doesn't need negate. DUM = DEE NotMatching DEE turns out to have problems/is not unique enough, as I discovered. Lattice algebras tend to presume the structure is distributive, complemented unless you give a strong instruction otherwise -- which is my 'ugly' axiom.

I have no idea what this means, or why I should care. My aim was to point out that I don't see any solution with a single primitive unless it's some kind of negation, like NAND. Whether or not you get a 'nice' DUM or not this way, I think you won't get there any other way.

...

No, I really don't follow. I don't know whether it's the lattices or the toolset but you lost me here. As I see it REMOVE is just AntiJoin. Easy if you start with negation.

AntiJoin could be defined as the negation of MATCHING; but r1 MATCHING r2 = r1 NatJoin r2 {attributes-from-r1}, so you still need projection

AntiJoin is cognate with Not Matching: same heading, result is the complement of SemiJoin (aka Matching). I prefer to avoid the distinctive/idiosyncratic TD names for operators, they come with too much baggage.

But I have to say I'm not optimistic about your quest. In highly informal terms, you can have an operator that removes attributes and an operator that removes rows and apply them twice to get the opposite, but a single operator to do all that seems far-fetched. More power if you can get there.

 

Andl - A New Database Language - andl.org
Quote from Vadim on May 12, 2021, 5:28 pm

Consider a database with all relations having the same set of attributes. Then, it is a Boolean Algebra.

 

Since all the relation values with the same heading form a Boolean Algebra (i.e. a distributive, complemented sub-lattice within the bigger structure), we can take any element r1 within it: the sub-sub-lattice from that element (as sub-sub-lattice top) 'down' to the empty relation with that heading is also a Boolean Algebra. That empty relation value is given by r1 NotMatching r1.

Sub-Sub-Lattice complement of r2 coincides with r1 NotMatching r2. IOW if r2 NatJoin s = (r1 NotMatching r1) & r2 InnerUnion s = r1 then s = r1 NotMatching r2. I've added that as an axiom.  Also added the distributivity within the sub-sub-lattice from r1. That still doesn't fix enough, without the 'ugly' axiom.

Quote from AntC on May 13, 2021, 11:53 am
Quote from Vadim on May 12, 2021, 5:28 pm

Consider a database with all relations having the same set of attributes. Then, it is a Boolean Algebra.

Since all the relation values with the same heading form a Boolean Algebra (i.e. a distributive, complemented sub-lattice within the bigger structure), we can take any element r1 within it: the sub-sub-lattice from that element (as sub-sub-lattice top) 'down' to the empty relation with that heading is also a Boolean Algebra. That empty relation value is given by r1 NotMatching r1.

Sub-Sub-Lattice complement of r2 coincides with r1 NotMatching r2. IOW if r2 NatJoin s = (r1 NotMatching r1) & r2 InnerUnion s = r1 then s = r1 NotMatching r2. I've added that as an axiom.  Also added the distributivity within the sub-sub-lattice from r1. That still doesn't fix enough, without the 'ugly' axiom.

The simplest possible counter example is the database consisting of only two relations: DEE and DUM. Yes, this example is far fetched from database practice, but counterexamples can be trivial/silly.

The other [rather artificial] example is the database with all the empty relations.

Therefore, we are in the realm of Boolean Algebra. There, reduction to a single operation is well known (NAND, NOR), which is not consistent with your presentation.

In general, a reduction of the operation set in Universal Algebra to some basic set is understood as representing each derived operation via composition of basic operatons. In other words, you are not allowed to use the First Order Logic sentenses and specify a constraint on the derived operation this way; you have to exhibit an equality with the derived operation alone on the left hand side.

 

Page 1 of 12Next