The Forum for Discussion about The Third Manifesto and Related Matters

Please or Register to create posts and topics.

Aggregates in constraints

Further to my previous post Aggregates in relational calculus (or lack thereof). I mentioned in that thread that the reason for my interest is for efficient evaluation of integrity constraints.

The approach I've been using so far is to deal with a constraint that is written in a calculus form ie. quantifiers, conjunction, disjunction, negation, predicate symbols. There has been a fair bit of research done on efficiently evaluating constraints in such a form (J.M.Nicolas possibly being the seminal case and many others since of course). Thus, for constraints specified in algebra, I've first been converting into calculus using the usual substitution rules (which is where I came unstuck).

I thought it might be useful/interesting to elaborate on how the technique works. The general idea is that one can derive from the constraint expression another expression(s) that is (/are) inherently more efficient to evaluate. This is primarily done by substituting into the original expression, where possible, the values from the transaction delta that caused the constraint to be rechecked (this can be done for variables that appear exclusively under a universal quantifier). The necessity for a recheck is dictated by the polarity under which the relevant relations occurs (-ve on INSERT, +ve on DELETE). Moreover, the transaction deltas themselves can be subject to the same constants that appear in the relevant relation in the original expression (acting as a "relevance filter").

Example:

A schema of emp(id,dept,job) (id being the key) meaning "Person with <id> is employed by department <dept> with job title <job>".

A constraint of ∀a emp(_,a,Manager) ⇒ emp(_,a,Clerk) meaning "All departments that employ a manager also employ a clerk".

If we rewrite the constraint to negation innermost form but favouring existential quantifiers:

¬∃a emp(_,a,Manager) ∧ ¬emp(_,a,Clerk)

We can now (more easily) see that emp appears under both polarities. It is therefore subject to recheck on both INSERT and DELETE. The variable a is universally quantified so we can substitute in values from the transaction delta. We do this by assuming the existence of a transition relation (RINS for INSERTs and RDEL for DELETEs) and bind the relevant variables.

Given that, the general form of the recheck expression becomes:

∃v0,...,vn (RINS|RDEL)(c0,...,cn) ∧ ¬IC

(v are variables, c are constants, IC is the integrity constraint)

which will evaluate to true when the constraint is violated (ie. note the negation of IC). For the example, this gives the following two expressions:

∃a empINS(_,a,Manager) ∧ emp(_,a,Manager) ∧ ¬emp(_,a,Clerk)
∃a empDEL(_,a,Clerk) ∧ emp(_,a,Manager) ∧ ¬emp(_,a,'Clerk')

There is a further optimisation in the insert case: we can replace emp(_,a,Manager) with true as we know for certain that it exists as we have just inserted it. Obviously that simplifies down to:

∃a empINS(_,a,Manager) ∧ ¬emp(_,a,Clerk)

The converse isn't possible in this instance, as, for the delete case, it can only be done when the bound variables are a superset of a key.

This brings us on to aggregates. It seems to me there are two general approaches:

(1) abandon the use of calculus and instead use algebra-with-aggregates; or
(2) continue the use of calculus and find a way of expressing aggregates that "works" with the aforementioned optimisations.

I've not really explored (1). I would assume that it is possible given that calculus can be transformed into algebra. The main optimisation relies upon determining the applicable quantifiers for any variables. I don't know how difficult (or not) that would be in algebra vs calculus.

Perhaps obviously, I'd favour (2) if possible. Working through a simple example seemed promising. eg. A constraint of "All departments employ at most 10 people". Using pidgin logic to illustrate (hopefully with obvious semantics?):

∀a Count[id](emp(_,a,_)) <= 10

Intuitively if we were inserting tuples into emp then we'd know to recheck only for the affected department. This would seem to work by using the transition relations mentioned earlier e.g.:

   ∃a empINS(_,a,_) ∧ ¬(Count[id](emp(_,a,_)) <= 10)
≡ ∃a empINS(_,a,_) ∧ Count[id](emp(_,a,_)) > 10

(A point of note here is that the particular comparison operator has an affect on the recheck circumstance eg. count(R)<=n should be rechecked on INSERT, count(R)>n rechecked on DELETE. Ceri & Widom in "Deriving Production Rules for Constraint Maintainance" describe combinations of aggregates + comparison operator that help determine this).

Now, what happens when the expression being counted is not just a relation? I can see it working for conjunctive expressions e.g.

∀a Count(R(a) ∧ S(a)) <= 10

For disjunctive expressions it strikes me that there is a possible rewrite that is applicable for Count:

∀a Count(P ∨ Q) <= 10
∀a Count(P) + Count(Q) <= 10

The same would be applicable for Sum. Min and Max would be similar also, but would then have the non-aggregate/traditional version applied to the results e.g.:

∀a Max(P ∨ Q) <= 10
∀a max(Max(P), Max(Q)) <= 10

The downside to this sort of thing is: does it work in general for all aggregates? Or does each aggregate require their own special rules? What if you have an aggregate that requires ordering (say, median) to calculate? Maybe they are future questions to answer. I imagine many of the most useful constraints would utilise sum & count the most, so there is still value, even if they are special-cased.

Anyway. It seems like this approach may have some mileage. I am posting here to see if there are other people working on similar things, and, if you are, does this chime with what you are doing?

Quote from Joe on October 29, 2019, 7:48 pm

For disjunctive expressions it strikes me that there is a possible rewrite that is applicable for Count:

∀a Count(P ∨ Q) <= 10
∀a Count(P) + Count(Q) <= 10

Take P=Q and assume ∀a Count(P)=6. Then, the first assertion is true, while the second false?

 

Quote from Vadim on October 29, 2019, 9:33 pm
Quote from Joe on October 29, 2019, 7:48 pm

For disjunctive expressions it strikes me that there is a possible rewrite that is applicable for Count:

∀a Count(P ∨ Q) <= 10
∀a Count(P) + Count(Q) <= 10

Take P=Q and assume ∀a Count(P)=6. Then, the first assertion is true, while the second false?

 

Yes I’ve managed to overlook this. Clearly the rewrite is actually invalid (maybe with bag semantics rather than set, it would make sense).

The proof of equivalence between RA and RC assumes the First Order RA, that is, it does not extend to open expressions, which are found in the Second Order RA.

I have never seen the RC used to express a computation, so I would start with that. The SO-RA expression QTY*PRICE computes an extended price for each tuple.

Within the SO-RA, aggregates are a special form of open expression, where the argument to the function is a collection of values, rather than a single value. SUM(QTY) computes a total quantity from a collection of QTY values. That should not be much harder.

Andl - A New Database Language - andl.org
Quote from Joe on October 29, 2019, 7:48 pm

Further to my previous post Aggregates in relational calculus (or lack thereof). I mentioned in that thread that the reason for my interest is for efficient evaluation of integrity constraints.

I thought it might be useful/interesting to elaborate on how the technique works.

Hi Joe, it's always struck me that you could put huge efforts into constraints-in-general, especially constraints using aggregates; but that the constraints actually used in real life are really in very limited forms. So it would be much better to put efforts into implementing those forms efficiently. (Which I think is pretty much done.) The "limited forms" would be: Inclusion dependencies (aka Foreign Key constraints); Equality Dependencies (every Department has an Employee, every Employee is in a Department); Exclusion Dependencies (a Manager of a Department can't be an Employee in that Department).

I've never know constraints using aggregates (or at least that can't be expressed avoiding aggregates: D&D use COUNT( ) to express a Functional Dependency, but an efficient constraint-enforcer wouldn't go counting tuples through the whole relation.) I don't think that's merely because current tools (SQL) can't/won't express fancy constraints.

...

Thanks for the background on the technique.

This brings us on to aggregates. It seems to me there are two general approaches:

(1) abandon the use of calculus and instead use algebra-with-aggregates; or
(2) continue the use of calculus and find a way of expressing aggregates that "works" with the aforementioned optimisations.

 

Perhaps obviously, I'd favour (2) if possible. Working through a simple example seemed promising. eg. A constraint of "All departments employ at most 10 people".

Has anybody here seen a constraint like that in the wild? I haven't. There might be an informal business rule about size of departments or teams or projects; but not so hard-edged that you'd write it as a constraint. You'd put it as a Warning in the application: 'That makes a Department of 11 Employees, are you sure?'.

Or an after-the-fact exception report/action message to a queue: the Programming Department is getting beyond manageable size, time to segregate responsibilities?

The downside to this sort of thing is: does it work in general for all aggregates?

I doubt it. But then I don't think the topic merits an in-general approach.

 

Quote from AntC on October 30, 2019, 12:18 am
Quote from Joe on October 29, 2019, 7:48 pm

Further to my previous post Aggregates in relational calculus (or lack thereof). I mentioned in that thread that the reason for my interest is for efficient evaluation of integrity constraints.

I thought it might be useful/interesting to elaborate on how the technique works.

Hi Joe, it's always struck me that you could put huge efforts into constraints-in-general, especially constraints using aggregates; but that the constraints actually used in real life are really in very limited forms.

Hi Ant.  I too have wondered about how applicable/useful generalised constraints would be in real-world databases. My experience is that "in the wild", many (most, really) schemas do not even follow 3rd normal form, are littered with NULLs, often don't have keys, or, have keys and indices conflated, have very few if any tuple constraints, etc. In other words, the existing facilities are already so poorly misunderstood and abused, that surely adding a more generalised facility is just asking for trouble? Well, there are, of course, instances where this isn't the case; and, it is those people who I think would find more generalised constraints to be useful.

 

So it would be much better to put efforts into implementing those forms efficiently. (Which I think is pretty much done.) The "limited forms" would be: Inclusion dependencies (aka Foreign Key constraints); Equality Dependencies (every Department has an Employee, every Employee is in a Department); Exclusion Dependencies (a Manager of a Department can't be an Employee in that Department).

What is interesting about the technique is that the generated checks for the "limited forms" you mention are actually optimal despite the algorithm being generalised. Going back to my example; if one were to insert 10,000 managers into the same department (think Accenture or KPMG ;-)) then the check would occur only once for the affected department. In contrast a typical foreign key implementation would actual do the check per-tuple check (ie. it would check the same rule 10,000 times).

Another common use-case that comes to mind is enforcing disjoint sub-types (I dislike "sub-types" as terminology, but can't think of a better phrase. There probably is one in the lingo of  TTM?). But, I suppose one could argue that it is a combination of equality, inclusion & exclusion dependencies?

 

I've never know constraints using aggregates (or at least that can't be expressed avoiding aggregates: D&D use COUNT( ) to express a Functional Dependency, but an efficient constraint-enforcer wouldn't go counting tuples through the whole relation.) I don't think that's merely because current tools (SQL) can't/won't express fancy constraints.

I can imagine one or two ways to avoid the use of aggregation, certainly for counting (and probably min/max). e.g. have a constraint asserting that a tuple have a contiguous "index" attribute, and another constraint that asserts there must exist the highest "index" attribute. Or something. I'm not convinced it would be very pretty in contrast to using an aggregate though.

I will go away and read about the D&D stuff on FDs + COUNT (). That is new to me (or, old to me, and I've forgotten).

BTW a real-world example from my own work life of wanting aggregate constraints is when working with a double-entry ledger. Depends on the exact ledger implementation, of course, but say a posting can affect multiple ledgers and the requirement being that the postings obviously net off to zero. A priori it won't necessarily be known how many ledgers are involved (hence wanting to use an aggregate). A similar example was when I worked on a trading system that distributed orders to various parties: the sum of the distributions need to equal the total of what was being distributed.

 

Perhaps obviously, I'd favour (2) if possible. Working through a simple example seemed promising. eg. A constraint of "All departments employ at most 10 people".

Has anybody here seen a constraint like that in the wild? I haven't. There might be an informal business rule about size of departments or teams or projects; but not so hard-edged that you'd write it as a constraint. You'd put it as a Warning in the application: 'That makes a Department of 11 Employees, are you sure?'.

Or an after-the-fact exception report/action message to a queue: the Programming Department is getting beyond manageable size, time to segregate responsibilities?

So in your example I might favour the following: have a constraint that says the maximum number of people employed in a department must be at most the quota unless another fact has been recorded that justifies an exemption. Any failure due to this constraint would then cause the "Are you sure?" message (and prompt for the reason if they were sure).

The downside to this sort of thing is: does it work in general for all aggregates?

I doubt it. But then I don't think the topic merits an in-general approach.

One thought that did cross my mind: aggregates are typically implemented using a state transition function and a finalising function. The transition function maintains whatever state is required for the aggregate (e.g. cumulative sum). If the state that was being maintained was itself exposed as a "return" value for the partially aggregated data (rather than whatever the finaliser returned) then perhaps it would then be possible to combine those two states and then run the finaliser against that. Obviously that would be "lifting" something internal in the DBMS into the calculus which is unusual. Just a random thought.

 

Quote from AntC on October 30, 2019, 12:18 am

A constraint of "All departments employ at most 10 people".

Has anybody here seen a constraint like that in the wild? I haven't.

I haven't personally either.  But a constraint like "An office can have at most four occupants" seems perfectly realistic to me.  As for airplanes WHERE model = 'Boeing 747-400' AND configuration = 'single class' can carry at most 660 passengers", that's completely inflexible: violate it, and the plane literally doesn't get off the ground.

Quote from johnwcowan on October 30, 2019, 6:34 pm
Quote from AntC on October 30, 2019, 12:18 am

A constraint of "All departments employ at most 10 people".

Has anybody here seen a constraint like that in the wild? I haven't.

I haven't personally either.  But a constraint like "An office can have at most four occupants" seems perfectly realistic to me.  As for airplanes WHERE model = 'Boeing 747-400' AND configuration = 'single class' can carry at most 660 passengers", that's completely inflexible: violate it, and the plane literally doesn't get off the ground.

Planned future sessions of trainings/whatever often come associated with a max.number of subscriptions/reservations.  (I know I know.  They usually allow more subscriptions than the max, putting the "surplus" in a kind of waiting queue so they can fill up late unannounced cancellations, but technically the constraint [and the case for it] is there - e.g. the "waiting queue" could be claimed to need an "ordinal position" attribute which the "confirmed" reservations do not need so there's a case for claiming there's distinct entities at work here.)

Quote from Erwin on October 30, 2019, 11:01 pm
Quote from johnwcowan on October 30, 2019, 6:34 pm
Quote from AntC on October 30, 2019, 12:18 am

A constraint of "All departments employ at most 10 people".

Has anybody here seen a constraint like that in the wild? I haven't.

I haven't personally either.  But a constraint like "An office can have at most four occupants" seems perfectly realistic to me.  As for airplanes WHERE model = 'Boeing 747-400' AND configuration = 'single class' can carry at most 660 passengers", that's completely inflexible: violate it, and the plane literally doesn't get off the ground.

Planned future sessions of trainings/whatever often come associated with a max.number of subscriptions/reservations.  (I know I know.  They usually allow more subscriptions than the max, putting the "surplus" in a kind of waiting queue so they can fill up late unannounced cancellations, but technically the constraint [and the case for it] is there - e.g. the "waiting queue" could be claimed to need an "ordinal position" attribute which the "confirmed" reservations do not need so there's a case for claiming there's distinct entities at work here.)

Your "I know I know" applies equally to John's airplane booking model: airlines always overbook, in the expectation some of the passengers will be no-shows. And in the worst case the fine print on a $10 bargain ticket says: you're not guaranteed a seat at the time shown. So if the plane's full, and a full-fare passenger turns up at the last moment, the $10 will get 'bumped'. There should be a stricter rule for issuing Boarding Passes (but that's a case of seat number must be unique/is a key). OTOH I've seen passengers taken off a plane when somebody turns up with the same seat number. (Admittedly that was Garuda Indonesia. My advice: avoid Garuda if at all possible, for lots of reasons.)

Whatever the business rules, I'm still not convinced that's the sort of thing you'd want as a constraint declared in the database, rather than a rule in the application. I'm thinking timing differences: we've moved/deplaned this passenger; the system's slow in freeing up availablity.

a real-world example from my own work life of wanting aggregate constraints is when working with a double-entry ledger.

Yes fair point, and I have met that business requirement (both distribution journals and invoice detail + tax summing to invoice heading).

In the accounting system that wasn't declared as a constraint. Possibly because SQL didn't support that. Possibly because the package was distributed on all major vendors' DBMSs, and there were subtle differences and levels of support. Chiefly because:

Most such transactions come from feeder systems (Sales Ordering/ERP module) or are prepared via spreadsheet. They've already been proven to balance there. Or if manual: when a bookkeeper is direct-entering a dozens-of-lines journal or invoice, it's hugely frustrating if it won't balance; and it's usually urgent to get a 'best guess' into the accounts (in draft) for month-end reporting (which is the only time accountants post manually). So the packages I worked on allowed draft input where the ledger/invoice detail lines went into exactly the same table as 'proper'/balancing transactions. Later they got approved/authorised by a supervisor, which required them to balance. Of course you could declare the constraint to apply only for approved entries; but that now requires the constraint looking at the transaction header table as well as at the detail.

The other point about (approved/finalised) journal entries is that they're never updated nor deleted. So you never need the fancy state transition/incremental algorithms that Joe describes.