The Forum for Discussion about The Third Manifesto and Related Matters

You need to log in to create posts and topics.

7NF and ℵ

One of the disadvantages of 6NF is the need to replicate the attributes of the central relation across all the others.  If the central relation's candidate key (the whole tuple) is lengthy or contains large attribute values, the cost in storage can be large.  In 7NF (which I propose here), we add an synthetic attribute named ℵ that is also a candidate key for the central relation.  By doing this, we can make ℵ an attribute, and the candidate key, for all the peripheral relations, so that they are all of degree exactly 2, affording a considerable simplification.

Another advantage to 7NF is that in 6NF k + 1 relations are required to represent a 1NF relation, where k is the rank of the original relation minus the rank of the central relation.  But in 7NF, if the values of ℵ are made unique across the whole database, then distinct central relations may share 7NF peripheral relations if they contain the same attribute.  Thus a single peripheral relation with heading (ℵ, PURCHASE_PRICE) can be shared by many central relations: a product relation, an invoice relation, an asset relation, and so on.

The type of ℵ is arbitrary as long as its value space is sufficiently large.  It can be a serially assigned integer; a random integer or string with a sufficiently large number of bits, such as a UUID; or (worst case) the concatenation of the relation name with the string image of all the attributes in the central relation, in which case the improvement is merely theoretical, as the storage consumption may well be worse than before.

Of course, if a DBMS restricted all public relvars to 6NF, it would be possible to implement them with 7NF under the (erm) table.  But since that isn't likely to be the case, 7NF is like the other NFs, a design principle.

Comments?

I'm surprised that these days "the cost in storage" is any concern at all.

we add an synthetic attribute named ℵ that is also a candidate key for the central relation.

So in at least the central relation, there is duplicated information, with the usual opportunities for anomalies. Or do you propose that once an ℵ is allocated to a (central relation's) tuple, there's no need to change it even if the 'business key's attributes change? Then you have to distinguish UPDATEs (which don't change the ℵ) vs DELETE/INSERT which will need to cascade any new ℵ to the peripheral relations.

But in 7NF, if the values of ℵ are made unique across the whole database, then distinct central relations may share 7NF peripheral relations if they contain the same attribute.

I'd like to see the characteristic predicate for a relvar with an ℵ and a MONEY that variously holds PURCHASE_PRICE, COST_PRICE, EXTENDED_PRICE, TAX_AMOUNT, .... What's its INCLUSION DEPENDENCY (aka FOREIGN KEY constraint) going to look like? There's nothing in the ℵ tells which central relation it belongs to. What if we want two MONEYs peripheral to the same central tuple?

And I see no grounds for calling this "7NF": it's not beyond 6NF; as soon as you have Foreign Keys, you can potentially use a synthetic Primary Persistent Key. Every darn application I've been working with recently does this already. Bane of my life. Dratted ORM (done badly, according to Dave).

Quote from AntC on September 8, 2019, 10:29 pm

I'm surprised that these days "the cost in storage" is any concern at all.

It's true we have Big Storage nowadays, but Big Data is outrunning it.  We had a zettabyte (10²¹ bytes) of storage in use in 2012 and 18 times that now, with 175 zettabytes expected by 2025.

 

Or do you propose that once an ℵ is allocated to a (central relation's) tuple, there's no need to change it even if the 'business key's attributes change? Then you have to distinguish UPDATEs (which don't change the ℵ) vs DELETE/INSERT which will need to cascade any new ℵ to the peripheral relations.

"Exactly so!" quoth he.

But in 7NF, if the values of ℵ are made unique across the whole database, then distinct central relations may share 7NF peripheral relations if they contain the same attribute.

I'd like to see the characteristic predicate for a relvar with an ℵ and a MONEY that variously holds PURCHASE_PRICE, COST_PRICE, EXTENDED_PRICE, TAX_AMOUNT, .... What's its INCLUSION DEPENDENCY (aka FOREIGN KEY constraint) going to look like? There's nothing in the ℵ tells which central relation it belongs to. What if we want two MONEYs peripheral to the same central tuple?

I understand 6NF to be about attributes, not domains; you have a peripheral relation for each non-key attribute of the original relation, and it's irrelevant if they all have the same domain, all have different domains, or in between.  What if we want two MONEYs in the same non-6NF relation?  We give them different names, and just so here.

So here we don't merge any of these, we only merge PURCHASE_PRICE in one original (non-7NF) relation with PURCHASE_PRICE attributes in the other original relations that contain it.

 

we only merge PURCHASE_PRICE in one original (non-7NF) relation with PURCHASE_PRICE attributes in the other original relations that contain it.

You're not answering the point about the Foreign Key constraint:

  • I'm inserting a TUPLE{ ℵ PKEY,  MONEY PURCHASE_PRICE} into a peripheral relvar.
  • That's a peripheral for several "original" relvars.
  • I need to validate that the PKEY exists already (or as part of this multiple update) in the PURCHASE_ORDER_DETAIL relvar.
  • How does the constraint checker know which relvar to go looking in?
  • Or do you suggest it looks in every relvar for which PURCHASE_PRICE is peripheral, and be happy if it finds that just anywhere?
  • (That last bullet was heavy with sarcasm, I probably need to point out.)
  • This is a issue about the information structure of the schema. There's a separate issue about the performance/indexing of the peripheral relvar, we don't want to rob Peter to pay Paul -- i.e. win on disk space but lose on processing for a bigger number of index entries.
  • Does mashing several different PURCHASE_PRICEs into one peripheral relvar actually save anything at all? (It cuts down on the number of relvars/names; is that any kind of saving?) What's wrong with a separate relvar for each distinct 'role' of PURCHASE_PRICE?
  • Oh, I seem to have reinvented 6NF (with surrogate keys).

 

Quote from AntC on September 9, 2019, 6:54 am
  • Does mashing several different PURCHASE_PRICEs into one peripheral relvar actually save anything at all? (It cuts down on the number of relvars/names; is that any kind of saving?)

Perhaps not, and there are practical difficulties, as you point out.  But merging tables was just an afterthought on my part: the essence here is ℵ.

  • Oh, I seem to have reinvented 6NF (with surrogate keys).

I think the use of ℵ is different enough from 6NF as usually defined to deserve a new name, but I won't argue that.

Another afterthought:  if all ℵ attributes throughout the database have a common domain ℸ (what you called PKEY above), then it is possible to treat the database as a graph database (in the modern sense, not Codd's sense of a Codasyl network database).  Each node in a graph DB has a type which prescribes that the node has zero or more required properties (the attributes of the central relation) and zero or more optional properties (defined in peripheral relations), where a property is a pair <name, value> and value may be of type ℸ as well as any other type.  For this purpose we relax the idea that the non-ℵ attributes of the central relation form a candidate key.

Graph databases in turn are isomorphic to RDF triple stores (with a slight limitation due to static typing), so we now have an embedding of both models in the relational model; we can, for example, write a RM database that accepts GraphQL or Sparql queries on databases of the right overall shape.

(I'm prepared to have this idea shot down too.)

 

As I see it, there is only one benefit to adding surrogates and renormalizing, but in order for it to actually be of benefit, (1) all surrogates must be drawn from the same domain, the entity domain or E-domain, (2) there must be a single separate unary E-relation, and (3) there must be one or more foreign key constraint from every other relation to the E-relation.

In this scheme, so long as full normalization has been achieved, both vertically and horizontally, it's possible to capture the fact that the same entity instantiates multiple predicates.  For instance, an employee may also be a customer or a customer may also be a supplier.  Without a separate E-relation, the same atomic fact--the assertion of the existence of the entity in question--may appear in multiple relations, which implies that the predicate of each of those atomic facts must be identical.  Such a schema is clearly redundant.  On the other hand, when there is a separate E-relation along with the requisite foreign key constraints, the redundancy is eliminated.  The foreign key constraints introduce implications, for example, being a customer implies being an entity, and being an employee implies being an entity, and as a consequence the atomic facts in question are each represented in the one E-relation rather than spread out amongst EMPLOYEE, CUSTOMER and SUPPLIER.

The question therefore becomes whether this is all worthwhile.  On the one hand, having a single entity domain simplifies the underlying logic somewhat, since definite description terms are no longer required; on the other hand, the complexity of the schema increases due to the additional keys.

Normalization from 5NF to 6NF is contraindicated unless the database is temporal.  Vertical normalization beyond 5NF always introduces cycles of inclusion dependencies.  When there are temporal attributes--especially when those attributes are intervals, circumlocution can be eliminated, and when the intervals in question are significant, some ambiguity can be eliminated by decomposing a 5NF temporal relation to 6NF.  This is not the case for relations without temporal attributes.  When there are no temporal attributes, decomposing a 5NF relation introduces redundancy.  A database schema exhibits redundancy if the same predicate is instantiated by tuples from more than one relvar.  When there is a cycle of inclusion dependencies amongst a group of relations, the same predicate is instantiated by tuples from more than one relvar.  For example, the 5NF relation

EMPLOYEE (EMP#, TAXID, RATE, OTRATE, SHIFT)

can be non-loss decomposed into

EMPTAXID (EMP#, TAXID), EMPRATE (EMP#, RATE), EMPOTRATE (EMP#, OTRATE), EMPSHIFT (EMP#, SHIFT)

but there must also exist a set of inclusion dependencies amongst them all

EMPTAXID[EMP#] = EMPRATE[EMP#] = EMPORATE[EMP#] = EMPSHIFT[EMP#]

And as a consequence, the predicate whose instantiations assert the existence of employees is no longer associated with a single relation, EMPLOYEE, but rather all of EMPTAXID, EMPRATE, EMPOTRATE and EMPSHIFT.

Now there may be a justification for decomposing past 5NF for non-temporal relations, but not vertically--that is, not by taking projections.  What it boils down to is

(P OR Q) IMPLIES (P AND Q)

where P and Q are formulas with identical free variables.  This also applies when there are more than two formulas.

(P OR Q OR R) IMPLIES (P AND Q AND R)

If this implication is not satisfied, then decomposition is not only warranted, but also necessary.  For example, suppose that you have a relation BUSINESS_ENTITIES.  There are several kinds of business entities: customers, suppliers, contractors and employees

Being a customer does not imply being a supplier, nor does being a supplier imply being a customer.  It therefore stands to reason that there should be separate relations for each kind of business entity.  In this case, it also makes sense to keep the relation BUSINESS_ENTITIES in order to capture the cases where an employee is also a customer or where a supplier is also a customer, but in general, that is not always the case.

For each relvar there is a family of predicates.  In the case of EMPLOYEE above, there are predicates for each functional dependency where the determinant is the key.  The functional dependencies, when instantiated, assert facts of the form there is one and only one x such that Pkx, where P is the predicate for the FD and k is the key value.  Notice that the predicate of such facts are unary, ranging only over the key.  So, in the case of EMPLOYEE,

P     There is one and only one TAXID such that the employee identified by EMP# has rate TAXID.

Q      There is one and only one RATE such that the employee identified by EMP# has rate RATE.

R     There is one and only one OTRATE such that the employee identified by EMP# has rate OTRATE.

S     There is one and only one SHIFT such that the employee identified by EMP# has rate SHIFT.

All of the above formulas have the same arity as

T   There is an employee identified by EMP#.

Now, suppose that not every employee gets paid overtime.  Then instead of allowing inapplicable nulls for OTRATE, the schema must be decomposed.  So if EMP#(1234) does not get paid overtime, then it is not the case that "there is one and only one OTRATE such that the employee identified by EMP#(1234) has rate OTRATE."  And therefore it is not always the case that

(P OR Q OR R OR S OR T) IMPLIES (P AND Q AND R AND S AND T)

So as you can see, decomposition is necessary unless all instantiations of all predicates that range over domains identified by the same set of attributes in the family of predicates for a relvar satisfy the implication.

 

Brian

Insofar as I understood that (probably not much), I think it is a formal restatement of what I said.  A single unary relation of type ℵ for all the things in the database, many binary relations of type {ℵ, <something>} expressing particular facts about a specific thing, throwing away all higher-level information about what facts might be associated with each thing, or relegating it to the domain of metadata (itself expressible as facts about things).  That is the graph/RDF model.

In the limit, though, as the number of binary relations grows without bound, it pays to abandon first-order RAs like SQL or Tutorial D in favor of truly second-order algebras like SPARQL or calculi like modern Prolog, which can easily answer questions of the form "What binary relations subsist between two specified things?" and "What binary relations subsist between all things and a single thing, or a set of things, or all things?"  It is possible to renormalize this into a ternary relationship of type {ℵ, relation_name, ℵ} and use a first-order RA to query it, but not very naturally.

In addition, for binary relations between things and abstracta like strings or numbers that aren't part of the ℵ domain, you'd need separate ternary relations in a statically typed DBMS.  In SQLite you wouldn't, because column types are just coercion targets, and it cheerfully coerces numbers to strings, integers to floats, floats to integers iff their value is an integer, and strings to integers or floats provided their content matches the grammar of an int or float literal.

It really isn't a formal restatement of what you said.  Adding surrogates and renormalizing doesn't necessarily yield a set of binary relations.  Again, decomposition from 5NF to 6NF for non-temporal relations is counterproductive.  In addition, entities may stand not only in binary relations but also ternary or even quaternary or quinary relations.

Decomposition of temporal relations to 5NF to 6NF is necessary from a logical standpoint only when the intervals during which what is the case are significant.  And when such is the case the relationships amongst the resulting relations are not strictly INDs.  On the other hand, decomposition to 6NF is only logically necessary for non-temporal 5NF relations when the instantiations of the predicates of the relvar by any tuple doesn't satisfy the constraint that whenever any that range over the domains identified by the same sequence of attributes hold, all of them hold.

For each relvar there is a family of predicates.  There is at least one predicate in the family of predicates for each functional dependency where the determinant is a key, including the trivial functional dependencies where the determinant and the dependent are the same key.  There may be more than one predicate for a given functional dependency, and decomposition becomes logically necessary only when there may be a tuple in the relation that doesn't satisfy all of the predicates in the family of predicates, for then it would not be possible to determine precisely which facts hold and which don't.

The schema is redundant when a predicate ranges over the domains for a set of attributes that is not a superkey.  In addition, when the same predicate is in the family of predicates for more than one relvar, then the schema is redundant.  When you decompose a non-temporal 5NF relvar, where every predicate in its family of predicates must be satisfied by every tuple in each possible relation, to a set of 6NF relvars, a circular set of INDs is required in order to prevent information loss.  As a result, the same predicate is in the family of predicates for every member of the 6NF relvars, and thus the schema becomes redundant.   In short, arbitrary decomposition to 6NF for non-temporal relations violates POOD.

 

Brian