DRY -- "ThirdNormalForm is the analogous principle for data." (?)

#1 · October 7, 2025, 2:31 am

I recently bumped into a particularly austere (desiccated?) proponent of DRY/Don't Repeat Yourself. My Subject quote is from the WikiWikiWeb discussion 'External link'ed from that wikip page. I get a drift that Real Programmers are DRY; ~~tables~~ relations with a Foreign Key are intrinsically suspect/wimpy for 'repeating' a value appearing elsewhere as Primary Key.

That WikiWeb discussion has a few mentions of Normalisation theory, but no acknowledgment of BCNF, 4NF or higher.
The mention of source code -- and of application configuration files in general, which is what I recently bumped into -- makes 'flat' ~~tables~~ relations theory less applicable: config is typically held in directories which are hierarchical. If your .yaml is in a directory foo, does calling it foo.yaml amount to repeating yourself? If the .yaml includes text literals foo "foo" is that more repeating? If you clone directory foo to directory bar, you then have to go through the contents overwriting all appearances of foo, so I guess you were repeating yourself.
" a function signature is duplicated at the function call site and the function definition site. " "in CeePlusPlus the interface and implementation for a class are typically specified in separate files, duplicating knowledge. ... duplication between .c and .h files is annoying"
These seem parallels to Primary Key vs Foreign Key references.

If Normal Form analysis doesn't give guidance as to what counts as repeating, is there some more abstract analysis framework to cover directories[**]/hierarchies/trees?

[**] I still recall fondly the System 38/AS400's refusal to support hierarchical directories. "Single-level store" applied for both memory/disk addressing and databases and applications. Life was simpler.

#2 · October 7, 2025, 6:34 am

IMO this is just stirring the pot. There is nothing here.

DRY says you should not assert the same fact twice. If the fact is a function to compute a value, it should not be repeated. Ditto if it's the value of a data item. Dry is not violated by multiple references to that function, or to that data item. FKs that reference a PK do not inherently violate DRY.

Non-normalised relations may violate DRY, which normalisation may alleviate.

Andl - A New Database Language - andl.org

#3 · October 7, 2025, 12:23 pm

Quote from AntC on October 7, 2025, 2:31 am

If Normal Form analysis doesn't give guidance as to what counts as repeating, is there some more abstract analysis framework to cover directories[**]/hierarchies/trees?

I don't think Normal Form analysis is entirely relevant for the dryness, although I'm quite sure it corresponds to cleaner/better/more orthogonal code in other ways.

I'd say the master programmer's perception of DRY is more relatable to the predicate and set logic so that each fact or logic rule is represented exactly once.

The novice programmer will often get confused by repetition on the implementation level, where exactly the same implementation could be duplicated for different logical purposes, which is NOT repetition.

#4 · October 7, 2025, 9:31 pm

DRY is a spectrum not a boolean, and a desirable inclination rather than an absolute.

Yes, a foreign key value or reference to a function or variable is repeating yourself, but less so than storing the same configuration value in multiple places or the same fact in multiple tables/relvars. Eliminating all repetition may -- depending on circumstances -- be worse than accepting some.

Evaluate tradeoffs.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

#5 · October 7, 2025, 11:32 pm

I disagree. Asserting a fact twice is repeating yourself. Asserting it once and referring to it once is not (the fact and the reference are different things). The DRY principle advises against multiple assertions, but not multiple references to a single assertion. Do you have any authority that would say otherwise?

Multiple references to a single fact are the subject of a quite different issue: dependencies. The parallel of DRY is: Minimise Dependencies. If the fact asserted is wrong, every reference to it will be wrong too.

Andl - A New Database Language - andl.org

#6 · October 8, 2025, 4:38 pm

Quote from dandl on October 7, 2025, 11:32 pm

I disagree. Asserting a fact twice is repeating yourself. Asserting it once and referring to it once is not (the fact and the reference are different things). The DRY principle advises against multiple assertions, but not multiple references to a single assertion. Do you have any authority that would say otherwise?

One example of asserting the same fact multiple times is in microservice architecture, where some duplication of -- for example -- configuration or even database contents (the same table may be manually duplicated in multiple services, each with their own local database) because the cost of dependencies is considered worse than the cost of managing duplication.

Another example is denormalisation in analytics schemas to maximise performance when the cost of joins is prohibitive.

Another -- and this may be the controversial one -- is use of foreign keys on data values instead of immutable generated keys.

Do you have any authority that would say otherwise?

Me. I'm old enough, have been doing this long enough, and am ornery and arrogant enough, to claim authority on everything.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

#7 · October 9, 2025, 7:06 am

Quote from AntC on October 9, 2025, 7:06 am

Quote from dandl on October 7, 2025, 11:32 pm

... Do you have any authority that would say otherwise?

What's sauce for the goose is sauce for the gander: do _you_ have any authority that would say so? These are practitioners' rules of thumb/conventions/best practices (for some vague sense of 'best'). I doubt we're going to find an 'authority' doing any scientific analysis.

..., but not multiple references to a single assertion.

My o.p./the example I'd bumped into was such an example: inside directory foo (which holds source and build components for a package), you need a .yaml to tell the build tool where to go looking for all the bits, and the options to build them. Does naming it foo.yaml constitute a superfluous reference -- since it's already in directory foo? Does naming it package.yaml constitute a superfluous reference -- since the content makes it clear it's yaml for building a package? (And indeed you can probably find literals foo/"foo" inside -- are they all superfluous/should they say something like THE_Package?)

Multiple references to a single fact are the subject of a quite different issue: dependencies. The parallel of DRY is: Minimise Dependencies. If the fact asserted is wrong, every reference to it will be wrong too.

Consider a Variant type -- as they're usually called in schema design, or 'tagged union' within a programming language:

A P Product is one of Count vs Bulk vs Kit (Assembly)

Count means we can take a discrete one of these from the warehouse shelf; we need to know its weight and price, etc.

Bulk means we measure out so many metres/kilos/litres from a roll/hopper/tank; we need to know its weight and price per unit of measure.

Kit means we don't hold stock of this directly -- upon needing some, we go round the store grabbing the appropriate quantities of Count/Bulk components; we don't hold the weight/price, because that would be repeating what can be derived by summing the components.

How to design a DRY schema for that? That is, without introducing superfluous dependencies

-- by which I mean declaring constraints to the effect a PNo appearing in the Count sub-relvar can't also appear in the other sub-relvars.

Should we include a tag in the main P Product relvar? But that's only repeating what can be gleaned by (rather laboriously) looking through the sub-relvars.

The superflous dependency bites if data entry inadvertently sets up a P with the wrong tag.

Quote from Dave Voorhis on October 8, 2025, 4:38 pm

... Another example is denormalisation in analytics schemas to maximise performance when the cost of joins is prohibitive.

I think this is OK (and is mentioned in the WikiWeb). The source transactional database is normalised. Then the facts are indeed stated once only. The analytics schema is generated from it purely mechanically [**]. No human is repeating themselves, only the machine.

[**] At least it had darned better be. Anybody who goes poking additional data only within the analytics tool deserves ... everything they'll get.

Quote from dandl on October 7, 2025, 11:32 pm

... Do you have any authority that would say otherwise?

What's sauce for the goose is sauce for the gander: do _you_ have any authority that would say so? These are practitioners' rules of thumb/conventions/best practices (for some vague sense of 'best'). I doubt we're going to find an 'authority' doing any scientific analysis.

..., but not multiple references to a single assertion.

My o.p./the example I'd bumped into was such an example: inside directory foo (which holds source and build components for a package), you need a .yaml to tell the build tool where to go looking for all the bits, and the options to build them. Does naming it foo.yaml constitute a superfluous reference -- since it's already in directory foo? Does naming it package.yaml constitute a superfluous reference -- since the content makes it clear it's yaml for building a package? (And indeed you can probably find literals foo/"foo" inside -- are they all superfluous/should they say something like THE_Package?)

Multiple references to a single fact are the subject of a quite different issue: dependencies. The parallel of DRY is: Minimise Dependencies. If the fact asserted is wrong, every reference to it will be wrong too.

Consider a Variant type -- as they're usually called in schema design, or 'tagged union' within a programming language:

A P Product is one of Count vs Bulk vs Kit (Assembly)
Count means we can take a discrete one of these from the warehouse shelf; we need to know its weight and price, etc.
Bulk means we measure out so many metres/kilos/litres from a roll/hopper/tank; we need to know its weight and price per unit of measure.
Kit means we don't hold stock of this directly -- upon needing some, we go round the store grabbing the appropriate quantities of Count/Bulk components; we don't hold the weight/price, because that would be repeating what can be derived by summing the components.

How to design a DRY schema for that? That is, without introducing superfluous dependencies

-- by which I mean declaring constraints to the effect a PNo appearing in the Count sub-relvar can't also appear in the other sub-relvars.
Should we include a tag in the main P Product relvar? But that's only repeating what can be gleaned by (rather laboriously) looking through the sub-relvars.
The superflous dependency bites if data entry inadvertently sets up a P with the wrong tag.

Quote from Dave Voorhis on October 8, 2025, 4:38 pm

... Another example is denormalisation in analytics schemas to maximise performance when the cost of joins is prohibitive.

I think this is OK (and is mentioned in the WikiWeb). The source transactional database is normalised. Then the facts are indeed stated once only. The analytics schema is generated from it purely mechanically [**]. No human is repeating themselves, only the machine.

[**] At least it had darned better be. Anybody who goes poking additional data only within the analytics tool deserves ... everything they'll get.

#8 · October 18, 2025, 7:08 pm

I'll just take issue with the title : the analogous principle is *NOT* third normal form, it is POOD.

FWIW, Date ultimately came out with a formal definition of the principle (*), in the "Stating the obvious" book (I hope I'm not violating principles myself by quoting it here - I'll claim fair use) :

There must not exist relvars R1 and R2 (not necessarily distinct) such that :

There exists a join dependency STAR{X1, ..., Xn} that's irreducible with respect to R1, and
There exists some Xi (1 <= i <= n) and some possibly empty set of attribute renamings on the projection, R1X say, of R1 on Xi that maps [ed. after applying the renamings to R1X, I suppose] R1X into [ed. some ???] R1Y, say, and
R1Y has the same heading as some subset Y (distinct from Xi if R1 and R2 are one and the same) of the heading of R2, and
There exist restriction conditions c1 and c2, neither of which is a logical contradiction in the logical sense of that term, and
The following equality dependency holds : R1Y where c1 = R2Y where c2 (where R2Y is the projection of R2 on Y)

(*) I don't know if he ever did this before anywhere, nor if anyone ever saw it. Nor do I know whether, while it was certainly developed in collaboration with McGoveran, Date and McGoveran are still in full agreement that what I quoted is indeed the final definition of the idea that they had.

Home exercise : assess the time complexity of the algorithm that assesses compliance of a given schema to this principle.

Home exercise 2 : argue whether a single-relvar database (say, {vendor:vendors product:products manufacturer:manufacturers}) complies to the POOD as quoted here in the case where that single relvar is subject to a JD (say, {{vendor,product} {vendor,manufacturer} {product,manufacturer}}.

Author of SIRA_PRISE

#9 · October 20, 2025, 12:21 am

Quote from AntC on October 20, 2025, 12:21 am

Quote from Erwin on October 18, 2025, 7:08 pm

... the title : ...

Hi Erwin, It's a quote. My following '(?)' was the hint that I, too, was sceptical.

There must not exist relvars R1 and R2 ...

The trouble with trying to use a Principle or Normal Form specified in terms of RelVars is how to interpret into structures (like source code/components or configuration specs) that are not RelVars? For example:

...

... a logical contradiction in the logical sense of that term, ...

A CeeLanguage .h file contains type specifications for functions, typedefs, etc., but

no executable code.

Function bodies giving the code are in .c files, which

#include the .h -- so which is the declaration, which is the use? Or (in database terms) which holds the Primary Key, which the Foreign Key?

**But** .c files also contain a type specification -- indeed they have to be able to do that for private functions not exported/shared in the .h.

So we can get a mis-match/contradiction between the .c vs .h. Or

if they're consistent, one is repeating the other(?).

... the algorithm that assesses compliance of a given schema to this principle.

Haskell source isn't as repeaty as .h, .c:

There is a single source file.

It's module header lists the names to be exported (other names are local scope),

bare name only, no specification or other attributes.

A name's type is given by a stand-alone type signature (or declaration).

A function's body is given by a set of equations.

**But** the compiler can usually infer a function's type from its body, so

type signatures are optional -- that is, repeaty.

Never the less, giving stand-alone type signatures is best practice, with which I agree:

It's much easier to write a type signature than get the function's body correct,

especially because a set of equations must yield a generalised signature, which is easy for compilers to figure out, humans not so much.

Recursive function bodies are particularly prone to getting over-generalised to yield an infinite type.

In this case the repeat/redundancy is a Good Thing/belt-and-braces/sanity check.

Using the POOD, how to account for a contradiction? between the stand-alone type signature -- which is the sort of fact that could be stated in a RelVar, vs an inferred type -- which is not 'stated' anywhere but is implied by a function body via rules of type inference.

The time complexity of type inference is typically measured in minutes -- that is, way more than would be acceptable for database updates.

Quote from Erwin on October 18, 2025, 7:08 pm

... the title : ...

Hi Erwin, It's a quote. My following '(?)' was the hint that I, too, was sceptical.

There must not exist relvars R1 and R2 ...

The trouble with trying to use a Principle or Normal Form specified in terms of RelVars is how to interpret into structures (like source code/components or configuration specs) that are not RelVars? For example:

...

... a logical contradiction in the logical sense of that term, ...

A CeeLanguage .h file contains type specifications for functions, typedefs, etc., but
no executable code.
Function bodies giving the code are in .c files, which
#include the .h -- so which is the declaration, which is the use? Or (in database terms) which holds the Primary Key, which the Foreign Key?
**But** .c files also contain a type specification -- indeed they have to be able to do that for private functions not exported/shared in the .h.
So we can get a mis-match/contradiction between the .c vs .h. Or
if they're consistent, one is repeating the other(?).

... the algorithm that assesses compliance of a given schema to this principle.

Haskell source isn't as repeaty as .h, .c:

There is a single source file.
It's module header lists the names to be exported (other names are local scope),
bare name only, no specification or other attributes.
A name's type is given by a stand-alone type signature (or declaration).
A function's body is given by a set of equations.
**But** the compiler can usually infer a function's type from its body, so
type signatures are optional -- that is, repeaty.

Never the less, giving stand-alone type signatures is best practice, with which I agree:

It's much easier to write a type signature than get the function's body correct,
especially because a set of equations must yield a generalised signature, which is easy for compilers to figure out, humans not so much.
Recursive function bodies are particularly prone to getting over-generalised to yield an infinite type.
In this case the repeat/redundancy is a Good Thing/belt-and-braces/sanity check.

Using the POOD, how to account for a contradiction? between the stand-alone type signature -- which is the sort of fact that could be stated in a RelVar, vs an inferred type -- which is not 'stated' anywhere but is implied by a function body via rules of type inference.

The time complexity of type inference is typically measured in minutes -- that is, way more than would be acceptable for database updates.

#10 · October 20, 2025, 6:57 pm

"The trouble with trying to use a Principle or Normal Form specified in terms of RelVars is how to interpret into structures (like source code/components or configuration specs) that are not RelVars?"

I'm pretty sure the POOD was not defined with structures in mind "that are not relvars" ...

Or perhaps IOW, the problem you talk about might precisely be *you* *wanting* to find such an interpretation ...

(BTW I'm also pretty confident that anything like "configuration specs" is really trivially mapped to relations (/relvars), most often constrained to be (/hold exclusively relations) of cardinality one.)

Author of SIRA_PRISE

The Forum for Discussion about The Third Manifesto and Related Matters

DRY -- "ThirdNormalForm is the analogous principle for data." (?)