Where to start for a beginner?
Quote from p c on November 25, 2020, 11:53 amI enjoyed this sentence in one of the replies to your where-to-start question: "...almost anything that isn't self-evident or mathematically provable is controversial to some -- as is the case when you reach a certain point in most fields."
Mathematical proofs for specific situations are almost completely absent from the reading list so far. Yet they are the formal foundation of Codd's invention. Worse, much of the list is based on unprovable seat-of-the-pants implementation practices.
There are practical reasons for such formality, an important one being that it allows accurate (logically valid) comparison of alternative data designs. For example you might define two different designs as equivalent if their interpretations never contradict each other. Proving that requires logical technique which is one reason why Codd wrote the 1970 paper in logical terms.
Regarding reaching "a certain point" beyond which controversy arises one would hope such a point isn't reached before the most elementary fundamentals have been reached.
There is an informal, somewhat propositional argument in Appendix E of the Dtatrm book, 2nd edition, pages 459-461, at the Third Manifesto site, where it's claimed that a desired assertion (the argument's premise) has an "indeterminate" conclusion (in other words not relationally possible because there are supposedly three non-equivalent representations for asserting it). But it is not a universally valid mathematical argument because it presumes that a particular choice of implementation representation always determines all possible logical interpretations.
The desired assertion is :
"IT IS NOT THE CASE THAT THE_SHOP_IS_OPEN_AND_THE_ALARM_IS_SET".
Logically, that assertion is a premise represented by the negation of a conjunction of two relations, for example, it can be represented by the term ¬(A∧B) such that A and B could name either propositions or relation extensions or predicates. Codd allowed those representations.
But mathematically, the informal argument suffers from ignoring this sentence in Codd 1970:
Section 1.3 A Relational View of Data, footnote 1: "More concisely, R is a subset of the Cartesian product ..."
The significance of that sentence is that a tuple can be said to represent a logical conjunction of attribute values. For example, if a tuple {a,b} does not satisfy predicate Rab, then {a,b} is not an element of relation R, likewise when a tuple {a,b,c} doesn't satisfy predicate Rabc. Nor can any non-conjunctive combination of projections on R evaluate to such a tuple.
Mathematically, this entails that the tuple {a,b} can be represented as two tuples, which in turn can be thought of as singleton unary relations, call them A and B, whose conjunction R is predicated on these equivalences (aka bi-conditions or bi-implications/entailments): (A∧ B↔A) and (A∧ B↔B). They must be premises of a mathematical argument involving the logical implications of R.
By definition, according to footnote 1, R cannot represent the disjunction A∨ B, because to do so some tuples would need to accompany each value for an attribute of A with a number of values for attribute B, and vice-versa.
Logical equivalence (not necessarily equality) of propositions P and Q means that when P is true, Q is true and when P is false, Q is false and when Q is true, P is true and when Q is false, P is false. In the relational model the equivalence (R↔A) (equivalence of two relations represented as a proposition) can be evaluated as a relation only when the set of tuples of R is equivalent in the above sense to the set of tuples of A.
Logical technique makes it easy to see that (A∧B↔A)∧(A∧B↔B) is logically equivalent to (A↔B). To be logically valid, any formal logical argument involving tuple {a,b} must include (A↔B) as a premise. This means that to be relational any database system must necessarily reflect that premise for all tuples. In such a system, the premise (A↔B) is part of what Codd called the "internal representation" (see the very first sentence of the 1970 paper - representation is supreme in the RM).
Formally, the "indeterminate" conclusion of the informal argument can be written as a union (¬A∨¬B), in other words, de Morgan's law applies. The presence of the disjunctive connective '∨' should be enough to stop right here and reject the argument but the point can be pushed a little further.
First, the formal version of the informal argument can be written as de Morgan's equivalence:
¬(A∧B) // premise of informal argument
↔ // ... is equivalent to...
¬A∨¬B // this logically valid conclusionBut that formal argument omits the relationally necessary premise (A↔B). Now form a logical version of the informal argument that includes the equivalence (A↔B) as a premise:
(A↔B) // Cartesian product is a premise
∧ // ... and a second premise is ...
¬(A∧B) // ...premise of informal argument
↔ // ... is NOT equivalent to...
¬A∨¬B // this invalid conclusion.The conclusion is logically invalid because when the conjunction in the second line is false, the conclusion is true, in other words the conclusion is not universally true, is never not dependent on arbitrary interpretation.
(It is easy to make a simple truth table to verify the invalidity. A clever philospher has made an online tool which makes such verification even easier, which you can test by pasting the argument, without my '//' comments, at https://umsu.de/trees ...)
By contrast, this argument is valid:
(A↔B)∧¬(A∧B)↔¬A∧¬B // valid argument
(If you are unfamiliar with formal notations you may need to look up the customary precedence and associativity rules.) The usefulness of such an equivalence is that it is extremely declarative, for example the argument itself is equivalent to an argument that reverses the conclusion and the premise.
This argument makes it quite clear that no matter when A and B are represented by tables or relvars, the assertion is satisfed only when they are both negated.
Codd assumed a database system that reflects predicate logic formulas. If you are unfamiliar with such formulas, you could look at chapter 4 of a free online book that was suggested to this group by a long-ago poster, Forallx by PD Magnus. It mentions why it is sometimes valid to interpret a predicate formula as a propositional formula, that is when both interpretations are logically equivalent.
Here is a predicate formula that expresses the valid argument:
∀x∀y((Ax↔By)∧¬(Ax∧By)↔¬Ax∧¬By)
Like with the propositional arguments, the umsu proof generator gives a detailed proof when a predicate formula is valid. You might want to try it for other situations involving normalized relations which will often show that for many purposes type theory and normalization theory are not nearly as fundamental as they are often said to be.
I enjoyed this sentence in one of the replies to your where-to-start question: "...almost anything that isn't self-evident or mathematically provable is controversial to some -- as is the case when you reach a certain point in most fields."
Mathematical proofs for specific situations are almost completely absent from the reading list so far. Yet they are the formal foundation of Codd's invention. Worse, much of the list is based on unprovable seat-of-the-pants implementation practices.
There are practical reasons for such formality, an important one being that it allows accurate (logically valid) comparison of alternative data designs. For example you might define two different designs as equivalent if their interpretations never contradict each other. Proving that requires logical technique which is one reason why Codd wrote the 1970 paper in logical terms.
Regarding reaching "a certain point" beyond which controversy arises one would hope such a point isn't reached before the most elementary fundamentals have been reached.
There is an informal, somewhat propositional argument in Appendix E of the Dtatrm book, 2nd edition, pages 459-461, at the Third Manifesto site, where it's claimed that a desired assertion (the argument's premise) has an "indeterminate" conclusion (in other words not relationally possible because there are supposedly three non-equivalent representations for asserting it). But it is not a universally valid mathematical argument because it presumes that a particular choice of implementation representation always determines all possible logical interpretations.
The desired assertion is :
"IT IS NOT THE CASE THAT THE_SHOP_IS_OPEN_AND_THE_ALARM_IS_SET".
Logically, that assertion is a premise represented by the negation of a conjunction of two relations, for example, it can be represented by the term ¬(A∧B) such that A and B could name either propositions or relation extensions or predicates. Codd allowed those representations.
But mathematically, the informal argument suffers from ignoring this sentence in Codd 1970:
Section 1.3 A Relational View of Data, footnote 1: "More concisely, R is a subset of the Cartesian product ..."
The significance of that sentence is that a tuple can be said to represent a logical conjunction of attribute values. For example, if a tuple {a,b} does not satisfy predicate Rab, then {a,b} is not an element of relation R, likewise when a tuple {a,b,c} doesn't satisfy predicate Rabc. Nor can any non-conjunctive combination of projections on R evaluate to such a tuple.
Mathematically, this entails that the tuple {a,b} can be represented as two tuples, which in turn can be thought of as singleton unary relations, call them A and B, whose conjunction R is predicated on these equivalences (aka bi-conditions or bi-implications/entailments): (A∧ B↔A) and (A∧ B↔B). They must be premises of a mathematical argument involving the logical implications of R.
By definition, according to footnote 1, R cannot represent the disjunction A∨ B, because to do so some tuples would need to accompany each value for an attribute of A with a number of values for attribute B, and vice-versa.
Logical equivalence (not necessarily equality) of propositions P and Q means that when P is true, Q is true and when P is false, Q is false and when Q is true, P is true and when Q is false, P is false. In the relational model the equivalence (R↔A) (equivalence of two relations represented as a proposition) can be evaluated as a relation only when the set of tuples of R is equivalent in the above sense to the set of tuples of A.
Logical technique makes it easy to see that (A∧B↔A)∧(A∧B↔B) is logically equivalent to (A↔B). To be logically valid, any formal logical argument involving tuple {a,b} must include (A↔B) as a premise. This means that to be relational any database system must necessarily reflect that premise for all tuples. In such a system, the premise (A↔B) is part of what Codd called the "internal representation" (see the very first sentence of the 1970 paper - representation is supreme in the RM).
Formally, the "indeterminate" conclusion of the informal argument can be written as a union (¬A∨¬B), in other words, de Morgan's law applies. The presence of the disjunctive connective '∨' should be enough to stop right here and reject the argument but the point can be pushed a little further.
First, the formal version of the informal argument can be written as de Morgan's equivalence:
¬(A∧B) // premise of informal argument
↔ // ... is equivalent to...
¬A∨¬B // this logically valid conclusion
But that formal argument omits the relationally necessary premise (A↔B). Now form a logical version of the informal argument that includes the equivalence (A↔B) as a premise:
(A↔B) // Cartesian product is a premise
∧ // ... and a second premise is ...
¬(A∧B) // ...premise of informal argument
↔ // ... is NOT equivalent to...
¬A∨¬B // this invalid conclusion.
The conclusion is logically invalid because when the conjunction in the second line is false, the conclusion is true, in other words the conclusion is not universally true, is never not dependent on arbitrary interpretation.
(It is easy to make a simple truth table to verify the invalidity. A clever philospher has made an online tool which makes such verification even easier, which you can test by pasting the argument, without my '//' comments, at https://umsu.de/trees ...)
By contrast, this argument is valid:
(A↔B)∧¬(A∧B)↔¬A∧¬B // valid argument
(If you are unfamiliar with formal notations you may need to look up the customary precedence and associativity rules.) The usefulness of such an equivalence is that it is extremely declarative, for example the argument itself is equivalent to an argument that reverses the conclusion and the premise.
This argument makes it quite clear that no matter when A and B are represented by tables or relvars, the assertion is satisfed only when they are both negated.
Codd assumed a database system that reflects predicate logic formulas. If you are unfamiliar with such formulas, you could look at chapter 4 of a free online book that was suggested to this group by a long-ago poster, Forallx by PD Magnus. It mentions why it is sometimes valid to interpret a predicate formula as a propositional formula, that is when both interpretations are logically equivalent.
Here is a predicate formula that expresses the valid argument:
∀x∀y((Ax↔By)∧¬(Ax∧By)↔¬Ax∧¬By)
Like with the propositional arguments, the umsu proof generator gives a detailed proof when a predicate formula is valid. You might want to try it for other situations involving normalized relations which will often show that for many purposes type theory and normalization theory are not nearly as fundamental as they are often said to be.
Quote from Brian S on November 27, 2020, 5:05 amQuote from Hugh on November 20, 2020, 3:01 pmQuote from Efrit on November 19, 2020, 7:31 pmThanks for the advice Dave Voorhis, I’ll definitely take a look to the text you mentionned.
To answer your question dandl, I do try to be a better developer in general, but I am also interested in having a better insight in general in the field. What I liked about the article in French I mentioned is that his author took time to define the basic definitions of concepts used in the relational model, by referring for example to researchers who seems to be well-known, as Edgar Codd, Chris Date or Hugh Darwens. This article however said a lot of things that were very different from what I’ve learned in my engineering school. Not only about the fact that the SQL model is pretty different than the relational one, but also about what is accepted as good practices in the design of a database. For example, related to normal forms, I have often read that, as a good first approach, or by default, we should try to normalize a relational database to the 3rd normal form. But this author argued that this advice was obsolete, and that today you should aim to the 5th one (possibly by starting directly by Boyce-Codd normal form instead of the 1st if you are experimented). Another example would about the primary key and candidate keys. I only heard before about the first one, and never the latter.
So the first thing I think I should do is to read a text that introduce the field with precision. But I also don’t want to be lost in too much advanced details and above all about things that are too much controversial yet. I hope that An Introduction to Database Systems (8th Edition) is the adequate one.
I am not opposed at all to look into Datalog, why not? I read a bit the Wikipedia article on it, and the first thing that is troubling for me is that it is a subset of Prolog, which I happen to know because I had a course on it in my engineering school some years ago (and I was pretty bad at it unfortunately). As a side note, I have also discovered Mercury some times ago. If you have good introductory texts about all of this, I would gladly read them.
Of course, I also hope that learning all of this will help me to do a better work as a developer. I’ll open another topic to ask advice about a problem I encountered in my job.
I was going to suggest to get a good understanding of relational database theory before delving into normal forms, but for the record I'd like you to know that the 6NF defined by me and Chris Date is also important, especially in a database that is designed, either by choice or necessity, to avoid constructs akin to SQL's NULL. But remember that normal forms are only advice. It's okay to violate them so long as the designer is happy that with particular cases either the general disadvantages of doing that are inapplicable or suitable countermeasures can be put in place.
As for relational theory, my online free book An Introduction to Relational Database Theory starts at the very beginning and was derived from a course I used to teach to undergraduates. There's also a companion book, SQL: A Comparative Survey, in which I tried to draw attention to the controversies surrounding that language without expressing my own strong opinions that have been expressed in some of my books with Chris Date.
Hugh
6NF is important, but not for avoiding constructs akin to SQL NULL. Decomposing a relation schema from 5NF to 6NF "always" requires the introduction of a cyclical set of referential constraints. For temporal schemes, the benefits outweigh the cost, but that is not the case for non-temporal schemes. The predicate of a non-temporal 5NF relation that is not already in 6NF is conjunctive--e.g., {EMP#, RATE, DEPT#}:"The employee identified by employee number <EMP#> is paid <RATE> per hour and the employee identified by <EMP#> works in the department identified by <DEPT#>." Decomposition from 5NF to 6NF changes the logical connective from a conjunction to a disjunction. It follows in this example, that the FD EMP# -> {RATE,DEPT#} is lost, which means that you can have an employee with a pay rate but no department or vice-versa. The only way to prevent that is to introduce a proper circular set of INDs.
Decomposition to avoid NULLs carries its own set of problems. For instance, how do you explicitly indicate that a particular attribute is always applicable, even if it isn't always supplied? Using the decomposition method, you need a separate relation just for that purpose. That kind of redundancy can be avoided by substituting a relation-valued (ZeroOrOne) attribute in its place. There are several benefits to this approach. (1) The indication is explicit. (2) The underlying predicate is retained. (3) Decomposition is not necessary. (4) There is no need to introduce proper circular sets of INDs. And (5) fewer joins are needed to answer queries.
That is not to say that decomposition to avoid SQL NULLs is always a problem. There are cases where it makes sense--in particular, when a given attribute is not always applicable. In such cases, the predicate of the relation is already disjunctive, and the dependency from the key to the attribute is not strictly functional, as the attribute in question is not always applicable. There is therefore no downside--at least not from a logical standpoint--to decomposing such a schema.
This is also not to say that there aren't problems when ZOO attributes are used, but those problems are only due to the fact that information has not been supplied, rather than being due to the mechanism for representing it.
Brian
Quote from Hugh on November 20, 2020, 3:01 pmQuote from Efrit on November 19, 2020, 7:31 pmThanks for the advice Dave Voorhis, I’ll definitely take a look to the text you mentionned.
To answer your question dandl, I do try to be a better developer in general, but I am also interested in having a better insight in general in the field. What I liked about the article in French I mentioned is that his author took time to define the basic definitions of concepts used in the relational model, by referring for example to researchers who seems to be well-known, as Edgar Codd, Chris Date or Hugh Darwens. This article however said a lot of things that were very different from what I’ve learned in my engineering school. Not only about the fact that the SQL model is pretty different than the relational one, but also about what is accepted as good practices in the design of a database. For example, related to normal forms, I have often read that, as a good first approach, or by default, we should try to normalize a relational database to the 3rd normal form. But this author argued that this advice was obsolete, and that today you should aim to the 5th one (possibly by starting directly by Boyce-Codd normal form instead of the 1st if you are experimented). Another example would about the primary key and candidate keys. I only heard before about the first one, and never the latter.
So the first thing I think I should do is to read a text that introduce the field with precision. But I also don’t want to be lost in too much advanced details and above all about things that are too much controversial yet. I hope that An Introduction to Database Systems (8th Edition) is the adequate one.
I am not opposed at all to look into Datalog, why not? I read a bit the Wikipedia article on it, and the first thing that is troubling for me is that it is a subset of Prolog, which I happen to know because I had a course on it in my engineering school some years ago (and I was pretty bad at it unfortunately). As a side note, I have also discovered Mercury some times ago. If you have good introductory texts about all of this, I would gladly read them.
Of course, I also hope that learning all of this will help me to do a better work as a developer. I’ll open another topic to ask advice about a problem I encountered in my job.
I was going to suggest to get a good understanding of relational database theory before delving into normal forms, but for the record I'd like you to know that the 6NF defined by me and Chris Date is also important, especially in a database that is designed, either by choice or necessity, to avoid constructs akin to SQL's NULL. But remember that normal forms are only advice. It's okay to violate them so long as the designer is happy that with particular cases either the general disadvantages of doing that are inapplicable or suitable countermeasures can be put in place.
As for relational theory, my online free book An Introduction to Relational Database Theory starts at the very beginning and was derived from a course I used to teach to undergraduates. There's also a companion book, SQL: A Comparative Survey, in which I tried to draw attention to the controversies surrounding that language without expressing my own strong opinions that have been expressed in some of my books with Chris Date.
Hugh
6NF is important, but not for avoiding constructs akin to SQL NULL. Decomposing a relation schema from 5NF to 6NF "always" requires the introduction of a cyclical set of referential constraints. For temporal schemes, the benefits outweigh the cost, but that is not the case for non-temporal schemes. The predicate of a non-temporal 5NF relation that is not already in 6NF is conjunctive--e.g., {EMP#, RATE, DEPT#}:"The employee identified by employee number <EMP#> is paid <RATE> per hour and the employee identified by <EMP#> works in the department identified by <DEPT#>." Decomposition from 5NF to 6NF changes the logical connective from a conjunction to a disjunction. It follows in this example, that the FD EMP# -> {RATE,DEPT#} is lost, which means that you can have an employee with a pay rate but no department or vice-versa. The only way to prevent that is to introduce a proper circular set of INDs.
Decomposition to avoid NULLs carries its own set of problems. For instance, how do you explicitly indicate that a particular attribute is always applicable, even if it isn't always supplied? Using the decomposition method, you need a separate relation just for that purpose. That kind of redundancy can be avoided by substituting a relation-valued (ZeroOrOne) attribute in its place. There are several benefits to this approach. (1) The indication is explicit. (2) The underlying predicate is retained. (3) Decomposition is not necessary. (4) There is no need to introduce proper circular sets of INDs. And (5) fewer joins are needed to answer queries.
That is not to say that decomposition to avoid SQL NULLs is always a problem. There are cases where it makes sense--in particular, when a given attribute is not always applicable. In such cases, the predicate of the relation is already disjunctive, and the dependency from the key to the attribute is not strictly functional, as the attribute in question is not always applicable. There is therefore no downside--at least not from a logical standpoint--to decomposing such a schema.
This is also not to say that there aren't problems when ZOO attributes are used, but those problems are only due to the fact that information has not been supplied, rather than being due to the mechanism for representing it.
Brian
Quote from p c on November 27, 2020, 1:55 pm"6NF is important, but not for avoiding constructs [sic] akin to SQL NULL" is completely correct as far as normalization is concerned. When a design reflects all requirements, normalization can minimize the overall storage needed as well as minimize the amount and complexity of constraints. That's not so for a design that is missing requirements. Missing requirements do not necessarily entail missing data and vice versa.
When a requirement arises that the present design doesn't reflect, introducing a disjunction via decomposition can sometimes enable the requirement.
A logical disjunction corresponds to a set union in relational algebra.
For example, the heading {EMP#, RATE, DEPT#} can support exactly three predicates when only {EMP#} is a key. When a fourth predicate is needed, allowing employees without one or two of rates and departments, the designer must usually expect an additional stored relation, at least when using present dbms'es and assuming that no domain augmentation such as allowing the dependent attribute values to be empty strings or numeric zero is compatible (doesn't contradict existing results) with existing applications, doesn't misinterpret existing relation relation representations to applications.
That additional unary relation introduces a union-compatible disjunction in addition to existing conjunctions implicit in the original extensions. No application change should be needed, such as introducing SQL Nulls, if a relational dbms is available that doesn't use the typical crippled catalog. The catalog merely replaces the internal representations of the application relations.
In fact, strictly speaking, from a non-physical perspective the existing extension doesn't even need to be replaced, only its catalog definition. The requirement is equivalent to adding an optional attribute to an existing relation.
But no conjunction 'changes' into a disjunction. Instead something is replaced with one or more other things. As the old joke goes, it doesn't take more than one psychiatrist to change a connective, provided the connective wants to change.
Disjunction in decompositions simply reflects the definition of a relational tuple.
Normalization is an optimization. It exploits the happy coincidence when storage simplification and logical simplification have mutual effect.
Nulls are an attempt to encode and invoke multiple predicate reresentations with a single non-relational representation. There appears to be no logical reason why that couldn't be implemented but requiring applications to both enforce null logic and prove results moves much of the logical burden from the dbms to all applications, just as so-called parameterized tables do in an even more complicated way (apparently the latter were once called Entity-Attribute-Value tables - they were notorious for creating career application coding jobs and extraordinary physical machine consumption, it didn't occur to implementers that they needed to write a compiler).
Exposing internal representations to applications is extremely uneconomic. Implementations of nulls should be required to show how they simplify storage and data constraints. When they don't, a severe lack of systems thinking is exposed.
"6NF is important, but not for avoiding constructs [sic] akin to SQL NULL" is completely correct as far as normalization is concerned. When a design reflects all requirements, normalization can minimize the overall storage needed as well as minimize the amount and complexity of constraints. That's not so for a design that is missing requirements. Missing requirements do not necessarily entail missing data and vice versa.
When a requirement arises that the present design doesn't reflect, introducing a disjunction via decomposition can sometimes enable the requirement.
A logical disjunction corresponds to a set union in relational algebra.
For example, the heading {EMP#, RATE, DEPT#} can support exactly three predicates when only {EMP#} is a key. When a fourth predicate is needed, allowing employees without one or two of rates and departments, the designer must usually expect an additional stored relation, at least when using present dbms'es and assuming that no domain augmentation such as allowing the dependent attribute values to be empty strings or numeric zero is compatible (doesn't contradict existing results) with existing applications, doesn't misinterpret existing relation relation representations to applications.
That additional unary relation introduces a union-compatible disjunction in addition to existing conjunctions implicit in the original extensions. No application change should be needed, such as introducing SQL Nulls, if a relational dbms is available that doesn't use the typical crippled catalog. The catalog merely replaces the internal representations of the application relations.
In fact, strictly speaking, from a non-physical perspective the existing extension doesn't even need to be replaced, only its catalog definition. The requirement is equivalent to adding an optional attribute to an existing relation.
But no conjunction 'changes' into a disjunction. Instead something is replaced with one or more other things. As the old joke goes, it doesn't take more than one psychiatrist to change a connective, provided the connective wants to change.
Disjunction in decompositions simply reflects the definition of a relational tuple.
Normalization is an optimization. It exploits the happy coincidence when storage simplification and logical simplification have mutual effect.
Nulls are an attempt to encode and invoke multiple predicate reresentations with a single non-relational representation. There appears to be no logical reason why that couldn't be implemented but requiring applications to both enforce null logic and prove results moves much of the logical burden from the dbms to all applications, just as so-called parameterized tables do in an even more complicated way (apparently the latter were once called Entity-Attribute-Value tables - they were notorious for creating career application coding jobs and extraordinary physical machine consumption, it didn't occur to implementers that they needed to write a compiler).
Exposing internal representations to applications is extremely uneconomic. Implementations of nulls should be required to show how they simplify storage and data constraints. When they don't, a severe lack of systems thinking is exposed.
Quote from Brian S on November 28, 2020, 12:10 amI'm sorry, p c, but I just don't follow your train of thought: Are you denying that the original predicate is the conjunction of two distinct predicates? Or are you denying that the database predicate of the 6NF decomposition of the original relation is disjunctive? It's also unclear why you chose to mention applications, for from the standpoint of the relevant database theory, the application does not bear on what information is represented in the database--that is, whether the information contained within is the logical sum of the propositions formed by instantiating the conjunctive predicate, "The employee identified by employee number <EMP#> is paid <RATE> per hour and the employee identified by <EMP#> works in the department identified by <DEPT#>" or by instantiating the predicates "The employee identified by employee number <EMP#> is paid <RATE> per hour" or "The employee identified by <EMP#> works in the department identified by <DEPT#>" or both.
By decomposing the 5NF relation to 6NF, what is instantiated by each tuple is no longer the conjunction of the two individual predicates, but rather the disjunction.
Note that the unary predicate over EMP# is implied by each of the binary predicates, and I would argue cannot stand apart from them due to the FD EMP# -> {RATE, DEPT#}.
I'm sorry, p c, but I just don't follow your train of thought: Are you denying that the original predicate is the conjunction of two distinct predicates? Or are you denying that the database predicate of the 6NF decomposition of the original relation is disjunctive? It's also unclear why you chose to mention applications, for from the standpoint of the relevant database theory, the application does not bear on what information is represented in the database--that is, whether the information contained within is the logical sum of the propositions formed by instantiating the conjunctive predicate, "The employee identified by employee number <EMP#> is paid <RATE> per hour and the employee identified by <EMP#> works in the department identified by <DEPT#>" or by instantiating the predicates "The employee identified by employee number <EMP#> is paid <RATE> per hour" or "The employee identified by <EMP#> works in the department identified by <DEPT#>" or both.
By decomposing the 5NF relation to 6NF, what is instantiated by each tuple is no longer the conjunction of the two individual predicates, but rather the disjunction.
Note that the unary predicate over EMP# is implied by each of the binary predicates, and I would argue cannot stand apart from them due to the FD EMP# -> {RATE, DEPT#}.
Quote from Hugh on November 28, 2020, 2:39 pmFrom Brian S: "Decomposing a relation schema from 5NF to 6NF "always" requires the introduction of a cyclical set of referential constraints."
Does "always", in inverted commas like that, mean "not quite always"? Or what?
Just to be clear, I assume that "Decomposing ... to two or more relation schemas" is what Brian meant. If there is only one non-prime attribute, then the 5NF schema is already in 6NF. But it is not the case that in all other cases cyclical referential constraints arise. Specifically, if the reason for decomposing is to avoid nulls (or even default values), then FKs in one direction only suffice. For example, if not all suppliers have names, not all have statuses, and not all have locations, then we could have four 6NF relvars, S, SN, SS, and SL. The last three each have foreign key {S#} referencing S and S has no foreign keys.
Hugh
From Brian S: "Decomposing a relation schema from 5NF to 6NF "always" requires the introduction of a cyclical set of referential constraints."
Does "always", in inverted commas like that, mean "not quite always"? Or what?
Just to be clear, I assume that "Decomposing ... to two or more relation schemas" is what Brian meant. If there is only one non-prime attribute, then the 5NF schema is already in 6NF. But it is not the case that in all other cases cyclical referential constraints arise. Specifically, if the reason for decomposing is to avoid nulls (or even default values), then FKs in one direction only suffice. For example, if not all suppliers have names, not all have statuses, and not all have locations, then we could have four 6NF relvars, S, SN, SS, and SL. The last three each have foreign key {S#} referencing S and S has no foreign keys.
Hugh
Quote from Brian S on November 28, 2020, 5:22 pmQuote from Hugh on November 28, 2020, 2:39 pmFrom Brian S: "Decomposing a relation schema from 5NF to 6NF "always" requires the introduction of a cyclical set of referential constraints."
Does "always", in inverted commas like that, mean "not quite always"? Or what?
Just to be clear, I assume that "Decomposing ... to two or more relation schemas" is what Brian meant. If there is only one non-prime attribute, then the 5NF schema is already in 6NF. But it is not the case that in all other cases cyclical referential constraints arise. Specifically, if the reason for decomposing is to avoid nulls (or even default values), then FKs in one direction only suffice. For example, if not all suppliers have names, not all have statuses, and not all have locations, then we could have four 6NF relvars, S, SN, SS, and SL. The last three each have foreign key {S#} referencing S and S has no foreign keys.
Hugh
The quotation marks were meant for emphasis. The point of my post is that decomposition just to avoid nulls--applicable nulls in particular--is logically flawed. If a relation R is in 5NF to begin with, then all dependent attributes are applicable, and the propositions represented by each tuple in R satisfies the constraint
P1 ∨ P2 ∨ ... ∨ Pk ∨ ... ∨ Pn → P1 ∧ P2 ∧ ... ∧ Pk ∧ ... ∧ Pn
where each Pk ranges over a 6NF projection on R. Some of the predicates that range over candidate keys only are existentially quantified. For example, each instantiation of the binary predicate "the employee identified by employee number <EMP#> is paid <RATE> per hour" implies the instantiation of the unary predicate "There is exactly one RATE such that the employee identified by employee number <EMP#> is paid RATE per hour, which is due to the functional dependency EMP# -> RATE. Information that is actually missing is characterized by only the unary predicate being instantiated.
It is redundant to associate the same predicate symbol with more than one relation schema. Decomposition to avoid applicable nulls does precisely that. For instance, if predicate symbol P1 were assigned to the predicate "the employee identified by employee number <EMP#> is paid <RATE> per hour", and the 5NF relation schema were decomposed, then that same predicate symbol would be associated with two relation schemas, namely those that participate in the foreign key constraint.
Also, it is not clear just from the structure of the database whether the implied unary predicate is associated with just the referencing relation schema or also the referenced one. Substitute the attribute OT_RATE above, which does not apply to exempt salaried employees. In that case the implied unary predicate is only associated with the referencing relation rather than both. One cannot therefore just assume that the implied unary predicate is also associated with the referenced relation schema. This indeterminacy is a consequence of the redundancy caused by associating the same predicate symbol with more than one relation schema.
It is for this reason that it makes sense to use ZOO attributes in place of applicable nulls and to decompose to eliminate inapplicable nulls. When ZOO attributes are used for information that may actually be missing, each atomic predicate should be associated with one schema only. (unless there is a POOD violation. One could certainly argue that the result of decomposing to avoid applicable nulls violates POOD, as one can obtain the same information from two distinct relations--especially in the case where no information is actually missing.) For inapplicable nulls, on the other hand, I would argue that when inapplicable nulls are permitted, the schema is not in 5NF to begin with, and so decomposition is therefore indicated.
Finally, the addressing of missing information should be as an afterthought, rather than a primary design goal. The database schema should be designed first and normalized under the assumption that all relevant information will always be supplied. The resulting schema can then be used to identify the relevant internal predicates and by extension the structure of the external predicates. Only then should adjustments be made to accommodate information that is actually missing. It is best, therefore, to substitute ZOO attributes where necessary to make that accommodation rather than introducing redundancy by decomposing the schema further.
Brian
Quote from Hugh on November 28, 2020, 2:39 pmFrom Brian S: "Decomposing a relation schema from 5NF to 6NF "always" requires the introduction of a cyclical set of referential constraints."
Does "always", in inverted commas like that, mean "not quite always"? Or what?
Just to be clear, I assume that "Decomposing ... to two or more relation schemas" is what Brian meant. If there is only one non-prime attribute, then the 5NF schema is already in 6NF. But it is not the case that in all other cases cyclical referential constraints arise. Specifically, if the reason for decomposing is to avoid nulls (or even default values), then FKs in one direction only suffice. For example, if not all suppliers have names, not all have statuses, and not all have locations, then we could have four 6NF relvars, S, SN, SS, and SL. The last three each have foreign key {S#} referencing S and S has no foreign keys.
Hugh
The quotation marks were meant for emphasis. The point of my post is that decomposition just to avoid nulls--applicable nulls in particular--is logically flawed. If a relation R is in 5NF to begin with, then all dependent attributes are applicable, and the propositions represented by each tuple in R satisfies the constraint
P1 ∨ P2 ∨ ... ∨ Pk ∨ ... ∨ Pn → P1 ∧ P2 ∧ ... ∧ Pk ∧ ... ∧ Pn
where each Pk ranges over a 6NF projection on R. Some of the predicates that range over candidate keys only are existentially quantified. For example, each instantiation of the binary predicate "the employee identified by employee number <EMP#> is paid <RATE> per hour" implies the instantiation of the unary predicate "There is exactly one RATE such that the employee identified by employee number <EMP#> is paid RATE per hour, which is due to the functional dependency EMP# -> RATE. Information that is actually missing is characterized by only the unary predicate being instantiated.
It is redundant to associate the same predicate symbol with more than one relation schema. Decomposition to avoid applicable nulls does precisely that. For instance, if predicate symbol P1 were assigned to the predicate "the employee identified by employee number <EMP#> is paid <RATE> per hour", and the 5NF relation schema were decomposed, then that same predicate symbol would be associated with two relation schemas, namely those that participate in the foreign key constraint.
Also, it is not clear just from the structure of the database whether the implied unary predicate is associated with just the referencing relation schema or also the referenced one. Substitute the attribute OT_RATE above, which does not apply to exempt salaried employees. In that case the implied unary predicate is only associated with the referencing relation rather than both. One cannot therefore just assume that the implied unary predicate is also associated with the referenced relation schema. This indeterminacy is a consequence of the redundancy caused by associating the same predicate symbol with more than one relation schema.
It is for this reason that it makes sense to use ZOO attributes in place of applicable nulls and to decompose to eliminate inapplicable nulls. When ZOO attributes are used for information that may actually be missing, each atomic predicate should be associated with one schema only. (unless there is a POOD violation. One could certainly argue that the result of decomposing to avoid applicable nulls violates POOD, as one can obtain the same information from two distinct relations--especially in the case where no information is actually missing.) For inapplicable nulls, on the other hand, I would argue that when inapplicable nulls are permitted, the schema is not in 5NF to begin with, and so decomposition is therefore indicated.
Finally, the addressing of missing information should be as an afterthought, rather than a primary design goal. The database schema should be designed first and normalized under the assumption that all relevant information will always be supplied. The resulting schema can then be used to identify the relevant internal predicates and by extension the structure of the external predicates. Only then should adjustments be made to accommodate information that is actually missing. It is best, therefore, to substitute ZOO attributes where necessary to make that accommodation rather than introducing redundancy by decomposing the schema further.
Brian
Quote from Erwin on November 28, 2020, 8:37 pmQuote from Brian S on November 27, 2020, 5:05 amQuote from Hugh on November 20, 2020, 3:01 pmQuote from Efrit on November 19, 2020, 7:31 pmThanks for the advice Dave Voorhis, I’ll definitely take a look to the text you mentionned.
To answer your question dandl, I do try to be a better developer in general, but I am also interested in having a better insight in general in the field. What I liked about the article in French I mentioned is that his author took time to define the basic definitions of concepts used in the relational model, by referring for example to researchers who seems to be well-known, as Edgar Codd, Chris Date or Hugh Darwens. This article however said a lot of things that were very different from what I’ve learned in my engineering school. Not only about the fact that the SQL model is pretty different than the relational one, but also about what is accepted as good practices in the design of a database. For example, related to normal forms, I have often read that, as a good first approach, or by default, we should try to normalize a relational database to the 3rd normal form. But this author argued that this advice was obsolete, and that today you should aim to the 5th one (possibly by starting directly by Boyce-Codd normal form instead of the 1st if you are experimented). Another example would about the primary key and candidate keys. I only heard before about the first one, and never the latter.
So the first thing I think I should do is to read a text that introduce the field with precision. But I also don’t want to be lost in too much advanced details and above all about things that are too much controversial yet. I hope that An Introduction to Database Systems (8th Edition) is the adequate one.
I am not opposed at all to look into Datalog, why not? I read a bit the Wikipedia article on it, and the first thing that is troubling for me is that it is a subset of Prolog, which I happen to know because I had a course on it in my engineering school some years ago (and I was pretty bad at it unfortunately). As a side note, I have also discovered Mercury some times ago. If you have good introductory texts about all of this, I would gladly read them.
Of course, I also hope that learning all of this will help me to do a better work as a developer. I’ll open another topic to ask advice about a problem I encountered in my job.
I was going to suggest to get a good understanding of relational database theory before delving into normal forms, but for the record I'd like you to know that the 6NF defined by me and Chris Date is also important, especially in a database that is designed, either by choice or necessity, to avoid constructs akin to SQL's NULL. But remember that normal forms are only advice. It's okay to violate them so long as the designer is happy that with particular cases either the general disadvantages of doing that are inapplicable or suitable countermeasures can be put in place.
As for relational theory, my online free book An Introduction to Relational Database Theory starts at the very beginning and was derived from a course I used to teach to undergraduates. There's also a companion book, SQL: A Comparative Survey, in which I tried to draw attention to the controversies surrounding that language without expressing my own strong opinions that have been expressed in some of my books with Chris Date.
Hugh
6NF is important, but not for avoiding constructs akin to SQL NULL. Decomposing a relation schema from 5NF to 6NF "always" requires the introduction of a cyclical set of referential constraints. For temporal schemes, the benefits outweigh the cost, but that is not the case for non-temporal schemes. The predicate of a non-temporal 5NF relation that is not already in 6NF is conjunctive--e.g., {EMP#, RATE, DEPT#}:"The employee identified by employee number <EMP#> is paid <RATE> per hour and the employee identified by <EMP#> works in the department identified by <DEPT#>." Decomposition from 5NF to 6NF changes the logical connective from a conjunction to a disjunction. It follows in this example, that the FD EMP# -> {RATE,DEPT#} is lost, which means that you can have an employee with a pay rate but no department or vice-versa. The only way to prevent that is to introduce a proper circular set of INDs.
Decomposition to avoid NULLs carries its own set of problems. For instance, how do you explicitly indicate that a particular attribute is always applicable, even if it isn't always supplied? Using the decomposition method, you need a separate relation just for that purpose. That kind of redundancy can be avoided by substituting a relation-valued (ZeroOrOne) attribute in its place. There are several benefits to this approach. (1) The indication is explicit. (2) The underlying predicate is retained. (3) Decomposition is not necessary. (4) There is no need to introduce proper circular sets of INDs. And (5) fewer joins are needed to answer queries.
That is not to say that decomposition to avoid SQL NULLs is always a problem. There are cases where it makes sense--in particular, when a given attribute is not always applicable. In such cases, the predicate of the relation is already disjunctive, and the dependency from the key to the attribute is not strictly functional, as the attribute in question is not always applicable. There is therefore no downside--at least not from a logical standpoint--to decomposing such a schema.
This is also not to say that there aren't problems when ZOO attributes are used, but those problems are only due to the fact that information has not been supplied, rather than being due to the mechanism for representing it.
Brian
Pointing out in the following quote "Using the decomposition method, you need a separate relation just for that purpose. That kind of redundancy ... " :
If you ***NEED*** some thing X (e.g. "a separate relation"), how can that same thing X (that same separate relation) possibly be called ***REDUNDANT*** ???
"Redundant" as applied to components of some [database] design has always stood to mean "superfluous", "unneeded", "unnecessary", ... To my simple mind at least. If you ***NEED*** it, then it's not derivable from any of the things you do have, and then it's not redundant, and if it's redundant, then it's derivable from the other things you already have present in the design, and then it's not needed. That's the reason why the title of that early original Codd paper was (emphasis mine) "redundancy ***AND DERIVABILITY*** in ...". Simple. Unfortunately the second portion has, throughout the decades, got snowed under by an unwarranted exclusive focus on [a particularly narrow interpretation of] the first.
Quote from Brian S on November 27, 2020, 5:05 amQuote from Hugh on November 20, 2020, 3:01 pmQuote from Efrit on November 19, 2020, 7:31 pmThanks for the advice Dave Voorhis, I’ll definitely take a look to the text you mentionned.
To answer your question dandl, I do try to be a better developer in general, but I am also interested in having a better insight in general in the field. What I liked about the article in French I mentioned is that his author took time to define the basic definitions of concepts used in the relational model, by referring for example to researchers who seems to be well-known, as Edgar Codd, Chris Date or Hugh Darwens. This article however said a lot of things that were very different from what I’ve learned in my engineering school. Not only about the fact that the SQL model is pretty different than the relational one, but also about what is accepted as good practices in the design of a database. For example, related to normal forms, I have often read that, as a good first approach, or by default, we should try to normalize a relational database to the 3rd normal form. But this author argued that this advice was obsolete, and that today you should aim to the 5th one (possibly by starting directly by Boyce-Codd normal form instead of the 1st if you are experimented). Another example would about the primary key and candidate keys. I only heard before about the first one, and never the latter.
So the first thing I think I should do is to read a text that introduce the field with precision. But I also don’t want to be lost in too much advanced details and above all about things that are too much controversial yet. I hope that An Introduction to Database Systems (8th Edition) is the adequate one.
I am not opposed at all to look into Datalog, why not? I read a bit the Wikipedia article on it, and the first thing that is troubling for me is that it is a subset of Prolog, which I happen to know because I had a course on it in my engineering school some years ago (and I was pretty bad at it unfortunately). As a side note, I have also discovered Mercury some times ago. If you have good introductory texts about all of this, I would gladly read them.
Of course, I also hope that learning all of this will help me to do a better work as a developer. I’ll open another topic to ask advice about a problem I encountered in my job.
I was going to suggest to get a good understanding of relational database theory before delving into normal forms, but for the record I'd like you to know that the 6NF defined by me and Chris Date is also important, especially in a database that is designed, either by choice or necessity, to avoid constructs akin to SQL's NULL. But remember that normal forms are only advice. It's okay to violate them so long as the designer is happy that with particular cases either the general disadvantages of doing that are inapplicable or suitable countermeasures can be put in place.
As for relational theory, my online free book An Introduction to Relational Database Theory starts at the very beginning and was derived from a course I used to teach to undergraduates. There's also a companion book, SQL: A Comparative Survey, in which I tried to draw attention to the controversies surrounding that language without expressing my own strong opinions that have been expressed in some of my books with Chris Date.
Hugh
6NF is important, but not for avoiding constructs akin to SQL NULL. Decomposing a relation schema from 5NF to 6NF "always" requires the introduction of a cyclical set of referential constraints. For temporal schemes, the benefits outweigh the cost, but that is not the case for non-temporal schemes. The predicate of a non-temporal 5NF relation that is not already in 6NF is conjunctive--e.g., {EMP#, RATE, DEPT#}:"The employee identified by employee number <EMP#> is paid <RATE> per hour and the employee identified by <EMP#> works in the department identified by <DEPT#>." Decomposition from 5NF to 6NF changes the logical connective from a conjunction to a disjunction. It follows in this example, that the FD EMP# -> {RATE,DEPT#} is lost, which means that you can have an employee with a pay rate but no department or vice-versa. The only way to prevent that is to introduce a proper circular set of INDs.
Decomposition to avoid NULLs carries its own set of problems. For instance, how do you explicitly indicate that a particular attribute is always applicable, even if it isn't always supplied? Using the decomposition method, you need a separate relation just for that purpose. That kind of redundancy can be avoided by substituting a relation-valued (ZeroOrOne) attribute in its place. There are several benefits to this approach. (1) The indication is explicit. (2) The underlying predicate is retained. (3) Decomposition is not necessary. (4) There is no need to introduce proper circular sets of INDs. And (5) fewer joins are needed to answer queries.
That is not to say that decomposition to avoid SQL NULLs is always a problem. There are cases where it makes sense--in particular, when a given attribute is not always applicable. In such cases, the predicate of the relation is already disjunctive, and the dependency from the key to the attribute is not strictly functional, as the attribute in question is not always applicable. There is therefore no downside--at least not from a logical standpoint--to decomposing such a schema.
This is also not to say that there aren't problems when ZOO attributes are used, but those problems are only due to the fact that information has not been supplied, rather than being due to the mechanism for representing it.
Brian
Pointing out in the following quote "Using the decomposition method, you need a separate relation just for that purpose. That kind of redundancy ... " :
If you ***NEED*** some thing X (e.g. "a separate relation"), how can that same thing X (that same separate relation) possibly be called ***REDUNDANT*** ???
"Redundant" as applied to components of some [database] design has always stood to mean "superfluous", "unneeded", "unnecessary", ... To my simple mind at least. If you ***NEED*** it, then it's not derivable from any of the things you do have, and then it's not redundant, and if it's redundant, then it's derivable from the other things you already have present in the design, and then it's not needed. That's the reason why the title of that early original Codd paper was (emphasis mine) "redundancy ***AND DERIVABILITY*** in ...". Simple. Unfortunately the second portion has, throughout the decades, got snowed under by an unwarranted exclusive focus on [a particularly narrow interpretation of] the first.
Quote from Erwin on November 28, 2020, 9:19 pmQuote from p c on November 25, 2020, 11:53 am...The desired assertion is :
"IT IS NOT THE CASE THAT THE_SHOP_IS_OPEN_AND_THE_ALARM_IS_SET".
Logically, that assertion is a premise represented by the negation of a conjunction of two relations, for example, it can be represented by the term ¬(A∧B) such that A and B could name either propositions or relation extensions or predicates. Codd allowed those representations.
But mathematically, the informal argument suffers from ignoring this sentence in Codd 1970:
Section 1.3 A Relational View of Data, footnote 1: "More concisely, R is a subset of the Cartesian product ..."
The significance of that sentence is that a tuple can be said to represent a logical conjunction of attribute values. For example, if a tuple {a,b} does not satisfy predicate Rab, then {a,b} is not an element of relation R, likewise when a tuple {a,b,c} doesn't satisfy predicate Rabc. Nor can any non-conjunctive combination of projections on R evaluate to such a tuple.
Mathematically, this entails that the tuple {a,b} can be represented as two tuples, ...
The case of "IT IS NOT THE CASE THAT THE_SHOP_IS_OPEN_AND_THE_ALARM_IS_SET" is represented by TUPLE{}, not by TUPLE{a,b} (and/or by whatever can be said about the latter).
Maybe this lack of understanding explains all the nonsense the author of that post has ever written on this forum before.
"IT IS NOT THE CASE THAT THE_SHOP_IS_OPEN_AND_THE_ALARM_IS_SET" is logically equivalent to "IT IS NOT THE CASE THAT THE_SHOP_IS_OPEN OR IT IS NOT THE CASE THAT THE_ALARM_IS_SET". Which is relationally represented by TUPLE{} being present in the nilary relation that is the extension of the nilary predicate "IT IS NOT THE CASE THAT THE_SHOP_IS_OPEN", or by TUPLE{} being present in the nilary relation that is the extension of the nilary predicate "IT IS NOT THE CASE THAT THE_ALARM_IS_SET", or being present in both. Which is relationally represented by TUPLE{} being absent from the nilary relation that is the extension of the nilary predicate "THE_SHOP_IS_OPEN", or being absent from the nilary relation that is the extension of the nilary predicate "THE_ALARM_IS_SET", or [being absent of] both.
All nilary, not a TUPLE{a,b} in sight.
Quote from p c on November 25, 2020, 11:53 am...The desired assertion is :
"IT IS NOT THE CASE THAT THE_SHOP_IS_OPEN_AND_THE_ALARM_IS_SET".
Logically, that assertion is a premise represented by the negation of a conjunction of two relations, for example, it can be represented by the term ¬(A∧B) such that A and B could name either propositions or relation extensions or predicates. Codd allowed those representations.
But mathematically, the informal argument suffers from ignoring this sentence in Codd 1970:
Section 1.3 A Relational View of Data, footnote 1: "More concisely, R is a subset of the Cartesian product ..."
The significance of that sentence is that a tuple can be said to represent a logical conjunction of attribute values. For example, if a tuple {a,b} does not satisfy predicate Rab, then {a,b} is not an element of relation R, likewise when a tuple {a,b,c} doesn't satisfy predicate Rabc. Nor can any non-conjunctive combination of projections on R evaluate to such a tuple.
Mathematically, this entails that the tuple {a,b} can be represented as two tuples, ...
The case of "IT IS NOT THE CASE THAT THE_SHOP_IS_OPEN_AND_THE_ALARM_IS_SET" is represented by TUPLE{}, not by TUPLE{a,b} (and/or by whatever can be said about the latter).
Maybe this lack of understanding explains all the nonsense the author of that post has ever written on this forum before.
"IT IS NOT THE CASE THAT THE_SHOP_IS_OPEN_AND_THE_ALARM_IS_SET" is logically equivalent to "IT IS NOT THE CASE THAT THE_SHOP_IS_OPEN OR IT IS NOT THE CASE THAT THE_ALARM_IS_SET". Which is relationally represented by TUPLE{} being present in the nilary relation that is the extension of the nilary predicate "IT IS NOT THE CASE THAT THE_SHOP_IS_OPEN", or by TUPLE{} being present in the nilary relation that is the extension of the nilary predicate "IT IS NOT THE CASE THAT THE_ALARM_IS_SET", or being present in both. Which is relationally represented by TUPLE{} being absent from the nilary relation that is the extension of the nilary predicate "THE_SHOP_IS_OPEN", or being absent from the nilary relation that is the extension of the nilary predicate "THE_ALARM_IS_SET", or [being absent of] both.
All nilary, not a TUPLE{a,b} in sight.
Quote from Brian S on November 29, 2020, 10:51 amQuote from Erwin on November 28, 2020, 8:37 pmQuote from Brian S on November 27, 2020, 5:05 amQuote from Hugh on November 20, 2020, 3:01 pmQuote from Efrit on November 19, 2020, 7:31 pmThanks for the advice Dave Voorhis, I’ll definitely take a look to the text you mentionned.
To answer your question dandl, I do try to be a better developer in general, but I am also interested in having a better insight in general in the field. What I liked about the article in French I mentioned is that his author took time to define the basic definitions of concepts used in the relational model, by referring for example to researchers who seems to be well-known, as Edgar Codd, Chris Date or Hugh Darwens. This article however said a lot of things that were very different from what I’ve learned in my engineering school. Not only about the fact that the SQL model is pretty different than the relational one, but also about what is accepted as good practices in the design of a database. For example, related to normal forms, I have often read that, as a good first approach, or by default, we should try to normalize a relational database to the 3rd normal form. But this author argued that this advice was obsolete, and that today you should aim to the 5th one (possibly by starting directly by Boyce-Codd normal form instead of the 1st if you are experimented). Another example would about the primary key and candidate keys. I only heard before about the first one, and never the latter.
So the first thing I think I should do is to read a text that introduce the field with precision. But I also don’t want to be lost in too much advanced details and above all about things that are too much controversial yet. I hope that An Introduction to Database Systems (8th Edition) is the adequate one.
I am not opposed at all to look into Datalog, why not? I read a bit the Wikipedia article on it, and the first thing that is troubling for me is that it is a subset of Prolog, which I happen to know because I had a course on it in my engineering school some years ago (and I was pretty bad at it unfortunately). As a side note, I have also discovered Mercury some times ago. If you have good introductory texts about all of this, I would gladly read them.
Of course, I also hope that learning all of this will help me to do a better work as a developer. I’ll open another topic to ask advice about a problem I encountered in my job.
I was going to suggest to get a good understanding of relational database theory before delving into normal forms, but for the record I'd like you to know that the 6NF defined by me and Chris Date is also important, especially in a database that is designed, either by choice or necessity, to avoid constructs akin to SQL's NULL. But remember that normal forms are only advice. It's okay to violate them so long as the designer is happy that with particular cases either the general disadvantages of doing that are inapplicable or suitable countermeasures can be put in place.
As for relational theory, my online free book An Introduction to Relational Database Theory starts at the very beginning and was derived from a course I used to teach to undergraduates. There's also a companion book, SQL: A Comparative Survey, in which I tried to draw attention to the controversies surrounding that language without expressing my own strong opinions that have been expressed in some of my books with Chris Date.
Hugh
6NF is important, but not for avoiding constructs akin to SQL NULL. Decomposing a relation schema from 5NF to 6NF "always" requires the introduction of a cyclical set of referential constraints. For temporal schemes, the benefits outweigh the cost, but that is not the case for non-temporal schemes. The predicate of a non-temporal 5NF relation that is not already in 6NF is conjunctive--e.g., {EMP#, RATE, DEPT#}:"The employee identified by employee number <EMP#> is paid <RATE> per hour and the employee identified by <EMP#> works in the department identified by <DEPT#>." Decomposition from 5NF to 6NF changes the logical connective from a conjunction to a disjunction. It follows in this example, that the FD EMP# -> {RATE,DEPT#} is lost, which means that you can have an employee with a pay rate but no department or vice-versa. The only way to prevent that is to introduce a proper circular set of INDs.
Decomposition to avoid NULLs carries its own set of problems. For instance, how do you explicitly indicate that a particular attribute is always applicable, even if it isn't always supplied? Using the decomposition method, you need a separate relation just for that purpose. That kind of redundancy can be avoided by substituting a relation-valued (ZeroOrOne) attribute in its place. There are several benefits to this approach. (1) The indication is explicit. (2) The underlying predicate is retained. (3) Decomposition is not necessary. (4) There is no need to introduce proper circular sets of INDs. And (5) fewer joins are needed to answer queries.
That is not to say that decomposition to avoid SQL NULLs is always a problem. There are cases where it makes sense--in particular, when a given attribute is not always applicable. In such cases, the predicate of the relation is already disjunctive, and the dependency from the key to the attribute is not strictly functional, as the attribute in question is not always applicable. There is therefore no downside--at least not from a logical standpoint--to decomposing such a schema.
This is also not to say that there aren't problems when ZOO attributes are used, but those problems are only due to the fact that information has not been supplied, rather than being due to the mechanism for representing it.
Brian
Pointing out in the following quote "Using the decomposition method, you need a separate relation just for that purpose. That kind of redundancy ... " :
If you ***NEED*** some thing X (e.g. "a separate relation"), how can that same thing X (that same separate relation) possibly be called ***REDUNDANT*** ???
"Redundant" as applied to components of some [database] design has always stood to mean "superfluous", "unneeded", "unnecessary", ... To my simple mind at least. If you ***NEED*** it, then it's not derivable from any of the things you do have, and then it's not redundant, and if it's redundant, then it's derivable from the other things you already have present in the design, and then it's not needed. That's the reason why the title of that early original Codd paper was (emphasis mine) "redundancy ***AND DERIVABILITY*** in ...". Simple. Unfortunately the second portion has, throughout the decades, got snowed under by an unwarranted exclusive focus on [a particularly narrow interpretation of] the first.
You're missing the point, Erwin. A separate relation schema--just for the purpose of indicating whether a particular attribute is applicable--is required ***ONLY*** when the decomposition method is used to eliminate applicable nulls. It is not ***NEEDED*** when the ZOO attribute method is used for that purpose. Moreover, redundancy in the form of the same underlying predicate being associated with multiple relation schemas is introduced when the decomposition method is used to eliminate applicable nulls.
Redundancy is characterized by the requirement for the same information to be recorded more than once in the database, or for information to be recorded when it can be derived from other information in the database. But another problem is indeterminacy. If you decompose the schema {EMP#, NAME, RATE, OT_RATE} into 6NF schemas {EMP#, NAME}, {EMP#, RATE}, {EMP#, OT_RATE}, with FKs that reference {EMP#, NAME}[EMP#], how do you know just by examining the structure of the database that all employees are supposed to have a RATE but only some are supposed to have an OT_RATE? If ZOO attributes are used for the former and decomposition for the latter, then whenever you see a ZOO attribute, you should know that there is supposed to be a value for every tuple, even if it hasn't been supplied. Of course, if ZOO attributes are also used for the latter, then you're back to square one.
Quote from Erwin on November 28, 2020, 8:37 pmQuote from Brian S on November 27, 2020, 5:05 amQuote from Hugh on November 20, 2020, 3:01 pmQuote from Efrit on November 19, 2020, 7:31 pmThanks for the advice Dave Voorhis, I’ll definitely take a look to the text you mentionned.
To answer your question dandl, I do try to be a better developer in general, but I am also interested in having a better insight in general in the field. What I liked about the article in French I mentioned is that his author took time to define the basic definitions of concepts used in the relational model, by referring for example to researchers who seems to be well-known, as Edgar Codd, Chris Date or Hugh Darwens. This article however said a lot of things that were very different from what I’ve learned in my engineering school. Not only about the fact that the SQL model is pretty different than the relational one, but also about what is accepted as good practices in the design of a database. For example, related to normal forms, I have often read that, as a good first approach, or by default, we should try to normalize a relational database to the 3rd normal form. But this author argued that this advice was obsolete, and that today you should aim to the 5th one (possibly by starting directly by Boyce-Codd normal form instead of the 1st if you are experimented). Another example would about the primary key and candidate keys. I only heard before about the first one, and never the latter.
So the first thing I think I should do is to read a text that introduce the field with precision. But I also don’t want to be lost in too much advanced details and above all about things that are too much controversial yet. I hope that An Introduction to Database Systems (8th Edition) is the adequate one.
I am not opposed at all to look into Datalog, why not? I read a bit the Wikipedia article on it, and the first thing that is troubling for me is that it is a subset of Prolog, which I happen to know because I had a course on it in my engineering school some years ago (and I was pretty bad at it unfortunately). As a side note, I have also discovered Mercury some times ago. If you have good introductory texts about all of this, I would gladly read them.
Of course, I also hope that learning all of this will help me to do a better work as a developer. I’ll open another topic to ask advice about a problem I encountered in my job.
I was going to suggest to get a good understanding of relational database theory before delving into normal forms, but for the record I'd like you to know that the 6NF defined by me and Chris Date is also important, especially in a database that is designed, either by choice or necessity, to avoid constructs akin to SQL's NULL. But remember that normal forms are only advice. It's okay to violate them so long as the designer is happy that with particular cases either the general disadvantages of doing that are inapplicable or suitable countermeasures can be put in place.
As for relational theory, my online free book An Introduction to Relational Database Theory starts at the very beginning and was derived from a course I used to teach to undergraduates. There's also a companion book, SQL: A Comparative Survey, in which I tried to draw attention to the controversies surrounding that language without expressing my own strong opinions that have been expressed in some of my books with Chris Date.
Hugh
6NF is important, but not for avoiding constructs akin to SQL NULL. Decomposing a relation schema from 5NF to 6NF "always" requires the introduction of a cyclical set of referential constraints. For temporal schemes, the benefits outweigh the cost, but that is not the case for non-temporal schemes. The predicate of a non-temporal 5NF relation that is not already in 6NF is conjunctive--e.g., {EMP#, RATE, DEPT#}:"The employee identified by employee number <EMP#> is paid <RATE> per hour and the employee identified by <EMP#> works in the department identified by <DEPT#>." Decomposition from 5NF to 6NF changes the logical connective from a conjunction to a disjunction. It follows in this example, that the FD EMP# -> {RATE,DEPT#} is lost, which means that you can have an employee with a pay rate but no department or vice-versa. The only way to prevent that is to introduce a proper circular set of INDs.
Decomposition to avoid NULLs carries its own set of problems. For instance, how do you explicitly indicate that a particular attribute is always applicable, even if it isn't always supplied? Using the decomposition method, you need a separate relation just for that purpose. That kind of redundancy can be avoided by substituting a relation-valued (ZeroOrOne) attribute in its place. There are several benefits to this approach. (1) The indication is explicit. (2) The underlying predicate is retained. (3) Decomposition is not necessary. (4) There is no need to introduce proper circular sets of INDs. And (5) fewer joins are needed to answer queries.
That is not to say that decomposition to avoid SQL NULLs is always a problem. There are cases where it makes sense--in particular, when a given attribute is not always applicable. In such cases, the predicate of the relation is already disjunctive, and the dependency from the key to the attribute is not strictly functional, as the attribute in question is not always applicable. There is therefore no downside--at least not from a logical standpoint--to decomposing such a schema.
This is also not to say that there aren't problems when ZOO attributes are used, but those problems are only due to the fact that information has not been supplied, rather than being due to the mechanism for representing it.
Brian
Pointing out in the following quote "Using the decomposition method, you need a separate relation just for that purpose. That kind of redundancy ... " :
If you ***NEED*** some thing X (e.g. "a separate relation"), how can that same thing X (that same separate relation) possibly be called ***REDUNDANT*** ???
"Redundant" as applied to components of some [database] design has always stood to mean "superfluous", "unneeded", "unnecessary", ... To my simple mind at least. If you ***NEED*** it, then it's not derivable from any of the things you do have, and then it's not redundant, and if it's redundant, then it's derivable from the other things you already have present in the design, and then it's not needed. That's the reason why the title of that early original Codd paper was (emphasis mine) "redundancy ***AND DERIVABILITY*** in ...". Simple. Unfortunately the second portion has, throughout the decades, got snowed under by an unwarranted exclusive focus on [a particularly narrow interpretation of] the first.
You're missing the point, Erwin. A separate relation schema--just for the purpose of indicating whether a particular attribute is applicable--is required ***ONLY*** when the decomposition method is used to eliminate applicable nulls. It is not ***NEEDED*** when the ZOO attribute method is used for that purpose. Moreover, redundancy in the form of the same underlying predicate being associated with multiple relation schemas is introduced when the decomposition method is used to eliminate applicable nulls.
Redundancy is characterized by the requirement for the same information to be recorded more than once in the database, or for information to be recorded when it can be derived from other information in the database. But another problem is indeterminacy. If you decompose the schema {EMP#, NAME, RATE, OT_RATE} into 6NF schemas {EMP#, NAME}, {EMP#, RATE}, {EMP#, OT_RATE}, with FKs that reference {EMP#, NAME}[EMP#], how do you know just by examining the structure of the database that all employees are supposed to have a RATE but only some are supposed to have an OT_RATE? If ZOO attributes are used for the former and decomposition for the latter, then whenever you see a ZOO attribute, you should know that there is supposed to be a value for every tuple, even if it hasn't been supplied. Of course, if ZOO attributes are also used for the latter, then you're back to square one.
Quote from p c on December 18, 2020, 11:08 amQuote from Brian S on November 28, 2020, 12:10 amI'm sorry, p c, but I just don't follow your train of thought: Are you denying that the original predicate is the conjunction of two distinct predicates? Or are you denying that the database predicate of the 6NF decomposition of the original relation is disjunctive? It's also unclear why you chose to mention applications, for from the standpoint of the relevant database theory, the application does not bear on what information is represented in the database--that is, whether the information contained within is the logical sum of the propositions formed by instantiating the conjunctive predicate, "The employee identified by employee number <EMP#> is paid <RATE> per hour and the employee identified by <EMP#> works in the department identified by <DEPT#>" or by instantiating the predicates "The employee identified by employee number <EMP#> is paid <RATE> per hour" or "The employee identified by <EMP#> works in the department identified by <DEPT#>" or both.
By decomposing the 5NF relation to 6NF, what is instantiated by each tuple is no longer the conjunction of the two individual predicates, but rather the disjunction.
Note that the unary predicate over EMP# is implied by each of the binary predicates, and I would argue cannot stand apart from them due to the FD EMP# -> {RATE, DEPT#}.
"Are you denying that the original predicate is the conjunction of two distinct predicates? Or are you denying that the database predicate of the 6NF decomposition of the original relation is disjunctive? It's also unclear why you chose to mention applications..."
Nice questions. As this group shows, not everybody is wlling or capable of taking the steps needed to logically conflate conjunction and disjunction.
A little pathology is telling. In 1997 the TTM book chapter later known as Appendix A claimed to "dispense with Union", only 27 years (!) after Codd had already dispensed with it in a single sentence (for "noninferential systems", in other words for a relational data sublanguage) replacing it with infinite disjunction and simultaneously dispensed with the tiny sublanguage (only five logical connectives), without justification. But Codd saw no need for replacement.
Almost invisibly, with that chapter the data sublanguage disappeared, just as it disappeared in SQL systems and the separation of concepts lost by both permitted the encouragement and even promotion of database contradictions, the opposites of Codd's desired consistency.
(Also without justification, a tuple attribute list somehow became a set and logical conjunction somehow became natural join.)
Disjunction follows from every relational conjunction but not vice versa. That is why projection and join are sufficient to query for flags that are red or blue and for parts that are bolts or washers.
The elementary constraint of every relational tuple with attributes A and B is given by the logical equivalence (A ↔ B), meaning A if and only if B, in other words if A is present, B is present and vice-versa. The equivalence doesn't assert tuple presence in a database. The assertion of their prsence, their mutual coincidence, is given by the conjunction (A ∧ B ).
When the constraint and the assertion are combined with the disjunction (A ∨ B) an elementary conflation of equivalences results:
(A ↔ B) ↔ (A ∧ B ) ↔ (A ∨ B).
Apparently, the conflation is logically valid for all first-order models. Because it applies to relational tuples, it applies to sets of similar tuples and therefore union-compatible relations. But union-compatibility doesn't entail at all any particular algebraic operation. All that is entailed is the enforcement of the elementary constraint (A ↔ B).
(The FD's you also mention are irrelevant to the constraint, they might happen to be obeyed but there's no constant reason for them to be always embodied.)
The disjunction of the decomposed relations in the example is not a relation, so it is irrelevant to database internal representation no matter how relevant it might seem in the eye of a user or application program coder.
The mention of applications is important for the same reason data sublanguage is important, system efficiency is lost by a universal relational language that to be logically correct must otherwise reflect the elementary constraint (A ↔ B) because an application becomes logically incorrect without the elementary conflation. Apparently, the intended foundational A-Algebra can't separate the elementary constraint from application representations so presumably neither can complying D-languages.
Quote from Brian S on November 28, 2020, 12:10 amI'm sorry, p c, but I just don't follow your train of thought: Are you denying that the original predicate is the conjunction of two distinct predicates? Or are you denying that the database predicate of the 6NF decomposition of the original relation is disjunctive? It's also unclear why you chose to mention applications, for from the standpoint of the relevant database theory, the application does not bear on what information is represented in the database--that is, whether the information contained within is the logical sum of the propositions formed by instantiating the conjunctive predicate, "The employee identified by employee number <EMP#> is paid <RATE> per hour and the employee identified by <EMP#> works in the department identified by <DEPT#>" or by instantiating the predicates "The employee identified by employee number <EMP#> is paid <RATE> per hour" or "The employee identified by <EMP#> works in the department identified by <DEPT#>" or both.
By decomposing the 5NF relation to 6NF, what is instantiated by each tuple is no longer the conjunction of the two individual predicates, but rather the disjunction.
Note that the unary predicate over EMP# is implied by each of the binary predicates, and I would argue cannot stand apart from them due to the FD EMP# -> {RATE, DEPT#}.
"Are you denying that the original predicate is the conjunction of two distinct predicates? Or are you denying that the database predicate of the 6NF decomposition of the original relation is disjunctive? It's also unclear why you chose to mention applications..."
Nice questions. As this group shows, not everybody is wlling or capable of taking the steps needed to logically conflate conjunction and disjunction.
A little pathology is telling. In 1997 the TTM book chapter later known as Appendix A claimed to "dispense with Union", only 27 years (!) after Codd had already dispensed with it in a single sentence (for "noninferential systems", in other words for a relational data sublanguage) replacing it with infinite disjunction and simultaneously dispensed with the tiny sublanguage (only five logical connectives), without justification. But Codd saw no need for replacement.
Almost invisibly, with that chapter the data sublanguage disappeared, just as it disappeared in SQL systems and the separation of concepts lost by both permitted the encouragement and even promotion of database contradictions, the opposites of Codd's desired consistency.
(Also without justification, a tuple attribute list somehow became a set and logical conjunction somehow became natural join.)
Disjunction follows from every relational conjunction but not vice versa. That is why projection and join are sufficient to query for flags that are red or blue and for parts that are bolts or washers.
The elementary constraint of every relational tuple with attributes A and B is given by the logical equivalence (A ↔ B), meaning A if and only if B, in other words if A is present, B is present and vice-versa. The equivalence doesn't assert tuple presence in a database. The assertion of their prsence, their mutual coincidence, is given by the conjunction (A ∧ B ).
When the constraint and the assertion are combined with the disjunction (A ∨ B) an elementary conflation of equivalences results:
(A ↔ B) ↔ (A ∧ B ) ↔ (A ∨ B).
Apparently, the conflation is logically valid for all first-order models. Because it applies to relational tuples, it applies to sets of similar tuples and therefore union-compatible relations. But union-compatibility doesn't entail at all any particular algebraic operation. All that is entailed is the enforcement of the elementary constraint (A ↔ B).
(The FD's you also mention are irrelevant to the constraint, they might happen to be obeyed but there's no constant reason for them to be always embodied.)
The disjunction of the decomposed relations in the example is not a relation, so it is irrelevant to database internal representation no matter how relevant it might seem in the eye of a user or application program coder.
The mention of applications is important for the same reason data sublanguage is important, system efficiency is lost by a universal relational language that to be logically correct must otherwise reflect the elementary constraint (A ↔ B) because an application becomes logically incorrect without the elementary conflation. Apparently, the intended foundational A-Algebra can't separate the elementary constraint from application representations so presumably neither can complying D-languages.