The Forum for Discussion about The Third Manifesto and Related Matters

Please or Register to create posts and topics.

Codd 1970 'domain' does not mean Date 2016 'type' [was: burble about Date's IM]

PreviousPage 16 of 22Next
Quote from dandl on March 15, 2020, 12:54 am

My point in starting this thread was to reject the attribution coming from Chris Date that Codd's 'domain' we should take to mean modern 'data type'. I don't think it's that Codd merely didn't know the term in that sense and would have used it if he did. I think his 'domain' is something different, and that Chris Date is plain wrong. Indeed Chris Date seems to be persistently wrong in a great number of his readings of other authors, and in bending well-established terminology to weird senses.

AFAICT Codd (early papers)  used 'domain' precisely in this sense: https://en.wikipedia.org/wiki/Data_domain: "the values which a data element may contain".

No. See Codd 1970 Fig 2. (I expect the same figure is in 1969.)

The data elements headed part, quantity both contain values 1, 2, 3 amongst others. Yet Codd says "A relation with two identical domains".  That is, the identical two are those headed part. Note that the example content of those two columns is not the same set of values; neither is either the same set of values as for quantity. So "two identical domains" must mean same column name and same (potential) set of values (or for the "domain-unordered counterparts", same domain name even if different role name: "we require in each case that the domain name be qualified by a distinctive role name," 1970 p.380 and footnote 2).

So for the quote Erwin picked up from 1979, "supplier 3" must mean a value 3 appearing in an attribute with role supplier, and distinct from a value 3 appearing in role part or quantity. This is independent of the machine representation of that value; or the more programming-language oriented conceptual data type (PossRep in TTM).

The same word is frequently used in precisely the same sense today, by data people (DP) in talking about data. Codd used this valuable idea to distinguish those attributes that could join from those that could not.

Hmmm. That wiki talks about enumerated types and/or values with foreign keys to 'reference data', and yes data people would be thinking of those. I see no evidence Codd had that in mind from the 1970 paper. Perhaps the 1979 paper with its entity types might be thinking about foreign-key linkages (for part, supplier); but there's still  no enumerated types nor reference data.

If Codd has a concept of what attributes can join, I'd expect in the 1972 paper a definition for 'join-compatible' in the same vein as 'union-compatible'. Furthermore where it explains how to get from Cartesian product to natural join by eliminating "redundant domains", it should talk about more than mere equality of values: "In the case of the equi-join, two of the domains of the resulting relation are identical in content. If one of the domains is removed by projection, the result is the natural join of the given relations, as defined in [2]." [p.10]

That should say "... are identical in content and domain name". Thus Natural Join would be a partial function/operation, same as the union-compatible operations. But that would be needlessly restrictive: although it might seem daft to join part to supplier even though they're both Integer, we might want to join supplier-city to part-stored-city.

... That has changed, but SQL is stuck in a time warp.

Hmm. SQL seems to be stuck at Codd 1972. (Or I should say at an IBM engineers' misunderstanding of Codd 1972.)

IMO the single most striking feature of  TTM is the unification of data domain with programming language type system.

TTM does not use any idea of 'data domain' (the word "domain" barely appears, and only to contrast to SQL). Perhaps that's how enumerated types got overlooked(?) It has only PL type system. Let's not conflate the two.

... There are other issues in TTM, such as the treatment of enumerated types, which are a big deal to DP. The data domain is obvious (Red, Amber, Green), but the programming type is problematic. Is it possible that the DP view and the CP view should not be unified?

I blow hot and cold as to whether the Relational Model needs enumerated types in the database content, as opposed to within a programming language. Red/Amber/Green would I suspect be reference data in the database and foreign keys to them, with additional attributes such as a description or label to show on the screen. A typical move when internationalising an application is to rip out all the hard-coded (pseudo-)enumerated types, and replace them with reference data so that you can offer translated labels. (Reference data is tiny, so you load those tables into memory.)

So my question for TFM (The Fourth Manifesto) is whether it is feasible to put some distance between the data domain and the language type, so both the DP and the CP get what they need, and if there are benefits in doing so.

 

If Codd's 'domain' has any use here within a modern nominative type system, it would be to distinguish the PL PossRep (which will be close to the PhysRep) from what I'll call 'NomRep': the PossRep named to some role name; in which the named type shares the PossRep and therefore can share the implementation/overloading of operations -- especially for equi-testing in a Join. So the Poss/PhysRep for part 3 is identical to the Poss/PhysRep for supplier 3, but you can't compare or Join on them because they're distinct nominative types.

Now we need to be canny with RA RENAME: newly-named attributes (say assembly-part vs component-part) must still be comparable; whereas RENAME part to supplier should still leave them non-comparable (or get rejected).

Quote from AntC on March 15, 2020, 2:53 am
Quote from dandl on March 15, 2020, 12:54 am

My point in starting this thread was to reject the attribution coming from Chris Date that Codd's 'domain' we should take to mean modern 'data type'. I don't think it's that Codd merely didn't know the term in that sense and would have used it if he did. I think his 'domain' is something different, and that Chris Date is plain wrong. Indeed Chris Date seems to be persistently wrong in a great number of his readings of other authors, and in bending well-established terminology to weird senses.

AFAICT Codd (early papers)  used 'domain' precisely in this sense: https://en.wikipedia.org/wiki/Data_domain: "the values which a data element may contain".

No. See Codd 1970 Fig 2. (I expect the same figure is in 1969.)

It's not, but close enough. See Fig 2 p4.

The data elements headed part, quantity both contain values 1, 2, 3 amongst others. Yet Codd says "A relation with two identical domains".  That is, the identical two are those headed part. Note that the example content of those two columns is not the same set of values; neither is either the same set of values as for quantity. So "two identical domains" must mean same column name and same (potential) set of values (or for the "domain-unordered counterparts", same domain name even if different role name: "we require in each case that the domain name be qualified by a distinctive role name," 1970 p.380 and footnote 2).

With licence, yes. In Fig 2 there appear a selection of values, from which we may infer others without actually being sure what those others might be. Let us assume here that there are two domains: part has the values 1-50 inclusive and quantity has the values 0-99 inclusive. The first two attributes of component are both of domain part, so we distinguish them by assigning a different role to each. In some cases the domain is sufficient to identify an attribute, in others we need both domain and role. In Codd's case the attribute names are sub.part, super.part, quantity. Thus far it all makes sense.

So for the quote Erwin picked up from 1979, "supplier 3" must mean a value 3 appearing in an attribute with role supplier, and distinct from a value 3 appearing in role part or quantity. This is independent of the machine representation of that value; or the more programming-language oriented conceptual data type (PossRep in TTM).

I demur. In my view 'supplier 3' refers to the domain supplier, which is allowed to take the value 3. No role is needed in this case.

The same word is frequently used in precisely the same sense today, by data people (DP) in talking about data. Codd used this valuable idea to distinguish those attributes that could join from those that could not.

Hmmm. That wiki talks about enumerated types and/or values with foreign keys to 'reference data', and yes data people would be thinking of those. I see no evidence Codd had that in mind from the 1970 paper. Perhaps the 1979 paper with its entity types might be thinking about foreign-key linkages (for part, supplier); but there's still  no enumerated types nor reference data.

If Codd has a concept of what attributes can join, I'd expect in the 1972 paper a definition for 'join-compatible' in the same vein as 'union-compatible'. Furthermore where it explains how to get from Cartesian product to natural join by eliminating "redundant domains", it should talk about more than mere equality of values: "In the case of the equi-join, two of the domains of the resulting relation are identical in content. If one of the domains is removed by projection, the result is the natural join of the given relations, as defined in [2]." [p.10]

I confess that I find Codd's definition of union-compatible disappointing. From this one would expect that the part and quantity mentioned above would be 'union-compatible', and surely that cannot be so. My only explanation here is that Codd intended to refine the issue to its essence. But it troubles me. This use of domain does not match my expectations.

Yes, p10 is problematic. How can two domains be 'identical in content' if they are not of the same domain?

Please note that I not trying to defend Codd. Maybe by this paper he had drifted off a simple 'set of permitted values' into something different.

That should say "... are identical in content and domain name". Thus Natural Join would be a partial function/operation, same as the union-compatible operations. But that would be needlessly restrictive: although it might seem daft to join part to supplier even though they're both Integer, we might want to join supplier-city to part-stored-city.

... That has changed, but SQL is stuck in a time warp.

Hmm. SQL seems to be stuck at Codd 1972. (Or I should say at an IBM engineers' misunderstanding of Codd 1972.)

IMO the single most striking feature of  TTM is the unification of data domain with programming language type system.

TTM does not use any idea of 'data domain' (the word "domain" barely appears, and only to contrast to SQL). Perhaps that's how enumerated types got overlooked(?) It has only PL type system. Let's not conflate the two.

... There are other issues in TTM, such as the treatment of enumerated types, which are a big deal to DP. The data domain is obvious (Red, Amber, Green), but the programming type is problematic. Is it possible that the DP view and the CP view should not be unified?

I blow hot and cold as to whether the Relational Model needs enumerated types in the database content, as opposed to within a programming language. Red/Amber/Green would I suspect be reference data in the database and foreign keys to them, with additional attributes such as a description or label to show on the screen. A typical move when internationalising an application is to rip out all the hard-coded (pseudo-)enumerated types, and replace them with reference data so that you can offer translated labels. (Reference data is tiny, so you load those tables into memory.)

I won't argue, but it's still clear that many useful attributes are restricted to taking on a small, known set of values, which the DP refer to as 'the domain'. No matter how you present them visually (Red or Rouge or Rot or Rojo or Rosso or a coloured panel) the underlying concept is the same.

  • The database has the job of (a) storing and retrieving the value (b) making sure every value stored is valid (c) storing metadata that reliably connects the permitted set of values to a defined business requirement.
  • The programming language has the job of (a) storing and retrieving all possible values for that domain as a value of a type (b) encoding business logic to manipulate values of that type according to business rules (c) encoding display logic to input and output values of that type according to some cultural expectation.

I don't see that (in general) unifying the database domains (for want of a better term) with a programming language type system is the solution to the problem.

So my question for TFM (The Fourth Manifesto) is whether it is feasible to put some distance between the data domain and the language type, so both the DP and the CP get what they need, and if there are benefits in doing so.

 

If Codd's 'domain' has any use here within a modern nominative type system, it would be to distinguish the PL PossRep (which will be close to the PhysRep) from what I'll call 'NomRep': the PossRep named to some role name; in which the named type shares the PossRep and therefore can share the implementation/overloading of operations -- especially for equi-testing in a Join. So the Poss/PhysRep for part 3 is identical to the Poss/PhysRep for supplier 3, but you can't compare or Join on them because they're distinct nominative types.

Now we need to be canny with RA RENAME: newly-named attributes (say assembly-part vs component-part) must still be comparable; whereas RENAME part to supplier should still leave them non-comparable (or get rejected).

Now I think you're trying to do too much in the PL. We need a DL.

Andl - A New Database Language - andl.org
Quote from dandl on March 15, 2020, 12:54 am

My point in starting this thread was to reject the attribution coming from Chris Date that Codd's 'domain' we should take to mean modern 'data type'. I don't think it's that Codd merely didn't know the term in that sense and would have used it if he did. I think his 'domain' is something different, and that Chris Date is plain wrong. Indeed Chris Date seems to be persistently wrong in a great number of his readings of other authors, and in bending well-established terminology to weird senses.

AFAICT Codd (early papers)  used 'domain' precisely in this sense: https://en.wikipedia.org/wiki/Data_domain: "the values which a data element may contain". The same word is frequently used in precisely the same sense today, by data people (DP) in talking about data. Codd used this valuable idea to distinguish those attributes that could join from those that could not.

In Codd's time, code people (CP) had a very limited idea of programming types, just barely above the physical representation. Code was thinking as a DP not a CP, so definitely not contemplating a programming language type. That has changed, but SQL is stuck in a time warp.

Exactly.  The prevailing languages of the early yrs of the RM offered nothing at all to distinguish "the number 3 where it is used as a supplier id" from "the number 3 where it is used as anything other".  They also offered nothing to help the system understand that the number 9796 is not a valid account number, because it does not satisfy the complement-modulo-97 rule ( mod(N,100) === 97 - mod([N/100] , 97) ).  And those are exactly the two reasons I see why "domain" had to be something more than just what the languages of the day could offer.

Quote from Erwin on March 15, 2020, 11:27 am
Quote from dandl on March 15, 2020, 12:54 am

My point in starting this thread was to reject the attribution coming from Chris Date that Codd's 'domain' we should take to mean modern 'data type'. I don't think it's that Codd merely didn't know the term in that sense and would have used it if he did. I think his 'domain' is something different, and that Chris Date is plain wrong. Indeed Chris Date seems to be persistently wrong in a great number of his readings of other authors, and in bending well-established terminology to weird senses.

AFAICT Codd (early papers)  used 'domain' precisely in this sense: https://en.wikipedia.org/wiki/Data_domain: "the values which a data element may contain". The same word is frequently used in precisely the same sense today, by data people (DP) in talking about data. Codd used this valuable idea to distinguish those attributes that could join from those that could not.

In Codd's time, code people (CP) had a very limited idea of programming types, just barely above the physical representation. Code was thinking as a DP not a CP, so definitely not contemplating a programming language type. That has changed, but SQL is stuck in a time warp.

Exactly.  The prevailing languages of the early yrs of the RM offered nothing at all to distinguish "the number 3 where it is used as a supplier id" from "the number 3 where it is used as anything other".  They also offered nothing to help the system understand that the number 9796 is not a valid account number, because it does not satisfy the complement-modulo-97 rule ( mod(N,100) === 97 - mod([N/100] , 97) ).  And those are exactly the two reasons I see why "domain" had to be something more than just what the languages of the day could offer.

I think we easily forget that user-defined types are a relatively recent mainstream consideration. It wasn't until the late 90's that you could mention the idea in typical code shops without raised eyebrows or even open scorn, and there are still developers -- both old and new -- who insist (for various reasons) that the only types we need are booleanint, decimal, timestamp and string -- or some variation thereof -- because user-defined types are too hard for average programmers, type constraints will only come back to bite you, etc., etc.

That said, I will (as usual) suggest that there's little to be gained in trying to (re)interpret old papers as anything except passing interest. Rather than endless re-readings of 70's-era publications, far more useful -- and interesting -- would be to reconsider, from scratch and without reference to old papers, database language design (and perhaps database systems in general) in light of modern type theory, modern language design, and modern requirements.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

Personally I see little point in trying to guess what Codd really meant by "domain".  I do, however, strongly suspect that his understanding of the term "data type" was in terms of physical representations and built-in types only as in languages like Fortran and Cobol.

Does TTM really militate against support for enumerated types?  I don't remember discussing these with Chris.  If it is thought that a special prescription is needed, or some additional text in RM Pre 4, then I can only apologise and plead for liberal interpretation of RM Pre 4.  Tutorial D has no special syntax for such types, but surely an enumeration of a set of values can be coded in a possrep constraint.

What is the typical treatment of enumerated types in strongly-typed languages that support them?  How would Tuesday be denoted as a value of type WEEKDAY?

Hugh

Coauthor of The Third Manifesto and related books.
Quote from Hugh on March 15, 2020, 12:46 pm

Personally I see little point in trying to guess what Codd really meant by "domain".  I do, however, strongly suspect that his understanding of the term "data type" was in terms of physical representations and built-in types only as in languages like Fortran and Cobol.

Does TTM really militate against support for enumerated types?  I don't remember discussing these with Chris.  If it is thought that a special prescription is needed, or some additional text in RM Pre 4, then I can only apologise and plead for liberal interpretation of RM Pre 4.  Tutorial D has no special syntax for such types, but surely an enumeration of a set of values can be coded in a possrep constraint.

What is the typical treatment of enumerated types in strongly-typed languages that support them?  How would Tuesday be denoted as a value of type WEEKDAY?

Hugh

Pascal was the first language where I saw this supported and it went somewhat like

var WEEKDAY myweekdayvar;
myweekdayvar := TUESDAY;

var SUIT mysuitvar := CLUBS;

which illustrates Anthony's remark about internationalisation, because 'TUESDAY' and 'CLUBS' would also be how those values get displayed.

Quote from Erwin on March 15, 2020, 2:24 pm
Quote from Hugh on March 15, 2020, 12:46 pm

Personally I see little point in trying to guess what Codd really meant by "domain".  I do, however, strongly suspect that his understanding of the term "data type" was in terms of physical representations and built-in types only as in languages like Fortran and Cobol.

Does TTM really militate against support for enumerated types?  I don't remember discussing these with Chris.  If it is thought that a special prescription is needed, or some additional text in RM Pre 4, then I can only apologise and plead for liberal interpretation of RM Pre 4.  Tutorial D has no special syntax for such types, but surely an enumeration of a set of values can be coded in a possrep constraint.

What is the typical treatment of enumerated types in strongly-typed languages that support them?  How would Tuesday be denoted as a value of type WEEKDAY?

Hugh

Pascal was the first language where I saw this supported and it went somewhat like

var WEEKDAY myweekdayvar;
myweekdayvar := TUESDAY;

var SUIT mysuitvar := CLUBS;

which illustrates Anthony's remark about internationalisation, because 'TUESDAY' and 'CLUBS' would also be how those values get displayed.

Hmm.  Obviously there's no way of simulating that in TD, and I can see that to claim TTM conformance for it would depend on a liberal interpretation of RM Pre 4.  Does anything go wrong if such a literal clashes with a keyword in the language?

E.g., var RELOP myrelop := JOIN;

Hugh

Coauthor of The Third Manifesto and related books.

In Pascal, as in most languages with enumerated types, it is necessary to give the names of the members of the enumeration explicitly.  These names are in effect variables that are immutably bound to the enumerated objects themselves.  Thus by writing type weekday = (Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sundayweekday becomes the name of a novel type and the other seven identifiers are bound to the seven unique instances of that type.  The scope of these names is the current procedure, block, or program.  The expression Low(weekday) is equivalent to Monday, and High(weekday) to Saturday.  In addition, Ord(Friday) is equal to 4.

Enumerable types are not usable externally in Pascal; it is necessary to write your own code to convert a string like "Tuesday" or "Dienstag" to (on input) or from (on output) Tuesday.  Java, on the other hand, provides such conversions automatically.

It is also possible to create types that are subsets of existing enumerated types:  type workday = Monday..Friday, for example, creates a subtype workday with only five values, which are the same as the corresponding five values of weekday.  Coercion from workday values to weekday values is therefore automatic.

In Pascal in particular, the boolean type is a predeclared enumerated type with two members, false and true.

Quote from Erwin on March 15, 2020, 2:24 pm
Quote from Hugh on March 15, 2020, 12:46 pm

Personally I see little point in trying to guess what Codd really meant by "domain".  I do, however, strongly suspect that his understanding of the term "data type" was in terms of physical representations and built-in types only as in languages like Fortran and Cobol.

Does TTM really militate against support for enumerated types?  I don't remember discussing these with Chris.  If it is thought that a special prescription is needed, or some additional text in RM Pre 4, then I can only apologise and plead for liberal interpretation of RM Pre 4.  Tutorial D has no special syntax for such types, but surely an enumeration of a set of values can be coded in a possrep constraint.

What is the typical treatment of enumerated types in strongly-typed languages that support them?  How would Tuesday be denoted as a value of type WEEKDAY?

Hugh

Pascal was the first language where I saw this supported and it went somewhat like

var WEEKDAY myweekdayvar;
myweekdayvar := TUESDAY;

var SUIT mysuitvar := CLUBS;

which illustrates Anthony's remark about internationalisation, because 'TUESDAY' and 'CLUBS' would also be how those values get displayed.

Such values should never be displayed, except perhaps to a developer for debugging purposes. Enumerated types are intended to be internal to program code; useful for things like constructing state machines, where the current state of the machine is -- and can only be -- one of the values of an enumerated type. They should represent values confined to the computational domain, never the user domain.

I saw a case recently where testing revealed that a raw enumeration value could "leak" out of a system and appear in end-user messages. This set off an immediate round of inquiries and requests for fixes, as raw enumeration text should never appear outside the system. The fact that the "end-user" in this case could only ever be another developer was not a mitigating factor. Any such value must be appropriately translated into a human-friendly message and transformed by the il8n layer, and given consideration -- if there seems some compulsion to expose it to users -- as to whether or not it should be an enumerated type at all.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

I think we easily forget that user-defined types are a relatively recent mainstream consideration. It wasn't until the late 90's that you could mention the idea in typical code shops without raised eyebrows or even open scorn, and there are still developers -- both old and new -- who insist (for various reasons) that the only types we need are booleanint, decimal, timestamp and string -- or some variation thereof -- because user-defined types are too hard for average programmers, type constraints will only come back to bite you, etc., etc.

My recollection is that we were writing code in C in the late 1980s with an evolving sense of defining new types. We had typedef, struct, union and macros, and we adopted a method-like coding convention whereby the first argument to a function was a pointer to struct. Pascal had scalar subranges, C had typedefs, Java/C# have neither. I miss them.

That said, I will (as usual) suggest that there's little to be gained in trying to (re)interpret old papers as anything except passing interest. Rather than endless re-readings of 70's-era publications, far more useful -- and interesting -- would be to reconsider, from scratch and without reference to old papers, database language design (and perhaps database systems in general) in light of modern type theory, modern language design, and modern requirements.

I agree. Mea culpa.

To return to my theme:

  • the DP view is that a domain is the permitted set of values for an attribute. It can be scalar (mainly numbers, strings, dates and subranges thereof) or structured (having field members or being a collection, recursively), and subject to a variety of rules.
  • the CP view is that a type is a named set of values which can be used as the arguments or return value of functions/operators. It should be based on a familiar data type for easy manipulation by standard libraries.

There is no particular reason for these views to be unified, but every reason they should be interoperable.

RM Pre 23 defines a type constraint and I think this may be a mistake. The real requirement is a domain constraint, to limit what can be stored as an attribute in the database. That is a critical DP need, but there is no need to impose that limit on the CP view or the type system.

To give an example, the DP might want to constrain the contents of an attribute to be a 'date that is a weekday' and another to be a 'date that is a weekend'. The CP will retrieve that value as a simple date and write code to manipulate it, but check any value against the database domain constraint before storing a new value.

It's pervasive. The DP might define an attribute as taking on the values "1,2,3,4,5,NA,MISSING", which the CP would prefer to code for as the integer values "1,2,3,4,5,-99,0". [There might be a 'survey result' data type if this is a common need, but the specific data values will still vary from one case to another.]

The DP might define "day of the week in national language" which the CP wants as "0,1,2,3,4,5,6". No code will ever check 'is this Tuesday?' so an enumerated type is pointless. Integer does fine.

The DP wants "future date, local timezone, local format", the CP just wants an object of type Date, with access to a vast range of useful library functions.

A domain (modern usage, forget Codd) is not a type.

Andl - A New Database Language - andl.org
PreviousPage 16 of 22Next