The Forum for Discussion about The Third Manifesto and Related Matters

You need to log in to create posts and topics.

Codd 1970 'domain' does not mean Date 2016 'type' [was: burble about Date's IM]

Hmm.  Obviously there's no way of simulating that in TD, and I can see that to claim TTM conformance for it would depend on a liberal interpretation of RM Pre 4.  Does anything go wrong if such a literal clashes with a keyword in the language?

E.g., var RELOP myrelop := JOIN;

It's worse than that. It's perfectly sensible to have an enumeration of values that cannot be represented in a programming language, such as 'the punctuation characters' or words such as office-holder or maître-d’ or thirdmanifesto.com. Enums in practice are often an alias for the 'real' value.

Andl - A New Database Language - andl.org
Quote from Dave Voorhis on March 15, 2020, 11:42 am
Quote from Erwin on March 15, 2020, 11:27 am
Quote from dandl on March 15, 2020, 12:54 am

My point in starting this thread was to reject the attribution coming from Chris Date that Codd's 'domain' we should take to mean modern 'data type'. I don't think it's that Codd merely didn't know the term in that sense and would have used it if he did. I think his 'domain' is something different, and that Chris Date is plain wrong. Indeed Chris Date seems to be persistently wrong in a great number of his readings of other authors, and in bending well-established terminology to weird senses.

AFAICT Codd (early papers)  used 'domain' precisely in this sense: https://en.wikipedia.org/wiki/Data_domain: "the values which a data element may contain". The same word is frequently used in precisely the same sense today, by data people (DP) in talking about data. Codd used this valuable idea to distinguish those attributes that could join from those that could not.

In Codd's time, code people (CP) had a very limited idea of programming types, just barely above the physical representation. Code was thinking as a DP not a CP, so definitely not contemplating a programming language type. That has changed, but SQL is stuck in a time warp.

Exactly.  The prevailing languages of the early yrs of the RM offered nothing at all to distinguish "the number 3 where it is used as a supplier id" from "the number 3 where it is used as anything other".  They also offered nothing to help the system understand that the number 9796 is not a valid account number, because it does not satisfy the complement-modulo-97 rule ( mod(N,100) === 97 - mod([N/100] , 97) ).  And those are exactly the two reasons I see why "domain" had to be something more than just what the languages of the day could offer.

I think we easily forget that user-defined types are a relatively recent mainstream consideration. It wasn't until the late 90's that you could mention the idea in typical code shops without raised eyebrows or even open scorn, and there are still developers -- both old and new -- who insist (for various reasons) that the only types we need are booleanint, decimal, timestamp and string -- or some variation thereof -- because user-defined types are too hard for average programmers, type constraints will only come back to bite you, etc., etc.

That said, I will (as usual) suggest that there's little to be gained in trying to (re)interpret old papers as anything except passing interest.

Said by the man with the retirement project of trying to (re)interpret Childs 1968.

Rather than endless re-readings of 70's-era publications, far more useful -- and interesting -- would be to reconsider, from scratch and without reference to old papers, database language design (and perhaps database systems in general) in light of modern type theory, modern language design, and modern requirements.

I'm not advocating trying to implement ideas from that era, even if the papers were wonderfully clear -- which Codd isn't, neither Childs. I am interested in what problem/difficulty for the RM Codd was struggling with. I think I have some understanding. I'm sure SQL doesn't address it. I'm pretty sure TTM doesn't address it. I can be pretty sure modern type theory/language design doesn't get near it either, and certainly not the self-proclaimed replacements for SQL.

BTW there's plenty of other disciplines going back over old material/thinkers. I think your attitude is just plain wrong. For instance after the GFC, which was amongst other things an epic failure of NeoLiberal Economic Theory, people have been going back to Keynes 1936 and even Adam Smith 1759 -- that is not the Smith treatise that right-wingers quote from, but the earlier Theory of Moral Sentiments that explains the psychology of markets. In Linguistics, if you want to understand Chomsky, you're far better reading 1957 or even the earlier 'Three Models' than his more recent turgid impenetrable stuff. And Linguisticians trying to recover from the evil dominion of Chomsky are going back to deSaussure, Jesperson, Jakobsen, even Bloomfield, A.N.Firth, M.A.K.Halliday.

Modern is not better; with 'progress' it's two steps forward and one back, as my very wise grandmother (b. 1896) used to say.

Quote from Hugh on March 15, 2020, 12:46 pm

Personally I see little point in trying to guess what Codd really meant by "domain".

So please explain how TTM distinguishes supplier 3 from part 3, given that people will give Integer identifiers to pretty much everything. Or should they not do that? Then where does TTM proscribe it?

I do, however, strongly suspect that his understanding of the term "data type" was in terms of physical representations and built-in types only as in languages like Fortran and Cobol.

As I said to Erwin, I don't think he had the least care for the term "data type" or whatever it meant. What evidence can you provide that "his understanding" denotes something?

Does TTM really militate against support for enumerated types?  I don't remember discussing these with Chris.

Yes TTM does militate against. I think a relevant question would be: does P/L1 support enumerated types, or rather did it when Chris was designing Tutorial D?

How militate against? The PossRep for the set of values within an enumerated type are distinct from any other values in any other type, and created 'out of thin air' by declaring the type. Their PhysRep might well be the same as some existing type (typically Integer), but that's opaque from the users' point of view.

If it is thought that a special prescription is needed, or some additional text in RM Pre 4, then I can only apologise and plead for liberal interpretation of RM Pre 4.  Tutorial D has no special syntax for such types, but surely an enumeration of a set of values can be coded in a possrep constraint.

I think enumerated types could be mimicked with some of the abuse of the IM/Union type that Dave has shown. It would suffer from excessive circumlocution, compared to languages with proper support. (You would have to choose the PhysRep as, say, Integer, and constrain that to 0, 1, 2, ... for each distinct value/distinct type in the enumeration; then Union all those individual value/types.)

What is the typical treatment of enumerated types in strongly-typed languages that support them?  How would Tuesday be denoted as a value of type WEEKDAY?

Take this Haskell declaration for example:

data WeekDay = Monday | Tuesday | Wednesday | Thursday | Friday | Saturday | Sunday
               deriving (Eq, Ord, Enum, Show, Read);

data says I'm declaring a datatype; WeekDay is the type name; the tags separated by | give the values in the set, in ascending sequence; deriving( ... ) asks to automatically generate overloadings for library-defined operators/functions. Eq, Ord says we can compare two WeekDay values for Equality or Ordering; Enum says we can map a WeekDay to/from an Integer (if we want to do arithmetic), also deliver Upper and Lower bound values; Show gives a toString( ) method, which'll merely be the tag name; Read gives a fromString( ) method, inverse of toString( ).

Note that Tuesday is not drawn from some pre-existing pool of values, as TTM explains types. Rather it is brought into existence as a value by appearing in the declaration. It does conform to RM Pre 2 in that Tuesday "carries with " it an identification that it is type WeekDay, and only that type.

John mentions that pascal supports a subrange type. That was actually removed from the language. (Haskell doesn't support such a thing.)

 

Quote from Hugh on March 15, 2020, 5:34 pm
Quote from Erwin on March 15, 2020, 2:24 pm
Quote from Hugh on March 15, 2020, 12:46 pm

...

What is the typical treatment of enumerated types in strongly-typed languages that support them?  How would Tuesday be denoted as a value of type WEEKDAY?

Pascal was the first language where I saw this supported and it went somewhat like

var WEEKDAY myweekdayvar;
myweekdayvar := TUESDAY;

var SUIT mysuitvar := CLUBS;

which illustrates Anthony's remark about internationalisation, because 'TUESDAY' and 'CLUBS' would also be how those values get displayed.

Note that before you can declare a var of type WEEKDAY or assign value TUESDAY to it, you must declare together the type and its set of values. As Erwin says, to internationalise it would be problematic, unless the language offered some sort of alias system.

Hmm.  Obviously there's no way of simulating that in TD, and I can see that to claim TTM conformance for it would depend on a liberal interpretation of RM Pre 4.  Does anything go wrong if such a literal clashes with a keyword in the language?

To answer the last q first, yes: TUESDAY must not be a keyword, nor appear in any other user declaration (as a var name, for example). And yes that's problematic if you want to use the same name for different purposes, typical example:

data NumBase = Bin | Oct | Dec | Hex;
data Month = Jan | Feb | ... | Oct | Nov | Dec;   -- Oct, Dec clash

To simulate in Rel (I'm guessing a bit)

TYPE TUESDAY INTTUESDAY(x INT) CONSTRAIN x = 1   // zero-base for MONDAY
             POSSREP TUESDAY() = INTTUESDAY(1);  // niladic Selector
... // likewise for other days of week

TYPE WEEKDAY UNION { MONDAY, TUESDAY, ..., SUNDAY }

The 'hidden implementation' Selector INTTUESDAY is a dummy to specify the based-on type and constrain its value. The publicly visible Selector TUESDAY() is niladic.

There are some languages (ALGOL 68 Ada I think and UK DoD Coral) that allow you to declare the underlying types and values for enumerations, rather than the language semantics determining it. In the case of Coral, it's for specifying hardware interfaces where you need to map from a program-visible meaningful name to a port number or interrupt code, etc. So to do that in Rel (if you can) isn't beyond the pail; but it's going a long way into implementation detail beyond what TTM wants for abstract PossReps. (And for example couldn't prevent a malicious actor allocating random numbers to days of the week, or indeed exactly the same number to each of them. So Joining/Filtering would get messy.)

Quote from Dave Voorhis on March 15, 2020, 8:25 pm
Quote from Erwin on March 15, 2020, 2:24 pm
Quote from Hugh on March 15, 2020, 12:46 pm

Personally I see little point in trying to guess what Codd really meant by "domain".  I do, however, strongly suspect that his understanding of the term "data type" was in terms of physical representations and built-in types only as in languages like Fortran and Cobol.

Does TTM really militate against support for enumerated types?  I don't remember discussing these with Chris.  If it is thought that a special prescription is needed, or some additional text in RM Pre 4, then I can only apologise and plead for liberal interpretation of RM Pre 4.  Tutorial D has no special syntax for such types, but surely an enumeration of a set of values can be coded in a possrep constraint.

What is the typical treatment of enumerated types in strongly-typed languages that support them?  How would Tuesday be denoted as a value of type WEEKDAY?

Hugh

Pascal was the first language where I saw this supported and it went somewhat like

var WEEKDAY myweekdayvar;
myweekdayvar := TUESDAY;

var SUIT mysuitvar := CLUBS;

which illustrates Anthony's remark about internationalisation, because 'TUESDAY' and 'CLUBS' would also be how those values get displayed.

Such values should never be displayed, except perhaps to a developer for debugging purposes.

Sheesh you're being argumentative. Should a DBMS display a Supplier number as 3 to business users? Or should it always display a meaningful name 'Acme Supply Co' ? If Acme deliver on TUESDAY, what should (say) the delivery docket show if not "TUESDAY"? There's plenty of intensive users who prefer numeric codes over verbosity.

Strictly speaking with enumerated types, yes TUESDAY is a program-internal symbol. (John above calls it a variable. I demur: it's a manifest constant, and your language should prevent nonsense like TUESDAY := FRIDAY, no matter how quickly I'd sometimes like to get the week over ;-). Then strictly speaking your language should support a toString( ) method for type WEEKDAY; but give us a break, can't the compiler fill in the result "TUESDAY" for us?

Enumerated types are intended to be internal to program code; useful for things like constructing state machines, where the current state of the machine is -- and can only be -- one of the values of an enumerated type. They should represent values confined to the computational domain, never the user domain.

Supplier 3 delivers on TUESDAY. That's user domain. As I mused above, perhaps TUESDAY should be reference data in a relvar (so we can hold translations for it), but then we don't get the type guarantees from using an enumerated type. Or are you (gasp!) advocating for an ORM layer that maps, em "TUESDAY" to TUESDAY and back? Give us a break.

I saw a case recently where testing revealed that a raw enumeration value could "leak" out of a system and appear in end-user messages. This set off an immediate round of inquiries and requests for fixes, as raw enumeration text should never appear outside the system. The fact that the "end-user" in this case could only ever be another developer was not a mitigating factor. Any such value must be appropriately translated into a human-friendly message and transformed by the il8n layer, and given consideration -- if there seems some compulsion to expose it to users -- as to whether or not it should be an enumerated type at all.

What do you mean by "raw enumeration"? Was that a name TUESDAY or the PhysRep 1? I agree the PhysRep should not leak out -- even to developers/debuggers.

What's wrong with a (business) user-visible type being from an enumeration? Are businesses allowed to invent extra days of the week or something 'as time-varying data'? What's the benefit to anybody of insisting on extra machinery to map from enumeration value TUESDAY to String "Tuesday" and back? That's the sort of machinery compilers are really good at and programmers not.

Quote from AntC on March 16, 2020, 1:26 am

Yes TTM does militate against. I think a relevant question would be: does P/L1 support enumerated types, or rather did it when Chris was designing Tutorial D?

ORDINAL types were not in the 1976 ANSI standard, but were present in the 1992 IBM "new" compiler for OS/2, AIX, Linux, and z/OS.  They were based directly on Pascal enumerations, with a few additions: operators to deliver the next and previous element, for example.  I don't know when TD was designed.

John mentions that pascal supports a subrange type. That was actually removed from the language.

If by "the language" you mean Pascal, then that turns out not to be the case: it's obviously still in the standards and is supported by GNU Pascal.

Modern is not better; with 'progress' it's two steps forward and one back, as my very wise grandmother (b. 1896) used to say.

Modern is better to the extent that it has more and better facts. Core theories in physics and chemistry benefit enormously from the wealth of experimental data gathered over the years. Eratosthenes was spot on with his method for measuring the diameter of the Earth, but his final result was based on wrong facts.

Modern is better to the extent that it has access to better tools. Whether it's making physical objects to atomic precision or performing a Pluto flypast, or the vast libraries of mathematical and computing tools, the gap between the Greeks and us now it vast.

But the music Bach wrote 300 years ago is as good now as ever, and probably has more listeners now than ever. So while there is something in what you say, for most things modern is indeed better; just not all.

Andl - A New Database Language - andl.org
Quote from Dave Voorhis on March 15, 2020, 11:42 am
Quote from Erwin on March 15, 2020, 11:27 am
Quote from dandl on March 15, 2020, 12:54 am

My point in starting this thread was to reject the attribution coming from Chris Date that Codd's 'domain' we should take to mean modern 'data type'. I don't think it's that Codd merely didn't know the term in that sense and would have used it if he did. I think his 'domain' is something different, and that Chris Date is plain wrong. Indeed Chris Date seems to be persistently wrong in a great number of his readings of other authors, and in bending well-established terminology to weird senses.

...

...

...

That said, I will (as usual) suggest that there's little to be gained in trying to (re)interpret old papers as anything except passing interest. Rather than endless re-readings of 70's-era publications, far more useful -- and interesting -- would be to reconsider, from scratch and without reference to old papers, database language design (and perhaps database systems in general) in light of modern type theory, modern language design, and modern requirements.

There's a couple of concepts (vaguely expressed) in old papers that I would put squarely in the, er ... domain of 'modern type theory'; and which fit exactly modern requirements for anything based on the RM. Only AFAICT there's no type theory that comes anywhere near supporting them. If you're aware (of either a supporting theory or a theoretical explanation for why they're incoherent), please explain.

  1. Codd 1970 footnote 2 "In mathematical terms, a relationship is an equivalence class of those relations that are equivalent under permutation of domains" [my emphasis]
    Leaving aside the bulk of the terminology, which is 'skunked', and concentrating on those underlined words: where is a modern type theory that can account for two data structures being of different datatype, but being in the same 'equivalence class'? Obvious examples in the RM: position-based tuples (in the modern programming language sense) with the same content, and same types for that content, but in different positions; two collections/sets, one whose elements are the datatype of one such tuple, the other of the datatype of an equivalence-class tuple, where we want to merge/union those collections.
    We might say in TTM terminology an 'equivalence class' gives same PossRep, but different PhysRep -- which is backwards relative to RM Pre 4.
  2. Childs 1968 'indexed set' (or we might say 'indexed collection' in more modern terminology)
    in which each element has an index that uniquely identifies the element: attribute name identifies elements of a RM-style tuple (and you can't have a tuple with repeated attribute name even if the <A, v> full element values are distinct); key identifies tuples within a relation value.

You could meet those requirements with run-time values being association lists (and that's probably what Codd was thinking if he thought about implementation at all, and it is how Childs appears to have implemented it, in compact form). But:

  • that could easily fail at run-time because the association lists don't correspond;
  • there's no static type-level guarantees that operations will succeed (and yet we know all the applicable structure at compile time);
  • it's horribly inefficient to walk down association lists matching across between different structures;
  • it's horribly wasteful of store to keep tag values and pointers in the structure when we know the tags and their structure at compile time, and could apply type erasure semantics to deliver a far more compact vector-based data format with the structure represented as 'phantom types'. (Childs appears to have done some of that, by 'factoring out' the index part of the association structure.)

(Yes those bullets are double-counting.)

Now I did "reconsider from scratch" my dissatisfaction with the TTM model and modern type theory, and I could put my finger on what I thought was wrong. But it wasn't until I made "reference to old papers" that I began to see a glimmer of what might be an alternative approach. Perhaps I'm just not very clever or imaginative or widely-read enough to figure it out. OTOH why try to reinvent the wheel when Codd and Childs have already been thinking about it?

Quote from AntC on March 16, 2020, 1:26 am
Quote from Hugh on March 15, 2020, 12:46 pm

Personally I see little point in trying to guess what Codd really meant by "domain".

So please explain how TTM distinguishes supplier 3 from part 3, given that people will give Integer identifiers to pretty much everything. Or should they not do that? Then where does TTM proscribe it?

Surely you know that it has never been an aim of TTM to prescribe or proscribe user behaviour.  UDT support allows users to make such distinctions if they so wish.

Hugh

Coauthor of The Third Manifesto and related books.
Quote from AntC on March 16, 2020, 10:04 am
Quote from Dave Voorhis on March 15, 2020, 11:42 am
Quote from Erwin on March 15, 2020, 11:27 am
Quote from dandl on March 15, 2020, 12:54 am

My point in starting this thread was to reject the attribution coming from Chris Date that Codd's 'domain' we should take to mean modern 'data type'. I don't think it's that Codd merely didn't know the term in that sense and would have used it if he did. I think his 'domain' is something different, and that Chris Date is plain wrong. Indeed Chris Date seems to be persistently wrong in a great number of his readings of other authors, and in bending well-established terminology to weird senses.

...

...

...

That said, I will (as usual) suggest that there's little to be gained in trying to (re)interpret old papers as anything except passing interest. Rather than endless re-readings of 70's-era publications, far more useful -- and interesting -- would be to reconsider, from scratch and without reference to old papers, database language design (and perhaps database systems in general) in light of modern type theory, modern language design, and modern requirements.

There's a couple of concepts (vaguely expressed) in old papers that I would put squarely in the, er ... domain of 'modern type theory'; and which fit exactly modern requirements for anything based on the RM. Only AFAICT there's no type theory that comes anywhere near supporting them. If you're aware (of either a supporting theory or a theoretical explanation for why they're incoherent), please explain.

  1. Codd 1970 footnote 2 "In mathematical terms, a relationship is an equivalence class of those relations that are equivalent under permutation of domains" [my emphasis]
    Leaving aside the bulk of the terminology, which is 'skunked', and concentrating on those underlined words: where is a modern type theory that can account for two data structures being of different datatype, but being in the same 'equivalence class'? Obvious examples in the RM: position-based tuples (in the modern programming language sense) with the same content, and same types for that content, but in different positions; two collections/sets, one whose elements are the datatype of one such tuple, the other of the datatype of an equivalence-class tuple, where we want to merge/union those collections.
    We might say in TTM terminology an 'equivalence class' gives same PossRep, but different PhysRep -- which is backwards relative to RM Pre 4.
  2. Childs 1968 'indexed set' (or we might say 'indexed collection' in more modern terminology)
    in which each element has an index that uniquely identifies the element: attribute name identifies elements of a RM-style tuple (and you can't have a tuple with repeated attribute name even if the <A, v> full element values are distinct); key identifies tuples within a relation value.

You could meet those requirements with run-time values being association lists (and that's probably what Codd was thinking if he thought about implementation at all, and it is how Childs appears to have implemented it, in compact form). But:

  • that could easily fail at run-time because the association lists don't correspond;
  • there's no static type-level guarantees that operations will succeed (and yet we know all the applicable structure at compile time);
  • it's horribly inefficient to walk down association lists matching across between different structures;
  • it's horribly wasteful of store to keep tag values and pointers in the structure when we know the tags and their structure at compile time, and could apply type erasure semantics to deliver a far more compact vector-based data format with the structure represented as 'phantom types'. (Childs appears to have done some of that, by 'factoring out' the index part of the association structure.)

(Yes those bullets are double-counting.)

Now I did "reconsider from scratch" my dissatisfaction with the TTM model and modern type theory, and I could put my finger on what I thought was wrong. But it wasn't until I made "reference to old papers" that I began to see a glimmer of what might be an alternative approach. Perhaps I'm just not very clever or imaginative or widely-read enough to figure it out. OTOH why try to reinvent the wheel when Codd and Childs have already been thinking about it?

That's certainly reasonable, and I wasn't intending to single you out for abuse. I'm deprecating a general tendency toward academic nostalgia that I've seen here and elsewhere, which involves endless re-reading of some seminal paper(s), desperate mining for insights that never come, ever more questionable reinterpretations, and a complete disregard -- and sometime scorn -- for anything and everything newer.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org