Has the RM really been defined completely ?

#1 · March 25, 2019, 8:40 am

In 1980, Codd defined a "data model" to be something that provided, in summary, all features necessary for "structure, integrity and manipulation" of [data/information present in] the database. The RM's answer has typically always been "relations, constraints defined using expressions of the relational algebra [ASSERTIONs], and relational algebra.".

However, it occurred to me that "manipulation" must involve/include/incorporate both the retrieval activity as the updating activity. And relational algebra, in its capacity as a mere system of computation starting from [relation] values that are given, has nothing to offer in the realm of database updating.

Codd kind of sidestepped this issue by accepting that databases would be kept in the old technologies and relational views just provided as a layer on top of them, so his model didn't need to address database updating.

TTM of course addresses this by seeing the database as a variable, or set thereof, but [f]actually I have never seen very much in the way of mathematical formalization of the concept. E.g. there might be mathematical proofs that such a system must necessaily involve the higher-order logics that DMG seeks so desperately to avoid (and AFAICT seeking the solution in the observation that database updating is in fact function application).

Is there any sense in these observations, and if so, does it mean the title question is not entirely unwarranted ?

Author of SIRA_PRISE

#2 · March 25, 2019, 2:11 pm

Hasn't computer science in general already said enough about variables and assignment?

Hugh

Coauthor of The Third Manifesto and related books.

#3 · March 25, 2019, 6:24 pm

Perhaps, but then it's rather curious that a thing such as the assignment principle is presented to some extent as an important finding ...

No ?

Author of SIRA_PRISE

#4 · March 26, 2019, 12:19 am

Quote from AntC on March 26, 2019, 12:19 am

Quote from Erwin on March 25, 2019, 6:24 pm

Perhaps, but then it's rather curious that a thing such as the assignment principle is presented to some extent as an important finding ...

Its "finding" is important only because Date & McGoveran's 1994 view updating rules could produce updates that broke the principle.

Something equivalent to the 'Assignment Principle' (how could you write it without caps?) was already explicit in Dayal&Bernstein 1978, just not with such a grandiloquent name. D&McG 1994 was a significant step backwards, in very many respects.

McGoveran has since written that his co-author misunderstood what he [McG] was trying to do in 1994. And, as you acknowledge, has eschewed the whole idea of variables/assignment as being outside FOL.

Codd kind of sidestepped this issue by accepting that databases would be kept in the old technologies and relational views just provided as a layer on top of them, so his model didn't need to address database updating.

Hmm. Codd held various positions at various times. One of those positions was that all relations must have a Primary Key; and that the UPDATE operation must not change the Primary Key. There are hints of that thinking in some of the '12 rules' [1985]. And IIRC the 1990 text had all sorts of RM-mixed-up-with-Entity-Relationship-theory that said the Primary Key was a persistent characteristic of the Entity. (Sorry to be vague: I don't pay much attention to it. But it seems to be a significant delusion for Fabian Pascal in particular.)

Then yes UPDATE looks very like "old technologies" file/record updates; and can't be interpreted as equivalent to some assignment holus bolus to a var. The other advantage is we can talk about concurrent computing: different processes UPDATing different rows, again without assigning a value to the whole var/database. (And that feeds nicely into all the theory on serializability.)

[DMG] seeking the solution in the observation that database updating is in fact function application

You can get that thinking to work in a single user/single thread programming environment: some function operates on a value of type database, returning a value of type database; in a recursive process that takes update or retrieval commands as an extra argument.

IMO that thinking breaks down when either the program terminates but you want the 'final' value of type database to persist; or in a multi-user environment where other users want access to the time-varying value of type database.

You could say: there's a database server process; all users make update/retrieval requests to it; it returns results via messaging. I don't like that: how do you explain what happens when the server/machine is shut down, but the database value persists? How do you avoid the need to talk about the program internals of the server process? In my thinking, the data structures come first; the processes manipulating them are secondary.

Also how do you explain schema changes such as adding/dropping attributes to a relvar or adding/dropping relvars in a dbvar? I don't know of any programming language theory that explains type-changing assignment to vars -- particularly vars that are shared amongst concurrent processes. (Again it could be explained under a single-thread value-passing model: we treat the invoked recursive function as polymorphic; it requires its argument to be a database, but the specific datatype might be different at each invocation.)

Quote from Erwin on March 25, 2019, 6:24 pm

Perhaps, but then it's rather curious that a thing such as the assignment principle is presented to some extent as an important finding ...

Its "finding" is important only because Date & McGoveran's 1994 view updating rules could produce updates that broke the principle.

Something equivalent to the 'Assignment Principle' (how could you write it without caps?) was already explicit in Dayal&Bernstein 1978, just not with such a grandiloquent name. D&McG 1994 was a significant step backwards, in very many respects.

McGoveran has since written that his co-author misunderstood what he [McG] was trying to do in 1994. And, as you acknowledge, has eschewed the whole idea of variables/assignment as being outside FOL.

Codd kind of sidestepped this issue by accepting that databases would be kept in the old technologies and relational views just provided as a layer on top of them, so his model didn't need to address database updating.

Hmm. Codd held various positions at various times. One of those positions was that all relations must have a Primary Key; and that the UPDATE operation must not change the Primary Key. There are hints of that thinking in some of the '12 rules' [1985]. And IIRC the 1990 text had all sorts of RM-mixed-up-with-Entity-Relationship-theory that said the Primary Key was a persistent characteristic of the Entity. (Sorry to be vague: I don't pay much attention to it. But it seems to be a significant delusion for Fabian Pascal in particular.)

Then yes UPDATE looks very like "old technologies" file/record updates; and can't be interpreted as equivalent to some assignment holus bolus to a var. The other advantage is we can talk about concurrent computing: different processes UPDATing different rows, again without assigning a value to the whole var/database. (And that feeds nicely into all the theory on serializability.)

[DMG] seeking the solution in the observation that database updating is in fact function application

You can get that thinking to work in a single user/single thread programming environment: some function operates on a value of type database, returning a value of type database; in a recursive process that takes update or retrieval commands as an extra argument.

IMO that thinking breaks down when either the program terminates but you want the 'final' value of type database to persist; or in a multi-user environment where other users want access to the time-varying value of type database.

You could say: there's a database server process; all users make update/retrieval requests to it; it returns results via messaging. I don't like that: how do you explain what happens when the server/machine is shut down, but the database value persists? How do you avoid the need to talk about the program internals of the server process? In my thinking, the data structures come first; the processes manipulating them are secondary.

Also how do you explain schema changes such as adding/dropping attributes to a relvar or adding/dropping relvars in a dbvar? I don't know of any programming language theory that explains type-changing assignment to vars -- particularly vars that are shared amongst concurrent processes. (Again it could be explained under a single-thread value-passing model: we treat the invoked recursive function as polymorphic; it requires its argument to be a database, but the specific datatype might be different at each invocation.)

#5 · March 26, 2019, 7:55 am

Quote from AntC on March 26, 2019, 12:19 am

You can get that thinking to work in a single user/single thread programming environment: some function operates on a value of type database, returning a value of type database; in a recursive process that takes update or retrieval commands as an extra argument.

IMO that thinking breaks down when either the program terminates but you want the 'final' value of type database to persist; or in a multi-user environment where other users want access to the time-varying value of type database.

...

Also how do you explain schema changes such as adding/dropping attributes to a relvar or adding/dropping relvars in a dbvar? I don't know of any programming language theory that explains type-changing assignment to vars -- particularly vars that are shared amongst concurrent processes. (Again it could be explained under a single-thread value-passing model: we treat the invoked recursive function as polymorphic; it requires its argument to be a database, but the specific datatype might be different at each invocation.)

The 'db=variable' approach suffers from the multi-user issue too.

Schema changes are like deleting a declaration and adding a new one with the same variable name but of a different type. The closest similar thing in regular programming is editing a class that has, say, "int i=1;" and creating a new version that has, say, "long i=1;". That's just software maintenance and version control, and there ain't no "programming language theory" that covers such issues either, AFAICT. Do you know of one ?

Author of SIRA_PRISE

#6 · March 26, 2019, 12:09 pm

I'm grateful to see that we weren't the first to define and embrace "the assignment principle". I had always assumed Chris got this from some early paper. I'm not surprised if it turns out that he came up with the name for it, though.

Anyway, some interesting implementation issues have been raised under this topic, but I'm not sure any of them are appropriate issues to be addressed in the RM per se. Certainly not the business of redefining an existing database relvar, for example. And certainly not the problems arising with concurrent updates in a multi-user environment (which surely have been well covered in journals over the years, with formal treatments of locking protocols and other approaches?).

Of course, Codd's model was a bit short, to say the least, on integrity, and no doubt TTM's treatment isn't considered to be sufficiently formal. Is there anything there to justify Erwin's claim?

Hugh

Coauthor of The Third Manifesto and related books.

#7 · March 26, 2019, 1:05 pm

Quote from Erwin on March 25, 2019, 8:40 am

TTM of course addresses this by seeing the database as a variable, or set thereof, but [f]actually I have never seen very much in the way of mathematical formalization of the concept.

I suppose the general mathematical formalisation is the Turing Machine. There are various formal treatments of some specific subject in computing or logic that define assignment and variables for some purpose. E.g., TTM is one; with a hasty search I found indirect reference to another in a discussion about David Hilbert & Wilhelm Ackermann's, Principles of Mathematical Logic, which indicates there is a definition of an assignment function on page 164¹. There are others.

As is typical with academic publications, there might be a general formalisation somewhere in the literature but it's probably well hidden. Popularity is driven by utility. Unless formalising the notion of "time-varying variable" provides some pragmatic purpose -- similar to the way relational algebra provides the basis for practical optimisation of query languages, and formalisation allows it to be proven correct -- it's not likely to be easily found, assuming it exists at all.

--
¹ - https://math.stackexchange.com/questions/663578/how-is-a-computer-variable-defined-mathematically

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

#8 · March 26, 2019, 8:18 pm

Quote from Dave Voorhis on March 26, 2019, 1:05 pm

Popularity is driven by utility.

All circumstances considered, a very offending statement towards individual TTM implementations as well as perhaps even to TTM itself, no ?

Author of SIRA_PRISE

#9 · March 26, 2019, 8:39 pm

Quote from Erwin on March 26, 2019, 8:18 pm

Quote from Dave Voorhis on March 26, 2019, 1:05 pm

Popularity is driven by utility.

All circumstances considered, a very offending statement towards individual TTM implementations as well as perhaps even to TTM itself, no ?

I was referring only to the popularity of published papers. If it applies to other things -- and I'm not saying it does or doesn't -- how can it be an offending statement if it's true?

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

#10 · March 26, 2019, 11:47 pm

Quote from dandl on March 26, 2019, 11:47 pm

Quote from Erwin on March 25, 2019, 8:40 am

...

However, it occurred to me that "manipulation" must involve/include/incorporate both the retrieval activity as the updating activity. And relational algebra, in its capacity as a mere system of computation starting from [relation] values that are given, has nothing to offer in the realm of database updating.

TTM of course addresses this by seeing the database as a variable, or set thereof, but ...

The RM is an abstract data model, consisting of (nameless and typeless) values, attributes, tuples and relations. Query languages (RA/RC/etc) impose names (but not types) and allow extraction of useful information from pre-existing RM values. A language (TTM/D, Rel, Andl) imposes a type system and an update mechanism that are peculiar to that language. Layer on layer.

The RM is complete. The RA is complete (and equivalent to the RC). The languages are sufficiently complete to be useful. Is there a missing link?

I read the concern underlying your question as: does there (or should there) exist a formalism akin to the RA to describe how an RM value is created or updated, without invoking constructs peculiar to a particular programming language? Is there a layer above the RA and below the programming language? Or if not, is there a language-like formalism that can avoid the need to define update in terms of a particular language? Is there a Turing Machine of RM update?

If there were, it would necessarily construct a new RM value from values that are not RM values, but arranged in what form? In Codd's query examples he makes free use of "primitive elements" (integers and strings) interspersed with named relations and various operators. Creating an RM value should use only primitive elements and operators, but should of course be formalised without recourse to literals in some particular programming language.

Updating an RM value R1 involves up to three steps. First, construct a new value R2 from primitive elements. Then construct another new value from R1 and R2 using the RA. Then replace R1 by R3. Either of the first two steps may be omitted in some instances. The final step might be represented as assignment in some programming language, but should be formalised in some other way, since not every language has assignment of the kind required.

Yes, I share your concerns but I can't answer them.

Quote from Erwin on March 25, 2019, 8:40 am

...

However, it occurred to me that "manipulation" must involve/include/incorporate both the retrieval activity as the updating activity. And relational algebra, in its capacity as a mere system of computation starting from [relation] values that are given, has nothing to offer in the realm of database updating.

TTM of course addresses this by seeing the database as a variable, or set thereof, but ...

The RM is an abstract data model, consisting of (nameless and typeless) values, attributes, tuples and relations. Query languages (RA/RC/etc) impose names (but not types) and allow extraction of useful information from pre-existing RM values. A language (TTM/D, Rel, Andl) imposes a type system and an update mechanism that are peculiar to that language. Layer on layer.

The RM is complete. The RA is complete (and equivalent to the RC). The languages are sufficiently complete to be useful. Is there a missing link?

I read the concern underlying your question as: does there (or should there) exist a formalism akin to the RA to describe how an RM value is created or updated, without invoking constructs peculiar to a particular programming language? Is there a layer above the RA and below the programming language? Or if not, is there a language-like formalism that can avoid the need to define update in terms of a particular language? Is there a Turing Machine of RM update?

If there were, it would necessarily construct a new RM value from values that are not RM values, but arranged in what form? In Codd's query examples he makes free use of "primitive elements" (integers and strings) interspersed with named relations and various operators. Creating an RM value should use only primitive elements and operators, but should of course be formalised without recourse to literals in some particular programming language.

Updating an RM value R1 involves up to three steps. First, construct a new value R2 from primitive elements. Then construct another new value from R1 and R2 using the RA. Then replace R1 by R3. Either of the first two steps may be omitted in some instances. The final step might be represented as assignment in some programming language, but should be formalised in some other way, since not every language has assignment of the kind required.

Yes, I share your concerns but I can't answer them.

Andl - A New Database Language - andl.org

The Forum for Discussion about The Third Manifesto and Related Matters

Has the RM really been defined completely ?