The Forum for Discussion about The Third Manifesto and Related Matters

Forum breadcrumbs - You are here:

Has the RM really been defined completely ?

12
Quote from Dave Voorhis on March 26, 2019, 1:05 pm
Quote from Erwin on March 25, 2019, 8:40 am

TTM of course addresses this by seeing the database as a variable, or set thereof, but [f]actually I have never seen very much in the way of mathematical formalization of the concept.

I suppose the general mathematical formalisation is the Turing Machine. There are various formal treatments of some specific subject in computing or logic that define assignment and variables for some purpose.

For me the first that comes to mind is Hoare Logic2. That uses language very similar to TTM "The assignment axiom states that, after the assignment, any predicate that was previously true for the right-hand side of the assignment now holds for the variable. " In particular, TTM is concerned with the predicate 'x == e' (where 'x' is the variable, 'e' is the expression assigned, == is equality test). By talking about "any predicate" it shows we're in second-order logic/Leibniz equality.

with a hasty search I found indirect reference to another in a discussion about David Hilbert & Wilhelm Ackermann's, Principles of Mathematical Logic, which indicates there is a definition of an assignment function on page 1641.

That q represents mostly the worst of Stackoverflow. The page 164 is from Zohar Manna, Mathematical theory of computation (1974), not from Hilbert & Ackerman. (The quote from H&A is purely about the mathematics/algebra sense of variable.) And even the page 164 quote doesn't help. It is in effect explaining multiple assignment -- which is useful if we regard a DBvar as representing a collection/tuple of variables, one per relvar.

This bit of the answer is more useful:

"In the context of computer languages, when we use the instruction

x1 ← x1+1

the symbol x1 is not a variable: it is a constant."

Then go to Strachey 1967 talking about lvalues vs rvalues: symbol x1 denotes a constant; where x1 appears within an expression, and particularly on RHS of assignment, we (I mean the language's semantics) take its rvalue -- what's traditionally just called 'value'; where it appears on LHS of assignment, we take its lvalue -- what's traditionally called 'address'. Beware Strachey has a very abstract sense of 'address' within an abstract model of 'store': no expectation it consist of uniform-sized cells. (An array is a collection of addresses, accessed by an address-calculation algorithm, which can return either lvalues or rvalues as needed.) Then assignment places the rvalue into the location pointed to by the address.

As is typical with academic publications, there might be a general formalisation somewhere in the literature but it's probably well hidden. Popularity is driven by utility. Unless formalising the notion of "time-varying variable" provides some pragmatic purpose -- similar to the way relational algebra provides the basis for practical optimisation of query languages, and formalisation allows it to be proven correct -- it's not likely to be easily found, assuming it exists at all.

I think that's an important observation: there are various academic formalisations; but we programmers all 'just know' what assignment is/does, so there's little utility in making a song and dance about it.

I think it's worth pointing out that in

x1 ← x1+1

what's happening in Hoare Logic or in a more mathematical logic flavoured treatment is that there are two distinct variables, with disjoint scopes. x1 appearing on LHS creates a fresh variable bound to the result of evaluating the RHS. Because it's same-named as the 'old' x1, that one's scope is shadowed so becomes inaccessible. (Similarly if a procedure calls itself recursively, there's a whole fresh set of variables created that shadow the scopes of the caller.) Then we can keep up the idea that a variable 'stands for' the expression that was assigned to it.

This is again all very well for a single user/single-threaded semantics. But if we have a long-running program repeatedly updating the dbvar (that is, repeatedly creating a fresh variable with the same name as the initial dbvar), how to explain multi-users getting hold of the 'current value' of 'the' dbvar. ('the' in scare quotes because there's no longer a single var.)

Then another formalisation (aimed more at explaining the semantics of OOP) is the pi calculus3. It treats each variable as a distinct process, with message passing: assignment is passing a message saying "if anybody asks what your value is, from now on you're to tell them it's one more than your current value". ("Current value" meaning the value you have at the time of receiving this instruction.) A process can fork arbitrarily many new processes and/or send an instruction to a process to halt -- which perhaps could explain creating/dropping new relvars.

But I still don't get how to explain simultaneous/multiple assignment: I think we have to regard the whole db as a single atomic process. We can't regard it as a collection of processes (one for each relvar), because pi-calculus deliberately doesn't define synchronisation amongst processes. (Rather, it gives primitives to build it.) Or we regard the dbvar as a process that's a gatekeeper for the relvars: no outside process can send messages to the relvars directly; the dbvar process synchronises updates such that after receiving a multiple update request, it will queue up further requests until all the relvar sub-processes have sent back a confirmation message.

[Hugh] Is there anything there to justify Erwin's claim?

To be precise, Erwin is posing a question, not a claim. I'd say the question of (multiple) assignment to var(s) in a multiprocess environment is not "defined completely". But that's not an RM-specific issue. So we might say: the RM has been defined completely, relying on other mathematical and programming language abstractions. But not all of those abstractions have been defined completely.

Quote from Erwin on March 26, 2019, 7:55 am
Quote from AntC on March 26, 2019, 12:19 am

...

Also how do you explain schema changes such as adding/dropping attributes to a relvar or adding/dropping relvars in a dbvar? I don't know of any programming language theory that explains type-changing assignment to vars -- particularly vars that are shared amongst concurrent processes. (Again it could be explained under a single-thread value-passing model: we treat the invoked recursive function as polymorphic; it requires its argument to be a database, but the specific datatype might be different at each invocation.)

The 'db=variable' approach suffers from the multi-user issue too.

Schema changes are like deleting a declaration and adding a new one with the same variable name but of a different type.  The closest similar thing in regular programming is editing a class that has, say, "int i=1;" and creating a new version that has, say, "long i=1;".  That's just software maintenance and version control, ...

In software maintenance you typically stop the program, delete its current executable, compile the code to create a new executable, then fire that up. I don't see how schema changes bear any relation to that. A DBMS wouldn't be worthy of the name if every time you want to change an int attribute to a long, you have to delete the whole database. TTM RM Pre 17 "D shall provide facilities for a transaction to define new relvars, or destroy existing ones, within its associated database" -- and for creating/destroying all sorts of artefacts, and quite right too.

If the Heading of a relvar is (part of) its type, and the collection of relvars and their types are (part of) the type of the dbvar; then any change to the schema is type-changing. Then in what sense is it the 'same' database persisting? Are we required to recompile every D program that accesses 'the' database, because the dbvar's type has changed? Or do we treat programs as accessing the database dynamically/with dynamic typing, such that at the start of execution we have attribute i int in some relvar, but by the end we have i long?

How does that fit with the preference for static typing, and type-safety guarantees?

Quote from Hugh on March 26, 2019, 12:09 pm

I'm grateful to see that we weren't the first to define and embrace "the assignment principle".  I had always assumed Chris got this from some early paper.  I'm not surprised if it turns out that he came up with the name for it, though.

(We've discussed this before.)

I'd put it: D&McG were the first to break (what came to be known as) "the assignment principle". This is specifically to do with updating through views. Earlier papers treat it as blimmin' obvious/not needing any great hullabaloo.

It would have saved a great deal of poppycock if Date had taken it from an earlier paper before publishing the 1994 treatment. Specifically from Dayal&Bernstein 1978, equivalently Bancilhon&Spyratos 1981, both of which were cited in the 1994 work. I see no evidence either of the 1994 authors actually read those references before writing their treatment.

Those early papers refer to 'no side effects' from updating through the view. From D&B 1978

The set of updates U has no side effects if
only the desired update is performed on the view.
For example, consider the insertion ...

However, now the tuple ... also appears in the view extension.
This is a side effect, which we judge to
be undesirable because it is not requested by the
user and yet alters the view extension.

I suppose D&B are a little vague what is "the desired update" or "requested by the user". They're using the traditional INSERT/DELETE/UPDATE operations (to a view), rather than assignment to a var.

Quote from AntC on March 27, 2019, 5:01 am

In software maintenance you typically stop the program, delete its current executable, compile the code to create a new executable, then fire that up. I don't see how schema changes bear any relation to that. ...

If the Heading of a relvar is (part of) its type, and the collection of relvars and their types are (part of) the type of the dbvar; then any change to the schema is type-changing. Then in what sense is it the 'same' database persisting? Are we required to recompile every D program that accesses 'the' database, because the dbvar's type has changed? Or do we treat programs as accessing the database dynamically/with dynamic typing, such that at the start of execution we have attribute i int in some relvar, but by the end we have i long?

How does that fit with the preference for static typing, and type-safety guarantees?

I "stop the relvar" (or tell the gatekeeper that all accesses to that particular relvar are to be blocked).  I delete its current "compiled definition" (i.e. the more machine-oriented form of the same defining information as is recorded as relations in the catalog).  I compile all impacted code against the proposed new definition (e.g. all the constraint-defining expressions of constraints known to involve the named relvar).  I run any conversion processes needed.  If all succeeds, I fire up the new stuff.  I'd say the parallels are blimmin' obvious.

You know as well as I do that we don't "need to recompile every program" but of course that's just in a capacity as a "programmer who just knows what assignment is/does" (in this case a "programmer who knows what impact analysis is and how to perform it").

Reverting to Erwin's original point :-

"it occurred to me that "manipulation" must involve/include/incorporate both the retrieval activity as the updating activity. And relational algebra, in its capacity as a mere system of computation starting from [relation] values that are given, has nothing to offer in the realm of database updating."

It may be of interest that the 'Z Schema Calculus' formal method (and probably other formal methods) allow the change in a variable's value to be formally specified.  I have followed the spirit of this in the RAQUEL notation.

So in the statement
R  <--Insert   someRelationalValue
(assuming a valid insert)
this generates a parse tree where the relvar R appears twice, once representing it with its original value, and once representing it with its new value after 'someRelationalValue' has been inserted.

One can formally define the new value of R in terms of its old value and 'someRelationalValue'.  I have written formal definitions for <--Insert and other updating assignments.

The advantage of this approach is that one can write valid assignment expressions that contain multiple assignments.

This formalism ensures that the change in a relvar's value is a single atomic transaction.  I haven't considered the ramifications of this for concurrent updates (although if 2 relvars are independent of each other, there is no impediment to their concurrent updating, assuming that fits the logic of the statement that expresses the concurrent updates).

David,

I observe that the gist of your reply already assumes the "variable" paradigm (and consequently, also assumes the "<--" paradigm) (or I under-understand it, which possibility I won't exclude).

My question also sort of questioned those very assumptions.  (I am not ***opposed*** to the paradigm, seeing as I built a complete working system that happily assumes it too, I was just wondering.  As an angineer, I can't see any alternatives.  But scientist and engineer are different professions.)

Quote from Erwin on March 28, 2019, 9:29 pm

That's right.
The relational model, as expressed in TTM, already uses the 'variable' and '<--' paradigms, so I was just continuing with them.

Quote from Erwin on March 28, 2019, 9:29 pm

My question also sort of questioned those very assumptions.  ... As an angineer, I can't see any alternatives.  But scientist and engineer are different professions.

Pure functional languages get rid of '<--' and variables, but to me the essence of a DB is the storage of state data (i.e. the DB's value), so a programming language that can handle state data has to include the 'variable' and '<--' paradigms, in order to handle state data.

I was just pointing out that one can have an algebra of assignments, just as one can have an algebra of operators (i.e. pure functions), the latter being what the relational algebra is; and that one can use the 2 algebras together in combination.

Sorry if this is not the line of thought you were looking for.

Quote from David Livingstone on March 29, 2019, 9:33 pm
Quote from Erwin on March 28, 2019, 9:29 pm

That's right.
The relational model, as expressed in TTM, already uses the 'variable' and '<--' paradigms, so I was just continuing with them.

Quote from Erwin on March 28, 2019, 9:29 pm

My question also sort of questioned those very assumptions.  ... As an angineer, I can't see any alternatives.  But scientist and engineer are different professions.

These are exactly the parts that should be questioned. The RM and RA come from Codd; type system, variables, assignment and other language features are by D&D. They may be more or less "set in stone" but that is no reason not to question them.

TTM is presented as "better than SQL", but in answer to what question? Perhaps if we ask the right question, then (as Dave suggested an eon ago) the answer is Datalog.

Andl - A New Database Language - andl.org
12