The Forum for Discussion about The Third Manifesto and Related Matters

Please or Register to create posts and topics.

"The Programmer As Navigator"

12

I've been trying to understand Bachmann's Turing Award lecture, where he gives a list of ways to navigate through a Codasyl database.  Here it is:

1. He [the programmer] can start at the beginning of the database, or at any known record, and sequentially access the "next"
record in the database until he reaches a record of interest or reaches the end.

2. He can enter the database with a database key that provides direct access to the physical location of a record. (A database key is the permanent virtual memory address assigned to a record at the time that it was created.)

3. He can enter the database in accordance with the value of a primary data key. (Either the indexed sequential or randomized access techniques will yield the same result.)

4. He can enter the database with a secondary data key value and sequentially access all records having that particular data value for the field.

5. He can start from the owner of a set and sequentially access all the member records. (This is equivalent to converting a primary data key into a secondary data key.)

6. He can start with any member record of a set and access either the next or prior member of that set.

7. He can start from any member of a set and access the owner of the set, thus converting a secondary data key into a primary data key.

Now methods 2, 5, 6, and 7 are clear enough.  But 1, 3, and 4 run up against (as far as I can tell) his description of what a database is as distinct from a single file:  "In a database, it is common to have several or many different kinds of records. For an example, in a personnel database there might be employee records, department records, skill records, deduction records, work history records, and education records. Each type of record has its own unique primary data key, and all of its other fields are potential secondary data keys."  So:

  • In the sequential access of method 1, what order is implied?  Are the records retrieved in order of insertion, or in some deterministic but unknown order, or in an order that can vary from run to run?  And are they segregated by record type, or must the programmer be prepared to handle a mixture of record types?
  • Similarly, in methods 3 and 4, how is it known what record type the keys apply to?  The phrase "enter the database in accordance with the value of a primary data key" gives no indication.  If the primary key values of different record types are the same, presumably this does not mean that all of them are retrieved.

Can anyone with an understanding of the Codasyl model make sense of this?  Thanks.

Quote from johnwcowan on November 15, 2019, 11:05 pm

I've been trying to understand Bachmann's Turing Award lecture, where he gives a list of ways to navigate through a Codasyl database.  Here it is:

1. He [the programmer] can start at the beginning of the database, or at any known record, and sequentially access the "next"
record in the database until he reaches a record of interest or reaches the end.

2. He can enter the database with a database key that provides direct access to the physical location of a record. (A database key is the permanent virtual memory address assigned to a record at the time that it was created.)

3. He can enter the database in accordance with the value of a primary data key. (Either the indexed sequential or randomized access techniques will yield the same result.)

4. He can enter the database with a secondary data key value and sequentially access all records having that particular data value for the field.

5. He can start from the owner of a set and sequentially access all the member records. (This is equivalent to converting a primary data key into a secondary data key.)

6. He can start with any member record of a set and access either the next or prior member of that set.

7. He can start from any member of a set and access the owner of the set, thus converting a secondary data key into a primary data key.

Now methods 2, 5, 6, and 7 are clear enough.  But 1, 3, and 4 run up against (as far as I can tell) his description of what a database is as distinct from a single file:  "In a database, it is common to have several or many different kinds of records. For an example, in a personnel database there might be employee records, department records, skill records, deduction records, work history records, and education records. Each type of record has its own unique primary data key, and all of its other fields are potential secondary data keys."  So:

  • In the sequential access of method 1, what order is implied?  Are the records retrieved in order of insertion, or in some deterministic but unknown order, or in an order that can vary from run to run?  And are they segregated by record type, or must the programmer be prepared to handle a mixture of record types?
  • Similarly, in methods 3 and 4, how is it known what record type the keys apply to?  The phrase "enter the database in accordance with the value of a primary data key" gives no indication.  If the primary key values of different record types are the same, presumably this does not mean that all of them are retrieved.

Can anyone with an understanding of the Codasyl model make sense of this?  Thanks.

Yes.  I recognize it very very well from what I remember of IDMS DML, that is, I don't really know how well IDMS DML complies to the official codasyl stuff.

  1. is the process of a "sweep".  You go through all the records in a particular area of the database in order of physical recording.  You can inspect what record type it is by looking at the RECORD_NAME field of the SUBSCHEMA_CTRL communication block and then do a case construct on that field.  Alternatively you could ask the system to retrieve only the records of a particular record type.
  2. is record access using a dbkey.  an IDMS dbkey is a unique identifier of a record within the whole database.  In my days, it was a 4-byte value with the first three bytes holding the "page number within database" and the last byte holding the "record sequence number within page" (deleting a record with sequence nbr 2 would not cause record nbr 3 to become nbr 2 though).  Bad practice to rely on this mechanism too heavily, though sometimes you needed it to "restore database currency" (operations like "OBTAIN NEXT ... WITHIN <area>" were always relative to "current record of ...")
  3. Is the process of accessing, say, a customer record using a customer number value as key.  If the physical org was hashing-based, it would be OBTAIN CALC REC_CUST, if the physical org was index-key based, then it would be OBTAIN REC_CUST USING <index key field in program>.  IDMS also allowed the DBA to allow duplicates for a calc key, which could be accessed one by one using OBTAIN DUPLICATE after a successful OBTAIN CALC.
  4. Point 5 (!!!!!!) is the process of first making, say, a given customer (CUST_REC) "current" in your run-unit and then accessing all of that particular customer's order, or addresses, or contracts, or anything connected to the CUST_REC via a SET.  That is, something like OBTAIN NEXT <member_rec_name> WITHIN <set-name>.  Sets could also be multi-member and you could also request OBTAIN NEXT WITHIN <set-name> and then you'd first have to inspect the RECORD_NAME field to know what record type the DBMS returned to you.
  5. Point 4 (!!!!!!!) appears to appeal to the fact that sets could be ordered (and the ordering key for that set then being that "secondary key value", so that, say, all orders of a given customer could easily be accessed in order date order.  You could obviously then also "position" your run-unit somewhere in the middle of that set and thus process, say, only the orders as of this-or-that particular date (of this particular customer).
  6. Is just the facility you need to use to "loop through the members of a set" as described in 4/5.  Don't see why it's a bullet in its own right here.
  7. OBTAIN OWNER is the facility that got mostly used when processing BOM structures.  From a given part, you could access, say, PART_CONTAINMENT records using typically methods 4/5/6 (say, OBTAIN FIRST PART_CONTAINMENT WITHIN CONTAINS_PART), but if you wanted to get to the actual data of the contained part you'd have to follow the "owner" link of the second set between the two record types, say, OBTAIN OWNER WITHIN CONTAINED_BY_PART.

The questions he asks at the end have been, in the case of IDMS, answered as design choices made by the authors.

Thanks, and sorry for being slow to reply.  #6 is different from #4 and #5 because having sequential access to, er, a sequence does not necessarily imply that given a member of that sequence you can find the followingand previous members (Lisp lists, for example, give no access to the previous member, and indeed it may not even be unique).  Of course, if you have #6, then #4 and #5 are going to be available.

Quote from johnwcowan on November 15, 2019, 11:05 pm

I've been trying to understand Bachmann's Turing Award lecture, where he gives a list of ways to navigate through a Codasyl database.  Here it is:

I recall using 2 or 3 different network-style databases back in the day. I really don' t know whether they were Codasyl compliant, but they offered all the operations listed here.

1. He [the programmer] can start at the beginning of the database, or at any known record, and sequentially access the "next"
record in the database until he reaches a record of interest or reaches the end.

2. He can enter the database with a database key that provides direct access to the physical location of a record. (A database key is the permanent virtual memory address assigned to a record at the time that it was created.)

3. He can enter the database in accordance with the value of a primary data key. (Either the indexed sequential or randomized access techniques will yield the same result.)

4. He can enter the database with a secondary data key value and sequentially access all records having that particular data value for the field.

5. He can start from the owner of a set and sequentially access all the member records. (This is equivalent to converting a primary data key into a secondary data key.)

6. He can start with any member record of a set and access either the next or prior member of that set.

7. He can start from any member of a set and access the owner of the set, thus converting a secondary data key into a primary data key.

Now methods 2, 5, 6, and 7 are clear enough.

I think you underestimate 5 and 7. This is the means by which you can can take any record you happen to have at hand and visit a vast number of loosely connected records by moving up and down the tree. 6 is implied by 5 and 7.

So finding a starting point is the thing. I recall using 2 only for 'bookmarking' a location to return to later in a program. Access by a known key and/or sub-key as per 3 and 4 is the main way of getting started.

But 1, 3, and 4 run up against (as far as I can tell) his description of what a database is as distinct from a single file:  "In a database, it is common to have several or many different kinds of records. For an example, in a personnel database there might be employee records, department records, skill records, deduction records, work history records, and education records. Each type of record has its own unique primary data key, and all of its other fields are potential secondary data keys."  So:

  • In the sequential access of method 1, what order is implied?  Are the records retrieved in order of insertion, or in some deterministic but unknown order, or in an order that can vary from run to run?  And are they segregated by record type, or must the programmer be prepared to handle a mixture of record types?
  • Similarly, in methods 3 and 4, how is it known what record type the keys apply to?  The phrase "enter the database in accordance with the value of a primary data key" gives no indication.  If the primary key values of different record types are the same, presumably this does not mean that all of them are retrieved.

A single physical file containing multiple record types and child or list collections is a database. An application's 'database' might be one or several such files.The use of the term is open to ambiguity.

As per 1, it is possible to visit every record in a single physical database file in a physical order, each record being returned with an indication of its binary length and record type (structure layout). But you could also 'walk the tree': visit the collection of master records, and their children and attached collections in a predictable logical order.

The physical order is meta-stable: it might change a lot after a reorg or compaction, but not otherwise.

Key lookups are tied to a particular index so yes, you know what record type to expect (but there might be variants).

Interestingly, item 1 is a major omission of TTM and the RM. There is no conceptual or actual way to iterate over database relvars. If you assume that the reference to a catalog implies a way to query on relvar names, there is still no way to treat an attribute of a catalog relation as a reference to a relvar for the purpose of iterating over the database. The type system just won't allow it.

Andl - A New Database Language - andl.org
Quote from dandl on February 12, 2020, 11:51 pm

Interestingly, item 1 is a major omission of TTM and the RM. There is no conceptual or actual way to iterate over database relvars. If you assume that the reference to a catalog implies a way to query on relvar names, there is still no way to treat an attribute of a catalog relation as a reference to a relvar for the purpose of iterating over the database. The type system just won't allow it.

That's not a "major omission".  That's a "feature" that became unnecessary/superfluous by the RM's fundamental design option to let semantics/meaning prevail over work at the physical level.  Predicates and business meaning first, data access strategies after that.  Separation of concerns, and insulating the business side (predicates, semantics and meaning) from the physical side.  Perhaps you think it was a bad idea.

(And before you start another round of "sometimes people just want to go inside the database and see what's inside, and then they need something like 1" : no they don't.  They look in the catalog and/or the doco, they find the abstract definitions (external predicates) contained within and then they can start querying the database without using any feature like 1.)

Quote from Erwin on February 13, 2020, 7:43 am
Quote from dandl on February 12, 2020, 11:51 pm

Interestingly, item 1 is a major omission of TTM and the RM. There is no conceptual or actual way to iterate over database relvars. If you assume that the reference to a catalog implies a way to query on relvar names, there is still no way to treat an attribute of a catalog relation as a reference to a relvar for the purpose of iterating over the database. The type system just won't allow it.

That's not a "major omission".  That's a "feature" that became unnecessary/superfluous by the RM's fundamental design option to let semantics/meaning prevail over work at the physical level.  Predicates and business meaning first, data access strategies after that.  Separation of concerns, and insulating the business side (predicates, semantics and meaning) from the physical side.  Perhaps you think it was a bad idea.

I think you've misunderstood. I made no mention of access to the underlying physical layer. The omission I refer to is that relvars are referred to by name and as such are not first class. You can write a function that takes a relation value as an argument, but you cannot write one that takes a relvar as an argument, or iterate over relvars, or form a reference to a relvar. This restriction is embedded both in TTM (RM Pre 13) and in the type system.

(And before you start another round of "sometimes people just want to go inside the database and see what's inside, and then they need something like 1" : no they don't.  They look in the catalog and/or the doco, they find the abstract definitions (external predicates) contained within and then they can start querying the database without using any feature like 1.)

But they cannot write a program to query the catalog and then use that information to query a relvar mentioned in the catalog. [At least not in a typesafe way -- it could always be done using generated code or dynamic types.]

There are several things TTM requires to be named: types, components, attributes, variables, databases, and probably also constraints. These are all second class. You cannot (for example) write a query that resolves to a variable attribute based on some computation. You cannot iterate over any of these things. Every one of these things has to be named explicitly in every complying D.

And yes, it's perfectly easy to come with sound business requirements that would best be met by facilities of this kind. People do it all the time when they generate SQL, as I'm sure you are aware.

Andl - A New Database Language - andl.org
Quote from dandl on February 13, 2020, 8:23 am

But they cannot write a program to query the catalog and then use that information to query a relvar mentioned in the catalog. [At least not in a typesafe way -- it could always be done using generated code or dynamic types.]

...

And yes, it's perfectly easy to come with sound business requirements that would best be met by facilities of this kind. People do it all the time when they generate SQL, as I'm sure you are aware.

I'd be interest in being actually shown such a "sound business requirement", because the program that addresses it will essentially do nothing but spew out relations, the [number and] structure of which is undefined/unpredictable/itself variable, meaning the semantics of what comes out of such a program is likewise.

Quote from dandl on February 13, 2020, 8:23 am
Quote from Erwin on February 13, 2020, 7:43 am
Quote from dandl on February 12, 2020, 11:51 pm

Interestingly, item 1 is a major omission of TTM and the RM. There is no conceptual or actual way to iterate over database relvars. If you assume that the reference to a catalog implies a way to query on relvar names, there is still no way to treat an attribute of a catalog relation as a reference to a relvar for the purpose of iterating over the database. The type system just won't allow it.

That's not a "major omission".  That's a "feature" that became unnecessary/superfluous by the RM's fundamental design option to let semantics/meaning prevail over work at the physical level.  Predicates and business meaning first, data access strategies after that.  Separation of concerns, and insulating the business side (predicates, semantics and meaning) from the physical side.  Perhaps you think it was a bad idea.

I think you've misunderstood. I made no mention of access to the underlying physical layer. The omission I refer to is that relvars are referred to by name and as such are not first class. You can write a function that takes a relation value as an argument, but you cannot write one that takes a relvar as an argument, or iterate over relvars, or form a reference to a relvar. This restriction is embedded both in TTM (RM Pre 13) and in the type system.

(And before you start another round of "sometimes people just want to go inside the database and see what's inside, and then they need something like 1" : no they don't.  They look in the catalog and/or the doco, they find the abstract definitions (external predicates) contained within and then they can start querying the database without using any feature like 1.)

But they cannot write a program to query the catalog and then use that information to query a relvar mentioned in the catalog. [At least not in a typesafe way -- it could always be done using generated code or dynamic types.]

There are several things TTM requires to be named: types, components, attributes, variables, databases, and probably also constraints. These are all second class. You cannot (for example) write a query that resolves to a variable attribute based on some computation. You cannot iterate over any of these things. Every one of these things has to be named explicitly in every complying D.

And yes, it's perfectly easy to come with sound business requirements that would best be met by facilities of this kind. People do it all the time when they generate SQL, as I'm sure you are aware.

Arguably, the real achievement of the relational model -- one sustained by TTM -- is not the model itself, but the fact that it demonstrated the practical value of defining systems in terms of immutable objects (like relations) that are manipulated by a composable and elegant set of primitive operations (like a relational algebra).

It is, after all, part of the reason for the success of SQL, despite not being an implementation of the relational model.

As such, if you find the relational model and/or TTM insufficient for your needs, feel free to define a new model more suited to your needs.

Defining it with something akin to the mathematical rigour of the relational model would be ideal, but isn't strictly necessary. If utility is your goal, let utility guide your design.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org
Quote from Erwin on February 13, 2020, 8:54 am
Quote from dandl on February 13, 2020, 8:23 am

But they cannot write a program to query the catalog and then use that information to query a relvar mentioned in the catalog. [At least not in a typesafe way -- it could always be done using generated code or dynamic types.]

...

And yes, it's perfectly easy to come with sound business requirements that would best be met by facilities of this kind. People do it all the time when they generate SQL, as I'm sure you are aware.

I'd be interest in being actually shown such a "sound business requirement", because the program that addresses it will essentially do nothing but spew out relations, the [number and] structure of which is undefined/unpredictable/itself variable, meaning the semantics of what comes out of such a program is likewise.

My observation is that certain operations are not possible in an implementation conforming to TTM, and that observation is independent of my success or failure in articulating specific business needs. If these are things a business wants to do, then an implementation will have to provide facilities that extend and possibly contradict what is set out in TTM.

An example: let us say a business wants to institute a new type constraint on dates, prohibiting any dates before (say) 1/1/1970. In order to do that it needs to determine whether the database already contains any dates earlier than that limit. It's easy enough to search the catalog for attributes of type "date", but it's not possible to use that information to construct a query to examine the values of those attributes, scattered across a range of relvars.

Another: the business wants to set up a training facility, using production software running on a series of throwaway relvars that are a copy of current production data, but can be created and removed at will. It's easy enough to check the catalog for the desired relvars and choose temporary names that are not currently in use, but you can't use that new relvar name in a D language statement to create a relvar.

There are ways to solve these problems, but that isn't the point. It's just an observation that named entities are second class in the language, and certain operations on them are thereby simply not possible.

TTM is deeply invested in the idea of unifying database update with relvar assignment. An alternative would be to treat database update as an operation on a value of type database, where the the relvar is referenced as a value of type relvar.This would have to be coupled with a generic capability on relation/tuple types. The above operations are now possible, but this is not TTM any more.

Andl - A New Database Language - andl.org
Quote from Dave Voorhis on February 13, 2020, 10:15 am
Quote from dandl on February 13, 2020, 8:23 am
Quote from Erwin on February 13, 2020, 7:43 am
Quote from dandl on February 12, 2020, 11:51 pm

Interestingly, item 1 is a major omission of TTM and the RM. There is no conceptual or actual way to iterate over database relvars. If you assume that the reference to a catalog implies a way to query on relvar names, there is still no way to treat an attribute of a catalog relation as a reference to a relvar for the purpose of iterating over the database. The type system just won't allow it.

That's not a "major omission".  That's a "feature" that became unnecessary/superfluous by the RM's fundamental design option to let semantics/meaning prevail over work at the physical level.  Predicates and business meaning first, data access strategies after that.  Separation of concerns, and insulating the business side (predicates, semantics and meaning) from the physical side.  Perhaps you think it was a bad idea.

I think you've misunderstood. I made no mention of access to the underlying physical layer. The omission I refer to is that relvars are referred to by name and as such are not first class. You can write a function that takes a relation value as an argument, but you cannot write one that takes a relvar as an argument, or iterate over relvars, or form a reference to a relvar. This restriction is embedded both in TTM (RM Pre 13) and in the type system.

(And before you start another round of "sometimes people just want to go inside the database and see what's inside, and then they need something like 1" : no they don't.  They look in the catalog and/or the doco, they find the abstract definitions (external predicates) contained within and then they can start querying the database without using any feature like 1.)

But they cannot write a program to query the catalog and then use that information to query a relvar mentioned in the catalog. [At least not in a typesafe way -- it could always be done using generated code or dynamic types.]

There are several things TTM requires to be named: types, components, attributes, variables, databases, and probably also constraints. These are all second class. You cannot (for example) write a query that resolves to a variable attribute based on some computation. You cannot iterate over any of these things. Every one of these things has to be named explicitly in every complying D.

And yes, it's perfectly easy to come with sound business requirements that would best be met by facilities of this kind. People do it all the time when they generate SQL, as I'm sure you are aware.

Arguably, the real achievement of the relational model -- one sustained by TTM -- is not the model itself, but the fact that it demonstrated the practical value of defining systems in terms of immutable objects (like relations) that are manipulated by a composable and elegant set of primitive operations (like a relational algebra).

Hear, hear! Codd got that right.

It is, after all, part of the reason for the success of SQL, despite not being an implementation of the relational model.

I will argue that SQL offers a superset of the RM. At least, I don't know anything the RM/RA can do that SQL can't, although it certainly takes discipline to toe the line.

As such, if you find the relational model and/or TTM insufficient for your needs, feel free to define a new model more suited to your needs.

Defining it with something akin to the mathematical rigour of the relational model would be ideal, but isn't strictly necessary. If utility is your goal, let utility guide your design.

I'm not arguing with the RM/RA or the core concept of TTM. I'm just pointing to a specific limitation that arises from the unification in D of the type system with column names and of relvar assignment with database update. Different choices here and in a few other areas like natural join, aggregation, recursion could produce something different. Just tweaking TTM is probably not enough.

Andl - A New Database Language - andl.org
12