Making illegal states unrepresentable

#1 · March 24, 2023, 8:11 am

I'm having a little discussion with Hillel Wayne sparked by https://buttondown.email/hillelwayne/archive/ergonomic-apis-channel-invariants-and-data-views/ where I speculated that having relations (and constraints on those) available in a programming language might go a long way towards making illegal states unrepresentable.

Related to the claim "there are multiple possible data layouts and they have different advantages" (which is obviously true), I speculated again that a relational representation might be no worse than a constant factor against the optimal representation for a particular case.

Challenge:

Here's an exercise to try: try modeling files and folders as SQL tables, where each file and folder belongs to at most one parent folder, and folders can contain any number of files and folder. Then try modeling the following:

The query "all files for a given folder"
The query "the most specific folder, if any, that transitively contains both file X and file Y"
The invariant "there are no cycles, where folder A belongs to B belongs to C belongs to A."

I believe I have solved this challenge but curious about what this distinguished group can add, either to the challenge or in general.

#2 · March 24, 2023, 8:47 pm

Quote from tobega on March 24, 2023, 8:11 am

Challenge:

... folders can contain any number of files and folder. ...

3. The invariant "there are no cycles, where folder A belongs to B belongs to C belongs to A."

Notoriously, relational structures cope very badly with nested hierarchies -- especially where the depth of nesting is arbitrary. The classic example is Sales territories/groupings, where in high-sales regions (such as Cities), there's to be fine-detail nesting (by suburb) but in the rural wastelands whole-county is good enough.

I've not seen that well expressed in SQL: usually there's a bunch of custom code to validate the structure. (Transitive closure/TTM TCLOSE would be the technique, but most SQLs don't like it.)

#3 · March 25, 2023, 10:01 am

Quote from tobega on March 24, 2023, 8:11 am

I'm having a little discussion ... I speculated that having relations (and constraints on those) available in a programming language might go a long way towards making illegal states unrepresentable.

I'm not following the notation in that discussion:

// Map of time + room to talk
sched1 = {(Time, Room): Talk}

// Map of talk to time + room
sched2 = {Talk: (Time, Room)}

And then "wait isn’t this just database views?". Well, no: in Relational terms (I'm guessing), that sched looks like _one_ table with three attributes {Time, Room, Talk} with two indexes (aka 'keys' _not_ views): {Time, Room}, {Talk} -- and our relational term for 'indexes' is UNIQUE constraints. It might be 'multiple possible layouts' in your notation. It's a single relation (although not adequate as it stands -- see below):

Related to the claim "there are multiple possible data layouts and they have different advantages" (which is obviously true), I speculated again that a relational representation might be no worse than a constant factor against the optimal representation for a particular case.

If you need to maintain a schedule, you'd put it into a database wouldn't you? A relation _is_ an optimal representation (providing you've carried through normalisation). What's your "no worse" allegation?

Or at least inside something like Datalog. I'm not seeing why you'd represent it only inside a constraint-handling tool. (Or is 'MiniZinc' comparable to DataLog?)

There's a domain-specific challenge with scheduling: Talks have a duration; some clashing Talk2 might start in the same Room at a later Time2, but still within the duration of Time1 -- see Date, Darwen, Lorentzos 'Time and Relational Theory', for a thoroughgoing way to handle durations as intervals (ranges of times).

You'll also need to worry about double-booking lecturers; and audience members. Presuming your venue has multiple Rooms.

#4 · March 25, 2023, 10:37 am

As far as I can see this article has very little to do with relations, and your question has very little to do with either.

The RM for the article is simply { Room, Time, Talk }. The article is about how to apply constraints to that model, not about the model itself. The constraints are easily expressed, but less easily optimised. No trees involved.

Your question is about a tree of files and folders, and the RM does not deal well with that. It can be expressed as the relations { Parent, Child } and { Folder, File } and it can be queried with transitive closure, but with limits.

Andl - A New Database Language - andl.org

#5 · March 27, 2023, 2:58 pm

Quote from tobega on March 27, 2023, 2:58 pm
Quote from AntC on March 25, 2023, 10:01 am

Quote from tobega on March 24, 2023, 8:11 am

I'm having a little discussion ... I speculated that having relations (and constraints on those) available in a programming language might go a long way towards making illegal states unrepresentable.

I'm not following the notation in that discussion:

Plain text

Copy to clipboard

Open code in new window

EnlighterJS 3 Syntax Highlighter

// Map of time + room to talk

sched1 = {(Time, Room): Talk}

// Map of talk to time + room

sched2 = {Talk: (Time, Room)}

// Map of time + room to talk sched1 = {(Time, Room): Talk} // Map of talk to time + room sched2 = {Talk: (Time, Room)}
// Map of time + room to talk
sched1 = {(Time, Room): Talk}

// Map of talk to time + room
sched2 = {Talk: (Time, Room)}
And then "wait isn’t this just database views?". Well, no: in Relational terms (I'm guessing), that sched looks like _one_ table with three attributes {Time, Room, Talk} with two indexes (aka 'keys' _not_ views): {Time, Room}, {Talk} -- and our relational term for 'indexes' is UNIQUE constraints. It might be 'multiple possible layouts' in your notation. It's a single relation (although not adequate as it stands -- see below):

Related to the claim "there are multiple possible data layouts and they have different advantages" (which is obviously true), I speculated again that a relational representation might be no worse than a constant factor against the optimal representation for a particular case.

If you need to maintain a schedule, you'd put it into a database wouldn't you? A relation _is_ an optimal representation (providing you've carried through normalisation). What's your "no worse" allegation?
Yes, I would put it into a database, or at least I would like to be able to create "tables" in my programming language, which is the point I am trying to make.

The author of the article didn't consider it because most languages don't have it, so he is stuck with key-value maps or other structures.

My "no worse" allegation concerns performance for any usage of the data structure, where I conjecture that the normalised relational representation is only a constant-factor away in big-O terms. In the key-value map a reverse lookup would be costly, for example, but with a relation we can go either way equally cheaply, provided the relevant indexing is in place.

Or at least inside something like Datalog. I'm not seeing why you'd represent it only inside a constraint-handling tool. (Or is 'MiniZinc' comparable to DataLog?)

There's a domain-specific challenge with scheduling: Talks have a duration; some clashing Talk2 might start in the same Room at a later Time2, but still within the duration of Time1 -- see Date, Darwen, Lorentzos 'Time and Relational Theory', for a thoroughgoing way to handle durations as intervals (ranges of times).

You'll also need to worry about double-booking lecturers; and audience members. Presuming your venue has multiple Rooms.

Good points, although I think we should consider this as a contrived example to try to show that "data layout severely affects performance of your algorithm", not an actual scheduling implementation.

Quote from AntC on March 25, 2023, 10:01 am

Quote from tobega on March 24, 2023, 8:11 am

I'm having a little discussion ... I speculated that having relations (and constraints on those) available in a programming language might go a long way towards making illegal states unrepresentable.

I'm not following the notation in that discussion:

Plain text

Copy to clipboard

Open code in new window

EnlighterJS 3 Syntax Highlighter

// Map of time + room to talk

sched1 = {(Time, Room): Talk}

// Map of talk to time + room

sched2 = {Talk: (Time, Room)}

// Map of time + room to talk sched1 = {(Time, Room): Talk} // Map of talk to time + room sched2 = {Talk: (Time, Room)}
// Map of time + room to talk
sched1 = {(Time, Room): Talk}

// Map of talk to time + room
sched2 = {Talk: (Time, Room)}
And then "wait isn’t this just database views?". Well, no: in Relational terms (I'm guessing), that sched looks like _one_ table with three attributes {Time, Room, Talk} with two indexes (aka 'keys' _not_ views): {Time, Room}, {Talk} -- and our relational term for 'indexes' is UNIQUE constraints. It might be 'multiple possible layouts' in your notation. It's a single relation (although not adequate as it stands -- see below):

Related to the claim "there are multiple possible data layouts and they have different advantages" (which is obviously true), I speculated again that a relational representation might be no worse than a constant factor against the optimal representation for a particular case.

If you need to maintain a schedule, you'd put it into a database wouldn't you? A relation _is_ an optimal representation (providing you've carried through normalisation). What's your "no worse" allegation?

Yes, I would put it into a database, or at least I would like to be able to create "tables" in my programming language, which is the point I am trying to make.

The author of the article didn't consider it because most languages don't have it, so he is stuck with key-value maps or other structures.

My "no worse" allegation concerns performance for any usage of the data structure, where I conjecture that the normalised relational representation is only a constant-factor away in big-O terms. In the key-value map a reverse lookup would be costly, for example, but with a relation we can go either way equally cheaply, provided the relevant indexing is in place.

Or at least inside something like Datalog. I'm not seeing why you'd represent it only inside a constraint-handling tool. (Or is 'MiniZinc' comparable to DataLog?)

There's a domain-specific challenge with scheduling: Talks have a duration; some clashing Talk2 might start in the same Room at a later Time2, but still within the duration of Time1 -- see Date, Darwen, Lorentzos 'Time and Relational Theory', for a thoroughgoing way to handle durations as intervals (ranges of times).

You'll also need to worry about double-booking lecturers; and audience members. Presuming your venue has multiple Rooms.

Good points, although I think we should consider this as a contrived example to try to show that "data layout severely affects performance of your algorithm", not an actual scheduling implementation.

#6 · March 27, 2023, 3:04 pm

Quote from AntC on March 24, 2023, 8:47 pm

Quote from tobega on March 24, 2023, 8:11 am

Challenge:

... folders can contain any number of files and folder. ...

3. The invariant "there are no cycles, where folder A belongs to B belongs to C belongs to A."

Notoriously, relational structures cope very badly with nested hierarchies -- especially where the depth of nesting is arbitrary. The classic example is Sales territories/groupings, where in high-sales regions (such as Cities), there's to be fine-detail nesting (by suburb) but in the rural wastelands whole-county is good enough.

I've not seen that well expressed in SQL: usually there's a bunch of custom code to validate the structure. (Transitive closure/TTM TCLOSE would be the technique, but most SQLs don't like it.)

Right, this is often stated as a problem, but if you have an in-memory tree structure, you are still doing a memory traversal for each level down. So a relational representation is just a (perhaps large) constant factor away in big-O thinking.

My idea for 3. would be to have the parent-child relation have a foreign-key constraint that a parent must exist in the union of root-dirs and children

#7 · March 27, 2023, 3:08 pm

Quote from dandl on March 25, 2023, 10:37 am

As far as I can see this article has very little to do with relations, and your question has very little to do with either.

The RM for the article is simply { Room, Time, Talk }. The article is about how to apply constraints to that model, not about the model itself. The constraints are easily expressed, but less easily optimised. No trees involved.

Your question is about a tree of files and folders, and the RM does not deal well with that. It can be expressed as the relations { Parent, Child } and { Folder, File } and it can be queried with transitive closure, but with limits.

Right, the article concerns representing data in your programming language, both for efficiency and proof of correctness, whereas my response is that you really want something like andl.

#8 · April 6, 2023, 7:11 pm

Quote from tobega on March 24, 2023, 8:11 am

I'm having a little discussion with Hillel Wayne sparked by https://buttondown.email/hillelwayne/archive/ergonomic-apis-channel-invariants-and-data-views/ where I speculated that having relations (and constraints on those) available in a programming language might go a long way towards making illegal states unrepresentable.

Database constraints as per TTM are targeted exclusively at, serve no other purpose than, and completely resolve the problem of, "making illegal states unrepresentable". Database constraints as per TTM obviously *require* relations, so that's a construct that *must* be natively understood by the programming language at hand. (The other consequence of this requirement is that program-local scalar variables are not taken into consideration.) Database constraints as per TTM have been implemented in full in SIRA_PRISE. (But since SIRA_PRISE was developed with a focus on the database side, once again its algorithms do not cover the case of program-local scalar variables.)

I am 100% convinced that the SIRA_PRISE enforcement algorithms, extended to cover the case of program-local scalar variables, must be the ultimate final nail in the coffin of "illegal states being representable". Of course the compiler will have to be able to generate the code that executes these enforcement algorithms. It already does that with java constructs like "assert" but those constructs are highly targeted toward scalars, not set-level objects and certainly not correlations between two distinct set-level objects where the correlation between the two is to some degree reminiscent of foreign keys and such.

Author of SIRA_PRISE

#9 · April 6, 2023, 7:40 pm

Quote from Erwin on April 6, 2023, 7:40 pm

Quote from tobega on March 24, 2023, 8:11 am

Related to the claim "there are multiple possible data layouts and they have different advantages" (which is obviously true), I speculated again that a relational representation might be no worse than a constant factor against the optimal representation for a particular case.

"multiple possible data layouts" is too vague. Does it refer to differences in logical structure, or merely to differences in physical structure ?

For any given problem there might be a multitude of relational logical structures to tackle that problem, each with its own "different advantages" (which will mostly ultimately boil down to "our compiler doesn't deal with relational operator XYZ so very well", so the "advantages" are ultimately really not logical at all, but never mind). It's the reason why Date ended up discussing the topic of "logical database equivalence". However, constraint enforcement algorithms that just take a logical expression of "something that is definitely wrong" and just deal with it, should always at least deal correctly with the "faults expression" (expressed on the logical structure as chosen by the designer) that it is given. One must however keep in mind that this makes both the "faults expression" ***AND THE*** "logical structure as chosen by the designer" the input to what any compiler/optimizer will get to see, and I consider it an unrealistic and unreasonable expectation to expect from the compiler/optimizer that it be able to consider the entire set of all possible logically equivalent designs, and to try and find the "optimal" solution considering each and every member [iow, equivalent logical design] of that "entire set". First reason why I believe that is : compute that entire set. (And can it even proved to be finite ???)

And if "multiple possible data layouts" refers exclusively to differences in physical structure, well I guess you'll already know that TTM has no answer to such matters. Separation of concerns between the logical and the physical was the *very reason* why the relational model saw the light of day.

Quote from tobega on March 24, 2023, 8:11 am

Related to the claim "there are multiple possible data layouts and they have different advantages" (which is obviously true), I speculated again that a relational representation might be no worse than a constant factor against the optimal representation for a particular case.

"multiple possible data layouts" is too vague. Does it refer to differences in logical structure, or merely to differences in physical structure ?

For any given problem there might be a multitude of relational logical structures to tackle that problem, each with its own "different advantages" (which will mostly ultimately boil down to "our compiler doesn't deal with relational operator XYZ so very well", so the "advantages" are ultimately really not logical at all, but never mind). It's the reason why Date ended up discussing the topic of "logical database equivalence". However, constraint enforcement algorithms that just take a logical expression of "something that is definitely wrong" and just deal with it, should always at least deal correctly with the "faults expression" (expressed on the logical structure as chosen by the designer) that it is given. One must however keep in mind that this makes both the "faults expression" ***AND THE*** "logical structure as chosen by the designer" the input to what any compiler/optimizer will get to see, and I consider it an unrealistic and unreasonable expectation to expect from the compiler/optimizer that it be able to consider the entire set of all possible logically equivalent designs, and to try and find the "optimal" solution considering each and every member [iow, equivalent logical design] of that "entire set". First reason why I believe that is : compute that entire set. (And can it even proved to be finite ???)

And if "multiple possible data layouts" refers exclusively to differences in physical structure, well I guess you'll already know that TTM has no answer to such matters. Separation of concerns between the logical and the physical was the *very reason* why the relational model saw the light of day.

Author of SIRA_PRISE

#10 · April 6, 2023, 8:46 pm

Quote from Erwin on April 6, 2023, 8:46 pm

Quote from tobega on March 24, 2023, 8:11 am

Challenge:

Here's an exercise to try: try modeling files and folders as SQL tables, where each file and folder belongs to at most one parent folder, and folders can contain any number of files and folder. Then try modeling the following:

The query "all files for a given folder"

The query "the most specific folder, if any, that transitively contains both file X and file Y"

The invariant "there are no cycles, where folder A belongs to B belongs to C belongs to A."

I believe I have solved this challenge but curious about what this distinguished group can add, either to the challenge or in general.

As usual, the problem statement is the problem.

Let me start with my question here : does everyone in the room here have *EXACTLY* the same idea of what it means to be "a file". Now of course everyone is going to shout "of course we do, we've all been working with files our whole lives, WE ALL KNOW WHAT A FILE IS".

So let me recant a little anecdote I've heard from one of those bone-headed idealist business analysts (with a proper background in underlying technology and theory, mind you), who got involved in an airplane traffic project where control towers and airflight companies were required to start working together. He was obsessed with "first getting the business dictionary right" so he wanted to be sure that everyone at the table had the *same* understanding of what a "flight" was. When he expressed that idea, the rather senior managers in the room all rose up simultaneously and shouted "of course we know what a flight is, we've all been working with flights our whole lives, WE ALL KNOW WHAT A FLIGHT IS (so skip that dictionary nonsense and can we pls get to business)". But the business analyst was bone-headed and didn't bend. And what turned out ? To the airplane people, a "flight" was a plane departing at X and landing (or at least scheduled to land) at Y. To the tower control people, a "flight" was a plane scheduled to either depart or land at the exact airport where we are. Rest of the story to be left as an exercise for you to fill in :-) (Hint : flight control people thought that "incoming" vs "outgoing" was a *property* of a "flight", while airplanes people thought that *there was no such property at all* (or at least you had to establish which airport you were looking from to be able to derive it).)

Now, reverting to the files. How many times have you wondered why, if you take a copy of some file into another folder, then the "archived" attribute of the file doesn't get set ? Or does, if that's what your OS does ? After all, it's the very essence of "COPY" that the contents don't get altered, isn't it ? So DON'T set the "archived" attribute. But otoh, If the folder that the copy is being taken into previously had 3 files in it, and now has 4, there's *definitely* something new in there, isnt't it ? So DO set the "archived" attribute (after all, backup procedures might be based on this "archived" attribute). Rest of the story left as an exercise for you to fill in, but for the remark that there's something of "value vs. variable" in there ...

Yes, I'm illustrating here how well MS-DOS was "thoroughly thought through" when it was first launched. However, I'm also illustrating how more than 40 yrs later, we still haven't managed to pull our minds together on what a "file" really is ...

(Additional thought : if a user asks you to "just send me the file" then under the first approach (he wants the value), the only possible right answer to give is "I can't, I can only send you its contents", and under the second approach (he wants the variable) indeed you can do that, but you are quite likely to meet that same user again soon after, asking "I need the updates too" (which is indicative of a user desiring a distributed database, and most probably in a setup that CAP theorem proves to be impossible to achieve).)

Quote from tobega on March 24, 2023, 8:11 am

Challenge:

Here's an exercise to try: try modeling files and folders as SQL tables, where each file and folder belongs to at most one parent folder, and folders can contain any number of files and folder. Then try modeling the following:

The query "all files for a given folder"

The query "the most specific folder, if any, that transitively contains both file X and file Y"

The invariant "there are no cycles, where folder A belongs to B belongs to C belongs to A."

I believe I have solved this challenge but curious about what this distinguished group can add, either to the challenge or in general.

As usual, the problem statement is the problem.

Let me start with my question here : does everyone in the room here have *EXACTLY* the same idea of what it means to be "a file". Now of course everyone is going to shout "of course we do, we've all been working with files our whole lives, WE ALL KNOW WHAT A FILE IS".

So let me recant a little anecdote I've heard from one of those bone-headed idealist business analysts (with a proper background in underlying technology and theory, mind you), who got involved in an airplane traffic project where control towers and airflight companies were required to start working together. He was obsessed with "first getting the business dictionary right" so he wanted to be sure that everyone at the table had the *same* understanding of what a "flight" was. When he expressed that idea, the rather senior managers in the room all rose up simultaneously and shouted "of course we know what a flight is, we've all been working with flights our whole lives, WE ALL KNOW WHAT A FLIGHT IS (so skip that dictionary nonsense and can we pls get to business)". But the business analyst was bone-headed and didn't bend. And what turned out ? To the airplane people, a "flight" was a plane departing at X and landing (or at least scheduled to land) at Y. To the tower control people, a "flight" was a plane scheduled to either depart or land at the exact airport where we are. Rest of the story to be left as an exercise for you to fill in :-) (Hint : flight control people thought that "incoming" vs "outgoing" was a *property* of a "flight", while airplanes people thought that *there was no such property at all* (or at least you had to establish which airport you were looking from to be able to derive it).)

Now, reverting to the files. How many times have you wondered why, if you take a copy of some file into another folder, then the "archived" attribute of the file doesn't get set ? Or does, if that's what your OS does ? After all, it's the very essence of "COPY" that the contents don't get altered, isn't it ? So DON'T set the "archived" attribute. But otoh, If the folder that the copy is being taken into previously had 3 files in it, and now has 4, there's *definitely* something new in there, isnt't it ? So DO set the "archived" attribute (after all, backup procedures might be based on this "archived" attribute). Rest of the story left as an exercise for you to fill in, but for the remark that there's something of "value vs. variable" in there ...

Yes, I'm illustrating here how well MS-DOS was "thoroughly thought through" when it was first launched. However, I'm also illustrating how more than 40 yrs later, we still haven't managed to pull our minds together on what a "file" really is ...

(Additional thought : if a user asks you to "just send me the file" then under the first approach (he wants the value), the only possible right answer to give is "I can't, I can only send you its contents", and under the second approach (he wants the variable) indeed you can do that, but you are quite likely to meet that same user again soon after, asking "I need the updates too" (which is indicative of a user desiring a distributed database, and most probably in a setup that CAP theorem proves to be impossible to achieve).)

Author of SIRA_PRISE

TTM Forum

The Forum for Discussion about The Third Manifesto and Related Matters

Making illegal states unrepresentable