The Forum for Discussion about The Third Manifesto and Related Matters

Please or Register to create posts and topics.

Making illegal states unrepresentable

PreviousPage 2 of 2
Quote from tobega on March 24, 2023, 8:11 am

Challenge:

Here's an exercise to try: try modeling files and folders as SQL tables, where each file and folder belongs to at most one parent folder, and folders can contain any number of files and folder. Then try modeling the following:
  1. The query "all files for a given folder"
  2. The query "the most specific folder, if any, that transitively contains both file X and file Y"
  3. The invariant "there are no cycles, where folder A belongs to B belongs to C belongs to A."

 

I believe I have solved this challenge but curious about what this distinguished group can add, either to the challenge or in general.

All foregoing having been said, I'll now try to come up with a detailed answer, assuming the following :

  • "Files" are an object in the "variable" kind of sense, not in the "value" kind of sense
  • There are no hidden requirements akin to unix & linux "logical links" which kind of declare that "while this file appears to reside in another location, it's still at all times guaranteed to have the exact same contents as this other file in this other location"
  • (and a disclaimer : none of what appears to be syntax in what follows is actually valid SIRA_PRISE (or even any other) syntax.  It's hypothetical syntax designed/invented for readability)

First aspect : we are talking a strict hierarchy here, and a hierarchy (well at least the filesystem ones) always has a single root, which cannot be allowed to not exist.  Two SIRA_PRISE options :

  • define a base relvar that is not allowed to be nonempty and is limited to at most one tuple (rather dubious design choice if you ask me mostly because of "not allowed to be nonempty")
  • define a virtual relvar that is defined to be equivalent to a relation literal that satisfies both of these conditions by definition (the one I'd go for at this point - except if it had been explicitly stated that there might have to be "multiple roots", such as in the windows file system with C: , D: etc. etc. ...)

So I start with [virtual relvar] FILESYSTEM_ROOT === REL { TUP { ( some attributes that I already think I know but really shouldn't at this point of the analysis ) } }.

Now I have something that I can add both files and folders to, and since there's a clear distinction between the two (folders can also contain files / folders in turn, but files cannot (thus disregarding ZIP and the likes)), I'll need two [base] relvars :

  • FILES ( some attributes that I seem to like to be identical to those of the FOLDERS relvar to come )
  • FOLDERS ( some attributes that are surely best chosen to be identical to those of FILESYSTEM_ROOT, since FILESYSTEM_ROOT is supposed to represent a "folder" too )

Now let's look at the supposedly expected identifiers :

  • Within a single folder, whether the root or any other one, both files and folders are identified by a "name", and that name must be unique within the entire folder.  So no two files can have the same name if they're within the same folder, no two folders can have the same name if they're within the same [parent] folder, nor can a file in a parent folder have the same name as a folder within that same parent folder.

That looks like a key consisting of two attributes NAME (name within folder) and PARENT_NAME (name of parent folder), so we need those attribute in our three relvars (assumed all type CHAR) :

  • FILESYSTEM_ROOT === REL { TUP { PARENT_NAME "" , NAME "" } }
  • FILES ( NAME , PARENT_NAME )
  • FOLDERS ( NAME , PARENT_NAME )

Now for the distributed key, at this point I can suffice with declaring FILES and FOLDERS having to be disjoint :

  • DB_CONSTRAINT ... IS_EMPTY ( INTERSECT ( FILES,FOLDERS) )

There must perhaps better also be a constraint that no entry in the FOLDERS relvar makes it look like it represents itself the root folder :

  • TUPLE_CONSTRAINT ... ON FOLDERS LENGTH(NAME) > 0

(For practical reasons we might want a similar constraint on FILES to prevent any file from having the "empty" name and thus having a "full" name like A/B/ - note the ending separator token)

Now, for " each file and folder belongs to at most (*** I'm going to change this to EXACTLY ***) one folder ".  Of course this looks like a "foreign key", but it raises two questions :

  • from what relvar to what relvar
  • and what are the attribute in that "foreign key"

Answering the first question, the answer is that " a folder " means that it can either be the root folder or some (any) other folder listed in the FOLDERS relvar, therefore the referent of the "foreign key" is the UNION of FOLDERS with FILESYSTEM_ROOT.  SIRA_PRISE allows this.

Answering the second question, the answer is that the "parent" name of any entry listed in either FILES or FOLDERS must be equal to the "combined" ("concatenated" or "full") name of any folder listed in either FOLDERS or FILESYSTEM_ROOT (hence of the foregoing union, but then that "union" must contain such a "concatenated"/"full" name attribute).

So we define that union as another virtual relvar, and make sure that virtual relvar also includes the desired "full name" attribute :

  • ALL_FOLDERS === EXTEND ( UNION ( FILESYSTEM_ROOT, FOLDERS) , FULL_NAME (CONCAT (PARENT_NAME , SEP , NAME) ) ) )

SEP is some constant here, such as '/' or '\' and is needed for being able to distinguish A/BC from AB/C.  Note that for robustness, this in turn creates a requirement that NAME values cannot include this SEP character anywhere.

And now we can define the "referential constraint" as PARENT_NAME in both FILES and FOLDERS referencing FULL_NAME in this view (SIRA_PRISE allows that) :

  • REF_CONSTRAINT ... FILES (PARENT_NAME) REFERENCES ALL_FOLDERS(FULL_NAME)
  • REF_CONSTRAINT ... FOLDERS (PARENT_NAME) REFERENCES ALL_FOLDERS(FULL_NAME)

Note that this implies that the PARENT_NAME attribute values must include the SEP tokens where appropriate

Also note that this is nowhere near actual SIRA_PRISE syntax for these constraint.  The actual SIRA_PRISE syntax would consist of SEMIMINUS and RENAME invocations.  The do the exact same thing, but are probably less readable for the non-illuminated.

A final note on an implementation concern.  There is, relatively obviously, no way that an optimizer is ever going to spot the opportunity that these constraints can be checked using an index on the FOLDERS relvar (on the PARENT_NAME attribute).  Two reasons : it requires the optimizer to analyze and 'understand' the nature and properties of the scalar CONCAT expression in the EXTEND, and the fact that the subject itself of that extend is in turn a UNION, one of whose components is just a relation literal (and obviously no physical index exists on that one).

The design option to be taken to tackle that problem is to make the ALL_FOLDERS view physically recorded ("materialized view" in SQL lingo), at which point it does become possible to define any index on it we like (at the expense of updates on the underlying base relvars becoming "more expensive", obviously).  But those are implementation concerns.

I think this covers the design, I'll get back to the queries and invariants later.

 

Quote from tobega on March 24, 2023, 8:11 am

Challenge:

Then try modeling the following:
  1. The query "all files for a given folder"
  2. The query "the most specific folder, if any, that transitively contains both file X and file Y"
  3. The invariant "there are no cycles, where folder A belongs to B belongs to C belongs to A."

 

I believe I have solved this challenge but curious about what this distinguished group can add, either to the challenge or in general.

Now I can address the queries and invariants, and the treatment can in fact be very short :

  • The query "all files for a given folder"

RESTRICT (FILES , EQ ( PARENT_NAME , <your full folder name here>) )

  • The query "the most specific folder, if any, that transitively contains both file X and file Y"

There is, alas, a definitional problem with 'transitively contains both file X and file Y'.  Are they the full filenames (including the chaining of parent folder names) ?  Or do they represent "some file named X" (regardless of parent folder) and "some file named Y" (regardless of parent folder) ?  If it's the former, why bother speaking of files, since you already have two folder names and you just want the most-specific parent of those two (maybe you want the response to be empty if either of the two given files simply do not exist but if that's the case, please say so).  If it's the latter, more clarification is needed on "transitively contains".

  • The invariant "there are no cycles, where folder A belongs to B belongs to C belongs to A."

By nature of the design, such invariants are made inherently unnecessary.  By virtue of the properties of the CONCAT operator, all files and folders must by definition reside in a folder whose full name is strictly shorter than that of the files/folders it contains.  But if anyone is interested in the general style of a constraint "no ncycles in a graph", it looks like

IS_EMPTY ( RESTRICT ( TCLOSE ( R , (P_ID,CH_ID) ) , EQ ( P_ID, CH_ID ) )

And that's all there's to it.  SIRA_PRISE uses the construct itself to ensure no views are defined in terms of themselves (at **ANY*** level of indirection).

This is all true, but really, you want a way to embed an RM and a suitable extended RA into some programming language. If you can do that, the rest is trivial.

Andl.Net is an attempt to do just that.

Andl - A New Database Language - andl.org
Quote from dandl on April 8, 2023, 1:30 pm

This is all true, but really, you want a way to embed an RM and a suitable extended RA into some programming language. If you can do that, the rest is trivial.

Andl.Net is an attempt to do just that.

To paraphrase CJD : The RA needs neither extension nor perversion to be able to achieve what it has always been intended to achieve".

("Integration" of a "suitable RA" within some given programming language -where by "integration" I mean "integration at the native level of the language definition"- is exactly what CJD tried to achieve inside IBM with the PL/1 people, but might also have been why he chose to leave IBM not very long after that.)

Whether it can make mere programmers realise that this is plain and simply true, is another matter.  Which is why I sometimes find myself obligated to make these kinds of remark.

Quote from dandl on April 8, 2023, 1:30 pm

This is all true, but really, you want a way to embed an RM and a suitable extended RA into some programming language. If you can do that, the rest is trivial.

Andl.Net is an attempt to do just that.

There's a barrier between the assumptions/axioms/premises/... we start from, and alas therefore also between the conclusions we arrive at, and even more alas I'm afraid I won't live to see the day when that barrier disappears.

Thanks for the detailed response @erwin !

I think we can translate "folder" to "set" and "file" to "non-set element of a set" for the purposes of this exercise. I think that clears up what "transitively contains" means. We should also specify that each element (set or not) can only belong to one set.

Your model indeed solves the required problems, although I'm not sure we need the FULL_NAME. For 2) we can just intersect the transitive closures of the parent relations for X and Y and see which one is not a parent.

The question to be answered, then, having proven that a relational representation is indeed possible, is if, when considering each problem in isolation, our relational model is as efficient as any representation we could use in a programming language such as lists, trees, hash maps and so on.

Quote from tobega on May 2, 2023, 12:31 pm

Thanks for the detailed response @erwin !

I think we can translate "folder" to "set" and "file" to "non-set element of a set" for the purposes of this exercise. I think that clears up what "transitively contains" means. We should also specify that each element (set or not) can only belong to one set.

Your model indeed solves the required problems, although I'm not sure we need the FULL_NAME. For 2) we can just intersect the transitive closures of the parent relations for X and Y and see which one is not a parent.

The question to be answered, then, having proven that a relational representation is indeed possible, is if, when considering each problem in isolation, our relational model is as efficient as any representation we could use in a programming language such as lists, trees, hash maps and so on.

@tobega,

"I think that clears up what "transitively contains" means."  No, it doesn't.  It doesn't clear up anything wrt to how to interpret the 'X' and the 'Y' in the problem statement : are they "full names" or are they "simple" names (that identify a file only within the folder in which it resides).

"I'm not sure we need the FULL_NAME".  Without any such "full name" construct, how are you going to have an integrity rule that each and every file MUST INDEED belong to some folder ?  I'm convinced it is indeed necessary.  (Not necessarily in this form of a value of type CHAR, the "parent folder name" stuff could also be represented/modeled/designed as a relation representing an ordered list of "simple" folder names, for example.  But it would still be needed.)

"when considering each problem in isolation, our relational model is as efficient as any representation we could use in a programming language such as lists, trees, hash maps and so on.".  Usual suspects here :

  • first, "efficiency" is not a property of a model, it is a property of an implementation of such a model
  • second, "efficiency" is not even properly defined here, but since the IT profession these days has nothing left but techno-freaks, no doubt it's meant to mean "runtime efficiency".  But here's a second possible intended meaning for the term "efficiency" : the time it takes a user to write his queries versus the time it takes the programmers to write a whole program for each and every query that arises, using their "efficient" "lists, trees and hashmaps and so on".  Now if it's a C-type manager asking the original question, would you think he'd also be concerned with the "efficiency" of programmers writing code to get the job done ?  (And that is not to mention the exacerbating difference between cost per second of a CPU and cost per second of a programmer.)
Quote from Erwin on May 2, 2023, 6:14 pm
Quote from tobega on May 2, 2023, 12:31 pm

Thanks for the detailed response @erwin !

I think we can translate "folder" to "set" and "file" to "non-set element of a set" for the purposes of this exercise. I think that clears up what "transitively contains" means. We should also specify that each element (set or not) can only belong to one set.

Your model indeed solves the required problems, although I'm not sure we need the FULL_NAME. For 2) we can just intersect the transitive closures of the parent relations for X and Y and see which one is not a parent.

The question to be answered, then, having proven that a relational representation is indeed possible, is if, when considering each problem in isolation, our relational model is as efficient as any representation we could use in a programming language such as lists, trees, hash maps and so on.

@tobega,

"I think that clears up what "transitively contains" means."  No, it doesn't.  It doesn't clear up anything wrt to how to interpret the 'X' and the 'Y' in the problem statement : are they "full names" or are they "simple" names (that identify a file only within the folder in which it resides).

They should be taken as unique identifiers of the entity they represent

"I'm not sure we need the FULL_NAME".  Without any such "full name" construct, how are you going to have an integrity rule that each and every file MUST INDEED belong to some folder ?  I'm convinced it is indeed necessary.  (Not necessarily in this form of a value of type CHAR, the "parent folder name" stuff could also be represented/modeled/designed as a relation representing an ordered list of "simple" folder names, for example.  But it would still be needed.)

Not at all, you simply need what you already have

FOLDERS ( NAME , PARENT_NAME )

with the constraint that PARENT_NAME is in FOLDERS(NAME) UNION ROOT_DIRS

The NAMEs need to be unique IDs, though. I suppose we need to take care of how updates (moves) may be performed so that a directory is not moved into one of its subdirectories (or itself)

"when considering each problem in isolation, our relational model is as efficient as any representation we could use in a programming language such as lists, trees, hash maps and so on.".  Usual suspects here :

  • first, "efficiency" is not a property of a model, it is a property of an implementation of such a mode
  • second, "efficiency" is not even properly defined here, but since the IT profession these days has nothing left but techno-freaks, no doubt it's meant to mean "runtime efficiency".  But here's a second possible intended meaning for the term "efficiency" : the time it takes a user to write his queries versus the time it takes the programmers to write a whole program for each and every query that arises, using their "efficient" "lists, trees and hashmaps and so on".  Now if it's a C-type manager asking the original question, would you think he'd also be concerned with the "efficiency" of programmers writing code to get the job done ?  (And that is not to mention the exacerbating difference between cost per second of a CPU and cost per second of a programmer.)

We are indeed discussing implementations of the model. One possible implementation is as relations. Other possible implementations are as trees or maps or lists.

There is merit to your argument about efficiency, but you still cannot ignore big-O efficiency at internet scale. When implementing without relations, which is where the question originates, some operations become very inefficient indeed, so the temptation might be to have multiple synchronized implementations of the model.

PreviousPage 2 of 2