The Forum for Discussion about The Third Manifesto and Related Matters

Please or Register to create posts and topics.

Aggregates in relational calculus

PreviousPage 4 of 6Next
Quote from Erwin on August 23, 2019, 11:07 am
Quote from Dave Voorhis on August 23, 2019, 10:16 am
Quote from dandl on August 23, 2019, 12:24 am

OK. Rather than argue further about who said what, when and where, is it safe to say:

  1. there is a 'core' RA defined by Codd, bounded by the limits of FOPL
  2. add the named perspective
  3. add simple recursion (TC in TTM, see Alice p271)
  4. add RVA/TVA (no longer normalised)
  5. add a way to generate new values by computation (TBD)
  6. add aggregation (second order)
  7. add fixed point recursion (GTC in TTM, WHILE) (second order)
  8. add D (now Turing Complete)

A spectrum/progression, with no agreement how far to go and still call it 'an' RA.

And no agreement whether (5) new values should be included or made part of some different 'domain algebra'.

[Incidentally, creating new values as closed expressions is first order, but at least in Andl creating new values as open expressions using an SQL backend requires callbacks, which is seriously hard to implement. Doesn't that make it second order?]

Sure, but I'm not sure agreement is necessary. Do what seems right.

A given relational algebra can be formally defined, but there isn't any technical or academic authority that mandates what "relational algebra" must be in general, though by convention it's a "relation" construct and a collection of operators on that construct. It's all a bit arbitrary and rather casually defined, but I'm not sure there's much compulsion for it to be otherwise.

In the software world, there isn't much interest in whether or not a language adheres to some theoretical formalism. Far more important (to the extent that even this matters) is the degree to which it's expressive and powerful.

Utility trumps purity.

But for the fact that purity in itself often provides kinds of utility that remain alas unseen.  Such as longlastingness and robustness e.g.

Those are not what I'm looking for. I look to formalism as a possible way to open up a new perspective on a familiar problem. Consider these programming language kinds:

  1. General purpose languages, found almost everywhere, can do almost everything (C/C++, Java, C#, Python, etc).
  2. Special environment languages, exist only within and monopolise a specific environment (SQL, JS, Smalltalk, Forth, Pick Basic, etc).

Since the creation of NodeJS JS is now a GP language, but it still monopolises the browser. There is AFAIK no such democratisation of SQL. You can only write and execute SQL inside a RDBMS, and you can only query data in the RDBMS. The only widely available implementation of the RA is locked up and out of reach in some RDBMS, where it cannot be used on a host of interesting problems. Why is that? The list above helps to explain why it's so hard to create a desktop 'NodeSQL' or embed any RA-based query language into a GP host language. First and foremost, the named perspective is totally alien to any common GP type system. It's all downhill from there.

This is an interesting line of thinking. I may have more to say.

Andl - A New Database Language - andl.org
Quote from dandl on August 24, 2019, 1:34 am
Quote from Erwin on August 23, 2019, 11:07 am
Quote from Dave Voorhis on August 23, 2019, 10:16 am
Quote from dandl on August 23, 2019, 12:24 am

OK. Rather than argue further about who said what, when and where, is it safe to say:

  1. there is a 'core' RA defined by Codd, bounded by the limits of FOPL
  2. add the named perspective
  3. add simple recursion (TC in TTM, see Alice p271)
  4. add RVA/TVA (no longer normalised)
  5. add a way to generate new values by computation (TBD)
  6. add aggregation (second order)
  7. add fixed point recursion (GTC in TTM, WHILE) (second order)
  8. add D (now Turing Complete)

A spectrum/progression, with no agreement how far to go and still call it 'an' RA.

And no agreement whether (5) new values should be included or made part of some different 'domain algebra'.

[Incidentally, creating new values as closed expressions is first order, but at least in Andl creating new values as open expressions using an SQL backend requires callbacks, which is seriously hard to implement. Doesn't that make it second order?]

Sure, but I'm not sure agreement is necessary. Do what seems right.

A given relational algebra can be formally defined, but there isn't any technical or academic authority that mandates what "relational algebra" must be in general, though by convention it's a "relation" construct and a collection of operators on that construct. It's all a bit arbitrary and rather casually defined, but I'm not sure there's much compulsion for it to be otherwise.

In the software world, there isn't much interest in whether or not a language adheres to some theoretical formalism. Far more important (to the extent that even this matters) is the degree to which it's expressive and powerful.

Utility trumps purity.

But for the fact that purity in itself often provides kinds of utility that remain alas unseen.  Such as longlastingness and robustness e.g.

Those are not what I'm looking for. I look to formalism as a possible way to open up a new perspective on a familiar problem. Consider these programming language kinds:

  1. General purpose languages, found almost everywhere, can do almost everything (C/C++, Java, C#, Python, etc).
  2. Special environment languages, exist only within and monopolise a specific environment (SQL, JS, Smalltalk, Forth, Pick Basic, etc).

Since the creation of NodeJS JS is now a GP language, but it still monopolises the browser. There is AFAIK no such democratisation of SQL. You can only write and execute SQL inside a RDBMS, and you can only query data in the RDBMS. The only widely available implementation of the RA is locked up and out of reach in some RDBMS, where it cannot be used on a host of interesting problems. Why is that? The list above helps to explain why it's so hard to create a desktop 'NodeSQL' or embed any RA-based query language into a GP host language. First and foremost, the named perspective is totally alien to any common GP type system. It's all downhill from there.

Is the named perspective really alien to any common general purpose type system?

In general-purpose languages we use static tuple types and static tuples extensively and have done for a long time, only we call them "classes" and "instances". These certainly employ a "named perspective".

Increasingly, we avoid record-by-record manipulations in favour of high-level container-oriented operations -- like .NET's LINQ or Java Streams -- which are notionally analogous to a relational algebra (though with semantics more influenced by functional programming than the relational model) that operate on various kinds of containers in lieu of relations/tables/relvars.

These modern object-oriented language conceptual similarities to an "embedded relational algebra" -- particularly in .NET's LINQ, which goes so far as to offer an optional SQL-like syntax -- arguably outweigh the differences.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org
  1. The statically typed GP languages use a record structure with (a) ordered members (b) type name equivalence. The named perspective presumes a tuple with (a) a set of attributes (b) equivalence for types that share the same set of attributes.
  2. The statically typed GP languages do not provide a mechanism to index into a record (by name or otherwise). This allows no way to write generic routines for project, join, union etc. You can't even write a generic equals method.

Obviously there are ways to solve all the problems by going outside the language: code generation, a preprocessor, reflection, etc. The problem doesn't go away, we just paper over it.

[I wrote code to do all this in Andl.NET, based on reflection. It works just fine, until it doesn't. You have to treat it nicely, and that's no way to write robust tools.]

Or you can give up on type inference and just use dynamic typing and hash tables for everything. Be prepared for some runtime surprises.

And I wouldn't want to suggest going back to the unnamed perspective is any better. If anything it's probably worse.

Andl - A New Database Language - andl.org
Quote from dandl on August 24, 2019, 2:23 pm
  1. The statically typed GP languages use a record structure with (a) ordered members (b) type name equivalence. The named perspective presumes a tuple with (a) a set of attributes (b) equivalence for types that share the same set of attributes.
  2. The statically typed GP languages do not provide a mechanism to index into a record (by name or otherwise). This allows no way to write generic routines for project, join, union etc. You can't even write a generic equals method.

Obviously there are ways to solve all the problems by going outside the language: code generation, a preprocessor, reflection, etc. The problem doesn't go away, we just paper over it.

[I wrote code to do all this in Andl.NET, based on reflection. It works just fine, until it doesn't. You have to treat it nicely, and that's no way to write robust tools.]

Or you can give up on type inference and just use dynamic typing and hash tables for everything. Be prepared for some runtime surprises.

And I wouldn't want to suggest going back to the unnamed perspective is any better. If anything it's probably worse.

Indeed, there is no ideal solution here. It's all tradeoffs, and a matter of deciding what you can tolerate vs what you can't.

For example, in the usual popular general purpose languages, class x {int p} is a different type from class y {int p}, so if you use classes for tuples, does that represent a show-stopping problem?

Or is it a tolerable quirk, one you can accept in return for being able to use a conventional and popular language?

Or, do you accept the awkwardness of a code generator in order to use a conventional and popular language and gain TTM-style tuples?

Or, do you give up all the benefits of conventional and popular languages in order to gain the benefits of a new language?

Tradeoffs.

I didn't realise a "named perspective" presumes "a tuple with (a) a set of attributes (b) equivalence for types that share the same set of attributes". I thought a "named perspective" meant named references (e.g., parameters in a function definition in C/C++) rather than positional references (e.g., arguments to a function invocation in C/C++.)

I thought "a tuple with (a) a set of attributes (b) equivalence for types that share the same set of attributes" was largely a TTM perspective -- at least in this context -- though not necessarily limited to TTM.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org
Quote from Dave Voorhis on August 24, 2019, 5:27 pm
Quote from dandl on August 24, 2019, 2:23 pm
  1. The statically typed GP languages use a record structure with (a) ordered members (b) type name equivalence. The named perspective presumes a tuple with (a) a set of attributes (b) equivalence for types that share the same set of attributes.
  2. The statically typed GP languages do not provide a mechanism to index into a record (by name or otherwise). This allows no way to write generic routines for project, join, union etc. You can't even write a generic equals method.

Obviously there are ways to solve all the problems by going outside the language: code generation, a preprocessor, reflection, etc. The problem doesn't go away, we just paper over it.

[I wrote code to do all this in Andl.NET, based on reflection. It works just fine, until it doesn't. You have to treat it nicely, and that's no way to write robust tools.]

Or you can give up on type inference and just use dynamic typing and hash tables for everything. Be prepared for some runtime surprises.

And I wouldn't want to suggest going back to the unnamed perspective is any better. If anything it's probably worse.

Indeed, there is no ideal solution here. It's all tradeoffs, and a matter of deciding what you can tolerate vs what you can't.

For example, in the usual popular general purpose languages, class x {int p} is a different type from class y {int p}, so if you use classes for tuples, does that represent a show-stopping problem?

It is, because of the new types generated by relational operators. You might adopt a convention of alphabetically ordering member names, but that may break when you join {A,B} and {B,C} vs join {B,C} and {A,B}.

Or is it a tolerable quirk, one you can accept in return for being able to use a conventional and popular language?

Or, do you accept the awkwardness of a code generator in order to use a conventional and popular language and gain TTM-style tuples?

This is probably the method of choice (it was good enough for Stroustrup). [The alternative is to hack into the compiler, or figure out a whole new take on the RM.]

What you're going to finish up with is something similar to LINQ. Operations on local data can use the full richness of the host language, operations on server data would have to be translated into SQL and live with whatever restrictions that implies.

Or, do you give up all the benefits of conventional and popular languages in order to gain the benefits of a new language?

Tradeoffs.

I didn't realise a "named perspective" presumes "a tuple with (a) a set of attributes (b) equivalence for types that share the same set of attributes". I thought a "named perspective" meant named references (e.g., parameters in a function definition in C/C++) rather than positional references (e.g., arguments to a function invocation in C/C++.)

I thought "a tuple with (a) a set of attributes (b) equivalence for types that share the same set of attributes" was largely a TTM perspective -- at least in this context -- though not necessarily limited to TTM.

Yes, using the word 'type' at this stage is premature. The essence of the named perspective is spelled out in HHT: the selectors (attribute names) form a set rather than an ordered list. Relations are union-compatible if they have the same set of selectors. Project and Join result in relations each with a new set of selectors. The effect is similar: these are not features that fit comfortably in a statically typed language.

 

Andl - A New Database Language - andl.org
Quote from Dave Voorhis on August 24, 2019, 5:27 pm
Quote from dandl on August 24, 2019, 2:23 pm
  1. The statically typed GP languages use a record structure with (a) ordered members (b) type name equivalence. The named perspective presumes a tuple with (a) a set of attributes (b) equivalence for types that share the same set of attributes.
  2. The statically typed GP languages do not provide a mechanism to index into a record (by name or otherwise). This allows no way to write generic routines for project, join, union etc. You can't even write a generic equals method.

Obviously there are ways to solve all the problems by going outside the language: code generation, a preprocessor, reflection, etc. The problem doesn't go away, we just paper over it.

Or you can give up on type inference and just use dynamic typing and hash tables for everything. Be prepared for some runtime surprises.

And I wouldn't want to suggest going back to the unnamed perspective is any better. If anything it's probably worse.

Indeed, there is no ideal solution here. It's all tradeoffs, and a matter of deciding what you can tolerate vs what you can't.

For example, in the usual popular general purpose languages, class x {int p} is a different type from class y {int p}, so if you use classes for tuples, does that represent a show-stopping problem?

...

I didn't realise a "named perspective" presumes "a tuple with (a) a set of attributes (b) equivalence for types that share the same set of attributes". I thought a "named perspective" meant named references (e.g., parameters in a function definition in C/C++) rather than positional references (e.g., arguments to a function invocation in C/C++.)

"Named perspective"is the term in the Alice book [Section 3.2 ff]. I don't think I've seen it elsewhere. They draw out what it means in dribs and drabs, in parallel with drawing out the "Unnamed perspective" -- i.e. positional. IMO they're not too clear on all the implications. But then their mission is to get to Datalog, which is positional only.

They claim the two perspectives are equivalent because you can always impose an ordering on the attribute names, then for any given tuple type you can translate the names into offsets. But that doesn't work for typical sum-of-products style programming language tuples: they define tuples positionally; define each name as corresponding to some position within that tuple type only; and don't impose any ordering on names within the tuple.

I thought "a tuple with (a) a set of attributes (b) equivalence for types that share the same set of attributes" was largely a TTM perspective -- at least in this context -- though not necessarily limited to TTM.

Typically the programming languages we're talking about only regard names as a shorthand for accessing a component of a structure, that is access for fetch or overwrite in situ. For any whole-tuple or whole-set-of-tuples operation (not necessarily limited to TTM, as you say) they typically expect the whole-tuples to have same type, that is: same component names accessing corresponding component positions. Most can't build-a-new-tuple-on-the-fly: you must give a static declaration naming each tuple type the program is going to use.

For a bit of theoretical purity, I think David's right to say "equivalence" of types rather than "equality".

The problem with either hacking the compiler or reflection/templating/code generators is polymorphism: we want to write (say) functions over sets of tuples whose heading includes at least attributes S#, SNAME and/or excludes attribute CITY -- because robust relational expressions typically mention only the attributes of interest for some query, and try to be tolerant of small-scale schema change. I'd say David's point b) re "same set of attributes" is only a small part of the challenge.

SQL can somewhat get there by declaring stored procedures, and recompiling them each time the schema changes. But it's weird to say that what's locked up inside the clunky old DBMS is more polymorphic than what you can achieve in modern GP languages. (Haskell's super-power polymorphism and overloading mechanisms are no better here: you can't express 'has no attribute CITY' -- which is what you want for polymorphic code that's going to extend a set-of-tuples with attribute CITY.) So then the GP language program passes a query request as a String (the universal data type), and gets results back as a String (or maybe JSON etc), and any static type checking has gone out the window.

Quote from dandl on August 25, 2019, 1:48 am
Quote from Dave Voorhis on August 24, 2019, 5:27 pm
Quote from dandl on August 24, 2019, 2:23 pm
  1. The statically typed GP languages use a record structure with (a) ordered members (b) type name equivalence. The named perspective presumes a tuple with (a) a set of attributes (b) equivalence for types that share the same set of attributes.
  2. The statically typed GP languages do not provide a mechanism to index into a record (by name or otherwise). This allows no way to write generic routines for project, join, union etc. You can't even write a generic equals method.

Obviously there are ways to solve all the problems by going outside the language: code generation, a preprocessor, reflection, etc. The problem doesn't go away, we just paper over it.

[I wrote code to do all this in Andl.NET, based on reflection. It works just fine, until it doesn't. You have to treat it nicely, and that's no way to write robust tools.]

Or you can give up on type inference and just use dynamic typing and hash tables for everything. Be prepared for some runtime surprises.

And I wouldn't want to suggest going back to the unnamed perspective is any better. If anything it's probably worse.

Indeed, there is no ideal solution here. It's all tradeoffs, and a matter of deciding what you can tolerate vs what you can't.

For example, in the usual popular general purpose languages, class x {int p} is a different type from class y {int p}, so if you use classes for tuples, does that represent a show-stopping problem?

It is, because of the new types generated by relational operators. You might adopt a convention of alphabetically ordering member names, but that may break when you join {A,B} and {B,C} vs join {B,C} and {A,B}.

Or is it a tolerable quirk, one you can accept in return for being able to use a conventional and popular language?

Or, do you accept the awkwardness of a code generator in order to use a conventional and popular language and gain TTM-style tuples?

This is probably the method of choice (it was good enough for Stroustrup). [The alternative is to hack into the compiler, or figure out a whole new take on the RM.]

What you're going to finish up with is something similar to LINQ. Operations on local data can use the full richness of the host language, operations on server data would have to be translated into SQL and live with whatever restrictions that implies.

Or, do you give up all the benefits of conventional and popular languages in order to gain the benefits of a new language?

Tradeoffs.

I didn't realise a "named perspective" presumes "a tuple with (a) a set of attributes (b) equivalence for types that share the same set of attributes". I thought a "named perspective" meant named references (e.g., parameters in a function definition in C/C++) rather than positional references (e.g., arguments to a function invocation in C/C++.)

I thought "a tuple with (a) a set of attributes (b) equivalence for types that share the same set of attributes" was largely a TTM perspective -- at least in this context -- though not necessarily limited to TTM.

Yes, using the word 'type' at this stage is premature. The essence of the named perspective is spelled out in HHT: the selectors (attribute names) form a set rather than an ordered list. Relations are union-compatible if they have the same set of selectors. Project and Join result in relations each with a new set of selectors. The effect is similar: these are not features that fit comfortably in a statically typed language.

I wasn't intending to debate this or that specific point, but to point out that all choices here involve tradeoffs.

For example, you might regard use of classes to implement tuples in popular general-purpose programming languages as a breaking point because join treats class p {int x} and class q {int x} as different types. The solution, then, is a preprocessor or different language, or maybe a different approach using operators not based on a relational algebra. Now you have a solution you're happy with, but that someone else considers broken because there's a preprocessor or a different language or an approach not based on a relational algebra...

In other words, there can be no single "right" approach. There will only ever be implementations that trade off one set of limitations for another.

--

Though I note, largely as an aside, that I'm increasingly questioning of the assumption that a relational algebra is the right approach, particularly as what characterises the "power" of a database language for most users is not adherence to a relational algebra (or what's influenced by it -- SQL) but the fact that it's reasonably intuitive and provides persistence and ACID-compliant transactional state. (Though with much of NoSQL you can leave out ACID compliance and nobody seems to mind.)

Note that .NET LINQ and Java Streams and other collection frameworks provide an effective means to query containers without being a familiar relational algebra. Indeed, what they lack isn't a relational algebra but easy persistence and, often, ACID-compliant transactional state. Does the average -- or even well above average -- programmer care whether they implement a relational algebra (or not), as long as they're powerful and expressive -- and, ideally, provide easy persistence and ACID compliance?

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org
Quote from AntC on August 25, 2019, 6:07 am
Quote from Dave Voorhis on August 24, 2019, 5:27 pm
Quote from dandl on August 24, 2019, 2:23 pm
  1. The statically typed GP languages use a record structure with (a) ordered members (b) type name equivalence. The named perspective presumes a tuple with (a) a set of attributes (b) equivalence for types that share the same set of attributes.
  2. The statically typed GP languages do not provide a mechanism to index into a record (by name or otherwise). This allows no way to write generic routines for project, join, union etc. You can't even write a generic equals method.

Obviously there are ways to solve all the problems by going outside the language: code generation, a preprocessor, reflection, etc. The problem doesn't go away, we just paper over it.

Or you can give up on type inference and just use dynamic typing and hash tables for everything. Be prepared for some runtime surprises.

And I wouldn't want to suggest going back to the unnamed perspective is any better. If anything it's probably worse.

Indeed, there is no ideal solution here. It's all tradeoffs, and a matter of deciding what you can tolerate vs what you can't.

For example, in the usual popular general purpose languages, class x {int p} is a different type from class y {int p}, so if you use classes for tuples, does that represent a show-stopping problem?

...

I didn't realise a "named perspective" presumes "a tuple with (a) a set of attributes (b) equivalence for types that share the same set of attributes". I thought a "named perspective" meant named references (e.g., parameters in a function definition in C/C++) rather than positional references (e.g., arguments to a function invocation in C/C++.)

"Named perspective"is the term in the Alice book [Section 3.2 ff]. I don't think I've seen it elsewhere. They draw out what it means in dribs and drabs, in parallel with drawing out the "Unnamed perspective" -- i.e. positional. IMO they're not too clear on all the implications. But then their mission is to get to Datalog, which is positional only.

...

Ah, I completely missed that "unnamed perspective" was a specific reference. There's a reason academic papers always include citations, even for terminology that should be obvious to its readers, and this is it. (Or at least a big part of it.)

Quote from AntC on August 25, 2019, 6:07 am

...

The problem with either hacking the compiler or reflection/templating/code generators is polymorphism: we want to write (say) functions over sets of tuples whose heading includes at least attributes S#, SNAME and/or excludes attribute CITY -- because robust relational expressions typically mention only the attributes of interest for some query, and try to be tolerant of small-scale schema change. I'd say David's point b) re "same set of attributes" is only a small part of the challenge.

SQL can somewhat get there by declaring stored procedures, and recompiling them each time the schema changes. But it's weird to say that what's locked up inside the clunky old DBMS is more polymorphic than what you can achieve in modern GP languages. (Haskell's super-power polymorphism and overloading mechanisms are no better here: you can't express 'has no attribute CITY' -- which is what you want for polymorphic code that's going to extend a set-of-tuples with attribute CITY.) So then the GP language program passes a query request as a String (the universal data type), and gets results back as a String (or maybe JSON etc), and any static type checking has gone out the window.

Yes. Though per my response in another post, I'm not sure Yet Another Language (or a preprocessor to turn an existing language into Yet Another Language) that solves this represents anything but a set of tradeoffs that meet a particular set of ideals whilst ensuring the language will only ever be used by a vanishingly small group of users.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org
Quote from Dave Voorhis on August 25, 2019, 9:16 am

Though I note, largely as an aside, that I'm increasingly questioning of the assumption that a relational algebra is the right approach, particularly as what characterises the "power" of a database language for most users is not adherence to a relational algebra (or what's influenced by it -- SQL) but the fact that it's reasonably intuitive and provides persistence and ACID-compliant transactional state. (Though with much of NoSQL you can leave out ACID compliance and nobody seems to mind.)

Note that .NET LINQ and Java Streams and other collection frameworks provide an effective means to query containers without being a familiar relational algebra. Indeed, what they lack isn't a relational algebra but easy persistence and, often, ACID-compliant transactional state. Does the average -- or even well above average -- programmer care whether they implement a relational algebra (or not), as long as they're powerful and expressive -- and, ideally, provide easy persistence and ACID compliance?

Those are two quite different things. Only a tiny part of the accesses to a data repository are for the purposes of update, and an even tinier fraction assume a formal database and concurrent/atomic update. A database language to support those updates is important, but merely a niche. I write tools to make it possible, but I never use them. ACID is way down the list on what I need.

Most data is queried far more often than it is updated, and it is reasonable to expend energy on better query languages that operate within a familiar environment. Recent announcements include PartiQL, NRQL, DQL, and I'm sure I've missed a few. Are they based on the RA? I have no idea, but they do give some idea where the action is happening.

Andl - A New Database Language - andl.org
Quote from dandl on August 24, 2019, 2:23 pm
  1. The statically typed GP languages use a record structure with (a) ordered members (b) type name equivalence. The named perspective presumes a tuple with (a) a set of attributes (b) equivalence for types that share the same set of attributes.

That turns out not to be the case, at least not always.  C, C++, and Cobol have ordered members, but there is no ordering in Java or C#, where objects aren't even necessarily laid out in memory until class loading time.  Algol 68 uses structural type equivalence exclusively, and OCaml, Go, and C++ templating language use it to a varying extent.

PreviousPage 4 of 6Next