The Forum for Discussion about The Third Manifesto and Related Matters

You need to log in to create posts and topics.

How to resolve the type system-relational algebra mismatch

We talk about the Object Relational Mismatch, but TTM shows us the problem is deeper than that. TTM/D presents a perfectly plausible type system in two parts:

  1. A scalar type system (RM Pre 4) reasonably similar to and compatible with value types (but not OO types) in other languages
  2. A non-scalar type system (RM Pre 6 and 7) designed specifically for the convenience of the 'usual operators of the relational algebra' (RM Pre 18), and highly incompatible with just about every other language.

The solutions on offer are:

  1. Implement TTM/D, as an island language with no bridges to anywhere. Write applications in industrial D. [This is the thrust of D&D]
  2. Implement TTM/D as a hosted language. Write applications in a mix of HL and D. [This is Andl, or Rel]
  3. Ignore TTM, implement a simple type system and a hackneyed version of the RA in a unique language (SQL). Write applications in a mix of HL and SQL, with some kind of bridge between them. [This is ORM, xDBC, +/- stored procedures]
  4. (variant on 3): embed or generate SQL in HL [This is LINQ to SQL, jOOQ, etc].
  5. Implement a modestly powerful type system and a simplified version of the RA in a minimal language, embed in HL. [This was the proposal I described recently, based on domains and JSON.]
  6. Find a new compromise: the intersection of a rich type system and a 'logical equivalent of the relation algebra' that is compatible with existing languages and can manage data held locally or on remote servers. [A dream?]
  7. Find a new direction: some method other than a novel type system to embed a 'logical equivalent of the relation algebra' into existing languages.

It seems (1) is a dead-end. I've done (2) and it's not going anywhere that I can see. The world has gone for 3 and 4, but not without pain. I know how to do (5), and I think it has possibilities.

But (6) and (7) would perhaps be more interesting. Thoughts?

 

Andl - A New Database Language - andl.org
Quote from dandl on October 29, 2019, 12:21 am

 

The solutions on offer are:

  1. ...
  2. ...
  3. Ignore TTM, implement a simple type system and a hackneyed version of the RA in a unique language (SQL). Write applications in a mix of HL and SQL, with some kind of bridge between them. [This is ORM, xDBC, +/- stored procedures]

I wasn't aware SQL has a type system corresponding to TTM's non-scalars. AFAICT, what SQL calls a type systems is all to do with column types (and user-defined-ish types for columns). How SQL deals with tuples (not really), and database tables, and query results (multisets with possibly anonymous, possibly duplicate column names), and table literals (badly) seems to be all ad-hoc and hand-waving.

Please tell the secret sauce of SQL non-scalar types. (More than a few people would like to know.)

 

Here's a sketch of one way to implement alternative #7.

  1. Start with an ordinary OO language such as Java or C#.  I'll call it "Java".
  2. Figure out an appropriate syntax for tuple types, preferably one that is easy to recognize (by a regular expression, perhaps) when embedded in Java.  The resulting language is Java+T.
  3. Write a preprocessor that translates Java+T into Java by mapping each tuple type to a class whose name can be deduced from the attribute names and type names that form the tuple: thus TUP {a int, b int} might map to TUP.a_int.b_int, where the attributes are put into alphabetical order.   (There was a pre-C++ preprocessor for providing singly dispatched classes that worked something like this, but I can't find a link.)
  4. The preprocessor also generates Java files defining these classes, each of which has a private field per tuple element, a public constructor, and a public accessor per tuple element.  If the same class is constructed by more than one preprocessor run, it will have the same name, so the collision is unimportant.
  5.  A relation type is an instantiation of the native Java interface Set<T>, where T is a tuple name.  The RA (including transitive closure, quota, etc.) is implemented using generic methods.  You'll probably need reflection, which Java has.

Alternative #8:  Embed the RA into a dynamically typed reflective language like Python.  The Tuple class is a facade over dictionaries, and the Relation class is a facade over sets.  Don't worry about static typing except at the boundary to SQL engines that are used for storage.  (Greg Gaughan's Dee.)

Quote from AntC on October 29, 2019, 5:08 am
Quote from dandl on October 29, 2019, 12:21 am

 

The solutions on offer are:

  1. ...
  2. ...
  3. Ignore TTM, implement a simple type system and a hackneyed version of the RA in a unique language (SQL). Write applications in a mix of HL and SQL, with some kind of bridge between them. [This is ORM, xDBC, +/- stored procedures]

I wasn't aware SQL has a type system corresponding to TTM's non-scalars. AFAICT, what SQL calls a type systems is all to do with column types (and user-defined-ish types for columns). How SQL deals with tuples (not really), and database tables, and query results (multisets with possibly anonymous, possibly duplicate column names), and table literals (badly) seems to be all ad-hoc and hand-waving.

Please tell the secret sauce of SQL non-scalar types. (More than a few people would like to know.)

 

As you are well aware, it doesn't. That's what I meant by a 'simple type system', the most minimal thing that could possibly work, as per SQL

There is currently no evidence to the effect that a type system that incorporates TTM's non-scalar types is a good thing, and good reason to say that it's not. I could make a perfectly good case to use JSON+dates, and store RVAs in JSON's array of object.

Even better, introduce a new single type called 'relation'. It would save a lot of grief. The compiler would check all relational expressions for syntactic correctness and then perform attribute checking separately. It's putting the attribute names into the type system that causes grief.

Andl - A New Database Language - andl.org
Quote from johnwcowan on October 29, 2019, 5:21 am

Here's a sketch of one way to implement alternative #7.

  1. Start with an ordinary OO language such as Java or C#.  I'll call it "Java".
  2. Figure out an appropriate syntax for tuple types, preferably one that is easy to recognize (by a regular expression, perhaps) when embedded in Java.  The resulting language is Java+T.
  3. Write a preprocessor that translates Java+T into Java by mapping each tuple type to a class whose name can be deduced from the attribute names and type names that form the tuple: thus TUP {a int, b int} might map to TUP.a_int.b_int, where the attributes are put into alphabetical order.   (There was a pre-C++ preprocessor for providing singly dispatched classes that worked something like this, but I can't find a link.)
  4. The preprocessor also generates Java files defining these classes, each of which has a private field per tuple element, a public constructor, and a public accessor per tuple element.  If the same class is constructed by more than one preprocessor run, it will have the same name, so the collision is unimportant.
  5.  A relation type is an instantiation of the native Java interface Set<T>, where T is a tuple name.  The RA (including transitive closure, quota, etc.) is implemented using generic methods.  You'll probably need reflection, which Java has.

Alternative #8:  Embed the RA into a dynamically typed reflective language like Python.  The Tuple class is a facade over dictionaries, and the Relation class is a facade over sets.  Don't worry about static typing except at the boundary to SQL engines that are used for storage.  (Greg Gaughan's Dee.)

It's a reasonable start, but a couple of issues.

  1. You haven't spelled out the form of the actual code that present the RA. Is it a novel syntax, or does it look like Java function calls?
  2. Changing the syntax of the developer language does not play nicely with modern tools. You won't get syntax completion, type checking, debugging to work nicely, if at all.
  3. I think you're assuming an existing database, SQL, and just SQL types to deal with.

This is the 'cfront' approach. The alternative is the 'separate module' approach.

  1. As above. I'll go with Java, although I do prefer C#.
  2. Figure out an appropriate syntax for a stand-alone mini language, complete with a type system and explicit imports/exports. The resulting language is RAJ (RA for Java).
  3. Write a compiler that translates RAJ into Java, creating suitable exported function calls and types as you wish. The exports should look familiar and natural to a Java programmer.
  4. The compiler generates classes, each of which has a private field per tuple element, a public constructor, and a public accessor per tuple element.
  5. A relation value can be any enumerated type over a tuple. Set <> is OK if you want. Best to use a templating approach so this can be changed as needed.

In this case the RAJ module looks exactly like a Java class definition. It gets compiled first, and the developer sees all the exports as expected. The whole thing checks into the repo. Sweet.

I would expect to find that someone has already done this with SQL as the sole backend. I would prefer to provide multiple backends, including the use of local or in-memory storage. There are some serious problems to solve on the implementation side.

 

Andl - A New Database Language - andl.org
Quote from dandl on October 29, 2019, 6:37 am
My "cfront" design:

1) You haven't spelled out the form of the actual code that present the RA. Is it a novel syntax, or does it look like Java function calls?

Definitely the latter, as in Dee.  The only non-native syntax is for tuple types.

2) Changing the syntax of the developer language does not play nicely with modern tools. You won't get syntax completion, type checking, debugging to work nicely, if at all.

Granted.  I don't use any of that in any language (except type checking in Scala, where there is no escaping it), so it doesn't naturally occur to me.  But it is probably a fatal objection nowadays.  Kotlin comes with the necessary and so does Scala, but Scala's support has ongoing problems (as does Scala).

3) I think you're assuming an existing database, SQL, and just SQL types to deal with.

No, not at all.  Never entered my mind.

Your "separate module" design:

2) Figure out an appropriate syntax for a stand-alone mini language, complete with a type system and explicit imports/exports. The resulting language is RAJ (RA for Java).

How mini is that?  Is it at the Tutorial D level?  That would provide two competing imperative languages.  Perhaps it should just provide pure functions that allow you to assign names to open RA expressions (that is, expressions with lambda-bound variables), plus update operators?  Presumably these would look like static methods to Java.

4) The compiler generates classes, each of which has a private field per tuple element, a public constructor, and a public accessor per tuple element.

Is this meant to use the same (ugly) collision-avoidance strategy as I outlined, or does the programmer have to invent a name for every tuple type that is referred to in an exported function?

5) A relation value can be any enumerated type over a tuple. Set <> is OK if you want. Best to use a templating approach so this can be changed as needed.

If it's not a set, it doesn't have RA semantics, but of course many classes implement Set<T>, and this system could add new ones, including facades over external database relations.

 

Quote from dandl on October 29, 2019, 12:21 am

We talk about the Object Relational Mismatch, but TTM shows us the problem is deeper than that. TTM/D presents a perfectly plausible type system in two parts:

  1. A scalar type system (RM Pre 4) reasonably similar to and compatible with value types (but not OO types) in other languages
  2. A non-scalar type system (RM Pre 6 and 7) designed specifically for the convenience of the 'usual operators of the relational algebra' (RM Pre 18), and highly incompatible with just about every other language.

The solutions on offer are:

  1. Implement TTM/D, as an island language with no bridges to anywhere. Write applications in industrial D. [This is the thrust of D&D]
  2. Implement TTM/D as a hosted language. Write applications in a mix of HL and D. [This is Andl, or Rel]
  3. Ignore TTM, implement a simple type system and a hackneyed version of the RA in a unique language (SQL). Write applications in a mix of HL and SQL, with some kind of bridge between them. [This is ORM, xDBC, +/- stored procedures]
  4. (variant on 3): embed or generate SQL in HL [This is LINQ to SQL, jOOQ, etc].
  5. Implement a modestly powerful type system and a simplified version of the RA in a minimal language, embed in HL. [This was the proposal I described recently, based on domains and JSON.]
  6. Find a new compromise: the intersection of a rich type system and a 'logical equivalent of the relation algebra' that is compatible with existing languages and can manage data held locally or on remote servers. [A dream?]
  7. Find a new direction: some method other than a novel type system to embed a 'logical equivalent of the relation algebra' into existing languages.

It seems (1) is a dead-end. I've done (2) and it's not going anywhere that I can see. The world has gone for 3 and 4, but not without pain. I know how to do (5), and I think it has possibilities.

But (6) and (7) would perhaps be more interesting. Thoughts?

6 and 7 are indeed more interesting, but lead to an inevitable question: Why do we want RA?

RA is uncontroversially (and perhaps obviously, but I would say that as a "fan" of TTM and the RM) the right approach for even a non-D, non-TTM form of 1 and 2. But I think we all agree that 1 and 2 aren't exactly setting the database world on fire.

3 is the "standard" approach; probably the majority of deployed applications in use today use it in one form or another. There are almost certainly many ways to improve it. Even my datasheet is certainly 3-flavoured. It is not the RA, other than to the extent that writing SQL queries -- which it generally doesn't attempt to hide or abstract away -- is RA.

4 is ok for queries, unsuitable (or unable) to generate (clean) SQL schemas. Often leads to frustration with limitations, and that leads back to 3.

5 sounds like 3 but with a potentially awkward abstraction layer around SQL. It doesn't sound appealing, at least not to me.

7 is purely speculative, so we can only imagine what novel and as-yet-unconsidered formalisms and practical approaches might be appropriate here.

That leaves 6.

Does 6 need -- or so strongly benefit from -- RA, such that there is virtually no alternative?

I would argue that if we're talking writing host-language code (for the masses, usually C# or Java or Python) we are almost certainly better off writing queries to data providers in native SQL + prepared statements via API calls, or MongoDB via API calls, or Hadoop API calls, or Apache POI/XXE (MS Excel) API calls, or whatever API calls.

Attempts to abstract these under one unifying API usually winds up with disappointing lack of capability ("the connector for Hadoop won't let me do x and I need x!", etc.)  and/or grossly suboptimal code and/or non-support for the latest data provider until somebody writes a "connector" for the new data provider so that it's supported by the unifying API. Of course, the "connector" for that new data provider usually turns out to be inadequate to support some key feature of the new data provider, so the "connector" API needs to be changed so that every data provider supports it... And so on.

We're almost certainly better off not trying to unify all data providers under one "data" API. It may be tempting to turn data providers and their data into relvars, but I think that will end in a #4-like frustration.

So no RA there.

Data providers, when queried via their API's, typically emit collections -- usually (if Java) some Set<T> or List<T> or Iterator<T>, or similar. I suggest -- and I genuinely suggest, this is not a rhetorical "suggest" that is actually an attempt to persuade -- that for manipulation of result-of-query collections, we don't need the RA. We already have a suitable composable, expressive, container/collection abstraction: .NET LINQ or Java Streams. These provide all the abstraction and expressive composability that we seek from the RA, fully-integrated with the rich(ish) type system of the host language, and both are capable and familiar to Java/C# programmers. Yes, it's not the RA we're used to, and it requires hand-optimisation of queries, but I'm not sure that's a compelling reason to switch back to the RA and give up the clean integration that LINQ/Streams give us.

It does suggest there's maybe work to be done on automated re-writing of LINQ/Streams queries to provide query optimisation.

So where does that leave the RA?

Unfortunately, given these options, it leaves it behind. It's really 1 and 2 that we want -- and that's where the RA would be most effective and most applicable -- but that's not what happened.

 

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org
Quote from johnwcowan on October 29, 2019, 5:21 am

Here's a sketch of one way to implement alternative #7.

  1. Start with an ordinary OO language such as Java or C#.  I'll call it "Java".
  2. Figure out an appropriate syntax for tuple types, preferably one that is easy to recognize (by a regular expression, perhaps) when embedded in Java.  The resulting language is Java+T.
  3. Write a preprocessor that translates Java+T into Java by mapping each tuple type to a class whose name can be deduced from the attribute names and type names that form the tuple: thus TUP {a int, b int} might map to TUP.a_int.b_int, where the attributes are put into alphabetical order.   (There was a pre-C++ preprocessor for providing singly dispatched classes that worked something like this, but I can't find a link.)
  4. The preprocessor also generates Java files defining these classes, each of which has a private field per tuple element, a public constructor, and a public accessor per tuple element.  If the same class is constructed by more than one preprocessor run, it will have the same name, so the collision is unimportant.
  5.  A relation type is an instantiation of the native Java interface Set<T>, where T is a tuple name.  The RA (including transitive closure, quota, etc.) is implemented using generic methods.  You'll probably need reflection, which Java has.

Alternative #8:  Embed the RA into a dynamically typed reflective language like Python.  The Tuple class is a facade over dictionaries, and the Relation class is a facade over sets.  Don't worry about static typing except at the boundary to SQL engines that are used for storage.  (Greg Gaughan's Dee.)

Dynamically typed "scripting" languages like Python are great for one-person projects and such but dire for enterprise scale, unless you have a wall of tests and epic code coverage -- or some magnificent (i.e., impossible) team-wide developer discipline -- to take the place of static typing guarantees from the compiler.

So I don't like #8. I appreciate that others might find it ideal.

My datasheet project has an evolving core a lot like your #7 (or should it be #6? Have lost track...)

  1. It starts with Java.
  2. A tuple type is simply an ordinary class with public getters and setters. It's not Java+T, it's just Java.
  3. Tuple types are generated dynamically as Java source code -- akin to your preprocessor -- and compiled into loadable classes at runtime using the Eclipse ECJ java compiler library. Typing is still nominal -- a tuple defined as class T1 { x int, y String } is a different type from a tuple defined as class T2 { x int, y String }, but I consider this experimental -- I'm curious whether it will turn out to be a problem or not given #5, below. If it turns out to be a problem, then I will explore faking structural typing -- perhaps using inheritance to give multiple classes of the same structural type different names by inheriting from a base type. E.g., class TBase {x int, y String }; class T1 extends TBase {}; class T2 extends TBase {};, etc.
  4. Set<T> or List<T> are the usual representations of collections of tuples. They are not necessarily relations in the RA sense, though Set<T> can certainly be understood that way.
  5. There is no RA; it uses the Streams API.
I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org
Quote from johnwcowan on October 29, 2019, 7:40 am
Quote from dandl on October 29, 2019, 6:37 am
My "cfront" design:

1) You haven't spelled out the form of the actual code that present the RA. Is it a novel syntax, or does it look like Java function calls?

Definitely the latter, as in Dee.  The only non-native syntax is for tuple types.

This is problematic. If you write join(relA,relB) what types are the the two arguments, to make that possible?

3) I think you're assuming an existing database, SQL, and just SQL types to deal with.

No, not at all.  Never entered my mind.

So how do you  decide which scalar types to support?

Your "separate module" design:

2) Figure out an appropriate syntax for a stand-alone mini language, complete with a type system and explicit imports/exports. The resulting language is RAJ (RA for Java).

How mini is that?  Is it at the Tutorial D level?  That would provide two competing imperative languages.  Perhaps it should just provide pure functions that allow you to assign names to open RA expressions (that is, expressions with lambda-bound variables), plus update operators?  Presumably these would look like static methods to Java.

Not imperative. It's an expression language which generates function calls visible in Java, with parameters passed in. There is a specific complication around dealing with open expressions, which I mentioned earlier and needs more thought.

The one exception: it might be useful to allow local temporary variables as a factoring convenience. They don't actually store anything.

4) The compiler generates classes, each of which has a private field per tuple element, a public constructor, and a public accessor per tuple element.

Is this meant to use the same (ugly) collision-avoidance strategy as I outlined, or does the programmer have to invent a name for every tuple type that is referred to in an exported function?

I lean towards default generated nonsense type names, with the ability to provide a meaningful name. The name is never really needed.

5) A relation value can be any enumerated type over a tuple. Set <> is OK if you want. Best to use a templating approach so this can be changed as needed.

If it's not a set, it doesn't have RA semantics, but of course many classes implement Set<T>, and this system could add new ones, including facades over external database relations.

The semantics don't matter. The query returns a collection; an update consumes a collection. The form of that collection is unimportant. Duplicates will be removed by the database.

Andl - A New Database Language - andl.org
Quote from dandl on October 29, 2019, 10:28 am

This is problematic. If you write join(relA,relB) what types are the the two arguments, to make that possible?

Something on the lines of Set<T extends Tuple>.  Introspection will be necessary to determine joinability.

So how do you  decide which scalar types to support?

Any serializable type.  The serialization algorithm doesn't have to be the standard Java one.

Not imperative. It's an expression language which generates function calls visible in Java, with parameters passed in. There is a specific complication around dealing with open expressions, which I mentioned earlier and needs more thought.

The one exception: it might be useful to allow local temporary variables as a factoring convenience. They don't actually store anything.

LGTM, including the variables.  Being able to log things is also valuable.

I lean towards default generated nonsense type names, with the ability to provide a meaningful name.

I think that presumes that the RAJ compiler has access to all the .raj files in the repo simultaneously, otherwise it cannot be sure of using the same class for the same tuple type throughout.  My ugly convention allows separate compilation.

The name is never really needed.

If the Java code is to have variables whose type is a tuple, it needs access to the type names.

The semantics don't matter. The query returns a collection; an update consumes a collection. The form of that collection is unimportant. Duplicates will be removed by the database.

I suppose that works.  But the semantics is that of a set, and using a Set interface more tightly constrains the behavior of the Java code.