The Forum for Discussion about The Third Manifesto and Related Matters

Please or Register to create posts and topics.

User defined operators vs. SQL back ends

Page 1 of 2Next

Lots of TD implementations seem to have SQL back ends.  I'm not speaking here of being able to treat a SQL database as a data source, on all fours with a spreadsheet or CSV or whatever; I'm speaking of storing the relations exposed to the user in a SQL database, for ACIDity or what not.

Generally, the more work you can offload to the SQL engine, the better: it's optimized for it, and in the case of a remote server it minimizes network traffic.  The obvious cases are select, project, and join.  The last two are straightforward (assuming the TD end keeps its own catalog relvars, which I do).  However, outsourcing selection bumps up against user-defined operators.  Is it typical to translate calls on those operators (at least the pure and functional ones) into lower-level operations known to the SQL engine (since after all TD has the source available), or to just punt on WHERE clauses that include them, processing them on the TD side?  Or a mixed strategy, with partial selection in SQL, the rest in TD?

I don't think there are lots of TD implementations. I think there are currently three -- MighTyD (no longer active) with PostgreSQL for a storage engine, DuroDBMS with the Berkeley DB for a storage engine, and Rel with the Berkeley DB (Java edition) for a storage engine.  There's also Project:M36, but it's more Tutorial D-ish than Tutorial D; not sure what it uses for a storage engine.

So there's only one Tutorial D implementation that relies on a SQL back end, and as I recall, MighTyD doesn't support user defined operators.

Other D implementations are based on SQL DBMSs and may offer a better answer. Ask @david-bennett-2 about Andl.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org
Quote from johnwcowan on August 28, 2019, 2:23 pm

Lots of TD implementations seem to have SQL back ends.  I'm not speaking here of being able to treat a SQL database as a data source, on all fours with a spreadsheet or CSV or whatever; I'm speaking of storing the relations exposed to the user in a SQL database, for ACIDity or what not.

Shall we broaden that to TTM/D implementations? TD is just one syntax and feature set.

Having said that, the only one I know is Andl, the one I wrote.

Generally, the more work you can offload to the SQL engine, the better: it's optimized for it, and in the case of a remote server it minimizes network traffic.  The obvious cases are select, project, and join.  The last two are straightforward (assuming the TD end keeps its own catalog relvars, which I do).  However, outsourcing selection bumps up against user-defined operators.  Is it typical to translate calls on those operators (at least the pure and functional ones) into lower-level operations known to the SQL engine (since after all TD has the source available), or to just punt on WHERE clauses that include them, processing them on the TD side?  Or a mixed strategy, with partial selection in SQL, the rest in TD?

If you check my post on RM by degrees, this is precisely the point. Expressions written according to the FORA (First Order RA), no matter what dialect, can be translated reliably into SQL and shipped to the server. SORA (Second Order) expressions in general cannot. SORA relies on 'open expressions', which have to be evaluated multiple times to return a result.The definition of OE is in the TD docs, nowhere else.

To the extent that SQL is a SORA implementation you can translate a subset of function calls, but it's hard work. As it happens, this is exactly how LINQ for SQL works, and just implementing a provider for that is a ton of work, let alone implementing a framework. I looked, and decided not.

Andl is a full implementation of the SORA: the lot, everything I could find. It runs natively on two SQL backends: SQLite and Postgres. In each case it has to run 'in-process' and it relies on callbacks from the SQL engine to evaluate open expressions. The SQLite implementation is reasonably complete; the Postgres implementation has some limits I could not escape, but it does work. I haven't tried any others. Please note that the callback process inevitably introduces its own overheads.

I would like to think there is another approach, but so far no-one has suggested one.

 

Andl - A New Database Language - andl.org
Quote from dandl on August 28, 2019, 11:55 pm

Shall we broaden that to TTM/D implementations? TD is just one syntax and feature set.

Having said that, the only one I know is Andl, the one I wrote.

Sure.  Indeed, any implementation of the TTM principles would be equally relevant, whether it is technically a D or not.

To the extent that SQL is a SORA implementation you can translate a subset of function calls, but it's hard work. As it happens, this is exactly how LINQ for SQL works, and just implementing a provider for that is a ton of work, let alone implementing a framework. I looked, and decided not.

This isn't quite the case I had in mind.   Suppose that the user defines the hyperbolic sine function, which is a pure function, thus: define sinh(x) = (exp(x) - exp(-x)) / 2.  Given a relation RR with just a real-valued attribute r, then insert R rel{r = sinh(3.141592653)} can be transformed into INSERT INTO RR(R) VALUES(11.5487393573), because this is a closed expression.  What's more, if you want to extend RR with a new attribute s such that s = sin(r), then you can send SELECT R, SIN(R) AS S FROM RR even though sin(r) is an open expression, because the SQL engine (or PostgreSQL at any rate) will understand it.

But if you want to extend  RR with a new attribute hs whose value is sinh(r), you can't just send SELECT R, SINH(R) AS HS FROM R, because PostgreSQL doesn't understand SINH.  But it does understand SELECT R, (EXP(R) - EXP(-R)) / 2 AS HS FROM RR, and since you have the definition of sinh available to you, you can make the expansion.

Obviously this is hardly worth doing for extension, but selection and join exhibit exactly the same issues and matter much more.  So for them you need to do a sort of decompilation of these functions.

Quote from dandl on August 28, 2019, 11:55 pm
Quote from johnwcowan on August 28, 2019, 2:23 pm

Lots of TD implementations seem to have SQL back ends.  I'm not speaking here of being able to treat a SQL database as a data source, on all fours with a spreadsheet or CSV or whatever; I'm speaking of storing the relations exposed to the user in a SQL database, for ACIDity or what not.

Shall we broaden that to TTM/D implementations? TD is just one syntax and feature set.

Having said that, the only one I know is Andl, the one I wrote.

Generally, the more work you can offload to the SQL engine, the better: it's optimized for it, and in the case of a remote server it minimizes network traffic.  The obvious cases are select, project, and join.  The last two are straightforward (assuming the TD end keeps its own catalog relvars, which I do).  However, outsourcing selection bumps up against user-defined operators.  Is it typical to translate calls on those operators (at least the pure and functional ones) into lower-level operations known to the SQL engine (since after all TD has the source available), or to just punt on WHERE clauses that include them, processing them on the TD side?  Or a mixed strategy, with partial selection in SQL, the rest in TD?

If you check my post on RM by degrees, this is precisely the point. Expressions written according to the FORA (First Order RA), no matter what dialect, can be translated reliably into SQL and shipped to the server. SORA (Second Order) expressions in general cannot. SORA relies on 'open expressions', which have to be evaluated multiple times to return a result.The definition of OE is in the TD docs, nowhere else.

To the extent that SQL is a SORA implementation you can translate a subset of function calls, but it's hard work. As it happens, this is exactly how LINQ for SQL works, and just implementing a provider for that is a ton of work, let alone implementing a framework. I looked, and decided not.

Andl is a full implementation of the SORA: the lot, everything I could find. It runs natively on two SQL backends: SQLite and Postgres. In each case it has to run 'in-process' and it relies on callbacks from the SQL engine to evaluate open expressions. The SQLite implementation is reasonably complete; the Postgres implementation has some limits I could not escape, but it does work. I haven't tried any others. Please note that the callback process inevitably introduces its own overheads.

I would like to think there is another approach, but so far no-one has suggested one.

To using a SQL engine as a back-end?

Probably not, or at least not without turning the SQL engine into nothing more than storage for relvars.

Of course, not using a SQL engine as a back-end is an option, which is why several D implementations use (for example) the Berkeley DB, a transactional key/value store. At least one other implements its own back-end (e.g., SIRA_PRISE, if I recall correctly.)

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org
Quote from johnwcowan on August 29, 2019, 3:10 am
Quote from dandl on August 28, 2019, 11:55 pm

Shall we broaden that to TTM/D implementations? TD is just one syntax and feature set.

Having said that, the only one I know is Andl, the one I wrote.

Sure.  Indeed, any implementation of the TTM principles would be equally relevant, whether it is technically a D or not.

A given implementation of the TTM principles would almost certainly be a D, because those principles are delineated in the TTM prescriptions and proscriptions, and any complete implementation of the TTM pre/pro-scriptions would be, by definition, a D.

D is the family of languages that adhere to the TTM pre/pro-scriptions. D is neither a specified syntax, nor a specified paradigm. D languages may have many different syntaxes and belong to different paradigms. When we write D, we usually mean any member of that family of languages. This is similar to the way Lisp typically refers to a family of languages with a recognisable parenthesised prefix syntax, rather than a specific member of the family like Common Lisp, Scheme or Clojure.

Tutorial D -- frequently abbreviated as TD -- is one specific member of the D family, with a specified syntax and semantics and imperative paradigm, and intended primarily for illustration and pedagogy. When we write Tutorial D or TD, we usually mean just that language, with the specific syntax and semantics defined by Date & Darwen.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org
Quote from Dave Voorhis on August 29, 2019, 6:24 am
Quote from johnwcowan on August 29, 2019, 3:10 am
Quote from dandl on August 28, 2019, 11:55 pm

Shall we broaden that to TTM/D implementations?

Sure.  Indeed, any implementation of the TTM principles would be equally relevant, whether it is technically a D or not.

A given implementation of the TTM principles would almost certainly be a D, because those principles are delineated in the TTM prescriptions and proscriptions, and any complete implementation of the TTM pre/pro-scriptions would be, by definition, a D.

... recognisable parenthesised prefix syntax ...

SIRA_PRISE was built from the ground up for reasons all similar to the issue depicted in the original qn.

  • Problems supporting MA, because SQL doesn't know it
  • "Minimizing network traffic" is outright false : multiple queries cannot be evaluated in one single line call because SQL [engines] do[es]n't support EXTEND TABLE_DEE {query1 ... query2 ...}
  • Exposing SQL's relational mistakes is always right around the corner
  • Level of SQL support by various engines is too different for things (the translation feature) to stay manageable (especially relevant in the UDT/UDO arena)
  • etc etc
Quote from johnwcowan on August 29, 2019, 3:10 am
Quote from dandl on August 28, 2019, 11:55 pm

Shall we broaden that to TTM/D implementations? TD is just one syntax and feature set.

Having said that, the only one I know is Andl, the one I wrote.

Sure.  Indeed, any implementation of the TTM principles would be equally relevant, whether it is technically a D or not.

To the extent that SQL is a SORA implementation you can translate a subset of function calls, but it's hard work. As it happens, this is exactly how LINQ for SQL works, and just implementing a provider for that is a ton of work, let alone implementing a framework. I looked, and decided not.

This isn't quite the case I had in mind.   Suppose that the user defines the hyperbolic sine function, which is a pure function, thus: define sinh(x) = (exp(x) - exp(-x)) / 2.  Given a relation RR with just a real-valued attribute r, then insert R rel{r = sinh(3.141592653)} can be transformed into INSERT INTO RR(R) VALUES(11.5487393573), because this is a closed expression.  What's more, if you want to extend RR with a new attribute s such that s = sin(r), then you can send SELECT R, SIN(R) AS S FROM RR even though sin(r) is an open expression, because the SQL engine (or PostgreSQL at any rate) will understand it.

But if you want to extend  RR with a new attribute hs whose value is sinh(r), you can't just send SELECT R, SINH(R) AS HS FROM R, because PostgreSQL doesn't understand SINH.  But it does understand SELECT R, (EXP(R) - EXP(-R)) / 2 AS HS FROM RR, and since you have the definition of sinh available to you, you can make the expansion.

Your choice of function is a bit unhelpful. All you did was raise the bar on how to rewrite functions so that SQL can handle them. There are plenty of functions for which this kind of expansion is impossible, or at least impracticable given typical SQL compiler limits. We don't really want to re-implement NumPy or LINPACK or IMSL in SQL code.

Obviously this is hardly worth doing for extension, but selection and join exhibit exactly the same issues and matter much more.  So for them you need to do a sort of decompilation of these functions.

SQL is a pretty bad target for cross-compilation.  Not going there.

The relevant places in TD where open expressions are required are WHERE, EXTEND and aggregation (not join). For EXTEND it matters little whether the computation is done before or after the data is shipped to the client, but for WHERE and aggregation is matters a lot. If we think of the evaluation of a function as a relation, then if that relation has a low cardinality it will be efficient to send data from server to client for computation, back to the server to evaluate the query, then send the final results back to the client. Efficient, but not easily arranged!

 

Andl - A New Database Language - andl.org
Quote from Dave Voorhis on August 29, 2019, 6:24 am

D is the family of languages that adhere to the TTM pre/pro-scriptions. D is neither a specified syntax, nor a specified paradigm. D languages may have many different syntaxes and belong to different paradigms.

Technically yes.  But it's clear that D was specified from the perspective of statically typed imperative languages.  It's possible to mess with the rules a bit to make it more or less fit another paradigm, but the result would not be, strictly speaking, D.

This is similar to the way Lisp typically refers to a family of languages with a recognisable parenthesised prefix syntax, rather than a specific member of the family like Common Lisp, Scheme or Clojure.

I wish (and so did John McCarthy, and got his way as far as standardization is concerned).  But in practice Common Lispers often arrogate the name "Lisp" to Common Lisp exclusively.

Quote from dandl at August 29, 2019, 11:25 am

We don't really want to re-implement NumPy or LINPACK or IMSL in SQL code.

No, we don't.  But there's no reason that SQL back ends shouldn't have access to arbitrary scalar libraries, either.  PL/Java for PostgreSQL is a step in this direction, and most engines support some sort of C plugins, though they usually require pretty messy C glue.

Quote from johnwcowan on August 29, 2019, 1:40 pm
Quote from Dave Voorhis on August 29, 2019, 6:24 am

D is the family of languages that adhere to the TTM pre/pro-scriptions. D is neither a specified syntax, nor a specified paradigm. D languages may have many different syntaxes and belong to different paradigms.

Technically yes.  But it's clear that D was specified from the perspective of statically typed imperative languages.  It's possible to mess with the rules a bit to make it more or less fit another paradigm, but the result would not be, strictly speaking, D.

Only if it fails to meet the prescriptions and proscriptions. If it meets the prescriptions and proscriptions, then regardless of paradigm it's a D. That's what a D is.

Quote from johnwcowan on August 29, 2019, 1:40 pm
Quote from Dave Voorhis on August 29, 2019, 6:24 am

This is similar to the way Lisp typically refers to a family of languages with a recognisable parenthesised prefix syntax, rather than a specific member of the family like Common Lisp, Scheme or Clojure.

I wish (and so did John McCarthy, and got his way as far as standardization is concerned).  But in practice Common Lispers often arrogate the name "Lisp" to Common Lisp exclusively.

Only among Common Lispers. The rest of us typically use Lisp to refer to the language family, not Common Lisp.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org
Page 1 of 2Next