The Forum for Discussion about The Third Manifesto and Related Matters

Please or Register to create posts and topics.

Does a database need anything more than a storage-based type system?

Page 1 of 4Next

TTM describes a language with a sophisticated type system and used-defined operators using those types, with the expectation that those types and operators live in the database, supporting database relvars. Is this a good idea?

I have argued that relation should be treated as templated collections rather than types in their own right, which would preclude storing RVAs (and this is consistent with principles of normalisation). Is it a good idea to store typed values in the database that can only be used in conjunction with a particular user-defined type system (also stored in the database)?

Would it not be better to define a general purpose database type system, with enough detail to allow the database to make choices about storage allocation but not presuming or preempting any particular code that might use the data?

SQL already goes close, but has a fair bit of legacy baggage. My preferred list would be: boolean, integer, decimal, real, text, binary, time, enum and UDT.

Time is point in time, effectively non-zoned datetime. You want timezone, use a UDT.

Enum is a type for which every permitted value is enumerated. In SQL terms, it's probably just text strings.

UDT is a type defined as a record structure with one or more named fields, each of one of the other types. In SQL terms, it's just a group of columns.

Who needs more (in the database)?

Andl - A New Database Language - andl.org
Quote from dandl on April 10, 2020, 11:19 am

TTM describes a language with a sophisticated type system and used-defined operators using those types, with the expectation that those types and operators live in the database, supporting database relvars. Is this a good idea?

I have argued that relation should be treated as templated collections rather than types in their own right, which would preclude storing RVAs (and this is consistent with principles of normalisation). Is it a good idea to store typed values in the database that can only be used in conjunction with a particular user-defined type system (also stored in the database)?

Would it not be better to define a general purpose database type system, with enough detail to allow the database to make choices about storage allocation but not presuming or preempting any particular code that might use the data?

SQL already goes close, but has a fair bit of legacy baggage. My preferred list would be: boolean, integer, decimal, real, text, binary, time, enum and UDT.

Time is point in time, effectively non-zoned datetime. You want timezone, use a UDT.

Enum is a type for which every permitted value is enumerated. In SQL terms, it's probably just text strings.

UDT is a type defined as a record structure with one or more named fields, each of one of the other types. In SQL terms, it's just a group of columns.

Who needs more (in the database)?

That seems like an unnecessarily large and rather domain-specific set -- at least in a computer science sense -- and predefined, unextendable sets of types inevitably run into a need for one more type that's really appropriate for this problem but turns up nowhere else (or in this domain but not that one, or often enough to need it here but not there, etc.)

At least that's the case with things I write, but I'm a proponent of "typeful programming" and encouraging use of user-defined types.

In a language like Tutorial D, the primitive types -- BOOLEAN, CHARACTER, RATIONAL, INTEGER -- are a sufficient basis for defining any user-defined type via composition as long as the compositional facilities are sufficiently rich, though I'd be inclined to add some form of Binary Large Object or byte array as a primitive. You can interface with any other language by providing a correspondence with (or mapping from) database language primitives to some set of types in the interfacing language. Most languages natively -- or at least easily -- support equivalents to the Tutorial D primitives.

User-defined types in the database can then be arbitrarily complex. That does mean interacting with other languages requires reduction to primitives, such that given some D-side value v of some given user-defined type V with a component FOO, itself a user-defined type with a component BAR, itself a user-defined type with a component BAZ, itself a user-defined type which has a CHARACTER component called STRING, you may wind up with something like THE_STRING(THE_BAZ(THE_BAR(THE_FOO(v))))1 to reach the "root" primitive, but that actually works quite well. It's fairly typical to build rich data-processing capabilities on the database side, rich UI (or other system interfacing) capabilities on the client-side, and the two are otherwise decoupled. Coupling is entirely via mapped primitives.

In other words, one set of rich data-oriented user-defined types lives in the database-side, a different set of rich UI or interface -oriented user-defined types lives in the client-side, and there is no (or little) overlap between user-defined types on either side. The interface between the client side and the database side is via mapped primitives rather than user-defined types.

When there is some overlap, it's relatively straightforward to reconstruct the relatively few shared types on each side.

--
1 This is perhaps exaggerated, for the sake of illustration. Code I've written has never nested this deep. One or two levels is typical.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

I believe the database language should natively be computationally complete and have a decent way of representing any user-defined non-cyclic data structure without too much indirection.  I also believe that compositionally recursive types should be supported, including nested tuples or relations to arbitrary depth.  I can get into more detail another time.

Quote from Dave Voorhis on April 10, 2020, 4:31 pm
Quote from dandl on April 10, 2020, 11:19 am

TTM describes a language with a sophisticated type system and used-defined operators using those types, with the expectation that those types and operators live in the database, supporting database relvars. Is this a good idea?

I have argued that relation should be treated as templated collections rather than types in their own right, which would preclude storing RVAs (and this is consistent with principles of normalisation). Is it a good idea to store typed values in the database that can only be used in conjunction with a particular user-defined type system (also stored in the database)?

Would it not be better to define a general purpose database type system, with enough detail to allow the database to make choices about storage allocation but not presuming or preempting any particular code that might use the data?

SQL already goes close, but has a fair bit of legacy baggage. My preferred list would be: boolean, integer, decimal, real, text, binary, time, enum and UDT.

Time is point in time, effectively non-zoned datetime. You want timezone, use a UDT.

Enum is a type for which every permitted value is enumerated. In SQL terms, it's probably just text strings.

UDT is a type defined as a record structure with one or more named fields, each of one of the other types. In SQL terms, it's just a group of columns.

Who needs more (in the database)?

That seems like an unnecessarily large and rather domain-specific set -- at least in a computer science sense -- and predefined, unextendable sets of types inevitably run into a need for one more type that's really appropriate for this problem but turns up nowhere else (or in this domain but not that one, or often enough to need it here but not there, etc.)

You lost me. In what sense?

At least that's the case with things I write, but I'm a proponent of "typeful programming" and encouraging use of user-defined types.

In a language like Tutorial D, the primitive types -- BOOLEAN, CHARACTER, RATIONAL, INTEGER -- are a sufficient basis for defining any user-defined type via composition as long as the compositional facilities are sufficiently rich, though I'd be inclined to add some form of Binary Large Object or byte array as a primitive.

Then we are in total agreement, except that I have added TIME and DECIMAL. My personal preference is to store only one numeric type (as per JSON), so having 3 is a nod to SQL and TD, 2 is just the wrong number (pick 1 or 3). My UDTs are simple compositions. Where else do you think we differ?

You can interface with any other language by providing a correspondence with (or mapping from) database language primitives to some set of types in the interfacing language. Most languages natively -- or at least easily -- support equivalents to the Tutorial D primitives.

Isn't that what I said?

User-defined types in the database can then be arbitrarily complex. That does mean interacting with other languages requires reduction to primitives, such that given some D-side value v of some given user-defined type V with a component FOO, itself a user-defined type with a component BAR, itself a user-defined type with a component BAZ, itself a user-defined type which has a CHARACTER component called STRING, you may wind up with something like THE_STRING(THE_BAZ(THE_BAR(THE_FOO(v))))1 to reach the "root" primitive, but that actually works quite well. It's fairly typical to build rich data-processing capabilities on the database side, rich UI (or other system interfacing) capabilities on the client-side, and the two are otherwise decoupled. Coupling is entirely via mapped primitives.

In other words, one set of rich data-oriented user-defined types lives in the database-side, a different set of rich UI or interface -oriented user-defined types lives in the client-side, and there is no (or little) overlap between user-defined types on either side. The interface between the client side and the database side is via mapped primitives rather than user-defined types.

Yes. And your point is? Sounds like furious agreement to me (but I will stick to insisting on TIME as a type).

When there is some overlap, it's relatively straightforward to reconstruct the relatively few shared types on each side.

--
1 This is perhaps exaggerated, for the sake of illustration. Code I've written has never nested this deep. One or two levels is typical.

 

Andl - A New Database Language - andl.org
Quote from dandl on April 11, 2020, 12:30 am
Quote from Dave Voorhis on April 10, 2020, 4:31 pm
Quote from dandl on April 10, 2020, 11:19 am

TTM describes a language with a sophisticated type system and used-defined operators using those types, with the expectation that those types and operators live in the database, supporting database relvars. Is this a good idea?

I have argued that relation should be treated as templated collections rather than types in their own right, which would preclude storing RVAs (and this is consistent with principles of normalisation). Is it a good idea to store typed values in the database that can only be used in conjunction with a particular user-defined type system (also stored in the database)?

Would it not be better to define a general purpose database type system, with enough detail to allow the database to make choices about storage allocation but not presuming or preempting any particular code that might use the data?

SQL already goes close, but has a fair bit of legacy baggage. My preferred list would be: boolean, integer, decimal, real, text, binary, time, enum and UDT.

Time is point in time, effectively non-zoned datetime. You want timezone, use a UDT.

Enum is a type for which every permitted value is enumerated. In SQL terms, it's probably just text strings.

UDT is a type defined as a record structure with one or more named fields, each of one of the other types. In SQL terms, it's just a group of columns.

Who needs more (in the database)?

That seems like an unnecessarily large and rather domain-specific set -- at least in a computer science sense -- and predefined, unextendable sets of types inevitably run into a need for one more type that's really appropriate for this problem but turns up nowhere else (or in this domain but not that one, or often enough to need it here but not there, etc.)

You lost me. In what sense?

At least that's the case with things I write, but I'm a proponent of "typeful programming" and encouraging use of user-defined types.

In a language like Tutorial D, the primitive types -- BOOLEAN, CHARACTER, RATIONAL, INTEGER -- are a sufficient basis for defining any user-defined type via composition as long as the compositional facilities are sufficiently rich, though I'd be inclined to add some form of Binary Large Object or byte array as a primitive.

Then we are in total agreement, except that I have added TIME and DECIMAL. My personal preference is to store only one numeric type (as per JSON), so having 3 is a nod to SQL and TD, 2 is just the wrong number (pick 1 or 3). My UDTs are simple compositions. Where else do you think we differ?

You can interface with any other language by providing a correspondence with (or mapping from) database language primitives to some set of types in the interfacing language. Most languages natively -- or at least easily -- support equivalents to the Tutorial D primitives.

Isn't that what I said?

User-defined types in the database can then be arbitrarily complex. That does mean interacting with other languages requires reduction to primitives, such that given some D-side value v of some given user-defined type V with a component FOO, itself a user-defined type with a component BAR, itself a user-defined type with a component BAZ, itself a user-defined type which has a CHARACTER component called STRING, you may wind up with something like THE_STRING(THE_BAZ(THE_BAR(THE_FOO(v))))1 to reach the "root" primitive, but that actually works quite well. It's fairly typical to build rich data-processing capabilities on the database side, rich UI (or other system interfacing) capabilities on the client-side, and the two are otherwise decoupled. Coupling is entirely via mapped primitives.

In other words, one set of rich data-oriented user-defined types lives in the database-side, a different set of rich UI or interface -oriented user-defined types lives in the client-side, and there is no (or little) overlap between user-defined types on either side. The interface between the client side and the database side is via mapped primitives rather than user-defined types.

Yes. And your point is? Sounds like furious agreement to me (but I will stick to insisting on TIME as a type).

When there is some overlap, it's relatively straightforward to reconstruct the relatively few shared types on each side.

--
1 This is perhaps exaggerated, for the sake of illustration. Code I've written has never nested this deep. One or two levels is typical.

Short answer (because it's late) is that user-defined types are absolutely mandatory. Only having simple records is entirely insufficient. Ideally, and at a minimum, I expect to see a rich type system. TTM's is fine (debate over that is elsewhere; let's not repeat it here) as is any system based on algebraic sum-of-product types. Rich polymorphism is a necessity; e.g., multiple dispatch, generics, etc.

If user-defined types are supported, the only built-in types needed for a reasonably performant language are (per Tutorial D) BOOLEAN, CHARACTER, RATIONAL or (preferably) FLOAT to map to standard hardware IEEE 754, INTEGER (map to canonical long or equivalent), and (ideally) a byte array though CHARACTER will suffice via encoding (if necessary.)

With user-defined type support, additional types -- TIMESTAMP, DECIMAL, DATE, whatever, etc. can be provided where appropriate via a non-builtin, non-primitive standard library or equivalent.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org
Quote from Darren Duncan on April 10, 2020, 8:00 pm

I believe the database language should natively be computationally complete and have a decent way of representing any user-defined non-cyclic data structure without too much indirection.  I also believe that compositionally recursive types should be supported, including nested tuples or relations to arbitrary depth.  I can get into more detail another time.

I'm not expressing an opinion on the computational capability of any DML/DDL, just how data gets stored in the database. I have proposed a set of simple scalar types plus a mechanism for composition. In my view the database should not contain or rely on user-supplied code, just data.

I did not specify (a) whether new types should be named or (b) whether composition types should allow self-reference. You can't have (b) without both (a) and some kind of NULL.

For the present I would say:

  • The list of types is: BOOL,INT,DEC,REAL,TIME,TEXT,BINARY,ENUM,STRUCT (names may vary; numeric types under review)
  • INT (range), ENUM (values) and STRUCT (components) are defined at the point of declaration and are unnamed
  • INT, DEC and REAL values are ordered and comparable to each other
  • TEXT values are ordered, BINARY are not
  • ENUM values are ordered and comparable to each other and to text, as if they were text.
  • STRUCT values with the same components are comparable and ordered, by comparing their components in the order of definition.
Andl - A New Database Language - andl.org
Quote from dandl on April 11, 2020, 12:30 am

My personal preference is to store only one numeric type (as per JSON), so having 3 is a nod to SQL and TD, 2 is just the wrong number (pick 1 or 3).

I find that for core numeric types, 2 is exactly the right number, where I have Integer and Fraction as those 2.  Both of these are exact and unlimited precision and scale.

A Fraction will represent any rational number, limited only by memory.  Each Fraction can be stored however the implementation wishes behind the scenes, optionally as a machine float if the value in question is representable by one, or likewise with other compact formats for common values, and otherwise typically as a pair of Integer for the general case.

Now in theory I could simply have Fraction, with Integer being a proper subtype rather than a disjoint type, but I like the idea of Integer being fundamental, given the arguably much greater number of use cases for integers than for general rationals.

And I feel conceptually defining Fraction in terms of Integer is better than the other way.

Because, if you start with defining the more specific type in terms of the more general type, then that goes down the rabbit hole of, what about the various mathematical types that are supersets of rationals, the most common example in computing being complex numbers.  Why not then make Complex your single type and make rationals and integers a specialization of that?  Where does it end?

Or this may not be a good argument, but its one that weighs on me.

Quote from Darren Duncan on April 11, 2020, 2:42 am
Quote from dandl on April 11, 2020, 12:30 am

My personal preference is to store only one numeric type (as per JSON), so having 3 is a nod to SQL and TD, 2 is just the wrong number (pick 1 or 3).

I find that for core numeric types, 2 is exactly the right number, where I have Integer and Fraction as those 2.  Both of these are exact and unlimited precision and scale.

A Fraction will represent any rational number, limited only by memory.  Each Fraction can be stored however the implementation wishes behind the scenes, optionally as a machine float if the value in question is representable by one, or likewise with other compact formats for common values, and otherwise typically as a pair of Integer for the general case.

Now in theory I could simply have Fraction, with Integer being a proper subtype rather than a disjoint type, but I like the idea of Integer being fundamental, given the arguably much greater number of use cases for integers than for general rationals.

And I feel conceptually defining Fraction in terms of Integer is better than the other way.

Because, if you start with defining the more specific type in terms of the more general type, then that goes down the rabbit hole of, what about the various mathematical types that are supersets of rationals, the most common example in computing being complex numbers.  Why not then make Complex your single type and make rationals and integers a specialization of that?  Where does it end?

Or this may not be a good argument, but its one that weighs on me.

I don't recall ever needing fractions with denominators other than 1, 2 and 10, and INT, DEC and REAL have those covered. Transcendentals are always a lost cause, unless you're proposing a denominator of Pi, and that only helps with some.

I treat complex the same as point, rectangle, colour etc, as composite types with named components. A composite of a number and an enum is perfect for quantities with units. Time plus enum gives you timezone or non-Gregorian calendars. It's simple, easy to understand and covers most of the main use cases.

I should have said before: no collections. No arrays, no trees, no lists. The whole point is that this is relational and relations are the only collection (basic 3NF). The query language has recursion built in, so it can do TC and GTC on parts explosions.

 

Andl - A New Database Language - andl.org
Quote from dandl on April 11, 2020, 7:24 am
Quote from Darren Duncan on April 11, 2020, 2:42 am
Quote from dandl on April 11, 2020, 12:30 am

My personal preference is to store only one numeric type (as per JSON), so having 3 is a nod to SQL and TD, 2 is just the wrong number (pick 1 or 3).

I find that for core numeric types, 2 is exactly the right number, where I have Integer and Fraction as those 2.  Both of these are exact and unlimited precision and scale.

A Fraction will represent any rational number, limited only by memory.  Each Fraction can be stored however the implementation wishes behind the scenes, optionally as a machine float if the value in question is representable by one, or likewise with other compact formats for common values, and otherwise typically as a pair of Integer for the general case.

I don't recall ever needing fractions with denominators other than 1, 2 and 10, and INT, DEC and REAL have those covered.

The problem with that is math would behave counter-intuitively without other denominators.

Take the simple expression "((3.0 * (1.0/3.0)) = 1.0".

With general Fraction support, that expression would result in TRUE, but with only denominator powers of 1/2/10 that expression would result in FALSE.

Its one thing to expect FALSE if you explicitly declare that you are rounding the result of each step in the expression to a denominator power of 1/2/10 but if you aren't then one would reasonably expect a TRUE result.

I tell you, some common programming languages today DO result in TRUE for that expression, and I would expect this is a feature that any well designed modern type system should have.

 

Quote from Dave Voorhis on April 11, 2020, 1:03 am
Quote from dandl on April 11, 2020, 12:30 am
Quote from Dave Voorhis on April 10, 2020, 4:31 pm
Quote from dandl on April 10, 2020, 11:19 am

TTM describes a language with a sophisticated type system and used-defined operators using those types, with the expectation that those types and operators live in the database, supporting database relvars. Is this a good idea?

I have argued that relation should be treated as templated collections rather than types in their own right, which would preclude storing RVAs (and this is consistent with principles of normalisation). Is it a good idea to store typed values in the database that can only be used in conjunction with a particular user-defined type system (also stored in the database)?

Would it not be better to define a general purpose database type system, with enough detail to allow the database to make choices about storage allocation but not presuming or preempting any particular code that might use the data?

SQL already goes close, but has a fair bit of legacy baggage. My preferred list would be: boolean, integer, decimal, real, text, binary, time, enum and UDT.

Time is point in time, effectively non-zoned datetime. You want timezone, use a UDT.

Enum is a type for which every permitted value is enumerated. In SQL terms, it's probably just text strings.

UDT is a type defined as a record structure with one or more named fields, each of one of the other types. In SQL terms, it's just a group of columns.

Who needs more (in the database)?

That seems like an unnecessarily large and rather domain-specific set -- at least in a computer science sense -- and predefined, unextendable sets of types inevitably run into a need for one more type that's really appropriate for this problem but turns up nowhere else (or in this domain but not that one, or often enough to need it here but not there, etc.)

You lost me. In what sense?

At least that's the case with things I write, but I'm a proponent of "typeful programming" and encouraging use of user-defined types.

In a language like Tutorial D, the primitive types -- BOOLEAN, CHARACTER, RATIONAL, INTEGER -- are a sufficient basis for defining any user-defined type via composition as long as the compositional facilities are sufficiently rich, though I'd be inclined to add some form of Binary Large Object or byte array as a primitive.

Then we are in total agreement, except that I have added TIME and DECIMAL. My personal preference is to store only one numeric type (as per JSON), so having 3 is a nod to SQL and TD, 2 is just the wrong number (pick 1 or 3). My UDTs are simple compositions. Where else do you think we differ?

You can interface with any other language by providing a correspondence with (or mapping from) database language primitives to some set of types in the interfacing language. Most languages natively -- or at least easily -- support equivalents to the Tutorial D primitives.

Isn't that what I said?

User-defined types in the database can then be arbitrarily complex. That does mean interacting with other languages requires reduction to primitives, such that given some D-side value v of some given user-defined type V with a component FOO, itself a user-defined type with a component BAR, itself a user-defined type with a component BAZ, itself a user-defined type which has a CHARACTER component called STRING, you may wind up with something like THE_STRING(THE_BAZ(THE_BAR(THE_FOO(v))))1 to reach the "root" primitive, but that actually works quite well. It's fairly typical to build rich data-processing capabilities on the database side, rich UI (or other system interfacing) capabilities on the client-side, and the two are otherwise decoupled. Coupling is entirely via mapped primitives.

In other words, one set of rich data-oriented user-defined types lives in the database-side, a different set of rich UI or interface -oriented user-defined types lives in the client-side, and there is no (or little) overlap between user-defined types on either side. The interface between the client side and the database side is via mapped primitives rather than user-defined types.

Yes. And your point is? Sounds like furious agreement to me (but I will stick to insisting on TIME as a type).

When there is some overlap, it's relatively straightforward to reconstruct the relatively few shared types on each side.

--
1 This is perhaps exaggerated, for the sake of illustration. Code I've written has never nested this deep. One or two levels is typical.

Short answer (because it's late) is that user-defined types are absolutely mandatory. Only having simple records is entirely insufficient. Ideally, and at a minimum, I expect to see a rich type system. TTM's is fine (debate over that is elsewhere; let's not repeat it here) as is any system based on algebraic sum-of-product types. Rich polymorphism is a necessity; e.g., multiple dispatch, generics, etc.

Now I'm confused. There are plenty of languages out there with user-defined types, etc, etc supported by user-written code. I'm proposing a specific set of 9 types for the database, including user-defined ENUM and STRUCT, but no code. No selectors, no operators, and in fact no type declarations and no collections. That leaves the job of application type system and the database interface layer as a job for code, but that code is written in a GP language of your choice. The database simply promises that you can access the data natively in your chosen language.

If user-defined types are supported, the only built-in types needed for a reasonably performant language are (per Tutorial D) BOOLEAN, CHARACTER, RATIONAL or (preferably) FLOAT to map to standard hardware IEEE 754, INTEGER (map to canonical long or equivalent), and (ideally) a byte array though CHARACTER will suffice via encoding (if necessary.)

With user-defined type support, additional types -- TIMESTAMP, DECIMAL, DATE, can be provided where appropriate via a non-builtin, non-primitive standard library or equivalent.

I disagree about your reasons: I see no reason to omit any of the ones I listed, they are all widely available, widely used and trivial to implement. The question is not whether you would leave any out, the question is: do you agree that they are sufficient?

Andl - A New Database Language - andl.org
Page 1 of 4Next