SQL is a standard?

#1 · February 16, 2019, 12:51 am

Next time somebody alleges there's an SQL Standard, you could refer them to this tale of woe. (That is, the difficulties they faced, not the specifics of how they tackled the challenge.)

In particular, the mess with different vendors' character encodings and handling of the varieties of (VAR)CHARs. It would be helpful if SQL specified a type system.

I guess we all know that T-SQL or PL-SQL is no sort of standard. Package vendors I've worked for, who want their package to be OS-agnostic, know to avoid any sort of procedural features. Also to limit themselves to a subset of standard-ish SQL. And still there's a bunch of ugly OS- and SQL-vendor-aware code that optimises each SQL call.

In particular one package I worked on was aimed at the international market, to be potentially translated into Asian languages. We had enormous difficulties with the CHAR Length function and substringing/concatenation because of DBCS. (Much of the screen labels would be in international English mixed in with Chinese characters.)

#2 · February 16, 2019, 12:39 pm

Quote from AntC on February 16, 2019, 12:51 am

...

In particular, the mess with different vendors' character encodings and handling of the varieties of (VAR)CHARs. It would be helpful if SQL specified a type system.

...

Are you saying that the international standard for SQL's treatment of types doesn't, in your opinion, constitute a (type) system, or isn't very systematic? I might agree but it certainly has a lot to say about types, especially user-defined ones.

Hugh

Coauthor of The Third Manifesto and related books.

#3 · February 16, 2019, 11:15 pm

A good part of that tale of woe has nothing to do with SQL, it's about conniptions of the Unicode standard. MS got into Unicode really early (what they did is now called UCS-2), but everyone now wants UTF-8 (without actually knowing what that means). And in the mix are thousands of users who don't want to change their legacy code just because there's a new standard to fit in with some weird language no-one's ever heard of. And Postgres doesn't do a really good job in their implementation of the standard.

In fact that's the real story here. There are very robust standards for both SQL and Unicode, but no guarantees of a faithful implementation by all relevant vendors. There is a perfectly good standard for SQL/PSM, but instead we have T-SQL and (a rather weak effort) called pgSQL. Until customers demand vendor compliance with published standards this is going to persist, to the detriment of all.

Andl - A New Database Language - andl.org

#4 · February 18, 2019, 8:34 am

Quote from AntC on February 18, 2019, 8:34 am

Quote from Hugh on February 16, 2019, 12:39 pm

Quote from AntC on February 16, 2019, 12:51 am

...

In particular, the mess with different vendors' character encodings and handling of the varieties of (VAR)CHARs. It would be helpful if SQL specified a type system.

...

Are you saying that the international standard for SQL's treatment of types doesn't, in your opinion, constitute a (type) system, or isn't very systematic? I might agree but it certainly has a lot to say about types, especially user-defined ones.

Thanks Hugh, no I didn't mean user-defined types. (The article I linked to doesn't mention those.) Despite the SQL standard's "lots to say", I've never known any installation to declare user-defined types. Any custom types are declared in the application layer/programming language, with a mapping/ORM to the database, in which they're held as INT or CHAR.

I meant the non-systematic system-defined types and the system-supplied operations on them. I'll borrow some terms from TTM Pre's.

Contra David B's comments, I don't see much to do with UNICODE/UTF-8/etc.

One example is DATE ranges: a type is supposed to be a set of values (RM Pre 1). Then how can one vendor's set of values start from year 0001 whereas another's start from 1753?

Re (VAR)CHAR: two values are supposed to compare equal iff they're "the very same value" (RM Pre 8). And if they're the same value, then any operation on them must produce the same result/must not expose any distinguishable difference. So some equality tests ignore trailing whitespace, and some don't. Some length functions count trailing whitespace, some don't. Some functions had distinguishable effects depending on the value's type (VARCHAR vs TEXT) -- which I suppose is within the rules, but unhelpful; some functions did expose a distinguishable difference (SQL-Server LEN()).

If it depends on the type, I'd suspect I could convert two allegedly equal TEXTs to VARCHAR (or v.v.) and expose a distinguishable difference.

I know from many data conversion exercises over cronky old legacy systems, that most organisations' CHAR fields are riddled with trailing whitespace (including stray 'invisible' characters copy'n'pasted from other applications into data entry screens). These get exposed if the application concatenates some field into an error message, for example. Then the SQL standard's specification of what it is for two CHAR values to be equal should also say that there must be no way to distinguish values that are alleged to be equal. Straightforward Leibniz equality.

Quote from Hugh on February 16, 2019, 12:39 pm

Quote from AntC on February 16, 2019, 12:51 am

...

In particular, the mess with different vendors' character encodings and handling of the varieties of (VAR)CHARs. It would be helpful if SQL specified a type system.

...

Are you saying that the international standard for SQL's treatment of types doesn't, in your opinion, constitute a (type) system, or isn't very systematic? I might agree but it certainly has a lot to say about types, especially user-defined ones.

Thanks Hugh, no I didn't mean user-defined types. (The article I linked to doesn't mention those.) Despite the SQL standard's "lots to say", I've never known any installation to declare user-defined types. Any custom types are declared in the application layer/programming language, with a mapping/ORM to the database, in which they're held as INT or CHAR.

I meant the non-systematic system-defined types and the system-supplied operations on them. I'll borrow some terms from TTM Pre's.

Contra David B's comments, I don't see much to do with UNICODE/UTF-8/etc.

One example is DATE ranges: a type is supposed to be a set of values (RM Pre 1). Then how can one vendor's set of values start from year 0001 whereas another's start from 1753?

Re (VAR)CHAR: two values are supposed to compare equal iff they're "the very same value" (RM Pre 8). And if they're the same value, then any operation on them must produce the same result/must not expose any distinguishable difference. So some equality tests ignore trailing whitespace, and some don't. Some length functions count trailing whitespace, some don't. Some functions had distinguishable effects depending on the value's type (VARCHAR vs TEXT) -- which I suppose is within the rules, but unhelpful; some functions did expose a distinguishable difference (SQL-Server LEN()).

If it depends on the type, I'd suspect I could convert two allegedly equal TEXTs to VARCHAR (or v.v.) and expose a distinguishable difference.

I know from many data conversion exercises over cronky old legacy systems, that most organisations' CHAR fields are riddled with trailing whitespace (including stray 'invisible' characters copy'n'pasted from other applications into data entry screens). These get exposed if the application concatenates some field into an error message, for example. Then the SQL standard's specification of what it is for two CHAR values to be equal should also say that there must be no way to distinguish values that are alleged to be equal. Straightforward Leibniz equality.

#5 · February 19, 2019, 1:05 pm

Quote from Hugh on February 19, 2019, 1:05 pm

Quote from AntC on February 18, 2019, 8:34 am

Quote from Hugh on February 16, 2019, 12:39 pm

Quote from AntC on February 16, 2019, 12:51 am

...

In particular, the mess with different vendors' character encodings and handling of the varieties of (VAR)CHARs. It would be helpful if SQL specified a type system.

...

Are you saying that the international standard for SQL's treatment of types doesn't, in your opinion, constitute a (type) system, or isn't very systematic? I might agree but it certainly has a lot to say about types, especially user-defined ones.

Thanks Hugh, no I didn't mean user-defined types. (The article I linked to doesn't mention those.) Despite the SQL standard's "lots to say", I've never known any installation to declare user-defined types. Any custom types are declared in the application layer/programming language, with a mapping/ORM to the database, in which they're held as INT or CHAR.

I meant the non-systematic system-defined types and the system-supplied operations on them. I'll borrow some terms from TTM Pre's.

Contra David B's comments, I don't see much to do with UNICODE/UTF-8/etc.

One example is DATE ranges: a type is supposed to be a set of values (RM Pre 1). Then how can one vendor's set of values start from year 0001 whereas another's start from 1753?

Re (VAR)CHAR: two values are supposed to compare equal iff they're "the very same value" (RM Pre 8). And if they're the same value, then any operation on them must produce the same result/must not expose any distinguishable difference. So some equality tests ignore trailing whitespace, and some don't. Some length functions count trailing whitespace, some don't. Some functions had distinguishable effects depending on the value's type (VARCHAR vs TEXT) -- which I suppose is within the rules, but unhelpful; some functions did expose a distinguishable difference (SQL-Server LEN()).

If it depends on the type, I'd suspect I could convert two allegedly equal TEXTs to VARCHAR (or v.v.) and expose a distinguishable difference.

I know from many data conversion exercises over cronky old legacy systems, that most organisations' CHAR fields are riddled with trailing whitespace (including stray 'invisible' characters copy'n'pasted from other applications into data entry screens). These get exposed if the application concatenates some field into an error message, for example. Then the SQL standard's specification of what it is for two CHAR values to be equal should also say that there must be no way to distinguish values that are alleged to be equal. Straightforward Leibniz equality.

Thanks, AntC. I was surprised but pleased when I heard that there had been so little take-up of SQL UDTs after SQL:1998 was published. I had been convinced by my IBM colleagues at the database labs (note, labs, plural, different ones for the different DB2s!) that customers were screaming out for it. I suspect it turned out that "customers" meant one or two particularly important ones, such as Boeing and SAS. I hated having to participate in the development of that awful material for the standard (for which I and three IBM colleagues shared a substantial award!).

As it happens, personally, I've not yet found a need for Tutorial D UDTs in my various Rel databases. I did want to define subtypes of INTEGER but didn't like seeing, e.g., POSINT(3) instead of just 3 in tabular displays of results, especially when I'm saving them as HTML and posting them to a web site.

Hugh

Quote from AntC on February 18, 2019, 8:34 am

Quote from Hugh on February 16, 2019, 12:39 pm

Quote from AntC on February 16, 2019, 12:51 am

...

In particular, the mess with different vendors' character encodings and handling of the varieties of (VAR)CHARs. It would be helpful if SQL specified a type system.

...

Are you saying that the international standard for SQL's treatment of types doesn't, in your opinion, constitute a (type) system, or isn't very systematic? I might agree but it certainly has a lot to say about types, especially user-defined ones.

Thanks Hugh, no I didn't mean user-defined types. (The article I linked to doesn't mention those.) Despite the SQL standard's "lots to say", I've never known any installation to declare user-defined types. Any custom types are declared in the application layer/programming language, with a mapping/ORM to the database, in which they're held as INT or CHAR.

I meant the non-systematic system-defined types and the system-supplied operations on them. I'll borrow some terms from TTM Pre's.

Contra David B's comments, I don't see much to do with UNICODE/UTF-8/etc.

One example is DATE ranges: a type is supposed to be a set of values (RM Pre 1). Then how can one vendor's set of values start from year 0001 whereas another's start from 1753?

Re (VAR)CHAR: two values are supposed to compare equal iff they're "the very same value" (RM Pre 8). And if they're the same value, then any operation on them must produce the same result/must not expose any distinguishable difference. So some equality tests ignore trailing whitespace, and some don't. Some length functions count trailing whitespace, some don't. Some functions had distinguishable effects depending on the value's type (VARCHAR vs TEXT) -- which I suppose is within the rules, but unhelpful; some functions did expose a distinguishable difference (SQL-Server LEN()).

If it depends on the type, I'd suspect I could convert two allegedly equal TEXTs to VARCHAR (or v.v.) and expose a distinguishable difference.

I know from many data conversion exercises over cronky old legacy systems, that most organisations' CHAR fields are riddled with trailing whitespace (including stray 'invisible' characters copy'n'pasted from other applications into data entry screens). These get exposed if the application concatenates some field into an error message, for example. Then the SQL standard's specification of what it is for two CHAR values to be equal should also say that there must be no way to distinguish values that are alleged to be equal. Straightforward Leibniz equality.

Thanks, AntC. I was surprised but pleased when I heard that there had been so little take-up of SQL UDTs after SQL:1998 was published. I had been convinced by my IBM colleagues at the database labs (note, labs, plural, different ones for the different DB2s!) that customers were screaming out for it. I suspect it turned out that "customers" meant one or two particularly important ones, such as Boeing and SAS. I hated having to participate in the development of that awful material for the standard (for which I and three IBM colleagues shared a substantial award!).

As it happens, personally, I've not yet found a need for Tutorial D UDTs in my various Rel databases. I did want to define subtypes of INTEGER but didn't like seeing, e.g., POSINT(3) instead of just 3 in tabular displays of results, especially when I'm saving them as HTML and posting them to a web site.

Hugh

Coauthor of The Third Manifesto and related books.

#6 · February 20, 2019, 11:21 am

As it happens, personally, I've not yet found a need for Tutorial D UDTs in my various Rel databases. I did want to define subtypes of INTEGER but didn't like seeing, e.g., POSINT(3) instead of just 3 in tabular displays of results, especially when I'm saving them as HTML and posting them to a web site.

Hugh

I struggled with that a long time too. The solution I eventually came up with was to have the type implementations offer not only, say, a renderAsString() method but also a renderAsStringVerbose(). The latter producing the POSINT(3) variant and the former just 3. The DBrowser always uses the shorter one and an option toggles the explicit display of the typename in the heading of the table (so you can have "COUNT" or "COUNT:POSINT"). Never pursued it any further so offering choice of possrep as well is still in the queue. That one's more difficult to apply across attributes of distinct types so I'd need a user-level config saying "when displaying this type, use that possrep". But it still won't play well with the possrep system used for units of measurement. You'd get WEIGHT(KG(1.0)) / WEIGHT(PNDS(0.454)) or else just 1.0 / 0.454 but still no way to leave out WEIGHT but retain KG/PNDS. Two distinct levels of "verbosity". Addressing all that is too laborious for the futile relevance it has to getting the data engine working properly.

Author of SIRA_PRISE

#7 · February 21, 2019, 11:55 am

Quote from Hugh on February 21, 2019, 11:55 am

Quote from Erwin on February 20, 2019, 11:21 am

As it happens, personally, I've not yet found a need for Tutorial D UDTs in my various Rel databases. I did want to define subtypes of INTEGER but didn't like seeing, e.g., POSINT(3) instead of just 3 in tabular displays of results, especially when I'm saving them as HTML and posting them to a web site.

Hugh

I struggled with that a long time too. The solution I eventually came up with was to have the type implementations offer not only, say, a renderAsString() method but also a renderAsStringVerbose(). The latter producing the POSINT(3) variant and the former just 3. The DBrowser always uses the shorter one and an option toggles the explicit display of the typename in the heading of the table (so you can have "COUNT" or "COUNT:POSINT"). Never pursued it any further so offering choice of possrep as well is still in the queue. That one's more difficult to apply across attributes of distinct types so I'd need a user-level config saying "when displaying this type, use that possrep". But it still won't play well with the possrep system used for units of measurement. You'd get WEIGHT(KG(1.0)) / WEIGHT(PNDS(0.454)) or else just 1.0 / 0.454 but still no way to leave out WEIGHT but retain KG/PNDS. Two distinct levels of "verbosity". Addressing all that is too laborious for the futile relevance it has to getting the data engine working properly.

Perhaps the best solution lies in something like BS12's FORMAT clause for specifying the format in which attribute values are to be presented to the application requesting the result of a query. In you gave a format specification for each attribute, defaulting to the system-defined one for the relevant type. For a UDT one could specify the default format in the type definition. Something like.

TYPE POSINT ORDINAL IS{INTEGER CONSTRAINT INTEGER > 0} FORMAT AS_FOR(INTEGER)

To override that default:

OUTPUT r ORDER(...) FORMAT(a1 AS_FOR(POSINT), a2 AS_FOR(EVEN_INT));

Could be a bit less messy than using EXTEND with all sorts of UDFs or extra built-in conversion functions.

Hugh

Quote from Erwin on February 20, 2019, 11:21 am

As it happens, personally, I've not yet found a need for Tutorial D UDTs in my various Rel databases. I did want to define subtypes of INTEGER but didn't like seeing, e.g., POSINT(3) instead of just 3 in tabular displays of results, especially when I'm saving them as HTML and posting them to a web site.

Hugh

I struggled with that a long time too. The solution I eventually came up with was to have the type implementations offer not only, say, a renderAsString() method but also a renderAsStringVerbose(). The latter producing the POSINT(3) variant and the former just 3. The DBrowser always uses the shorter one and an option toggles the explicit display of the typename in the heading of the table (so you can have "COUNT" or "COUNT:POSINT"). Never pursued it any further so offering choice of possrep as well is still in the queue. That one's more difficult to apply across attributes of distinct types so I'd need a user-level config saying "when displaying this type, use that possrep". But it still won't play well with the possrep system used for units of measurement. You'd get WEIGHT(KG(1.0)) / WEIGHT(PNDS(0.454)) or else just 1.0 / 0.454 but still no way to leave out WEIGHT but retain KG/PNDS. Two distinct levels of "verbosity". Addressing all that is too laborious for the futile relevance it has to getting the data engine working properly.

Perhaps the best solution lies in something like BS12's FORMAT clause for specifying the format in which attribute values are to be presented to the application requesting the result of a query. In you gave a format specification for each attribute, defaulting to the system-defined one for the relevant type. For a UDT one could specify the default format in the type definition. Something like.

TYPE POSINT ORDINAL IS{INTEGER CONSTRAINT INTEGER > 0} FORMAT AS_FOR(INTEGER)

To override that default:

OUTPUT r ORDER(...) FORMAT(a1 AS_FOR(POSINT), a2 AS_FOR(EVEN_INT));

Could be a bit less messy than using EXTEND with all sorts of UDFs or extra built-in conversion functions.

Hugh

Coauthor of The Third Manifesto and related books.

#8 · February 21, 2019, 9:41 pm

Quote from Erwin on February 21, 2019, 9:41 pm

Quote from AntC on February 18, 2019, 8:34 am

One example is DATE ranges: a type is supposed to be a set of values (RM Pre 1). Then how can one vendor's set of values start from year 0001 whereas another's start from 1753?

Re (VAR)CHAR: two values are supposed to compare equal iff they're "the very same value" (RM Pre 8). And if they're the same value, then any operation on them must produce the same result/must not expose any distinguishable difference. So some equality tests ignore trailing whitespace, and some don't. Some length functions count trailing whitespace, some don't. Some functions had distinguishable effects depending on the value's type (VARCHAR vs TEXT) -- which I suppose is within the rules, but unhelpful; some functions did expose a distinguishable difference (SQL-Server LEN()).

Regarding the DATE ranges : it appears to me that TTM as such does not prohibit such scenario's from occurring in a D installation either, simply because it does not define the scope within which a type name has to be unique. Any user in any installation (possibly even within the same organization) can define some type and name if FOO or BAR or TIMESTAMP and define it in such a way that the respective value sets are not "identical" (in a DBE kind of way, by which I refer to the chpt in DBE where there is talk of the number 1 being the same "individual" but its inclusion in, say, the types INT and LONGINT do not count as being the same/identical value DESPITE being "the same" individual). In fact, the idea of user-defined types and how they need to be assigned a name and how for obvious reasons only the single DBMS instance(s) to which the definition(s) is(are) visible can be aware of those definitions, implies of necessity that there can potentially be distinct DBMS instances that know some type named identically but whose value sets are unequal.

As for the SQL CHAR types mess : it appears to me to be important to recall Philip's old but very to-the-point remark that TTM's equality and SQL '=' are NOT THE SAME THING. Much of your commentary seems to be based on the notion that they are [the same thing], which they aren't.

Quote from AntC on February 18, 2019, 8:34 am

One example is DATE ranges: a type is supposed to be a set of values (RM Pre 1). Then how can one vendor's set of values start from year 0001 whereas another's start from 1753?

Re (VAR)CHAR: two values are supposed to compare equal iff they're "the very same value" (RM Pre 8). And if they're the same value, then any operation on them must produce the same result/must not expose any distinguishable difference. So some equality tests ignore trailing whitespace, and some don't. Some length functions count trailing whitespace, some don't. Some functions had distinguishable effects depending on the value's type (VARCHAR vs TEXT) -- which I suppose is within the rules, but unhelpful; some functions did expose a distinguishable difference (SQL-Server LEN()).

Regarding the DATE ranges : it appears to me that TTM as such does not prohibit such scenario's from occurring in a D installation either, simply because it does not define the scope within which a type name has to be unique. Any user in any installation (possibly even within the same organization) can define some type and name if FOO or BAR or TIMESTAMP and define it in such a way that the respective value sets are not "identical" (in a DBE kind of way, by which I refer to the chpt in DBE where there is talk of the number 1 being the same "individual" but its inclusion in, say, the types INT and LONGINT do not count as being the same/identical value DESPITE being "the same" individual). In fact, the idea of user-defined types and how they need to be assigned a name and how for obvious reasons only the single DBMS instance(s) to which the definition(s) is(are) visible can be aware of those definitions, implies of necessity that there can potentially be distinct DBMS instances that know some type named identically but whose value sets are unequal.

As for the SQL CHAR types mess : it appears to me to be important to recall Philip's old but very to-the-point remark that TTM's equality and SQL '=' are NOT THE SAME THING. Much of your commentary seems to be based on the notion that they are [the same thing], which they aren't.

Author of SIRA_PRISE

#9 · February 21, 2019, 9:44 pm

Quote from Erwin on February 21, 2019, 9:41 pm

TTM's equality and SQL '=' are NOT THE SAME THING.

Which BTW is one of the VERY REASONS TTM ever got written in the first place.

Author of SIRA_PRISE

#10 · February 22, 2019, 7:54 am

Quote from Erwin on February 21, 2019, 9:44 pm

Quote from Erwin on February 21, 2019, 9:41 pm

TTM's equality and SQL '=' are NOT THE SAME THING.

Which BTW is one of the VERY REASONS TTM ever got written in the first place.

I think we should say: SQL's = is not the same thing as any other language's = (or rather == in many cases). I don't think TTM's Prescription for equality is particularly unusual. (We need to go into a diversion about pointer equality vs value equality, I suppose; but that's really a discussion about when pointer equality returns False even though value equality returns True.)

So clearly the tale of woe was from the point of view of 'any other language' and the sufferers of woe's expectations. (Specifically the sufferers had a Haskell mindset, so they would expect value equality.)

Or have I got memory lapse? Was it common at the time SQL started out for equality tests to be (in)sensitive to trailing whitespace? Is it that we've learnt better since? IIRC languages I was programming at the time, if you wanted to ignore trailing whitespace, then you had to explicitly TRIM( ) it off before comparing. (Typically you'd do that immediately upon receiving a field from the screen, so it never got into the application or the database. Or you had fixed-width fields and you had to PAD( ) with spaces. Loading a short field from screen into fixed-width WORKING STORAGE that you hadn't space-filled was a fruitful source of bugs. Oh, and you had to deal with leading whitespace.)

Perhaps every data migration project should recruit a team member with white hair, who can remember what COBOL stands for.

I think the reasons for woe stand: however arcane are the rules in SQL, there's no excuse for different vendors producing different answers for something as basic as equality testing.

TTM Forum

The Forum for Discussion about The Third Manifesto and Related Matters

SQL is a standard?