The Forum for Discussion about The Third Manifesto and Related Matters

Please or Register to create posts and topics.

Does RM Pro 4 really preclude NULLs?

Quote from dandl on June 5, 2020, 6:35 am
Quote from Dave Voorhis on June 4, 2020, 12:45 pm
Quote from dandl on June 4, 2020, 11:28 am
Quote from Dave Voorhis on June 4, 2020, 7:45 am
Quote from dandl on June 4, 2020, 7:10 am

Consider the following.

  • Every type shall be assumed to include a value called NULL, unless declared or constrained otherwise.
  • Every instance of NULL is a typed value, which can be created by a selector with some literal argument. For example INTEGER(NULL).
  • An instance of NULL compares equal to NULL of the same  type and not equal to every other value of that type. NULLs of different types may not be compared.
  • For ordered types, NULL compares less than any other value of that type.
  • An outer join returns a value in which some tuples may have attributes containing a NULL value, providing this is not prohibited for the type.
  • Aggregation by default ignores NULL values in the aggregated attribute (other options are possible).

As far as I can tell this complies with RM Pro 4, and alleviates a number of problems in dealing with SQL-like queries.

I bet Date & Darwen would object -- along with most of the rest of us -- to any attempt to lawyer your way into defining a general SQL-like NULL.

Do you read it that way? I think most people find a general need to deal with missing values in their data, and most people find the SQL outer join a convenience. I'm exploring whether the benefits of NULL can be achieved without 3-level logic or SQL syntactic oddities; but also without having to build and implement a custom type just to do such a simple and generic thing.

I'm looking for feedback on issues that I might have missed. If it's just "D&D wouldn't like it", that doesn't help a whole lot.

Not only D & D wouldn't like it...

We have gone over this in some detail before, to the point that it's one of the various annual discussions, along with update-through-views, etc.

My preference is for an option type defined via the IM's UNION facility, others prefer Zero-Or-One ("ZOO") RVAs, others -- no doubt -- prefer other things. I'd expect a useful D to ideally support at least allowing option types and ZOOs.

But I'm sure we've discussed various TTM-compliant ways of handling missing values before, including TTM-complaint ways of handling (for example) import of SQL query results into a D using various kinds of union types to define option types, both with or without the IM.

I think it was considerations like this that persuaded Dataphor it couldn't do without some kind of NULL, but I don't know what they actually finished up doing. But it certainly should make life easier, and that can't be all bad.

NULL in its typical SQL and classic Java/C/C++ programming language guises is all bad. The literature -- and various programming language implementations -- are flush with suggestions for mitigating these.

Agreed, so we don't do that. But languages with typed NULL treated mostly as an ordinary value? Do you know one?

Almost everything that uses IEEE754 floats expose NaN.

Haskell and friends generalise it as option aka 'Maybe' types, a union of the value type and nothing.

Non-functional languages handle it to a greater or lesser degree. Some like Kotlin handle it fairly well and syntactically integrate it. Some like Java provide parametric Option<T> types, which is adequate.

See my previous post re sentinel values. In short, no.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org
Quote from Dave Voorhis on June 5, 2020, 7:46 am
Quote from dandl on June 5, 2020, 2:24 am
Quote from Hugh on June 4, 2020, 1:34 pm
Quote from dandl on June 4, 2020, 7:10 am

Consider the following.

  • Every type shall be assumed to include a value called NULL, unless declared or constrained otherwise.
  • Every instance of NULL is a typed value, which can be created by a selector with some literal argument. For example INTEGER(NULL).
  • An instance of NULL compares equal to NULL of the same  type and not equal to every other value of that type. NULLs of different types may not be compared.
  • For ordered types, NULL compares less than any other value of that type.
  • An outer join returns a value in which some tuples may have attributes containing a NULL value, providing this is not prohibited for the type.
  • Aggregation by default ignores NULL values in the aggregated attribute (other options are possible).

As far as I can tell this complies with RM Pro 4, and alleviates a number of problems in dealing with SQL-like queries.

Although I don't much like the idea, it certainly appears to comply with RM Pro 4.  Above all, it doesn't depart from 2VL.  I once asked a similar question in this forum (though in even less detail than dandl) but the idea was pretty much debunked, as I recall.

I suppose that the semantics of arithmetic operations on INTEGER(NULL) and RATIONAL(NULL) would be inspired by existing treatments of NaN.

Just so, but with a well-behaved NULL, so NULL === NULL

(Does INTEGER(NULL)/0 raise an exception?)

Returns NULL I think.

What about string operations involving CHARACTER(NULL)?

I've not found it useful to distinguish NULL from the empty string in stored data, other than the special treatment provided by SQL. My preference is that it's just an empty string, nothing special.

Question to dandl: Do you really mean every type?  Even UDTs, enumerated types, tuple types, and relation types?

Yes. Most values already have a value suitable for use as a NULL. Here is my list of eleven:

  • bool: actually no, this type should not be NULL, that would be 3VL
  • int (INTEGER): the value -2^63 could be reserved for this purpose (assuming 64 bit integers)
  • double (RATIONAL): NaN, but library functions need to treat it right
  • decimal: dunno, pick one
  • date/time: year 0
  • text (CHARACTER); empty string (of characters)
  • binary: empty string (of bytes)
  • enum: the word 'null' (as per JS)
  • UDT: every component is NULL
  • tuple: every attribute is NULL
  • relation: empty relation.

Every sentinel value you choose will show up in real data with some other meaning, either someone else's semantically-different-from-yours sentinel value or some real data.

For example, the empty string seems like an obvious choice for strings until you need to import/link a SQL table that has semantically distinguished null strings from empty strings. This happens a lot.

The proper way to handle it is with true option types. Ideally, you'd have a language with structural obligation to force handling the 'just a' and 'nothing' options (as in Haskell, Kotlin, etc.), but the usual approaches to handling option types in the popular languages (usually a parametric Option<T> type) will suffice.

It's what they're for.

I don't care about sentinel value conflict -- it's missing if it's missing. I agree there is a problem with int, decimal and text, but the solution based on advanced language features is too intrusive for a query language. The aim is to feed in data from a variety of sources, do a spot of filtering (such as sentinel value conversion), and then have data that preserves missing values, and then does basic NULL-like things such as propagates through library functions, omits from aggregates and doesn't drop tuples on joins. Sentinel values are the best way I know to do that.

Andl - A New Database Language - andl.org
Quote from dandl on June 5, 2020, 2:24 pm
Quote from Dave Voorhis on June 5, 2020, 7:46 am
Quote from dandl on June 5, 2020, 2:24 am
Quote from Hugh on June 4, 2020, 1:34 pm
Quote from dandl on June 4, 2020, 7:10 am

Consider the following.

  • Every type shall be assumed to include a value called NULL, unless declared or constrained otherwise.
  • Every instance of NULL is a typed value, which can be created by a selector with some literal argument. For example INTEGER(NULL).
  • An instance of NULL compares equal to NULL of the same  type and not equal to every other value of that type. NULLs of different types may not be compared.
  • For ordered types, NULL compares less than any other value of that type.
  • An outer join returns a value in which some tuples may have attributes containing a NULL value, providing this is not prohibited for the type.
  • Aggregation by default ignores NULL values in the aggregated attribute (other options are possible).

As far as I can tell this complies with RM Pro 4, and alleviates a number of problems in dealing with SQL-like queries.

Although I don't much like the idea, it certainly appears to comply with RM Pro 4.  Above all, it doesn't depart from 2VL.  I once asked a similar question in this forum (though in even less detail than dandl) but the idea was pretty much debunked, as I recall.

I suppose that the semantics of arithmetic operations on INTEGER(NULL) and RATIONAL(NULL) would be inspired by existing treatments of NaN.

Just so, but with a well-behaved NULL, so NULL === NULL

(Does INTEGER(NULL)/0 raise an exception?)

Returns NULL I think.

What about string operations involving CHARACTER(NULL)?

I've not found it useful to distinguish NULL from the empty string in stored data, other than the special treatment provided by SQL. My preference is that it's just an empty string, nothing special.

Question to dandl: Do you really mean every type?  Even UDTs, enumerated types, tuple types, and relation types?

Yes. Most values already have a value suitable for use as a NULL. Here is my list of eleven:

  • bool: actually no, this type should not be NULL, that would be 3VL
  • int (INTEGER): the value -2^63 could be reserved for this purpose (assuming 64 bit integers)
  • double (RATIONAL): NaN, but library functions need to treat it right
  • decimal: dunno, pick one
  • date/time: year 0
  • text (CHARACTER); empty string (of characters)
  • binary: empty string (of bytes)
  • enum: the word 'null' (as per JS)
  • UDT: every component is NULL
  • tuple: every attribute is NULL
  • relation: empty relation.

Every sentinel value you choose will show up in real data with some other meaning, either someone else's semantically-different-from-yours sentinel value or some real data.

For example, the empty string seems like an obvious choice for strings until you need to import/link a SQL table that has semantically distinguished null strings from empty strings. This happens a lot.

The proper way to handle it is with true option types. Ideally, you'd have a language with structural obligation to force handling the 'just a' and 'nothing' options (as in Haskell, Kotlin, etc.), but the usual approaches to handling option types in the popular languages (usually a parametric Option<T> type) will suffice.

It's what they're for.

I don't care about sentinel value conflict -- it's missing if it's missing. I agree there is a problem with int, decimal and text, but the solution based on advanced language features is too intrusive for a query language. The aim is to feed in data from a variety of sources, do a spot of filtering (such as sentinel value conversion), and then have data that preserves missing values, and then does basic NULL-like things such as propagates through library functions, omits from aggregates and doesn't drop tuples on joins. Sentinel values are the best way I know to do that.

Sentinel values are almost always a bad way to do that. Speaking as someone who's been paid to do data analysis and processing from the early 1980's to today, sentinel values always start out looking like a good idea, but invariably wind up being a justification for assigning retroactive abortion to whoever picked value as the sentinel. This happens again and again and again.

The best way to "have data that preserves missing values" is to represent missing values in source data with option type values in the processing environment, not sentinel values and -- and I really can't emphasise this enough -- let the end user be able to decide how missing values should propagate through library functions (unless it's obvious and incontrovertible), are omitted from aggregates, and (ideally, but perhaps not practically) do or don't drop tuples on joins.

Otherwise, you simply replace the awfulness of NULL's rules with your own awful NULL-substitute rules. I.e., no improvement, just different.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

What I believe is best, and what I implement, is a strictly 2VL system, where every value equals itself, that supports type unions.

I have a system-defined singleton type named "Ignorance" whose meaning is, we don't have a normal value here AND we have no explanation for why.

So Ignorance is the direct analogy to a generic null of many languages that carries no other semantics.

Users can also choose to define other singleton types that correspond to more specific reasons a normal value might be missing.

All of my normal types like numbers or strings or relations expressly do NOT include any special values.  Users can union these with Ignorance or other special values when they want that.

 

Quote from dandl on June 4, 2020, 11:28 am

I think it was considerations like this that persuaded Dataphor it couldn't do without some kind of NULL, but I don't know what they actually finished up doing. But it certainly should make life easier, and that can't be all bad.

Having worked on and with Dataphor for a period in the past, I can tell you that its owners still (as of 2017) consider 3VL NULL to be the most appropriate from a logical standpoint, that it is actually the most logically correct approach that NULL != NULL; they consider that it is not logically appropriate for an unknown to compare as either the same or as not the same as anything including an unknown, because being unknown you can't reasonably make any such assertions.

Quote from Darren Duncan on June 7, 2020, 7:27 pm

What I believe is best, and what I implement, is a strictly 2VL system, where every value equals itself, that supports type unions.

I have a system-defined singleton type named "Ignorance" whose meaning is, we don't have a normal value here AND we have no explanation for why.

So Ignorance is the direct analogy to a generic null of many languages that carries no other semantics.

Users can also choose to define other singleton types that correspond to more specific reasons a normal value might be missing.

All of my normal types like numbers or strings or relations expressly do NOT include any special values.  Users can union these with Ignorance or other special values when they want that.

This may well be the best approach (but I prefer 'Missing'). For it to play nicely I think you need:

  • easy to define and use T? as 'type T plus Missing', so all the libraries on T work easily with T? [Nullable<> in C# works quite well in this regard.]
  • works consistently with ordered types and aggregation
  • Outer Join that emits Missing if attribute type is T?.
Andl - A New Database Language - andl.org
Quote from dandl on June 8, 2020, 2:14 am
Quote from Darren Duncan on June 7, 2020, 7:27 pm

What I believe is best, and what I implement, is a strictly 2VL system, where every value equals itself, that supports type unions.

I have a system-defined singleton type named "Ignorance" whose meaning is, we don't have a normal value here AND we have no explanation for why.

So Ignorance is the direct analogy to a generic null of many languages that carries no other semantics.

Users can also choose to define other singleton types that correspond to more specific reasons a normal value might be missing.

All of my normal types like numbers or strings or relations expressly do NOT include any special values.  Users can union these with Ignorance or other special values when they want that.

This may well be the best approach (but I prefer 'Missing'). For it to play nicely I think you need:

  • easy to define and use T? as 'type T plus Missing', so all the libraries on T work easily with T? [Nullable<> in C# works quite well in this regard.]
  • works consistently with ordered types and aggregation
  • Outer Join that emits Missing if attribute type is T?.

These are option types, as I mentioned above. Option<T> in Java; Nullable<T> in C#.

It's nice if you can provide an outer join that per user choice either emits option types or separates the return tuples into missing_left, missing_right and matched collections.

Sentinel values simply don't make sense when we know about union types.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org
Quote from Darren Duncan on June 7, 2020, 7:37 pm
Quote from dandl on June 4, 2020, 11:28 am

I think it was considerations like this that persuaded Dataphor it couldn't do without some kind of NULL, but I don't know what they actually finished up doing. But it certainly should make life easier, and that can't be all bad.

Having worked on and with Dataphor for a period in the past, I can tell you that its owners still (as of 2017) consider 3VL NULL to be the most appropriate from a logical standpoint, that it is actually the most logically correct approach that NULL != NULL; they consider that it is not logically appropriate for an unknown to compare as either the same or as not the same as anything including an unknown, because being unknown you can't reasonably make any such assertions.

That particular aspect is correct.  However after some NULL == <something> comparison has correctly yielded "I don't know" as an answer, something has to decide, e.g. if it happens inside a WHERE, whether or not to include this tuple that gave rise to the "I don't know", and the answer to "include or not" cannot be "I don't know".  So in SQL, "unknown" ultimately does get treated as "false" which is exactly the "not logically appropriate" thing that was to be avoided.

Quote from Dave Voorhis on June 8, 2020, 8:10 am
Quote from dandl on June 8, 2020, 2:14 am
Quote from Darren Duncan on June 7, 2020, 7:27 pm

What I believe is best, and what I implement, is a strictly 2VL system, where every value equals itself, that supports type unions.

I have a system-defined singleton type named "Ignorance" whose meaning is, we don't have a normal value here AND we have no explanation for why.

So Ignorance is the direct analogy to a generic null of many languages that carries no other semantics.

Users can also choose to define other singleton types that correspond to more specific reasons a normal value might be missing.

All of my normal types like numbers or strings or relations expressly do NOT include any special values.  Users can union these with Ignorance or other special values when they want that.

This may well be the best approach (but I prefer 'Missing'). For it to play nicely I think you need:

  • easy to define and use T? as 'type T plus Missing', so all the libraries on T work easily with T? [Nullable<> in C# works quite well in this regard.]
  • works consistently with ordered types and aggregation
  • Outer Join that emits Missing if attribute type is T?.

These are option types, as I mentioned above. Option<T> in Java; Nullable<T> in C#.

I might have misunderstood, but I don't think so. In C# at least all nulls are the same.

It's nice if you can provide an outer join that per user choice either emits option types or separates the return tuples into missing_left, missing_right and matched collections.

Sentinel values simply don't make sense when we know about union types.

For union types to work for this purpose in a language quite a few things have to be done right, and RM Pro 4 looks like a roadblock, hence my question. If we read RM Pro 4 as not posing a barrier, as well as TTM allowing union types without the IM, then it becomes worthwhile to talk about ways and means.

Sentinel values allow you to encode missing values on input and for interchange, which union types do not. It's not a gimme.

Andl - A New Database Language - andl.org
Quote from dandl on June 8, 2020, 10:29 am
Quote from Dave Voorhis on June 8, 2020, 8:10 am
Quote from dandl on June 8, 2020, 2:14 am
Quote from Darren Duncan on June 7, 2020, 7:27 pm

What I believe is best, and what I implement, is a strictly 2VL system, where every value equals itself, that supports type unions.

I have a system-defined singleton type named "Ignorance" whose meaning is, we don't have a normal value here AND we have no explanation for why.

So Ignorance is the direct analogy to a generic null of many languages that carries no other semantics.

Users can also choose to define other singleton types that correspond to more specific reasons a normal value might be missing.

All of my normal types like numbers or strings or relations expressly do NOT include any special values.  Users can union these with Ignorance or other special values when they want that.

This may well be the best approach (but I prefer 'Missing'). For it to play nicely I think you need:

  • easy to define and use T? as 'type T plus Missing', so all the libraries on T work easily with T? [Nullable<> in C# works quite well in this regard.]
  • works consistently with ordered types and aggregation
  • Outer Join that emits Missing if attribute type is T?.

These are option types, as I mentioned above. Option<T> in Java; Nullable<T> in C#.

I might have misunderstood, but I don't think so. In C# at least all nulls are the same.

Nullable<T> is not C#'s null, it's a parametric type. But C#'s implementation is peculiar -- tied up as it is with C#'s nullable types -- and seems only halfway to being a true union type. Far better is Java's Optional<T> (which is what I meant, not Option<T>) and comes closer to being a real union type. It would be trivial to create a C# equivalent.

It's nice if you can provide an outer join that per user choice either emits option types or separates the return tuples into missing_left, missing_right and matched collections.

Sentinel values simply don't make sense when we know about union types.

For union types to work for this purpose in a language quite a few things have to be done right, and RM Pro 4 looks like a roadblock, hence my question. If we read RM Pro 4 as not posing a barrier, as well as TTM allowing union types without the IM, then it becomes worthwhile to talk about ways and means.

Now that I think more about it, C#'s Nullable<T> is probably a violation of RM Pro 4, but an equivalent to Java's Optional<T> would be fine.

Sentinel values allow you to encode missing values on input and for interchange, which union types do not. It's not a gimme.

Sentinel values are almost inevitable in external CSV files and the like, and often unavoidable if you're outputting CSV, but you really don't want to be keeping them around inside a data processing system, particularly when file X uses one sentinel value and file Y uses a different sentinel value, and you need to UNION or JOIN X and Y. 

Convert them to an option type as early as possible, or you will wind up with the unpleasant situation of file having one sentinel value, file Y having a different sentinel value, and your internal systems using a third sentinel value -- perhaps one that file X and Y both define as valid data. Much ugliness ensues.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org