Null, NaN, NaV, and 2VL

#1 · October 9, 2019, 9:43 pm

Quote from johnwcowan on October 9, 2019, 9:43 pm
I am going to assume for now that SQL NULL represents one of three things: (a) a value which is knowable but presently not known, like John Cowan's death date; (b) something that is not applicable to the tuple it is in (the fuel efficiency of an all-electric vehicle; (c) a soft error that does not throw an exception. I am going to dismiss case (a), as it is in some sense transient, even if the lag time before the value is known may be very large.

The value NaN in the floating-point domain can be used as either (b) or (c); in principle, (a) is possible, but usually floats are in scientific applications where everything is either known or never going to be known. In the interval interpretation of floats, a float f₁ represents the half-open interval [f₁ - (f₁ - f₀)/2, f1 + (f₂ - f₁)/2), where f₀ and f₂ are the floats just above and just below f, and +Inf and -Inf are the half-open intervals from the largest/smallest float to ±∞. NaN represents the interval from -∞ to +∞, as in 1.0/0.0, or else the null interval, as in √ᵣ-2 where √ᵣ is the real-valued square root, so it truly is a floating-point value rather than a "mark".

However, it is off the real number line, because every affirmative comparison operator (= < > ≤ ≥) involving NaN returns false. Thus 1 < NaN is false, -5 >= NaN is false, and in particular NaN = NaN is false. This means that < is no longer the negation of >= and likewise in all similar cases, which is a bit messy, but could be made easier by adding operators ≠ </ >/ ≤/ ≥/. This is done because we do not want to assert that 1.0/0.0 =√ᵣ-2 is true.

The extension of infinity and NaN to the domain of exact numbers (integral or fractional) is straightforward, so we can have a value NaEN ("not an exact number"). Complex numbers are more complicated, because either the real part or the imaginary part can be NaN separately (this turns out to be what people doing computation on complex numbers want). If we have a relation with information about gods and mortals, we may well say that a god's death date is NaD, and perhaps this is also true of some gods' birth dates; by the same token, the age of a god is NaEN. If "abc" * 3 is "abcabcabc", then "abcabcabd"/3 is NaS.

All this can be generalized to NaV, the (typeless) union of all such values, as long as Boolean values are excluded. There is no NaB, because booleans might be presently unknown, but they can never be inapplicable, nor can they be the result of a soft error, since operations on booleans are truly closed. So we have provided for some (not all) of the functions of NULL without gaining th complexities of 3VL. In particular, 3VL with UNKNOWN makes sense only using an open-world assumption in which a tuple in a relation represents a fact but the absence of a tuple does not represent the falsity of that fact, which is not the normal case. So even for full-fledged SQL NULL, we can reject the use of 2VL (at the expense of the standard, of course). In short, the devil may still be a bit sulfurous, but he is none so bad as he has been painted.

I am going to assume for now that SQL NULL represents one of three things: (a) a value which is knowable but presently not known, like John Cowan's death date; (b) something that is not applicable to the tuple it is in (the fuel efficiency of an all-electric vehicle; (c) a soft error that does not throw an exception. I am going to dismiss case (a), as it is in some sense transient, even if the lag time before the value is known may be very large.

The value NaN in the floating-point domain can be used as either (b) or (c); in principle, (a) is possible, but usually floats are in scientific applications where everything is either known or never going to be known. In the interval interpretation of floats, a float f₁ represents the half-open interval [f₁ - (f₁ - f₀)/2, f1 + (f₂ - f₁)/2), where f₀ and f₂ are the floats just above and just below f, and +Inf and -Inf are the half-open intervals from the largest/smallest float to ±∞. NaN represents the interval from -∞ to +∞, as in 1.0/0.0, or else the null interval, as in √ᵣ-2 where √ᵣ is the real-valued square root, so it truly is a floating-point value rather than a "mark".

However, it is off the real number line, because every affirmative comparison operator (= < > ≤ ≥) involving NaN returns false. Thus 1 < NaN is false, -5 >= NaN is false, and in particular NaN = NaN is false. This means that < is no longer the negation of >= and likewise in all similar cases, which is a bit messy, but could be made easier by adding operators ≠ </ >/ ≤/ ≥/. This is done because we do not want to assert that 1.0/0.0 =√ᵣ-2 is true.

The extension of infinity and NaN to the domain of exact numbers (integral or fractional) is straightforward, so we can have a value NaEN ("not an exact number"). Complex numbers are more complicated, because either the real part or the imaginary part can be NaN separately (this turns out to be what people doing computation on complex numbers want). If we have a relation with information about gods and mortals, we may well say that a god's death date is NaD, and perhaps this is also true of some gods' birth dates; by the same token, the age of a god is NaEN. If "abc" * 3 is "abcabcabc", then "abcabcabd"/3 is NaS.

All this can be generalized to NaV, the (typeless) union of all such values, as long as Boolean values are excluded. There is no NaB, because booleans might be presently unknown, but they can never be inapplicable, nor can they be the result of a soft error, since operations on booleans are truly closed. So we have provided for some (not all) of the functions of NULL without gaining th complexities of 3VL. In particular, 3VL with UNKNOWN makes sense only using an open-world assumption in which a tuple in a relation represents a fact but the absence of a tuple does not represent the falsity of that fact, which is not the normal case. So even for full-fledged SQL NULL, we can reject the use of 2VL (at the expense of the standard, of course). In short, the devil may still be a bit sulfurous, but he is none so bad as he has been painted.

#2 · March 1, 2020, 6:10 am

I have investigated this further and under two approaches and found it inconsistent both times. I conclude there is nothing to be done in this direction, which is a pity.

#3 · March 1, 2020, 6:39 am

Quote from AntC on March 1, 2020, 6:39 am

Quote from johnwcowan on October 9, 2019, 9:43 pm

...

All this can be generalized to NaV, the (typeless) union of all such values, as long as Boolean values are excluded. There is no NaB, because booleans might be presently unknown, but they can never be inapplicable,

Suppose we have a person record, with marital status. For Status Married, there's a Boolean DivorcedYet? Consider a person who's Single. Their DivorcedYet? value is not applicable.

Is that a dumb design? Yes. But I see it all the time; because SQL's Null makes it so easy to scatter non-applicable fields all over the schema. I see it in particular on enterprise-wide packages, because they must be protean over all sorts of businesses and compliance regimes. A typical ERP/accounting package will have a bunch of nullable fields for representing VAT/GST and a different bunch of nullable fields for representing Sales tax. (Typically the package got designed for one regime then got 'internationalised' as a later venture.) A possibly-inapplicable Boolean there might be a flag saying whether GST is calculated and rounded per-line-item or whole-invoice. Note that Tax regulations allow either way, and therefore allow a sum-of-line-item rounded values to be slightly 'wrong' compared to the invoice total calculation.

nor can they be the result of a soft error, since operations on booleans are truly closed.

Therefore no.

So we have provided for some (not all) of the functions of NULL without gaining th complexities of 3VL. In particular, 3VL with UNKNOWN makes sense only using an open-world assumption in which a tuple in a relation represents a fact but the absence of a tuple does not represent the falsity of that fact, which is not the normal case. So even for full-fledged SQL NULL, we can reject the use of 2VL (at the expense of the standard, of course). In short, the devil may still be a bit sulfurous, but he is none so bad as he has been painted.

Codd 1985 and later went to enormous lengths to categorise different varieties of Null and varieties of operations over those differently-categorised nulls. It's a veil of tears.

Quote from johnwcowan on October 9, 2019, 9:43 pm

...

All this can be generalized to NaV, the (typeless) union of all such values, as long as Boolean values are excluded. There is no NaB, because booleans might be presently unknown, but they can never be inapplicable,

Suppose we have a person record, with marital status. For Status Married, there's a Boolean DivorcedYet? Consider a person who's Single. Their DivorcedYet? value is not applicable.

Is that a dumb design? Yes. But I see it all the time; because SQL's Null makes it so easy to scatter non-applicable fields all over the schema. I see it in particular on enterprise-wide packages, because they must be protean over all sorts of businesses and compliance regimes. A typical ERP/accounting package will have a bunch of nullable fields for representing VAT/GST and a different bunch of nullable fields for representing Sales tax. (Typically the package got designed for one regime then got 'internationalised' as a later venture.) A possibly-inapplicable Boolean there might be a flag saying whether GST is calculated and rounded per-line-item or whole-invoice. Note that Tax regulations allow either way, and therefore allow a sum-of-line-item rounded values to be slightly 'wrong' compared to the invoice total calculation.

nor can they be the result of a soft error, since operations on booleans are truly closed.

Therefore no.

So we have provided for some (not all) of the functions of NULL without gaining th complexities of 3VL. In particular, 3VL with UNKNOWN makes sense only using an open-world assumption in which a tuple in a relation represents a fact but the absence of a tuple does not represent the falsity of that fact, which is not the normal case. So even for full-fledged SQL NULL, we can reject the use of 2VL (at the expense of the standard, of course). In short, the devil may still be a bit sulfurous, but he is none so bad as he has been painted.

Codd 1985 and later went to enormous lengths to categorise different varieties of Null and varieties of operations over those differently-categorised nulls. It's a veil of tears.

#4 · March 1, 2020, 6:45 am

I agree entirely. It was considerations like this, plus a little elementary propositional logic, that led me to the same conclusion.

Another example of null fields is the lab-test record, which records the results of lab tests. There are thousands of such tests, but doing more than, say, thirty on any one occasion would be very unusual. The obvious approach is to have a record with thousands of fields, most of which are NULL, but a bit of reflection shows a much better way to do this.

In the extreme case, one can go to RDF, which does allow first-class attributes. (They don't look first-class, because they are all named, but having an unbounded number of names is the moral equivalent of having none.)

#5 · March 1, 2020, 1:19 pm

It's a bit off topic, but I've recently been playing with Knime, which is mostly use for numerical analysis. A recurrent feature of all kinds of data acquisition (labs, surveys, etc) is the missing value. You just can't avoid them. While behaviour varies, mostly you want missing values to compare nicely but propagate through any calculation that uses them. Sometimes there is more than one MV, and the same applies, possibly with one taking priority over another. Numerical analysts would be lost without them.

They're not quite an SQL NULL or a Nan, although obviously they live somewhere nearby. Note that the data is often little strings like Yes/No/Maybe/NA so NaN is no use. Here NA is the MV, so we want it to compare first or last, not somewhere in the middle.

I suppose the solution might be a set of types augmented by missing values, and corresponding operators, functions and so on, but that seems like overkill.

Andl - A New Database Language - andl.org

#6 · March 1, 2020, 1:38 pm

Quote from dandl on March 1, 2020, 1:19 pm

It's a bit off topic, but I've recently been playing with Knime, which is mostly use for numerical analysis. A recurrent feature of all kinds of data acquisition (labs, surveys, etc) is the missing value. You just can't avoid them. While behaviour varies, mostly you want missing values to compare nicely but propagate through any calculation that uses them. Sometimes there is more than one MV, and the same applies, possibly with one taking priority over another. Numerical analysts would be lost without them.

They're not quite an SQL NULL or a Nan, although obviously they live somewhere nearby. Note that the data is often little strings like Yes/No/Maybe/NA so NaN is no use. Here NA is the MV, so we want it to compare first or last, not somewhere in the middle.

I suppose the solution might be a set of types augmented by missing values, and corresponding operators, functions and so on, but that seems like overkill.

Option types are conceptually the right tool for this job. The environment needs to make them easy to use.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

#7 · March 1, 2020, 11:25 pm

Option types certainly help, but they don't really deliver the kind of cross-cutting generic behaviour expected of missing values. You can define a type and add an 'IsMissing' attribute, but propagating that through all the possible function calls is a lot of work. It's usually handled by the expression evaluator, before passing off to function calls.

TTM already fails to offer much in the way of enumerated types so there's that too.

Andl - A New Database Language - andl.org

#8 · March 2, 2020, 12:56 pm

Quote from dandl on March 1, 2020, 11:25 pm

Option types certainly help, but they don't really deliver the kind of cross-cutting generic behaviour expected of missing values. You can define a type and add an 'IsMissing' attribute, but propagating that through all the possible function calls is a lot of work. It's usually handled by the expression evaluator, before passing off to function calls.

TTM already fails to offer much in the way of enumerated types so there's that too.

Is it TTM that fails here or just Tutorial D?

I don't think there's anything in TTM that precludes defining a type system and associated user interface that makes it easy to specify union types, option types, etc., for data processing purposes. You could probably add syntactic sweetener to what's already there.

If there is some fundamental obstacle, perhaps that's justification for a new type system and associated user interface that makes it easy to specify union types, option types, etc., for data processing purposes.

Propagating missing values is indeed work -- necessary work if you want missing values to make logical sense, but not a lot of work with a reasonably expressive language and type system.

Otherwise, if you choose to lazily specify a generic null and friends, you wind up with all the illogic and other abominations that SQL frequently demonstrates.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

The Forum for Discussion about The Third Manifesto and Related Matters

Null, NaN, NaV, and 2VL