Type checking/inference, practical application?

#11 · July 5, 2021, 7:05 am

I want to make it clear that I support the goal and ideal of making a language do as much type checking at compile time as possible.

However I believe for the general case of programs that doing 100% of type checking at compile time is impossible, and there is always a fraction that can only be done at runtime.

A key situation where that is true is if the application supports the user defining ad-hoc queries without arbitrary limitations.

In that case, program code is being created and executed at runtime, meaning this code couldn't possibly have been checked at compile time before the user received the program to run it.

In practice there could be multiple distinct compile and run times, such as the runtime invoking the compiler with the user's ad-hoc query definition, but in that case the second compile time is effectively happening at runtime relative to the original compile.

And so at least some type checks can only be done at runtime or else the system is not computationally complete and hence is much less interesting and useful.

#12 · July 5, 2021, 8:46 am

Quote from Dave Voorhis on July 5, 2021, 8:46 am

Quote from Darren Duncan on July 5, 2021, 6:50 am

Quote from dandl on July 5, 2021, 5:31 am

Quote from mamcx on July 4, 2021, 6:31 pm

Quote from Darren Duncan on July 4, 2021, 7:10 am

In my opinion, for a FIRST version, I feel that it is considerably easier to implement a language whose types and type checks are implemented at runtime in the general case, with compile time checks done just for a subset that are relatively easy, with the assumption it is impossible to do things 100% at compile time without either severely limiting language functionality or else requiring orders of magnitude more time or intelligence to solve how to do it

Yeah, this is what I have done so far. But "compile time checks done just for a subset" is a nice goal, but figure WHICH is this subset is what is not as obvious to me. One major reason for doing this is actually to get nice error messages, and type checking already visits the code to get the info for them (be it a compile or runtime).

I would disagree. Strong type checking is a deeply pervasive decision. I much prefer it for big durable programs (like application suites) and find it a nuisance for ad hoc scripts. Your choice, but I think you have to choose.

David, you seem to be talking about a different thing than I and mancx were.

We were talking about when type checking happens, compile time vs runtime. You're talking about strong type checking which is orthogonal to this and completely different.

Strong type checking says for example an attempt to use a text where a number is expected would fail, while weak type checking would not fail and instead try to extract a number from the text.

Whereas, I think mancx and I both agreed, or at least I think, we want to have the strong typing regardless, and compile versus runtime checking is just a matter of when the failure is seen, rather than whether the failure happens.

The type system characteristic matrix, sometimes helpful in discussions like these:

Strength * Inference Static/Dynamic Example (worst case)

Weak None None FORTH, typical assembly language

Manifest Dynamic

Static C

Inferred Dynamic

Static

Moderate Manifest Dynamic

Static

Inferred Dynamic PHP, Python, JavaScript

Static

Strong Manifest Dynamic

Static

Inferred Dynamic Java, C#, C++

Static

* Weak = type mismatches result in undefined behaviour

Moderate = type mismatches result in specified implicit casts or runtime error

Strong = type mismatches result in defined error

Filling in more examples is left as an exercise for the reader, and note that the given examples are worst-case behaviour. E.g., modern Java and C# are notionally a strong, inferred, static type system, but with features that permit them to be -- at worst -- a dynamically-typed system.

On that basis, and in retrospect, I probably should have put C in the weak, manifest, dynamic category -- which you can do by making every pointer void *, etc., though that's intentionally pathological. Java and C# have features that put them in the strong, inferred, dynamically-typed category when used properly.

Quote from Darren Duncan on July 5, 2021, 6:50 am

Quote from dandl on July 5, 2021, 5:31 am

Quote from mamcx on July 4, 2021, 6:31 pm

Quote from Darren Duncan on July 4, 2021, 7:10 am

In my opinion, for a FIRST version, I feel that it is considerably easier to implement a language whose types and type checks are implemented at runtime in the general case, with compile time checks done just for a subset that are relatively easy, with the assumption it is impossible to do things 100% at compile time without either severely limiting language functionality or else requiring orders of magnitude more time or intelligence to solve how to do it

Yeah, this is what I have done so far. But "compile time checks done just for a subset" is a nice goal, but figure WHICH is this subset is what is not as obvious to me. One major reason for doing this is actually to get nice error messages, and type checking already visits the code to get the info for them (be it a compile or runtime).

I would disagree. Strong type checking is a deeply pervasive decision. I much prefer it for big durable programs (like application suites) and find it a nuisance for ad hoc scripts. Your choice, but I think you have to choose.

David, you seem to be talking about a different thing than I and mancx were.

We were talking about when type checking happens, compile time vs runtime. You're talking about strong type checking which is orthogonal to this and completely different.

Strong type checking says for example an attempt to use a text where a number is expected would fail, while weak type checking would not fail and instead try to extract a number from the text.

Whereas, I think mancx and I both agreed, or at least I think, we want to have the strong typing regardless, and compile versus runtime checking is just a matter of when the failure is seen, rather than whether the failure happens.

The type system characteristic matrix, sometimes helpful in discussions like these:

Strength *	Inference	Static/Dynamic	Example (worst case)
Weak	None	None	FORTH, typical assembly language

	Manifest	Dynamic
	Manifest	Static	C
	Inferred	Dynamic
	Inferred	Static
Moderate	Manifest	Dynamic
	Manifest	Static
	Inferred	Dynamic	PHP, Python, JavaScript
	Inferred	Static
Strong	Manifest	Dynamic
	Manifest	Static
	Inferred	Dynamic	Java, C#, C++
	Inferred	Static

*	Weak = type mismatches result in undefined behaviour
	Moderate = type mismatches result in specified implicit casts or runtime error
	Strong = type mismatches result in defined error

Filling in more examples is left as an exercise for the reader, and note that the given examples are worst-case behaviour. E.g., modern Java and C# are notionally a strong, inferred, static type system, but with features that permit them to be -- at worst -- a dynamically-typed system.

On that basis, and in retrospect, I probably should have put C in the weak, manifest, dynamic category -- which you can do by making every pointer void *, etc., though that's intentionally pathological. Java and C# have features that put them in the strong, inferred, dynamically-typed category when used properly.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

#13 · July 5, 2021, 3:06 pm

Quote from dandl on July 5, 2021, 5:31 am

The basic (Codd) Relational Algebra defines 6 operators: select, project, join, rename, union, not.

The Extended RA adds 3: extend, aggregate, while.

Of course, that's a relational algebra. There is nothing so definitive as to be considered the Relational Algebra.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

#14 · July 5, 2021, 9:54 pm

Quote from Dave Voorhis on July 5, 2021, 3:06 pm

Quote from dandl on July 5, 2021, 5:31 am

The basic (Codd) Relational Algebra defines 6 operators: select, project, join, rename, union, not.

The Extended RA adds 3: extend, aggregate, while.

Of course, that's a relational algebra. There is nothing so definitive as to be considered the Relational Algebra.

while is specific to the Alice book. I don't think I've seen any other algebra alleged to be relational that includes while. Most use some sort of grouping operator -- such as G in wikipedia -- with transitive closure.

#15 · July 6, 2021, 12:49 am

Whereas, I think mancx and I both agreed, or at least I think, we want to have the strong typing regardless, and compile versus runtime checking is just a matter of when the failure is seen, rather than whether the failure happens.

Yes, this is more what I will do. Strong typing is normally done if compiled and dynamic if not, but in fact, is something you can do at parsing time even if the lang is interpreted.

#16 · July 6, 2021, 12:55 am

Quote from dandl on July 5, 2021, 5:31 am

That works, but those languages are quite low level. You really need excellent support for business data types. My list is: bool, int, float, decimal, time, text, blob, enum (string), record and collections thereof. Records/tuples are a serious problem if you want to go the TTM route.

Yeah, my list is nearly identical (with extra: date, dateTime & time using nodatime-like API, enums are like in Rust, so strings + chars + ints + algebraic types). But why Records/tuples are trouble? BTW I have them more like an internal detail I need for iterators/streams but have wonder if make them first-class to the user, just think the main containers are enough for that.

The basic (Codd) Relational Algebra defines 6 operators: select, project, join, rename, union, not.

The Extended RA adds 3: extend, aggregate, while.

Update is just assign: replace the old value of some variable by a new value.

That's all there is. It does everything in SQL.

Oh, nice, I misunderstood about update. I was too focused on sugar like insert/update/delete/upsert - and that all values are immutable by default and marked mutable explicitly- and is more obvious looking this way.

#17 · July 6, 2021, 1:27 am

Quote from Darren Duncan on July 5, 2021, 7:05 am

I want to make it clear that I support the goal and ideal of making a language do as much type checking at compile time as possible.

However I believe for the general case of programs that doing 100% of type checking at compile time is impossible, and there is always a fraction that can only be done at runtime.

I disagree. IMO it is possible to design a language to allow 100% type checking at compile time, and I can prove it with examples and strategies. This is my safe goal. The only OO language I know that comes close is Eiffel.

A key situation where that is true is if the application supports the user defining ad-hoc queries without arbitrary limitations.

In that case, program code is being created and executed at runtime, meaning this code couldn't possibly have been checked at compile time before the user received the program to run it.

In practice there could be multiple distinct compile and run times, such as the runtime invoking the compiler with the user's ad-hoc query definition, but in that case the second compile time is effectively happening at runtime relative to the original compile.

Correct. Many application systems need to supply end-user computation. The main examples are ad hoc queries and rule-based computations like pricing and payroll. In that case the end-user facility must itself provide type-safe compilation and execution.

And so at least some type checks can only be done at runtime or else the system is not computationally complete and hence is much less interesting and useful.

This is patently untrue. Happy to demolish any proposed examples on request.

Andl - A New Database Language - andl.org

#18 · July 6, 2021, 1:45 am

Quote from AntC on July 5, 2021, 9:54 pm

Quote from Dave Voorhis on July 5, 2021, 3:06 pm

Quote from dandl on July 5, 2021, 5:31 am

The basic (Codd) Relational Algebra defines 6 operators: select, project, join, rename, union, not.

The Extended RA adds 3: extend, aggregate, while.

Of course, that's a relational algebra. There is nothing so definitive as to be considered the Relational Algebra.

while is specific to the Alice book. I don't think I've seen any other algebra alleged to be relational that includes while. Most use some sort of grouping operator -- such as G in wikipedia -- with transitive closure.

What I have provided is complete: a superset of every RA operation in TD, SQL and any similar query language using the named perspective. Feel free to provide your own version in some other form, but if complete they will be equivalent and there will be a trivial transformation between them.

The while operator is represented by CTE RECURSIVE in SQL, and is a superset of transitive closure. Given while, TC is a trivial shorthand.

Grouping is aggregation, not while. I don't know G -- please provide a link.

Andl - A New Database Language - andl.org

#19 · July 6, 2021, 2:46 am

Quote from dandl on July 6, 2021, 2:46 am

Quote from mamcx on July 6, 2021, 12:55 am

Quote from dandl on July 5, 2021, 5:31 am

That works, but those languages are quite low level. You really need excellent support for business data types. My list is: bool, int, float, decimal, time, text, blob, enum (string), record and collections thereof. Records/tuples are a serious problem if you want to go the TTM route.

Yeah, my list is nearly identical (with extra: date, dateTime & time using nodatime-like API, enums are like in Rust, so strings + chars + ints + algebraic types). But why Records/tuples are trouble? BTW I have them more like an internal detail I need for iterators/streams but have wonder if make them first-class to the user, just think the main containers are enough for that.

My time is a storage format for a point in time. You need lots of library routines to turn that into useful stuff like YMDHMS or whatever, but you don't need any more types.

My enum is a storage format for one of a set of short strings. Enums as a gloss over integers are a hangover from when storage was expensive, and should never be used now.

The char data type is likewise useless. It's a hangover from other times and serves no useful purpose.

My record is a storage format for a set of values identified by name. The type specifies a set of <name,type> pairs.

Records hark all the way back to EQUIVALENT in Fortran, REDEFINES in Cobol and struct in C: old ways of defining storage layout. The problem now is that (a) the order of field definition matters (b) the names are lost. That doesn't work well for a storage format or for TTM tuples. It's a complex topic, and nobody really has a perfect answer.

The basic (Codd) Relational Algebra defines 6 operators: select, project, join, rename, union, not.

The Extended RA adds 3: extend, aggregate, while.

Update is just assign: replace the old value of some variable by a new value.

That's all there is. It does everything in SQL.

Oh, nice, I misunderstood about update. I was too focused on sugar like insert/update/delete/upsert - and that all values are immutable by default and marked mutable explicitly- and is more obvious looking this way.

Just so. Values are immutable and variables hold a value which may vary over time.

Quote from mamcx on July 6, 2021, 12:55 am

Quote from dandl on July 5, 2021, 5:31 am

That works, but those languages are quite low level. You really need excellent support for business data types. My list is: bool, int, float, decimal, time, text, blob, enum (string), record and collections thereof. Records/tuples are a serious problem if you want to go the TTM route.

Yeah, my list is nearly identical (with extra: date, dateTime & time using nodatime-like API, enums are like in Rust, so strings + chars + ints + algebraic types). But why Records/tuples are trouble? BTW I have them more like an internal detail I need for iterators/streams but have wonder if make them first-class to the user, just think the main containers are enough for that.

My time is a storage format for a point in time. You need lots of library routines to turn that into useful stuff like YMDHMS or whatever, but you don't need any more types.

My enum is a storage format for one of a set of short strings. Enums as a gloss over integers are a hangover from when storage was expensive, and should never be used now.

The char data type is likewise useless. It's a hangover from other times and serves no useful purpose.

My record is a storage format for a set of values identified by name. The type specifies a set of <name,type> pairs.

Records hark all the way back to EQUIVALENT in Fortran, REDEFINES in Cobol and struct in C: old ways of defining storage layout. The problem now is that (a) the order of field definition matters (b) the names are lost. That doesn't work well for a storage format or for TTM tuples. It's a complex topic, and nobody really has a perfect answer.

The basic (Codd) Relational Algebra defines 6 operators: select, project, join, rename, union, not.

The Extended RA adds 3: extend, aggregate, while.

Update is just assign: replace the old value of some variable by a new value.

That's all there is. It does everything in SQL.

Oh, nice, I misunderstood about update. I was too focused on sugar like insert/update/delete/upsert - and that all values are immutable by default and marked mutable explicitly- and is more obvious looking this way.

Just so. Values are immutable and variables hold a value which may vary over time.

Andl - A New Database Language - andl.org

#20 · July 6, 2021, 3:11 am

Quote from Darren Duncan on July 6, 2021, 3:11 am

Quote from dandl on July 6, 2021, 1:27 am

Quote from Darren Duncan on July 5, 2021, 7:05 am

And so at least some type checks can only be done at runtime or else the system is not computationally complete and hence is much less interesting and useful.

This is patently untrue. Happy to demolish any proposed examples on request.

I think this just comes down to definitions of what "compile time" and "runtime" actually mean, the assumption there is exactly on instance of the first which precedes all of the second.

If we consider the restricted traditional scenario for language L that person P1 is writing an application in L, and then compiling that program to a machine binary and giving just that binary to person P2 to run their business on, then the entirety of "compile time" is when the program was in the hands of P1 and the entirety of the time the machine binary is being used in the hands of P2 is "runtime".

In this scenario, the only time "compile time type checks" could be done is by P1.

Now if the program has features to allow P2 to express arbitrary queries or business logic, arbitrary as in P2 has a computationally complete environment to work with, then P2 would in the general case be defining their own types and routines and executing them. But all of these type checks would be "at runtime" from the perspective of the program because P2 only has the machine binary which is "already compiled", and it isn't being recompiled to include whatever P2 defined.

So it is impossible for the types defined by P2 to have been checked at compile time of the main program because the compilation was finished before P2 had the program.

In contrast, if we define "compile time" as something that can be caused by a runtime, and there can be more than one, then my comment about impossibility doesn't apply.

In the latter case, the compile/runtime distinction is a lot more arbitrary as they tend to blur together into one undifferentiated stage which in effect is just runtime.

Quote from dandl on July 6, 2021, 1:27 am

Quote from Darren Duncan on July 5, 2021, 7:05 am

And so at least some type checks can only be done at runtime or else the system is not computationally complete and hence is much less interesting and useful.

This is patently untrue. Happy to demolish any proposed examples on request.

I think this just comes down to definitions of what "compile time" and "runtime" actually mean, the assumption there is exactly on instance of the first which precedes all of the second.

If we consider the restricted traditional scenario for language L that person P1 is writing an application in L, and then compiling that program to a machine binary and giving just that binary to person P2 to run their business on, then the entirety of "compile time" is when the program was in the hands of P1 and the entirety of the time the machine binary is being used in the hands of P2 is "runtime".

In this scenario, the only time "compile time type checks" could be done is by P1.

Now if the program has features to allow P2 to express arbitrary queries or business logic, arbitrary as in P2 has a computationally complete environment to work with, then P2 would in the general case be defining their own types and routines and executing them. But all of these type checks would be "at runtime" from the perspective of the program because P2 only has the machine binary which is "already compiled", and it isn't being recompiled to include whatever P2 defined.

So it is impossible for the types defined by P2 to have been checked at compile time of the main program because the compilation was finished before P2 had the program.

In contrast, if we define "compile time" as something that can be caused by a runtime, and there can be more than one, then my comment about impossibility doesn't apply.

In the latter case, the compile/runtime distinction is a lot more arbitrary as they tend to blur together into one undifferentiated stage which in effect is just runtime.

The Forum for Discussion about The Third Manifesto and Related Matters