Big Data/Data Science/AI, etc

#11 · June 8, 2019, 10:14 am

Quote from dandl on June 8, 2019, 1:09 am

Quote from Dave Voorhis on June 7, 2019, 3:15 pm

We created a Big Data undergrad course a few years ago. I was its programme leader¹. We attracted so few students that we shut it down. It turns out that 16/17 year old kids have no idea what "data science" is, but whatever it is, it sounds like no fun at all.

I don't think it works at under-grad. It looks more like a Masters to me.

Yes. We offer a Masters in Big Data Analytics and it's reasonably successful.

Quote from dandl on June 8, 2019, 1:09 am

Quote from Dave Voorhis on June 7, 2019, 3:15 pm

It's inspired by using Rel for "production" desktop data management, and having reflected at length on its strengths and limitations. In particular, I've recognised there are the things it doesn't do at all (or does badly) but that it needs to do really, really well. The solution won't be a D and won't even be a new language per se, but it will respect the ideals of TTM, even though it won't necessarily embrace all the prescriptions and proscriptions. It will help programmers and "power users" work with data and build data-oriented applications, using popular languages and tools in a new but familiar -- and hopefully very appealing and powerful -- way.

Is it a secret, or do you plan to blog about it?

I intend to 'blog about it when I have something to release. Otherwise, I'm just arm-waving about vapourware.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

#12 · June 8, 2019, 2:48 pm

Quote from Dave Voorhis on June 8, 2019, 10:14 am

Quote from dandl on June 8, 2019, 1:09 am

Is it a secret, or do you plan to blog about it?

I intend to 'blog about it when I have something to release. Otherwise, I'm just arm-waving about vapourware.

"What is now proved was once, only imagin'd." —William Blake, "Proverbs of Hell"

#13 · June 12, 2019, 6:11 pm

Quote from David Livingstone on June 7, 2019, 10:36 am

A lot of computing is conditioned by compiled languages. There everything has to be static so that a programmer can write program code that works for all (the limited range of) possible cases. Interpreted languages can be be dynamic. Therefore a programmer can write code that inspects each case first, and then processes the data on the basis of what is found.

Perhaps we just need to add more dynamic typing to relational DBMSs ?

David Livingstone

There is no such necessary relationship between interpreted languages and dynamic typing. Common Lisp and Scheme are dynamically typed languages, but there are implementations of both that compile to machine code (via C or not) ahead of time. Cecil, also dynamically typed, has as far as I know only a compiler implementation. Often a good deal of runtime type checking can be compiled away.

That isn't even considering the JIT compilers for Smalltalk, Self, Java and other JVM languages, and C# and other CLR languages, Pure, Julia, etc.

#14 · June 14, 2019, 6:48 pm

Quote from johnwcowan on June 12, 2019, 6:11 pm

Quote from David Livingstone on June 7, 2019, 10:36 a

There is no such necessary relationship between interpreted languages and dynamic typing.

I take your point. The relationship is not NECESSARY.

To clarify perhaps - what I was thinking of were languages which allow meta data about variables, etc, to be obtained via some operator/assignment/... in the language - often called 'Reflection' these days - AND then allow that data to be used within an expression or statement in the language.

For example in such a relational DB language, one could write an expression(s) to obtain the sets of attributes (names and types) of several existing DB relvars, and then (say) write an expression which derives a new set of attributes from those sets (using Unions, Intersects, Differences, as required), and then finally create a new DB relvar with that new set of derived attributes.

All I meant to say was that in my experience, the habits of thought of programmers who use compiled languages all the time, tend not to think in ways consistent with the above example.

But as you point out, every language is different in what it provides, and every programmer has different experiences which condition their typical way of thinking about coding.

David Livingstone

#15 · June 15, 2019, 12:50 am

Quote from dandl on June 15, 2019, 12:50 am

Quote from David Livingstone on June 14, 2019, 6:48 pm

Quote from johnwcowan on June 12, 2019, 6:11 pm

Quote from David Livingstone on June 7, 2019, 10:36 a

There is no such necessary relationship between interpreted languages and dynamic typing.

I take your point. The relationship is not NECESSARY.

To clarify perhaps - what I was thinking of were languages which allow meta data about variables, etc, to be obtained via some operator/assignment/... in the language - often called 'Reflection' these days - AND then allow that data to be used within an expression or statement in the language.

For example in such a relational DB language, one could write an expression(s) to obtain the sets of attributes (names and types) of several existing DB relvars, and then (say) write an expression which derives a new set of attributes from those sets (using Unions, Intersects, Differences, as required), and then finally create a new DB relvar with that new set of derived attributes.

All I meant to say was that in my experience, the habits of thought of programmers who use compiled languages all the time, tend not to think in ways consistent with the above example.

But as you point out, every language is different in what it provides, and every programmer has different experiences which condition their typical way of thinking about coding.

Here I sit firmly on the fence, but I reject your core proposition. I love using dynamic languages for small programs, because I get results fast. Ruby is my preferred poison, but I've used them all.

And I hate using dynamic languages for larger programs, because they break so easily. My habit is to rely heavily on the compiler and IDE to catch typos, reduce typing and allow brutal refactoring. When I write JavaScript I have to waste heaps of time on fixing dumb bugs and writing silly unit tests because I don't have a compiler to do it for me. JS sucks, TS rules.

But your core proposition is a purported advantage for dynamic languages in managing de novo types, and this I reject. There is none. Static compilation can easily figure out the types of newly created relvars and other types. Either the programmer wrote them and can rely on the compiler to check them, or the language provides compile-time type operators and the language can generate them so they're never wrong. Haskell is one example of a language that does this, but there are others. An industrial D should be one of them.

Quote from David Livingstone on June 14, 2019, 6:48 pm

Quote from johnwcowan on June 12, 2019, 6:11 pm

Quote from David Livingstone on June 7, 2019, 10:36 a

There is no such necessary relationship between interpreted languages and dynamic typing.

I take your point. The relationship is not NECESSARY.

To clarify perhaps - what I was thinking of were languages which allow meta data about variables, etc, to be obtained via some operator/assignment/... in the language - often called 'Reflection' these days - AND then allow that data to be used within an expression or statement in the language.

For example in such a relational DB language, one could write an expression(s) to obtain the sets of attributes (names and types) of several existing DB relvars, and then (say) write an expression which derives a new set of attributes from those sets (using Unions, Intersects, Differences, as required), and then finally create a new DB relvar with that new set of derived attributes.

All I meant to say was that in my experience, the habits of thought of programmers who use compiled languages all the time, tend not to think in ways consistent with the above example.

But as you point out, every language is different in what it provides, and every programmer has different experiences which condition their typical way of thinking about coding.

Here I sit firmly on the fence, but I reject your core proposition. I love using dynamic languages for small programs, because I get results fast. Ruby is my preferred poison, but I've used them all.

And I hate using dynamic languages for larger programs, because they break so easily. My habit is to rely heavily on the compiler and IDE to catch typos, reduce typing and allow brutal refactoring. When I write JavaScript I have to waste heaps of time on fixing dumb bugs and writing silly unit tests because I don't have a compiler to do it for me. JS sucks, TS rules.

But your core proposition is a purported advantage for dynamic languages in managing de novo types, and this I reject. There is none. Static compilation can easily figure out the types of newly created relvars and other types. Either the programmer wrote them and can rely on the compiler to check them, or the language provides compile-time type operators and the language can generate them so they're never wrong. Haskell is one example of a language that does this, but there are others. An industrial D should be one of them.

Andl - A New Database Language - andl.org

#16 · June 15, 2019, 1:15 am

Quote from dandl on June 15, 2019, 12:50 am

And I hate using dynamic languages for larger programs, because they break so easily. My habit is to rely heavily on the compiler and IDE to catch typos, reduce typing and allow brutal refactoring. When I write JavaScript I have to waste heaps of time on fixing dumb bugs and writing silly unit tests because I don't have a compiler to do it for me. JS sucks, TS rules.

~~ shrug ~~ To each their own. I have lots of experience with both dynamic and static languages at both large and small scales, and I don't have strong preferences for either. Then again, I am not obsessed with speed of development.

But your core proposition is a purported advantage for dynamic languages in managing de novo types, and this I reject. There is none. Static compilation can easily figure out the types of newly created relvars and other types. Either the programmer wrote them and can rely on the compiler to check them, or the language provides compile-time type operators and the language can generate them so they're never wrong. Haskell is one example of a language that does this, but there are others. An industrial D should be one of them.

Reflection is basically an escape hatch by which static languages can do dynamic things. That's distinct from type inference, which is an unmitigated good and all statically typed programming languages should have it.

#17 · June 15, 2019, 5:01 am

Quote from AntC on June 15, 2019, 5:01 am

Quote from David Livingstone on June 14, 2019, 6:48 pm

Quote from johnwcowan on June 12, 2019, 6:11 pm

Quote from David Livingstone on June 7, 2019, 10:36 a

There is no such necessary relationship between interpreted languages and dynamic typing.

I take your point. The relationship is not NECESSARY.

To clarify perhaps - what I was thinking of were languages which allow meta data about variables, etc, to be obtained via some operator/assignment/... in the language - often called 'Reflection' these days - AND then allow that data to be used within an expression or statement in the language.

For example in such a relational DB language, one could write an expression(s) to obtain the sets of attributes (names and types) of several existing DB relvars, and then (say) write an expression which derives a new set of attributes from those sets (using Unions, Intersects, Differences, as required), and then finally create a new DB relvar with that new set of derived attributes.

David (L), I think you're being unduly pessimistic about the capabilities of static type inference. The headings (and keys) of relations/relvars are known statically. Then we can infer the heading (and keys) of any derived relational expression. Thanks to TTM's approach of headings being sets, as you say, we need only a few relational operators:

The Union of two headings is the heading of the JOIN of relations with those headings.

For the Difference of two headings there's no operation within "the usual operators of the relational algebra" (RM Pre 18), but if your algebra is expressively complete [**], you can take the heading from 'generalised REMOVE': r1 REMOVE r2 is r1 project away all attributes in common with r2.

For the Intersection (of headings), there's already ISBL generalised UNION ≡ SQL UNION CORRESPONDING ≡ Tropashko's Inner Union: note this is union of the tuples but intersection of the headings, that is: project each operand relation on the attributes in common, union the projected tuples.

All I meant to say was that in my experience, the habits of thought of programmers who use compiled languages all the time, tend not to think in ways consistent with the above example.

I use compiled/static languages all the time. I've already built libraries of operators that derive Union/Intersection/Difference of headings statically. All I did was borrow work from people doing that (in Haskell) fifteen years ago. In my experience, the habits of thought of programmers who use dynamically-typed languages all the time, tend not to not understand how powerful static type inference can be.

Haskell also has powerful facilities for 'Reflection'/generic programming -- statically type-safe. But I didn't need to draw on any of that.

[Note **] 'expressively complete'. I deliberately avoided Codd's 'relationally complete' there. I find his 1972 definitions fraught with difficulties, and almost certainly inadequate (insofar as it's possible to make them precise). For me, 'complete' includes at least "closed over relations", then IMO all relational operators should take only relations as operands (as do JOIN, UNION, [NOT] MATCHING), even if we allow some operations with (sets of) attribute names baked in (such as project/remove, RENAME/EXTEND). Arguably those baked-in attribute names are not operands, because (at least in Tutorial D) they must be name literals/constants, and (sets of) attribute names are not first-class.

Then I prefer Tropashko's set of operators, which can be proved to be expressively complete under a lattice interpretation of relations. In fact we need only one operator -- I choose JOIN -- all others can be defined in terms of it, using lattice properties.

Quote from David Livingstone on June 14, 2019, 6:48 pm

Quote from johnwcowan on June 12, 2019, 6:11 pm

Quote from David Livingstone on June 7, 2019, 10:36 a

There is no such necessary relationship between interpreted languages and dynamic typing.

I take your point. The relationship is not NECESSARY.

To clarify perhaps - what I was thinking of were languages which allow meta data about variables, etc, to be obtained via some operator/assignment/... in the language - often called 'Reflection' these days - AND then allow that data to be used within an expression or statement in the language.

For example in such a relational DB language, one could write an expression(s) to obtain the sets of attributes (names and types) of several existing DB relvars, and then (say) write an expression which derives a new set of attributes from those sets (using Unions, Intersects, Differences, as required), and then finally create a new DB relvar with that new set of derived attributes.

David (L), I think you're being unduly pessimistic about the capabilities of static type inference. The headings (and keys) of relations/relvars are known statically. Then we can infer the heading (and keys) of any derived relational expression. Thanks to TTM's approach of headings being sets, as you say, we need only a few relational operators:

The Union of two headings is the heading of the JOIN of relations with those headings.
For the Difference of two headings there's no operation within "the usual operators of the relational algebra" (RM Pre 18), but if your algebra is expressively complete [**], you can take the heading from 'generalised REMOVE': r1 REMOVE r2 is r1 project away all attributes in common with r2.
For the Intersection (of headings), there's already ISBL generalised UNION ≡ SQL UNION CORRESPONDING ≡ Tropashko's Inner Union: note this is union of the tuples but intersection of the headings, that is: project each operand relation on the attributes in common, union the projected tuples.

All I meant to say was that in my experience, the habits of thought of programmers who use compiled languages all the time, tend not to think in ways consistent with the above example.

I use compiled/static languages all the time. I've already built libraries of operators that derive Union/Intersection/Difference of headings statically. All I did was borrow work from people doing that (in Haskell) fifteen years ago. In my experience, the habits of thought of programmers who use dynamically-typed languages all the time, tend not to not understand how powerful static type inference can be.

Haskell also has powerful facilities for 'Reflection'/generic programming -- statically type-safe. But I didn't need to draw on any of that.

[Note **] 'expressively complete'. I deliberately avoided Codd's 'relationally complete' there. I find his 1972 definitions fraught with difficulties, and almost certainly inadequate (insofar as it's possible to make them precise). For me, 'complete' includes at least "closed over relations", then IMO all relational operators should take only relations as operands (as do JOIN, UNION, [NOT] MATCHING), even if we allow some operations with (sets of) attribute names baked in (such as project/remove, RENAME/EXTEND). Arguably those baked-in attribute names are not operands, because (at least in Tutorial D) they must be name literals/constants, and (sets of) attribute names are not first-class.

Then I prefer Tropashko's set of operators, which can be proved to be expressively complete under a lattice interpretation of relations. In fact we need only one operator -- I choose JOIN -- all others can be defined in terms of it, using lattice properties.

#18 · June 15, 2019, 10:23 pm

Quote from AntC on June 15, 2019, 5:01 am

Then I prefer Tropashko's set of operators, which can be proved to be expressively complete under a lattice interpretation of relations. In fact we need only one operator -- I choose JOIN -- all others can be defined in terms of it, using lattice properties.

Tropashko gives UNION and JOIN as his minimal base at the expense of a number of infinite relations, but how do you reduce UNION to JOIN?

I wonder if the nullology prescription should be extended to omnology (totology? ugly either way) as well. Has anyone done the work to demonstrate that all the operators do the Right Thing on relations (necessarily virtual) with an infinite number of tuples? Tropashko particularly depends on EQ and NEQ, the obvious relations whose tuples are every pair of the form (k, k) and the complement of it.

I also would like an explanation of TENSOR_PRODUCT as applied to relations (I know from nothing about tensors).

#19 · June 16, 2019, 2:20 am

Quote from AntC on June 16, 2019, 2:20 am

Quote from johnwcowan on June 15, 2019, 10:23 pm

Quote from AntC on June 15, 2019, 5:01 am

Then I prefer Tropashko's set of operators, which can be proved to be expressively complete under a lattice interpretation of relations. In fact we need only one operator -- I choose JOIN -- all others can be defined in terms of it, using lattice properties.

Tropashko gives UNION and JOIN as his minimal base ...

Careful: that's his 'Inner Union', not SQL's (or D's or set theory's) UNION. And he defines several other operations in terms of those. (Those are definitions, not reductions.)

at the expense of a number of infinite relations, but how do you reduce UNION to JOIN?

I said "defined in terms of", not "reduce". We're in a lattice structure (where the base partially ordered set is the set of relations). Not quite sure why you're bringing in "infinite": I expect the set of relations to be finitely denumerable. Then when I say "I choose JOIN", I mean I define the partial ordering (following Tropashko) using relational JOIN as lattice meet. Then InnerUnion is lattice join. Note that in usual definitions of lattice structures those two operations are taken as axiomatic, with the absorption laws, then associativity is derived. I prefer

JOIN defined as closed over the base set of relations; the operation must be commutative, associative, idempotent.

Lattice (partial) ordering: (r1 <= r2) =_df r1 == (r1 JOIN r2) -- standard definition

The identity for JOIN is DEE (by definition): forall r. DEE JOIN r == r -- i.e. lattice top

(r1 InnerUnion r2) =_df r3 s.t. (r1 == (r1 JOIN r3)) ∧ (r2 == (r2 JOIN r3)) and r3 is the least such (by lattice ordering), that is: forall r3'. (r3' == (r1 JOIN r3')) ∧ (r3' == (r2 JOIN r3')) ⇒ (r3 == (r3 JOIN r3')).

There's a side condition on JOIN (which applies mutatis mutandis for every definition in terms of JOIN): the operands to JOIN must be join-compatible, that is attribute names in common must be at same attribute type. Otherwise the result of JOIN is undefined. Then operands to InnerUnion must be join-compatible. (Tropashko doesn't apply this side-condition: his attributes are untyped.)

We can show InnerUnion exists and is unique for all pairs of relations (providing those JOINs are well-defined). Also can show DEE exists and is unique.

I wonder if the nullology prescription should be extended to omnology (totology? ugly either way) as well. Has anyone done the work to demonstrate that all the operators do the Right Thing on relations (necessarily virtual) with an infinite number of tuples? Tropashko particularly depends on EQ and NEQ, the obvious relations whose tuples are every pair of the form (k, k) and the complement of it.

Again I'm not sure what you're getting at. Tropashko's operators are domain-independent. (Unlike Appendix A's <OR>, <NOT>.) Then yes it goes fine with infinite relations. AFAIR his definitional semantics doesn't draw on dis-equality of relations, only on equality. Must admit I'm a little rusty on it. To be precise, he does explore/model a lattice structure 'at right angles' to the JOIN-based one, with DUM aka R00 as lattice bottom for Appendix A's <OR> and his R11 as lattice top. That won't go well with infinite relations; but none of that modelling is essential to his core operators.

If we follow Hall, Hitchcock, Todd 1975's idea of 'algorithmic relations' then those are potentially infinite. But HHT give rules for constructing queries (similar to the rules for 'safe queries') to make sure you never expose an infinite relation outermost in a query. Those rules are easily expressed using type inference for the relational operators (treating algorithmic relations as a different type constructor vs grounded relations). That type inference is statically type safe, David L please note.

I also would like an explanation of TENSOR_PRODUCT as applied to relations (I know from nothing about tensors).

There used to be a contributor here talked in terms of cylindric algebras/Tarski semantics for relations. I never managed to get my head round that.

Quote from johnwcowan on June 15, 2019, 10:23 pm

Quote from AntC on June 15, 2019, 5:01 am

Then I prefer Tropashko's set of operators, which can be proved to be expressively complete under a lattice interpretation of relations. In fact we need only one operator -- I choose JOIN -- all others can be defined in terms of it, using lattice properties.

Tropashko gives UNION and JOIN as his minimal base ...

Careful: that's his 'Inner Union', not SQL's (or D's or set theory's) UNION. And he defines several other operations in terms of those. (Those are definitions, not reductions.)

at the expense of a number of infinite relations, but how do you reduce UNION to JOIN?

I said "defined in terms of", not "reduce". We're in a lattice structure (where the base partially ordered set is the set of relations). Not quite sure why you're bringing in "infinite": I expect the set of relations to be finitely denumerable. Then when I say "I choose JOIN", I mean I define the partial ordering (following Tropashko) using relational JOIN as lattice meet. Then InnerUnion is lattice join. Note that in usual definitions of lattice structures those two operations are taken as axiomatic, with the absorption laws, then associativity is derived. I prefer

JOIN defined as closed over the base set of relations; the operation must be commutative, associative, idempotent.
Lattice (partial) ordering: (r1 <= r2) =_df r1 == (r1 JOIN r2) -- standard definition
The identity for JOIN is DEE (by definition): forall r. DEE JOIN r == r -- i.e. lattice top
(r1 InnerUnion r2) =_df r3 s.t. (r1 == (r1 JOIN r3)) ∧ (r2 == (r2 JOIN r3)) and r3 is the least such (by lattice ordering), that is: forall r3'. (r3' == (r1 JOIN r3')) ∧ (r3' == (r2 JOIN r3')) ⇒ (r3 == (r3 JOIN r3')).

There's a side condition on JOIN (which applies mutatis mutandis for every definition in terms of JOIN): the operands to JOIN must be join-compatible, that is attribute names in common must be at same attribute type. Otherwise the result of JOIN is undefined. Then operands to InnerUnion must be join-compatible. (Tropashko doesn't apply this side-condition: his attributes are untyped.)

We can show InnerUnion exists and is unique for all pairs of relations (providing those JOINs are well-defined). Also can show DEE exists and is unique.

I wonder if the nullology prescription should be extended to omnology (totology? ugly either way) as well. Has anyone done the work to demonstrate that all the operators do the Right Thing on relations (necessarily virtual) with an infinite number of tuples? Tropashko particularly depends on EQ and NEQ, the obvious relations whose tuples are every pair of the form (k, k) and the complement of it.

Again I'm not sure what you're getting at. Tropashko's operators are domain-independent. (Unlike Appendix A's <OR>, <NOT>.) Then yes it goes fine with infinite relations. AFAIR his definitional semantics doesn't draw on dis-equality of relations, only on equality. Must admit I'm a little rusty on it. To be precise, he does explore/model a lattice structure 'at right angles' to the JOIN-based one, with DUM aka R00 as lattice bottom for Appendix A's <OR> and his R11 as lattice top. That won't go well with infinite relations; but none of that modelling is essential to his core operators.

If we follow Hall, Hitchcock, Todd 1975's idea of 'algorithmic relations' then those are potentially infinite. But HHT give rules for constructing queries (similar to the rules for 'safe queries') to make sure you never expose an infinite relation outermost in a query. Those rules are easily expressed using type inference for the relational operators (treating algorithmic relations as a different type constructor vs grounded relations). That type inference is statically type safe, David L please note.

I also would like an explanation of TENSOR_PRODUCT as applied to relations (I know from nothing about tensors).

There used to be a contributor here talked in terms of cylindric algebras/Tarski semantics for relations. I never managed to get my head round that.

#20 · June 16, 2019, 4:27 am

Quote from AntC on June 16, 2019, 2:20 am

Careful: that's his 'Inner Union', not SQL's (or D's or set theory's) UNION. And he defines several other operations in terms of those.

Ah. Clearly I need to read the paper more closely.

I said "defined in terms of", not "reduce".

What's the distinction? Where I come from, if you can define A in terms of B (using a constructive definition, at least), you have reduced A to B.

Not quite sure why you're bringing in "infinite": I expect the set of relations to be finitely denumerable.

No, it's the relations themselves that T uses in his definitions of RENAME and MINUS that contain an infinite number of tuples. Siimlarly, the niladic version of INTERSECT in the TTM book requires a heading to be supplied, and it is the universal relation for that heading; i.e. all possible tuples that meet the type restrictions. This is given as the only case where INTERSECT and JOIN differ.

There's a side condition on JOIN (which applies mutatis mutandis for every definition in terms of JOIN): the operands to JOIN must be join-compatible, that is attribute names in common must be at same attribute type.

See my post on type-indexed rows, which makes that tautologous.

Again I'm not sure what you're getting at. Tropashko's operators are domain-independent. (Unlike Appendix A's <OR>, <NOT>.) Then yes it goes fine with infinite relations.

But TTM, as noted above, requires relations of infinite cardinality already. Once you have exposed such things, you have to know what they are going to do when operated on.

The Forum for Discussion about The Third Manifesto and Related Matters