The Third Manifesto ForumDatabases: Database TheoryWhat do set-based operations buy …

Please or Register to create posts and topics.

What do set-based operations buy us?

#21 · February 19, 2021, 1:19 pm

Quote from Dave Voorhis on February 19, 2021, 1:19 pm

Quote from tobega on February 19, 2021, 12:39 pm

Quote from Dave Voorhis on February 19, 2021, 11:35 am

Quote from tobega on February 19, 2021, 11:09 am

Quote from Dave Voorhis on February 19, 2021, 9:40 am

Quote from tobega on February 19, 2021, 5:22 am

On another tack, I will still note that the SQL recursive CTE version will calculate the correct value even for the sloppy predicates, so it is arguably easier to use. Since we seem to agree that a language should make it easier for programmers to get correct results and harder for them to create obscure bugs, judging by the strong stance on static typing, I still worry that we might be straining at gnats and swallowing camels. In studies performed, static languages do not have lower defect rates over all and as long as the language has strong typing, any type errors are easily found, so the bugs are not obscure.

Comparing defect rates between statically and dynamically typed languages strikes me as largely missing the point. I wouldn't expect a dynamically-typed program to be any buggier than an equivalent statically-typed program -- assuming the developer is competent.

The real difference is the amount of effort required to create the program.

At the risk of getting into an undesirable side-discussion, you are actually not entirely correct, effort also seems to be on par (but you may be confusing static/dynamic with the weakly/strongly typed distinction). Capers Jones did a study on hours spent per code point, also bug frequency, for lots of languages and Smalltalk was one of the most efficient, almost twice as productive as Java/C#. Smalltalk is dynamic but strongly typed. On the other hand, JavaScript (weak dynamic) lagged significantly behind Java/C#. I believe that JavaScript programmers deceive themselves by confusing activity for productivity.

Smalltalk is a rather special case, an uber-consistent language with an unusually well-integrated toolset. Is Smalltalk productivity due to its dynamic typing, or in spite of it?

I'd argue it's in spite of it. A comparison of Smalltalk vs Strongtalk would be interesting.

I think the main take-away is that the effect of typing is both smaller and different to what we tend to imagine.

Typing model is perhaps more a handy label than an essential and defining characteristic; it's a way of easily categorising languages whose utility (or lack thereof) is defined by more than just typing.

That said, it raises a question: What benefit does dynamic typing give you?

Imagine a hypothetical C# or Java, same as now but variables do not have declared types and can be arbitrarily assigned values of any type. E.g., C# as now, but every variable is implicitly, automatically, and unavoidably type dynamic. What, if anything, do we gain from that?

If the answer is nothing (or little) then perhaps the same applies to the usual set of popular dynamically typed languages.

Research aside, I don't think anyone that's actually had to maintain real large code bases would argue that the difference in maintenance effort in the popular dynamically typed languages (PHP, Python, JavaScript, Ruby, shell scripts) vs statically, manifest-typed (or mainly manifest-typed) languages is anything but profound, unless you're fortunate to be working on a dynamically-typed code-base that uses only primitive types. Then it can be manageable. If it defines classes (e.g., PHP, Python) extensively, then it's a nightmare.

In maintenance I would say the documentation effect is very strong. Any code with explicitly stated types is much easier to analyze, so without them you would have to spend much more time just understanding how and where to modify the code.

When it comes to Python there are plenty of weaknesses, especially in how classes are defined, making it hard to understand what was defined and where, but it's not necessarily down to typing. I have had to try to maintain a largeish chunk of Python classes and I neither know nor like the language, it was a nightmare for me. But the entire production code of YouTube is written in Python, they have had to evolve some specific disciplines, but seem to be managing all right. As a counterpoint the code review tool that had been written in Python by none less than Guido van Rossum himself was rewritten in java to make it more maintainable.

It's not unusual to see language justifications of the form "<insert big site here> extensively uses <dynamically-typed language>."

I wonder how often <dynamically-typed language> is used in single large projects, as opposed to being hundreds or thousands of wee independent scripts that are a nice manageable eight to forty or so lines apiece.

I've just spent the morning documenting a gaggle of cron jobs on a server. You could easily argue that the business successfully runs thousands of lines of shell script, and in a sense it does, except most of them are entirely self-contained and tiny. That's easy to maintain. The same number of lines in one project-worth of interdependencies would be a nightmare.

I also wonder how often a site or service has grown from a manageable Python (for example) prototype maintained by happy developers, to a massive behemoth of legacy Pythonic production horror maintained by miserable developers who would like nothing more than to rewrite it in C#, Java, or whatever but there's no money, time, and/or management support. ("You guys chose to use Python! You wanted to use Python! Why are you complaining now?")

Quote from tobega on February 19, 2021, 12:39 pm

Quote from Dave Voorhis on February 19, 2021, 11:35 am

Quote from tobega on February 19, 2021, 11:09 am

Quote from Dave Voorhis on February 19, 2021, 9:40 am

Quote from tobega on February 19, 2021, 5:22 am

On another tack, I will still note that the SQL recursive CTE version will calculate the correct value even for the sloppy predicates, so it is arguably easier to use. Since we seem to agree that a language should make it easier for programmers to get correct results and harder for them to create obscure bugs, judging by the strong stance on static typing, I still worry that we might be straining at gnats and swallowing camels. In studies performed, static languages do not have lower defect rates over all and as long as the language has strong typing, any type errors are easily found, so the bugs are not obscure.

Comparing defect rates between statically and dynamically typed languages strikes me as largely missing the point. I wouldn't expect a dynamically-typed program to be any buggier than an equivalent statically-typed program -- assuming the developer is competent.

The real difference is the amount of effort required to create the program.

At the risk of getting into an undesirable side-discussion, you are actually not entirely correct, effort also seems to be on par (but you may be confusing static/dynamic with the weakly/strongly typed distinction). Capers Jones did a study on hours spent per code point, also bug frequency, for lots of languages and Smalltalk was one of the most efficient, almost twice as productive as Java/C#. Smalltalk is dynamic but strongly typed. On the other hand, JavaScript (weak dynamic) lagged significantly behind Java/C#. I believe that JavaScript programmers deceive themselves by confusing activity for productivity.

Smalltalk is a rather special case, an uber-consistent language with an unusually well-integrated toolset. Is Smalltalk productivity due to its dynamic typing, or in spite of it?

I'd argue it's in spite of it. A comparison of Smalltalk vs Strongtalk would be interesting.

I think the main take-away is that the effect of typing is both smaller and different to what we tend to imagine.

Typing model is perhaps more a handy label than an essential and defining characteristic; it's a way of easily categorising languages whose utility (or lack thereof) is defined by more than just typing.

That said, it raises a question: What benefit does dynamic typing give you?

Imagine a hypothetical C# or Java, same as now but variables do not have declared types and can be arbitrarily assigned values of any type. E.g., C# as now, but every variable is implicitly, automatically, and unavoidably type dynamic. What, if anything, do we gain from that?

If the answer is nothing (or little) then perhaps the same applies to the usual set of popular dynamically typed languages.

Research aside, I don't think anyone that's actually had to maintain real large code bases would argue that the difference in maintenance effort in the popular dynamically typed languages (PHP, Python, JavaScript, Ruby, shell scripts) vs statically, manifest-typed (or mainly manifest-typed) languages is anything but profound, unless you're fortunate to be working on a dynamically-typed code-base that uses only primitive types. Then it can be manageable. If it defines classes (e.g., PHP, Python) extensively, then it's a nightmare.

In maintenance I would say the documentation effect is very strong. Any code with explicitly stated types is much easier to analyze, so without them you would have to spend much more time just understanding how and where to modify the code.

When it comes to Python there are plenty of weaknesses, especially in how classes are defined, making it hard to understand what was defined and where, but it's not necessarily down to typing. I have had to try to maintain a largeish chunk of Python classes and I neither know nor like the language, it was a nightmare for me. But the entire production code of YouTube is written in Python, they have had to evolve some specific disciplines, but seem to be managing all right. As a counterpoint the code review tool that had been written in Python by none less than Guido van Rossum himself was rewritten in java to make it more maintainable.

It's not unusual to see language justifications of the form "<insert big site here> extensively uses <dynamically-typed language>."

I wonder how often <dynamically-typed language> is used in single large projects, as opposed to being hundreds or thousands of wee independent scripts that are a nice manageable eight to forty or so lines apiece.

I've just spent the morning documenting a gaggle of cron jobs on a server. You could easily argue that the business successfully runs thousands of lines of shell script, and in a sense it does, except most of them are entirely self-contained and tiny. That's easy to maintain. The same number of lines in one project-worth of interdependencies would be a nightmare.

I also wonder how often a site or service has grown from a manageable Python (for example) prototype maintained by happy developers, to a massive behemoth of legacy Pythonic production horror maintained by miserable developers who would like nothing more than to rewrite it in C#, Java, or whatever but there's no money, time, and/or management support. ("You guys chose to use Python! You wanted to use Python! Why are you complaining now?")

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

#22 · February 19, 2021, 2:42 pm

Quote from tobega on February 19, 2021, 2:42 pm

Quote from Dave Voorhis on February 19, 2021, 1:19 pm

Quote from tobega on February 19, 2021, 12:39 pm

Quote from Dave Voorhis on February 19, 2021, 11:35 am

Quote from tobega on February 19, 2021, 11:09 am

Quote from Dave Voorhis on February 19, 2021, 9:40 am

Quote from tobega on February 19, 2021, 5:22 am

On another tack, I will still note that the SQL recursive CTE version will calculate the correct value even for the sloppy predicates, so it is arguably easier to use. Since we seem to agree that a language should make it easier for programmers to get correct results and harder for them to create obscure bugs, judging by the strong stance on static typing, I still worry that we might be straining at gnats and swallowing camels. In studies performed, static languages do not have lower defect rates over all and as long as the language has strong typing, any type errors are easily found, so the bugs are not obscure.

Comparing defect rates between statically and dynamically typed languages strikes me as largely missing the point. I wouldn't expect a dynamically-typed program to be any buggier than an equivalent statically-typed program -- assuming the developer is competent.

The real difference is the amount of effort required to create the program.

At the risk of getting into an undesirable side-discussion, you are actually not entirely correct, effort also seems to be on par (but you may be confusing static/dynamic with the weakly/strongly typed distinction). Capers Jones did a study on hours spent per code point, also bug frequency, for lots of languages and Smalltalk was one of the most efficient, almost twice as productive as Java/C#. Smalltalk is dynamic but strongly typed. On the other hand, JavaScript (weak dynamic) lagged significantly behind Java/C#. I believe that JavaScript programmers deceive themselves by confusing activity for productivity.

Smalltalk is a rather special case, an uber-consistent language with an unusually well-integrated toolset. Is Smalltalk productivity due to its dynamic typing, or in spite of it?

I'd argue it's in spite of it. A comparison of Smalltalk vs Strongtalk would be interesting.

I think the main take-away is that the effect of typing is both smaller and different to what we tend to imagine.

Typing model is perhaps more a handy label than an essential and defining characteristic; it's a way of easily categorising languages whose utility depends on more than just typing.

That said, it raises a question: What benefit does dynamic typing give you?

Imagine a hypothetical C# or Java, same as now but variables do not have declared types and can be arbitrarily assigned values of any type. E.g., C# as now, but every variable is implicitly, automatically, and unavoidably type dynamic. What, if anything, do we gain from that?

I think that is the right question to ask. (Although is that actually Clojure you're describing?)

The beginning of the answer is the same as what you get from type inference, but I think that is much too shallow and I don't really know the answer.

I have always been firmly in the static typing camp, but I have to acknowledge that the research doesn't necessarily support a very strong stance. I mentioned how Uncle Bob has swung over and the basis of the argument seems to be that TDD mandates about the same amount of tests whatever your type system, but you avoid having to do all the type declaration stuff in a dynamic system. (On trying to look up solid arguments I found this which was somewhat interesting: https://labs.ig.com/static-typing-promise )

Go is a bit of an interesting case because it counts as statically typed but interfaces actually work by duck-typing, so you don't have to specifically mention the interfaces you implement on the implementing side. Each consumer can have its own partially overlapping interface requirement.

My Tailspin language is currently dynamically typed but I have the intention of adding explicit typing to it, possibly by contracts a la Eiffel. So far it feels quite nice to just create types on the fly, but I can't say it's really compelling. Well, it is a lot easier than implementing a java value class with builder and getters, but that's not necessarily the typing, rather all the other stuff. Java's records that are coming are mostly lovely.

Occasionally it is convenient to be able to put in any type. In java I certainly use Object and instanceof occasionally, rather than try to pin down five levels of generics. Also sometimes useful to output one result type and one error type. But what if the type system has proper support for sum types?

Andreas Stefik has tried to create an evidence-based programming language called Quorum. Don't know how great it is, but it seems to be used a lot by blind people. I think it is from there that I got the idea that typing information is mainly important on the specification of functions/methods.

Perhaps the thing is really how we can make use of typing lighter and keep the benefits where they matter and relax things where they don't. The really big problems are the bugs that are hard to find and hard to fix and in that respect typing possibly doesn't do much either way even if, as I said, there are some clear benefits of explicit type information in trying to understand code.

Quote from Dave Voorhis on February 19, 2021, 1:19 pm

Quote from tobega on February 19, 2021, 12:39 pm

Quote from Dave Voorhis on February 19, 2021, 11:35 am

Quote from tobega on February 19, 2021, 11:09 am

Quote from Dave Voorhis on February 19, 2021, 9:40 am

Quote from tobega on February 19, 2021, 5:22 am

On another tack, I will still note that the SQL recursive CTE version will calculate the correct value even for the sloppy predicates, so it is arguably easier to use. Since we seem to agree that a language should make it easier for programmers to get correct results and harder for them to create obscure bugs, judging by the strong stance on static typing, I still worry that we might be straining at gnats and swallowing camels. In studies performed, static languages do not have lower defect rates over all and as long as the language has strong typing, any type errors are easily found, so the bugs are not obscure.

Comparing defect rates between statically and dynamically typed languages strikes me as largely missing the point. I wouldn't expect a dynamically-typed program to be any buggier than an equivalent statically-typed program -- assuming the developer is competent.

The real difference is the amount of effort required to create the program.

At the risk of getting into an undesirable side-discussion, you are actually not entirely correct, effort also seems to be on par (but you may be confusing static/dynamic with the weakly/strongly typed distinction). Capers Jones did a study on hours spent per code point, also bug frequency, for lots of languages and Smalltalk was one of the most efficient, almost twice as productive as Java/C#. Smalltalk is dynamic but strongly typed. On the other hand, JavaScript (weak dynamic) lagged significantly behind Java/C#. I believe that JavaScript programmers deceive themselves by confusing activity for productivity.

Smalltalk is a rather special case, an uber-consistent language with an unusually well-integrated toolset. Is Smalltalk productivity due to its dynamic typing, or in spite of it?

I'd argue it's in spite of it. A comparison of Smalltalk vs Strongtalk would be interesting.

I think the main take-away is that the effect of typing is both smaller and different to what we tend to imagine.

Typing model is perhaps more a handy label than an essential and defining characteristic; it's a way of easily categorising languages whose utility depends on more than just typing.

That said, it raises a question: What benefit does dynamic typing give you?

Imagine a hypothetical C# or Java, same as now but variables do not have declared types and can be arbitrarily assigned values of any type. E.g., C# as now, but every variable is implicitly, automatically, and unavoidably type dynamic. What, if anything, do we gain from that?

I think that is the right question to ask. (Although is that actually Clojure you're describing?)

The beginning of the answer is the same as what you get from type inference, but I think that is much too shallow and I don't really know the answer.

I have always been firmly in the static typing camp, but I have to acknowledge that the research doesn't necessarily support a very strong stance. I mentioned how Uncle Bob has swung over and the basis of the argument seems to be that TDD mandates about the same amount of tests whatever your type system, but you avoid having to do all the type declaration stuff in a dynamic system. (On trying to look up solid arguments I found this which was somewhat interesting: https://labs.ig.com/static-typing-promise )

Go is a bit of an interesting case because it counts as statically typed but interfaces actually work by duck-typing, so you don't have to specifically mention the interfaces you implement on the implementing side. Each consumer can have its own partially overlapping interface requirement.

My Tailspin language is currently dynamically typed but I have the intention of adding explicit typing to it, possibly by contracts a la Eiffel. So far it feels quite nice to just create types on the fly, but I can't say it's really compelling. Well, it is a lot easier than implementing a java value class with builder and getters, but that's not necessarily the typing, rather all the other stuff. Java's records that are coming are mostly lovely.

Occasionally it is convenient to be able to put in any type. In java I certainly use Object and instanceof occasionally, rather than try to pin down five levels of generics. Also sometimes useful to output one result type and one error type. But what if the type system has proper support for sum types?

Andreas Stefik has tried to create an evidence-based programming language called Quorum. Don't know how great it is, but it seems to be used a lot by blind people. I think it is from there that I got the idea that typing information is mainly important on the specification of functions/methods.

Perhaps the thing is really how we can make use of typing lighter and keep the benefits where they matter and relax things where they don't. The really big problems are the bugs that are hard to find and hard to fix and in that respect typing possibly doesn't do much either way even if, as I said, there are some clear benefits of explicit type information in trying to understand code.

#23 · February 19, 2021, 5:27 pm

Quote from Dave Voorhis on February 19, 2021, 5:27 pm

Quote from tobega on February 19, 2021, 2:42 pm

Quote from Dave Voorhis on February 19, 2021, 1:19 pm

Quote from tobega on February 19, 2021, 12:39 pm

Quote from Dave Voorhis on February 19, 2021, 11:35 am

Quote from tobega on February 19, 2021, 11:09 am

Quote from Dave Voorhis on February 19, 2021, 9:40 am

Quote from tobega on February 19, 2021, 5:22 am

On another tack, I will still note that the SQL recursive CTE version will calculate the correct value even for the sloppy predicates, so it is arguably easier to use. Since we seem to agree that a language should make it easier for programmers to get correct results and harder for them to create obscure bugs, judging by the strong stance on static typing, I still worry that we might be straining at gnats and swallowing camels. In studies performed, static languages do not have lower defect rates over all and as long as the language has strong typing, any type errors are easily found, so the bugs are not obscure.

Comparing defect rates between statically and dynamically typed languages strikes me as largely missing the point. I wouldn't expect a dynamically-typed program to be any buggier than an equivalent statically-typed program -- assuming the developer is competent.

The real difference is the amount of effort required to create the program.

At the risk of getting into an undesirable side-discussion, you are actually not entirely correct, effort also seems to be on par (but you may be confusing static/dynamic with the weakly/strongly typed distinction). Capers Jones did a study on hours spent per code point, also bug frequency, for lots of languages and Smalltalk was one of the most efficient, almost twice as productive as Java/C#. Smalltalk is dynamic but strongly typed. On the other hand, JavaScript (weak dynamic) lagged significantly behind Java/C#. I believe that JavaScript programmers deceive themselves by confusing activity for productivity.

Smalltalk is a rather special case, an uber-consistent language with an unusually well-integrated toolset. Is Smalltalk productivity due to its dynamic typing, or in spite of it?

I'd argue it's in spite of it. A comparison of Smalltalk vs Strongtalk would be interesting.

I think the main take-away is that the effect of typing is both smaller and different to what we tend to imagine.

Typing model is perhaps more a handy label than an essential and defining characteristic; it's a way of easily categorising languages whose utility depends on more than just typing.

That said, it raises a question: What benefit does dynamic typing give you?

Imagine a hypothetical C# or Java, same as now but variables do not have declared types and can be arbitrarily assigned values of any type. E.g., C# as now, but every variable is implicitly, automatically, and unavoidably type dynamic. What, if anything, do we gain from that?

I think that is the right question to ask. (Although is that actually Clojure you're describing?)

Not by intent. I was imagining a C# or Java where variables and parameters cannot have type annotations and can accept values of any type. I.e., pure dynamic typing, but the language is otherwise the same. If you could compare that with statically-typed C# or Java, that would be a fair comparison, unlike the usual language debates that compare Python apples to Java oranges, or JavaScript grapes to Kotlin kiwis, etc., i.e., nothing alike.

I suspect you can emulate it pretty closely in C# by declaring every variable and parameter to be type dynamic. There isn't a Java equivalent. (Declare all variables and parameters as Object doesn't do it -- you'd need a barrage of casts, at least, so not a realistic comparison with "real" dynamically-typed languages.)

The beginning of the answer is the same as what you get from type inference, but I think that is much too shallow and I don't really know the answer.

I have always been firmly in the static typing camp, but I have to acknowledge that the research doesn't necessarily support a very strong stance. I mentioned how Uncle Bob has swung over and the basis of the argument seems to be that TDD mandates about the same amount of tests whatever your type system, but you avoid having to do all the type declaration stuff in a dynamic system.

I see the but-you'll-have-unit-tests argument come up frequently, which is fine for type safety... If you have unit tests. For various reasons, real projects often don't. Maybe they should, but they don't. Sometimes the brief is, "Dave, this code has no unit tests, for... Reasons. Add unit tests. Figure out what it does whilst you're at it, and have the eight new Jiras I've assigned you done by Tuesday."

If it's Java or C#, it's manageable. If it's a typical no-type-manifests dynamically-typed language, that's grown from somebody's pet project into a multi-headed, forty-armed, has-legs-and-wheels-and-wings mission-critical monster, then...

Those distant vomiting noises you hear are me.

Of course, that's just type safety. It's no help at all for code readability. That -- as we've both noted -- depends largely on manifest typing, which is notionally orthogonal to static/dynamic typing but code written in dynamically-typed languages with optional type annotations usually doesn't have any (the compiler doesn't enforce them, so why use them?) so you're back to the same old readability problem.

(On trying to look up solid arguments I found this which was somewhat interesting: https://labs.ig.com/static-typing-promise )

The 100% code coverage suggested by http://blog.cleancoder.com/uncle-bob/2016/05/01/TypeWars.html simply doesn't happen. It should, but... No.

There's the converse argument, too -- if we didn't use dynamically-typed languages, maybe we wouldn't need 100% code coverage...

Go is a bit of an interesting case because it counts as statically typed but interfaces actually work by duck-typing, so you don't have to specifically mention the interfaces you implement on the implementing side. Each consumer can have its own partially overlapping interface requirement.

My Tailspin language is currently dynamically typed but I have the intention of adding explicit typing to it, possibly by contracts a la Eiffel. So far it feels quite nice to just create types on the fly, but I can't say it's really compelling. Well, it is a lot easier than implementing a java value class with builder and getters, but that's not necessarily the typing, rather all the other stuff. Java's records that are coming are mostly lovely.

Occasionally it is convenient to be able to put in any type. In java I certainly use Object and instanceof occasionally, rather than try to pin down five levels of generics.

I use Object when I really mean any Java instance, and instanceof almost never. They tend to show up more often in tools than applications, too. Indeed, I can't remember the last time I used Object or instanceof in a business application.

I never use them out of convenience, or as a shortcut. There needs to be a compelling reason. Lazy coding isn't one.

Also sometimes useful to output one result type and one error type.

I usually create a class with two constructors, one for the result type and one for the error type, and an isError(). Of course, that is far from ideal.

But what if the type system has proper support for sum types?

That would be preferable.

Quote from tobega on February 19, 2021, 2:42 pm

Quote from Dave Voorhis on February 19, 2021, 1:19 pm

Quote from tobega on February 19, 2021, 12:39 pm

Quote from Dave Voorhis on February 19, 2021, 11:35 am

Quote from tobega on February 19, 2021, 11:09 am

Quote from Dave Voorhis on February 19, 2021, 9:40 am

Quote from tobega on February 19, 2021, 5:22 am

On another tack, I will still note that the SQL recursive CTE version will calculate the correct value even for the sloppy predicates, so it is arguably easier to use. Since we seem to agree that a language should make it easier for programmers to get correct results and harder for them to create obscure bugs, judging by the strong stance on static typing, I still worry that we might be straining at gnats and swallowing camels. In studies performed, static languages do not have lower defect rates over all and as long as the language has strong typing, any type errors are easily found, so the bugs are not obscure.

Comparing defect rates between statically and dynamically typed languages strikes me as largely missing the point. I wouldn't expect a dynamically-typed program to be any buggier than an equivalent statically-typed program -- assuming the developer is competent.

The real difference is the amount of effort required to create the program.

At the risk of getting into an undesirable side-discussion, you are actually not entirely correct, effort also seems to be on par (but you may be confusing static/dynamic with the weakly/strongly typed distinction). Capers Jones did a study on hours spent per code point, also bug frequency, for lots of languages and Smalltalk was one of the most efficient, almost twice as productive as Java/C#. Smalltalk is dynamic but strongly typed. On the other hand, JavaScript (weak dynamic) lagged significantly behind Java/C#. I believe that JavaScript programmers deceive themselves by confusing activity for productivity.

Smalltalk is a rather special case, an uber-consistent language with an unusually well-integrated toolset. Is Smalltalk productivity due to its dynamic typing, or in spite of it?

I'd argue it's in spite of it. A comparison of Smalltalk vs Strongtalk would be interesting.

I think the main take-away is that the effect of typing is both smaller and different to what we tend to imagine.

Typing model is perhaps more a handy label than an essential and defining characteristic; it's a way of easily categorising languages whose utility depends on more than just typing.

That said, it raises a question: What benefit does dynamic typing give you?

Imagine a hypothetical C# or Java, same as now but variables do not have declared types and can be arbitrarily assigned values of any type. E.g., C# as now, but every variable is implicitly, automatically, and unavoidably type dynamic. What, if anything, do we gain from that?

I think that is the right question to ask. (Although is that actually Clojure you're describing?)

Not by intent. I was imagining a C# or Java where variables and parameters cannot have type annotations and can accept values of any type. I.e., pure dynamic typing, but the language is otherwise the same. If you could compare that with statically-typed C# or Java, that would be a fair comparison, unlike the usual language debates that compare Python apples to Java oranges, or JavaScript grapes to Kotlin kiwis, etc., i.e., nothing alike.

I suspect you can emulate it pretty closely in C# by declaring every variable and parameter to be type dynamic. There isn't a Java equivalent. (Declare all variables and parameters as Object doesn't do it -- you'd need a barrage of casts, at least, so not a realistic comparison with "real" dynamically-typed languages.)

The beginning of the answer is the same as what you get from type inference, but I think that is much too shallow and I don't really know the answer.

I have always been firmly in the static typing camp, but I have to acknowledge that the research doesn't necessarily support a very strong stance. I mentioned how Uncle Bob has swung over and the basis of the argument seems to be that TDD mandates about the same amount of tests whatever your type system, but you avoid having to do all the type declaration stuff in a dynamic system.

I see the but-you'll-have-unit-tests argument come up frequently, which is fine for type safety... If you have unit tests. For various reasons, real projects often don't. Maybe they should, but they don't. Sometimes the brief is, "Dave, this code has no unit tests, for... Reasons. Add unit tests. Figure out what it does whilst you're at it, and have the eight new Jiras I've assigned you done by Tuesday."

If it's Java or C#, it's manageable. If it's a typical no-type-manifests dynamically-typed language, that's grown from somebody's pet project into a multi-headed, forty-armed, has-legs-and-wheels-and-wings mission-critical monster, then...

Those distant vomiting noises you hear are me.

Of course, that's just type safety. It's no help at all for code readability. That -- as we've both noted -- depends largely on manifest typing, which is notionally orthogonal to static/dynamic typing but code written in dynamically-typed languages with optional type annotations usually doesn't have any (the compiler doesn't enforce them, so why use them?) so you're back to the same old readability problem.

(On trying to look up solid arguments I found this which was somewhat interesting: https://labs.ig.com/static-typing-promise )

The 100% code coverage suggested by http://blog.cleancoder.com/uncle-bob/2016/05/01/TypeWars.html simply doesn't happen. It should, but... No.

There's the converse argument, too -- if we didn't use dynamically-typed languages, maybe we wouldn't need 100% code coverage...

Go is a bit of an interesting case because it counts as statically typed but interfaces actually work by duck-typing, so you don't have to specifically mention the interfaces you implement on the implementing side. Each consumer can have its own partially overlapping interface requirement.

My Tailspin language is currently dynamically typed but I have the intention of adding explicit typing to it, possibly by contracts a la Eiffel. So far it feels quite nice to just create types on the fly, but I can't say it's really compelling. Well, it is a lot easier than implementing a java value class with builder and getters, but that's not necessarily the typing, rather all the other stuff. Java's records that are coming are mostly lovely.

Occasionally it is convenient to be able to put in any type. In java I certainly use Object and instanceof occasionally, rather than try to pin down five levels of generics.

I use Object when I really mean any Java instance, and instanceof almost never. They tend to show up more often in tools than applications, too. Indeed, I can't remember the last time I used Object or instanceof in a business application.

I never use them out of convenience, or as a shortcut. There needs to be a compelling reason. Lazy coding isn't one.

Also sometimes useful to output one result type and one error type.

I usually create a class with two constructors, one for the result type and one for the error type, and an isError(). Of course, that is far from ideal.

But what if the type system has proper support for sum types?

That would be preferable.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

#24 · February 20, 2021, 8:26 am

Quote from tobega on February 20, 2021, 8:26 am
On what we get from dynamic typing, I've never entirely felt the attraction in general, but I have a reflection that allows me to partially understand it:

I remember the early days of XML when the format was free. The original purpose of XML was to replace SGML as a simpler general document format by limiting the specific character representation but otherwise retaining complete freedom to represent the contents of any document. The one mistake they made was to have a DTD. Without a DTD, XML would have been perfect. Luckily DTD was optional, so it was possible to use XML in a free way.

This freedom to express any data in any combination can be very useful in the right context. Consider being able to represent a player tag inside a team tag versus a team tag inside a player tag, it just depends on your perspective. The data and relationship is still possible to easily understand. It is conceivable that semi-structured combinations can be made meaningful in data transfer and, by extension, in algorithms.

In messaging between computers in particular it is very useful to have a format that will not be validated on any other level than the application level. All fields optional as far as the messaging format goes is the first requirement for being able to provide forwards and backwards compatibility.

So these two properties pretty much equate to dynamic typing.

Then I got to work with a system (written in java) where the data was kept as XML throughout the system (a text chat relay). There were some advantages such as the code at the entry point could attach some data that was needed at the exit point that the code in between didn't even have to know about. In the end, the challenge was how to know what data was available at any point depending on which path it had taken through the system. And even if you could test for that, you might not know what it really meant or where it was set. So again, a documentation problem, but one solution is to change to a typed object which forces you to "document" and also enables you to search for accesses in a better way than text search. So was the problem mostly that it was just text rather than a dynamic object? That said, I do think that you should keep your serialization formats confined on the edge.

Tragically, XML fell prey to productization, you can't create a messaging product if the format and parser is entirely free and the application does all the validation. So first came namespaces, then schemas and everything just ground to a halt and became too heavy to use. Thus the world switched to JSON.

On what we get from dynamic typing, I've never entirely felt the attraction in general, but I have a reflection that allows me to partially understand it:

I remember the early days of XML when the format was free. The original purpose of XML was to replace SGML as a simpler general document format by limiting the specific character representation but otherwise retaining complete freedom to represent the contents of any document. The one mistake they made was to have a DTD. Without a DTD, XML would have been perfect. Luckily DTD was optional, so it was possible to use XML in a free way.

This freedom to express any data in any combination can be very useful in the right context. Consider being able to represent a player tag inside a team tag versus a team tag inside a player tag, it just depends on your perspective. The data and relationship is still possible to easily understand. It is conceivable that semi-structured combinations can be made meaningful in data transfer and, by extension, in algorithms.

In messaging between computers in particular it is very useful to have a format that will not be validated on any other level than the application level. All fields optional as far as the messaging format goes is the first requirement for being able to provide forwards and backwards compatibility.

So these two properties pretty much equate to dynamic typing.

Then I got to work with a system (written in java) where the data was kept as XML throughout the system (a text chat relay). There were some advantages such as the code at the entry point could attach some data that was needed at the exit point that the code in between didn't even have to know about. In the end, the challenge was how to know what data was available at any point depending on which path it had taken through the system. And even if you could test for that, you might not know what it really meant or where it was set. So again, a documentation problem, but one solution is to change to a typed object which forces you to "document" and also enables you to search for accesses in a better way than text search. So was the problem mostly that it was just text rather than a dynamic object? That said, I do think that you should keep your serialization formats confined on the edge.

Tragically, XML fell prey to productization, you can't create a messaging product if the format and parser is entirely free and the application does all the validation. So first came namespaces, then schemas and everything just ground to a halt and became too heavy to use. Thus the world switched to JSON.

#25 · February 20, 2021, 10:22 am

I have always been firmly in the static typing camp, but I have to acknowledge that the research doesn't necessarily support a very strong stance. I mentioned how Uncle Bob has swung over and the basis of the argument seems to be that TDD mandates about the same amount of tests whatever your type system, but you avoid having to do all the type declaration stuff in a dynamic system. (On trying to look up solid arguments I found this which was somewhat interesting: https://labs.ig.com/static-typing-promise )

You still haven't quoted any research of any authority. What I see here is bunk.

The problem with this argument (apart from the close resemblance to a religious war) is that it lumps every kind of programming in together. It's like comparing apples to oranges, battleships and the colour yellow. You just can't.

Speaking for myself, I like to have a scripting language for massaging text files or iterating over files. For this purpose I like Ruby, and I hate C#.

Then I like to have a language at the level of bits and bytes, to diddle with ports and memory, use protocols. I like C++, Python is a dog.

Then I write a few thousand lines of code for a language compiler and VM, hack it until it works, refactor until it's right. I want a static type compiler, modules and interfaces, assertions and tests (but not TDD). C# fits the bill. And so on.

When you show me research that reflects that, you have my attention.

Andl - A New Database Language - andl.org

#26 · February 20, 2021, 12:51 pm

Quote from tobega on February 20, 2021, 12:51 pm

Quote from dandl on February 20, 2021, 10:22 am

I have always been firmly in the static typing camp, but I have to acknowledge that the research doesn't necessarily support a very strong stance. I mentioned how Uncle Bob has swung over and the basis of the argument seems to be that TDD mandates about the same amount of tests whatever your type system, but you avoid having to do all the type declaration stuff in a dynamic system. (On trying to look up solid arguments I found this which was somewhat interesting: https://labs.ig.com/static-typing-promise )

You still haven't quoted any research of any authority. What I see here is bunk.

The problem with this argument (apart from the close resemblance to a religious war) is that it lumps every kind of programming in together. It's like comparing apples to oranges, battleships and the colour yellow. You just can't.

Speaking for myself, I like to have a scripting language for massaging text files or iterating over files. For this purpose I like Ruby, and I hate C#.

Then I like to have a language at the level of bits and bytes, to diddle with ports and memory, use protocols. I like C++, Python is a dog.

Then I write a few thousand lines of code for a language compiler and VM, hack it until it works, refactor until it's right. I want a static type compiler, modules and interfaces, assertions and tests (but not TDD). C# fits the bill. And so on.

When you show me research that reflects that, you have my attention.

You have a point about different languages being suited for different scenarios.

As regards your asking for solid research that supports my claim that the advantages of static typing are a lot less significant than what we generally want to believe, that is understandable and I suppose quite reasonable. Except I have no interest in proving anything to you and it really is your choice whether you take me seriously or not.

I am, however, interested in learning things, so your personal preferences listed above are of some interest given that they reflect your experience. And if you should want to make a claim that static typing is vastly superior in some sense, I would be happy to see what research you can provide to support that claim. It should be easy to find if it were true, certainly a lot easier than for me to provide dozens of papers that I've come across over dozens of years that fail to prove a huge advantage of static typing.

Quote from dandl on February 20, 2021, 10:22 am

I have always been firmly in the static typing camp, but I have to acknowledge that the research doesn't necessarily support a very strong stance. I mentioned how Uncle Bob has swung over and the basis of the argument seems to be that TDD mandates about the same amount of tests whatever your type system, but you avoid having to do all the type declaration stuff in a dynamic system. (On trying to look up solid arguments I found this which was somewhat interesting: https://labs.ig.com/static-typing-promise )

You still haven't quoted any research of any authority. What I see here is bunk.

The problem with this argument (apart from the close resemblance to a religious war) is that it lumps every kind of programming in together. It's like comparing apples to oranges, battleships and the colour yellow. You just can't.

Speaking for myself, I like to have a scripting language for massaging text files or iterating over files. For this purpose I like Ruby, and I hate C#.

Then I like to have a language at the level of bits and bytes, to diddle with ports and memory, use protocols. I like C++, Python is a dog.

Then I write a few thousand lines of code for a language compiler and VM, hack it until it works, refactor until it's right. I want a static type compiler, modules and interfaces, assertions and tests (but not TDD). C# fits the bill. And so on.

When you show me research that reflects that, you have my attention.

You have a point about different languages being suited for different scenarios.

As regards your asking for solid research that supports my claim that the advantages of static typing are a lot less significant than what we generally want to believe, that is understandable and I suppose quite reasonable. Except I have no interest in proving anything to you and it really is your choice whether you take me seriously or not.

I am, however, interested in learning things, so your personal preferences listed above are of some interest given that they reflect your experience. And if you should want to make a claim that static typing is vastly superior in some sense, I would be happy to see what research you can provide to support that claim. It should be easy to find if it were true, certainly a lot easier than for me to provide dozens of papers that I've come across over dozens of years that fail to prove a huge advantage of static typing.

#27 · February 20, 2021, 1:13 pm

As I feared we got sidelined in a typing discussion, although I suppose it has its purpose too. I'm certainly looking again what's come up on the internet about the matter.

Are there any comments on the idea that the quality of code performing relational operations can be assessed by the quality of the predicate being produced? Maybe that is just obvious to all but me?

What things, if any, about a predicate statement could be considered to be a code smell? Can we say, for example, that a workable relation must consist of a conjuction of facts in definite form? Does a tuple need to be self-contained or can it refer to other factors not specified in the tuple in any way?

Thinking further, I suppose the "type" of a relation should include the predicate? What if it could be represented as part of the type system and the compiler could type check the derived predicates resulting from operations?

#28 · February 20, 2021, 3:34 pm

Quote from tobega on February 20, 2021, 1:13 pm

As I feared we got sidelined in a typing discussion, although I suppose it has its purpose too. I'm certainly looking again what's come up on the internet about the matter.

Are there any comments on the idea that the quality of code performing relational operations can be assessed by the quality of the predicate being produced? Maybe that is just obvious to all but me?

Perhaps. Maybe that's because (to me, at least) a predicate just... Is. I'm not sure what "quality of the predicate" means.

As for static-vs-dynamic typing, no matter how much research there is (or isn't) for one side or the other, it comes down heavily to personal preference. Some of us prefer statically-typed languages for everything, some for some things and not others, some prefer them for nothing, etc. I suspect personal language preference drives far, far, far more than we'd like to think (even with knowing it drives a lot), at least from an engineering point of view.

Indeed, it almost certainly drives far more than we'd like to think in all engineering disciplines, not just (relatively undisciplined) software engineering.

What things, if any, about a predicate statement could be considered to be a code smell? Can we say, for example, that a workable relation must consist of a conjuction of facts in definite form? Does a tuple need to be self-contained or can it refer to other factors not specified in the tuple in any way?

Thinking further, I suppose the "type" of a relation should include the predicate? What if it could be represented as part of the type system and the compiler could type check the derived predicates resulting from operations?

Sorry, you've lost me a bit here. I can infer things, but they're probably not what you've intended. Examples?

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

#29 · February 20, 2021, 3:46 pm

Quote from Dave Voorhis on February 20, 2021, 3:46 pm

Quote from dandl on February 20, 2021, 10:22 am

I have always been firmly in the static typing camp, but I have to acknowledge that the research doesn't necessarily support a very strong stance. I mentioned how Uncle Bob has swung over and the basis of the argument seems to be that TDD mandates about the same amount of tests whatever your type system, but you avoid having to do all the type declaration stuff in a dynamic system. (On trying to look up solid arguments I found this which was somewhat interesting: https://labs.ig.com/static-typing-promise )

You still haven't quoted any research of any authority. What I see here is bunk.

I suppose it depends what you define as "authority."

In some circles, Bob Martin is considered as valid an authority as you can get. In others, he's ignored in favour of Knuth and other academic notables. In yet other circles, if it doesn't come straight from Microsoft, ignore it. Etc.

In computing in general we have a bit of a problem identifying authority, which is perhaps not a bad thing.

The problem with this argument (apart from the close resemblance to a religious war) is that it lumps every kind of programming in together. It's like comparing apples to oranges, battleships and the colour yellow. You just can't.

Speaking for myself, I like to have a scripting language for massaging text files or iterating over files. For this purpose I like Ruby, and I hate C#.

Then I like to have a language at the level of bits and bytes, to diddle with ports and memory, use protocols. I like C++, Python is a dog.

Then I write a few thousand lines of code for a language compiler and VM, hack it until it works, refactor until it's right. I want a static type compiler, modules and interfaces, assertions and tests (but not TDD). C# fits the bill. And so on.

When you show me research that reflects that, you have my attention.

There is a problem that research of this sort often starts with a subject group consisting of a cohort of final-year university computer science (or software engineering, if you're lucky) students, which is about as non-reflective of real industry practitioners as you can get.

Not that it matters, anyway. See my other post re language preferences. If definitive research was presented to condemn dynamic typing, would you give up Ruby?

Or if definitive research condemned static typing, would you give up C# or C++?

I think I know the answer, and it would be the same answer for the majority of us.

Quote from dandl on February 20, 2021, 10:22 am

I have always been firmly in the static typing camp, but I have to acknowledge that the research doesn't necessarily support a very strong stance. I mentioned how Uncle Bob has swung over and the basis of the argument seems to be that TDD mandates about the same amount of tests whatever your type system, but you avoid having to do all the type declaration stuff in a dynamic system. (On trying to look up solid arguments I found this which was somewhat interesting: https://labs.ig.com/static-typing-promise )

You still haven't quoted any research of any authority. What I see here is bunk.

I suppose it depends what you define as "authority."

In some circles, Bob Martin is considered as valid an authority as you can get. In others, he's ignored in favour of Knuth and other academic notables. In yet other circles, if it doesn't come straight from Microsoft, ignore it. Etc.

In computing in general we have a bit of a problem identifying authority, which is perhaps not a bad thing.

The problem with this argument (apart from the close resemblance to a religious war) is that it lumps every kind of programming in together. It's like comparing apples to oranges, battleships and the colour yellow. You just can't.

Speaking for myself, I like to have a scripting language for massaging text files or iterating over files. For this purpose I like Ruby, and I hate C#.

Then I like to have a language at the level of bits and bytes, to diddle with ports and memory, use protocols. I like C++, Python is a dog.

Then I write a few thousand lines of code for a language compiler and VM, hack it until it works, refactor until it's right. I want a static type compiler, modules and interfaces, assertions and tests (but not TDD). C# fits the bill. And so on.

When you show me research that reflects that, you have my attention.

There is a problem that research of this sort often starts with a subject group consisting of a cohort of final-year university computer science (or software engineering, if you're lucky) students, which is about as non-reflective of real industry practitioners as you can get.

Not that it matters, anyway. See my other post re language preferences. If definitive research was presented to condemn dynamic typing, would you give up Ruby?

Or if definitive research condemned static typing, would you give up C# or C++?

I think I know the answer, and it would be the same answer for the majority of us.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

#30 · February 20, 2021, 7:03 pm

Quote from Erwin on February 20, 2021, 7:03 pm

Quote from tobega on February 11, 2021, 8:54 am

From an article I came across (https://towardsdatascience.com/apache-arrow-read-dataframe-with-zero-memory-69634092b1a):

Traditionally, data is stored on disk in a row-by-row manner. Columnar storage was born out of the necessity to analyze large datasets and aggregate them efficiently. Data analytics is less interested in rows of data (e.g. one customer transaction, one call log, …) but on aggregations thereof (e.g. total amount spent by customer, total call minutes by region, …).

which tied into a thought I had been pondering about the efficiency of checking for set-ness all the time, i.e. is the SQL bag way more efficient in general? What does a set-algebra buy us of value?

When it comes to storage, having each tuple be unique is just one more constraint, I suppose. I feel there could be problems when we involve blobs/clobs. Or what about just filenames, where the referenced files have identical contents? If we sometimes have to account for tricky notions of equality and distinctness, is it better to always have to consider these things?

Getting back to data analytics, that is of course vastly different from managing a supplier database. You have data pumping in at an alarming rate, so it's maybe just a completely different thing that should have different tools.

I have skimmed most of the responses and I don't think what I want to mention in connection with this has already been touched on. Apologies if it has and I'm redundantly stating facts ( :-) ) about bag algebra.

Bag algebra "creates" issues of "how to handle the duplicates" for every single operator of the relation algebra for which an equivalent/analogue/... is to be included in the bag algebra. Every single one.

When mimicking the operators of the RA in some "bag algebra" to be newly devised, the following basic principles determine the "conceptual integrity" (thank you Fred Brooks) of that new algebra :

can a tuple be "matched" more than once with tuples from the other argument of the [binary] operator at hand ?

Are you even going to "match" at all ?

what about commutativity of the operator at hand ?

... (myself hoping to find out as I'm writing this)

For example, when INTERSECTing a bag of two identical tuples (say, B2) with a bag of three identical tuples (say, B3) that also happen to identical to those in B2, are you going to make INTERSECT return a bag of cardinality 2 or of cardinality 3 ? Don't think for a moment that either option won't have its use cases.

if the answer to bullet 1 is Y, then the result is of cardinality 2 and B2 INTERSECT B3 === B3 INTERSECT B2   (BA operator is commutative)

if the answer to bullet 1 is N, then on the intuitive face of it it seems like B2 INTERSECT B3 must yield cardinality 2 and B3 INTERSECT B2 must yield cardinality 3 (operator not commutative).

When UNIONing B2 with B3 :

If the answer to bullet 2 is N, then you're going to return a bag of cardinality 5, but note that B2 UNION B2 \= B2 !!!

If the answer to bullet 2 is Y, then you're going to return a bag of cardinality 3, and B2 UNION B2 === B2

(observe that in both cases, B2 INTERSECT B3 === B3 INTERSECT B2   (BA operator is commutative)

When MINUSing the two :

If the answer to bullet 1 is Y, then B2 MINUS B3 === B3 MINUS B2 === FI (empty bag)

If the answer to bullet 1 is N, then B3 MINUS B2 will yield a bag of cardinality 1.

For each of the three foregoing operators, think (and I mean long and thoroughly) about ***how*** you would formulate the predicate of the INTERSECT/UNION/MIUNUS expression such that its extension is indeed the resulting bag value (as is indeed the case with the operators of the ***relational*** algebra).

JOIN is even worse.

Say B21 = { TUP {A1 'A' B2A2 'Z'} TUP{A1 'A' B2A2 'Z'} }

Say B22 = { TUP {A1 'A' B2A2 'Z'} TUP{A1 'A' B2A2 'X'} }   (in fact a relation)

Say B31 = { TUP {A1 'A' B3A2 'Z'} TUP{A1 'A' B3A2 'Z'} TUP{A1 'A' B3A2 'Z'} }

Say B32 = { TUP {A1 'A' B3A2 'Z'} TUP{A1 'A' B3A2 'Y'} TUP{A1 'A' B3A2 'X'} }   (in fact a relation)

When JOINing, we assume the answer to bullet 2 is Y because at least superficially it doesn't seem to make sense to "not match at all" in a JOIN operator. Furthermore, the most intuitive option for defining what it means to "match" in a JOIN-like operator, is "equality of the sub-tuple formed by the [tuple] projection on the set of JOIN attributes of both tuples considered for matching". Then, if the answer to bullet 1 is N (each and every tuple matched at most once) :

B22 JOIN B32 is indeterminate in general because there's no way of controlling/predicting/... which of the particular B32 tuples will be matched with some specific B22 tuple, so in general, you get that B22 JOIN B32 \= B22 JOIN B32   (!!!!!!!!!!!!). Ditto for B21 JOIN B32.

So it seems like considering the analogue of a JOIN operator in "bag algebra" ***FORCES*** us to accept that the answer to bullet 1 must necessarily be Y, and then even at the very best of circumstances we're left (as a DBMS writer) with the problem of how to support the "reasonable" use cases of INTERSECT/UNION/MINUS where the answer to bullet 1 was in fact N.

I have no doubt this "analysis" is extremely superficial, but it's the best I have to offer at the moment : what "set-ness" has to offer is that it makes the ensuing, eurhm, "data-paradigm-algebra" :

definable for the definers

implementable for the implementors

both understandable and manageable for the users (= it becomes manageable for the user to store the entire set of operators in that part of their brains that we call in my native language "parate kennis" - that proper subset of all the things we have stored in our brains that we use on a day-to-day basis and that our brains make accessible for ourselves in a whimpse (*) )

(*) The first time I used that word [on this forum] it turned out that I had subconsciously conflated the words "whiff" and "glimpse" when I wasn't even aware that the former was indeed a word. I was so proud of that after reading the comment pointing that out to me, that I've been seeking ever since for a good excuse to use it a second time :-) Thanks for the occasion :-)

Quote from tobega on February 11, 2021, 8:54 am

From an article I came across (https://towardsdatascience.com/apache-arrow-read-dataframe-with-zero-memory-69634092b1a):

Traditionally, data is stored on disk in a row-by-row manner. Columnar storage was born out of the necessity to analyze large datasets and aggregate them efficiently. Data analytics is less interested in rows of data (e.g. one customer transaction, one call log, …) but on aggregations thereof (e.g. total amount spent by customer, total call minutes by region, …).

which tied into a thought I had been pondering about the efficiency of checking for set-ness all the time, i.e. is the SQL bag way more efficient in general? What does a set-algebra buy us of value?

When it comes to storage, having each tuple be unique is just one more constraint, I suppose. I feel there could be problems when we involve blobs/clobs. Or what about just filenames, where the referenced files have identical contents? If we sometimes have to account for tricky notions of equality and distinctness, is it better to always have to consider these things?

Getting back to data analytics, that is of course vastly different from managing a supplier database. You have data pumping in at an alarming rate, so it's maybe just a completely different thing that should have different tools.

I have skimmed most of the responses and I don't think what I want to mention in connection with this has already been touched on. Apologies if it has and I'm redundantly stating facts ( :-) ) about bag algebra.

Bag algebra "creates" issues of "how to handle the duplicates" for every single operator of the relation algebra for which an equivalent/analogue/... is to be included in the bag algebra. Every single one.

When mimicking the operators of the RA in some "bag algebra" to be newly devised, the following basic principles determine the "conceptual integrity" (thank you Fred Brooks) of that new algebra :

can a tuple be "matched" more than once with tuples from the other argument of the [binary] operator at hand ?
Are you even going to "match" at all ?
what about commutativity of the operator at hand ?
... (myself hoping to find out as I'm writing this)

For example, when INTERSECTing a bag of two identical tuples (say, B2) with a bag of three identical tuples (say, B3) that also happen to identical to those in B2, are you going to make INTERSECT return a bag of cardinality 2 or of cardinality 3 ? Don't think for a moment that either option won't have its use cases.

if the answer to bullet 1 is Y, then the result is of cardinality 2 and B2 INTERSECT B3 === B3 INTERSECT B2 (BA operator is commutative)
if the answer to bullet 1 is N, then on the intuitive face of it it seems like B2 INTERSECT B3 must yield cardinality 2 and B3 INTERSECT B2 must yield cardinality 3 (operator not commutative).

When UNIONing B2 with B3 :

If the answer to bullet 2 is N, then you're going to return a bag of cardinality 5, but note that B2 UNION B2 \= B2 !!!
If the answer to bullet 2 is Y, then you're going to return a bag of cardinality 3, and B2 UNION B2 === B2
(observe that in both cases, B2 INTERSECT B3 === B3 INTERSECT B2 (BA operator is commutative)

When MINUSing the two :

If the answer to bullet 1 is Y, then B2 MINUS B3 === B3 MINUS B2 === FI (empty bag)
If the answer to bullet 1 is N, then B3 MINUS B2 will yield a bag of cardinality 1.

For each of the three foregoing operators, think (and I mean long and thoroughly) about ***how*** you would formulate the predicate of the INTERSECT/UNION/MIUNUS expression such that its extension is indeed the resulting bag value (as is indeed the case with the operators of the ***relational*** algebra).

JOIN is even worse.

Say B21 = { TUP {A1 'A' B2A2 'Z'} TUP{A1 'A' B2A2 'Z'} }
Say B22 = { TUP {A1 'A' B2A2 'Z'} TUP{A1 'A' B2A2 'X'} } (in fact a relation)
Say B31 = { TUP {A1 'A' B3A2 'Z'} TUP{A1 'A' B3A2 'Z'} TUP{A1 'A' B3A2 'Z'} }
Say B32 = { TUP {A1 'A' B3A2 'Z'} TUP{A1 'A' B3A2 'Y'} TUP{A1 'A' B3A2 'X'} } (in fact a relation)

When JOINing, we assume the answer to bullet 2 is Y because at least superficially it doesn't seem to make sense to "not match at all" in a JOIN operator. Furthermore, the most intuitive option for defining what it means to "match" in a JOIN-like operator, is "equality of the sub-tuple formed by the [tuple] projection on the set of JOIN attributes of both tuples considered for matching". Then, if the answer to bullet 1 is N (each and every tuple matched at most once) :

B22 JOIN B32 is indeterminate in general because there's no way of controlling/predicting/... which of the particular B32 tuples will be matched with some specific B22 tuple, so in general, you get that B22 JOIN B32 \= B22 JOIN B32 (!!!!!!!!!!!!). Ditto for B21 JOIN B32.

So it seems like considering the analogue of a JOIN operator in "bag algebra" ***FORCES*** us to accept that the answer to bullet 1 must necessarily be Y, and then even at the very best of circumstances we're left (as a DBMS writer) with the problem of how to support the "reasonable" use cases of INTERSECT/UNION/MINUS where the answer to bullet 1 was in fact N.

I have no doubt this "analysis" is extremely superficial, but it's the best I have to offer at the moment : what "set-ness" has to offer is that it makes the ensuing, eurhm, "data-paradigm-algebra" :

definable for the definers
implementable for the implementors
both understandable and manageable for the users (= it becomes manageable for the user to store the entire set of operators in that part of their brains that we call in my native language "parate kennis" - that proper subset of all the things we have stored in our brains that we use on a day-to-day basis and that our brains make accessible for ourselves in a whimpse (*) )

(*) The first time I used that word [on this forum] it turned out that I had subconsciously conflated the words "whiff" and "glimpse" when I wasn't even aware that the former was indeed a word. I was so proud of that after reading the comment pointing that out to me, that I've been seeking ever since for a good excuse to use it a second time :-) Thanks for the occasion :-)

Author of SIRA_PRISE

The Forum for Discussion about The Third Manifesto and Related Matters

What do set-based operations buy us?