The Forum for Discussion about The Third Manifesto and Related Matters

Please or Register to create posts and topics.

What do set-based operations buy us?

Quote from dandl on February 16, 2021, 6:33 am
Quote from tobega on February 16, 2021, 5:38 am
Quote from dandl on February 15, 2021, 11:31 pm
Quote from tobega on February 15, 2021, 6:17 pm

So what I'm hearing is that the set-based model helps us avoid accidental duplication in data entry.

But what about accidental deduplication in a projection step? E.g. where two shipments happen to have the same amount of parts? Seems like it would be a very obscure bug.

 

Sorry, I can't see where you heard that.

The set-based model is the direct consequence of viewing a tuple in a relation as the assertion of a fact. If 'there was a shipment of 27 parts' is true, there is no reason to assert it twice. If there is reason to distinguish two such shipments it must be because each shipment is a separate fact to be asserted, eg 'the shipment on Tuesday was of 27 parts' and 'the shipment on Friday was of 27 parts'.

You cannot have a bag of true facts, it has to be a set. If that's not what you need, then choose a different data structure, often a list but sometimes a map.

 

It's not that simple. I'm talking about when you start out with a proper relation but while doing operations on it, you must always make sure of keeping facts that may be irrelevant to your final result but necessary in the intermediate step. But necessary only to avoid deduplication in that step, so it may not be clear why you need it.

It becomes particularly obscure when uniqueness is determined by a combination of attributes and one attribute will be elided in the result.

Ah, now I understand what you don't understand. You are describing aggregation. The aggregate of your bag is a tuple with an attribute that is the count.

An example is transitive closure where the definition of TRANCLO carefully avoids the issue by renaming an attribute on both sides of the join:

OPERATOR TRANCLO ( XY RELATION { X P#, Y P# } )RETURNS RELATION { X P#, Y P# } ;
RETURN( WITH ( XY UNION ( ( XY RENAME(Y AS Z)) COMPOSE
( XY RENAME(X AS Z)) ) )AS TTT:
IF TTT = XY THEN TTT/* unwind recursion*/
ELSE TRANCLO ( TTT )/* recursive invocation */
END IF ) ;
END OPERATOR
OPERATOR TRANCLO ( XY RELATION { X P#, Y P# } )RETURNS RELATION { X P#, Y P# } ; RETURN( WITH ( XY UNION ( ( XY RENAME(Y AS Z)) COMPOSE ( XY RENAME(X AS Z)) ) )AS TTT: IF TTT = XY THEN TTT/* unwind recursion*/ ELSE TRANCLO ( TTT )/* recursive invocation */ END IF ) ; END OPERATOR
OPERATOR TRANCLO ( XY RELATION { X P#, Y P# } )RETURNS RELATION { X P#, Y P# } ;
RETURN( WITH ( XY UNION ( ( XY RENAME(Y AS Z)) COMPOSE
                          ( XY RENAME(X AS Z)) ) )AS TTT:
IF TTT = XY THEN TTT/* unwind recursion*/
ELSE TRANCLO ( TTT )/* recursive invocation */
END IF ) ;
END OPERATOR

But if you add a part count to that, as in the explosion of parts example in recommendation 6 of chapter 10 of DTATRM, it becomes more complicated and the book doesn't attempt to define the operator, merely describe its properties.

Correct. This is the operator referred to as GTC. I can tell you exactly how to do that, and I have a formal description. I published it here, but I don't think anyone paid it much attention. See https://forum.thethirdmanifesto.com/forum/topic/formal-definitions-for-relation-constant-transitive-closure-extended-transitive-closure/.

I had exactly this problem in the adventofcode day 7 that is part of driving the Tailspin implementation and I fell neatly into the trap, although it did give the correct answer for my case. So I have a bug where I get the correct answer most of the time, but not when I least suspect it.

templates contents
when <?($::count <=0>)> do $!
otherwise
$({multiplier: §.amount, container: §.contained})
-> ($ join $bagrules) -> $({container:, contained:, amount: §.amount * §.multiplier})
-> ($ union ($ -> contents)) !
end contents
templates contents when <?($::count <=0>)> do $! otherwise $({multiplier: §.amount, container: §.contained}) -> ($ join $bagrules) -> $({container:, contained:, amount: §.amount * §.multiplier}) -> ($ union ($ -> contents)) ! end contents
templates contents
  when <?($::count <=0>)> do $!
  otherwise
    $({multiplier: §.amount, container: §.contained})
      -> ($ join $bagrules) -> $({container:, contained:, amount: §.amount * §.multiplier})
      -> ($ union ($ -> contents)) !
end contents

or, a clumsy attempt to translate it to Tutorial-D

OPERATOR CONTENTS( CCA RELATION { CONTAINER C_ID, CONTAINED C_ID, AMOUNT INTEGER }, RULES RELATION { CONTAINER C_ID, CONTAINED C_ID, AMOUNT INTEGER } )
RETURNS RELATION { CONTAINER C_ID, CONTAINED C_ID, AMOUNT INTEGER } ;
IF ISEMPTY(CCA)
RETURN CCA;
END IF
RETURN (WITH ((RENAME CCA {CONTAINED, AMOUNT} (CONTAINED AS CONTAINER, AMOUNT AS MULTIPLIER)) JOIN RULES) AS NEXT:
((NEXT EXTEND(TOTAL AS MULTIPLIER*AMOUNT)){CONTAINER, CONTAINED, TOTAL} RENAME(TOTAL AS AMOUNT)) UNION CONTENTS(NEXT, RULES);
END OPERATOR
OPERATOR CONTENTS( CCA RELATION { CONTAINER C_ID, CONTAINED C_ID, AMOUNT INTEGER }, RULES RELATION { CONTAINER C_ID, CONTAINED C_ID, AMOUNT INTEGER } ) RETURNS RELATION { CONTAINER C_ID, CONTAINED C_ID, AMOUNT INTEGER } ; IF ISEMPTY(CCA) RETURN CCA; END IF RETURN (WITH ((RENAME CCA {CONTAINED, AMOUNT} (CONTAINED AS CONTAINER, AMOUNT AS MULTIPLIER)) JOIN RULES) AS NEXT: ((NEXT EXTEND(TOTAL AS MULTIPLIER*AMOUNT)){CONTAINER, CONTAINED, TOTAL} RENAME(TOTAL AS AMOUNT)) UNION CONTENTS(NEXT, RULES); END OPERATOR
OPERATOR CONTENTS( CCA RELATION { CONTAINER C_ID, CONTAINED C_ID, AMOUNT INTEGER }, RULES RELATION { CONTAINER C_ID, CONTAINED C_ID, AMOUNT INTEGER } )
    RETURNS RELATION { CONTAINER C_ID, CONTAINED C_ID, AMOUNT INTEGER } ;
  IF ISEMPTY(CCA)
    RETURN CCA;
  END IF
  RETURN (WITH ((RENAME CCA {CONTAINED, AMOUNT} (CONTAINED AS CONTAINER, AMOUNT AS MULTIPLIER)) JOIN RULES) AS NEXT:
    ((NEXT EXTEND(TOTAL AS MULTIPLIER*AMOUNT)){CONTAINER, CONTAINED, TOTAL} RENAME(TOTAL AS AMOUNT)) UNION CONTENTS(NEXT, RULES);
END OPERATOR

What you're relying on here is operator recursion, which places it outside the RA. Given the ability to define operators you can construct your own data structures, but that's not what this is about.

So this might be the key observation here, but you'll have to explain a bit more.

As a user of Tutorial-D I am defining an operator that takes relations as parameters and is producing what I (mistakenly) believe to be a relation, using only relational operators in the process, so it certainly looks relational at first glance.

So the key question is, by what criteria do you determine that this is outside the RA? If I am a code reviewer, how can I determine that we are on dangerous ground and that this is perhaps not the code the author intended? Bonus points if we can get a compiler warning here.

Quote from tobega on February 16, 2021, 5:38 am
Quote from dandl on February 15, 2021, 11:31 pm
Quote from tobega on February 15, 2021, 6:17 pm

So what I'm hearing is that the set-based model helps us avoid accidental duplication in data entry.

But what about accidental deduplication in a projection step? E.g. where two shipments happen to have the same amount of parts? Seems like it would be a very obscure bug.

Sorry, I can't see where you heard that.

The set-based model is the direct consequence of viewing a tuple in a relation as the assertion of a fact. If 'there was a shipment of 27 parts' is true, there is no reason to assert it twice. If there is reason to distinguish two such shipments it must be because each shipment is a separate fact to be asserted, eg 'the shipment on Tuesday was of 27 parts' and 'the shipment on Friday was of 27 parts'.

You cannot have a bag of true facts, it has to be a set. If that's not what you need, then choose a different data structure, often a list but sometimes a map.

It's not that simple. I'm talking about when you start out with a proper relation but while doing operations on it, you must always make sure of keeping facts that may be irrelevant to your final result but necessary in the intermediate step. But necessary only to avoid deduplication in that step, so it may not be clear why you need it.

It becomes particularly obscure when uniqueness is determined by a combination of attributes and one attribute will be elided in the result. [...]

Note that I am not suggestion relvars be lists or bags, just the values resulting from relational operations.

  1. Tutorial D defines ARRAYs for this (essentially), though they're perhaps better considered the Tutorial D equivalent to SQL cursors; and
  2. This does come up in -- as @dandl mentioned -- aggregation. See https://reldb.org/c/wp-content/uploads/2016/06/User-Defined-Aggregate-Operators-in-Tutorial-D-and-Rel.pdf where the bottom of page 4 briefly addresses handling duplicates in an intermediate result, to ensure correct results for aggregation.
I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

What you're relying on here is operator recursion, which places it outside the RA. Given the ability to define operators you can construct your own data structures, but that's not what this is about.

So this might be the key observation here, but you'll have to explain a bit more.

As a user of Tutorial-D I am defining an operator that takes relations as parameters and is producing what I (mistakenly) believe to be a relation, using only relational operators in the process, so it certainly looks relational at first glance.

So the key question is, by what criteria do you determine that this is outside the RA? If I am a code reviewer, how can I determine that we are on dangerous ground and that this is perhaps not the code the author intended? Bonus points if we can get a compiler warning here.

The RA is a set of operations on relations. The RA itself does not provide any means to create new attribute values.

  • First, Codd created new values with literals (only) and hinted that more would be needed.
  • TTM defines a type system and a language, in which new values are created outside the RA and imported using language features.
  • The Extended RA of Algebra A suggests creating new values as relational constants (relcons) but gives no formal definitions
  • I have given formal definitions for creating new values by replacing relcons with pure functions (without a type system or a language).

Pure functions are still first order, just like the RA. They cannot be recursive (a function cannot refer to itself), and they are 'safe': guaranteed to terminate and return a value within the domain.

For the code reviewer it is easy: insist that every function be provably pure.

Andl - A New Database Language - andl.org
Quote from dandl on February 16, 2021, 11:21 pm

What you're relying on here is operator recursion, which places it outside the RA. Given the ability to define operators you can construct your own data structures, but that's not what this is about.

So this might be the key observation here, but you'll have to explain a bit more.

As a user of Tutorial-D I am defining an operator that takes relations as parameters and is producing what I (mistakenly) believe to be a relation, using only relational operators in the process, so it certainly looks relational at first glance.

So the key question is, by what criteria do you determine that this is outside the RA? If I am a code reviewer, how can I determine that we are on dangerous ground and that this is perhaps not the code the author intended? Bonus points if we can get a compiler warning here.

The RA is a set of operations on relations. The RA itself does not provide any means to create new attribute values.

  • First, Codd created new values with literals (only) and hinted that more would be needed.
  • TTM defines a type system and a language, in which new values are created outside the RA and imported using language features.
  • The Extended RA of Algebra A suggests creating new values as relational constants (relcons) but gives no formal definitions
  • I have given formal definitions for creating new values by replacing relcons with pure functions (without a type system or a language).

Pure functions are still first order, just like the RA. They cannot be recursive (a function cannot refer to itself), and they are 'safe': guaranteed to terminate and return a value within the domain.

For the code reviewer it is easy: insist that every function be provably pure.

Thanks for the attempt to come up with a constructive idea! Unfortunately it seems the generally accepted computer science allows pure functions to be recursive. There is also a proof that recursion can always be rewritten as iteration and vice versa. I'm fairly certain that the recursion here is equivalent to the iterative sequence you specify in https://forum.thethirdmanifesto.com/forum/topic/formal-definitions-for-relation-constant-transitive-closure-extended-transitive-closure/ and suffers from the same problems when it comes to formulating the successor function.

However, your formal specification for the successor function did give me an idea that it all comes down to the predicate.

In my buggy code the predicate would be "The container is the immediate container of the contained in a chain from the outer container in which the contained will appear as many times as the sum of similar chains of which this amount is a component"

I think we concluded previously in another discussion that a predicate containing a disjunction, such as "Employee E works in department D or on project J", must inevitably lead to trouble. In the same way I think we can surmise that any imprecision in the predicate is likely to be problematic. What do you mean by "a chain"? What do you mean by "a component"? What is the "outer container" that seems to be a major player in this predicate? Also "similar chains" might alert us to a duplication problem.

Rewriting the code until the predicate is crisp enough, viz. "The container is the outermost container that contains this amount of the contained via the given path", with the actual path as a part of the relation will give a correct answer. If we wish, we could replace the path with a generated unique identifier, in which case "the given path" would be "the path uniquely identified by the given tempid".

// Here we have a solution that uses relational algebra correctly, where each tuple states the fact that
// "The outer container contains this amount of the contained type via the given path".
// For sake of illustration, the actual path is generated, but it could be represented by a tempid as above.
// If the entire bagrules relation is input, the answer for all bags will be given at the same time.
templates contents3
  when <?($::count <=0>)> do $!
  otherwise
    $({container:, path:, multiplier: §.amount, z: §.contained})
      -> ($ join $bagrules({z: §.container, contained:, amount:}))
      -> $({container:, contained:, amount: §.amount * §.multiplier, path: [§.path..., §.z]})
      -> ($ union ($ -> contents3)) !
end contents3

($bagrules join {[{container: 'shiny gold', path: []}]}) -> ($ union ($ -> contents3))
-> $(by {[{container: 'shiny gold'}]} collect {totalInside:Sum&{of: :(amount:)}}) ... -> '$.totalInside;
' -> !OUT::write
One more thing that falls out of this is the true power of the relational algebra to allow us to calculate many results at once, which can easily be done by passing the whole original relation in as a parameter to our corrected function, but which would just have created rubbish in our buggy attempt.
It might be interesting to consider which exact properties we are justified to require of a predicate, to be "crisp" enough. Can we say that a workable relation must consist of a conjuction of facts in definite form? Does it need to be self-contained?
On another tack, I will still note that the SQL recursive CTE version will calculate the correct value even for the sloppy predicates, so it is arguably easier to use. Since we seem to agree that a language should make it easier for programmers to get correct results and harder for them to create obscure bugs, judging by the strong stance on static typing, I still worry that we might be straining at gnats and swallowing camels. In studies performed, static languages do not have lower defect rates over all and as long as the language has strong typing, any type errors are easily found, so the bugs are not obscure. But despite the code review solution, the bugs from accidental deduplication will probably be both hard to detect and hard to understand.
The question then would be how easy it is to create bugs that are hard to detect and difficult to understand if we allow duplication. If the amount of duplication is semi-unpredictable, as https://perlmonks.org/?node_id=515776 seems to show, then I suppose we just have to crisp up our thinking and deduplicate.
Quote from tobega on February 19, 2021, 5:22 am
On another tack, I will still note that the SQL recursive CTE version will calculate the correct value even for the sloppy predicates, so it is arguably easier to use. Since we seem to agree that a language should make it easier for programmers to get correct results and harder for them to create obscure bugs, judging by the strong stance on static typing, I still worry that we might be straining at gnats and swallowing camels. In studies performed, static languages do not have lower defect rates over all and as long as the language has strong typing, any type errors are easily found, so the bugs are not obscure.

Comparing defect rates between statically and dynamically typed languages strikes me as largely missing the point. I wouldn't expect a dynamically-typed program to be any buggier than an equivalent statically-typed program -- assuming the developer is competent.

The real difference is the amount of effort required to create the program.

Dynamically-typed languages are deceptively simple; they lure you in with fewer keystrokes only to bite you with having to do type inference in your head. In reality, dynamically-typed languages often take more effort -- sometimes a lot more effort -- to read and maintain because you're having to mentally do the type inference and type checking that the compiler would do for you. Though you can gain some of that back by using IDEs that do it for you, plus you usually write a gaggle of unit tests (but that's more effort you need to make.) And to be fair, injudicious use of type inference over manifest types in statically-typed languages that support type inference can be almost as bad as dynamic typing without manifest typing, from a readability point of view.

With small programs (for an undefined value of "small"), the difference between statically-typed and dynamically-typed development/maintenance effort is also small and might indeed favour fewer keystrokes, but the development/maintenance difficulty of dynamically-typed programs grows exponentially with program size. It only takes a few thousand lines of Python or PHP or JavaScript for development and maintenance to become a nightmare, whilst an equivalent-sized Java or C# program -- of the sort often deprecated for being "too verbose" -- remains relatively easy to read and maintain.

I note that dynamic-typing advocacy has largely (though not entirely) grown up post-Web, where finding production bugs is usually associated with clicking a button and getting an immediate response, vs the heavy-duty and sometimes long-running data processing implied by TTM et al. In the latter, finding a bug in production might mean the monthly payroll run fails at 2:00AM after processing 1,409,398 records of 2,489,398 because someone fat-fingered an entry on a timesheet. Rare as it might be, that is usually far, far worse than some button on a Web app failing.

But despite the code review solution, the bugs from accidental deduplication will probably be both hard to detect and hard to understand.
Is it any worse or harder to detect or hard to understand than accidental duplication?
My experience is that accidental duplication is often the source of difficult-to-detect bugs in SQL.
The question then would be how easy it is to create bugs that are hard to detect and difficult to understand if we allow duplication. If the amount of duplication is semi-unpredictable, as https://perlmonks.org/?node_id=515776 seems to show, then I suppose we just have to crisp up our thinking and deduplicate.

I think the arguments in favour of sets over bags are compelling, and those in favour of bags over sets are not. :-)

[Note: Sorry about creating, then deleting, then re-creating this post. I'm trying to debug a weird formatting issue.]

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org
Quote from Dave Voorhis on February 19, 2021, 9:40 am
Quote from tobega on February 19, 2021, 5:22 am
On another tack, I will still note that the SQL recursive CTE version will calculate the correct value even for the sloppy predicates, so it is arguably easier to use. Since we seem to agree that a language should make it easier for programmers to get correct results and harder for them to create obscure bugs, judging by the strong stance on static typing, I still worry that we might be straining at gnats and swallowing camels. In studies performed, static languages do not have lower defect rates over all and as long as the language has strong typing, any type errors are easily found, so the bugs are not obscure.

Comparing defect rates between statically and dynamically typed languages strikes me as largely missing the point. I wouldn't expect a dynamically-typed program to be any buggier than an equivalent statically-typed program -- assuming the developer is competent.

The real difference is the amount of effort required to create the program.

At the risk of getting into an undesirable side-discussion, you are actually not entirely correct, effort also seems to be on par (but you may be confusing static/dynamic with the weakly/strongly typed distinction). Capers Jones did a study on hours spent per function point produced, and also considering bug frequency, for lots of languages and Smalltalk was one of the most efficient, almost twice as productive as Java/C#. Smalltalk is dynamic but strongly typed. On the other hand, JavaScript (weak dynamic) lagged significantly behind Java/C#. I believe that JavaScript programmers deceive themselves by confusing activity for productivity.

The one clear research-confirmed benefit of static typing seems to be the documentation aspect. Another seems to be performance, that the static typing helps the compiler.

Dynamically-typed languages are deceptively simple; they lure you in with fewer keystrokes only to bite you with having to do type inference in your head. In reality, dynamically-typed languages often take more effort -- sometimes a lot more effort -- to read and maintain because you're having to mentally do the type inference and type checking that the compiler would do for you. Though you can gain some of that back by using IDEs that do it for you, plus you usually write a gaggle of unit tests (but that's more effort you need to make.) And to be fair, injudicious use of type inference over manifest types in statically-typed languages that support type inference can be almost as bad as dynamic typing without manifest typing, from a readability point of view.

Yes, indeed. With type inference you lose the one clear benefit of static typing. A similar aspect is at work concerning anonymous lambdas, although there you also often have a temporal confusion aspect where the code is not executed in the sequence that you write it.

With small programs (for an undefined value of "small"), the difference between statically-typed and dynamically-typed development/maintenance effort is also small and might indeed favour fewer keystrokes, but the development/maintenance difficulty of dynamically-typed programs grows exponentially with program size. It only takes a few thousand lines of Python or PHP or JavaScript for development and maintenance to become a nightmare, whilst an equivalent-sized Java or C# program -- of the sort often deprecated for being "too verbose" -- remains relatively easy to read and maintain.

JavaScript is weakly typed, so there you do have a maintenance nightmare when the runtime decides to help you by gratuitously converting incompatible types. (You can also consider a weakly statically typed language like C, similar issues)

Python tends to do better, because it is strongly typed. There are challenges involved in maintaining very large Python codebases, but I don't think typing is the main culprit. Maybe from the documentation angle, but you won't get obscure typing bugs. The main problem Google had with Python code was the significant whitespace which can render code invisibly broken.

I note that dynamic-typing advocacy has largely (though not entirely) grown up post-Web, where finding production bugs is usually associated with clicking a button and getting an immediate response, vs the heavy-duty and sometimes long-running data processing implied by TTM et al. In the latter, finding a bug in production might mean the monthly payroll run fails at 2:00AM after processing 1,409,398 records of 2,489,398 because someone fat-fingered an entry on a timesheet. Rare as it might be, that is usually far, far worse than some button on a Web app failing.

But despite the code review solution, the bugs from accidental deduplication will probably be both hard to detect and hard to understand.
Is it any worse or harder to detect or hard to understand than accidental duplication?
My experience is that accidental duplication is often the source of difficult-to-detect bugs in SQL.
I'd be interested in any stories.
The question then would be how easy it is to create bugs that are hard to detect and difficult to understand if we allow duplication. If the amount of duplication is semi-unpredictable, as https://perlmonks.org/?node_id=515776 seems to show, then I suppose we just have to crisp up our thinking and deduplicate.

I think the arguments in favour of sets over bags are compelling, and those in favour of bags over sets are not. :-)

[Note: Sorry about creating, then deleting, then re-creating this post. I'm trying to debug a weird formatting issue.]

I noticed that the paragraph formatting of my post went wonky after the code block. Tried to edit it several times.

Quote from tobega on February 19, 2021, 11:09 am
Quote from Dave Voorhis on February 19, 2021, 9:40 am
Quote from tobega on February 19, 2021, 5:22 am
On another tack, I will still note that the SQL recursive CTE version will calculate the correct value even for the sloppy predicates, so it is arguably easier to use. Since we seem to agree that a language should make it easier for programmers to get correct results and harder for them to create obscure bugs, judging by the strong stance on static typing, I still worry that we might be straining at gnats and swallowing camels. In studies performed, static languages do not have lower defect rates over all and as long as the language has strong typing, any type errors are easily found, so the bugs are not obscure.

Comparing defect rates between statically and dynamically typed languages strikes me as largely missing the point. I wouldn't expect a dynamically-typed program to be any buggier than an equivalent statically-typed program -- assuming the developer is competent.

The real difference is the amount of effort required to create the program.

At the risk of getting into an undesirable side-discussion, you are actually not entirely correct, effort also seems to be on par (but you may be confusing static/dynamic with the weakly/strongly typed distinction). Capers Jones did a study on hours spent per code point, also bug frequency, for lots of languages and Smalltalk was one of the most efficient, almost twice as productive as Java/C#. Smalltalk is dynamic but strongly typed. On the other hand, JavaScript (weak dynamic) lagged significantly behind Java/C#. I believe that JavaScript programmers deceive themselves by confusing activity for productivity.

Smalltalk is a rather special case, an uber-consistent language with an unusually well-integrated toolset. Is Smalltalk productivity due to its dynamic typing, or in spite of it?

I'd argue it's in spite of it. A comparison of Smalltalk vs Strongtalk would be interesting.

Research aside, I don't think anyone that's actually had to maintain real large code bases would argue that the difference in maintenance effort in the popular dynamically typed languages (PHP, Python, JavaScript, Ruby, shell scripts) vs statically, manifest-typed (or mainly manifest-typed) languages is anything but profound, unless you're fortunate to be working on a dynamically-typed code-base that uses only primitive types. Then it can be manageable. If it defines classes (e.g., PHP, Python) extensively, then it's a nightmare.

Is it any worse or harder to detect or hard to understand than accidental duplication?
My experience is that accidental duplication is often the source of difficult-to-detect bugs in SQL.
I'd be interested in any stories.
I can't bring to mind any specific anecdotes, only the general experience.
I'm sure everyone who works with SQL has run into inadvertent duplicates. They're rarely immediately visible (unless you carefully construct test data, and use ORDER BY on every query) during development and tend to show up in QA or production when some user notes a report has the same content repeated, or an auditor (or auditing-minded user or tester) discovers that the output of x has more rows than the output of y, despite the fact that they should be the same.
The question then would be how easy it is to create bugs that are hard to detect and difficult to understand if we allow duplication. If the amount of duplication is semi-unpredictable, as https://perlmonks.org/?node_id=515776 seems to show, then I suppose we just have to crisp up our thinking and deduplicate.

I think the arguments in favour of sets over bags are compelling, and those in favour of bags over sets are not. :-)

[Note: Sorry about creating, then deleting, then re-creating this post. I'm trying to debug a weird formatting issue.]

I noticed that the paragraph formatting of my post went wonky after the code block. Tried to edit it several times.

Releases of the source code plugin sometimes don't play nicely with the forum plugin. I just installed a new update to the forum plugin, which might or might not fix it. If it doesn't, there will probably be an update along shortly.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org
Quote from tobega on February 19, 2021, 5:22 am
Quote from dandl on February 16, 2021, 11:21 pm

What you're relying on here is operator recursion, which places it outside the RA. Given the ability to define operators you can construct your own data structures, but that's not what this is about.

So this might be the key observation here, but you'll have to explain a bit more.

As a user of Tutorial-D I am defining an operator that takes relations as parameters and is producing what I (mistakenly) believe to be a relation, using only relational operators in the process, so it certainly looks relational at first glance.

So the key question is, by what criteria do you determine that this is outside the RA? If I am a code reviewer, how can I determine that we are on dangerous ground and that this is perhaps not the code the author intended? Bonus points if we can get a compiler warning here.

The RA is a set of operations on relations. The RA itself does not provide any means to create new attribute values.

  • First, Codd created new values with literals (only) and hinted that more would be needed.
  • TTM defines a type system and a language, in which new values are created outside the RA and imported using language features.
  • The Extended RA of Algebra A suggests creating new values as relational constants (relcons) but gives no formal definitions
  • I have given formal definitions for creating new values by replacing relcons with pure functions (without a type system or a language).

Pure functions are still first order, just like the RA. They cannot be recursive (a function cannot refer to itself), and they are 'safe': guaranteed to terminate and return a value within the domain.

For the code reviewer it is easy: insist that every function be provably pure.

Thanks for the attempt to come up with a constructive idea! Unfortunately it seems the generally accepted computer science allows pure functions to be recursive. There is also a proof that recursion can always be rewritten as iteration and vice versa. I'm fairly certain that the recursion here is equivalent to the iterative sequence you specify in https://forum.thethirdmanifesto.com/forum/topic/formal-definitions-for-relation-constant-transitive-closure-extended-transitive-closure/ and suffers from the same problems when it comes to formulating the successor function.

Clearly you missed my point. The 'pure functions' I refer to are an abstraction, comparable to a maths function. They take arguments and always return the same value for those arguments, with no side effects and without failing. This 'pure function' is a requirement for the formal definition. It has nothing to do with writing implementation code.

However, your formal specification for the successor function did give me an idea that it all comes down to the predicate.

No, it doesn't. The formal specification is complete as it stands. Writing a better predicate does not mean that your unsafe code becomes safe.

In my buggy code the predicate would be "The container is the immediate container of the contained in a chain from the outer container in which the contained will appear as many times as the sum of similar chains of which this amount is a component"

I think we concluded previously in another discussion that a predicate containing a disjunction, such as "Employee E works in department D or on project J", must inevitably lead to trouble. In the same way I think we can surmise that any imprecision in the predicate is likely to be problematic. What do you mean by "a chain"? What do you mean by "a component"? What is the "outer container" that seems to be a major player in this predicate? Also "similar chains" might alert us to a duplication problem.

Rewriting the code until the predicate is crisp enough, viz. "The container is the outermost container that contains this amount of the contained via the given path", with the actual path as a part of the relation will give a correct answer. If we wish, we could replace the path with a generated unique identifier, in which case "the given path" would be "the path uniquely identified by the given tempid".

// Here we have a solution that uses relational algebra correctly, where each tuple states the fact that
// "The outer container contains this amount of the contained type via the given path".
// For sake of illustration, the actual path is generated, but it could be represented by a tempid as above.
// If the entire bagrules relation is input, the answer for all bags will be given at the same time.
templates contents3
when <?($::count <=0>)> do $!
otherwise
$({container:, path:, multiplier: §.amount, z: §.contained})
-> ($ join $bagrules({z: §.container, contained:, amount:}))
-> $({container:, contained:, amount: §.amount * §.multiplier, path: [§.path..., §.z]})
-> ($ union ($ -> contents3)) !
end contents3
($bagrules join {[{container: 'shiny gold', path: []}]}) -> ($ union ($ -> contents3))
-> $(by {[{container: 'shiny gold'}]} collect {totalInside:Sum&{of: :(amount:)}}) ... -> '$.totalInside;
' -> !OUT::write
// Here we have a solution that uses relational algebra correctly, where each tuple states the fact that // "The outer container contains this amount of the contained type via the given path". // For sake of illustration, the actual path is generated, but it could be represented by a tempid as above. // If the entire bagrules relation is input, the answer for all bags will be given at the same time. templates contents3 when <?($::count <=0>)> do $! otherwise $({container:, path:, multiplier: §.amount, z: §.contained}) -> ($ join $bagrules({z: §.container, contained:, amount:})) -> $({container:, contained:, amount: §.amount * §.multiplier, path: [§.path..., §.z]}) -> ($ union ($ -> contents3)) ! end contents3 ($bagrules join {[{container: 'shiny gold', path: []}]}) -> ($ union ($ -> contents3)) -> $(by {[{container: 'shiny gold'}]} collect {totalInside:Sum&{of: :(amount:)}}) ... -> '$.totalInside; ' -> !OUT::write
// Here we have a solution that uses relational algebra correctly, where each tuple states the fact that
// "The outer container contains this amount of the contained type via the given path".
// For sake of illustration, the actual path is generated, but it could be represented by a tempid as above.
// If the entire bagrules relation is input, the answer for all bags will be given at the same time.
templates contents3
  when <?($::count <=0>)> do $!
  otherwise
    $({container:, path:, multiplier: §.amount, z: §.contained})
      -> ($ join $bagrules({z: §.container, contained:, amount:}))
      -> $({container:, contained:, amount: §.amount * §.multiplier, path: [§.path..., §.z]})
      -> ($ union ($ -> contents3)) !
end contents3

($bagrules join {[{container: 'shiny gold', path: []}]}) -> ($ union ($ -> contents3))
-> $(by {[{container: 'shiny gold'}]} collect {totalInside:Sum&{of: :(amount:)}}) ... -> '$.totalInside;
' -> !OUT::write
One more thing that falls out of this is the true power of the relational algebra to allow us to calculate many results at once, which can easily be done by passing the whole original relation in as a parameter to our corrected function, but which would just have created rubbish in our buggy attempt.
It might be interesting to consider which exact properties we are justified to require of a predicate, to be "crisp" enough. Can we say that a workable relation must consist of a conjuction of facts in definite form? Does it need to be self-contained?
A predicate for a relation is a statement about a fact held to be true in the universe of discourse. The question you put is beyond my comprehension.
On another tack, I will still note that the SQL recursive CTE version will calculate the correct value even for the sloppy predicates, so it is arguably easier to use. Since we seem to agree that a language should make it easier for programmers to get correct results and harder for them to create obscure bugs, judging by the strong stance on static typing, I still worry that we might be straining at gnats and swallowing camels. In studies performed, static languages do not have lower defect rates over all and as long as the language has strong typing, any type errors are easily found, so the bugs are not obscure. But despite the code review solution, the bugs from accidental deduplication will probably be both hard to detect and hard to understand.
Bunk. Please quote your source.
The question then would be how easy it is to create bugs that are hard to detect and difficult to understand if we allow duplication. If the amount of duplication is semi-unpredictable, as https://perlmonks.org/?node_id=515776 seems to show, then I suppose we just have to crisp up our thinking and deduplicate.

Makes no sense to me and leads nowhere interesting.

Andl - A New Database Language - andl.org
Quote from Dave Voorhis on February 19, 2021, 9:40 am
Quote from tobega on February 19, 2021, 5:22 am
On another tack, I will still note that the SQL recursive CTE version will calculate the correct value even for the sloppy predicates, so it is arguably easier to use. Since we seem to agree that a language should make it easier for programmers to get correct results and harder for them to create obscure bugs, judging by the strong stance on static typing, I still worry that we might be straining at gnats and swallowing camels. In studies performed, static languages do not have lower defect rates over all and as long as the language has strong typing, any type errors are easily found, so the bugs are not obscure.

Comparing defect rates between statically and dynamically typed languages strikes me as largely missing the point. I wouldn't expect a dynamically-typed program to be any buggier than an equivalent statically-typed program -- assuming the developer is competent.

The real difference is the amount of effort required to create the program.

Dynamically-typed languages are deceptively simple; they lure you in with fewer keystrokes only to bite you with having to do type inference in your head. In reality, dynamically-typed languages often take more effort -- sometimes a lot more effort -- to read and maintain because you're having to mentally do the type inference and type checking that the compiler would do for you. Though you can gain some of that back by using IDEs that do it for you, plus you usually write a gaggle of unit tests (but that's more effort you need to make.) And to be fair, injudicious use of type inference over manifest types in statically-typed languages that support type inference can be almost as bad as dynamic typing without manifest typing, from a readability point of view.

With small programs (for an undefined value of "small"), the difference between statically-typed and dynamically-typed development/maintenance effort is also small and might indeed favour fewer keystrokes, but the development/maintenance difficulty of dynamically-typed programs grows exponentially with program size. It only takes a few thousand lines of Python or PHP or JavaScript for development and maintenance to become a nightmare, whilst an equivalent-sized Java or C# program -- of the sort often deprecated for being "too verbose" -- remains relatively easy to read and maintain.

Just so. I find dynamic languages preferable for short programs that use mostly scalars (strings and numbers), and horrible for big programs with classes. The killer is the unit tests you need just to do what the compiler should have done for you.

 

 

Andl - A New Database Language - andl.org
Quote from Dave Voorhis on February 19, 2021, 11:35 am
Quote from tobega on February 19, 2021, 11:09 am
Quote from Dave Voorhis on February 19, 2021, 9:40 am
Quote from tobega on February 19, 2021, 5:22 am
On another tack, I will still note that the SQL recursive CTE version will calculate the correct value even for the sloppy predicates, so it is arguably easier to use. Since we seem to agree that a language should make it easier for programmers to get correct results and harder for them to create obscure bugs, judging by the strong stance on static typing, I still worry that we might be straining at gnats and swallowing camels. In studies performed, static languages do not have lower defect rates over all and as long as the language has strong typing, any type errors are easily found, so the bugs are not obscure.

Comparing defect rates between statically and dynamically typed languages strikes me as largely missing the point. I wouldn't expect a dynamically-typed program to be any buggier than an equivalent statically-typed program -- assuming the developer is competent.

The real difference is the amount of effort required to create the program.

At the risk of getting into an undesirable side-discussion, you are actually not entirely correct, effort also seems to be on par (but you may be confusing static/dynamic with the weakly/strongly typed distinction). Capers Jones did a study on hours spent per code point, also bug frequency, for lots of languages and Smalltalk was one of the most efficient, almost twice as productive as Java/C#. Smalltalk is dynamic but strongly typed. On the other hand, JavaScript (weak dynamic) lagged significantly behind Java/C#. I believe that JavaScript programmers deceive themselves by confusing activity for productivity.

Smalltalk is a rather special case, an uber-consistent language with an unusually well-integrated toolset. Is Smalltalk productivity due to its dynamic typing, or in spite of it?

I'd argue it's in spite of it. A comparison of Smalltalk vs Strongtalk would be interesting.

I think the main take-away is that the effect of typing is both smaller and different to what we tend to imagine.

Research aside, I don't think anyone that's actually had to maintain real large code bases would argue that the difference in maintenance effort in the popular dynamically typed languages (PHP, Python, JavaScript, Ruby, shell scripts) vs statically, manifest-typed (or mainly manifest-typed) languages is anything but profound, unless you're fortunate to be working on a dynamically-typed code-base that uses only primitive types. Then it can be manageable. If it defines classes (e.g., PHP, Python) extensively, then it's a nightmare.

In maintenance I would say the documentation effect is very strong. Any code with explicitly stated types is much easier to analyze, so without them you would have to spend much more time just understanding how and where to modify the code.

When it comes to Python there are plenty of weaknesses, especially in how classes are defined, making it hard to understand what was defined and where, but it's not necessarily down to typing. I have had to try to maintain a largeish chunk of Python classes and I neither know nor like the language, it was a nightmare for me. But the entire production code of YouTube is written in Python, they have had to evolve some specific disciplines, but seem to be managing all right. As a counterpoint the code review tool that had been written in Python by none less than Guido van Rossum himself was rewritten in java to make it more maintainable.