The Forum for Discussion about The Third Manifesto and Related Matters

Please or Register to create posts and topics.

What do set-based operations buy us?

From an article I came across (https://towardsdatascience.com/apache-arrow-read-dataframe-with-zero-memory-69634092b1a):

Traditionally, data is stored on disk in a row-by-row manner. Columnar storage was born out of the necessity to analyze large datasets and aggregate them efficiently. Data analytics is less interested in rows of data (e.g. one customer transaction, one call log, …) but on aggregations thereof (e.g. total amount spent by customer, total call minutes by region, …).

which tied into a thought I had been pondering about the efficiency of checking for set-ness all the time, i.e. is the SQL bag way more efficient in general? What does a set-algebra buy us of value?

When it comes to storage, having each tuple be unique is just one more constraint, I suppose. I  feel there could be problems when we involve blobs/clobs. Or what about just filenames, where the referenced files have identical contents? If we sometimes have to account for tricky notions of equality and distinctness, is it better to always have to consider these things?

Getting back to data analytics, that is of course vastly different from managing a supplier database. You have data pumping in at an alarming rate, so it's maybe just a completely different thing that should have different tools.

Quote from tobega on February 11, 2021, 8:54 am

From an article I came across (https://towardsdatascience.com/apache-arrow-read-dataframe-with-zero-memory-69634092b1a):

Traditionally, data is stored on disk in a row-by-row manner. Columnar storage was born out of the necessity to analyze large datasets and aggregate them efficiently. Data analytics is less interested in rows of data (e.g. one customer transaction, one call log, …) but on aggregations thereof (e.g. total amount spent by customer, total call minutes by region, …).

which tied into a thought I had been pondering about the efficiency of checking for set-ness all the time, i.e. is the SQL bag way more efficient in general? What does a set-algebra buy us of value?

Sets are obviously semantically cleaner, as per relational theory a duplicate tuple (if it could exist) would be redundantly repeating statement of a fact.

Practically, duplicate tuples/rows have the problem of not distinguishing correctly multiple things from one thing incorrectly repeated, and awkwardness with joins, etc., often leads to explicit de-duplication (e.g., SELECT DISTINCT).

In theory, de-duplication could be a relatively expensive operation, but my impression is that in practice it generally isn't. Though I've never looked at how often de-duplication is internally automatically invoked in (say) RelTutorial D queries as opposed to having to be explicitly invoked in a bag-oriented query language to achieve semantically-equivalent correct results.

When it comes to storage, having each tuple be unique is just one more constraint, I suppose. I  feel there could be problems when we involve blobs/clobs. Or what about just filenames, where the referenced files have identical contents? If we sometimes have to account for tricky notions of equality and distinctness, is it better to always have to consider these things?

This is a potential problem, and it does require clear definition of equality for the purposes of comparison. Blobs/clobs and the like can be relatively inexpensively hashed (for some undefined, size-and-system dependent value of "inexpensively") and hashes compared, but some equality comparisons are NP-hard. E.g., are these two lambda expressions equal, where "equal" is defined as semantically identical?

In that case, you have to fall back to string comparisons.

Getting back to data analytics, that is of course vastly different from managing a supplier database. You have data pumping in at an alarming rate, so it's maybe just a completely different thing that should have different tools.

Yes.

There will always be "big data" (in the original meaning of the term, not the modern "a big spreadsheet" common usage) problems that stretch the bounds of system capability. Not long before leaving academia, I was very peripherally involved with science experiments that generate something on the order of a terabyte of sensor data per second, all day long. You expect that sort of thing to require specialist systems, that make compromises everywhere not to maximise throughput, but to make getting useful results possible.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org
Quote from tobega on February 11, 2021, 8:54 am

From an article I came across (https://towardsdatascience.com/apache-arrow-read-dataframe-with-zero-memory-69634092b1a):

Traditionally, data is stored on disk in a row-by-row manner. Columnar storage was born out of the necessity to analyze large datasets and aggregate them efficiently. Data analytics is less interested in rows of data (e.g. one customer transaction, one call log, …) but on aggregations thereof (e.g. total amount spent by customer, total call minutes by region, …).

I'd be careful to distinguish stored-on-disk vs dynamic query results, which typically are represented only in memory, and might use all sorts of cunning storage structures. Column stores on disk are the norm with data warehouses, where you're concerned about efficiency of queries/aggregates rather than efficiency of transaction update.

which tied into a thought I had been pondering about the efficiency of checking for set-ness all the time, i.e. is the SQL bag way more efficient in general? What does a set-algebra buy us of value?

Firstly remember that all tables/relvars have keys, and RM VSS 2 'infer candidate keys' expects usually any query results have keys. There's only a risk of losing set-ness if the query is projecting away attributes of keys. That can be solved at 'compile time' for the query. The query engine is not checking for set-ness "all the time". It is checking for uniqueness of keys at every update.

What does SQL lose by dealing in bags? Since the quote is talking about aggregates ... Your COUNTs stop making sense; your aggregates for subsets of the relation don't add back to the aggregate for the whole. Part P#123 has a total quantity 250 on order. OK break that down by Supplier, so we project away the Purchase Order num. Supplier S#567 appears twice, so is that 1 order x qty 25 or 2 x 25? So is the total 250 or only 225? (The sort of bug that hardly ever shows up, because usually we place only one order with each Supplier for each product; or at least the orders are for different quantities.)

When it comes to storage, having each tuple be unique is just one more constraint, I suppose. I  feel there could be problems when we involve blobs/clobs. Or what about just filenames, where the referenced files have identical contents? If we sometimes have to account for tricky notions of equality and distinctness, is it better to always have to consider these things?

In storage even SQL insists every table has a key. The engine doesn't compare whole tuples.

"Files"? Not sure I follow. Why should two referenced blobs having the same value be different to two Quantity attributes on different tuples having the same value? Probably things like blobs are held in specialised data structure on disk, with the tables holding the address only. So opportunity for sharing.

Getting back to data analytics, that is of course vastly different from managing a supplier database. You have data pumping in at an alarming rate, so it's maybe just a completely different thing that should have different tools.

Yes. Typically the transaction-store is continually (asynchronously) pumping its updates to the dw.

Re the article: the buzz at the beginning is about natural language processing [**]. Then there's a complete disconnect. Then it talks about (relatively conventional) tabular data stored in conventional column stores. (Been around since the late '60's, it says there. Chris Date a long time ago wrote a much more sober piece about one particular DBMS.)

That breathless stuff in the article about compressed storage is a) utterly not new; b) described in terms that are utterly bollox, to use a technical term. You hold one index entry for each distinct value for a column. Of course you then don't repeat the same value in multiple tuples. Instead you have multiple tuple-Ids for each tuple having that column. And you repeat those tuple-Ids for every other column index. It may be that tuple-IDs give more efficient storage/indexing/scatter characteristics as opposed to VARCHARs.

D.L.Childs also claimed impressive performance for queries, back in the day. His 'indexed sets' are perhaps much the same technology.

[**] Anything agog about nlp immediately gets 5,000 demerit points in my book. Have you actually tried Google translate? Or 'Grammarly'? Or any other style-guides?

Quote from Dave Voorhis on February 11, 2021, 9:22 am
Quote from tobega on February 11, 2021, 8:54 am

When it comes to storage, having each tuple be unique is just one more constraint, I suppose. I  feel there could be problems when we involve blobs/clobs. Or what about just filenames, where the referenced files have identical contents? If we sometimes have to account for tricky notions of equality and distinctness, is it better to always have to consider these things?

This is a potential problem, and it does require clear definition of equality for the purposes of comparison. Blobs/clobs and the like can be relatively inexpensively hashed (for some undefined, size-and-system dependent value of "inexpensively") and hashes compared, but some equality comparisons are NP-hard. E.g., are these two lambda expressions equal, where "equal" is defined as semantically identical?

Actually, worse than NP-hard -- not computable; halting-problem equivalent.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

So what I'm hearing is that the set-based model helps us avoid accidental duplication in data entry.

But what about accidental deduplication in a projection step? E.g. where two shipments happen to have the same amount of parts? Seems like it would be a very obscure bug.

 

Quote from tobega on February 15, 2021, 6:17 pm

So what I'm hearing is that the set-based model helps us avoid accidental duplication in data entry.

But what about accidental deduplication in a projection step? E.g. where two shipments happen to have the same amount of parts? Seems like it would be a very obscure bug.

Avoiding accidental duplication in data entry -- or in the intermediate steps of a query -- are practical outcomes of a set-based model.

But in a system that is fundamentally about representing facts and meaning -- where the notion of a "duplicate fact" is meaningless --  two shipments of the same part distinguished only by a possibly-different but possibly-identical amount of parts, means what?

Indeed, it would hopefully be a very unusual bug. It belongs to a certain category of bugs; the kind where the developer who makes that bug in a set-based system would be the source of many bugs in a bag-based system.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org
Quote from tobega on February 15, 2021, 6:17 pm

So what I'm hearing is that the set-based model helps us avoid accidental duplication in data entry.

But what about accidental deduplication in a projection step? E.g. where two shipments happen to have the same amount of parts? Seems like it would be a very obscure bug.

 

Sorry, I can't see where you heard that.

The set-based model is the direct consequence of viewing a tuple in a relation as the assertion of a fact. If 'there was a shipment of 27 parts' is true, there is no reason to assert it twice. If there is reason to distinguish two such shipments it must be because each shipment is a separate fact to be asserted, eg 'the shipment on Tuesday was of 27 parts' and 'the shipment on Friday was of 27 parts'.

You cannot have a bag of true facts, it has to be a set. If that's not what you need, then choose a different data structure, often a list but sometimes a map.

 

Andl - A New Database Language - andl.org
Quote from dandl on February 15, 2021, 11:31 pm
Quote from tobega on February 15, 2021, 6:17 pm

So what I'm hearing is that the set-based model helps us avoid accidental duplication in data entry.

But what about accidental deduplication in a projection step? E.g. where two shipments happen to have the same amount of parts? Seems like it would be a very obscure bug.

 

Sorry, I can't see where you heard that.

The set-based model is the direct consequence of viewing a tuple in a relation as the assertion of a fact. If 'there was a shipment of 27 parts' is true, there is no reason to assert it twice. If there is reason to distinguish two such shipments it must be because each shipment is a separate fact to be asserted, eg 'the shipment on Tuesday was of 27 parts' and 'the shipment on Friday was of 27 parts'.

You cannot have a bag of true facts, it has to be a set. If that's not what you need, then choose a different data structure, often a list but sometimes a map.

 

It's not that simple. I'm talking about when you start out with a proper relation but while doing operations on it, you must always make sure of keeping facts that may be irrelevant to your final result but necessary in the intermediate step. But necessary only to avoid deduplication in that step, so it may not be clear why you need it.

It becomes particularly obscure when uniqueness is determined by a combination of attributes and one attribute will be elided in the result.

An example is transitive closure where the definition of TRANCLO carefully avoids the issue by renaming an attribute on both sides of the join:

OPERATOR TRANCLO ( XY RELATION { X P#, Y P# } )RETURNS RELATION { X P#, Y P# } ;
RETURN( WITH ( XY UNION ( ( XY RENAME(Y AS Z)) COMPOSE
                          ( XY RENAME(X AS Z)) ) )AS TTT:
IF TTT = XY THEN TTT/* unwind recursion*/
ELSE TRANCLO ( TTT )/* recursive invocation */
END IF ) ;
END OPERATOR

But if you add a part count to that, as in the explosion of parts example in recommendation 6 of chapter 10 of DTATRM, it becomes more complicated and the book doesn't attempt to define the operator, merely describe its properties.

I had exactly this problem in the adventofcode day 7 that is part of driving the Tailspin implementation and I fell neatly into the trap, although it did give the correct answer for my case. So I have a bug where I get the correct answer most of the time, but not when I least suspect it.

templates contents
  when <?($::count <=0>)> do $!
  otherwise
    $({multiplier: §.amount, container: §.contained})
      -> ($ join $bagrules) -> $({container:, contained:, amount: §.amount * §.multiplier})
      -> ($ union ($ -> contents)) !
end contents

or, a clumsy attempt to translate it to Tutorial-D (Edit: corrected)

OPERATOR CONTENTS( CCA RELATION { CONTAINER C_ID, CONTAINED C_ID, AMOUNT INTEGER }, RULES RELATION { CONTAINER C_ID, CONTAINED C_ID, AMOUNT INTEGER } )
    RETURNS RELATION { CONTAINER C_ID, CONTAINED C_ID, AMOUNT INTEGER } ;
  IF ISEMPTY(CCA)
    RETURN CCA;
  END IF
  RETURN (WITH ((RENAME CCA {CONTAINED, AMOUNT} (CONTAINED AS CONTAINER, AMOUNT AS MULTIPLIER)) JOIN RULES) AS LEFT,
    ((LEFT EXTEND(TOTAL AS MULTIPLIER*AMOUNT)){CONTAINER, CONTAINED, TOTAL} RENAME(TOTAL AS AMOUNT)) AS NEXT:
    NEXT UNION CONTENTS(NEXT, RULES);
END OPERATOR

So while I understand the idea of the purity of stating a fact once and once only, and that purity usually pays of in the end, I am still asking what we gain by mandating deduplication all the way compared to what we might lose.

The idea of using a list instead is not bad, but are we then perhaps in the situation where it is prudent to always work with lists/bags until explicit deduplication is asked for?

Note that I am not suggestion relvars be lists or bags, just the values resulting from relational operations.

Quote from tobega on February 16, 2021, 5:38 am
Quote from dandl on February 15, 2021, 11:31 pm
Quote from tobega on February 15, 2021, 6:17 pm

So what I'm hearing is that the set-based model helps us avoid accidental duplication in data entry.

But what about accidental deduplication in a projection step? E.g. where two shipments happen to have the same amount of parts? Seems like it would be a very obscure bug.

 

Sorry, I can't see where you heard that.

The set-based model is the direct consequence of viewing a tuple in a relation as the assertion of a fact. If 'there was a shipment of 27 parts' is true, there is no reason to assert it twice. If there is reason to distinguish two such shipments it must be because each shipment is a separate fact to be asserted, eg 'the shipment on Tuesday was of 27 parts' and 'the shipment on Friday was of 27 parts'.

You cannot have a bag of true facts, it has to be a set. If that's not what you need, then choose a different data structure, often a list but sometimes a map.

 

It's not that simple. I'm talking about when you start out with a proper relation but while doing operations on it, you must always make sure of keeping facts that may be irrelevant to your final result but necessary in the intermediate step. But necessary only to avoid deduplication in that step, so it may not be clear why you need it.

It becomes particularly obscure when uniqueness is determined by a combination of attributes and one attribute will be elided in the result.

Ah, now I understand what you don't understand. You are describing aggregation. The aggregate of your bag is a tuple with an attribute that is the count.

An example is transitive closure where the definition of TRANCLO carefully avoids the issue by renaming an attribute on both sides of the join:

OPERATOR TRANCLO ( XY RELATION { X P#, Y P# } )RETURNS RELATION { X P#, Y P# } ;
RETURN( WITH ( XY UNION ( ( XY RENAME(Y AS Z)) COMPOSE
( XY RENAME(X AS Z)) ) )AS TTT:
IF TTT = XY THEN TTT/* unwind recursion*/
ELSE TRANCLO ( TTT )/* recursive invocation */
END IF ) ;
END OPERATOR
OPERATOR TRANCLO ( XY RELATION { X P#, Y P# } )RETURNS RELATION { X P#, Y P# } ; RETURN( WITH ( XY UNION ( ( XY RENAME(Y AS Z)) COMPOSE ( XY RENAME(X AS Z)) ) )AS TTT: IF TTT = XY THEN TTT/* unwind recursion*/ ELSE TRANCLO ( TTT )/* recursive invocation */ END IF ) ; END OPERATOR
OPERATOR TRANCLO ( XY RELATION { X P#, Y P# } )RETURNS RELATION { X P#, Y P# } ;
RETURN( WITH ( XY UNION ( ( XY RENAME(Y AS Z)) COMPOSE
                          ( XY RENAME(X AS Z)) ) )AS TTT:
IF TTT = XY THEN TTT/* unwind recursion*/
ELSE TRANCLO ( TTT )/* recursive invocation */
END IF ) ;
END OPERATOR

But if you add a part count to that, as in the explosion of parts example in recommendation 6 of chapter 10 of DTATRM, it becomes more complicated and the book doesn't attempt to define the operator, merely describe its properties.

Correct. This is the operator referred to as GTC. I can tell you exactly how to do that, and I have a formal description. I published it here, but I don't think anyone paid it much attention. See https://forum.thethirdmanifesto.com/forum/topic/formal-definitions-for-relation-constant-transitive-closure-extended-transitive-closure/.

I had exactly this problem in the adventofcode day 7 that is part of driving the Tailspin implementation and I fell neatly into the trap, although it did give the correct answer for my case. So I have a bug where I get the correct answer most of the time, but not when I least suspect it.

templates contents
when <?($::count <=0>)> do $!
otherwise
$({multiplier: §.amount, container: §.contained})
-> ($ join $bagrules) -> $({container:, contained:, amount: §.amount * §.multiplier})
-> ($ union ($ -> contents)) !
end contents
templates contents when <?($::count <=0>)> do $! otherwise $({multiplier: §.amount, container: §.contained}) -> ($ join $bagrules) -> $({container:, contained:, amount: §.amount * §.multiplier}) -> ($ union ($ -> contents)) ! end contents
templates contents
  when <?($::count <=0>)> do $!
  otherwise
    $({multiplier: §.amount, container: §.contained})
      -> ($ join $bagrules) -> $({container:, contained:, amount: §.amount * §.multiplier})
      -> ($ union ($ -> contents)) !
end contents

or, a clumsy attempt to translate it to Tutorial-D

OPERATOR CONTENTS( CCA RELATION { CONTAINER C_ID, CONTAINED C_ID, AMOUNT INTEGER }, RULES RELATION { CONTAINER C_ID, CONTAINED C_ID, AMOUNT INTEGER } )
RETURNS RELATION { CONTAINER C_ID, CONTAINED C_ID, AMOUNT INTEGER } ;
IF ISEMPTY(CCA)
RETURN CCA;
END IF
RETURN (WITH ((RENAME CCA {CONTAINED, AMOUNT} (CONTAINED AS CONTAINER, AMOUNT AS MULTIPLIER)) JOIN RULES) AS NEXT:
((NEXT EXTEND(TOTAL AS MULTIPLIER*AMOUNT)){CONTAINER, CONTAINED, TOTAL} RENAME(TOTAL AS AMOUNT)) UNION CONTENTS(NEXT, RULES);
END OPERATOR
OPERATOR CONTENTS( CCA RELATION { CONTAINER C_ID, CONTAINED C_ID, AMOUNT INTEGER }, RULES RELATION { CONTAINER C_ID, CONTAINED C_ID, AMOUNT INTEGER } ) RETURNS RELATION { CONTAINER C_ID, CONTAINED C_ID, AMOUNT INTEGER } ; IF ISEMPTY(CCA) RETURN CCA; END IF RETURN (WITH ((RENAME CCA {CONTAINED, AMOUNT} (CONTAINED AS CONTAINER, AMOUNT AS MULTIPLIER)) JOIN RULES) AS NEXT: ((NEXT EXTEND(TOTAL AS MULTIPLIER*AMOUNT)){CONTAINER, CONTAINED, TOTAL} RENAME(TOTAL AS AMOUNT)) UNION CONTENTS(NEXT, RULES); END OPERATOR
OPERATOR CONTENTS( CCA RELATION { CONTAINER C_ID, CONTAINED C_ID, AMOUNT INTEGER }, RULES RELATION { CONTAINER C_ID, CONTAINED C_ID, AMOUNT INTEGER } )
    RETURNS RELATION { CONTAINER C_ID, CONTAINED C_ID, AMOUNT INTEGER } ;
  IF ISEMPTY(CCA)
    RETURN CCA;
  END IF
  RETURN (WITH ((RENAME CCA {CONTAINED, AMOUNT} (CONTAINED AS CONTAINER, AMOUNT AS MULTIPLIER)) JOIN RULES) AS NEXT:
    ((NEXT EXTEND(TOTAL AS MULTIPLIER*AMOUNT)){CONTAINER, CONTAINED, TOTAL} RENAME(TOTAL AS AMOUNT)) UNION CONTENTS(NEXT, RULES);
END OPERATOR

What you're relying on here is operator recursion, which places it outside the RA. Given the ability to define operators you can construct your own data structures, but that's not what this is about.

Basically it's two steps.

  1. Add extra tuples with AMOUNT calculated from the two parent tuples (this is my Extended TC).
  2. Aggregate the tuples to get the final result (which D&D call Generalised TC).

So while I understand the idea of the purity of stating a fact once and once only, and that purity usually pays of in the end, I am still asking what we gain by mandating deduplication all the way compared to what we might lose.

The idea of using a list instead is not bad, but are we then perhaps in the situation where it is prudent to always work with lists/bags until explicit deduplication is asked for?

Note that I am not suggestion relvars be lists or bags, just the values resulting from relational operations.

I don't know of such a case, and this is certainly not one.

Andl - A New Database Language - andl.org