Aggregation in TD and Rel
Quote from dandl on March 28, 2020, 5:44 amAccording to the TD spec I have, there are 5 aggregation functions for numbers: COUNT, SUM, AVG, MAX, MIN. Questions:
- Is that it? Does Rel have more? Should there be?
- Is it possible to aggregate user-defined types (such as DATE or POINT)?
According to the TD spec I have, there are 5 aggregation functions for numbers: COUNT, SUM, AVG, MAX, MIN. Questions:
- Is that it? Does Rel have more? Should there be?
- Is it possible to aggregate user-defined types (such as DATE or POINT)?
Quote from johnwcowan on March 28, 2020, 6:32 amIt needs to be a commutative and associative operator, since the operands are unordered. Looking at the list of reducible APL operators, you could add product for numbers and conjunction and disjunction for booleans.
In addition, here are some possibly useful derived functions: sum of squares, harmonic sum, geometric mean, harmonic mean, population and sample variance, population and sample standard deviation, and range (max - min). The median and mode are also useful, but more-than-linear to compute. Skewness and kurtosis are more remote possibilities.
It needs to be a commutative and associative operator, since the operands are unordered. Looking at the list of reducible APL operators, you could add product for numbers and conjunction and disjunction for booleans.
Quote from Dave Voorhis on March 28, 2020, 8:51 amQuote from dandl on March 28, 2020, 5:44 amAccording to the TD spec I have, there are 5 aggregation functions for numbers: COUNT, SUM, AVG, MAX, MIN. Questions:
- Is that it? Does Rel have more? Should there be?
- Is it possible to aggregate user-defined types (such as DATE or POINT)?
It provides the ones defined in Tutorial D and no more. These:
sys.OperatorsBuiltin WHERE STARTS_WITH(Name, "AGGREGATE")
Name
CHARACTERSignature
CHARACTERReturnsType
CHARACTERDefinition
CHARACTERAGGREGATE_AND AGGREGATE_AND(ARRAY TUPLE {}) BOOLEAN // Logical AND of r
AGGREGATE_AND(r ARRAY OF TUPLE {AGGREGAND BOOLEAN, AGGREGATION_SERIAL INT}) RETURNS BOOLEANAGGREGATE_SUM_RATIONAL AGGREGATE_SUM_RATIONAL(ARRAY TUPLE {}) RATIONAL // RATIONAL sum of r
AGGREGATE_SUM_RATIONAL(r ARRAY OF TUPLE {AGGREGAND RATIONAL, AGGREGATION_SERIAL INT}) RETURNS RATIONALAGGREGATE_OR AGGREGATE_OR(ARRAY TUPLE {}) BOOLEAN // Logical OR of r
AGGREGATE_OR(r ARRAY OF TUPLE {AGGREGAND BOOLEAN, AGGREGATION_SERIAL INT}) RETURNS BOOLEANAGGREGATE_UNION AGGREGATE_UNION(ARRAY TUPLE {}) RELATION {} // UNION of r
AGGREGATE_UNION(r ARRAY OF TUPLE {AGGREGAND RELATION {*}, AGGREGATION_SERIAL INT}) RETURNS RELATION {*}AGGREGATE_D_UNION AGGREGATE_D_UNION(ARRAY TUPLE {}) RELATION {} // disjoint-UNION of r
AGGREGATE_DUNION(r ARRAY OF TUPLE {AGGREGAND RELATION {*}, AGGREGATION_SERIAL INT}) RETURNS RELATION {*}AGGREGATE_EQUIV AGGREGATE_EQUIV(ARRAY TUPLE {}) BOOLEAN // Logical EQUIV (aka '=') of r
AGGREGATE_EQUIV(r ARRAY OF TUPLE {AGGREGAND BOOLEAN, AGGREGATION_SERIAL INT}) RETURNS BOOLEANAGGREGATE_MAX AGGREGATE_MAX(ARRAY TUPLE {}) RATIONAL // Maximum of r
AGGREGATE_MAX(r ARRAY OF TUPLE {AGGREGAND ALPHA, AGGREGATION_SERIAL INT}) RETURNS ALPHAAGGREGATE_MIN AGGREGATE_MIN(ARRAY TUPLE {}) RATIONAL // Minimum of r
AGGREGATE_MIN(r ARRAY OF TUPLE {AGGREGAND ALPHA, AGGREGATION_SERIAL INT}) RETURNS ALPHAAGGREGATE_XUNION AGGREGATE_XUNION(ARRAY TUPLE {}) RELATION {} // exclusive-UNION of r
AGGREGATE_XUNION(r ARRAY OF TUPLE {AGGREGAND RELATION {*}, AGGREGATION_SERIAL INT}) RETURNS RELATION {*}AGGREGATE_INTERSECT AGGREGATE_INTERSECT(ARRAY TUPLE {}) RELATION {} // INTERSECT of r
AGGREGATE_INTERSECT(r ARRAY OF TUPLE {AGGREGAND RELATION {*}, AGGREGATION_SERIAL INT}) RETURNS RELATION {*}AGGREGATE_AVG_INTEGER AGGREGATE_AVG_INTEGER(ARRAY TUPLE {}) RATIONAL // Arithmetic mean of r
AGGREGATE_AVG_INTEGER(r ARRAY OF TUPLE {AGGREGAND INT, AGGREGATION_SERIAL INT}) RETURNS RATIONALAGGREGATE_AVG_RATIONAL AGGREGATE_AVG_RATIONAL(ARRAY TUPLE {}) RATIONAL // Arithmetic mean of r
AGGREGATE_AVG_INTEGER(r ARRAY OF TUPLE {AGGREGAND RATIONAL, AGGREGATION_SERIAL INT}) RETURNS RATIONALAGGREGATE_SUM_INTEGER AGGREGATE_SUM_INTEGER(ARRAY TUPLE {}) INTEGER // INTEGER sum of r
AGGREGATE_SUM_INTEGER(r ARRAY OF TUPLE {AGGREGAND INT, AGGREGATION_SERIAL INT}) RETURNS INTEGERAGGREGATE_XOR AGGREGATE_XOR(ARRAY TUPLE {}) BOOLEAN // Logical exclusive-OR of r
AGGREGATE_XOR(r ARRAY OF TUPLE {AGGREGAND BOOLEAN, AGGREGATION_SERIAL INT}) RETURNS BOOLEANIt is possible to aggregate user-defined types, but that may involve some work to define what the aggregates mean, by either aggregating on some POSSREP component and/or providing -- where necessary -- the additional operators needed to support aggregating them. See https://reldb.org/c/wp-content/uploads/2016/06/User-Defined-Aggregate-Operators-in-Tutorial-D-and-Rel.pdf
I note in passing that the operators I use most often, in both Rel and SQL, are COUNT, MAX and MIN. Others barely get a look in.
Quote from dandl on March 28, 2020, 5:44 amAccording to the TD spec I have, there are 5 aggregation functions for numbers: COUNT, SUM, AVG, MAX, MIN. Questions:
- Is that it? Does Rel have more? Should there be?
- Is it possible to aggregate user-defined types (such as DATE or POINT)?
It provides the ones defined in Tutorial D and no more. These:
Name CHARACTER |
Signature CHARACTER |
ReturnsType CHARACTER |
Definition CHARACTER |
---|---|---|---|
AGGREGATE_AND | AGGREGATE_AND(ARRAY TUPLE {}) | BOOLEAN | // Logical AND of r AGGREGATE_AND(r ARRAY OF TUPLE {AGGREGAND BOOLEAN, AGGREGATION_SERIAL INT}) RETURNS BOOLEAN |
AGGREGATE_SUM_RATIONAL | AGGREGATE_SUM_RATIONAL(ARRAY TUPLE {}) | RATIONAL | // RATIONAL sum of r AGGREGATE_SUM_RATIONAL(r ARRAY OF TUPLE {AGGREGAND RATIONAL, AGGREGATION_SERIAL INT}) RETURNS RATIONAL |
AGGREGATE_OR | AGGREGATE_OR(ARRAY TUPLE {}) | BOOLEAN | // Logical OR of r AGGREGATE_OR(r ARRAY OF TUPLE {AGGREGAND BOOLEAN, AGGREGATION_SERIAL INT}) RETURNS BOOLEAN |
AGGREGATE_UNION | AGGREGATE_UNION(ARRAY TUPLE {}) | RELATION {} | // UNION of r AGGREGATE_UNION(r ARRAY OF TUPLE {AGGREGAND RELATION {*}, AGGREGATION_SERIAL INT}) RETURNS RELATION {*} |
AGGREGATE_D_UNION | AGGREGATE_D_UNION(ARRAY TUPLE {}) | RELATION {} | // disjoint-UNION of r AGGREGATE_DUNION(r ARRAY OF TUPLE {AGGREGAND RELATION {*}, AGGREGATION_SERIAL INT}) RETURNS RELATION {*} |
AGGREGATE_EQUIV | AGGREGATE_EQUIV(ARRAY TUPLE {}) | BOOLEAN | // Logical EQUIV (aka '=') of r AGGREGATE_EQUIV(r ARRAY OF TUPLE {AGGREGAND BOOLEAN, AGGREGATION_SERIAL INT}) RETURNS BOOLEAN |
AGGREGATE_MAX | AGGREGATE_MAX(ARRAY TUPLE {}) | RATIONAL | // Maximum of r AGGREGATE_MAX(r ARRAY OF TUPLE {AGGREGAND ALPHA, AGGREGATION_SERIAL INT}) RETURNS ALPHA |
AGGREGATE_MIN | AGGREGATE_MIN(ARRAY TUPLE {}) | RATIONAL | // Minimum of r AGGREGATE_MIN(r ARRAY OF TUPLE {AGGREGAND ALPHA, AGGREGATION_SERIAL INT}) RETURNS ALPHA |
AGGREGATE_XUNION | AGGREGATE_XUNION(ARRAY TUPLE {}) | RELATION {} | // exclusive-UNION of r AGGREGATE_XUNION(r ARRAY OF TUPLE {AGGREGAND RELATION {*}, AGGREGATION_SERIAL INT}) RETURNS RELATION {*} |
AGGREGATE_INTERSECT | AGGREGATE_INTERSECT(ARRAY TUPLE {}) | RELATION {} | // INTERSECT of r AGGREGATE_INTERSECT(r ARRAY OF TUPLE {AGGREGAND RELATION {*}, AGGREGATION_SERIAL INT}) RETURNS RELATION {*} |
AGGREGATE_AVG_INTEGER | AGGREGATE_AVG_INTEGER(ARRAY TUPLE {}) | RATIONAL | // Arithmetic mean of r AGGREGATE_AVG_INTEGER(r ARRAY OF TUPLE {AGGREGAND INT, AGGREGATION_SERIAL INT}) RETURNS RATIONAL |
AGGREGATE_AVG_RATIONAL | AGGREGATE_AVG_RATIONAL(ARRAY TUPLE {}) | RATIONAL | // Arithmetic mean of r AGGREGATE_AVG_INTEGER(r ARRAY OF TUPLE {AGGREGAND RATIONAL, AGGREGATION_SERIAL INT}) RETURNS RATIONAL |
AGGREGATE_SUM_INTEGER | AGGREGATE_SUM_INTEGER(ARRAY TUPLE {}) | INTEGER | // INTEGER sum of r AGGREGATE_SUM_INTEGER(r ARRAY OF TUPLE {AGGREGAND INT, AGGREGATION_SERIAL INT}) RETURNS INTEGER |
AGGREGATE_XOR | AGGREGATE_XOR(ARRAY TUPLE {}) | BOOLEAN | // Logical exclusive-OR of r AGGREGATE_XOR(r ARRAY OF TUPLE {AGGREGAND BOOLEAN, AGGREGATION_SERIAL INT}) RETURNS BOOLEAN |
It is possible to aggregate user-defined types, but that may involve some work to define what the aggregates mean, by either aggregating on some POSSREP component and/or providing -- where necessary -- the additional operators needed to support aggregating them. See https://reldb.org/c/wp-content/uploads/2016/06/User-Defined-Aggregate-Operators-in-Tutorial-D-and-Rel.pdf
I note in passing that the operators I use most often, in both Rel and SQL, are COUNT, MAX and MIN. Others barely get a look in.
Quote from dandl on March 28, 2020, 12:41 pmQuote from Dave Voorhis on March 28, 2020, 8:51 amQuote from dandl on March 28, 2020, 5:44 amAccording to the TD spec I have, there are 5 aggregation functions for numbers: COUNT, SUM, AVG, MAX, MIN. Questions:
- Is that it? Does Rel have more? Should there be?
- Is it possible to aggregate user-defined types (such as DATE or POINT)?
It provides the ones defined in Tutorial D and no more. These:
Thanks Dave. I thought that was so. I'm doing aggregation for my Knime nodes, and just thought I'd check.
Andl has generalised aggregation with a fold() function, but that's beyond what I'm trying to do here.
It is possible to aggregate user-defined types, but that may involve some work to define what the aggregates mean, by either aggregating on some POSSREP component and/or providing -- where necessary -- the additional operators needed to support aggregating them. See https://reldb.org/c/wp-content/uploads/2016/06/User-Defined-Aggregate-Operators-in-Tutorial-D-and-Rel.pdf
I recall this. It's a really interesting approach, not one I've seen elsewhere. I'm tempted to use it here.
MAX and MIN for dates is the main one I think.
I note in passing that the operators I use most often, in both Rel and SQL, are COUNT, MAX and MIN. Others barely get a look in.
I use SUM a fair bit if there are money values in the table. Seems kind of reasonable.
Quote from Dave Voorhis on March 28, 2020, 8:51 amQuote from dandl on March 28, 2020, 5:44 amAccording to the TD spec I have, there are 5 aggregation functions for numbers: COUNT, SUM, AVG, MAX, MIN. Questions:
- Is that it? Does Rel have more? Should there be?
- Is it possible to aggregate user-defined types (such as DATE or POINT)?
It provides the ones defined in Tutorial D and no more. These:
Thanks Dave. I thought that was so. I'm doing aggregation for my Knime nodes, and just thought I'd check.
Andl has generalised aggregation with a fold() function, but that's beyond what I'm trying to do here.
It is possible to aggregate user-defined types, but that may involve some work to define what the aggregates mean, by either aggregating on some POSSREP component and/or providing -- where necessary -- the additional operators needed to support aggregating them. See https://reldb.org/c/wp-content/uploads/2016/06/User-Defined-Aggregate-Operators-in-Tutorial-D-and-Rel.pdf
I recall this. It's a really interesting approach, not one I've seen elsewhere. I'm tempted to use it here.
MAX and MIN for dates is the main one I think.
I note in passing that the operators I use most often, in both Rel and SQL, are COUNT, MAX and MIN. Others barely get a look in.
I use SUM a fair bit if there are money values in the table. Seems kind of reasonable.
Quote from Dave Voorhis on March 28, 2020, 1:38 pmQuote from dandl on March 28, 2020, 12:41 pmQuote from Dave Voorhis on March 28, 2020, 8:51 amQuote from dandl on March 28, 2020, 5:44 amAccording to the TD spec I have, there are 5 aggregation functions for numbers: COUNT, SUM, AVG, MAX, MIN. Questions:
- Is that it? Does Rel have more? Should there be?
- Is it possible to aggregate user-defined types (such as DATE or POINT)?
It provides the ones defined in Tutorial D and no more. These:
Thanks Dave. I thought that was so. I'm doing aggregation for my Knime nodes, and just thought I'd check.
Doesn't KNIME provide that sort of thing already?
In other words, can't the output of a relational pipeline be passed to whatever KNIME already provides for statistical and aggregate operations?
Andl has generalised aggregation with a fold() function, but that's beyond what I'm trying to do here.
It is possible to aggregate user-defined types, but that may involve some work to define what the aggregates mean, by either aggregating on some POSSREP component and/or providing -- where necessary -- the additional operators needed to support aggregating them. See https://reldb.org/c/wp-content/uploads/2016/06/User-Defined-Aggregate-Operators-in-Tutorial-D-and-Rel.pdf
I recall this. It's a really interesting approach, not one I've seen elsewhere. I'm tempted to use it here.
MAX and MIN for dates is the main one I think.
I note in passing that the operators I use most often, in both Rel and SQL, are COUNT, MAX and MIN. Others barely get a look in.
I use SUM a fair bit if there are money values in the table. Seems kind of reasonable.
Yes. If I had to rank them, COUNT, MAX and MIN get used constantly. SUM gets used sometimes. The rest, almost never.
Quote from dandl on March 28, 2020, 12:41 pmQuote from Dave Voorhis on March 28, 2020, 8:51 amQuote from dandl on March 28, 2020, 5:44 amAccording to the TD spec I have, there are 5 aggregation functions for numbers: COUNT, SUM, AVG, MAX, MIN. Questions:
- Is that it? Does Rel have more? Should there be?
- Is it possible to aggregate user-defined types (such as DATE or POINT)?
It provides the ones defined in Tutorial D and no more. These:
Thanks Dave. I thought that was so. I'm doing aggregation for my Knime nodes, and just thought I'd check.
Doesn't KNIME provide that sort of thing already?
In other words, can't the output of a relational pipeline be passed to whatever KNIME already provides for statistical and aggregate operations?
Andl has generalised aggregation with a fold() function, but that's beyond what I'm trying to do here.
It is possible to aggregate user-defined types, but that may involve some work to define what the aggregates mean, by either aggregating on some POSSREP component and/or providing -- where necessary -- the additional operators needed to support aggregating them. See https://reldb.org/c/wp-content/uploads/2016/06/User-Defined-Aggregate-Operators-in-Tutorial-D-and-Rel.pdf
I recall this. It's a really interesting approach, not one I've seen elsewhere. I'm tempted to use it here.
MAX and MIN for dates is the main one I think.
I note in passing that the operators I use most often, in both Rel and SQL, are COUNT, MAX and MIN. Others barely get a look in.
I use SUM a fair bit if there are money values in the table. Seems kind of reasonable.
Yes. If I had to rank them, COUNT, MAX and MIN get used constantly. SUM gets used sometimes. The rest, almost never.
Quote from dandl on March 29, 2020, 12:49 amThanks Dave. I thought that was so. I'm doing aggregation for my Knime nodes, and just thought I'd check.
Doesn't KNIME provide that sort of thing already?
In other words, can't the output of a relational pipeline be passed to whatever KNIME already provides for statistical and aggregate operations?
Knime overlaps quite a bit in various areas. My theme is to provide a consistent set of node operators that more or less cover the same territory as SQL, but from an RA perspective so no NULLs and no duplicate rows. They could be used as an education tool, or to replace existing SQL when the input is not an RDBMS. Hence a minimal set of aggregation functions, and not really any motivation to add more.
Thanks Dave. I thought that was so. I'm doing aggregation for my Knime nodes, and just thought I'd check.
Doesn't KNIME provide that sort of thing already?
In other words, can't the output of a relational pipeline be passed to whatever KNIME already provides for statistical and aggregate operations?
Knime overlaps quite a bit in various areas. My theme is to provide a consistent set of node operators that more or less cover the same territory as SQL, but from an RA perspective so no NULLs and no duplicate rows. They could be used as an education tool, or to replace existing SQL when the input is not an RDBMS. Hence a minimal set of aggregation functions, and not really any motivation to add more.