## The Forum for Discussion about The Third Manifesto and Related Matters

Forum breadcrumbs - You are here:
Please or Register to create posts and topics.

# Relations as sets of columns

Quote from Erwin on April 15, 2021, 3:13 pm
Quote from Vadim on April 14, 2021, 4:54 pm

As you have mentioned, a relation is not really a set, but a pair of header and body, each considered to be a set. Again, this traditional view has been challenged (or complemented) by the "columnar" database community, which considers a relation as a set of columns. There has to be conforming condition when "gluing" columns to construct relations of higher arity, of course, but the situation is similar to requiring tuples to match in conventional database theory.

I'll take one more step, and suggest that using "a set of" in database foundation is problematic. First of all, tuple- and column- perspectives clearly disagree what the set structure of a relation is. Second, mathematics field gradually shifted from emphasizing object's set structure to categorical view. Technically speaking, only some categories are structured sets. It would be interesting to formalize database relations in terms of objects and arrows, and I wonder if David Spivak's database category is the right one.

I'm wondering what the "guaranteed access rule" would look like in such a "columnar perspective".  You cannot in any way refer to "that unique tuple which has some particular given [combination of] key attribute value[s] because both "combinations of key values" and tuples themselves are inherently undefinable and can only be introduced if you first reconstitute the relation [or the relational structure] from the "columns".

Those 12 rules are informal requirements for practical RDBMS. A more formal question would be "Can you describe the set structure of column based DBMS, and specify classic relational algebra operations in those terms?"  Here is a semiformal sketch:

A column based database is a set of relations. A relation is an equivalence class of tables. A table is a map: attribute name -> set of values, where the sets of values match each other. For a table there is no requirements of the duplicate row exclusion, we simply consider tables with duplicates as equivalent. For example, the tables

Attr1 -> {1,2,2}
Attr2 -> {a,b,b}

and

Attr1 -> {1,2,2,2}
Attr2 -> {a,b,b,b}

are considered equivalent. A natural representative of the equivalence class in the above example would be

Attr1 -> {1,2}
Attr2 -> {a,b}

Quote from Vadim on April 15, 2021, 8:21 pm
Quote from Erwin on April 15, 2021, 3:13 pm

Those 12 rules are informal requirements for practical RDBMS. A more formal question would be "Can you describe the set structure of column based DBMS, and specify classic relational algebra operations in those terms?"  Here is a semiformal sketch:

A column based database is a set of relations. A relation is an equivalence class of tables. A table is a map: attribute name -> set of values, where the sets of values match each other. For a table there is no requirements of the duplicate row exclusion, we simply consider tables with duplicates as equivalent. For example, the tables

Attr1 -> {1,2,2}
Attr2 -> {a,b,b}

and

Attr1 -> {1,2,2,2}
Attr2 -> {a,b,b,b}

are considered equivalent. A natural representative of the equivalence class in the above example would be

Huh? So what does SELECT COUNT(*) FROM <this equiv class); return? Can you declare keys/UNIQUE constraint? What does it mean to have a key?

How many rows returned from SELECT * FROM <this equiv class> NATURAL JOIN <this equiv class>? IOW is NATURAL JOIN idempotent?

I don't see anything "simply" about it. And I don't think this is how columnar databases are implemented. Supporting keys/uniqueness (at least in base tables) is pretty fundamental, despite SQL making a poor job of it.

The example

Attr1 -> {1,2,2}
Attr2 -> {a,b,b}

is faulty: what are those repeated elements inside the sets?

The column elements have to be organized into a map. Therefore, a table is a map of columns, where each column value is map. No set in the definition whatsoever!

The corrected example:

{
Attr1 -> {row#1->1, row#2->2, row#3->2},
Attr2 -> {row#1->a, row#2->b, row#3->b}
}

Yes, in one form or another, row identifiers seems to be unavoidable.

Now, defining the equivalence among the tables becomes less obvious. For example, how the above table with 3 rows is equivalent to this

{
Attr1 -> {rx->1, ry->2},
Attr2 -> {rx->a, ry->b}
}

For relational operators, first we have to define them over the tables. Then, we need to prove that those operations are well defined, that is respect the equivalence classes.  For example, if we join relations A and B, then we pick some representatives - tables - from each equivalence class, calculate the join over those table objects,  and the result would be another table, which should represent the relation $A \bowtie B$. Any other choice of table representatives for A and B should give the table from the same equivalence class, that is the same relation.

Now, since we have row ids, then defining the relation key seems to be easy: just require the map to be bijection. In the above example

{
Attr1 -> {rx->1, ry->2},
Attr2 -> {rx->a, ry->b}
}

both attribute values are bijective maps.  This definition however, seems to depend on the choice of table representative, as it fails for the tables with duplicate rows...