Big Data/Data Science/AI, etc

#41 · June 20, 2019, 11:53 am

Quote from Dave Voorhis on June 20, 2019, 10:52 am

>> it's certainly entirely reasonable to define streams and operations on streams.

I agree.
I was just wondering about the logical model that Stonebraker (and/or others) has/have implemented with regard to streams.
If we knew what that was - I have no experience of streams - then it ought to be logically possible to see how it compares to the relational model.

David Livingstone

#42 · June 20, 2019, 11:58 am

Quote from David Livingstone on June 20, 2019, 11:47 am

Quote from Dave Voorhis on June 20, 2019, 10:42 am

>>> Do you mean, is there a generic relvar that can be programmatically defined to connect to some new kind of data source?

> If so, yes, but it's not exposed in Rel's Tutorial D dialect. It's defined in Java.

Actually I was just wondering about what (I assume are) the general principles underlying your external relvars. Presumably you have some general rules that are implemented by java code ?

Yes, there is a Java interface called TableExternal that every external relvar must implement. You can see it at https://github.com/DaveVoorhis/Rel/blob/master/ServerV0000/src/org/reldb/rel/v0/storage/tables/TableExternal.java

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

#43 · June 20, 2019, 12:01 pm

Quote from David Livingstone on June 20, 2019, 11:53 am

Quote from Dave Voorhis on June 20, 2019, 10:52 am

>> it's certainly entirely reasonable to define streams and operations on streams.

I agree.
I was just wondering about the logical model that Stonebraker (and/or others) has/have implemented with regard to streams.
If we knew what that was - I have no experience of streams - then it ought to be logically possible to see how it compares to the relational model.

David Livingstone

You mean a theoretical foundation for streams?

I suppose there might be one -- if there is, I haven't paid enough attention to know what it is -- but they might be defined, like most code, in an ad hoc fashion.

The various stream processing engines (there are a number of them) may give some insights into their fundamental operations.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

#44 · June 20, 2019, 12:06 pm

Quote from Dave Voorhis on June 20, 2019, 10:46 am

Streams are finite but unbounded. You can always count how many data items you've received until now -- hence finite -- but the stream never ends, hence unbounded.

So presumably operations are carried out on the stream at (possibly very small) time intervals ? On just those 'tuples' (or whatever) that have arrived during that interval ?

Is it logically possible to envisage finite but unbounded sets of tuples ?

David Livingstone

#45 · June 20, 2019, 3:05 pm

Quote from johnwcowan on June 20, 2019, 3:05 pm
Some things don't map well to relations, like graph databases, document stores, XML documents, Word documents, and so on.

Those cases happen to be straightforward. A graph database is an RM database with one main relation, call it Graph, and a few subsidiary relations. The main relation's attributes are subject, property, and object, and all of them are of type graph_node_id. There needs to be a source to generate unique or unique-enough values of this type (involving a special relation which holds the latest graph_node_id, or a high-quality random-UUID generator, or what have you). Each tuple asserts that the subject stands in the specified property relationship to the object. For example, if some subject, call it 1, stands in the relationship of (animal) parent, call this relationship 2, to another subject, call it 3, then there is a tuple {subject: 1, property: 2, object: 3}.

Note that properties are themselves capable of being subjects or objects, and so we can assert meta-properties, meta-meta-properties, etc.:

So, naturalists observe, a flea
Has smaller fleas that on him prey;
And these have smaller still to bite 'em,
And so proceed ad infinitum. —Jonathan Swift

The other relations hold relationships in which the object is of some other type (wlg we can insist that subjects are always graph nodes). For example, if the name of subject 1 is "Thomas Cowan", of subject 2 is "father-of", and of subject 3 is "John Cowan", then the StringGraph relation, whose object attribute is of type String, has three appropriate tuples whose property is 4, and a fourth tuple telling us that 4's name is "name-of". We wouldn't need these relations if there was a supertype of Id, String, Int, Float, etc. etc.

XML is more complicated, but roughly speaking each element lives in a relation E that has the attributes xml_element_id, name, andattributes (a tuple-valued attribute). We also need another relation C with the attributes xml_element_id, parent_element_id, and ordinal_position to represent child elements, and yet another relation T with attributes xml_element_id, ordinal_position, text to represent content. It is a constraint that tuples with the same key {xml_element_id, ordinal_position} can't appear in both C and T. Elements in E may appear in C or T or both or neither; the root element cannot appear in C.

D implementations that can join relations hundreds or thousands of times with ease (unlike most SQL implementations) should be excellent for handling both these kinds of data.

Some things don't map well to relations, like graph databases, document stores, XML documents, Word documents, and so on.

Those cases happen to be straightforward. A graph database is an RM database with one main relation, call it Graph, and a few subsidiary relations. The main relation's attributes are subject, property, and object, and all of them are of type graph_node_id. There needs to be a source to generate unique or unique-enough values of this type (involving a special relation which holds the latest graph_node_id, or a high-quality random-UUID generator, or what have you). Each tuple asserts that the subject stands in the specified property relationship to the object. For example, if some subject, call it 1, stands in the relationship of (animal) parent, call this relationship 2, to another subject, call it 3, then there is a tuple {subject: 1, property: 2, object: 3}.

Note that properties are themselves capable of being subjects or objects, and so we can assert meta-properties, meta-meta-properties, etc.:

So, naturalists observe, a flea
Has smaller fleas that on him prey;
And these have smaller still to bite 'em,
And so proceed ad infinitum. —Jonathan Swift

The other relations hold relationships in which the object is of some other type (wlg we can insist that subjects are always graph nodes). For example, if the name of subject 1 is "Thomas Cowan", of subject 2 is "father-of", and of subject 3 is "John Cowan", then the StringGraph relation, whose object attribute is of type String, has three appropriate tuples whose property is 4, and a fourth tuple telling us that 4's name is "name-of". We wouldn't need these relations if there was a supertype of Id, String, Int, Float, etc. etc.

XML is more complicated, but roughly speaking each element lives in a relation E that has the attributes xml_element_id, name, andattributes (a tuple-valued attribute). We also need another relation C with the attributes xml_element_id, parent_element_id, and ordinal_position to represent child elements, and yet another relation T with attributes xml_element_id, ordinal_position, text to represent content. It is a constraint that tuples with the same key {xml_element_id, ordinal_position} can't appear in both C and T. Elements in E may appear in C or T or both or neither; the root element cannot appear in C.

D implementations that can join relations hundreds or thousands of times with ease (unlike most SQL implementations) should be excellent for handling both these kinds of data.

#46 · June 20, 2019, 4:23 pm

Quote from Dave Voorhis on June 20, 2019, 4:23 pm

Quote from johnwcowan on June 20, 2019, 3:05 pm

Some things don't map well to relations, like graph databases, document stores, XML documents, Word documents, and so on.

Those cases happen to be straightforward. A graph database is an RM database with one main relation, call it Graph, and a few subsidiary relations. The main relation's attributes are subject, property, and object, and all of them are of type graph_node_id. There needs to be a source to generate unique or unique-enough values of this type (involving a special relation which holds the latest graph_node_id, or a high-quality random-UUID generator, or what have you). Each tuple asserts that the subject stands in the specified property relationship to the object. For example, if some subject, call it 1, stands in the relationship of (animal) parent, call this relationship 2, to another subject, call it 3, then there is a tuple {subject: 1, property: 2, object: 3}.

Note that properties are themselves capable of being subjects or objects, and so we can assert meta-properties, meta-meta-properties, etc.:

So, naturalists observe, a flea
Has smaller fleas that on him prey;
And these have smaller still to bite 'em,
And so proceed ad infinitum. —Jonathan Swift

The other relations hold relationships in which the object is of some other type (wlg we can insist that subjects are always graph nodes). For example, if the name of subject 1 is "Thomas Cowan", of subject 2 is "father-of", and of subject 3 is "John Cowan", then the StringGraph relation, whose object attribute is of type String, has three appropriate tuples whose property is 4, and a fourth tuple telling us that 4's name is "name-of". We wouldn't need these relations if there was a supertype of Id, String, Int, Float, etc. etc.

XML is more complicated, but roughly speaking each element lives in a relation E that has the attributes xml_element_id, name, andattributes (a tuple-valued attribute). We also need another relation C with the attributes xml_element_id, parent_element_id, and ordinal_position to represent child elements, and yet another relation T with attributes xml_element_id, ordinal_position, text to represent content. It is a constraint that tuples with the same key {xml_element_id, ordinal_position} can't appear in both C and T. Elements in E may appear in C or T or both or neither; the root element cannot appear in C.

D implementations that can join relations hundreds or thousands of times with ease (unlike most SQL implementations) should be excellent for handling both these kinds of data.

That's a mapping, yes, but is it mapping well?

And I'm not sure why a D implementation would be able to join relations hundreds or thousands of times with any more ease than most SQL implementations, given that the core in both is likely to be roughly the same.

Though obviously they don't have to be. The core of a D certainly could conceivably map XML documents and graph databases to relvars (and I am assuming a live mapping here, not some form of ETL) to perhaps absorb graphs and XML into the relational world without too much pain -- along with, perhaps, XML/graph-specific join mechanisms for performance, some graph/hierarchy-friendly new operators for ease of use, and a language happy to validate XML against XML schemas to enforce safety whilst allowing flexibility, and so forth.

But the cases I gave I chose (relatively) carefully, as precisely those which notionally are usually described as non-relational, are often pointed out to be conformable to the relational model -- or its closest relative, SQL -- but in practice turn out to be unpleasant to work with or result in some undesirable performance (or other) tradeoffs or impedance mismatches. It's precisely this go-around -- oh yes the relational model / SQL can do this; oh no it can't -- that drives NoSQL adoption.

I suggest again, as I've suggested before, that as long as we persist in trying to subsume all other models into the relational model, we will face opposition from NoSQL and its ilk. Far better, methinks, for future database products to heterogeneously support NoSQL and its relatives along with the relational model, and treat all as equals. I.e., embrace, but don't subsume.

Quote from johnwcowan on June 20, 2019, 3:05 pm

Some things don't map well to relations, like graph databases, document stores, XML documents, Word documents, and so on.

Those cases happen to be straightforward. A graph database is an RM database with one main relation, call it Graph, and a few subsidiary relations. The main relation's attributes are subject, property, and object, and all of them are of type graph_node_id. There needs to be a source to generate unique or unique-enough values of this type (involving a special relation which holds the latest graph_node_id, or a high-quality random-UUID generator, or what have you). Each tuple asserts that the subject stands in the specified property relationship to the object. For example, if some subject, call it 1, stands in the relationship of (animal) parent, call this relationship 2, to another subject, call it 3, then there is a tuple {subject: 1, property: 2, object: 3}.

Note that properties are themselves capable of being subjects or objects, and so we can assert meta-properties, meta-meta-properties, etc.:

So, naturalists observe, a flea
Has smaller fleas that on him prey;
And these have smaller still to bite 'em,
And so proceed ad infinitum. —Jonathan Swift

The other relations hold relationships in which the object is of some other type (wlg we can insist that subjects are always graph nodes). For example, if the name of subject 1 is "Thomas Cowan", of subject 2 is "father-of", and of subject 3 is "John Cowan", then the StringGraph relation, whose object attribute is of type String, has three appropriate tuples whose property is 4, and a fourth tuple telling us that 4's name is "name-of". We wouldn't need these relations if there was a supertype of Id, String, Int, Float, etc. etc.

XML is more complicated, but roughly speaking each element lives in a relation E that has the attributes xml_element_id, name, andattributes (a tuple-valued attribute). We also need another relation C with the attributes xml_element_id, parent_element_id, and ordinal_position to represent child elements, and yet another relation T with attributes xml_element_id, ordinal_position, text to represent content. It is a constraint that tuples with the same key {xml_element_id, ordinal_position} can't appear in both C and T. Elements in E may appear in C or T or both or neither; the root element cannot appear in C.

D implementations that can join relations hundreds or thousands of times with ease (unlike most SQL implementations) should be excellent for handling both these kinds of data.

That's a mapping, yes, but is it mapping well?

And I'm not sure why a D implementation would be able to join relations hundreds or thousands of times with any more ease than most SQL implementations, given that the core in both is likely to be roughly the same.

Though obviously they don't have to be. The core of a D certainly could conceivably map XML documents and graph databases to relvars (and I am assuming a live mapping here, not some form of ETL) to perhaps absorb graphs and XML into the relational world without too much pain -- along with, perhaps, XML/graph-specific join mechanisms for performance, some graph/hierarchy-friendly new operators for ease of use, and a language happy to validate XML against XML schemas to enforce safety whilst allowing flexibility, and so forth.

But the cases I gave I chose (relatively) carefully, as precisely those which notionally are usually described as non-relational, are often pointed out to be conformable to the relational model -- or its closest relative, SQL -- but in practice turn out to be unpleasant to work with or result in some undesirable performance (or other) tradeoffs or impedance mismatches. It's precisely this go-around -- oh yes the relational model / SQL can do this; oh no it can't -- that drives NoSQL adoption.

I suggest again, as I've suggested before, that as long as we persist in trying to subsume all other models into the relational model, we will face opposition from NoSQL and its ilk. Far better, methinks, for future database products to heterogeneously support NoSQL and its relatives along with the relational model, and treat all as equals. I.e., embrace, but don't subsume.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

#47 · June 21, 2019, 12:04 am

Quote from Dave Voorhis on June 20, 2019, 12:01 pm

Quote from David Livingstone on June 20, 2019, 11:53 am

Quote from Dave Voorhis on June 20, 2019, 10:52 am

>> it's certainly entirely reasonable to define streams and operations on streams.

I agree.
I was just wondering about the logical model that Stonebraker (and/or others) has/have implemented with regard to streams.
If we knew what that was - I have no experience of streams - then it ought to be logically possible to see how it compares to the relational model.

You mean a theoretical foundation for streams?

I suppose there might be one --

Yes there is. (I've no idea whether Stonebraker is working from it.) It's in amongst the theory on folds and maps for finite sequences/lists.

I'm rusty on it, but essentially it validates that any process guarantees progress by consuming each element of the stream and moving on to the next, and that no process waits for or expects some terminal to the stream. This is used to prove (say) operating systems or server processes run in constant size/don't have memory leaks. IOW a process that reports whether the stream has at least ten million elements is ok; something that reports the total count of elements is not. That kind of useless theory.

Contrast that theory on folds/maps validates that a process at each iteration makes the problem smaller, by progressing towards the (assumed) end of the list.

if there is, I haven't paid enough attention to know what it is -- but they might be defined, like most code, in an ad hoc fashion.

The various stream processing engines (there are a number of them) may give some insights into their fundamental operations.

Given that most stream processing engines are written in low-level code, for efficiency reasons, I doubt the theory gives much help in reasoning about them.

#48 · June 21, 2019, 5:22 pm

Quote from David Livingstone on June 21, 2019, 5:22 pm

Quote from Dave Voorhis on June 20, 2019, 4:23 pm

Quote from johnwcowan on June 20, 2019, 3:05 pm

Some things don't map well to relations, like graph databases, document stores, XML documents, Word documents, and so on.

Those cases happen to be straightforward. ...

>> That's a mapping, yes, but is it mapping well?

Regarding graphs, I would think it depends on how well the designs of the relevant relvars fit how the users understand the graphs. (Psychologically it would also be useful if the application using the DB could present the contents of a 'graph relvar' as a graph and not a table).

Even more importantly I think it depends on what relational operator(s) are available to manipulate 'graph relvalues' so that it's easy for the user to picture themselves carrying out graph-like operations on their graphs/relations. I don't think these operators are well developed enough at the moment. It would also be useful if one could assign integrity constraints to a 'graph-relvar' to ensure its value does indeed conform to a hierarchy/acyclic/cyclic graph, as the case may be.

>> And I'm not sure why a D implementation would be able to join relations hundreds or thousands of times with any more ease than most SQL implementations, given that the core in both is likely to be roughly the same.

A good point.
However in the O'Reilly book 'Graph Databases', on page 8 it states : "One compelling reason, then, for choosing a graph database is the sheer performance
increase when dealing with connected data versus relational databases and NOSQL." So graph DBMSs must have a means of storage that facilitates high performance.
Ideally good relational DBMSs provide several storage methods, and the DBA chooses the most appropriate method for each relvar, or the DBMS automates this. So perhaps the solution is adding a suitable 'graph storage method(s)' to relational DBMSs ?

>> I suggest again, as I've suggested before, that as long as we persist in trying to subsume all other models into the relational model, we will face opposition from NoSQL and its ilk. Far better, methinks, for future database products to heterogeneously support NoSQL and its relatives along with the relational model, and treat all as equals. I.e., embrace, but don't subsume.

A reasonable point of view, and very practical.
However it leaves the DB user having to map between the different data models that are embraced by the 'Embracing DBMS'.
If it were possible to express them all relationally (with possibly a standard library of specialist relational operators ?) and have the appropriate scalar types, it would make life easier for the DB user.

So I think 'subsuming' can be a worthwhile long-term aim, even if at the moment it's more of a pipe-dream.

David Livingstone

Quote from Dave Voorhis on June 20, 2019, 4:23 pm

Quote from johnwcowan on June 20, 2019, 3:05 pm

Some things don't map well to relations, like graph databases, document stores, XML documents, Word documents, and so on.

Those cases happen to be straightforward. ...

>> That's a mapping, yes, but is it mapping well?

Regarding graphs, I would think it depends on how well the designs of the relevant relvars fit how the users understand the graphs. (Psychologically it would also be useful if the application using the DB could present the contents of a 'graph relvar' as a graph and not a table).

Even more importantly I think it depends on what relational operator(s) are available to manipulate 'graph relvalues' so that it's easy for the user to picture themselves carrying out graph-like operations on their graphs/relations. I don't think these operators are well developed enough at the moment. It would also be useful if one could assign integrity constraints to a 'graph-relvar' to ensure its value does indeed conform to a hierarchy/acyclic/cyclic graph, as the case may be.

>> And I'm not sure why a D implementation would be able to join relations hundreds or thousands of times with any more ease than most SQL implementations, given that the core in both is likely to be roughly the same.

A good point.
However in the O'Reilly book 'Graph Databases', on page 8 it states : "One compelling reason, then, for choosing a graph database is the sheer performance
increase when dealing with connected data versus relational databases and NOSQL." So graph DBMSs must have a means of storage that facilitates high performance.
Ideally good relational DBMSs provide several storage methods, and the DBA chooses the most appropriate method for each relvar, or the DBMS automates this. So perhaps the solution is adding a suitable 'graph storage method(s)' to relational DBMSs ?

>> I suggest again, as I've suggested before, that as long as we persist in trying to subsume all other models into the relational model, we will face opposition from NoSQL and its ilk. Far better, methinks, for future database products to heterogeneously support NoSQL and its relatives along with the relational model, and treat all as equals. I.e., embrace, but don't subsume.

A reasonable point of view, and very practical.
However it leaves the DB user having to map between the different data models that are embraced by the 'Embracing DBMS'.
If it were possible to express them all relationally (with possibly a standard library of specialist relational operators ?) and have the appropriate scalar types, it would make life easier for the DB user.

So I think 'subsuming' can be a worthwhile long-term aim, even if at the moment it's more of a pipe-dream.

David Livingstone

The Forum for Discussion about The Third Manifesto and Related Matters