Big Data/Data Science/AI, etc

#31 · June 20, 2019, 2:32 am

Yes. I don't care a hoot about big data (I haven't got any) but I care about streams and variability. The Codd RM expects the data to be have a defined order and cardinality. A stream has no cardinality and data that we receive as inputs has no fixed heading. We don't pre-process them into relvars. We dump them into a NoSQL database and then try to find a query tool to extract useful rows and columns that we can report on. It's not a tuple soup, it's a tuple stream and the tuples are variable, so they have missing and extraneous data compared to what we need. How do we deal with it?

Andl - A New Database Language - andl.org

#32 · June 20, 2019, 9:23 am

Quote from Dave Voorhis on June 20, 2019, 9:23 am

Quote from David Livingstone on June 19, 2019, 11:02 pm

Quote from Dave Voorhis on June 19, 2019, 11:28 am

>> A defining characteristic of 'modern' data ... is variety,
>> which means you are dealing with images, document files of every conceivable type, graphs (e.g., social networks, semantic networks, etc.), spreadsheets, outputs from IoT gadgets, remote REST/SOAP Web service endpoints, financial data, blockchains, buying history,

Can we split this up into a variety of of data types, structures, etc in order to see how the relational model might cope with (at least some of) these ?
How about starting with :

Ability to plug in new scalar types (which must include their implementation). Some of these will be what I would call 'large physical types' because a single logical scalar value (say an image) would have its value stored in one physical file (not a field/record within a file), so the storage mechanism will have to cope with that.

Ability to cope with a greater variety of structures of relational values, to cope with, say spreadsheets, financial data, than we are traditionally used to. Maybe we don't have to be obliged to normalise every relational value (if a more complex one actually represents the real world, and is the way users want to view things) and have a set of operators that make it easy for us to manipulate these more structurally-complex relvalues ? Maybe we could specify an attribute as being relvalues whose reltype is a Powerset of scalar types ?

When it comes to graphs, graph structures can be represented relationally, and TTM has a Generalised Transitive Closure operator that manipulates them. But I think that this operator lacks the power and flexibility to do all the things that Graph DBMSs claim they can do. So maybe the real problem is to improve the set of relational operators available for use with graphs ?

Other things ?

Just throwing out a few possibilities for consideration.

David Livingstone

In Rel, special relvars are the means to connect to external data. I call them external relvars. You can define external relvars that are bound to CSV files, (individual sheets in) Excel spreadsheet documents, tables in any SQL DBMS that can be reached via ODBC/JDBC, Microsoft Access tables, and relvars in other Rel databases. I supported Hadoop for a while, but it was a bit of a pest to maintain and nobody cared so I dropped it.

Every attribute in every external relvar is automatically defined to be of type CHARACTER; nulls become empty strings. It's up to the user to map -- via VIEWs or whatever -- those character values to whatever locally-defined types, built-in or user-defined, the user sees fit.

This works well for external data that can be fairly easily mapped to relations.

Some things don't map well to relations, like graph databases, document stores, XML documents, Word documents, and so on. These make me question trying to shoehorn everything into the relational model. Maybe the right approach is not to try to recast every model into the relational model, but to try to create overarching tools that incorporate the relational model along with other ways of representing data.

Quote from David Livingstone on June 19, 2019, 11:02 pm

Quote from Dave Voorhis on June 19, 2019, 11:28 am

>> A defining characteristic of 'modern' data ... is variety,
>> which means you are dealing with images, document files of every conceivable type, graphs (e.g., social networks, semantic networks, etc.), spreadsheets, outputs from IoT gadgets, remote REST/SOAP Web service endpoints, financial data, blockchains, buying history,

Can we split this up into a variety of of data types, structures, etc in order to see how the relational model might cope with (at least some of) these ?
How about starting with :

Ability to plug in new scalar types (which must include their implementation). Some of these will be what I would call 'large physical types' because a single logical scalar value (say an image) would have its value stored in one physical file (not a field/record within a file), so the storage mechanism will have to cope with that.

Ability to cope with a greater variety of structures of relational values, to cope with, say spreadsheets, financial data, than we are traditionally used to. Maybe we don't have to be obliged to normalise every relational value (if a more complex one actually represents the real world, and is the way users want to view things) and have a set of operators that make it easy for us to manipulate these more structurally-complex relvalues ? Maybe we could specify an attribute as being relvalues whose reltype is a Powerset of scalar types ?

When it comes to graphs, graph structures can be represented relationally, and TTM has a Generalised Transitive Closure operator that manipulates them. But I think that this operator lacks the power and flexibility to do all the things that Graph DBMSs claim they can do. So maybe the real problem is to improve the set of relational operators available for use with graphs ?

Other things ?

Just throwing out a few possibilities for consideration.

David Livingstone

In Rel, special relvars are the means to connect to external data. I call them external relvars. You can define external relvars that are bound to CSV files, (individual sheets in) Excel spreadsheet documents, tables in any SQL DBMS that can be reached via ODBC/JDBC, Microsoft Access tables, and relvars in other Rel databases. I supported Hadoop for a while, but it was a bit of a pest to maintain and nobody cared so I dropped it.

Every attribute in every external relvar is automatically defined to be of type CHARACTER; nulls become empty strings. It's up to the user to map -- via VIEWs or whatever -- those character values to whatever locally-defined types, built-in or user-defined, the user sees fit.

This works well for external data that can be fairly easily mapped to relations.

Some things don't map well to relations, like graph databases, document stores, XML documents, Word documents, and so on. These make me question trying to shoehorn everything into the relational model. Maybe the right approach is not to try to recast every model into the relational model, but to try to create overarching tools that incorporate the relational model along with other ways of representing data.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

#33 · June 20, 2019, 10:27 am

Quote from David Livingstone on June 20, 2019, 10:27 am

Quote from Dave Voorhis on June 20, 2019, 9:23 am

>> In Rel, relvars are the means to connect to external data. I call them external relvars. You can define external relvars that are bound to CSV files, (individual sheets in) Excel spreadsheet documents, tables in any SQL DBMS that can be reached via ODBC/JDBC, Microsoft Access tables, and relvars in other Rel databases. I supported Hadoop for a while, but it was a bit of a pest to maintain and nobody cared so I dropped it.

I have followed a similar strategy in RAQUEL. I've called them source and sink relvars. However only sink relvars are currently implemented.

In RAQUEL, an external/source/sink relvar is not under the control of the DBMS, unlike a DB relvar which is under the control of the DBMS. Do you differentiate similarly ?

>> Every attribute in every external relvar is automatically defined to be of type CHARACTER; nulls become empty strings. It's up to the user to map -- via VIEWs or whatever -- those character values to whatever locally-defined types, built-in or user-defined, the user sees fit.

I have worked on the basis that a binding can be assigned to a source or sink, which maps between the logical relvar and its underlying physical storage. So the binding chosen would cope with these issues.
Do you have a general-purpose binding, that can be applied to a range of storage mechanisms, so that you can achieve the generality of attribute type CHARACTER ? That sounds very useful.

>>> These make me question trying to shoehorn everything into the relational model.
>>> Maybe the right approach is not to try to recast every model into the relational model, but to try to create overarching tools that incorporate the relational model along with other ways of representing data.

Interesting.

I've been working on the assumption that ultimately the relational model can in principle handle all sorts of data. Then the problem is to provide the model with suitable scalar types and suitable relational operators and assignments to make it easy to handle a range of different relational structures.
Of course many implementation problems arise with this strategy, which can rather limit what is achievable.

I have also assumed that it's useful to have sequences and bags of tuples which you can map onto sets of tuples (i.e. relational values) - thereby providing another form of data independence. Of course you need sequence and bag operators to provide the DB user with the ease of use wrt them. I think partial sequences might be another useful higher-level concept to map onto sets.

Would you suggest any useful limitations for relations, or other data models for use in certain areas ?

David Livingstone

Quote from Dave Voorhis on June 20, 2019, 9:23 am

>> In Rel, relvars are the means to connect to external data. I call them external relvars. You can define external relvars that are bound to CSV files, (individual sheets in) Excel spreadsheet documents, tables in any SQL DBMS that can be reached via ODBC/JDBC, Microsoft Access tables, and relvars in other Rel databases. I supported Hadoop for a while, but it was a bit of a pest to maintain and nobody cared so I dropped it.

I have followed a similar strategy in RAQUEL. I've called them source and sink relvars. However only sink relvars are currently implemented.

In RAQUEL, an external/source/sink relvar is not under the control of the DBMS, unlike a DB relvar which is under the control of the DBMS. Do you differentiate similarly ?

>> Every attribute in every external relvar is automatically defined to be of type CHARACTER; nulls become empty strings. It's up to the user to map -- via VIEWs or whatever -- those character values to whatever locally-defined types, built-in or user-defined, the user sees fit.

I have worked on the basis that a binding can be assigned to a source or sink, which maps between the logical relvar and its underlying physical storage. So the binding chosen would cope with these issues.
Do you have a general-purpose binding, that can be applied to a range of storage mechanisms, so that you can achieve the generality of attribute type CHARACTER ? That sounds very useful.

>>> These make me question trying to shoehorn everything into the relational model.
>>> Maybe the right approach is not to try to recast every model into the relational model, but to try to create overarching tools that incorporate the relational model along with other ways of representing data.

Interesting.

I've been working on the assumption that ultimately the relational model can in principle handle all sorts of data. Then the problem is to provide the model with suitable scalar types and suitable relational operators and assignments to make it easy to handle a range of different relational structures.
Of course many implementation problems arise with this strategy, which can rather limit what is achievable.

I have also assumed that it's useful to have sequences and bags of tuples which you can map onto sets of tuples (i.e. relational values) - thereby providing another form of data independence. Of course you need sequence and bag operators to provide the DB user with the ease of use wrt them. I think partial sequences might be another useful higher-level concept to map onto sets.

Would you suggest any useful limitations for relations, or other data models for use in certain areas ?

David Livingstone

#34 · June 20, 2019, 10:38 am

Quote from johnwcowan on June 20, 2019, 12:15 am

>> A stream is not finite just because hardware is finite.

Agreed.

But in the real world/universe surely every physical thing is finite, even if it is absolutely colossal in size. Don't your 2 examples illustrate this ?

Is it not the case that in maths, the logic that applies to finite things cannot always be applied to infinite things ?

Thanks for your useful points about streams.

David Livingstone

#35 · June 20, 2019, 10:42 am

Quote from David Livingstone on June 20, 2019, 10:27 am

Quote from Dave Voorhis on June 20, 2019, 9:23 am

>> In Rel, relvars are the means to connect to external data. I call them external relvars. You can define external relvars that are bound to CSV files, (individual sheets in) Excel spreadsheet documents, tables in any SQL DBMS that can be reached via ODBC/JDBC, Microsoft Access tables, and relvars in other Rel databases. I supported Hadoop for a while, but it was a bit of a pest to maintain and nobody cared so I dropped it.

I have followed a similar strategy in RAQUEL. I've called them source and sink relvars. However only sink relvars are currently implemented.

In RAQUEL, an external/source/sink relvar is not under the control of the DBMS, unlike a DB relvar which is under the control of the DBMS. Do you differentiate similarly ?

External relvars are specified in Rel's dialect of Tutorial D; they're just different syntax from standard Tutorial D relvars.

Quote from David Livingstone on June 20, 2019, 10:27 am

I have worked on the basis that a binding can be assigned to a source or sink, which maps between the logical relvar and its underlying physical storage. So the binding chosen would cope with these issues.
Do you have a general-purpose binding, that can be applied to a range of storage mechanisms, so that you can achieve the generality of attribute type CHARACTER ? That sounds very useful.

Do you mean, is there a generic relvar that can be programmatically defined to connect to some new kind of data source?

If so, yes, but it's not exposed in Rel's Tutorial D dialect. It's defined in Java.

Quote from David Livingstone on June 20, 2019, 10:27 am

Would you suggest any useful limitations for relations, or other data models for use in certain areas ?

Sorry, I'm not clear on what you're asking here.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

#36 · June 20, 2019, 10:46 am

Quote from David Livingstone on June 20, 2019, 10:38 am

Quote from johnwcowan on June 20, 2019, 12:15 am

>> A stream is not finite just because hardware is finite.

Agreed.

But in the real world/universe surely every physical thing is finite, even if it is absolutely colossal in size. Don't your 2 examples illustrate this ?

Is it not the case that in maths, the logic that applies to finite things cannot always be applied to infinite things ?

Streams are finite but unbounded. You can always count how many data items you've received until now -- hence finite -- but the stream never ends, hence unbounded.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

#37 · June 20, 2019, 10:47 am

Quote from dandl on June 20, 2019, 2:32 am

>> I care about streams and variability. The Codd RM expects the data to be have a defined order and cardinality. A stream has no cardinality and data that we receive as inputs has no fixed heading.

So the question is, is it logically possible to map streams onto relations ? For example can we map streams onto a mathematical concept - say a sequence - which we can then map onto a set and hence a relation ? Can we map a stream onto any mathematical concept ?

If we can't, then logically we have to exclude streams from relational DBs.

David Livingstone

#38 · June 20, 2019, 10:52 am

Quote from David Livingstone on June 20, 2019, 10:47 am

Quote from dandl on June 20, 2019, 2:32 am

>> I care about streams and variability. The Codd RM expects the data to be have a defined order and cardinality. A stream has no cardinality and data that we receive as inputs has no fixed heading.

So the question is, is it logically possible to map streams onto relations ? For example can we map streams onto a mathematical concept - say a sequence - which we can then map onto a set and hence a relation ? Can we map a stream onto any mathematical concept ?

If we can't, then logically we have to exclude streams from relational DBs.

David Livingstone

This doesn't seem to have stopped Stonebraker from developing StreamSQL. I suppose it's a matter of some terminological debate as to whether streams and their associated operations are the relational model, an extension to the relational model, or something else entirely, but it's certainly entirely reasonable to define streams and operations on streams.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

#39 · June 20, 2019, 11:47 am

Quote from Dave Voorhis on June 20, 2019, 10:42 am

>>> Do you mean, is there a generic relvar that can be programmatically defined to connect to some new kind of data source?

> If so, yes, but it's not exposed in Rel's Tutorial D dialect. It's defined in Java.

Actually I was just wondering about what (I assume are) the general principles underlying your external relvars. Presumably you have some general rules that are implemented by java code ?

Quote from David Livingstone on June 20, 2019, 10:27 am

>> Would you suggest any useful limitations for relations, or other data models for use in certain areas ?

>> Sorry, I'm not clear on what you're asking here.

I ws just wondering if you had got round to formalising your practical experience, as regards logical models. Or thinking about guidelines ?

David Livingstone

#40 · June 20, 2019, 11:50 am

Quote from David Livingstone on June 20, 2019, 11:47 am

Quote from Dave Voorhis on June 20, 2019, 10:42 am

>>> Do you mean, is there a generic relvar that can be programmatically defined to connect to some new kind of data source?

> If so, yes, but it's not exposed in Rel's Tutorial D dialect. It's defined in Java.

Actually I was just wondering about what (I assume are) the general principles underlying your external relvars. Presumably you have some general rules that are implemented by java code ?

Quote from David Livingstone on June 20, 2019, 10:27 am

>> Would you suggest any useful limitations for relations, or other data models for use in certain areas ?

>> Sorry, I'm not clear on what you're asking here.

I ws just wondering if you had got round to formalising your practical experience, as regards logical models. Or thinking about guidelines ?

Ah.

No.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

The Forum for Discussion about The Third Manifesto and Related Matters