The Forum for Discussion about The Third Manifesto and Related Matters

Please or Register to create posts and topics.

Big Data/Data Science/AI, etc

PreviousPage 3 of 5Next
Quote from johnwcowan on June 16, 2019, 4:27 am
Quote from AntC on June 16, 2019, 2:20 am

Careful: that's his 'Inner Union', not SQL's (or D's or set theory's) UNION. And he defines several other operations in terms of those.

Ah.  Clearly I need to read the paper more closely.

Yes. That's papers plural -- which contradict each other. And don't assume that what Tropasko means by a relation is what TTM means. Tropashko just waves relations into existence as elements of the lattice structure. Nowhere does he give carefully-numbered points to construct a relation. I only got the answer from him that he sees (all) attributes as untyped by repeated questioning. Furthermore his English is so ... (shall we say) ... idiosyncratic, that getting precise meaning out of what his saying is hazardous. (The joint papers are a little easier to follow.)

I said "defined in terms of", not "reduce".

What's the distinction?

See the definition I gave for InnerUnion. (And apologies I'm so rusty I got it wrong the first couple of times. I've just editted again.)

Where I come from, if you can define A in terms of B (using a constructive definition, at least), you have reduced A to B.

No I didn't give a constructive definition; I gave in effect some simultaneous equations, InnerJoin is the (unique) solution. So the lattice structure is a model. We then have to show it felicitously maps to relations and relational operations -- as defined in TTM.

Not quite sure why you're bringing in "infinite": I expect the set of relations to be finitely denumerable.

No, it's the relations themselves that T uses in his definitions of RENAME and MINUS that contain an infinite number of tuples.

OK I need to re-read those definitions. Are we looking at the same paper(s)? MINUS (at least by one definition) is the solution to a bunch of equations, same as I used for InnerUnion. Yes RENAME is in effect a COMPOSE with a (simple) algorithmic relation, viz. an equi-relation -- that's no different in principle to PLUS in Appendix A. So what I said above about HHT and 'safe queries' applies. To put it sharply: if all the relations mentioned in your query expression (excluding certain trivial expressions) are infinite; then your query is non-terminating. It might still be decidable in the sense that if you offer some tuple in the domain, it can say whether that's in the result. That's not useful as a query language.

  Siimlarly, the niladic version of INTERSECT in the TTM book requires a heading to be supplied, and it is the universal relation for that heading; i.e. all possible tuples that meet the type restrictions.  This is given as the only case where INTERSECT and JOIN differ.

Tropashko doesn't have niladic INTERSECT. Don't go muddling up his stuff with TTM. They are just not directly comparable.

There's a side condition on JOIN (which applies mutatis mutandis for every definition in terms of JOIN): the operands to JOIN must be join-compatible, that is attribute names in common must be at same attribute type.

See my post on type-indexed rows, which makes that tautologous.

Nope. That post is a long way wrong. I'll reply there.

Again I'm not sure what you're getting at. Tropashko's operators are domain-independent. (Unlike Appendix A's <OR>, <NOT>.) Then yes it goes fine with infinite relations.

But TTM, as noted above, requires relations of infinite cardinality already.  Once you have exposed such things, you have to know what they are going to do when operated on.

Yes both Tropashko and TTM "know what they're doing". But each is doing different things. If you mash the two together you'll get mush only.

 

Quote from AntC on June 16, 2019, 7:46 am
Quote from johnwcowan on June 16, 2019, 4:27 am
Quote from AntC on June 16, 2019, 2:20 am

Careful: that's his 'Inner Union', not SQL's (or D's or set theory's) UNION. And he defines several other operations in terms of those.

Ah.  Clearly I need to read the paper more closely.

For Tropashko's definition of set MINUS he writes A \ B, I'm looking at 'First Steps in Relational Lattice' pages 2-3. That's a later paper than the one you seem to be drawing from.

I said "defined in terms of", not "reduce".

What's the distinction?

I've picked MINUS, because it's an example from Tropashko of what he calls an "equational definition" "mimicking the way the minus operation is introduced in arithmetics [sic]"; not a construction to which A \ B can be reduced.

Not quite sure why you're bringing in "infinite": I expect the set of relations to be finitely denumerable.

No, it's the relations themselves that T uses in his definitions of RENAME and MINUS that contain an infinite number of tuples.

For MINUS you definitely need to read the later paper, and read the earlier paper closer. He uses notation  ⌈z⌉  to mean "an empty relation" with attribute z (very bottom of page 2). Those ceiling brackets are very confusingly used also in  ⌈x=y⌉ , meaning an infinite relation with attributes {x, y} s.t. the x value equals the y value. Then we have (wordpress won't do his bowtie nor swizzled bowtie)

A \ B =df X s.t. -- A, B have single attribute z

X JOIN B = ⌈z⌉ -- empty relation with attribute z

X InnerUnion B = A

No infinite relation has been harmed in that definition. OTOH this definition is what we'd technically call wrong. If B contains rows that don't appear in A, that last InnerUnion will not equal A. So that last equation should be X InnerUnion (B JOIN A) = A . We'd really like a generalised version of difference that scales to relations of arbitrary degree, and doesn't require the two operands to have the same heading. See improved version in a later still paper, discussed below.

I see there's also a definition in section '3.5 Difference' of the 'Relational Algebra as non-Distributive Lattice' earlier paper -- which is not co-authored and is almost unreadable. Yes that's a constructive definition that uses a NEQ( ) relation that is infinite. That definition is also what we'd technically call wrong. If in A \ B, B is empty, that definition will return an empty result, whereas it should return A. If A, B are of degree zero -- for example both DEE -- I can't see what it would do, but I'd be pretty sure those JOINs to NEQ( ) would malfunction. It's also what we'd technically call unusable, because a) it doesn't scale for relations of arbitrary degree; b) you have to construct a different formula for each possible set of attributes.

I suggest you file that earlier paper in the round cabinet. Here's a version from 'Relational Lattice Axioms' Section '6. AntiJoin'. This operation is Tutorial D's NOT MATCHING -- what we might call 'generalised difference' because it doesn't require the two operands to have the same heading. Unfortunately this paper uses a different (lattice-based) notation. I'll recast to earlier style

A \ B =df X s.t. -- A, B have headings of arbitrary degree, not nec the same, but must be join-compatible

X JOIN (B JOIN A) = A JOIN B JOIN DUM -- empty relation with heading union of A,Bs' headings

X InnerUnion (B JOIN A) = A -- B might have tuples not in A, we must ignore them

Where did that DUM come from? It can't be defined as the identity for some operation (as we did for DEE from JOIN). It's the solution to a couple of simultaneous equations (provably existent, provably unique)

forall r. [(r JOIN DUM = r) ≡ (r InnerUnion DUM = DUM)]  -- r empty

forall r. [(r JOIN DUM ≠ r) ≡ (r InnerUnion DUM = DEE)]  -- r non-empty

To check our definitions, we can algebraically derive DUM = DEE \ DEE.

 

I only looked into Tropashko because you mentioned him, and given the difficulties I think I will not invest in him further.

Quote from dandl on June 15, 2019, 12:50 am

>> ...  I reject your core proposition.
I love using dynamic languages for small programs, because ...

And I hate using dynamic languages for larger programs, because ...

Apologies for the delay in responding, due to other pressures.

I hadn't intended to make a core proposition.  Sorry if it came across that way.  My original statement (7 June) said :

"A lot of semi-structured data appears to be what I would call 'dynamic data'.  .....  Perhaps we just need to to add more dynamic typing to relational DBMSs ?"

Please note the question mark.
I take your point about type inference, what compilers can provide us with, etc.  These are all worthwhile tools we should use when appropriate.

If I were to make a core proposition, I would say that we should try to be software ENGINEERS, and try to use the right tools for the particular job in hand.  But then I think most/all people in this forum will be trying to do that anyway.  And it's not always easy to accomplish.

So I suppose that leads to the questions "what sort of tools should a TTM-compliant relational language and DBMS provide ?"  "What sort of additional tools should we have ?  (e.g  DB design tools)."

I could refine the question area(s) by considering the talk in the NoSQL camp.  Broadly speaking this seems to me to be :

  1. SQL fails by not having the right physical data storage.  Since physical storage is independent of the logical model, and is a matter of how well a given DBMS handles it, we can ignore it in this forum.
  2. SQL is not dynamic enough and doesn't handle 'modern' data.  This is potentially relevant.  What tools/facilities should a TTM-compliant relational language and DBMS provide to cope with such data, in order to go beyond what SQL does ?  But first, what does the criticism actually mean ?

Would anyone like to make any suggestions ?  (I'm deliberately not suggesting anything so that I can't subvert the topic).

My thanks to those who, meanwhile, have made some intersting points.

David Livingstone

Quote from David Livingstone on June 19, 2019, 11:05 am
Quote from dandl on June 15, 2019, 12:50 am

>> ...  I reject your core proposition.
I love using dynamic languages for small programs, because ...

And I hate using dynamic languages for larger programs, because ...

Apologies for the delay in responding, due to other pressures.

I hadn't intended to make a core proposition.  Sorry if it came across that way.  My original statement (7 June) said :

"A lot of semi-structured data appears to be what I would call 'dynamic data'.  .....  Perhaps we just need to to add more dynamic typing to relational DBMSs ?"

Please note the question mark.
I take your point about type inference, what compilers can provide us with, etc.  These are all worthwhile tools we should use when appropriate.

If I were to make a core proposition, I would say that we should try to be software ENGINEERS, and try to use the right tools for the particular job in hand.  But then I think most/all people in this forum will be trying to do that anyway.  And it's not always easy to accomplish.

So I suppose that leads to the questions "what sort of tools should a TTM-compliant relational language and DBMS provide ?"  "What sort of additional tools should we have ?  (e.g  DB design tools)."

I could refine the question area(s) by considering the talk in the NoSQL camp.  Broadly speaking this seems to me to be :

  1. SQL fails by not having the right physical data storage.  Since physical storage is independent of the logical model, and is a matter of how well a given DBMS handles it, we can ignore it in this forum.
  2. SQL is not dynamic enough and doesn't handle 'modern' data.  This is potentially relevant.  What tools/facilities should a TTM-compliant relational language and DBMS provide to cope with such data, in order to go beyond what SQL does ?  But first, what does the criticism actually mean ?

Would anyone like to make any suggestions ?  (I'm deliberately not suggesting anything so that I can't subvert the topic).

My thanks to those who, meanwhile, have made some intersting points.

David Livingstone

A defining characteristic of 'modern' data -- one of the explicitly-noted "three V's" of "big data" -- is variety, which means you are dealing with images, document files of every conceivable type, graphs (e.g., social networks, semantic networks, etc.), spreadsheets, outputs from IoT gadgets, remote REST/SOAP Web service endpoints, financial data, blockchains, buying history, you-name-it. Some of it is already in SQL databases. Some of it isn't in SQL databases, but fits very neatly into SQL databases. Some of it doesn't fit at all.

Furthermore, some of it demonstrates the other two of the "three V's" of "big data" -- volume and velocity -- which means not only does it not fit into SQL databases (they can't keep up because too much velocity; they can't store it because too big of volume), so if it's really "big data" per the classic definitions it can't "fit" into storage at all. You can only (maybe) process it as a stream.

Therefore, what's needed -- at least from a big data / data science point of view -- is not a single solution to SQL's limitations, but the ability to seamlessly and effectively integrate all the various ways of dealing with the variety (and volume and velocity) of real-world data.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org
Quote from Dave Voorhis on June 19, 2019, 11:28 am

> the explicitly-noted "three V's" of "big data"

I think this is a good starting point.

variety, which means you are dealing with images, document files of every conceivable type, graphs (e.g., social networks, semantic networks, etc.), spreadsheets, outputs from IoT gadgets, remote REST/SOAP Web service endpoints, financial data, blockchains, buying history,

A lot could be said here.

volume ... if it's really "big data" per the classic definitions it can't "fit" into storage at all.

Doesn't this imply an implementation problem, rather than a relational model problem ?  DBMSs just have to be able to plug in more powerful, or more efficient, physical storage ?

> velocity -- they can't keep up because too much velocity ... you can only (maybe) process it as a stream.

Is this also an implementation problem, to do with efficiency ?  What would you say is the significance of the 'stream' as regards the relational model ?

> what's needed ... is not a single solution to SQL's limitations, but the ability to seamlessly and effectively integrate all the various ways of dealing with the variety (and volume and velocity) of real-world data.

I entirely agree that relational 'developments' should fit seamlessly and effectively into the relational model, and BE an integral part of it.

David Livingstone

 

Quote from David Livingstone on June 19, 2019, 7:07 pm
Quote from Dave Voorhis on June 19, 2019, 11:28 am

> the explicitly-noted "three V's" of "big data"

I think this is a good starting point.

variety, which means you are dealing with images, document files of every conceivable type, graphs (e.g., social networks, semantic networks, etc.), spreadsheets, outputs from IoT gadgets, remote REST/SOAP Web service endpoints, financial data, blockchains, buying history,

A lot could be said here.

volume ... if it's really "big data" per the classic definitions it can't "fit" into storage at all.

Doesn't this imply an implementation problem, rather than a relational model problem ?  DBMSs just have to be able to plug in more powerful, or more efficient, physical storage ?

It's part of the classic definition of Big Data. If you can relatively easily store the data using off-the-shelf disks and/or disk arrays, then it's not big data -- it's just data. It's considered Big Data when there's too much data to store using existing solutions, so you are forced to invent new storage strategies.

To a certain degree, yes, if a given database management product won't let you fit your data within your 64 petabyte storage array, then maybe it needs to support dynamic compression storage plugins to make the data fit. That isn't a model problem per se, but are you planning to run general TTM-like or SQL-like queries on the (compressed) data?

If so, it may become too slow to meet requirements, particularly if most of the queries actually involve navigating a graph...

There are always tradeoffs with these things.

Quote from David Livingstone on June 19, 2019, 7:07 pm
Quote from Dave Voorhis on June 19, 2019, 11:28 am

> velocity -- they can't keep up because too much velocity ... you can only (maybe) process it as a stream.

Is this also an implementation problem, to do with efficiency ?  What would you say is the significance of the 'stream' as regards the relational model ?

It may be the same sort of problem as the volume problem, or it may be that your sensor network generates 2 or 3 terabytes of data per second.

Or it may be that the data generation is continuous and unbounded.

How do you deal with those?

There is a SQL variant called StreamSQL developed by Stonebreaker et al. for dealing with streaming data. See https://en.wikipedia.org/wiki/StreamSQL

I must emphasise that I'm using the classic definition of Big Data, which focuses on variety, volume and velocity, and the notion that it's only Big Data if one or more of the three can't be handled with existing products, so something new has to be invented; i.e., code has to be written or even hardware has to be created. If variety, volume and velocity can all be handled with existing products, then by definition it's not Big Data, it's just data.

Because software (and sometimes hardware) invention is a scary prospect for many business types, there are attempts to redefine Big Data to (for example) mean "Big" isn't about volume/variety/velocity (they're unimportant; simply pay a consultant to deal with that or, as various product vendors will tell you, their products do handle Big Data) but its commercial significance to your organisation (it's big important!) and so the three V's are really value, veracity and visualisation. Or maybe they're not three V's but seventeen. And so on. The field tends to be infected with marketing and commercial interests.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org
Quote from Dave Voorhis on June 19, 2019, 9:31 pm

volume ... if it's really "big data" per the classic definitions it can't "fit" into storage at all.

>> It's part of the classic definition of Big Data. If you can relatively easily store the data using off-the-shelf disks and/or disk arrays, then it's not big data -- it's just data.

I take your general point.
But if 'Big Data' is actually to be used - which we constantly hear it is - then there must be some means of storing it, otherwise it's just a pipe dream.
Perhaps a relational DBMS needs to be designed so that you can plug in a variety of storage mechanisms (like you plug Berkeley DB into Rel) so that we can always plug in some appropriate super-duper storage engine.
Ergo, it's of great practical significance to a relational DBMS, but not theoretical since performance doesn't appear in the relational model.

 

Quote from David Livingstone on June 19, 2019, 7:07 pm
Quote from Dave Voorhis on June 19, 2019, 11:28 am

> velocity -- they can't keep up because too much velocity ... you can only (maybe) process it as a stream.

>> It may be the same sort of problem as the volume problem, or it may be that your sensor network generates 2 or 3 terabytes of data per second.

>> There is a SQL variant called StreamSQL developed by Stonebreaker et al. for dealing with streaming data. See https://en.wikipedia.org/wiki/StreamSQL

A stream is 'an infinite sequence of tuples' (quoting from the reference).  So we could map sequences to sets of tuples for the logical model.  'Infinite' is a fudge, since real hardware only deals with finite quantities - probably needs some theoretical clarification.  Otherwise we're back to the underpinning implementation again (?).

>> I must emphasise that I'm using the classic definition of Big Data, which focuses on variety, volume and velocity, and the notion that it's only Big Data if one or more of the three can't be handled with existing products, so something new has to be invented;

Point taken.

>> The field tends to be infected with marketing and commercial interests.

Indeed !

From a relational model viewpoint, variety of types would appear to be the big issue - how do they relate to TTM ?

David Livingstone

Quote from Dave Voorhis on June 19, 2019, 11:28 am

>> A defining characteristic of 'modern' data ... is variety,
>> which means you are dealing with images, document files of every conceivable type, graphs (e.g., social networks, semantic networks, etc.), spreadsheets, outputs from IoT gadgets, remote REST/SOAP Web service endpoints, financial data, blockchains, buying history,

Can we split this up into a variety of of data types, structures, etc in order to see how the relational model might cope with (at least some of) these ?
How about starting with :

  1.  Ability to plug in new scalar types (which must include their implementation).  Some of these will be what I would call 'large physical types' because a single logical scalar value (say an image) would have its value stored in one physical file (not a field/record within a file), so the storage mechanism will have to cope with that.
  2. Ability to cope with a greater variety of structures of relational values, to cope with, say spreadsheets, financial data, than we are traditionally used to.  Maybe we don't have to be obliged to normalise every relational value (if a more complex one actually represents the real world, and is the way users want to view things) and have a set of operators that make it easy for us to manipulate these more structurally-complex relvalues ?  Maybe we could specify an attribute as being relvalues whose reltype is a Powerset of scalar types ?
  3. When it comes to graphs, graph structures can be represented relationally, and TTM has a Generalised Transitive Closure operator that manipulates them.  But I think that this operator lacks the power and flexibility to do all the things that Graph DBMSs claim they can do.  So maybe the real problem is to improve the set of relational  operators available for use with graphs ?
  4. Other things ?

Just throwing out a few possibilities for consideration.

David Livingstone

A stream is not finite just because hardware is finite.  A well-chosen stream is only finite insofar as the universe has a finite lifetime.  For example, a stream of seismological data can run in principle from now to the destruction of the Earth by the expanding Sun, and a stream of observed proton decays (if any) could go on for the next 3E43 years.

A few thoughts about adding streams to the RM:

  1. I model streams like this: they are similar to relations, except that they have exactly one of something I will call an ordered key.  This key is an ordinary RM key whose attributes must be of ordered types; it is also the case that the attributes in the key have an order, unlike ordinary keys.  This is in practical terms the sort with which the tuples in the stream arrive, often just a timestamp.
  2. RENAME on a stream renames the attributes in the ordered key to match the new names of the attributes of the stream.
  3. Joins (antijoins, semijoins) of streams with other streams or with relations are only possible on the ordered key.
  4. GROUP is only possible on the ordered key.  Note that the new attribute after grouping contains relations, not streams.
  5. REMOVE can only remove an attribute in the ordered key if it removes all later attributes in the key as well.  The equivalent restriction is imposed on PROJECT.
  6. WRAP likewise cannot wrap an attribute in the ordered key unless it wraps all later attributes in the key as well.
  7. UNION, XUNION, and MINUS of two streams require that they not only have the same heading but the same ordered key.  Of course, if one argument is a relation the second requirement is not imposed.
  8. A new operation WINDOW takes a stream and two tuples whose attributes are the attributes of the ordered key It returns a relation containing just those tuples of the stream whose ordered key is >= the first tuple and < the second tuple in lexicographical order.
  9. A new operation BATCH takes a stream and a relation whose attributes are the attributes of the ordered key, and produces partitions of the stream by ranges of the ordered key.  Details below.

BATCH effectively partitions a stream into many streams each of which has a limited range of ordered keys.  The tuples of the input relation specify the boundaries of the partition.  The result relation has the same cardinality as the input relation and one additional attribute, the partition of the stream.  The other attributes of the result contain intervals rather than individual values of the ordered-key attributes, where an interval is either an implicitly defined scalar data type or a tuple with two attributes min (inclusive) and max (exclusive).  The min values are exactly those in the input relation.

This is very preliminary.  Comments welcome.

PreviousPage 3 of 5Next