The Forum for Discussion about The Third Manifesto and Related Matters

Please or Register to create posts and topics.

Big Data/Data Science/AI, etc

Page 1 of 5Next

'Data' is becoming fashionable again, even if 'Databases' are not.  We're always hearing about Big Data, AI and machine learning from big 'Data Sets', Data Science, etc.  (Northumbria University is currently developing a Data Science course).

Does this create a new opportunity for relational concepts and relational DBMSs (where relational is defined by TTM) ?

Many of us in this forum appear to have grown up with traditional business DBs, which have been dominated by SQL - I include myself in this area.  Do we now need to think about what other, new subject and application areas could really benefit from applying a true relational formalism ?

I like to distinguish between the relational formalism itself - as defined by TTM - and its application to designing queries and DBs.  Where and how else could we apply it ?  Is the TTM-relational formalism more than just a super-SQL ?

David Livingstone

If we're talking about AI as in ML, then as far as I know the data they deal with is always represented as arrays of numbers, and that's not a very relational thing.

So you have a database of images (or documents or measurements of something), each of them alike. Each of them will have one or more labels attached. The data is pre-processed so that each labelled item is in exactly the same form. Typically that will be an array of numbers, each the same length. In relational terms you have two columns: a key and an array of numbers. Another table might hold the labels applied to each data item by key, two columns again.

The challenging part is the pre-processing and the choice of training parameters. The data management issues are pretty simple (apart from the sheer quantity of data).

Andl - A New Database Language - andl.org
Quote from David Livingstone on June 5, 2019, 2:13 pm

'Data' is becoming fashionable again, ... Big Data, AI and machine learning from big 'Data Sets', Data Science, etc.

I wasn't aware 'Data' had ever dropped out of fashion. 'Big Data' has been promising for a decade and a half, even with occasional achievements. But mostly data scientists with no knowledge of the subject matter claiming discoveries: spuriously decoding the Voynich manuscript, spuriously unscrambling human DNA and making claims about migration patterns with no archaeological evidence, today an example of claiming to identify authorial style by patterns of punctuation: no, what punctuation identifies is editorial/publishing style; you can't identify Shakespeare by punctuation because C19th and early C20th publications completely ignored punctuation and spelling in his manuscripts.

Does this create a new opportunity for relational concepts and relational DBMSs (where relational is defined by TTM) ?

I doubt it: 'Big Data'is about using statistical methods/sophisticated pattern matching over unstructured or semi-structured data, paying attention to metadata as much as content. DBMSs (relational or not) are about structured data, where the structure has been designed (to use the term loosley in some cases) through business analysis.

I like to distinguish between the relational formalism itself - as defined by TTM - and its application to designing queries and DBs.  Where and how else could we apply it ?  Is the TTM-relational formalism more than just a super-SQL ?

There seems to be a continuing trend to think that semi-structured/Key-value stores/NoSQL are somehow more flexible in answering the sort of ad-hoc queries data scientists go in for. Those would be the sort of data scientists who don't understand statistics, don't understand confirmation bias, don't understand statistical 'priming'. But if they're no good at their jobs, no a relational approach won't help. To laugh or to cry?

Quote from dandl on June 6, 2019, 2:04 am

If we're talking about AI as in ML, then as far as I know the data they deal with is always represented as arrays of numbers, and that's not a very relational thing.

I was thinking in very general terms, which was why I listed several topics.  I hoped people might come up with interesting novel suggestions.  I deliberately avoided suggestions of my own in order to prevent any discussion being limited to their specifics.  Also my suggestions might not turn out to be very good.

I presume by ML you mean machine learning ?  I know little about that.

One specific example that I do know of, and that might be considered machine learning (?), is work in the area of shoeprint recognition that has been done at Northumbria University.  In addition to an image of a shoeprint, a shoeprint per se is described by a vector of 'Keypoint Descriptors', where the latter do indeed comprise a lot of numbers.  However you can think of a vector of 'Keypoint Descriptors' as the physical storage of a single logical scalar value that you could call a 'Shoeprint Description'.  Suppose then all the logical operators on 'Shoeprint Descriptions' are packaged into an object class - they typically are anyway these days.  Then use the class to implement a logical scalar type, which can be plugged into your DBMS.  In this case, the DBMS will also have to be able to cope with 'Large Physical Scalars', i.e. logical scalar values that need an entire OS file to physically store the value of one logical value, because a vector of 'Keypoint Descriptors' is stored in a file.

To sum up, you deal with all the numbers by putting them in a data type.

Examples of usage (expressed in RAQUEL) :

SOCshoeprints  <--Real  SOCshoeprints Extend[ ShoeprintDesc <-- ShoeprintImg Generate ]

This creates a description of a shoeprint from an image of one in every tuple of the relvar SOCshoeprints, and adds it to SOCshoeprints in a new attribute called 'ShoeprintDesc'.
<--Real is a RAQUEL assignment.
Extend is a RAQUEL operator.
Generate is an operator of the 'Shoeprint Description' type.

Suppose you then want to compare all the shoeprints of suspect 'Joe' with those found at a crime scene.  The following expression does this for you :

Suspect  Restrict[ SuName = 'Joe' ]  Gen[ S-ShoeprintDesc  Compare  C-ShoeprintDesc  >  n ]  CrimeScene
Restrict is the RAQUEL restriction operator.
Gen is the RAQUEL generalised join operator.
Compare is the comparison operator of the 'Shoeprint Description' type.  It returns a numeric value indicating the degree of similarity.  If the similarity is greater than that defined by the number 'n' (put in your choice of number here) then you want the relevant suspect and crime scene tuples merged and put in the result.

Over the years, a lot of people at Northumbria have worked on recognition systems - machine parts, logos, shoeprints, faces, palmprints - and they all find it a pain to handle the files of data that they have to cope with.  If a DBMS could handle it for them, they would be delighted.

If that's not the sort of machine learning you're interested in, then my apologies for boring you.

David Livingstone

Quote from AntC on June 7, 2019, 6:13 am
Quote from David Livingstone on June 5, 2019, 2:13 pm

'Data' is becoming fashionable again, ... Big Data, AI and machine learning from big 'Data Sets', Data Science, etc.

I wasn't aware 'Data' had ever dropped out of fashion.

Data in terms of databases has declined as a subject, and in terms of research, over the years in academia.  The same appears to me to be true as regards the computing media.  'Big Data' etc sometimes appears to be considered as a brand new subject, and nothing to do with databases.  Perhaps what you say about the people who work with 'Big Data' provides the explanation for that ?

Does this create a new opportunity for relational concepts and relational DBMSs (where relational is defined by TTM) ?

I doubt it: 'Big Data'is about using statistical methods/sophisticated pattern matching over unstructured or semi-structured data, paying attention to metadata as much as content. DBMSs (relational or not) are about structured data, where the structure has been designed (to use the term loosley in some cases) through business analysis.

A lot of 'semi-structured data' appears to be really what I would call 'dynamic data'.  One instance of 'semi-structured data' has one structure, another instance has another structure.  So as one proceeds through the whole collection, the individual types vary dynamically.  This calls for dynamic data typing to deal with it.

A lot of computing is conditioned by compiled languages.  There everything has to be static so that a programmer can write program code that works for all (the limited range of) possible cases.  Interpreted languages can be be dynamic.  Therefore a programmer can write code that inspects each case first, and then processes the data on the basis of what is found.

Perhaps we just need to add more dynamic typing to relational DBMSs ?

David Livingstone

One specific example that I do know of, and that might be considered machine learning (?), is work in the area of shoeprint recognition that has been done at Northumbria University.  In addition to an image of a shoeprint, a shoeprint per se is described by a vector of 'Keypoint Descriptors', where the latter do indeed comprise a lot of numbers.  However you can think of a vector of 'Keypoint Descriptors' as the physical storage of a single logical scalar value that you could call a 'Shoeprint Description'.  Suppose then all the logical operators on 'Shoeprint Descriptions' are packaged into an object class - they typically are anyway these days.  Then use the class to implement a logical scalar type, which can be plugged into your DBMS.  In this case, the DBMS will also have to be able to cope with 'Large Physical Scalars', i.e. logical scalar values that need an entire OS file to physically store the value of one logical value, because a vector of 'Keypoint Descriptors' is stored in a file.

To sum up, you deal with all the numbers by putting them in a data type.

Indeed you can. First you put them all in an array, where all the values are the same data type (float) and only the value and its position in the array are significant. Then if you like you can package it all up with metadata (including labels) and provide some handy accessor functions, but we already know how to do that. It's just that the data in a format useful for ML and the data used in business processing are poles apart.

Andl - A New Database Language - andl.org
Quote from David Livingstone on June 5, 2019, 2:13 pm

'Data' is becoming fashionable again, even if 'Databases' are not.  We're always hearing about Big Data, AI and machine learning from big 'Data Sets', Data Science, etc.  (Northumbria University is currently developing a Data Science course).

A certain amount of this is pure hype, as industry media desperately tries to find something new and interesting to report. Indeed, "Big Data" is already talked about much less than it was several years ago. The big talking points now are Blockchain (though that seems to be on a slight decline), machine learning (also on a slight decline, maybe) and Internet-of-Things (on the way up).

We created a Big Data undergrad course a few years ago. I was its programme leader1. We attracted so few students that we shut it down. It turns out that 16/17 year old kids have no idea what "data science" is, but whatever it is, it sounds like no fun at all.

Of course, every university is different and attracts students with different expectations, background, and inclinations. It's entirely possible that Northumbria, unlike Derby, will be mobbed by prospective Data Science students.

--

1 I'm currently the undergrad computer science programme leader, though I'm retiring from academia in a week (last day, June 14th 2019) to work full time on turning my software ideas into what will hopefully become viable (and paying, either directly or indirectly) products.

Quote from David Livingstone on June 5, 2019, 2:13 pm

'Data' is becoming fashionable again, even if 'Databases' are not.  We're always hearing about Big Data, AI and machine learning from big 'Data Sets', Data Science, etc.  (Northumbria University is currently developing a Data Science course).

Does this create a new opportunity for relational concepts and relational DBMSs (where relational is defined by TTM) ?

Many of us in this forum appear to have grown up with traditional business DBs, which have been dominated by SQL - I include myself in this area.  Do we now need to think about what other, new subject and application areas could really benefit from applying a true relational formalism ?

I like to distinguish between the relational formalism itself - as defined by TTM - and its application to designing queries and DBs.  Where and how else could we apply it ?  Is the TTM-relational formalism more than just a super-SQL ?

David Livingstone

Indeed, I don't think SQL is a feasible target, at least not right now. The problem is that most people employed to use SQL love it. If they don't love it, they wind up doing something that doesn't require SQL. To make any inroads here, a SQL alternative needs to be so mind-bogglingly amazing that even the most ardent SQL supporter can't resist looking at the new alternative. I don't think any TTM-inspired implementation or anything else -- including some of the relatively-successful NoSQL DBMSs like MongoDB -- have come remotely close. MongoDB (or whatever) is often a preferred choice for those who don't like SQL, but to succeed, an alternative needs to be irresistible to the decision-makers and developers who love SQL in spite of its limitations and currently wouldn't dream of using anything else.

I do think there is application for TTM outside of conventional database management systems.

As I've suggested before, I think the relational model can be an organising principle for manipulating collections of things or machine resources, like servers in a cloud or sensors in a sensor network or objects in an operating system. Regarding that last item, a year or two ago I ran across an open source project that used a SQL dialect to manipulate operating system resources, and it seemed quite usable. Unfortunately, I can't find it now. I'll look again when I get a moment, but it does suggest one possible direction for TTM ideas outside of trying to make something to compete with SQL.

I also think there is room for applying TTM ideas to general-purpose data manipulation and data management outside of conventional (i.e., SQL) DBMSs. It's what I'm leaving academia to work on.

It's inspired by using Rel for "production" desktop data management, and having reflected at length on its strengths and limitations. In particular, I've recognised there are the things it doesn't do at all (or does badly) but that it needs to do really, really well. The solution won't be a D and won't even be a new language per se, but it will respect the ideals of TTM, even though it won't necessarily embrace all the prescriptions and proscriptions. It will help programmers and "power users" work with data and build data-oriented applications, using popular languages and tools in a new but familiar -- and hopefully very appealing and powerful -- way.

Unfortunately, it's still a good way away from any sort of public release. I haven't made anywhere near as much progress as I would have liked since I last wrote along these lines roughly a year ago (was it a year ago?), as far too many employment (and life in general) obligations got in the way. But starting Monday June 17th, making it happen will become my full-time job.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org
Quote from Dave Voorhis on June 7, 2019, 3:15 pm
1 I'm currently the undergrad computer science programme leader, though I'm retiring from academia in a week (last day, June 14th 2019) to work full time on turning my software ideas into what will hopefully become viable (and paying, either directly or indirectly) products.
Congratulations Dave! What does that mean for Rel?
On a couple of topics from this thread that would be grist to Fabien's mill:
  • The so-called data scientists using big data to identify authors by their punctuation made available a web front-end that will take any passage and guess its authorship. Somebody tried the last paragraph of James Joyce's Ulysses, which famously has no punctuation, at the author's insistence. (The passage is immediately recognisable by anybody who knows literature for its rhythmic crescendo of insistent "Yes"es.) Result: divide by zero error.

 

  • "The big talking points now are Blockchain ..." There's a few local software houses running blockchain/virtual currency stuff. One of them got hacked; people unknown syphoned US$30m out of customers accounts. Because it's blockchain there's no (visible) audit trail: nobody can tell which customers or how much each has lost (they have to look for themselves). Nobody can tell where the money's gone. Nobody knows how to get it back. The police have given up because the perpetrators are almost certainly overseas/outside their jurisdiction. Customers are suing but the local company has of course gone bankrupt/40 IT staff have lost their jobs.

Blockchain is going to revolutionise banking/trading systems ... because why?

Quote from Dave Voorhis on June 7, 2019, 3:15 pm

A certain amount of this is pure hype, as industry media desperately tries to find something new and interesting to report. Indeed, "Big Data" is already talked about much less than it was several years ago. The big talking points now are Blockchain (though that seems to be on a slight decline), machine learning (also on a slight decline, maybe) and Internet-of-Things (on the way up).

Big Data hasn't so much gone away as gone mainstream. Trainable pattern recognisers, mostly based on ANNs, are being widely used to harvest commerial opportunities, to the point where there are legitimate privacy concerns. There doesn't seem to be a 'next big thing' in AI, although the people at Deep Mind would have you believe otherwise.

We created a Big Data undergrad course a few years ago. I was its programme leader1. We attracted so few students that we shut it down. It turns out that 16/17 year old kids have no idea what "data science" is, but whatever it is, it sounds like no fun at all.

I don't think it works at under-grad. It looks more like a Masters to me.

1 I'm currently the undergrad computer science programme leader, though I'm retiring from academia in a week (last day, June 14th 2019) to work full time on turning my software ideas into what will hopefully become viable (and paying, either directly or indirectly) products.

Now that is a subject always of interest to me.

Indeed, I don't think SQL is a feasible target, at least not right now. The problem is that most people employed to use SQL love it. If they don't love it, they wind up doing something that doesn't require SQL. To make any inroads here, a SQL alternative needs to be so mind-bogglingly amazing that even the most ardent SQL supporter can't resist looking at the new alternative. I don't think any TTM-inspired implementation or anything else -- including some of the relatively-successful NoSQL DBMSs like MongoDB -- have come remotely close. MongoDB (or whatever) is often a preferred choice for those who don't like SQL, but to succeed, an alternative needs to be irresistible to the decision-makers and developers who love SQL in spite of its limitations and currently wouldn't dream of using anything else.

Agreed.

It's inspired by using Rel for "production" desktop data management, and having reflected at length on its strengths and limitations. In particular, I've recognised there are the things it doesn't do at all (or does badly) but that it needs to do really, really well. The solution won't be a D and won't even be a new language per se, but it will respect the ideals of TTM, even though it won't necessarily embrace all the prescriptions and proscriptions. It will help programmers and "power users" work with data and build data-oriented applications, using popular languages and tools in a new but familiar -- and hopefully very appealing and powerful -- way.

Is it a secret, or do you plan to blog about it?

Andl - A New Database Language - andl.org
Quote from AntC on June 8, 2019, 1:04 am
Quote from Dave Voorhis on June 7, 2019, 3:15 pm
1 I'm currently the undergrad computer science programme leader, though I'm retiring from academia in a week (last day, June 14th 2019) to work full time on turning my software ideas into what will hopefully become viable (and paying, either directly or indirectly) products.
Congratulations Dave! What does that mean for Rel?

For Rel, it means business as usual as an open source, largely education-oriented desktop DBMS that I will continue to develop and maintain. It belongs to me -- to the extent that it belongs to anyone, being open source -- not the university.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org
Page 1 of 5Next