The Forum for Discussion about The Third Manifesto and Related Matters

Please or Register to create posts and topics.

"And the moral is, don't use Excel as a scientific database ..."

Page 1 of 5Next

Quote from Dave Voorhis on January 7, 2020, 10:40 am
Quote from AntC on January 7, 2020, 6:04 am
Quote from Vadim on January 6, 2020, 8:05 pm
Vadim: Consequently, the utility of databases in science is dubious, as they are essentially glorified storage engines.

AntC: If a database isn't a storage engine, it's useless to anybody. I dispute that SQL (if that's what you mean) makes anything glorified: quite the reverse. NoSQL usually makes the engine worse than SQL. Most practicing scientists seem to think that Excel is a storage engine.

Dave: Excel is a storage engine. And it's an easy storage engine. Almost everything else, SQL, NoSQL, Tableau, Qlikview, Hadoop, you-name-it is, for anyone but a theoretician or computer power-user -- which most of the world is not interested in becoming -- is too damn hard.

Even Excel is too damn hard, but most scientists or other technical people have achieved some notional facility with it, ...

The title quote is at about 14:00 here. Someone analysed a bunch of papers on genome research whose supporting evidence was in spreadsheets attached to the publication. Some Genome codes look a lot like dates: MARCH5, SEP8. Excel had interpreted them as dates at input. The genome statistics were garbage.

[Actually starts at about 11:30).]

No, there was no evidence presented that this affected the results. Perhaps it just didn't matter, but it certainly is sloppy.

Most of this talk is about spreadsheets containing calculations and various kinds of mistakes in them. There is one throwaway about 'why use Excel to store data if there is no calculation'?

My question is: if you've got some data in tabular format (a CSV) what would you use to maintain it other than Excel? What else is there? Where is the data storage product that is 'better than Excel'?

 

 

Andl - A New Database Language - andl.org
Quote from dandl on January 25, 2020, 9:36 am

My question is: if you've got some data in tabular format (a CSV) what would you use to maintain it other than Excel? What else is there? Where is the data storage product that is 'better than Excel'?

Even in the MS Office suite, Access is much better than Excel at handling data in a tabular format (it happily reads Excel, CSV, fixed-column text, ODBC, and others) and it gives you easy form and report creation.

But that should be obvious, if only because it's a database management system.

The problem is and has always been that the notion of defining a column with a type and a possible calculation that applies to all rows (i.e., EXTEND in Tutorial D, a calculated column in SQL) is yet another abstraction too far for the typical non-technical -- and, often, otherwise very capable and intelligent -- data user. It's far easier for him or her to conceptually understand a calculation that occurs on each cell in a spreadsheet, than the virtually incomprehensible magic of a calculation that automatically and only applies to every row in the given column.

It may sound like I'm being facetious, but I'm not. That really is a mental hurdle for most folks outside of us "database people", and is too big a mental gap to ever bridge for many.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

No, Access isn't it. I know plenty about the "notion of defining a column" and I use Excel for preference, despite its limitations.

Why? Because of that one word: abstraction. Excel provides a concrete view of data in tables with a highly expressive layer of formatting on top. Access (and every other DBMS) imposes a highly restrictive abstraction of what it will allow and how it can be presented. Excel is like keeping money in your wallet, to spend as you wish. Access is like a bank account, you have to play by the bank's rules.

No, my question was not about switching from freeform spreadsheet to straitjacket DBMS, it was about a better tool for handling personal data. The good bits of Excel, but with more emphasis on being a good data store.

Andl - A New Database Language - andl.org

The question of utility databases in science is more fundamental than the use of particular storage engine (Excel, etc). Conventional wisdom is that Relational Databases have many capabilities but excel at none (pun intended).

Take biology, for example, and genetics, in particular. Sure DNA sequences are strings, so that there is little confusion what domain to use to represent them, but do we ever perform equijoin, on this domain? What about a query across genomics and proteomics databases where the join condition leverages genomics code? AFAIK, researchers do store data in relational databases, but then use BLAST program to perform analysis.

Matching/parsing/translating strings is relatively easy, but what do you do for spatial data? Certainly, you can capture protein molecule structure in a table of atomic coordinates, but then, can you actually query the spatial data?

Database research community is well aware of these issues, and frontier research projects like aforementioned LARA are aimed at expanding the reach of classic relational databases. (And I don't follow the connection to Child's work mentioned by Anthony).

Quote from Dave Voorhis on January 25, 2020, 10:20 am
Quote from dandl on January 25, 2020, 9:36 am

My question is: if you've got some data in tabular format (a CSV) what would you use to maintain it other than Excel? What else is there? Where is the data storage product that is 'better than Excel'?

Even in the MS Office suite, Access is much better than Excel at handling data in a tabular format ...

The problem is and has always been that the notion of defining a column with a type ...

It may sound like I'm being facetious, but I'm not. That really is a mental hurdle for most folks outside of us "database people", and is too big a mental gap to ever bridge for many.

Leaving aside what you go on to say about calculations, how hard can it be in Excel to click on a column head and apply a datatype setting? Youdathought genome researchers would have bumped into auto-reformatting problems many times, and have adopted a few disciplines. I guess when you insert a blank row or press return to append a row, Excel doesn't inherit the formatting from columns above/below? (That's one thing that Quantrix fixes IIRC; and of course MsAccess.)

I'm so used to database thinking, I can't even see what mental hurdle it is you're talking about. So I guess I (or anybody else here) am entirely unsuited to design the Next Big Thing (TM).

Quote from AntC on January 26, 2020, 5:58 am
Quote from Dave Voorhis on January 25, 2020, 10:20 am
Quote from dandl on January 25, 2020, 9:36 am

My question is: if you've got some data in tabular format (a CSV) what would you use to maintain it other than Excel? What else is there? Where is the data storage product that is 'better than Excel'?

Even in the MS Office suite, Access is much better than Excel at handling data in a tabular format ...

The problem is and has always been that the notion of defining a column with a type ...

It may sound like I'm being facetious, but I'm not. That really is a mental hurdle for most folks outside of us "database people", and is too big a mental gap to ever bridge for many.

Leaving aside what you go on to say about calculations, how hard can it be in Excel to click on a column head and apply a datatype setting?

But that's exactly the point. Excel only has two data types: numbers (floating point) and not numbers (text). If you import a CSV and don't pay attention, Excel will import your column of data as a date, convert it to a number and store that. And given the US heritage, there is a really good chance that even if it is a date, you'll get MMDDYYYY regardless of reality. And your original data is lost, never to be seen again.

Yes, you can click on a column heading and apply formatting, but it's still just a number or not a number.

Youdathought genome researchers would have bumped into auto-reformatting problems many times, and have adopted a few disciplines. I guess when you insert a blank row or press return to append a row, Excel doesn't inherit the formatting from columns above/below? (That's one thing that Quantrix fixes IIRC; and of course MsAccess.)

I'm so used to database thinking, I can't even see what mental hurdle it is you're talking about. So I guess I (or anybody else here) am entirely unsuited to design the Next Big Thing (TM).

Typical RDBMS have the opposite problem: you have to create the abstraction of columns and types before you can stick any data into it. Types are pervasive, and can really get in the way when you just want to play with data and add a bit of basic formatting.

Excel is astonishingly durable, given how bad the basic model looks by modern standards. The Next Big Thing now would be something like Google Sheets or AirTable I guess.

Andl - A New Database Language - andl.org
Quote from dandl on January 25, 2020, 1:17 pm

No, Access isn't it. I know plenty about the "notion of defining a column" and I use Excel for preference, despite its limitations.

Why? Because of that one word: abstraction. Excel provides a concrete view of data in tables with a highly expressive layer of formatting on top. Access (and every other DBMS) imposes a highly restrictive abstraction of what it will allow and how it can be presented. Excel is like keeping money in your wallet, to spend as you wish. Access is like a bank account, you have to play by the bank's rules.

That's not abstraction, that's conflation.

Excel conflates formatting with data, which is perfectly adequate -- though it demands a lot of manual effort (which some like; they think the effort is being "creative") -- when you only need one formatted "report" on one "table" of local data.

When you want multiple formatted views from one set of data, or want the data be drawn from multiple external data sources, it quickly becomes dire. It turns into spreadsheet hell. I have endless stories from my last workplace about spreadsheet hell and the people who preferred spreadsheet hell to anything even remotely database-like, even when databases would have so often solved so many problems.

But if your personal data is usually just one formatted "report" on one local "table" of data -- and you like that approach -- then Excel is probably the right tool for the job.

No, my question was not about switching from freeform spreadsheet to straitjacket DBMS, it was about a better tool for handling personal data. The good bits of Excel, but with more emphasis on being a good data store.

I'm making good progress on that, but I can almost guarantee you won't like it. I suspect you'll find it too databasey and not enough spreadsheety.

But I will certainly like it, and that's what matters.

Though I've shown what I'm working on to a few others. They like it too.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org
Quote from AntC on January 26, 2020, 5:58 am

I'm so used to database thinking, I can't even see what mental hurdle it is you're talking about. So I guess I (or anybody else here) am entirely unsuited to design the Next Big Thing (TM).

That's it. You're used to database thinking, I'm used to database thinking, and we've all been doing technical computery stuff for nigh unto half a century, give or take. Therefore, an awful lot of what seems intuitively obviously common knowledge to us is mind-bendingly complicated, confusing, and abstract to the beginner or average non-technical user.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org
Quote from dandl on January 26, 2020, 7:23 am
Quote from AntC on January 26, 2020, 5:58 am

...

I'm so used to database thinking, I can't even see what mental hurdle it is you're talking about. So I guess I (or anybody else here) am entirely unsuited to design the Next Big Thing (TM).

Typical RDBMS have the opposite problem: you have to create the abstraction of columns and types before you can stick any data into it. Types are pervasive, and can really get in the way when you just want to play with data and add a bit of basic formatting.

That turns out to be easy to fix from an interaction point of view. You can let the user simply plunk data into an arbitrary table/sheet and not worry about types, at least at first. What goes on behind the scenes is a bit more complex, as the engine sometimes has to convert data from one type to another and it sometimes fails (for reasonable reasons) and you sometimes have to tell the user and help him deal with it by fixing the data or specifying a type. But I think that's solvable and results in usable interaction.

Formatting is something else. As I suggested in a previous post, conflating formatting and data rapidly and inevitably becomes an abomination, no matter how convenient and superficially desirable it seems at first. Data and information ("information is data in formation") -- i.e., data entry and reports -- need to be kept separate and appreciated to be distinct, or we wind up back in spreadsheet hell.

 

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org
Page 1 of 5Next