Meaning of a relation
Quote from tobega on January 22, 2021, 8:58 amQuote from tobega on January 21, 2021, 3:47 pmQuote from Erwin on January 21, 2021, 2:51 pmQuote from tobega on January 21, 2021, 10:27 amQuote from Erwin on January 20, 2021, 8:07 pmQuote from tobega on January 13, 2021, 2:12 pm"All the tuples are true and each tuple means that Employee X works in Department Y OR on project Z"
The problem I have with that is that when I project on (Employee, Department), I get a relation value where it is suddenly very difficult to assert anything about the employee's relation to the department. Since it would (I presume) still be required to have "AND" between all the tuples, my only recourse seems to be to say that each tuple means "Employee X may work in Department Y".
Three things I want to say here.
If the predicate of a relation (/relational expression) truly is "Employee X works in Department Y OR on project Z", [and the relation has three attributes X, Y and Z] then the CWA dictates that :
- if X works in dept.Y, then a tuple must be present for each and every possible Z value (each and every possible project) and the particular (X,Y) combination
- if X works on project Z, then a tuple must be present for each and every possible Y value (each and every possible department) and the particular (X,Z) combination.
Note very carefully that where I mentioned this example in my reply, I was talking very specifically of a relation of degree 1, just attribute X, and your Y and Z were ***NOT*** free variables in my predicate.
Representing this predicate in the form of a relation of degree 3, with Y (department) and Z (project) also being attributes of the relation, gets us straight into the territory of OUTER UNION. And while <OR> in A algebra is indeed intended to correspond to that form of union, there are practical reasons why we'd all want to avoid using it in practice, same reasons why real DBMS's ***require*** "heading compatibility" between the arguments used in an invocation of relational UNION.
Two, the precise nature of the predicate that is satisfied by the result coming out of your projection (and any projection in general actually). In your example, that predicate is (study very carefully some 15-20 times and only if and when you're sufficiently awake) :
- THERE EXISTS SOME PROJECT Z such that ...
- ... Employee X works in Department Y OR on project Z
Note how the 'Z' has, in terms of logic, become a ***bound*** variable of the predicate, meaning it's no longer a ***free*** variable, meaning there's no corresponding Z attribute in the relation representing that.
The tuples that ***should*** appear in the relation are :
- the (X,Y) tuples that satisfy "Employee X works in Department Y"
- ***ALL*** the (X,Y) tuples such that Y is ***some*** department and X works on ***some*** project Z (further unspecified in this relation)
PONDER THAT SECOND BULLET.
Three, about "it is suddenly very difficult to assert anything about the employee's relation to the department".
If the informational problem you are trying to address in some specific particular piece of code you' re writing is about "the employee's relation to the department", ***and nothing else but that***, then you ***SHOULDN'T*** be inspecting any things that import any concept into the data other than, precisely, "the employee's relation to the department". Meaning, you ***SHOULDN'T*** be inspecting things that say anything, at whatever level of vagueness, about the possible existence of projects that the employee at hand might possibly also be working on. Meaning, if your ***business problem*** is "does this employee work in this department or not", then ***use the relvar that TELLS you thing thing (and nothing else)***.
Thanks for this carefully considered response, very useful to see the logic.
For my own purposes I concluded that the crucial thing is the CWA.
I also concluded that I do not want to have an OR between attributes. It remains to be seen if it is possible to work around this in general, but in this case I would use the predicate "It may be true that Employee X works for Department Y AND in Project Z", which holds when we generate the combinations given. It is a somewhat weaker assertion, but I can separately keep track of the fact that I expect to be able to replace "may be" by "is" for at least one tuple. The crucial point is by the CWA that no missing relation can have "may be" replaced by "is".
You are of course correct that in the simple example given, the projection onto (employee, department) will by necessity contain all departments. But it will not be so if we meanwhile receive other pieces of information such as "nobody on project J works in department C".
As for use case, are you familiar with the problem "Who owns the zebra?" otherwise known as Einstein's riddle? That is probably a good example of the type of work to do (without going to datalog). In addition, of course, to more usual business purposes since it's a language feature.
Little addendum FWIW :
Where I wrote "***ALL*** the (X,Y) tuples such that Y is ***some*** department and X works on ***some*** project Z (further unspecified in this relation)", the part "Y is ***some department***" specifically means "any thinkable department", ***including*** all the ones that are not claimed to exist (in some other relvar dedicated to that purpose of stating which departments are in existence).
So if departments are identified by, say, a 4-byte integer that has no further constraints on it (such as having to be positive, or having to be even, or whatever notion that would rule out some of the integers), then that's over 4bn tuples for every single emloyee who works on some project Z.
Take the tuple {X "Hugh Darwen", Y "does_not_exist", Z "The third Manifesto"}, fill in the values in the predicate and verify for yourself that the resulting proposition (the disjunction) is true (and will also be true for every other thinkable department identifier) if it is true that "Hugh Darwen works on The third Manifesto".
Well, all is all, as defined by the domain, which may or may not be all possible values of the 4 byte integer. It could also be restricted to a list of N distinct 4-byte values, in which case it would be N values. Otherwise why care about a restriction imposed by some semi-arbitrarily selected physical limit if you don't care about the restriction imposed by the business?
I think that clearly nonsensical alternatives can be safely ignored anyway, since ignoring them cannot affect the outcome. But maybe you've made a good argument against using OR in the predicate that applies to a tuple if we would have to add tuples even for nonsense values on the leg that doesn't matter? I am at least much happier that the AND form is easier to apply further restrictions to.
Thinking more about this, Erwin does have a point especially with regards to potential future values, which is why we might then have to have every possible (and impossible) value represented. So, yes, we could not safely persist such a relation, nor anything derived from it.
Quote from tobega on January 21, 2021, 3:47 pmQuote from Erwin on January 21, 2021, 2:51 pmQuote from tobega on January 21, 2021, 10:27 amQuote from Erwin on January 20, 2021, 8:07 pmQuote from tobega on January 13, 2021, 2:12 pm"All the tuples are true and each tuple means that Employee X works in Department Y OR on project Z"
The problem I have with that is that when I project on (Employee, Department), I get a relation value where it is suddenly very difficult to assert anything about the employee's relation to the department. Since it would (I presume) still be required to have "AND" between all the tuples, my only recourse seems to be to say that each tuple means "Employee X may work in Department Y".
Three things I want to say here.
If the predicate of a relation (/relational expression) truly is "Employee X works in Department Y OR on project Z", [and the relation has three attributes X, Y and Z] then the CWA dictates that :
- if X works in dept.Y, then a tuple must be present for each and every possible Z value (each and every possible project) and the particular (X,Y) combination
- if X works on project Z, then a tuple must be present for each and every possible Y value (each and every possible department) and the particular (X,Z) combination.
Note very carefully that where I mentioned this example in my reply, I was talking very specifically of a relation of degree 1, just attribute X, and your Y and Z were ***NOT*** free variables in my predicate.
Representing this predicate in the form of a relation of degree 3, with Y (department) and Z (project) also being attributes of the relation, gets us straight into the territory of OUTER UNION. And while <OR> in A algebra is indeed intended to correspond to that form of union, there are practical reasons why we'd all want to avoid using it in practice, same reasons why real DBMS's ***require*** "heading compatibility" between the arguments used in an invocation of relational UNION.
Two, the precise nature of the predicate that is satisfied by the result coming out of your projection (and any projection in general actually). In your example, that predicate is (study very carefully some 15-20 times and only if and when you're sufficiently awake) :
- THERE EXISTS SOME PROJECT Z such that ...
- ... Employee X works in Department Y OR on project Z
Note how the 'Z' has, in terms of logic, become a ***bound*** variable of the predicate, meaning it's no longer a ***free*** variable, meaning there's no corresponding Z attribute in the relation representing that.
The tuples that ***should*** appear in the relation are :
- the (X,Y) tuples that satisfy "Employee X works in Department Y"
- ***ALL*** the (X,Y) tuples such that Y is ***some*** department and X works on ***some*** project Z (further unspecified in this relation)
PONDER THAT SECOND BULLET.
Three, about "it is suddenly very difficult to assert anything about the employee's relation to the department".
If the informational problem you are trying to address in some specific particular piece of code you' re writing is about "the employee's relation to the department", ***and nothing else but that***, then you ***SHOULDN'T*** be inspecting any things that import any concept into the data other than, precisely, "the employee's relation to the department". Meaning, you ***SHOULDN'T*** be inspecting things that say anything, at whatever level of vagueness, about the possible existence of projects that the employee at hand might possibly also be working on. Meaning, if your ***business problem*** is "does this employee work in this department or not", then ***use the relvar that TELLS you thing thing (and nothing else)***.
Thanks for this carefully considered response, very useful to see the logic.
For my own purposes I concluded that the crucial thing is the CWA.
I also concluded that I do not want to have an OR between attributes. It remains to be seen if it is possible to work around this in general, but in this case I would use the predicate "It may be true that Employee X works for Department Y AND in Project Z", which holds when we generate the combinations given. It is a somewhat weaker assertion, but I can separately keep track of the fact that I expect to be able to replace "may be" by "is" for at least one tuple. The crucial point is by the CWA that no missing relation can have "may be" replaced by "is".
You are of course correct that in the simple example given, the projection onto (employee, department) will by necessity contain all departments. But it will not be so if we meanwhile receive other pieces of information such as "nobody on project J works in department C".
As for use case, are you familiar with the problem "Who owns the zebra?" otherwise known as Einstein's riddle? That is probably a good example of the type of work to do (without going to datalog). In addition, of course, to more usual business purposes since it's a language feature.
Little addendum FWIW :
Where I wrote "***ALL*** the (X,Y) tuples such that Y is ***some*** department and X works on ***some*** project Z (further unspecified in this relation)", the part "Y is ***some department***" specifically means "any thinkable department", ***including*** all the ones that are not claimed to exist (in some other relvar dedicated to that purpose of stating which departments are in existence).
So if departments are identified by, say, a 4-byte integer that has no further constraints on it (such as having to be positive, or having to be even, or whatever notion that would rule out some of the integers), then that's over 4bn tuples for every single emloyee who works on some project Z.
Take the tuple {X "Hugh Darwen", Y "does_not_exist", Z "The third Manifesto"}, fill in the values in the predicate and verify for yourself that the resulting proposition (the disjunction) is true (and will also be true for every other thinkable department identifier) if it is true that "Hugh Darwen works on The third Manifesto".
Well, all is all, as defined by the domain, which may or may not be all possible values of the 4 byte integer. It could also be restricted to a list of N distinct 4-byte values, in which case it would be N values. Otherwise why care about a restriction imposed by some semi-arbitrarily selected physical limit if you don't care about the restriction imposed by the business?
I think that clearly nonsensical alternatives can be safely ignored anyway, since ignoring them cannot affect the outcome. But maybe you've made a good argument against using OR in the predicate that applies to a tuple if we would have to add tuples even for nonsense values on the leg that doesn't matter? I am at least much happier that the AND form is easier to apply further restrictions to.
Thinking more about this, Erwin does have a point especially with regards to potential future values, which is why we might then have to have every possible (and impossible) value represented. So, yes, we could not safely persist such a relation, nor anything derived from it.
Quote from dandl on January 22, 2021, 1:41 pmQuote from Dave Voorhis on January 20, 2021, 2:42 pmQuote from dandl on January 20, 2021, 1:29 pmMy experience in using tools like this to build software is limited: Rails, MS EF, couple of others. I defer to experts.
But I know about building tools that are used in this way by others, and I know how hard it is to create the right abstraction and defend it over time, at scale and across change. I think the fundamental premise of the ORM is wrong, but while I think a 'layer over SQL' is probably a better way, at least for smaller apps, I don't think it scales or delivers the other benefits of a true abstraction layer.
These days, I get paid to use (and sometimes build) tools like this. ORMs are kind of like dynamically typed languages -- they're deceptively simple and make you think you're writing less code, and maybe you are, but you actually wind up going to a lot more effort.
I think a layer over SQL probably is a better way. Over time, I'll find out as I use it.
Having used true abstraction layers of at least the ORM variety -- such as Entity Framework -- with large applications, they seem to wind up encouraging a certain amount of code bloat and a lot of awfully SQL-like non-SQL, plus some actual SQL shoehorned in because something runs too slowly.
That doesn't seem particularly scalable.
The ORM is wrong because it maintains a fiction of an OO database. If you had an OO database you wouldn't use it, so why pretend?
Layer over SQL is less wrong (because it keeps the good bits) but it really only gives you better glue code, it doesn't raise the level of the language (or the database).
I would prefer a solution that was function oriented (like Linq or Java Streams) and composable, but providing the benefits of TTM: extensible type system, RA, MA. Visible SQL won't get you there.
Yes, the full benefits of the TTM ideal would be nice.
But, tradeoffs, as usual.
Writing less code is almost always good, but the main thing at scale is to change less code. Say you move between Postgres and MySQL and you find you have to change the case in every query. Or even the convention for joins is different. Or it's Oracle which is even weirder. If the SQL queries are all abstracted into a generator, that' s one thing to change; if they're all hand-written that could be thousands. Even automated tools can struggle.
I've always been a fan of placing one layer of indirection between my code and someone else's API, so you don't repeat the same code fragments over and over, and so you know where to go to diagnose, log or update specific to the API. Same principle applies to an external data model: one layer of abstraction over the data model. Same goes for business logic like a pricing model or a regulatory compliance model.
A layer in which you can still see the SQL is not really a layer, by my reckoning. It's leaky.
Quote from Dave Voorhis on January 20, 2021, 2:42 pmQuote from dandl on January 20, 2021, 1:29 pmMy experience in using tools like this to build software is limited: Rails, MS EF, couple of others. I defer to experts.
But I know about building tools that are used in this way by others, and I know how hard it is to create the right abstraction and defend it over time, at scale and across change. I think the fundamental premise of the ORM is wrong, but while I think a 'layer over SQL' is probably a better way, at least for smaller apps, I don't think it scales or delivers the other benefits of a true abstraction layer.
These days, I get paid to use (and sometimes build) tools like this. ORMs are kind of like dynamically typed languages -- they're deceptively simple and make you think you're writing less code, and maybe you are, but you actually wind up going to a lot more effort.
I think a layer over SQL probably is a better way. Over time, I'll find out as I use it.
Having used true abstraction layers of at least the ORM variety -- such as Entity Framework -- with large applications, they seem to wind up encouraging a certain amount of code bloat and a lot of awfully SQL-like non-SQL, plus some actual SQL shoehorned in because something runs too slowly.
That doesn't seem particularly scalable.
The ORM is wrong because it maintains a fiction of an OO database. If you had an OO database you wouldn't use it, so why pretend?
Layer over SQL is less wrong (because it keeps the good bits) but it really only gives you better glue code, it doesn't raise the level of the language (or the database).
I would prefer a solution that was function oriented (like Linq or Java Streams) and composable, but providing the benefits of TTM: extensible type system, RA, MA. Visible SQL won't get you there.
Yes, the full benefits of the TTM ideal would be nice.
But, tradeoffs, as usual.
Writing less code is almost always good, but the main thing at scale is to change less code. Say you move between Postgres and MySQL and you find you have to change the case in every query. Or even the convention for joins is different. Or it's Oracle which is even weirder. If the SQL queries are all abstracted into a generator, that' s one thing to change; if they're all hand-written that could be thousands. Even automated tools can struggle.
I've always been a fan of placing one layer of indirection between my code and someone else's API, so you don't repeat the same code fragments over and over, and so you know where to go to diagnose, log or update specific to the API. Same principle applies to an external data model: one layer of abstraction over the data model. Same goes for business logic like a pricing model or a regulatory compliance model.
A layer in which you can still see the SQL is not really a layer, by my reckoning. It's leaky.
Quote from Dave Voorhis on January 22, 2021, 1:52 pmQuote from dandl on January 22, 2021, 1:41 pmQuote from Dave Voorhis on January 20, 2021, 2:42 pmQuote from dandl on January 20, 2021, 1:29 pmMy experience in using tools like this to build software is limited: Rails, MS EF, couple of others. I defer to experts.
But I know about building tools that are used in this way by others, and I know how hard it is to create the right abstraction and defend it over time, at scale and across change. I think the fundamental premise of the ORM is wrong, but while I think a 'layer over SQL' is probably a better way, at least for smaller apps, I don't think it scales or delivers the other benefits of a true abstraction layer.
These days, I get paid to use (and sometimes build) tools like this. ORMs are kind of like dynamically typed languages -- they're deceptively simple and make you think you're writing less code, and maybe you are, but you actually wind up going to a lot more effort.
I think a layer over SQL probably is a better way. Over time, I'll find out as I use it.
Having used true abstraction layers of at least the ORM variety -- such as Entity Framework -- with large applications, they seem to wind up encouraging a certain amount of code bloat and a lot of awfully SQL-like non-SQL, plus some actual SQL shoehorned in because something runs too slowly.
That doesn't seem particularly scalable.
The ORM is wrong because it maintains a fiction of an OO database. If you had an OO database you wouldn't use it, so why pretend?
Layer over SQL is less wrong (because it keeps the good bits) but it really only gives you better glue code, it doesn't raise the level of the language (or the database).
I would prefer a solution that was function oriented (like Linq or Java Streams) and composable, but providing the benefits of TTM: extensible type system, RA, MA. Visible SQL won't get you there.
Yes, the full benefits of the TTM ideal would be nice.
But, tradeoffs, as usual.
Writing less code is almost always good, but the main thing at scale is to change less code. Say you move between Postgres and MySQL and you find you have to change the case in every query. Or even the convention for joins is different. Or it's Oracle which is even weirder. If the SQL queries are all abstracted into a generator, that' s one thing to change; if they're all hand-written that could be thousands. Even automated tools can struggle.
In roughly 30+ years of developing, supporting, and consulting on SQL-based database-driven applications, I've seen a requirement to move from one DBMS to a different DBMS once.
Notably, that was late last year; a proposed move from Oracle Database to PostgreSQL for an application.
In my domains, the majority of applications start and finish on one DBMS.
I've always been a fan of placing one layer of indirection between my code and someone else's API, so you don't repeat the same code fragments over and over, and so you know where to go to diagnose, log or update specific to the API. Same principle applies to an external data model: one layer of abstraction over the data model. Same goes for business logic like a pricing model or a regulatory compliance model.
Certainly. But in that respect, having a SQL-based database abstraction layer is no different from having an ORM-QL -based database abstraction layer.
A layer in which you can still see the SQL is not really a layer, by my reckoning. It's leaky.
Perhaps, but having repeatedly experienced the downsides of ORMs and ORM-QLs and other hide-almost-all-the-SQL SQL database abstraction layers, I'm starting to think that seeing the SQL -- leaky abstraction or not -- may be the least evil of all evils if you're using a SQL database.
Indeed, maybe it's better to expose the SQL and make it easier to integrate. Hence, my "SQL amplifier."
But, entirely in the database abstraction layer, not peppered throughout the application. That isn't what I have in mind.
Quote from dandl on January 22, 2021, 1:41 pmQuote from Dave Voorhis on January 20, 2021, 2:42 pmQuote from dandl on January 20, 2021, 1:29 pmMy experience in using tools like this to build software is limited: Rails, MS EF, couple of others. I defer to experts.
But I know about building tools that are used in this way by others, and I know how hard it is to create the right abstraction and defend it over time, at scale and across change. I think the fundamental premise of the ORM is wrong, but while I think a 'layer over SQL' is probably a better way, at least for smaller apps, I don't think it scales or delivers the other benefits of a true abstraction layer.
These days, I get paid to use (and sometimes build) tools like this. ORMs are kind of like dynamically typed languages -- they're deceptively simple and make you think you're writing less code, and maybe you are, but you actually wind up going to a lot more effort.
I think a layer over SQL probably is a better way. Over time, I'll find out as I use it.
Having used true abstraction layers of at least the ORM variety -- such as Entity Framework -- with large applications, they seem to wind up encouraging a certain amount of code bloat and a lot of awfully SQL-like non-SQL, plus some actual SQL shoehorned in because something runs too slowly.
That doesn't seem particularly scalable.
The ORM is wrong because it maintains a fiction of an OO database. If you had an OO database you wouldn't use it, so why pretend?
Layer over SQL is less wrong (because it keeps the good bits) but it really only gives you better glue code, it doesn't raise the level of the language (or the database).
I would prefer a solution that was function oriented (like Linq or Java Streams) and composable, but providing the benefits of TTM: extensible type system, RA, MA. Visible SQL won't get you there.
Yes, the full benefits of the TTM ideal would be nice.
But, tradeoffs, as usual.
Writing less code is almost always good, but the main thing at scale is to change less code. Say you move between Postgres and MySQL and you find you have to change the case in every query. Or even the convention for joins is different. Or it's Oracle which is even weirder. If the SQL queries are all abstracted into a generator, that' s one thing to change; if they're all hand-written that could be thousands. Even automated tools can struggle.
In roughly 30+ years of developing, supporting, and consulting on SQL-based database-driven applications, I've seen a requirement to move from one DBMS to a different DBMS once.
Notably, that was late last year; a proposed move from Oracle Database to PostgreSQL for an application.
In my domains, the majority of applications start and finish on one DBMS.
I've always been a fan of placing one layer of indirection between my code and someone else's API, so you don't repeat the same code fragments over and over, and so you know where to go to diagnose, log or update specific to the API. Same principle applies to an external data model: one layer of abstraction over the data model. Same goes for business logic like a pricing model or a regulatory compliance model.
Certainly. But in that respect, having a SQL-based database abstraction layer is no different from having an ORM-QL -based database abstraction layer.
A layer in which you can still see the SQL is not really a layer, by my reckoning. It's leaky.
Perhaps, but having repeatedly experienced the downsides of ORMs and ORM-QLs and other hide-almost-all-the-SQL SQL database abstraction layers, I'm starting to think that seeing the SQL -- leaky abstraction or not -- may be the least evil of all evils if you're using a SQL database.
Indeed, maybe it's better to expose the SQL and make it easier to integrate. Hence, my "SQL amplifier."
But, entirely in the database abstraction layer, not peppered throughout the application. That isn't what I have in mind.
Quote from Erwin on January 22, 2021, 4:54 pmQuote from Dave Voorhis on January 22, 2021, 1:52 pmQuote from dandl on January 22, 2021, 1:41 pmQuote from Dave Voorhis on January 20, 2021, 2:42 pmQuote from dandl on January 20, 2021, 1:29 pmMy experience in using tools like this to build software is limited: Rails, MS EF, couple of others. I defer to experts.
But I know about building tools that are used in this way by others, and I know how hard it is to create the right abstraction and defend it over time, at scale and across change. I think the fundamental premise of the ORM is wrong, but while I think a 'layer over SQL' is probably a better way, at least for smaller apps, I don't think it scales or delivers the other benefits of a true abstraction layer.
These days, I get paid to use (and sometimes build) tools like this. ORMs are kind of like dynamically typed languages -- they're deceptively simple and make you think you're writing less code, and maybe you are, but you actually wind up going to a lot more effort.
I think a layer over SQL probably is a better way. Over time, I'll find out as I use it.
Having used true abstraction layers of at least the ORM variety -- such as Entity Framework -- with large applications, they seem to wind up encouraging a certain amount of code bloat and a lot of awfully SQL-like non-SQL, plus some actual SQL shoehorned in because something runs too slowly.
That doesn't seem particularly scalable.
The ORM is wrong because it maintains a fiction of an OO database. If you had an OO database you wouldn't use it, so why pretend?
Layer over SQL is less wrong (because it keeps the good bits) but it really only gives you better glue code, it doesn't raise the level of the language (or the database).
I would prefer a solution that was function oriented (like Linq or Java Streams) and composable, but providing the benefits of TTM: extensible type system, RA, MA. Visible SQL won't get you there.
Yes, the full benefits of the TTM ideal would be nice.
But, tradeoffs, as usual.
Writing less code is almost always good, but the main thing at scale is to change less code. Say you move between Postgres and MySQL and you find you have to change the case in every query. Or even the convention for joins is different. Or it's Oracle which is even weirder. If the SQL queries are all abstracted into a generator, that' s one thing to change; if they're all hand-written that could be thousands. Even automated tools can struggle.
In roughly 30+ years of developing, supporting, and consulting on SQL-based database-driven applications, I've seen a requirement to move from one DBMS to a different DBMS once.
Notably, that was late last year; a proposed move from Oracle Database to PostgreSQL for an application.
In my domains, the majority of applications start and finish on one DBMS.
I've always been a fan of placing one layer of indirection between my code and someone else's API, so you don't repeat the same code fragments over and over, and so you know where to go to diagnose, log or update specific to the API. Same principle applies to an external data model: one layer of abstraction over the data model. Same goes for business logic like a pricing model or a regulatory compliance model.
Certainly. But in that respect, having a SQL-based database abstraction layer is no different from having an ORM-QL -based database abstraction layer.
A layer in which you can still see the SQL is not really a layer, by my reckoning. It's leaky.
Perhaps, but having repeatedly experienced the downsides of ORMs and ORM-QLs and other hide-almost-all-the-SQL SQL database abstraction layers, I'm starting to think that seeing the SQL -- leaky abstraction or not -- may be the least evil of all evils if you're using a SQL database.
Indeed, maybe it's better to expose the SQL and make it easier to integrate. Hence, my "SQL amplifier."
I've participated once in the construction of such a "data access layer" (as a user and thus primarily as a critic). It had pretty avant-garde features for those days (+- 1995) :
- Every business entity modeled and implemented for history-keeping (largely in line with TDATRM and TaRT)
- No SQL in any other module than those constituting the DAL
- modules of the DAL classified in three distinct categories : query one, query list, update (covering all of INSERT DELETE UPDATE). The only exception ever made was by me where I had a case of having to query by three distinct time dimensions simultaneously (and technically I made it fit the "query list" paradigm). I still get a kick having to explain that one to others.
- "Query list" facilitated "querying in chunks", say, "give me the first twenty of this", and on the next invocation "give me the first twenty after the last one I got from the previous invocation" etc. etc.
- All data modules compiled, and thus no components run-time-generating the SQL with the server side parsing and compiling it (not even as a reusable "prepared statement" which is as far as any ORM can go).
- All update modules including optimistic locking by design.
- (maybe some more I've forgotten about by now)
- (In particular : messaging between modules of distinct layers managed by the communication layer, which ascertained absolute platform independence between individual modules and made it possible, e.g., to go read data from any database, and in particular the non-DB2 thing that 99% of the business was still running on : it's the comm layer that decided whether to do assembler CALL or to use LU6.2 (mainframe's posh name for TCP/IP so I was told) )
Quote from Dave Voorhis on January 22, 2021, 1:52 pmQuote from dandl on January 22, 2021, 1:41 pmQuote from Dave Voorhis on January 20, 2021, 2:42 pmQuote from dandl on January 20, 2021, 1:29 pmMy experience in using tools like this to build software is limited: Rails, MS EF, couple of others. I defer to experts.
But I know about building tools that are used in this way by others, and I know how hard it is to create the right abstraction and defend it over time, at scale and across change. I think the fundamental premise of the ORM is wrong, but while I think a 'layer over SQL' is probably a better way, at least for smaller apps, I don't think it scales or delivers the other benefits of a true abstraction layer.
These days, I get paid to use (and sometimes build) tools like this. ORMs are kind of like dynamically typed languages -- they're deceptively simple and make you think you're writing less code, and maybe you are, but you actually wind up going to a lot more effort.
I think a layer over SQL probably is a better way. Over time, I'll find out as I use it.
Having used true abstraction layers of at least the ORM variety -- such as Entity Framework -- with large applications, they seem to wind up encouraging a certain amount of code bloat and a lot of awfully SQL-like non-SQL, plus some actual SQL shoehorned in because something runs too slowly.
That doesn't seem particularly scalable.
The ORM is wrong because it maintains a fiction of an OO database. If you had an OO database you wouldn't use it, so why pretend?
Layer over SQL is less wrong (because it keeps the good bits) but it really only gives you better glue code, it doesn't raise the level of the language (or the database).
I would prefer a solution that was function oriented (like Linq or Java Streams) and composable, but providing the benefits of TTM: extensible type system, RA, MA. Visible SQL won't get you there.
Yes, the full benefits of the TTM ideal would be nice.
But, tradeoffs, as usual.
Writing less code is almost always good, but the main thing at scale is to change less code. Say you move between Postgres and MySQL and you find you have to change the case in every query. Or even the convention for joins is different. Or it's Oracle which is even weirder. If the SQL queries are all abstracted into a generator, that' s one thing to change; if they're all hand-written that could be thousands. Even automated tools can struggle.
In roughly 30+ years of developing, supporting, and consulting on SQL-based database-driven applications, I've seen a requirement to move from one DBMS to a different DBMS once.
Notably, that was late last year; a proposed move from Oracle Database to PostgreSQL for an application.
In my domains, the majority of applications start and finish on one DBMS.
I've always been a fan of placing one layer of indirection between my code and someone else's API, so you don't repeat the same code fragments over and over, and so you know where to go to diagnose, log or update specific to the API. Same principle applies to an external data model: one layer of abstraction over the data model. Same goes for business logic like a pricing model or a regulatory compliance model.
Certainly. But in that respect, having a SQL-based database abstraction layer is no different from having an ORM-QL -based database abstraction layer.
A layer in which you can still see the SQL is not really a layer, by my reckoning. It's leaky.
Perhaps, but having repeatedly experienced the downsides of ORMs and ORM-QLs and other hide-almost-all-the-SQL SQL database abstraction layers, I'm starting to think that seeing the SQL -- leaky abstraction or not -- may be the least evil of all evils if you're using a SQL database.
Indeed, maybe it's better to expose the SQL and make it easier to integrate. Hence, my "SQL amplifier."
I've participated once in the construction of such a "data access layer" (as a user and thus primarily as a critic). It had pretty avant-garde features for those days (+- 1995) :
- Every business entity modeled and implemented for history-keeping (largely in line with TDATRM and TaRT)
- No SQL in any other module than those constituting the DAL
- modules of the DAL classified in three distinct categories : query one, query list, update (covering all of INSERT DELETE UPDATE). The only exception ever made was by me where I had a case of having to query by three distinct time dimensions simultaneously (and technically I made it fit the "query list" paradigm). I still get a kick having to explain that one to others.
- "Query list" facilitated "querying in chunks", say, "give me the first twenty of this", and on the next invocation "give me the first twenty after the last one I got from the previous invocation" etc. etc.
- All data modules compiled, and thus no components run-time-generating the SQL with the server side parsing and compiling it (not even as a reusable "prepared statement" which is as far as any ORM can go).
- All update modules including optimistic locking by design.
- (maybe some more I've forgotten about by now)
- (In particular : messaging between modules of distinct layers managed by the communication layer, which ascertained absolute platform independence between individual modules and made it possible, e.g., to go read data from any database, and in particular the non-DB2 thing that 99% of the business was still running on : it's the comm layer that decided whether to do assembler CALL or to use LU6.2 (mainframe's posh name for TCP/IP so I was told) )
Quote from Dave Voorhis on January 22, 2021, 5:43 pmQuote from Erwin on January 22, 2021, 4:54 pmI've participated once in the construction of such a "data access layer" (as a user and thus primarily as a critic). It had pretty avant-garde features for those days (+- 1995) :
...
- "Query list" facilitated "querying in chunks", say, "give me the first twenty of this", and on the next invocation "give me the first twenty after the last one I got from the previous invocation" etc. etc.
...
As an aside, I see this a lot.
As you'd expect, it's often used to support paginated lists on Web sites.
I occasionally see it show up in contexts that have nothing to do with Web pagination.
Like, "We only have 128 gigabytes on the server! There's no way it can possibly handle more than a thousand records at a time!"
Each record is less than 1K bytes. What?
Or, "We have to process data in small chunks or we get charged extra for database use!"
What!?
Or, "We process the data in batches because there's no way we're going to let the DBMS process it all at once! What if it failed part way through? We'd have to start over from scratch!"
What??!?!
There's no countering this sort of madness.
Quote from Erwin on January 22, 2021, 4:54 pmI've participated once in the construction of such a "data access layer" (as a user and thus primarily as a critic). It had pretty avant-garde features for those days (+- 1995) :
...
- "Query list" facilitated "querying in chunks", say, "give me the first twenty of this", and on the next invocation "give me the first twenty after the last one I got from the previous invocation" etc. etc.
...
As an aside, I see this a lot.
As you'd expect, it's often used to support paginated lists on Web sites.
I occasionally see it show up in contexts that have nothing to do with Web pagination.
Like, "We only have 128 gigabytes on the server! There's no way it can possibly handle more than a thousand records at a time!"
Each record is less than 1K bytes. What?
Or, "We have to process data in small chunks or we get charged extra for database use!"
What!?
Or, "We process the data in batches because there's no way we're going to let the DBMS process it all at once! What if it failed part way through? We'd have to start over from scratch!"
What??!?!
There's no countering this sort of madness.
Quote from Erwin on January 22, 2021, 6:38 pmQuote from Dave Voorhis on January 22, 2021, 5:43 pmQuote from Erwin on January 22, 2021, 4:54 pmI've participated once in the construction of such a "data access layer" (as a user and thus primarily as a critic). It had pretty avant-garde features for those days (+- 1995) :
...
- "Query list" facilitated "querying in chunks", say, "give me the first twenty of this", and on the next invocation "give me the first twenty after the last one I got from the previous invocation" etc. etc.
...
As an aside, I see this a lot.
As you'd expect, it's often used to support paginated lists on Web sites.
I occasionally see it show up in contexts that have nothing to do with Web pagination.
Like, "We only have 128 gigabytes on the server! There's no way it can possibly handle more than a thousand records at a time!"
Each record is less than 1K bytes. What?
Or, "We have to process data in small chunks or we get charged extra for database use!"
What!?
Or, "We process the data in batches because there's no way we're going to let the DBMS process it all at once! What if it failed part way through? We'd have to start over from scratch!"
What??!?!
There's no countering this sort of madness.
Possibly because there's a valid point somewhere in there.
In the days (and on the machine) I'm talking about, an I/O still took 10ms. And our COBOL environment (in the broad sense) still did not support "records" >32K. Row-level locking had already been invented, but we couldn't afford the extra memory required to really go do that. (We were still in the phase of being impressed with MVS/ESA supporting >16M for the entire machine and addressing was still 31-bit so at any rate memory was still limited to 2G. (though maybe they already had techniques to stash more physical memory in the machine and just make it seem to an individual task that 2G was the max.) I attracted piles of dismissal and scorn and mockery when I tried to convince colleagues that a table with zero elements was a valid way for a "query list" module to inform its caller that no entries were found. I would only run into the term "nullologist" more than 10 yrs later.
In those days, there was one particularly effective way to piss off the chief systems guy to almost no end, and that was by quoting the chief data guy's dictum(/boutade) that his entire architecture was built on : "the machine is big enough and money isn't a problem". It did start becoming a problem the moment our entire department was put under curatorship of our shareholder's department of bean counters :-) .
Sense of being put on my pet peeve hobbyhorse here, so perhaps better to stop here :-)
Quote from Dave Voorhis on January 22, 2021, 5:43 pmQuote from Erwin on January 22, 2021, 4:54 pmI've participated once in the construction of such a "data access layer" (as a user and thus primarily as a critic). It had pretty avant-garde features for those days (+- 1995) :
...
- "Query list" facilitated "querying in chunks", say, "give me the first twenty of this", and on the next invocation "give me the first twenty after the last one I got from the previous invocation" etc. etc.
...
As an aside, I see this a lot.
As you'd expect, it's often used to support paginated lists on Web sites.
I occasionally see it show up in contexts that have nothing to do with Web pagination.
Like, "We only have 128 gigabytes on the server! There's no way it can possibly handle more than a thousand records at a time!"
Each record is less than 1K bytes. What?
Or, "We have to process data in small chunks or we get charged extra for database use!"
What!?
Or, "We process the data in batches because there's no way we're going to let the DBMS process it all at once! What if it failed part way through? We'd have to start over from scratch!"
What??!?!
There's no countering this sort of madness.
Possibly because there's a valid point somewhere in there.
In the days (and on the machine) I'm talking about, an I/O still took 10ms. And our COBOL environment (in the broad sense) still did not support "records" >32K. Row-level locking had already been invented, but we couldn't afford the extra memory required to really go do that. (We were still in the phase of being impressed with MVS/ESA supporting >16M for the entire machine and addressing was still 31-bit so at any rate memory was still limited to 2G. (though maybe they already had techniques to stash more physical memory in the machine and just make it seem to an individual task that 2G was the max.) I attracted piles of dismissal and scorn and mockery when I tried to convince colleagues that a table with zero elements was a valid way for a "query list" module to inform its caller that no entries were found. I would only run into the term "nullologist" more than 10 yrs later.
In those days, there was one particularly effective way to piss off the chief systems guy to almost no end, and that was by quoting the chief data guy's dictum(/boutade) that his entire architecture was built on : "the machine is big enough and money isn't a problem". It did start becoming a problem the moment our entire department was put under curatorship of our shareholder's department of bean counters :-) .
Sense of being put on my pet peeve hobbyhorse here, so perhaps better to stop here :-)
Quote from Dave Voorhis on January 22, 2021, 6:44 pmQuote from Erwin on January 22, 2021, 6:38 pmQuote from Dave Voorhis on January 22, 2021, 5:43 pmQuote from Erwin on January 22, 2021, 4:54 pmI've participated once in the construction of such a "data access layer" (as a user and thus primarily as a critic). It had pretty avant-garde features for those days (+- 1995) :
...
- "Query list" facilitated "querying in chunks", say, "give me the first twenty of this", and on the next invocation "give me the first twenty after the last one I got from the previous invocation" etc. etc.
...
As an aside, I see this a lot.
As you'd expect, it's often used to support paginated lists on Web sites.
I occasionally see it show up in contexts that have nothing to do with Web pagination.
Like, "We only have 128 gigabytes on the server! There's no way it can possibly handle more than a thousand records at a time!"
Each record is less than 1K bytes. What?
Or, "We have to process data in small chunks or we get charged extra for database use!"
What!?
Or, "We process the data in batches because there's no way we're going to let the DBMS process it all at once! What if it failed part way through? We'd have to start over from scratch!"
What??!?!
There's no countering this sort of madness.
Possibly because there's a valid point somewhere in there.
...
Indeed, I suspect some of the madness comes from a time when it wasn't madness, twenty or thirty or forty years ago.
Conditions have changed.
Quote from Erwin on January 22, 2021, 6:38 pmQuote from Dave Voorhis on January 22, 2021, 5:43 pmQuote from Erwin on January 22, 2021, 4:54 pmI've participated once in the construction of such a "data access layer" (as a user and thus primarily as a critic). It had pretty avant-garde features for those days (+- 1995) :
...
- "Query list" facilitated "querying in chunks", say, "give me the first twenty of this", and on the next invocation "give me the first twenty after the last one I got from the previous invocation" etc. etc.
...
As an aside, I see this a lot.
As you'd expect, it's often used to support paginated lists on Web sites.
I occasionally see it show up in contexts that have nothing to do with Web pagination.
Like, "We only have 128 gigabytes on the server! There's no way it can possibly handle more than a thousand records at a time!"
Each record is less than 1K bytes. What?
Or, "We have to process data in small chunks or we get charged extra for database use!"
What!?
Or, "We process the data in batches because there's no way we're going to let the DBMS process it all at once! What if it failed part way through? We'd have to start over from scratch!"
What??!?!
There's no countering this sort of madness.
Possibly because there's a valid point somewhere in there.
...
Indeed, I suspect some of the madness comes from a time when it wasn't madness, twenty or thirty or forty years ago.
Conditions have changed.
Quote from Erwin on January 22, 2021, 7:43 pmQuote from Dave Voorhis on January 22, 2021, 6:44 pmQuote from Erwin on January 22, 2021, 6:38 pmQuote from Dave Voorhis on January 22, 2021, 5:43 pmQuote from Erwin on January 22, 2021, 4:54 pmI've participated once in the construction of such a "data access layer" (as a user and thus primarily as a critic). It had pretty avant-garde features for those days (+- 1995) :
...
- "Query list" facilitated "querying in chunks", say, "give me the first twenty of this", and on the next invocation "give me the first twenty after the last one I got from the previous invocation" etc. etc.
...
As an aside, I see this a lot.
As you'd expect, it's often used to support paginated lists on Web sites.
I occasionally see it show up in contexts that have nothing to do with Web pagination.
Like, "We only have 128 gigabytes on the server! There's no way it can possibly handle more than a thousand records at a time!"
Each record is less than 1K bytes. What?
Or, "We have to process data in small chunks or we get charged extra for database use!"
What!?
Or, "We process the data in batches because there's no way we're going to let the DBMS process it all at once! What if it failed part way through? We'd have to start over from scratch!"
What??!?!
There's no countering this sort of madness.
Possibly because there's a valid point somewhere in there.
...
Indeed, I suspect some of the madness comes from a time when it wasn't madness, twenty or thirty or forty years ago.
Conditions have changed.
The numbers have changed. Possibly here and there also the relative ratio between some pair of them. What I am deeply convinced of is that very little has changed to the ratio between the user's expectations and the machine's intrinsic capabilities. These days, our users (I'm in a data warehousing service these days) expect to be handed tens if not hundreds of G of data within less than half a workday's time of having formulated the question (just the ftp of such a file from the mainframe to the analysis environment already takes couple of hours if it's overnight). When we made our entire archive of medical records available in the analysis environment (600.000.000 1K records / year), a clever user saw fit to initiate a comparative analysis over time and he was genuinely surprised that it would still be running after 4 hrs. And since these "dinky toy" environments (term of my invention) are ***not*** the champions of scheduling in such a way that every user gets his fair share of computer time, ***all*** other users suffered the consequences throughout all that time until someone decided to kick that resource hog out. The cost is not machine time, the cost is the other business users getting stalled in their daily work and being unproductive. I've bought Don Knuth's magnum opus recently as a new yrs' present to myself because I needed some consolation. He describes the very same problems and the very same concerns I'm describing in this post !!! It's not really the "conditions" that have changed, it's just the numbers.
Quote from Dave Voorhis on January 22, 2021, 6:44 pmQuote from Erwin on January 22, 2021, 6:38 pmQuote from Dave Voorhis on January 22, 2021, 5:43 pmQuote from Erwin on January 22, 2021, 4:54 pmI've participated once in the construction of such a "data access layer" (as a user and thus primarily as a critic). It had pretty avant-garde features for those days (+- 1995) :
...
- "Query list" facilitated "querying in chunks", say, "give me the first twenty of this", and on the next invocation "give me the first twenty after the last one I got from the previous invocation" etc. etc.
...
As an aside, I see this a lot.
As you'd expect, it's often used to support paginated lists on Web sites.
I occasionally see it show up in contexts that have nothing to do with Web pagination.
Like, "We only have 128 gigabytes on the server! There's no way it can possibly handle more than a thousand records at a time!"
Each record is less than 1K bytes. What?
Or, "We have to process data in small chunks or we get charged extra for database use!"
What!?
Or, "We process the data in batches because there's no way we're going to let the DBMS process it all at once! What if it failed part way through? We'd have to start over from scratch!"
What??!?!
There's no countering this sort of madness.
Possibly because there's a valid point somewhere in there.
...
Indeed, I suspect some of the madness comes from a time when it wasn't madness, twenty or thirty or forty years ago.
Conditions have changed.
The numbers have changed. Possibly here and there also the relative ratio between some pair of them. What I am deeply convinced of is that very little has changed to the ratio between the user's expectations and the machine's intrinsic capabilities. These days, our users (I'm in a data warehousing service these days) expect to be handed tens if not hundreds of G of data within less than half a workday's time of having formulated the question (just the ftp of such a file from the mainframe to the analysis environment already takes couple of hours if it's overnight). When we made our entire archive of medical records available in the analysis environment (600.000.000 1K records / year), a clever user saw fit to initiate a comparative analysis over time and he was genuinely surprised that it would still be running after 4 hrs. And since these "dinky toy" environments (term of my invention) are ***not*** the champions of scheduling in such a way that every user gets his fair share of computer time, ***all*** other users suffered the consequences throughout all that time until someone decided to kick that resource hog out. The cost is not machine time, the cost is the other business users getting stalled in their daily work and being unproductive. I've bought Don Knuth's magnum opus recently as a new yrs' present to myself because I needed some consolation. He describes the very same problems and the very same concerns I'm describing in this post !!! It's not really the "conditions" that have changed, it's just the numbers.
Quote from Dave Voorhis on January 22, 2021, 8:32 pmQuote from Erwin on January 22, 2021, 7:43 pmQuote from Dave Voorhis on January 22, 2021, 6:44 pmQuote from Erwin on January 22, 2021, 6:38 pmQuote from Dave Voorhis on January 22, 2021, 5:43 pmQuote from Erwin on January 22, 2021, 4:54 pmI've participated once in the construction of such a "data access layer" (as a user and thus primarily as a critic). It had pretty avant-garde features for those days (+- 1995) :
...
- "Query list" facilitated "querying in chunks", say, "give me the first twenty of this", and on the next invocation "give me the first twenty after the last one I got from the previous invocation" etc. etc.
...
As an aside, I see this a lot.
As you'd expect, it's often used to support paginated lists on Web sites.
I occasionally see it show up in contexts that have nothing to do with Web pagination.
Like, "We only have 128 gigabytes on the server! There's no way it can possibly handle more than a thousand records at a time!"
Each record is less than 1K bytes. What?
Or, "We have to process data in small chunks or we get charged extra for database use!"
What!?
Or, "We process the data in batches because there's no way we're going to let the DBMS process it all at once! What if it failed part way through? We'd have to start over from scratch!"
What??!?!
There's no countering this sort of madness.
Possibly because there's a valid point somewhere in there.
...
Indeed, I suspect some of the madness comes from a time when it wasn't madness, twenty or thirty or forty years ago.
Conditions have changed.
The numbers have changed. Possibly here and there also the relative ratio between some pair of them. What I am deeply convinced of is that very little has changed to the ratio between the user's expectations and the machine's intrinsic capabilities. These days, our users (I'm in a data warehousing service these days) expect to be handed tens if not hundreds of G of data within less than half a workday's time of having formulated the question (just the ftp of such a file from the mainframe to the analysis environment already takes couple of hours if it's overnight). When we made our entire archive of medical records available in the analysis environment (600.000.000 1K records / year), a clever user saw fit to initiate a comparative analysis over time and he was genuinely surprised that it would still be running after 4 hrs. And since these "dinky toy" environments (term of my invention) are ***not*** the champions of scheduling in such a way that every user gets his fair share of computer time, ***all*** other users suffered the consequences throughout all that time until someone decided to kick that resource hog out. The cost is not machine time, the cost is the other business users getting stalled in their daily work and being unproductive. I've bought Don Knuth's magnum opus recently as a new yrs' present to myself because I needed some consolation. He describes the very same problems and the very same concerns I'm describing in this post !!! It's not really the "conditions" that have changed, it's just the numbers.
Well, yes. Unquestionably, certain types of problems are the same old problem on a modern scale. Demand expands to fill capacity, etc.
The kinds of issues I mentioned were user/tech scale/capacity misunderstandings, each a case of refusing to use the DBMS to do things a DBMS does well, and instead demanding a complex, failure-prone, iterative, multiple-batch bodge with multiple query hits to the DBMS instead of one. Not because it's necessary, but because someone -- perhaps remembering how some system used to be, decades ago -- thinks it's still necessary when it isn't.
Another was an IT professional I knew who was categorically insistent that there be no queries with joins -- i.e., denormalise and duplicate so that every necessary query has all the data needed in one table -- because "everyone knows joins are too slow."
That was even after I demonstrated that with the production volume of live data (a few hundred records) on the production-grade DBMS installation that joined query results on a normalised (3NF) database were effectively instantaneous.
Now that I think about it, I've been up against that one multiple times, and never yet because it's warranted.
I think it's the same sort of misunderstanding that has slowly changed the meaning of "big data" from "bigger and faster and more variation than we have technology to handle, so we have to invent new technology" to "bigger than the usual off-the-shelf solutions can handle, so we have to buy something expensive" to "we'll have to use MySQL instead of Excel" to "a really, really big spreadsheet" to "a bigger spreadsheet than most of the ones I've seen today."
Rant, rant, rant.
Quote from Erwin on January 22, 2021, 7:43 pmQuote from Dave Voorhis on January 22, 2021, 6:44 pmQuote from Erwin on January 22, 2021, 6:38 pmQuote from Dave Voorhis on January 22, 2021, 5:43 pmQuote from Erwin on January 22, 2021, 4:54 pmI've participated once in the construction of such a "data access layer" (as a user and thus primarily as a critic). It had pretty avant-garde features for those days (+- 1995) :
...
- "Query list" facilitated "querying in chunks", say, "give me the first twenty of this", and on the next invocation "give me the first twenty after the last one I got from the previous invocation" etc. etc.
...
As an aside, I see this a lot.
As you'd expect, it's often used to support paginated lists on Web sites.
I occasionally see it show up in contexts that have nothing to do with Web pagination.
Like, "We only have 128 gigabytes on the server! There's no way it can possibly handle more than a thousand records at a time!"
Each record is less than 1K bytes. What?
Or, "We have to process data in small chunks or we get charged extra for database use!"
What!?
Or, "We process the data in batches because there's no way we're going to let the DBMS process it all at once! What if it failed part way through? We'd have to start over from scratch!"
What??!?!
There's no countering this sort of madness.
Possibly because there's a valid point somewhere in there.
...
Indeed, I suspect some of the madness comes from a time when it wasn't madness, twenty or thirty or forty years ago.
Conditions have changed.
The numbers have changed. Possibly here and there also the relative ratio between some pair of them. What I am deeply convinced of is that very little has changed to the ratio between the user's expectations and the machine's intrinsic capabilities. These days, our users (I'm in a data warehousing service these days) expect to be handed tens if not hundreds of G of data within less than half a workday's time of having formulated the question (just the ftp of such a file from the mainframe to the analysis environment already takes couple of hours if it's overnight). When we made our entire archive of medical records available in the analysis environment (600.000.000 1K records / year), a clever user saw fit to initiate a comparative analysis over time and he was genuinely surprised that it would still be running after 4 hrs. And since these "dinky toy" environments (term of my invention) are ***not*** the champions of scheduling in such a way that every user gets his fair share of computer time, ***all*** other users suffered the consequences throughout all that time until someone decided to kick that resource hog out. The cost is not machine time, the cost is the other business users getting stalled in their daily work and being unproductive. I've bought Don Knuth's magnum opus recently as a new yrs' present to myself because I needed some consolation. He describes the very same problems and the very same concerns I'm describing in this post !!! It's not really the "conditions" that have changed, it's just the numbers.
Well, yes. Unquestionably, certain types of problems are the same old problem on a modern scale. Demand expands to fill capacity, etc.
The kinds of issues I mentioned were user/tech scale/capacity misunderstandings, each a case of refusing to use the DBMS to do things a DBMS does well, and instead demanding a complex, failure-prone, iterative, multiple-batch bodge with multiple query hits to the DBMS instead of one. Not because it's necessary, but because someone -- perhaps remembering how some system used to be, decades ago -- thinks it's still necessary when it isn't.
Another was an IT professional I knew who was categorically insistent that there be no queries with joins -- i.e., denormalise and duplicate so that every necessary query has all the data needed in one table -- because "everyone knows joins are too slow."
That was even after I demonstrated that with the production volume of live data (a few hundred records) on the production-grade DBMS installation that joined query results on a normalised (3NF) database were effectively instantaneous.
Now that I think about it, I've been up against that one multiple times, and never yet because it's warranted.
I think it's the same sort of misunderstanding that has slowly changed the meaning of "big data" from "bigger and faster and more variation than we have technology to handle, so we have to invent new technology" to "bigger than the usual off-the-shelf solutions can handle, so we have to buy something expensive" to "we'll have to use MySQL instead of Excel" to "a really, really big spreadsheet" to "a bigger spreadsheet than most of the ones I've seen today."
Rant, rant, rant.
Quote from Erwin on January 22, 2021, 8:50 pmQuote from Dave Voorhis on January 22, 2021, 8:32 pmQuote from Erwin on January 22, 2021, 7:43 pmQuote from Dave Voorhis on January 22, 2021, 6:44 pmQuote from Erwin on January 22, 2021, 6:38 pmQuote from Dave Voorhis on January 22, 2021, 5:43 pmQuote from Erwin on January 22, 2021, 4:54 pmI've participated once in the construction of such a "data access layer" (as a user and thus primarily as a critic). It had pretty avant-garde features for those days (+- 1995) :
...
- "Query list" facilitated "querying in chunks", say, "give me the first twenty of this", and on the next invocation "give me the first twenty after the last one I got from the previous invocation" etc. etc.
...
As an aside, I see this a lot.
As you'd expect, it's often used to support paginated lists on Web sites.
I occasionally see it show up in contexts that have nothing to do with Web pagination.
Like, "We only have 128 gigabytes on the server! There's no way it can possibly handle more than a thousand records at a time!"
Each record is less than 1K bytes. What?
Or, "We have to process data in small chunks or we get charged extra for database use!"
What!?
Or, "We process the data in batches because there's no way we're going to let the DBMS process it all at once! What if it failed part way through? We'd have to start over from scratch!"
What??!?!
There's no countering this sort of madness.
Possibly because there's a valid point somewhere in there.
...
Indeed, I suspect some of the madness comes from a time when it wasn't madness, twenty or thirty or forty years ago.
Conditions have changed.
The numbers have changed. Possibly here and there also the relative ratio between some pair of them. What I am deeply convinced of is that very little has changed to the ratio between the user's expectations and the machine's intrinsic capabilities. These days, our users (I'm in a data warehousing service these days) expect to be handed tens if not hundreds of G of data within less than half a workday's time of having formulated the question (just the ftp of such a file from the mainframe to the analysis environment already takes couple of hours if it's overnight). When we made our entire archive of medical records available in the analysis environment (600.000.000 1K records / year), a clever user saw fit to initiate a comparative analysis over time and he was genuinely surprised that it would still be running after 4 hrs. And since these "dinky toy" environments (term of my invention) are ***not*** the champions of scheduling in such a way that every user gets his fair share of computer time, ***all*** other users suffered the consequences throughout all that time until someone decided to kick that resource hog out. The cost is not machine time, the cost is the other business users getting stalled in their daily work and being unproductive. I've bought Don Knuth's magnum opus recently as a new yrs' present to myself because I needed some consolation. He describes the very same problems and the very same concerns I'm describing in this post !!! It's not really the "conditions" that have changed, it's just the numbers.
Well, yes. Unquestionably, certain types of problems are the same old problem on a modern scale. Demand expands to fill capacity, etc.
The kinds of issues I mentioned were user/tech scale/capacity misunderstandings, each a case of refusing to use the DBMS to do things a DBMS does well, and instead demanding a complex, failure-prone, iterative, multiple-batch bodge with multiple query hits to the DBMS instead of one. Not because it's necessary, but because someone -- perhaps remembering how some system used to be, decades ago -- thinks it's still necessary when it isn't.
Another was an IT professional I knew who was categorically insistent that there be no queries with joins -- i.e., denormalise and duplicate so that every necessary query has all the data needed in one table -- because "everyone knows joins are too slow."
That was even after I demonstrated that with the production volume of live data (a few hundred records) on the production-grade DBMS installation that joined query results on a normalised (3NF) database were effectively instantaneous.
Now that I think about it, I've been up against that one multiple times, and never yet because it's warranted.
I think it's the same sort of misunderstanding that has slowly changed the meaning of "big data" from "bigger and faster and more variation than we have technology to handle, so we have to invent new technology" to "bigger than the usual off-the-shelf solutions can handle, so we have to buy something expensive" to "we'll have to use MySQL instead of Excel" to "a really, really big spreadsheet" to "a bigger spreadsheet than most of the ones I've seen today."
Rant, rant, rant.
Yeah. That data set of 600.000.000 1K records / year is the result of "denormalization" taken to the extreme because "having all the data in one single table is ***way*** simpler" (as is handling all the "absence markers" either present explicitly or implied by "strange" values such as EBCDIC blanks in a packed decimal field which could have you believe a certain amount is 4040404 -your monetary unit here- if you're not careful). Never mind. To our mutual regret, we might agree on more issues than either of us is willing to admit :-) . And yes I needed the rant. Thanks for the occasion :-) .
Quote from Dave Voorhis on January 22, 2021, 8:32 pmQuote from Erwin on January 22, 2021, 7:43 pmQuote from Dave Voorhis on January 22, 2021, 6:44 pmQuote from Erwin on January 22, 2021, 6:38 pmQuote from Dave Voorhis on January 22, 2021, 5:43 pmQuote from Erwin on January 22, 2021, 4:54 pmI've participated once in the construction of such a "data access layer" (as a user and thus primarily as a critic). It had pretty avant-garde features for those days (+- 1995) :
...
- "Query list" facilitated "querying in chunks", say, "give me the first twenty of this", and on the next invocation "give me the first twenty after the last one I got from the previous invocation" etc. etc.
...
As an aside, I see this a lot.
As you'd expect, it's often used to support paginated lists on Web sites.
I occasionally see it show up in contexts that have nothing to do with Web pagination.
Like, "We only have 128 gigabytes on the server! There's no way it can possibly handle more than a thousand records at a time!"
Each record is less than 1K bytes. What?
Or, "We have to process data in small chunks or we get charged extra for database use!"
What!?
Or, "We process the data in batches because there's no way we're going to let the DBMS process it all at once! What if it failed part way through? We'd have to start over from scratch!"
What??!?!
There's no countering this sort of madness.
Possibly because there's a valid point somewhere in there.
...
Indeed, I suspect some of the madness comes from a time when it wasn't madness, twenty or thirty or forty years ago.
Conditions have changed.
The numbers have changed. Possibly here and there also the relative ratio between some pair of them. What I am deeply convinced of is that very little has changed to the ratio between the user's expectations and the machine's intrinsic capabilities. These days, our users (I'm in a data warehousing service these days) expect to be handed tens if not hundreds of G of data within less than half a workday's time of having formulated the question (just the ftp of such a file from the mainframe to the analysis environment already takes couple of hours if it's overnight). When we made our entire archive of medical records available in the analysis environment (600.000.000 1K records / year), a clever user saw fit to initiate a comparative analysis over time and he was genuinely surprised that it would still be running after 4 hrs. And since these "dinky toy" environments (term of my invention) are ***not*** the champions of scheduling in such a way that every user gets his fair share of computer time, ***all*** other users suffered the consequences throughout all that time until someone decided to kick that resource hog out. The cost is not machine time, the cost is the other business users getting stalled in their daily work and being unproductive. I've bought Don Knuth's magnum opus recently as a new yrs' present to myself because I needed some consolation. He describes the very same problems and the very same concerns I'm describing in this post !!! It's not really the "conditions" that have changed, it's just the numbers.
Well, yes. Unquestionably, certain types of problems are the same old problem on a modern scale. Demand expands to fill capacity, etc.
The kinds of issues I mentioned were user/tech scale/capacity misunderstandings, each a case of refusing to use the DBMS to do things a DBMS does well, and instead demanding a complex, failure-prone, iterative, multiple-batch bodge with multiple query hits to the DBMS instead of one. Not because it's necessary, but because someone -- perhaps remembering how some system used to be, decades ago -- thinks it's still necessary when it isn't.
Another was an IT professional I knew who was categorically insistent that there be no queries with joins -- i.e., denormalise and duplicate so that every necessary query has all the data needed in one table -- because "everyone knows joins are too slow."
That was even after I demonstrated that with the production volume of live data (a few hundred records) on the production-grade DBMS installation that joined query results on a normalised (3NF) database were effectively instantaneous.
Now that I think about it, I've been up against that one multiple times, and never yet because it's warranted.
I think it's the same sort of misunderstanding that has slowly changed the meaning of "big data" from "bigger and faster and more variation than we have technology to handle, so we have to invent new technology" to "bigger than the usual off-the-shelf solutions can handle, so we have to buy something expensive" to "we'll have to use MySQL instead of Excel" to "a really, really big spreadsheet" to "a bigger spreadsheet than most of the ones I've seen today."
Rant, rant, rant.
Yeah. That data set of 600.000.000 1K records / year is the result of "denormalization" taken to the extreme because "having all the data in one single table is ***way*** simpler" (as is handling all the "absence markers" either present explicitly or implied by "strange" values such as EBCDIC blanks in a packed decimal field which could have you believe a certain amount is 4040404 -your monetary unit here- if you're not careful). Never mind. To our mutual regret, we might agree on more issues than either of us is willing to admit :-) . And yes I needed the rant. Thanks for the occasion :-) .