The Forum for Discussion about The Third Manifesto and Related Matters

Please or Register to create posts and topics.

`SELECT DISTINCT` -- a(nother) disaster waiting to happen

"[SELECT DISTINCT] is precisely an awfulterriblehorrible way to write queries. First, select distinct incurs overhead, so it should only be used when necessary. More importantly, it not should not be used to remove duplicates unless the query-writer thoroughly understands the data and the query and why distinct values might appear."

(The guy has a rep of 1.1m, mostly obtained answering SQL queries. He's the same guy thinks "Natural Join is a disaster waiting to happen".)

Funny old industry we're in. I always (at least mentally) write SELECT DISTINCT at first; then assure myself the query necessarily returns distinct rows, before removing the DISTINCT as optimisation. And I thought I was following industry best practice.

You'd'a thought that the compiler would be smart enough to see when DISTINCT was superfluous, and optimise it away for itself, without me (or Mr. Brain-the-size-of-a-planet) needing to scratch our heads about it. You'd'a thought the compiler or language would have a way to say 'I'm expecting no duplicates here, please advise if I'm wrong' vs 'I understand there might be duplicates here, and I deliberately want you to squish them out' vs '... deliberately produce them all' vs 'I'm a newbie with no idea what/why duplicates are, reject the query if there's a risk of duplicates' -- with that last being the default action.

Instead we get the nonsense with UNION (nothing) being the default; no way to say explicitly UNION DISTINCT vs UNION ALL; and no way to say SELECT ALL.

Quote from AntC on May 18, 2021, 11:53 am

"[SELECT DISTINCT] is precisely an awfulterriblehorrible way to write queries. First, select distinct incurs overhead, so it should only be used when necessary. More importantly, it not should not be used to remove duplicates unless the query-writer thoroughly understands the data and the query and why distinct values might appear."

(The guy has a rep of 1.1m, mostly obtained answering SQL queries. He's the same guy thinks "Natural Join is a disaster waiting to happen".)

Funny old industry we're in. I always (at least mentally) write SELECT DISTINCT at first; then assure myself the query necessarily returns distinct rows, before removing the DISTINCT as optimisation. And I thought I was following industry best practice.

There's that bad word again. Did you mean that you think this change will help the query planner to construct a lower cost plan? Why?

You'd'a thought that the compiler would be smart enough to see when DISTINCT was superfluous, and optimise it away for itself, without me (or Mr. Brain-the-size-of-a-planet) needing to scratch our heads about it. You'd'a thought the compiler or language would have a way to say 'I'm expecting no duplicates here, please advise if I'm wrong' vs 'I understand there might be duplicates here, and I deliberately want you to squish them out' vs '... deliberately produce them all' vs 'I'm a newbie with no idea what/why duplicates are, reject the query if there's a risk of duplicates' -- with that last being the default action.

You should not assume that the compiler (responsible for parsing, syntax and type checking) and the query planner (responsible for choosing how to execute the query) are the same thing or even happen at the same time. And last time I checked the query planner was perfectly capable of ignoring DISTINCT on an indexed field

Instead we get the nonsense with UNION (nothing) being the default; no way to say explicitly UNION DISTINCT vs UNION ALL; and no way to say SELECT ALL.

 

Andl - A New Database Language - andl.org
Quote from dandl on May 18, 2021, 2:12 pm
Quote from AntC on May 18, 2021, 11:53 am

"[SELECT DISTINCT] is precisely an awfulterriblehorrible way to write queries. First, select distinct incurs overhead, so it should only be used when necessary. More importantly, it not should not be used to remove duplicates unless the query-writer thoroughly understands the data and the query and why distinct values might appear."

(The guy has a rep of 1.1m, mostly obtained answering SQL queries. He's the same guy thinks "Natural Join is a disaster waiting to happen".)

Funny old industry we're in. I always (at least mentally) write SELECT DISTINCT at first; then assure myself the query necessarily returns distinct rows, before removing the DISTINCT as optimisation. And I thought I was following industry best practice.

There's that bad word again. Did you mean that you think this change will help the query planner to construct a lower cost plan? Why?

I think you're disagreeing with the guy's "select distinct incurs overhead ..."? You're saying the query planner knows when squishing out duplicates is unnecessary? Does it always? (Remembering that not always do schemas declare UNIQUE constraints, even when the schema designer know of them.) Did it always back in the '80's or whenever he learnt his craft?

You'd'a thought that the compiler would be smart enough to see when DISTINCT was superfluous, and optimise it away for itself, without me (or Mr. Brain-the-size-of-a-planet) needing to scratch our heads about it. You'd'a thought the compiler or language would have a way to say 'I'm expecting no duplicates here, please advise if I'm wrong' vs 'I understand there might be duplicates here, and I deliberately want you to squish them out' vs '... deliberately produce them all' vs 'I'm a newbie with no idea what/why duplicates are, reject the query if there's a risk of duplicates' -- with that last being the default action.

You should not assume that the compiler (responsible for parsing, syntax and type checking) and the query planner (responsible for choosing how to execute the query) are the same thing or even happen at the same time.

I wasn't distinguishing the stages that go on inside the black box. I (as a query writer/runner) can't intervene between those stages. Why bother to split that hair?

And last time I checked the query planner was perfectly capable of ignoring DISTINCT on an indexed field

Perhaps you'd care to tell the guy; because he thinks otherwise. [Edit: Ah, I see you did. And I see he disagrees. hmm] Though (some of) the advanced queries he's writing presumably he knows produce distinct rows without a suitable index being visible.

Instead we get the nonsense with UNION (nothing) being the default; no way to say explicitly UNION DISTINCT vs UNION ALL; and no way to say SELECT ALL.

 

 

Quote from AntC on May 18, 2021, 9:57 pm
Quote from dandl on May 18, 2021, 2:12 pm
Quote from AntC on May 18, 2021, 11:53 am

"[SELECT DISTINCT] is precisely an awfulterriblehorrible way to write queries. First, select distinct incurs overhead, so it should only be used when necessary. More importantly, it not should not be used to remove duplicates unless the query-writer thoroughly understands the data and the query and why distinct values might appear."

(The guy has a rep of 1.1m, mostly obtained answering SQL queries. He's the same guy thinks "Natural Join is a disaster waiting to happen".)

Funny old industry we're in. I always (at least mentally) write SELECT DISTINCT at first; then assure myself the query necessarily returns distinct rows, before removing the DISTINCT as optimisation. And I thought I was following industry best practice.

There's that bad word again. Did you mean that you think this change will help the query planner to construct a lower cost plan? Why?

I think you're disagreeing with the guy's "select distinct incurs overhead ..."? You're saying the query planner knows when squishing out duplicates is unnecessary? Does it always? (Remembering that not always do schemas declare UNIQUE constraints, even when the schema designer know of them.) Did it always back in the '80's or whenever he learnt his craft?

What I'm saying is that to talk about 'overhead' or 'optimisation' in the abstract without providing a specific instance is plain wrong. In principle at least the query planner knows more about your data than you do and (at the time of execution) can make better decisions than you can (at the time of writing the query). If you claim this is not so, you need to provide a specific instance and show why your intervention will produce a better result. BTW this is my 'safer' theme: I really shouldn't have to think about this, the compiler/planner should do it for me so I can think about important stuff.

You'd'a thought that the compiler would be smart enough to see when DISTINCT was superfluous, and optimise it away for itself, without me (or Mr. Brain-the-size-of-a-planet) needing to scratch our heads about it. You'd'a thought the compiler or language would have a way to say 'I'm expecting no duplicates here, please advise if I'm wrong' vs 'I understand there might be duplicates here, and I deliberately want you to squish them out' vs '... deliberately produce them all' vs 'I'm a newbie with no idea what/why duplicates are, reject the query if there's a risk of duplicates' -- with that last being the default action.

You should not assume that the compiler (responsible for parsing, syntax and type checking) and the query planner (responsible for choosing how to execute the query) are the same thing or even happen at the same time.

I wasn't distinguishing the stages that go on inside the black box. I (as a query writer/runner) can't intervene between those stages. Why bother to split that hair?

Because compilation happens 'now' and query planning may happen 'later'; your language suggested it a warning 'now', but the information for it may not be available until 'later'. I agree it would be useful for the compiler to tell you what it knows 'now', but you need another mechanism to deal with what is known 'later'.

And last time I checked the query planner was perfectly capable of ignoring DISTINCT on an indexed field

Perhaps you'd care to tell the guy; because he thinks otherwise. [Edit: Ah, I see you did. And I see he disagrees. hmm] Though (some of) the advanced queries he's writing presumably he knows produce distinct rows without a suitable index being visible.

This is someone with a high opinion of his own worth, but an inability to provide deep analysis. I don't expect an answer to my question (but he treats me like a moron, which is good).

As a rule, the query planner will

  • in a join on a UNIQUE INDEX field: use index lookup and infer DISTINCT
  • in a join on a very small table: read into memory, use brute force/build in-memory index (and may then infer DISTINCT)
  • in a join on large table with no index: build an index, usually on the smaller side of the join (and may then infer DISTINCT)

Some query engines do a sort-merge; some build an ISAM index for non-DISTINCT and a hash index for DISTINCT. The hash index is usually quicker. And some just do something stupid (at least they did when we were investigating this stuff for the Powerflex SQL backend). IMO you just can't give general advice on this topic that applies across all engines.

And of course DISTINCT in more complicated queries is (how shall I put it?) more complicated.

Instead we get the nonsense with UNION (nothing) being the default; no way to say explicitly UNION DISTINCT vs UNION ALL; and no way to say SELECT ALL.

 

 

Sometimes you'd think they'd never heard of the RA.

Andl - A New Database Language - andl.org
Quote from dandl on May 19, 2021, 1:04 am
Quote from AntC on May 18, 2021, 9:57 pm
Quote from dandl on May 18, 2021, 2:12 pm
Quote from AntC on May 18, 2021, 11:53 am

"[SELECT DISTINCT] is precisely an awfulterriblehorrible way to write queries. First, select distinct incurs overhead, so it should only be used when necessary. More importantly, it not should not be used to remove duplicates unless the query-writer thoroughly understands the data and the query and why distinct values might appear."

(The guy has a rep of 1.1m, mostly obtained answering SQL queries. He's the same guy thinks "Natural Join is a disaster waiting to happen".)

Funny old industry we're in. I always (at least mentally) write SELECT DISTINCT at first; then assure myself the query necessarily returns distinct rows, before removing the DISTINCT as optimisation. And I thought I was following industry best practice.

There's that bad word again. Did you mean that you think this change will help the query planner to construct a lower cost plan? Why?

I think you're disagreeing with the guy's "select distinct incurs overhead ..."? You're saying the query planner knows when squishing out duplicates is unnecessary? Does it always? (Remembering that not always do schemas declare UNIQUE constraints, even when the schema designer know of them.) Did it always back in the '80's or whenever he learnt his craft?

What I'm saying is that to talk about 'overhead' or 'optimisation' in the abstract without providing a specific instance is plain wrong. In principle at least the query planner knows more about your data than you do and (at the time of execution) can make better decisions than you can (at the time of writing the query). If you claim this is not so, you need to provide a specific instance and show why your intervention will produce a better result. BTW this is my 'safer' theme: I really shouldn't have to think about this, the compiler/planner should do it for me so I can think about important stuff.

Ok. (I'm not disagreeing with you, just trying to tease out how the overall system can be more user-friendly, and at what stage.)

  • At compile time (looking at the schema, before execution and before looking at relative volumes of records), the compiler should be able to look at the declared indexes/constraints and figure out whether a given SELECT might produce duplicate rows.
  • After compile time but before execution time, the schema might change: constraints and/or indexes might come and go. But in that case the system should really re-compile all queries (as it does with stored procedures), and (re-)issue warnings about potential duplicates.
  • But of course queries aren't in a visible repository somewhere, they're assembled dynamically by the calling program as strings; then passed to the engine for immediate compile/query-plan/execute. Where in that does something warn of potential duplicates? And who gets the message? (Not the end-user, who just wants to see the invoices with some peculiar characteristic.)
  • At query-plan time, the planner can see relative volumes of rows, and for the specific Customer/Product/date the query is asking for. It might choose to ignore indexes; it might therefore be compelled to laboriously squish out duplicates -- but (we presume it knows what it's doing) that's more efficient than going by some access path that's guaranteed to avoid duplicates.  Then it seems to me wots-'is-name can't say SELECT DISTINCT is always less efficient/or never more efficient.
  • In the case in point, there's SELECT by timestamp. I guess that means a very high likelihood each row is distinct, even though that's not an index. ... Until it isn't distinct ... let me see ... An event that occurs on the night daylight saving ends; with another event happening exactly an hour later, on true sun time.

You'd'a thought that the compiler would be smart enough to see when DISTINCT was superfluous, and optimise it away for itself, without me (or Mr. Brain-the-size-of-a-planet) needing to scratch our heads about it. You'd'a thought the compiler or language would have a way to say 'I'm expecting no duplicates here, please advise if I'm wrong' vs 'I understand there might be duplicates here, and I deliberately want you to squish them out' vs '... deliberately produce them all' vs 'I'm a newbie with no idea what/why duplicates are, reject the query if there's a risk of duplicates' -- with that last being the default action.

You should not assume that the compiler (responsible for parsing, syntax and type checking) and the query planner (responsible for choosing how to execute the query) are the same thing or even happen at the same time.

I wasn't distinguishing the stages that go on inside the black box. I (as a query writer/runner) can't intervene between those stages. Why bother to split that hair?

Because compilation happens 'now' and query planning may happen 'later'; your language suggested it a warning 'now', but the information for it may not be available until 'later'. I agree it would be useful for the compiler to tell you what it knows 'now', but you need another mechanism to deal with what is known 'later'.

So now we can distinguish what the query means (given the schema/constraints visible at compile-time) vs how to query-plan it. ('Means' as in does the derived predicate allow for duplicates?) And if the schema/constraints change in between, that potentially changes the 'meaning' of some queries. Did I say "funny old industry we're in"? And people want to bet their whole enterprises on this stuff?

And last time I checked the query planner was perfectly capable of ignoring DISTINCT on an indexed field

Perhaps you'd care to tell the guy; because he thinks otherwise. [Edit: Ah, I see you did. And I see he disagrees. hmm] Though (some of) the advanced queries he's writing presumably he knows produce distinct rows without a suitable index being visible.

This is someone with a high opinion of his own worth, but an inability to provide deep analysis. I don't expect an answer to my question (but he treats me like a moron, which is good).

As a rule, the query planner will

  • in a join on a UNIQUE INDEX field: use index lookup and infer DISTINCT
  • in a join on a very small table: read into memory, use brute force/build in-memory index (and may then infer DISTINCT)
  • in a join on large table with no index: build an index, usually on the smaller side of the join (and may then infer DISTINCT)

Some query engines do a sort-merge; some build an ISAM index for non-DISTINCT and a hash index for DISTINCT. The hash index is usually quicker. And some just do something stupid (at least they did when we were investigating this stuff for the Powerflex SQL backend). IMO you just can't give general advice on this topic that applies across all engines.

And of course DISTINCT in more complicated queries is (how shall I put it?) more complicated.

So DISTINCT is a way:

  • for the query-writer to say: I don't want duplicates, no matter what indexes/schema changes come and go; and
  • for the query-planner to say uh-oh, I'd better not execute the query this way, because I might produce duplicates, and then it would be inefficient to have to squish them out; but
  • not for the smart-alec to say: I know better than the query-planner (there will never be duplicate timestamp), I'll shave off a bit of squishing (which they probably won't).

 

Instead we get the nonsense with UNION (nothing) being the default; no way to say explicitly UNION DISTINCT vs UNION ALL; and no way to say SELECT ALL.

 

 

Sometimes you'd think they'd never heard of the RA.

Clearly the IBM engineers who threw together (what became) SQL didn't understand/hadn't heard a tenth of what Codd said. Clearly they thought they were producing a prototype/proof of concept that was never going to get used in anger. It was almost like one of those software designs that start as a joke or anti-pattern, only 'commercial management' have no sense of humour/irony.

Quote from AntC on May 19, 2021, 5:38 am
Quote from dandl on May 19, 2021, 1:04 am
Quote from AntC on May 18, 2021, 9:57 pm
Quote from dandl on May 18, 2021, 2:12 pm
Quote from AntC on May 18, 2021, 11:53 am

"[SELECT DISTINCT] is precisely an awfulterriblehorrible way to write queries. First, select distinct incurs overhead, so it should only be used when necessary. More importantly, it not should not be used to remove duplicates unless the query-writer thoroughly understands the data and the query and why distinct values might appear."

(The guy has a rep of 1.1m, mostly obtained answering SQL queries. He's the same guy thinks "Natural Join is a disaster waiting to happen".)

Funny old industry we're in. I always (at least mentally) write SELECT DISTINCT at first; then assure myself the query necessarily returns distinct rows, before removing the DISTINCT as optimisation. And I thought I was following industry best practice.

There's that bad word again. Did you mean that you think this change will help the query planner to construct a lower cost plan? Why?

I think you're disagreeing with the guy's "select distinct incurs overhead ..."? You're saying the query planner knows when squishing out duplicates is unnecessary? Does it always? (Remembering that not always do schemas declare UNIQUE constraints, even when the schema designer know of them.) Did it always back in the '80's or whenever he learnt his craft?

What I'm saying is that to talk about 'overhead' or 'optimisation' in the abstract without providing a specific instance is plain wrong. In principle at least the query planner knows more about your data than you do and (at the time of execution) can make better decisions than you can (at the time of writing the query). If you claim this is not so, you need to provide a specific instance and show why your intervention will produce a better result. BTW this is my 'safer' theme: I really shouldn't have to think about this, the compiler/planner should do it for me so I can think about important stuff.

Ok. (I'm not disagreeing with you, just trying to tease out how the overall system can be more user-friendly, and at what stage.)

  • At compile time (looking at the schema, before execution and before looking at relative volumes of records), the compiler should be able to look at the declared indexes/constraints and figure out whether a given SELECT might produce duplicate rows.

I guess so, but that introduces an ordering: if the dev writes the query and then the admin adds the index, you won't get that behaviour unless recompile.

  • After compile time but before execution time, the schema might change: constraints and/or indexes might come and go. But in that case the system should really re-compile all queries (as it does with stored procedures), and (re-)issue warnings about potential duplicates.

Is that universal? It raises the spectre of a minor schema change triggering hundreds of warnings. Does that happen?

  • But of course queries aren't in a visible repository somewhere, they're assembled dynamically by the calling program as strings; then passed to the engine for immediate compile/query-plan/execute. Where in that does something warn of potential duplicates? And who gets the message? (Not the end-user, who just wants to see the invoices with some peculiar characteristic.)
  • At query-plan time, the planner can see relative volumes of rows, and for the specific Customer/Product/date the query is asking for. It might choose to ignore indexes; it might therefore be compelled to laboriously squish out duplicates -- but (we presume it knows what it's doing) that's more efficient than going by some access path that's guaranteed to avoid duplicates.  Then it seems to me wots-'is-name can't say SELECT DISTINCT is always less efficient/or never more efficient.
  • In the case in point, there's SELECT by timestamp. I guess that means a very high likelihood each row is distinct, even though that's not an index. ... Until it isn't distinct ... let me see ... An event that occurs on the night daylight saving ends; with another event happening exactly an hour later, on true sun time.

Again, could be system dependent? I seem to recall one system that used UCT and guaranteed never to issue a duplicate/reversed timestamp during system uptime.

You'd'a thought that the compiler would be smart enough to see when DISTINCT was superfluous, and optimise it away for itself, without me (or Mr. Brain-the-size-of-a-planet) needing to scratch our heads about it. You'd'a thought the compiler or language would have a way to say 'I'm expecting no duplicates here, please advise if I'm wrong' vs 'I understand there might be duplicates here, and I deliberately want you to squish them out' vs '... deliberately produce them all' vs 'I'm a newbie with no idea what/why duplicates are, reject the query if there's a risk of duplicates' -- with that last being the default action.

You should not assume that the compiler (responsible for parsing, syntax and type checking) and the query planner (responsible for choosing how to execute the query) are the same thing or even happen at the same time.

I wasn't distinguishing the stages that go on inside the black box. I (as a query writer/runner) can't intervene between those stages. Why bother to split that hair?

Because compilation happens 'now' and query planning may happen 'later'; your language suggested it a warning 'now', but the information for it may not be available until 'later'. I agree it would be useful for the compiler to tell you what it knows 'now', but you need another mechanism to deal with what is known 'later'.

So now we can distinguish what the query means (given the schema/constraints visible at compile-time) vs how to query-plan it. ('Means' as in does the derived predicate allow for duplicates?) And if the schema/constraints change in between, that potentially changes the 'meaning' of some queries. Did I say "funny old industry we're in"? And people want to bet their whole enterprises on this stuff?

And last time I checked the query planner was perfectly capable of ignoring DISTINCT on an indexed field

Perhaps you'd care to tell the guy; because he thinks otherwise. [Edit: Ah, I see you did. And I see he disagrees. hmm] Though (some of) the advanced queries he's writing presumably he knows produce distinct rows without a suitable index being visible.

This is someone with a high opinion of his own worth, but an inability to provide deep analysis. I don't expect an answer to my question (but he treats me like a moron, which is good).

As a rule, the query planner will

  • in a join on a UNIQUE INDEX field: use index lookup and infer DISTINCT
  • in a join on a very small table: read into memory, use brute force/build in-memory index (and may then infer DISTINCT)
  • in a join on large table with no index: build an index, usually on the smaller side of the join (and may then infer DISTINCT)

Some query engines do a sort-merge; some build an ISAM index for non-DISTINCT and a hash index for DISTINCT. The hash index is usually quicker. And some just do something stupid (at least they did when we were investigating this stuff for the Powerflex SQL backend). IMO you just can't give general advice on this topic that applies across all engines.

And of course DISTINCT in more complicated queries is (how shall I put it?) more complicated.

So DISTINCT is a way:

  • for the query-writer to say: I don't want duplicates, no matter what indexes/schema changes come and go; and
  • for the query-planner to say uh-oh, I'd better not execute the query this way, because I might produce duplicates, and then it would be inefficient to have to squish them out; but
  • not for the smart-alec to say: I know better than the query-planner (there will never be duplicate timestamp), I'll shave off a bit of squishing (which they probably won't).

Yes, pretty much. We just had some really bad outcomes when we tried to outsmart the query planner.  SQL is the only thing I ever ran across where essentially the same query could run slower/faster by over 3 orders of magnitude. Our best ever was a query that went from 16 minutes down to ~300 msec, and passed exactly the same test suite. You need to do something the right way to avoid that happening, but broad brush advice generally doesn't cover it.

 

Instead we get the nonsense with UNION (nothing) being the default; no way to say explicitly UNION DISTINCT vs UNION ALL; and no way to say SELECT ALL.

 

 

Sometimes you'd think they'd never heard of the RA.

Clearly the IBM engineers who threw together (what became) SQL didn't understand/hadn't heard a tenth of what Codd said. Clearly they thought they were producing a prototype/proof of concept that was never going to get used in anger. It was almost like one of those software designs that start as a joke or anti-pattern, only 'commercial management' have no sense of humour/irony.

The worrying thing is there always seems to be enough time to do it just well enough to get paid, and never enough time to do it right.

Andl - A New Database Language - andl.org