The Forum for Discussion about The Third Manifesto and Related Matters

Please or Register to create posts and topics.

Key questions

12

(I'm looking at the alarming number of q's on StackOverflow in which the student is given a schema where every table has a column <entity>ID as well as there being a perfectly serviceable business key CustomerNumber, ProductNumber and/or SKU, etc.  I guess some tools allocate an ID column as soon as you start the schema for a table (Ms Access does this). Do students/lecturers think this is somehow best practice?)

 

  1. How often after you've analysed business requirements and normalised a schema to 6NF do you find more than one candidate key? (I'm not saying the eventual physical design will be in 6NF, but that's a useful discipline to make sure you understand requirements.)
  2. Even if there's more than one candidate key, how often does there seem one that's a stronger candidate? (On grounds, for example, that its value will persist unchanged for the lifetime of the denoted entity, or that it's a natural target to declare a Foreign Key constraint, or that although (say) SSN or VIN is just as unique and persistent, data entry staff won't necessarily know it at the point they need to create an Employee or Vehicle record.)
  3. How often is it a composite key where the components might change over the lifetime of the participating entities?

 

I'm wondering if TTM should climb down from the ivory tower a little, and acknowledge there is usually something old-timers would call a 'Primary Key'. (Could be dubbed PKey, where the P might stand for Persistent.)

DBMS artefacts like transition constraints [RM VSS 3] and update-through-view [RM Pre 14, 21] might be easier to implement if there was a single favoured key or declared PKey.

There's a habit in Data Warehousing of allocating a surrogate key even if the source schemas have perfectly serviceable business keys. In so far as I can get a rationale for that from DBAs, they seem to harbour Fear/Uncertainty/Doubt that any business key is stable. Of course once in a blue moon some business will find that its 6-digit Customer number no longer has enough digits (probably because they've segmented it into ranges, and one of the ranges is exhausted). Then I see DW schemas with both a DW surrogate and an <entity>ID -- which AFAICT is also a surrogate -- replicated throughout a star-schema.

Are whole-word binary keys really that more 'efficient' than alpha keys or composite keys? My experience is that's premature optimisation: what you might gain on 'efficient' lookup you lose in practice because every application has to lookup a Customer Number index first to obtain the CustomerID. (Users typically memorise shortish alpha-numeric Customer Numbers. 64-bit random numbers in Hex not so much.)

 

Quote from AntC on October 1, 2020, 6:46 am

(I'm looking at the alarming number of q's on StackOverflow in which the student is given a schema where every table has a column <entity>ID as well as there being a perfectly serviceable business key CustomerNumber, ProductNumber and/or SKU, etc.  I guess some tools allocate an ID column as soon as you start the schema for a table (Ms Access does this). Do students/lecturers think this is somehow best practice?)

Students usually don't know what is/isn't best practice. That comes from years of experience.

The same applies to lecturers. Sadly, the person who teaches the ubiquitous undergrad database course is rarely a person with both theoretical database background and industry experience. Usually, it's a grad student working on a PhD in "Internet-of-Things Blockchain Security Using Machine Learning" (or was it "Internet-of-Things Machine Learning Security Using Blockchain"? "Machine Learning Security in Blockchain Using Internet-of-Things"??) with no interest in databases, or a mathematics lecturer who needs to fill a teaching slot.

"If MSAccess auto-generates auto-numbered numeric primary whadjamacallits, it must be right. (I guess...)"

That said, there is a certain push in some practitioner circles to strictly define surrogate keys and never use natural keys...

... In so far as I can get a rationale for that from DBAs, they seem to harbour Fear/Uncertainty/Doubt that any business key is stable.

...and that's why.

If challenged, some horror story gets trotted out about the pain of changing all the customer/product/staff/whatever ID's across the 1,783 separate database systems and in all 1,938,394 mission-critical spreadsheets the last three times the company changed ownership, etc.

"Simmonds, you might have thought six-digit numeric keys for Widgets was okay when you were at Innotech, but now that you're here at TechInno you're going to do it right and put a 'T' in front of the first digit and a 'Q' at the end like God intended!"

Are whole-word binary keys really that more 'efficient' than alpha keys or composite keys? My experience is that's premature optimisation: what you might gain on 'efficient' lookup you lose in practice because every application has to lookup a Customer Number index first to obtain the CustomerID. (Users typically memorise shortish alpha-numeric Customer Numbers. 64-bit random numbers in Hex not so much.)

True.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

So what is the received wisdom in relational circles on externally visible identifiers? I worked in the hospital environment, where patient UR nos were handed out with gay abandon, and it was routine to update them. A patient might give a slightly different name and get a new UR, or be confused with a relative -- all variations possible. A simple key change wouldn't fix it, because there were always linkages, some of them to external or paper systems that couldn't be updated. Other identifiers like supplier and employee IDs didn't have the same problems, or less often.

It seemed to me that changing a key that was used for linkage always caused grief, and externally visible keys by definition are linked (to the outside world). Logic then dictates to avoid using externally visible (business) keys for FK linkage and use surrogates instead, but that brings its own issues.

 

 

Andl - A New Database Language - andl.org

"I'm wondering if TTM should climb down from the ivory tower a little"

TTM voices no objection to designating one key as special and this was a deliberate decision.  From Chapter 2 of DTATRM:

"... the relational model has historically required ... that one [key] be chosen as the primary key, and the others are then alternate keys. While this discipline (i.e., choosing a primary key) might be useful in practice, we do not insist on it, because we regard the idea of making one key somehow 'more equal than others' as a psychological issue merely [52]."

[52] is a reference to Chris's article "The Primacy of Primary Keys: An Investigation", which was included in Relational Database Writings 1991-1994.

Hugh

Coauthor of The Third Manifesto and related books.
Quote from dandl on October 1, 2020, 11:32 am

So what is the received wisdom in relational circles on externally visible identifiers? I worked in the hospital environment, where patient UR nos were handed out with gay abandon, and it was routine to update them. A patient might give a slightly different name and get a new UR, or be confused with a relative -- all variations possible. A simple key change wouldn't fix it, because there were always linkages, some of them to external or paper systems that couldn't be updated. Other identifiers like supplier and employee IDs didn't have the same problems, or less often.

It seemed to me that changing a key that was used for linkage always caused grief, and externally visible keys by definition are linked (to the outside world). Logic then dictates to avoid using externally visible (business) keys for FK linkage and use surrogates instead, but that brings its own issues.

When I taught databases, I sometimes made this a subject of in-class debate (other times I assigned looking into it as an exercise), with the goal of getting students to recognise that there isn't a single right answer. Sometimes natural keys are preferable and sometimes surrogate keys are preferable, but (like many choices) neither is ideal and either is inevitably a compromise. (And, what are the requirements?)

If a student didn't raise it in the debate, I would point out that our useable natural keys are often someone else's surrogate keys. SINs, SSNs, VINs, ISBNs, ISSNs and so on are "their" publicised surrogate keys that become "our" natural keys.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org
Quote from dandl on October 1, 2020, 11:32 am

So what is the received wisdom in relational circles on externally visible identifiers? I worked in the hospital environment, where patient UR nos were handed out with gay abandon, and it was routine to update them. A patient might give a slightly different name and get a new UR, or be confused with a relative -- all variations possible.

I worked implementing a hospital system for half the country. Clerical staff went to extreme lengths to avoid allocating a new Patient ID -- even if the patient was unconscious and bleeding profusely. (FYI staff are legally entitled/obliged to rifle through the pockets of unconscious patients, looking for ID.) All New Zealanders are supposed to be given a National Health Index at birth and carry it to the grave. Getting the right NHI is critical for patient safety: if they're allergic to anaesthetic, or taking some powerful medication, we want to know before administering anything.

If all else fails, there's a series of 'temp' NHI numbers, using which will prod staff regularly to try to connect them to a proper record.

A simple key change wouldn't fix it, because there were always linkages, some of them to external or paper systems that couldn't be updated.

Yes sometimes a patient got mis-identified. (Those coming in with gunshot wounds or drug overdose are particularly anxious not to be identified.) So there was a key-correction/re-linkage process, about as ugly as you'd expect. And yes a re-linkage would occasionally still be wrong, needing double-ugly unpicking of linkages.

Yes there's links to GP's practices, consultants, physios/allied health, etc. They all need to be sure of the medical history for this body.

Other identifiers like supplier and employee IDs didn't have the same problems, or less often.

It seemed to me that changing a key that was used for linkage always caused grief, and externally visible keys by definition are linked (to the outside world). Logic then dictates to avoid using externally visible (business) keys for FK linkage and use surrogates instead, but that brings its own issues.

I'm not sure how a surrogate patient ID helps there: every medical interaction must be with a known patient. So the interaction (aka 'episode of care', which might have multiple events within it) gets a surrogate anyway. A temp ID is an NHI for all intents and purposes except that it's within a special range that nags to get properly linked.

 

Quote from Hugh on October 1, 2020, 11:49 am

"I'm wondering if TTM should climb down from the ivory tower a little"

TTM voices no objection to designating one key as special and this was a deliberate decision.  From Chapter 2 of DTATRM:

"... the relational model has historically required ... that one [key] be chosen as the primary key, and the others are then alternate keys. While this discipline (i.e., choosing a primary key) might be useful in practice, we do not insist on it, because we regard the idea of making one key somehow 'more equal than others' as a psychological issue merely [52]."

[52] is a reference to Chris's article "The Primacy of Primary Keys: An Investigation", which was included in Relational Database Writings 1991-1994.

 

Thanks Hugh, so the plans have been available on Alpha Centauri for 26 years. That paper in particular I would like to read -- I've searched quite hard for it. But I'm not willing to buy the whole tome.

I disagree this is "a psychological issue merely". By making that claim, TTM plays into the hands of those who want to splash surrogate keys everywhere. There are downsides to having multiple keys doing the same job (that is, _if_ it's the same job): update anomalies in particular, and that is surely a model issue.

I would expect TTM to VSS relvars have a PKey; with something to the effect that should be the one favoured to be persistent, used in transition constraints, etc. I note that when it comes to update-through-view, Codd thought this possible only wrt 'Primary key'.

I wonder how SIRA_PRISE's transition constraints are declared? Is there reliance on something that amounts to PKey? What if a relvar has multiple keys?

Quote from AntC on October 2, 2020, 3:30 am
Quote from dandl on October 1, 2020, 11:32 am

So what is the received wisdom in relational circles on externally visible identifiers? I worked in the hospital environment, where patient UR nos were handed out with gay abandon, and it was routine to update them. A patient might give a slightly different name and get a new UR, or be confused with a relative -- all variations possible.

I worked implementing a hospital system for half the country. Clerical staff went to extreme lengths to avoid allocating a new Patient ID -- even if the patient was unconscious and bleeding profusely. (FYI staff are legally entitled/obliged to rifle through the pockets of unconscious patients, looking for ID.) All New Zealanders are supposed to be given a National Health Index at birth and carry it to the grave. Getting the right NHI is critical for patient safety: if they're allergic to anaesthetic, or taking some powerful medication, we want to know before administering anything.

If all else fails, there's a series of 'temp' NHI numbers, using which will prod staff regularly to try to connect them to a proper record.

Nice if you've got it. The NZ NHI system started in 1993, presumably in an effort to resolve exactly the issues we were running into in the 1980s. Many countries (most?) don't have one and even if they do, there will be plenty of non-resident/non-documented/outright deceptive bodies to mess things up.

A simple key change wouldn't fix it, because there were always linkages, some of them to external or paper systems that couldn't be updated.

Yes sometimes a patient got mis-identified. (Those coming in with gunshot wounds or drug overdose are particularly anxious not to be identified.) So there was a key-correction/re-linkage process, about as ugly as you'd expect. And yes a re-linkage would occasionally still be wrong, needing double-ugly unpicking of linkages.

Yes there's links to GP's practices, consultants, physios/allied health, etc. They all need to be sure of the medical history for this body.

You wish. In this country most patients never get good linkage of records outside the walls of one hospital/GP/path lab. So the problem often resolves to: I've this this ID for a patient at this date

Other identifiers like supplier and employee IDs didn't have the same problems, or less often.

It seemed to me that changing a key that was used for linkage always caused grief, and externally visible keys by definition are linked (to the outside world). Logic then dictates to avoid using externally visible (business) keys for FK linkage and use surrogates instead, but that brings its own issues.

I'm not sure how a surrogate patient ID helps there: every medical interaction must be with a known patient. So the interaction (aka 'episode of care', which might have multiple events within it) gets a surrogate anyway. A temp ID is an NHI for all intents and purposes except that it's within a special range that nags to get properly linked.

If a patient had acquired multiple UR nos, the surrogate could provide linkage. But I'm not claiming we solved the problem; that's why I asked.

 

 

Andl - A New Database Language - andl.org
Quote from AntC on October 2, 2020, 3:53 am

I wonder how SIRA_PRISE's transition constraints are declared? Is there reliance on something that amounts to PKey? What if a relvar has multiple keys?

If a relvar has >1 key, then either :

  • one key is picked as "primary" for the purpose of that particular transition constraint (and the semantics won't be enforced in the case there's a change to the value of some other key)
  • the "same" constraint is declared multiple times, reformulated in terms of each of the applicable keys (this is the very first time I thought of this possibility myself, so I never thought this through and have no idea about feasibility or possible downsides, apart from the blatantly obvious "redundancy" downside)
  • there's a relvar somewhere in the design that is supposed to document stuff like "the country that was called 'Zaire' from 1970 to 1998 is the same as the one that was called 'Congo' from 1905 to 1970." and the constraint is formulated in such a way as to take note of that relvar (which assumes the user is properly updating that relvar too).  This is going to be ***very hard*** to do because of the transitive closures it is going to require, and which involve a temporal aspect by definition.  I don't even know if it is technically possible using the TTM algebra (in particular the TC/temporal combo, though I suspect it must be) and if so, its feasability (which I suspect to be on the low side).

***Having*** declarative transition constraints in the first place was enough for me.

 

Quote from Erwin on October 12, 2020, 5:37 am
Quote from AntC on October 2, 2020, 3:53 am

I wonder how SIRA_PRISE's transition constraints are declared? Is there reliance on something that amounts to PKey? What if a relvar has multiple keys?

If a relvar has >1 key, then either :

  • one key is picked as "primary" for the purpose of that particular transition constraint (and the semantics won't be enforced in the case there's a change to the value of some other key)
  • the "same" constraint is declared multiple times, reformulated in terms of each of the applicable keys (this is the very first time I thought of this possibility myself, so I never thought this through and have no idea about feasibility or possible downsides, apart from the blatantly obvious "redundancy" downside)

If there's multiple constraints, each taking a different key as 'Persistent', my guess is that it would be impossible to change any of the keys.

  • Suppose the S&P data model but relvar S having as keys both {{SNO}, {SNAME}} -- that is, Supplier names must be unique as well as Supplier numbers.
  • And suppose a transition constraint that STATUS if changed can only be increased. Or rather two constraints: one saying that if as-is-SNO is same as to-be-SNO then as-is-STATUS must be less-equal to-be-STATUS; and another with the same logic in terms of SNAME as PKey.
  • To prevent cheating the transition constraint by 'modifying a tuple' in one transaction to change 'its' SNO, then re-modifying in a second transaction the tuple with the modified SNO back to as-was-SNO with decreased STATUS, you'd need a transition constraint that no SNO can disappear. Same anti-cheating constraint for SNAME.
  • Now I think you're banjaxed: you can't persist a SNO and change 'its' SNAME; neither can you persist a SNAME and change 'its' SNO. Either of those attempts would make a SNO/SNAME disappear.
  • So after inserting a tuple, its initial SNO, SNAME remain in lock-step.

(?)

  • there's a relvar somewhere in the design that is supposed to document stuff like "the country that was called 'Zaire' from 1970 to 1998 is the same as the one that was called 'Congo' from 1905 to 1970." and the constraint is formulated in such a way as to take note of that relvar (which assumes the user is properly updating that relvar too).  This is going to be ***very hard*** to do because of the transitive closures it is going to require, and which involve a temporal aspect by definition.  I don't even know if it is technically possible using the TTM algebra (in particular the TC/temporal combo, though I suspect it must be) and if so, its feasability (which I suspect to be on the low side).

***Having*** declarative transition constraints in the first place was enough for me.

 

Oh yes, top marks for achieving that.

With the 'Congo'/'Zaire' case, presumably its code/abbreviation changed at the same time as its name? wikipedia is only giving me 'Republic of the Congo'/.cg/COG. both name and code changing at the same time would be indistinguishable from one country 'disappearing' and some other country appearing fresh.

Or does SIRA_PRISE look at the form of UPDATE ... : {CDomain = '.cg', CName = 'Congo'}? Which smells suspiciously like it's sidestepping RM Pre 21: we can translate that UPDATE into assignments; but having done that we can't enforce the transition constraints against deleting a country code.

12