Adding a new attribute to an existing relation
Quote from Efrit on November 19, 2020, 8:01 pmHello,
I recently encountered an issue in my job: we had to add an attribute attr1 to an existing relation t1. However, this attribute is relevant only for a few tuples of the relation, which means that a lot of the entities represented by this relation do not have this attribute at all. From there, we had 3 options:
- Add the column attr1 directly to the table t1. This seemed the more straightforward, but it would have meant that there would be a lot of NULLs for this column, which are always hard to deal with correctly.
- The second one is to add a separate table t2 that would have two columns: the foreign key f1 referring to the table t1, and the attr1. The downside is that it complexifies a bit the schema. Also, would that mean that we have to add table each time we encounter a similar issue?
- Add a parameterized table t3. It would have 3 columns: a foreign key f1 referring to the table t1, a column name_attr and a column value_attr. The idea is to have in the name_attr column the name of the attribute we want to add to an entity of t1, and the column value_attr would have the value of this attribute for this entity.
I think that the solution 1 is bad as it leads to a lot of NULLs, and that conceptually it doesn’t capture the characteristic of the entities we want to represent. The second one came from myself, I thought about based on what I know (or think I know) about relational model. But the idea of adding another table each time we have to add data like this in the schema seemed to scare people about how it can complexifies the schema and the application code. I didn’t really understand this argument, maybe I am missing something here? The third one is the most interesting for me, as I never thought about this kind of table. I think it’s interesting, but the first limitation I see is that the type of the value would certainly be constrained to be strings, and that the application code would have then to interpret the data of this column according to the data of the name_attr. But then since the vast majority of our data is strings…
What are you the pros and cons of each solution? Is there a better one in general there? At least in the relational model?
Hello,
I recently encountered an issue in my job: we had to add an attribute attr1 to an existing relation t1. However, this attribute is relevant only for a few tuples of the relation, which means that a lot of the entities represented by this relation do not have this attribute at all. From there, we had 3 options:
- Add the column attr1 directly to the table t1. This seemed the more straightforward, but it would have meant that there would be a lot of NULLs for this column, which are always hard to deal with correctly.
- The second one is to add a separate table t2 that would have two columns: the foreign key f1 referring to the table t1, and the attr1. The downside is that it complexifies a bit the schema. Also, would that mean that we have to add table each time we encounter a similar issue?
- Add a parameterized table t3. It would have 3 columns: a foreign key f1 referring to the table t1, a column name_attr and a column value_attr. The idea is to have in the name_attr column the name of the attribute we want to add to an entity of t1, and the column value_attr would have the value of this attribute for this entity.
I think that the solution 1 is bad as it leads to a lot of NULLs, and that conceptually it doesn’t capture the characteristic of the entities we want to represent. The second one came from myself, I thought about based on what I know (or think I know) about relational model. But the idea of adding another table each time we have to add data like this in the schema seemed to scare people about how it can complexifies the schema and the application code. I didn’t really understand this argument, maybe I am missing something here? The third one is the most interesting for me, as I never thought about this kind of table. I think it’s interesting, but the first limitation I see is that the type of the value would certainly be constrained to be strings, and that the application code would have then to interpret the data of this column according to the data of the name_attr. But then since the vast majority of our data is strings…
What are you the pros and cons of each solution? Is there a better one in general there? At least in the relational model?
Quote from Dave Voorhis on November 19, 2020, 8:19 pmQuote from Efrit on November 19, 2020, 8:01 pmHello,
I recently encountered an issue in my job: we had to add an attribute attr1 to an existing relation t1. However, this attribute is relevant only for a few tuples of the relation, which means that a lot of the entities represented by this relation do not have this attribute at all. From there, we had 3 options:
- Add the column attr1 directly to the table t1. This seemed the more straightforward, but it would have meant that there would be a lot of NULLs for this column, which are always hard to deal with correctly.
- The second one is to add a separate table t2 that would have two columns: the foreign key f1 referring to the table t1, and the attr1. The downside is that it complexifies a bit the schema. Also, would that mean that we have to add table each time we encounter a similar issue?
- Add a parameterized table t3. It would have 3 columns: a foreign key f1 referring to the table t1, a column name_attr and a column value_attr. The idea is to have in the name_attr column the name of the attribute we want to add to an entity of t1, and the column value_attr would have the value of this attribute for this entity.
- Flawed for the reason you've stated, but it's SQL and SQL has NULLs, so maybe it's the best solution.
- The (arguably slight) schema complexity increase -- and JOIN cost in queries -- may be worth it to gain freedom from NULL awkwardnesses. Or maybe not.
- This is (at least bordering on) an approach often described as "entity-attribute-value" or EAV, which is sometimes chosen for flexibility but almost never warranted except when it is.
I generally keep out of database design discussions both here and elsewhere, because they start out pleasant but often devolve into invective and insults.
Consider the specific requirements, examine all the specific trade-offs, and note that (a) whatever approach you choose will probably be fine, and (b) if it turns out not to be fine, you can fix it. It's a database, so it's not like the data is going to go away. The worst that will happen is it will run slowly, and maybe some of the queries you need will be more difficult to write with some approaches than with other approaches. If that's a problem, change it so it goes faster and/or is easier to query.
Quote from Efrit on November 19, 2020, 8:01 pmHello,
I recently encountered an issue in my job: we had to add an attribute attr1 to an existing relation t1. However, this attribute is relevant only for a few tuples of the relation, which means that a lot of the entities represented by this relation do not have this attribute at all. From there, we had 3 options:
- Add the column attr1 directly to the table t1. This seemed the more straightforward, but it would have meant that there would be a lot of NULLs for this column, which are always hard to deal with correctly.
- The second one is to add a separate table t2 that would have two columns: the foreign key f1 referring to the table t1, and the attr1. The downside is that it complexifies a bit the schema. Also, would that mean that we have to add table each time we encounter a similar issue?
- Add a parameterized table t3. It would have 3 columns: a foreign key f1 referring to the table t1, a column name_attr and a column value_attr. The idea is to have in the name_attr column the name of the attribute we want to add to an entity of t1, and the column value_attr would have the value of this attribute for this entity.
- Flawed for the reason you've stated, but it's SQL and SQL has NULLs, so maybe it's the best solution.
- The (arguably slight) schema complexity increase -- and JOIN cost in queries -- may be worth it to gain freedom from NULL awkwardnesses. Or maybe not.
- This is (at least bordering on) an approach often described as "entity-attribute-value" or EAV, which is sometimes chosen for flexibility but almost never warranted except when it is.
I generally keep out of database design discussions both here and elsewhere, because they start out pleasant but often devolve into invective and insults.
Consider the specific requirements, examine all the specific trade-offs, and note that (a) whatever approach you choose will probably be fine, and (b) if it turns out not to be fine, you can fix it. It's a database, so it's not like the data is going to go away. The worst that will happen is it will run slowly, and maybe some of the queries you need will be more difficult to write with some approaches than with other approaches. If that's a problem, change it so it goes faster and/or is easier to query.
Quote from dandl on November 19, 2020, 10:16 pmI generally keep out of database design discussions both here and elsewhere, because they start out pleasant but often devolve into invective and insults.
Just so. I like to think of database design as akin to accounting. It represents an opinion of how to best represent a set of real world facts using a specific set of IT tools. Some opinions are backed up by better theory and/or experience; some are more just like quoting from a religion. It's the religious bit that tends to get nasty.
Consider the specific requirements, examine all the specific trade-offs, and note that (a) whatever approach you choose will probably be fine, and (b) if it turns out not to be fine, you can fix it. It's a database, so it's not like the data is going to go away. The worst that will happen is it will run slowly, and maybe some of the queries you need will be more difficult to write with some approaches than with other approaches. If that's a problem, change it so it goes faster and/or is easier to query.
Yes, but I would add one thing. Design is always about choices and trade-offs, so you really better cast your net wide and record what you chose and why. Most of us don't get it right the first time, and if it doesn't turn out right, you need to know why and what to try next.
I generally keep out of database design discussions both here and elsewhere, because they start out pleasant but often devolve into invective and insults.
Just so. I like to think of database design as akin to accounting. It represents an opinion of how to best represent a set of real world facts using a specific set of IT tools. Some opinions are backed up by better theory and/or experience; some are more just like quoting from a religion. It's the religious bit that tends to get nasty.
Consider the specific requirements, examine all the specific trade-offs, and note that (a) whatever approach you choose will probably be fine, and (b) if it turns out not to be fine, you can fix it. It's a database, so it's not like the data is going to go away. The worst that will happen is it will run slowly, and maybe some of the queries you need will be more difficult to write with some approaches than with other approaches. If that's a problem, change it so it goes faster and/or is easier to query.
Yes, but I would add one thing. Design is always about choices and trade-offs, so you really better cast your net wide and record what you chose and why. Most of us don't get it right the first time, and if it doesn't turn out right, you need to know why and what to try next.
Quote from Efrit on November 20, 2020, 6:52 pmThanks for the detailed replies.
We went for the second solution finally. Related to the specific requirements, the problem is that, in this context, it’s kind of hard to determine them as I feel that we have to go fast pretty much all the time. But what I have noticed is that, in our backend talking to the database (a Django REST Framework application), we had several issues with None values that were coming from NULL values. So I think it’s better to avoid them as much as possible. For the solution 3, the flexibility isn’t really needed, as we can stop the tool we are developing nearly as much as we want (it an internal migration software). Also I don’t really know how the queries would look like through the ORM. I would like to try what it looks like in the case of my job, but there are a lot of things to do first. ^^ Therefore I advocated for the second solution and that’s the one that has been chosen, we’ll see if it resists over the time.
Thanks again.
Thanks for the detailed replies.
We went for the second solution finally. Related to the specific requirements, the problem is that, in this context, it’s kind of hard to determine them as I feel that we have to go fast pretty much all the time. But what I have noticed is that, in our backend talking to the database (a Django REST Framework application), we had several issues with None values that were coming from NULL values. So I think it’s better to avoid them as much as possible. For the solution 3, the flexibility isn’t really needed, as we can stop the tool we are developing nearly as much as we want (it an internal migration software). Also I don’t really know how the queries would look like through the ORM. I would like to try what it looks like in the case of my job, but there are a lot of things to do first. ^^ Therefore I advocated for the second solution and that’s the one that has been chosen, we’ll see if it resists over the time.
Thanks again.