Wednesday, 14 February 2018

What's the Single Thing that Improves Data Quality?

Is he 98 or 298? It all depends on the data type ....
When a new database is being designed there is often frustration mixed in with the excitement. Creating anything new is fun, but there is a certain amount of groundwork to get any project going.

This includes figuring out what information needs to be stored, which can be more complicated than it first appears. It is a careful process of going through the existing business processes, envisioning the new system, and deciding what is or is not needed. All now with an eye towards GDPR.

Yet making the right decisions about your data will make or break the success of the system.

Having decided what data needs to be stored, there are more questions. What range of values do you expect to hold? Will you always know the value when a new record is entered? Could the value ever be negative? What is the largest value the file might hold? It is at this point that the frustration invariably turns to annoyance. "It doesn’t matter", comes back the reply. "We will think about that later" is another favourite. "After all, we’re not short of disk space, so what’s the problem?"

Yet each one of those unmade decisions is potentially a bug that will strike at the most inconvenient time in the system’s future; I can guarantee it. Because these decisions are not about running out of disk space, but about data quality. The data type for each field is the most basic and most valuable constraint in a database. It is what separates relational databases from other ways of storing data, such as spreadsheets.

The data that can be entered into a particular field is constrained by a number of things:
  • The data type. This ensures that, for example, a date cannot be entered into a field designed to hold an integer.
  • Primary and foreign keys. As well as providing relationships between tables, they also check data as it is being entered. A value entered into a foreign key field must match the value in another table’s primary key.
  • NULL or NOT NULL. Fields that allow NULLs risk introducing hard-to-debug issues when data is added to the system. It is often necessary to allow NULLs, because the information may not always be available when records are added. However, fields that allow NULL can produce query results that are not the expected results.
  • Allowable ranges. Whilst some data types automatically limit the range that can be added, others do not. For example, a SQL Server tinyint cannot take values greater than 255, or values that are negative. This might make a tinyint a good data type for a field holding age values. A SQL Server smallint, on the other hand, may be negative or positive, and allows values between -32,768 and +32,767, making is a poor choice for an age field, as incorrect data could easily be entered.  
There are very many more examples of how data types and constraints help keep data quality high. Whilst I fully admit that the age example is not ideal (it would be far better to hold date of birth rather than age), it is easily understood and illustrates the point.

Of course, things can be changed in a database system. You can add constraints later, or change constraints you got wrong. But everything comes at a price. And sometimes that price is reliability because every change in a system has consequences - sometimes unexpected ones.

So if you are getting impatient with never ending questions about your data, think of it as an investment in keeping the quality of your data high for years to come.

No comments:

Post a Comment