Gitlab hero border pattern left svg Gitlab hero border pattern right svg

Data Quality

On this page

What is data quality?

Data quality is a measurement of the value of data to the business, meaning it is dependent on the task trying to be accomplished. High quality data in one context can be low in another. However, there are common dimensions that span contexts and these are detailed here. Also discussed is the relation of Data Integrity to Data Quality, the kinds of Data Errors, and how to improve Data Quality by decreasing the number of errors.

Dimensions

Data Accuracy

Data accuracy is a measure of how well the values in the database match the real-world values they represent. Inaccurate data would be having an Opportunity list the wrong value of a contract.

Information Completeness

Information completeness refers to the extent to which data are of sufficient breadth, depth, and scope for the task at hand. There are absolute measures of breadth, depth, and scope that will return relative differences in completeness depending on the usage.

Time Dimensions

The time-related dimensions of data quality have three components: currency, volatility, and timeliness.

Data Consistency

Data consistency, also known as coherence and validity, is a measure of whether the data are violating or adhering to semantic rules. That is, given a set of values or ranges for a particular field, do the data adhere to the rules. An example of this would be Turn Around Time. Common values would be 0 to 365 days. That is, we would expect some amount of time to pass before a result is returned ( > 0) but more than a year seems unreasonable. Zero is an indication of no value for that field. With more experience we might realize that 0 to 180 might be more reasonable.

Data Uniqueness

Data uniqueness, also known as data deduplication, is a measure of whether or not real-world entities are represented multiple times in the same dataset. This could likewise apply to data features within a database entry. This can also have implications for reconciling data across databases (i.e. curated products database vs ordered products database.)

Other dimensions may include precision, accessibility, credibility, traceability, and confidentiality. I have chosen to exclude these dimensions because they are either sufficiently covered by the other dimensions (precision and credibility) or not particularly relevant to the broader quality conversation (accessibility, traceability, and confidentiality).

Data Integrity

Data Integrity is the trustworthiness of the data, typically built upon Consistency and Accuracy. Therefore, Data Integrity is a result of Data Quality. Data with a high degree of integrity can be of low quality if it’s not suited for the task at hand and does not provide value to the business.

Data Errors

Errors occur, by definition, when data is inaccurate, incomplete, out of date, inconsistent, or duplicated.

Data Quality Improvement

Data Quality improvement is separated into three categories: Prevention, Detection, and Repair of errors. Data errors are different for each of the above data quality dimensions, but they can be avoided in all cases using these three techniques.