Data quality refers to the level of quality of Data. There are many definitions of data quality but data is generally considered high quality if, they are fit for their intended uses in operations, decision making and planning. Alternatively, data is deemed of high quality if it correctly represents the real-world construct to which it refers. Furthermore, apart from these definitions, as data volume increases, the question of internal consistency within data becomes significant, regardless of fitness for use for any particular external purpose.
The people’s views on data quality can often be in disagreement, even when discussing the same set of data used for the same purpose. The definition of quality is applied, data quality can be defined as the degree to which a set of characteristics of data fulfills requirements. Examples of characteristics are: completeness, validity, accuracy, consistency, availability and timeliness. Requirements are defined as the need or expectation that is stated, generally implied or obligatory.
A considerable amount of data quality research involves investigating and describing various categories of desirable attributes (or dimensions) of data. These lists commonly include accuracy, correctness, currency, completeness and relevance. In practice, data quality is a concern for professionals involved with a wide range of information systems, ranging from data warehousing and business intelligence to customer relationship management and supply chain management. Problems with data quality don’t only arise from incorrect data; inconsistent data is a problem as well. Eliminating data shadow systems and centralizing data in a warehouse is one of the initiatives a company can take to ensure data consistency.
The market is going some way to providing data quality assurance. Most data quality tools offer a series of tools for improving data, which may include some or all of the following:
- Data profiling – initially assessing the data to understand its quality challenges
- Data standardization – a business rules engine that ensures that data conforms to quality rules
- Geocoding – for name and address data. Corrects data to U.S. and Worldwide postal standards
- Matching or Linking – a way to compare data so that similar, but slightly different records can be aligned. Matching may use “fuzzy logic” to find duplicates in the data. It often recognizes that ‘Bob’ and ‘Robert’ may be the same individual. It might be able to manage ‘householding’, or finding links between spouses at the same address, for example.
- Monitoring – keeping track of data quality over time and reporting variations in the quality of data. Software can also auto-correct the variations based on pre-defined business rules.
- Batch and Real time – Once the data is initially cleansed (batch), companies often want to build the processes into enterprise applications to keep it clean.