This question bank includes data quality themes as defined by the Data Management Association UK (DAMA) dimensions outlined in The Government Data Quality Framework. These dimensions and definitions used are the same as those outlined in our Administrative Data Quality Framework (ADQF). The dimensions covered are:

  • accuracy

  • validity

  • completeness

  • uniqueness

  • consistency

  • timeliness

We have used these because they were developed by experts in data quality to assess the fitness for purpose of data. Finding which dimensions are important for you will help you make decisions around how fit for purpose the data are for your needs.

In future publications of this question bank, we intend to include questions based on relevant selected principles from the European Statistics Code of Practice. The principles covered will be:

  • relevance: coverage, content, purpose and collection

  • accessibility and clarity: accessing the data, data format, availability of supporting information, quality and sufficiency of metadata, illustrations and accompanying advice

Before going into the questions for each data quality dimension, we have provided some general questions for you, which can be used to ensure that you have a fundamental understanding of the data and its’ quality.


Questions to ask to gain insights into the data’s quality in general

Q1 How are the data from this dataset collected? For example, through public contact with services over the phone, registration forms, etc.
Q2 What organisation(s) collects the data?
Q3 Are there different organisations which collect different data or variables in the dataset? For example, where one organisation is responsible for collecting income-related data, and another organisation is responsible for collecting demographic data, and this data is combined to create one composite dataset
Q4 If so, which data is collected by which organisation?
Q5 Is the data collected differently?
Q6 Does this have an impact on the quality?
Q7 How do suppliers quality assure the data?
Q8 Are there any known quality issues?
Q9 What thresholds have the suppliers put in place regarding the data’s quality? For example, an acceptable number of duplicate records, or an acceptable amount of missing data.
Q10 How is the quality for this dataset documented?
Q11 Are there any supplementary documents related to the dataset that can be shared? For example, a data dictionary, a metadata list.
Q12 Are there training manuals related to the work that can be shared? For example, for coding, updating or maintaining the dataset.


Accuracy and Validity


Accuracy and validity definition

Administrative data accuracy refers to how well the data match reality - do the data capture what you are trying to measure?

Valid data is defined as the extent to which the data conform to the expected format, type, and range. For example, an email address must have an ‘@’ symbol.



Questions to ask to gain insights into accuracy of data

Q13 How accurate are the supplied data?
Q14 How well do the data meet the statistical use?
Q15 How accurate are the items, or variables in the supplied data?
Q16 How accurate are the units, or records in the supplied data?
Q17 What are the accuracy issues in the supplied data?
Q18 If there are accuracy issues, how are they identified? For example, through a formal auditing process, or an automatic flagging system.
Q19 What methods are implemented by the suppliers to prevent any accuracy issues? For example, checks built into the data collection instrument.
Q20 If there are accuracy issues, how are they resolved by the suppliers? And to which variables and types of records?
Q21 What data accuracy issues are not addressed?
Q22 Why are the issues not addressed?
Q23 What happens to data accuracy issues that are not addressed? For example, logged or reported to a specific team.
Q24 How are users of the data informed about these data accuracy issues?



Invalid entry questions

Q25 What are the types of invalid data entries in the data?
Q26 How many invalid entries are there in the data?
Q27 What variables have invalid data entries?
Q28 What types of records have invalid data entries?
Q29 What methods are used to identify invalid data entries?
Q30 What methods are used to resolve invalid data entries?



Error, typos or mistakes questions

Q31 What kinds of errors, typos or mistakes, are there in the data?
Q32 Which variables have typos, errors or mistakes?
Q33 Which types of records have typos, errors or mistakes?
Q34 What are the causes of these errors, typos or mistakes in the data?
Q35 How are errors, typos or mistakes identified in the data?
Q36 How are errors, typos or mistakes in the data resolved?



Completeness and Uniqueness


Completeness definition

Completeness describes the degree to which all values within each variable are present (or absent of blank, null or empty values). Completeness applies both at data item (variable) level, and unit (record) level. At a data item level, you may have an individual’s value missing, for example a date of birth, from their record within a dataset. Alternatively, at a unit level, a full record may be missing; that individual is missing from the dataset entirely.

Depending on the completeness of your data there may be under-coverage or over-coverage. Please see the ‘Completeness’ section of the Administrative Data Quality Framework for more information on over and under-coverage.

To assess completeness, you will need to identify how many items or records are missing versus present. This dimension is sometimes described in terms of “missingness” rather than “completeness”, but the quality issue is the same. Data are ‘complete’ when all the data required for your purposes are both present and available for use. This does not mean your data needs 100% of the fields to be complete, but that the values and units you need are present. A ‘complete’ dataset may still be inaccurate if it has values that are not correct.


Questions to ask to gain insights into completeness of data



Unit completeness questions

Q37 How complete or incomplete are the data?
Q38 How many records in the data are considered complete, or to have good coverage?
Q39 What types of records need to be in the data to be considered complete?
Q40 What types of records are missing from the data where they should be included?
Q41 Why are they missing?
Q42 What types of records are included in the data where they should not be?
Q43 Why are these included?
Q44 How are records missing from the data identified as missing?
Q45 How are records missing in the data resolved?
Q46 How are records that are wrongly included in the data, identified?
Q47 How are records, that are wrongly included in the data, resolved?
Q48 Unit imputation is when missing data are replaced with a record or unit. Are any records in the data supplied, imputed records?
Q49 Why are these records imputed?
Q50 How are they imputed?
Q51 What changes have been made to exclusion and inclusion criteria in the data over time? For example, due to policy changes.



Item completeness questions

Q52 Which variables or values have missing data?
Q53 If there are missing data in variables or values, are there any particular types of records that have data within variables or values missing?
Q54 How are missing data within variables or values identified?
Q55 How are missing data within variables or values resolved?
Q56 How are data, variables or values that are wrongly included in the dataset, identified?
Q57 How are data, variables or values that are wrongly included in the dataset, resolved?
Q58 Item imputation is when missing data are replaced with a value or variable. Which variables, or values in the data are imputed?
Q59 Why are these variables, or values imputed?
Q60 How are they imputed?



Uniqueness definition

Uniqueness describes the degree to which there is no duplication in records. This means that the data contains only one record for each entity it represents, and each value is stored only once.

Data are unique if it appears only once in a dataset. A record can be a duplicate even if it has some fields that are different. For example, a person may have two patient records with matching information in some fields (for example, name and date of birth) but may have different addresses and contact numbers in each record, therefore they are treated as two separate people. Depending on what you are using the data for, this may or may not be a uniqueness issue. If you want to know the total number of visits for every patient, this is not a problem. However, if you want to know how many patients you have on your roster, you could be counting the same person twice. As such, it is important to take uniqueness into account and into context when assessing the quality for and when combining datasets as it can impact the coverage of the data.



Questions to ask to gain insights into uniqueness of data


Q61 How often, do identical records appear in the data more than once?
Q62 Should there or shouldn’t there be records appearing more than once in the data?
Q63 If records appear more than once, what, is the reason?
Q64 How unique are the records in the data?
Q65 What type of records appear in the data more than once?
Q66 What does each row in the dataset represent?
Q67 How is each unique record identified? For example, a record ID number.
Q68 What measures are carried out to prevent records appearing more than once in the data during data collection?
Q69 What measures are carried out to prevent records appearing more than once in the data during data processing?
Q70 How are records that appear more than once in the data identified?
Q71 What do duplicate records look like in the data?
Q72 How are records that appear more than once in the data resolved?



Consistency and Timeliness


Consistency definition

Consistency is achieved when data values do not conflict with other values within a dataset or across different datasets. For example, date of birth for the same person should be recorded as the same date within the same dataset and between datasets. It should also match the age recorded for that person. Their postcode should also not conflict with their address, etc. Another example may be where two people who are each others’ spouses, should both have the same marital status recorded.



Questions to ask to gain insights into consistency of data


Q73 How consistent, are the data between variables?
Q74 Which variables, have inconsistent information? What is the reason for this?
Q75 Which types of records, if any, have inconsistent information?
Q76 What is the reason for this?
Q77 If you have a composite dataset (dataset compiled from different sources), how consistent, are the data across the different sources?
Q78 How consistent, are the data over time?
Q79 Have there been any changes to the way the data are collected over time?
Q80 What changes have there been to the variables over time? For example, changes to definition.
Q81 Which variables, if any, were changed?
Q82 What is used to measure consistency or identify inconsistencies in the supplied data?
Q83 What aspects of the data are checked for consistency? Such as, all data items, certain variables, certain time points.
Q84 How are inconsistencies in the data resolved?



Timeliness definition

Timeliness refers to how well the data reflect the period they are supposed to represent. It also describes how up to date the data are.

The attributes represented in some data might stay the same over time – e.g., the day you were born does not change, no matter how much time passes. Other attributes, such as income, may change.

Your data are also ‘timely’ if the lag between their collection and their availability for your use is appropriate for your needs. Are the data available when expected and needed? Do they reflect the time they are supposed to?



Questions to ask to gain insights into timeliness of data


Q85 When are the data collected? For example, constantly or over a certain timeframe?
Q86 Up to date refers to whether the data supplied is the latest version. For example, if there are new data being collected, but is not reflected in the current data, then the data are not up to date.
Q87 How up to date are the data at the point of it being supplied?
Q88 What can impact how up to date the data are?
Q89 Reference dates refer to timestamps which indicate when the data have been changed. Are there any reference dates for each record?
Q90 At what point of the data collection phase are reference dates produced? For example, when the data are collected, or when the data were last updated.
Q91 How up to date, are the variables at the point of it being supplied?
Q92 Which types of records, do not have up to date information in these variables?
Q93 What methods are used to check that the data are up to date?
Q94 What methods are carried out to resolve data if they are not up to date?
Q95 How often are the data updated?
Q96 What information is updated?
Q97 Are there any time lags between the reference dates in the data and the date in which the data are supplied?
Q98 What are the different processes by which new records are added?
Q99 How often, are existing records within the data updated with new information?
Q100 What are the different processes by which existing records are updated with new information?
Q101 What are the different processes by which variables or values are updated with new information?
Q102 How often are the data updated to remove records from the data?
Q103 Under what circumstances are records removed from the data?
Q104 What are the different processes by which unwanted records are removed?
Q105 When records meet the criteria for removal, how long would it typically take for the record to be deleted from the data supplied?
Q106 How often, are existing records within the data, updated to correct for any errors?
Q107 How often, are variables within records, updated to correct for any errors?
Q108 What are the different processes by which existing records are updated to correct for any errors?