This question bank includes data quality themes as defined by the Data Management Association UK (DAMA) dimensions outlined in The Government Data Quality Framework. These dimensions and definitions used are the same as those outlined in our Administrative Data Quality Framework (ADQF). The dimensions covered are:
accuracy
validity
completeness
uniqueness
consistency
timeliness
We have used these because they were developed by experts in data quality to assess the fitness for purpose of data. Finding which dimensions are important for you will help you make decisions around how fit for purpose the data are for your needs.
In future publications of this question bank, we intend to include questions based on relevant selected principles from the European Statistics Code of Practice. The principles covered will be:
relevance: coverage, content, purpose and collection
accessibility and clarity: accessing the data, data format, availability of supporting information, quality and sufficiency of metadata, illustrations and accompanying advice
Before going into the questions for each data quality dimension, we have provided some general questions for you, which can be used to ensure that you have a fundamental understanding of the data and its’ quality.
Q1 | How are the data from this dataset collected? For example, through public contact with services over the phone, registration forms, etc. |
Q2 | What organisation(s) collects the data? |
Q3 | Are there different organisations which collect different data or variables in the dataset? For example, where one organisation is responsible for collecting income-related data, and another organisation is responsible for collecting demographic data, and this data is combined to create one composite dataset |
Q4 | If so, which data is collected by which organisation? |
Q5 | Is the data collected differently? |
Q6 | Does this have an impact on the quality? |
Q7 | How do suppliers quality assure the data? |
Q8 | Are there any known quality issues? |
Q9 | What thresholds have the suppliers put in place regarding the data’s quality? For example, an acceptable number of duplicate records, or an acceptable amount of missing data. |
Q10 | How is the quality for this dataset documented? |
Q11 | Are there any supplementary documents related to the dataset that can be shared? For example, a data dictionary, a metadata list. |
Q12 | Are there training manuals related to the work that can be shared? For example, for coding, updating or maintaining the dataset. |
Administrative data accuracy refers to how well the data match reality - do the data capture what you are trying to measure?
Valid data is defined as the extent to which the data conform to the expected format, type, and range. For example, an email address must have an ‘@’ symbol.
Q13 | How accurate are the supplied data? |
Q14 | How well do the data meet the statistical use? |
Q15 | How accurate are the items, or variables in the supplied data? |
Q16 | How accurate are the units, or records in the supplied data? |
Q17 | What are the accuracy issues in the supplied data? |
Q18 | If there are accuracy issues, how are they identified? For example, through a formal auditing process, or an automatic flagging system. |
Q19 | What methods are implemented by the suppliers to prevent any accuracy issues? For example, checks built into the data collection instrument. |
Q20 | If there are accuracy issues, how are they resolved by the suppliers? And to which variables and types of records? |
Q21 | What data accuracy issues are not addressed? |
Q22 | Why are the issues not addressed? |
Q23 | What happens to data accuracy issues that are not addressed? For example, logged or reported to a specific team. |
Q24 | How are users of the data informed about these data accuracy issues? |
Q25 | What are the types of invalid data entries in the data? |
Q26 | How many invalid entries are there in the data? |
Q27 | What variables have invalid data entries? |
Q28 | What types of records have invalid data entries? |
Q29 | What methods are used to identify invalid data entries? |
Q30 | What methods are used to resolve invalid data entries? |
Q31 | What kinds of errors, typos or mistakes, are there in the data? |
Q32 | Which variables have typos, errors or mistakes? |
Q33 | Which types of records have typos, errors or mistakes? |
Q34 | What are the causes of these errors, typos or mistakes in the data? |
Q35 | How are errors, typos or mistakes identified in the data? |
Q36 | How are errors, typos or mistakes in the data resolved? |
Completeness describes the degree to which all values within each variable are present (or absent of blank, null or empty values). Completeness applies both at data item (variable) level, and unit (record) level. At a data item level, you may have an individual’s value missing, for example a date of birth, from their record within a dataset. Alternatively, at a unit level, a full record may be missing; that individual is missing from the dataset entirely.
Depending on the completeness of your data there may be under-coverage or over-coverage. Please see the ‘Completeness’ section of the Administrative Data Quality Framework for more information on over and under-coverage.
To assess completeness, you will need to identify how many items or records are missing versus present. This dimension is sometimes described in terms of “missingness” rather than “completeness”, but the quality issue is the same. Data are ‘complete’ when all the data required for your purposes are both present and available for use. This does not mean your data needs 100% of the fields to be complete, but that the values and units you need are present. A ‘complete’ dataset may still be inaccurate if it has values that are not correct.
Q37 | How complete or incomplete are the data? |
Q38 | How many records in the data are considered complete, or to have good coverage? |
Q39 | What types of records need to be in the data to be considered complete? |
Q40 | What types of records are missing from the data where they should be included? |
Q41 | Why are they missing? |
Q42 | What types of records are included in the data where they should not be? |
Q43 | Why are these included? |
Q44 | How are records missing from the data identified as missing? |
Q45 | How are records missing in the data resolved? |
Q46 | How are records that are wrongly included in the data, identified? |
Q47 | How are records, that are wrongly included in the data, resolved? |
Q48 | Unit imputation is when missing data are replaced with a record or unit. Are any records in the data supplied, imputed records? |
Q49 | Why are these records imputed? |
Q50 | How are they imputed? |
Q51 | What changes have been made to exclusion and inclusion criteria in the data over time? For example, due to policy changes. |
Q52 | Which variables or values have missing data? |
Q53 | If there are missing data in variables or values, are there any particular types of records that have data within variables or values missing? |
Q54 | How are missing data within variables or values identified? |
Q55 | How are missing data within variables or values resolved? |
Q56 | How are data, variables or values that are wrongly included in the dataset, identified? |
Q57 | How are data, variables or values that are wrongly included in the dataset, resolved? |
Q58 | Item imputation is when missing data are replaced with a value or variable. Which variables, or values in the data are imputed? |
Q59 | Why are these variables, or values imputed? |
Q60 | How are they imputed? |
Uniqueness describes the degree to which there is no duplication in records. This means that the data contains only one record for each entity it represents, and each value is stored only once.
Data are unique if it appears only once in a dataset. A record can be a duplicate even if it has some fields that are different. For example, a person may have two patient records with matching information in some fields (for example, name and date of birth) but may have different addresses and contact numbers in each record, therefore they are treated as two separate people. Depending on what you are using the data for, this may or may not be a uniqueness issue. If you want to know the total number of visits for every patient, this is not a problem. However, if you want to know how many patients you have on your roster, you could be counting the same person twice. As such, it is important to take uniqueness into account and into context when assessing the quality for and when combining datasets as it can impact the coverage of the data.
Q61 | How often, do identical records appear in the data more than once? |
Q62 | Should there or shouldn’t there be records appearing more than once in the data? |
Q63 | If records appear more than once, what, is the reason? |
Q64 | How unique are the records in the data? |
Q65 | What type of records appear in the data more than once? |
Q66 | What does each row in the dataset represent? |
Q67 | How is each unique record identified? For example, a record ID number. |
Q68 | What measures are carried out to prevent records appearing more than once in the data during data collection? |
Q69 | What measures are carried out to prevent records appearing more than once in the data during data processing? |
Q70 | How are records that appear more than once in the data identified? |
Q71 | What do duplicate records look like in the data? |
Q72 | How are records that appear more than once in the data resolved? |
Consistency is achieved when data values do not conflict with other values within a dataset or across different datasets. For example, date of birth for the same person should be recorded as the same date within the same dataset and between datasets. It should also match the age recorded for that person. Their postcode should also not conflict with their address, etc. Another example may be where two people who are each others’ spouses, should both have the same marital status recorded.
Q73 | How consistent, are the data between variables? |
Q74 | Which variables, have inconsistent information? What is the reason for this? |
Q75 | Which types of records, if any, have inconsistent information? |
Q76 | What is the reason for this? |
Q77 | If you have a composite dataset (dataset compiled from different sources), how consistent, are the data across the different sources? |
Q78 | How consistent, are the data over time? |
Q79 | Have there been any changes to the way the data are collected over time? |
Q80 | What changes have there been to the variables over time? For example, changes to definition. |
Q81 | Which variables, if any, were changed? |
Q82 | What is used to measure consistency or identify inconsistencies in the supplied data? |
Q83 | What aspects of the data are checked for consistency? Such as, all data items, certain variables, certain time points. |
Q84 | How are inconsistencies in the data resolved? |
Timeliness refers to how well the data reflect the period they are supposed to represent. It also describes how up to date the data are.
The attributes represented in some data might stay the same over time – e.g., the day you were born does not change, no matter how much time passes. Other attributes, such as income, may change.
Your data are also ‘timely’ if the lag between their collection and their availability for your use is appropriate for your needs. Are the data available when expected and needed? Do they reflect the time they are supposed to?
Q85 | When are the data collected? For example, constantly or over a certain timeframe? |
Q86 | Up to date refers to whether the data supplied is the latest version. For example, if there are new data being collected, but is not reflected in the current data, then the data are not up to date. |
Q87 | How up to date are the data at the point of it being supplied? |
Q88 | What can impact how up to date the data are? |
Q89 | Reference dates refer to timestamps which indicate when the data have been changed. Are there any reference dates for each record? |
Q90 | At what point of the data collection phase are reference dates produced? For example, when the data are collected, or when the data were last updated. |
Q91 | How up to date, are the variables at the point of it being supplied? |
Q92 | Which types of records, do not have up to date information in these variables? |
Q93 | What methods are used to check that the data are up to date? |
Q94 | What methods are carried out to resolve data if they are not up to date? |
Q95 | How often are the data updated? |
Q96 | What information is updated? |
Q97 | Are there any time lags between the reference dates in the data and the date in which the data are supplied? |
Q98 | What are the different processes by which new records are added? |
Q99 | How often, are existing records within the data updated with new information? |
Q100 | What are the different processes by which existing records are updated with new information? |
Q101 | What are the different processes by which variables or values are updated with new information? |
Q102 | How often are the data updated to remove records from the data? |
Q103 | Under what circumstances are records removed from the data? |
Q104 | What are the different processes by which unwanted records are removed? |
Q105 | When records meet the criteria for removal, how long would it typically take for the record to be deleted from the data supplied? |
Q106 | How often, are existing records within the data, updated to correct for any errors? |
Q107 | How often, are variables within records, updated to correct for any errors? |
Q108 | What are the different processes by which existing records are updated to correct for any errors? |