‘Good Data’: a data practitioner perspective

In their recent chapter, Claire Trentham and Adam Steer offer a set of responses to the tricky question of ‘What makes data good?‘. Their ‘manifesto’ joins a growing list of guidelines and recommendations for desirable data generation and use – from the EU’s GDPR through to the Manifesto for Data Practices and the Code of Ethics for Data Science. None of these are binding to the context of Australian schools, yet all point to issues and challenges that certainly need to be considered in any instance of educational data.

Trentham and Steer are quick to stress their practical (rather than theoretical) interest in this question. As they describe themselves:

“We are not ethicists, nor data privacy experts. We are Australian data practitioners with experience managing and working with petabyte-scale data collections; advisors on and contributors to continent-scale data infrastructures. We love high quality data but want to make sure the data we produce and consume considers more than fidelity, precision, accuracy, and reproducibility” (p.38).

In this spirit, Trentham and Steer address what they distinguish as ‘technical’ and ‘human’ aspects of datafication. As they put it, this includes deceptively straightforward questions such as “What do we do with all these data? How do we catalogue them? How should we use them?”. This latter phrasing of ‘should’ rather than ‘could’ sensibly frames the ensuing discussion of ‘good data’ into the realm of values, judgements and politics – reminding us that the use of digital data in society should always be approached as a choice. The fact that school data infrastructures are becoming seemingly inescapable, should not deter us from considering the possibility of alternate technological pathways and different data futures.

The centrepiece of Trentham and Steer’s chapter is the following ‘Guidelines for Good Data’:

Good data are…	Considerations	Questions we may ask
Usable: fit for purpose	Well described Include uncertainties/ limitations Readable FAIR (Findable, Accessible, Interoperable, Reusable) Reproducible Timely Appropriately licenced	Is the purpose of the dataset well defined? Are these the best data for the task? Are the data well described, including limitations and uncertainties? Is the dataset discoverable, accessible, and readable? Are the data reproducible? Is the method by which open data was produced also open?
Collected with respect to…	Humans and their rights The natural world	Was the data collected/produced for this purpose, not incidentally?
Published	With respect to openness Maintaining privacy Carrying owner licensing	Is the dataset published with a DOI and version? Does the data carry an appropriate licence?
Revisable	Personal: opt-in/out alternatives Long term accuracy: data may change over time Older versions of data may be decommissioned	For human-related data, could participants realistically opt-out? Are the data time dependent?
Form useful social capital	Valuable to society Persistent, open Available for ethical use	Have we considered ethics around the data?

Many dot-points on this list speak for themselves, and give us plenty of food for thought. Amidst these various contentions, some particular areas of interest include the following points:

There is long-standing interest amongst data practitioners in ensuring that data-sets are FAIR (findable, accessible, interoperable and reusable). However, Trentham and Steer contend that in addition these characteristics, ‘good data’ also need to be ethical (i.e. no entity will be harmed in the collection or use of the data), and revisable (i.e. allowing the identification and correction of errata, corrections made, and updated versions of the dataset released with older versions archived).
Use of data should be ‘defensible‘ – i.e. it is demonstrable that the data can be validly used for its primary and/or secondary purposes.
Good data are self-deferential – datasets are open about their limitations.
Good data are self-explanatory – i.e. accompanied by meta-data that not only describe how the dataset was created, but also who funded and collected the data, for what purposes, and any subsequent post-processing steps.
Data should be released on a timely basis to primary users as soon as possible after collection. They should still be relevant when they are released to the wider community
Emphasis should be placed on minimising the impact of data collection and retention on the natural world – i.e. in terms of energy costs, depletion of natural resources and other ‘physical cost[s] of holding, cataloguing, accessing and processing data’
Personal data should be generated on an opt-in basis at any time (including long after the data has been generated) – “in the context of ubiquitous data collection about individuals, ‘good data’ respects the right to be forgotten” (p.46)

These contentions have significant implications for schools’ data processes and practices. For example, many school datasets are not especially timely or well documented. In this sense, Trentham and Steer acknowledge that no dataset will be able to ‘tick every box’, and that some of these criteria may be logically impossible for particular types of data. Nevertheless, the key qualities that arise from their list offer a useful set of criteria against where to judge data use in schools. In short, this revolves around the following three underpinning characteristics:

Data consistency – e.g. is there agreement about what to call things between data providers?
Data accessibility – e.g. can data be openly accessed and engaged with?
Data provenance – e.g. is it clear where data come from?

‘Good Data’: a data practitioner perspective

Like this:

Related