Preserving data
Preserving research data refers to the practice of keeping data available and usable in the longer term, beyond the end of your research project.
Why preserve data?
The main reasons for data preservation are:
- Ensuring that your research can be verified and reproduced
- Maintaining data for future reuse, e.g. for further research or teaching
Increasingly, funders, publishers, and institutions (including Ghent University) are requiring (certain) research data to be retained for a specified period and/or for a specific purpose.
Preserving vs. storing data
Preserving data from completed research is different from storing and backing up data files while your research is still ongoing. The latter typically involves data that are mutable; the former concerns data (or milestone versions of data) that are ‘frozen’ and not in active use.
Long-term preservation requires appropriate actions to prevent data from becoming unavailable and unusable over time, for example because of:
- Outdated software or hardware
- Storage media degradation
- A lack of sufficient descriptive and contextual information to keep data understandable
In other words, data preservation involves more than simply not deleting the data files created and stored in the course of your research project!
What to keep?
Not all (versions or parts of) research data can or have to be kept indefinitely.
Maintaining data in a usable form for the longer term takes effort and has a considerable cost. Selecting which (parts of) data to keep, and for how long, is therefore an essential component of data preservation.
As a researcher you have a key role in deciding what to retain and what not, as you know your data best. Such decisions may depend on factors such as:
- The type of data involved
- The norms in your discipline
- Whether you are keeping data for potential future reuse, for verification, or for other purposes. Depending on the purpose, you may need to keep the raw data or data in a more processed form (or perhaps you want to preserve different forms of the same dataset for different purposes, and for different retention periods).
Appraisal and selection of research data is still an evolving field, but some generic, high-level criteria are emerging to guide decisions on what to keep. Common criteria for keeping data include:
- Legal or ethical requirements to keep (certain) data for a specified retention period (e.g. for clinical trials data)
- Funder, institutional or publisher policies
- High potential reuse value of the data
- Great scientific, historical, or cultural significance of the data
- The data are unique and/or cannot easily be re-created.
- The benefits outweigh the costs of data preservation.
The other side of the picture is that there can be valid reasons for disposing of (parts/versions of) data after finishing your research (e.g. duplicate copies, superseded versions, …) or later on, after expiration of the applicable retention period.
Without associated information, research data quickly become useless. For all data selected for preservation, you should therefore keep a data package consisting of:
- The research data files themselves
- The necessary accompanying documentation and metadata to ensure that those data remain findable, comprehensible, and (re)usable
It is also important to document and justify your choices to keep or remove data, for example in your Data Management Plan.
Where to keep data?
Research data and documentation selected for retention should be kept in a suitable location and in a secure manner to ensure that they remain available and usable beyond the end of your project, with appropriate access rights.
Where appropriate, depositing data in an established, trustworthy research data repository (sometimes also called a data center, data archive or scientific database) is generally the preferred option. This has the added benefit of at the same time allowing you to make your data available to others.
Data repository types
There are different kinds of data repository, including:
- General-purpose repositories: accept a wide range of data types (and sometimes other research outputs as well) from all disciplines. Examples are:
- Zenodo
- Open Science Framework (single sign-on with Ghent University credentials possible)
- Dryad
- Domain-specific repositories: focus on specific data types or data from specific research domains.
- Institutional repositories: hold research data outputs from a particular research institution.
Data repositories are mostly suitable for research data that can be publicly shared – although that doesn’t necessarily have to mean sharing in a fully open way (see degrees of data sharing). Some data repositories can cater for data that cannot be made (immediately) available under full open access, for example by allowing temporary embargoes, or by offering more restricted or controlled levels of access.
However, sometimes it may not be possible or not appropriate to deposit data in an external repository, e.g. for legal, ethical, contractual, practical, or other reasons. In such cases, research data selected for preservation will need to be kept in-house.
There are hundreds of existing data repositories or archives to choose from. Keep in mind, however, that not all repositories are created equal. Some repositories focus more on disseminating and making your data visible than on ensuring their preservation in the long term.
Basic tips
- Check the list of repositories recommended by your journal/publisher. Many journals and publishers with data sharing policies recommend, and for some data types even require, the use of specific repositories. For example, see the list of recommended repositories from Springer Nature or PLOS.
- Deposit data in a broadly recognised domain-specific repository if one is available for your specific domain or data type. Trusted domain repositories might not accept all individual datasets, however: they tend to focus on high-quality data with potential for reuse.
- Select a general-purpose repository, such as Zenodo or Open Science Framework, if no established repository exists for your research domain.
Additional considerations
- Does the repository match your data needs (e.g. in terms of accepted data types and formats, access levels, licenses, legal requirements for data protection…)?
- Does it charge for its services?
- Does it have an explicit commitment to long-term preservation?
- Does it provide a landing page for each dataset, with publicly available metadata?
- Does it assign persistent and unique identifiers?
- Does it provide clarity about access levels and conditions?
- Does it provide information about usage licenses?
- Is it certified?
- Is it community-based, or a commercial solution?
Non-digital research data and materials
RDM mostly focuses on digital research data. However, you may also collect analogue research data (e.g. surveys on paper…) as part of your research, or other non-digital materials that strictly speaking do not constitute research data (e.g. samples). Sometimes such non-digital data and materials also need to be retained after the end of your project.
- Consider whether digitizing the data is an option (e.g. this may be worthwhile for data that will be kept permanently for future reuse).
- If not, your Faculty, Department, research group, lab etc. may offer facilities to retain your data for verification or legal compliance purposes for a finite retention period. An example is the Faculty of Psychology and Educational Sciences’ Archive for Research Material.
- Contact rdm.support@ugent.be in case you have paper research data that could merit permanent preservation for future reuse purposes.
There are also repositories for non-digital materials you can make use of. Examples include:
- BCCM (Belgian Co-ordinated Collections of Micro-Organisms): for the deposit of biological materials
- Bioresource Center Ghent: offers support for biobanking projects as well as a central biobanking facility at Ghent University Hospital.
Preparing data for preservation
There are certain preconditions for maintaining data in a usable form over time, such as:
- Availability of sufficient documentation and metadata
- Where needed, converting data to file formats suitable for sustainable access
- Properly structured data files
- Having the necessary rights and/or permissions in place to preserve and share data
Keeping data safe for the future therefore requires some preparation and effort on your part.
Established domain-specific repositories will usually only accept data that meet their standards for file formats, documentation and metadata, data quality… If you plan to use a data repository or archive, check in advance what the requirements are, so you can adequately prepare your data for deposit.
More information
- N. Beagrie (2019), What to Keep: A Jisc research data study
- DCC (2014), Five steps to decide what data to keep: a checklist for appraising research data
- H. Tjalsma & J. Rombouts (2011), Selection of Research Data. Guidelines for appraising and selecting research data
- Whyte, A (2015), Where to keep research data. DCC checklist for evaluating data repositories
- Science Europe (2018), ‘Practical Guide to the International Alignment of Research Data Management’. This guide contains a section on 'Criteria for the selection of trustworthy repositories'.