Documenting data

Data documentation comprises any contextual and descriptive information needed to find, assess, understand, and (re)use research data.

Why document data?

Research data should always be accompanied with documentation because it:

Enables you to understand/interpret data later
Makes data independently understandable, i.e. reusable
Make results independently reproducible, starting from raw data
Helps avoid incorrect use/misinterpretation

As such, documentation is an essential step in making your data FAIR.

When?

The pitfall in the process of data documentation is procrastination. Will you really remember why you manipulated the data the way you did next month? Or next year? And how about the abbreviations you used for your variables? Will you still remember them within two years’ time?

Best practice is to start gathering meaningful information from as early on in the research process.

Filenames and the organization of folders can also provide important context about a data collection.

How?

Data should be documented both at the study level and the data (item) level.

Study-level documentation

This type of documentation concerns contextual information about a study/project:

The background, aims, objectives, and the hypotheses that shape the research
Procedural & methodological information such as the used methods, data preparations and manipulations, summaries of findings, and temporary results

Examples:

Data collection protocols & procedures
List of all datasets
Preliminary findings (ranked by date)
Extent of the research (i.e. temporal or geographical)
Instruments used - questionnaires, showcards, interview schedules, etc.
Temporal/geographic coverage
Description of data validation methods - cleaning, error-checking
Compilation of derived variables
Source(s) of secondary data
List of selection criteria for literature study
Reports, ﬁnal publication, user guide, working paper, lab books,
Blank informed consent forms
Records of interviewees and their demographic characteristics

Data-level documentation

Data-level documentation provides information about datasets and/or individual data items, and about variables within data sets. It is the primary source for others and your future self to understand the data.

Information about datasets can be:

An inventory of the data ﬁles
Relationships between data ﬁles
Annotated processing scripts that generate/process data sets
Anonymisation methods carried out
…

Information about variables can be:

Labels, codes, classiﬁcations
(Coding of) missing values
Explanations of variable derivations and aggregations

Codebook

Information about data items can be recorded in a codebook. Typically this is a separate file, but some data formats allow to embed this information in the data file (e.g. the SPSS .sav file format).

While embedding variable information within the data file may seem easy at first glance, it also comes with a risk. For instance, when converting the data file, often the embedded documentation gets lost in translation. Also, embedded data information only allows for a predefined set of metadata. Therefore it is advised to use a separate codebook file which is more flexible.

Variable names: best practices

Use valid variable names in tabular data:

Use meaningful abbreviations such as “motoc” (= mother occupation), “rtms” (reaction time in milliseconds).
Use variable names that are related to your data collection method. E.g. question number system related to questions in a survey/questionnaire: q1a, q1b, q2, q3a.
Avoid simplistic numerical order system: v1, v2, v3.

Be consistent:

Don’t change variable names across (versions of) datasets (e.g. “gender”, “sex”).
Use 1 language.
Use maximum 8 characters.
Do not use spaces, special characters and use lower case. (e.g. “Gender”, “gender”).

Qualitative research

Data list

In qualitative research projects often a lot of files are collected (e.g. interview recordings, transcripts). To keep an overview it is often a good idea to maintain a 'data list':

For a collection of qualitative data (e.g interview transcripts, a/v-recordings)
Single at-a-glance ﬁnding aid: easily identify and locate relevant items within a data collection
Identifying details of the data items: e.g. ﬁle name(s), description, format, size, date
Descriptive attributes of participants/entities studied: e.g. age, gender, occupation or location
Unique identiﬁer for each item
Notes, e.g. to indicate where parts of data are missing

Data list templates can be found via the UK Data Service's Data-level documentation page.

NVIVO

Transcribing and documenting qualitative data is often done using NVivo. NVivo is available for UGent researchers through Athena.

For more information about Nvivo, see https://onderzoektips.ugent.be/nl/tips/00001699/.

Metadata

Metadata are 'data about data', used to describe and annotate data. They are a highly structured, machine-readable form of data documentation.

A well-known example of metadata is the information about a book typically found in a library catalogue:

Author
Publication year
Title
Publisher
Location

Similarly, trusted data repositories expose descriptive metadata about the datasets they hold online, so that they can be searched and discovered. Other types of metadata also exist, e.g. administrative, technical, rights, or preservation metadata.

Metadata can be embedded within data files, or captured in separate files.

Metadata schemas

Metadata comprise a fixed set of elements, as defined by a particular metadata schema.

When creating metadata, it is good practice to not invent your own schema, but to make use of existing metadata standards. A famous example is the DublinCore schema, which in its most simple form comprises a set of 15 metadata elements applicable to a wide range of datatypes. Many more metadata standards exist, some of which are domain-specific.

There are a number of registries where you can look for disciplinary metadata standards:

Documenting data

Why document data?

When?

How?

Study-level documentation

Data-level documentation

Codebook

Variable names: best practices

Qualitative research

Metadata

Metadata schemas

More information

Programmes

Research

University life

Tools