Documenting data

Data documentation comprises any contextual and descriptive information needed to find, assess, understand, and (re)use research data.

Why document data?

Research data should always be accompanied with documentation because it:

  • Enables you to understand/interpret data later
  • Makes data independently understandable, i.e. reusable
  • Make results independently reproducible, starting from raw data
  • Helps avoid incorrect use/misinterpretation

As such, documentation is an essential step in making your data FAIR.

When?

The pitfall in the process of data documentation is procrastination. Will you really remember why you manipulated the data the way you did next month? Or next year? And how about the abbreviations you used for your variables? Will you still remember them within two years’ time?

Best practice is to start gathering meaningful information from as early on in the research process.

Filenames and the organization of folders can also provide important context about a data collection. 

How?

Data should be documented both at the study level and the data (item) level.

This type of documentation concerns contextual information about a study/project:

  • The background, aims, objectives, and the hypotheses that shape the research
  • Procedural & methodological information such as the used methods, data preparations and manipulations, summaries of findings, and temporary results

Examples:

  • Data collection protocols & procedures
  • List of all datasets
  • Preliminary findings (ranked by date)
  • Extent of the research (i.e. temporal or geographical)
  • Instruments used - questionnaires, showcards, interview schedules, etc.
  • Temporal/geographic coverage
  • Description of data validation methods - cleaning, error-checking
  • Compilation of derived variables
  • Source(s) of secondary data
  • List of selection criteria for literature study
  • Reports, final publication, user guide, working paper, lab books,
  • Blank informed consent forms
  • Records of interviewees and their demographic characteristics

Data-level documentation provides information about datasets and/or individual data items, and about variables within data sets. It is the primary source for others and your future self to understand the data.

Information about datasets can be:

  • An inventory of the data files
  • Relationships between data files
  • Annotated processing scripts that generate/process data sets
  • Anonymisation methods carried out

Information about variables can be:

  • Labels, codes, classifications
  • (Coding of) missing values
  • Explanations of variable derivations and aggregations

Information about data items can be recorded in a codebook. Typically this is a separate file, but some data formats allow to embed this information in the data file (e.g. the SPSS .sav file format).

While embedding variable information within the data file may seem easy at first glance, it also comes with a risk. For instance, when converting the data file, often the embedded documentation gets lost in translation. Also, embedded data information only allows for a predefined set of metadata. Therefore it is advised to use a separate codebook file which is more flexible.

Use valid variable names in tabular data:

  • Use meaningful abbreviations such as “motoc” (= mother occupation), “rtms” (reaction time in milliseconds).
  • Use variable names that are related to your data collection method. E.g. question number system related to questions in a survey/questionnaire: q1a, q1b, q2, q3a.
  • Avoid simplistic numerical order system: v1, v2, v3.

Be consistent:

  • Don’t change variable names across (versions of) datasets (e.g. “gender”, “sex”).
  • Use 1 language.
  • Use maximum 8 characters.
  • Do not use spaces, special characters and use lower case. (e.g. “Gender”, “gender”).

Data list

In qualitative research projects often a lot of files are collected (e.g. interview recordings, transcripts). To keep an overview it is often a good idea to maintain a 'data list':

  • For a collection of qualitative data (e.g interview transcripts, a/v-recordings)
  • Single at-a-glance finding aid: easily identify and locate relevant items within a data collection
  • Identifying details of the data items: e.g. file name(s), description, format, size, date
  • Descriptive attributes of participants/entities studied: e.g. age, gender, occupation or location
  • Unique identifier for each item
  • Notes, e.g. to indicate where parts of data are missing

Data list templates  can be found via the UK Data Service's Data-level documentation page.

NVIVO

Transcribing and documenting qualitative data is often done using NVivo. NVivo is available for UGent researchers through Athena.

For more information about Nvivo, see https://onderzoektips.ugent.be/nl/tips/00001699/.

Metadata

Metadata are 'data about data', used to describe and annotate data. They are a highly structured, machine-readable form of data documentation. 

A well-known example of metadata is the information about a book typically found in a library catalogue:

  • Author
  • Publication year
  • Title
  • Publisher
  • Location

Similarly, trusted data repositories expose descriptive metadata about the datasets they hold online, so that they can be searched and discovered. Other types of metadata also exist, e.g. administrative, technical, rights, or preservation metadata.

Metadata can be embedded within data files, or captured in separate files.

Metadata schemas

Metadata comprise a fixed set of elements, as defined by a particular metadata schema.

When creating metadata, it is good practice to not invent your own schema, but to make use of existing metadata standards. A famous example is the DublinCore schema, which in its most simple form comprises a set of 15 metadata elements applicable to a wide range of datatypes. Many more metadata standards exist, some of which are domain-specific. 

There are a number of registries where you can look for disciplinary metadata standards:

More information