Collecting and organizing data

Research data can be gathered through observation, manual or automatic measurements in the laboratory or in the field, with remote sensing techniques, by interviews, by modelling and simulation, etc. Data can also be stored in many formats.

Considering what data will be collected, generated, and/or reused, and how you will organize them, is an important part of Research Data Management.

File formats

What is a file format?

A file format describes how information is stored within a digital file. Although each file format is unique, different file formats exist for similar types of information (e.g. text can be stored in a plain text file as well as in a word file).

On most computer systems, the format of a file is indicated by the ‘extension’ in the filename (e.g. .txt, .csv). The extension provides an immediate clue about the type of data within a file. For example, we expect that a file with a .jpg extension is an image, whereas a .docx should contain formatted text.

The difference between file formats is situated at the following levels:

Simple vs complex formats: e.g. the .txt format is a very simple way of storing text, while a .docx file has more complex properties.
Open vs closed file formats: closed (or proprietary) file formats are not open, in the sense that they cannot be freely used. Often they are owned by companies or are patented. Open formats can be used and implemented by anyone.

Which file formats to use?

The choice of file formats to use for research data depends on:

Discipline-specific standards and customs
Planned data analyses
Software availability/cost
Hardware used – e.g. audio capture, fMRI scanner

Risks

Using a specific format can hold risks. For instance, using formats which can only be used within specific software makes the digital data vulnerable to obsolescence of the software. This can lead situations of being locked out of one's own data.

Also, converting data from one format to another can lead to problems of losing metadata or formatting. Therefore, it is good practice to plan your choice of formats with long-term access in the back of your mind.

Best practices

To offer the best long-term guarantees in terms of usability, accessibility and sustainability, file formats should have the following characteristics:

Non-proprietary (not protected by trademark, patent or copyright)
Open, documented standard
Common usage by research community
Standard representation (ASCII, Unicode)
Lossless compression (>< lossy compression)

Recommended formats

Examples of recommended file formats for different types of data can be found via:

DANS, File formats
UK Data Service, Recommended formats

File naming

A file name is the principal identifier of a file. Therefore, good file names should:

Provide useful cues to content, status and version
Uniquely identify a file
Help to classify and sort files

As such, file names that reflect the content of the file will facilitate searching, discovering and understanding of the data.

File name elements

File names can be constructed using the following elements:

Project acronym
Content description
Date
Location
Creator name/initials
Status information (i.e. draft or final)
….

Best practices

When creating a file name try to:

Give a unique name.
Use elements essential to identify the file.
Avoid long names, remove unnecessary elements.
For dates, use ISO8601 standard (i.e. YYYYMMDD). This will keep the files sorted chronologically.
For versioning via filename, use ascending, decimal version numbers.
Try to only use alphanumerical characters (A-Z, a-z, 0-9).
Do not use special characters like \ / : * ? " < > | ! % & - ; = () + , .
Do not use spaces. Use an underscore ("_") for separation.
Do not alter or remove the extension of a file (e.g. .txt, .sav, .mp4, .docx).
Be consistent in how you build up names and make sure there is consensus among all team/project members.

Examples

Some examples of good file names are:

CONS_INT1_12-03-2019.rtf	Result from Interview 1 of the Consumers research on 12/13/2019
GC-MS1_20180912_POLY03.ms	Polymer 3measured on GC-MS machine 1 on the September 1 2018
GC-MS2_20180914_POLY08.ms	Polymer 8 measured on GC-MS maching 2 on September 14, 2018
GC-MS2_20180914_POLY08.pptx	Chromatogram of polymer 8 measurement represented in a powerpoint presentation with all relevant peaks labeled

Folder/data organization

To be able to store data in a such a way that results can be reproduced and data can be re-used, one of the important challenges is working in a well-organized folder structure. Using a standard way of organizing research files has indisputable advantages, both for your daily work and also when sharing data with colleagues or others.

Almost all research domains require specific ways of organizing and structuring the stored research data. Therefore, it is difficult to provide general guidelines. To demonstrate how the general principle of structuring research data can be implemented, we provide some examples. All examples are based on real studies.

Examples

Experimental research (https://github.ugent.be/jlammert/folders-experimental)

Version control

When you work with different versions of a file, it can be a challenge to locate the 'correct' version or to know how versions differ from each other. If not done well, it can even be difficult to know which file preceded the other.

The matter is even complicated further when files are kept in multiple locations, and multiple users edit these files. To avoid confusion and safeguard against accidental loss, a versioning system can be put in place.

Different approaches can be taken to provide version information about a file: manual or automatic version control methods.

Manual methods

File names

File names are a simple way to manually give information about the version/status of a file. This can be done by:

Including a date in the file name, e.g.: HealthTest-2008-04-06.docx
Including a version number in the file name, e.g.: HealthTest-00-02.docx or HealthTest-v02.docx

Version history table

A version history table is a table kept within the file itself or within a separate file including file history, version control table or notes. It is used to record versions, dates, authors and details of changes to files.

Example:

Version	What was changed?	By whom	When?
1	Initial draft	Godfried Bomans	12/05/2019
2	Revised Intro	Godfried Bomans	14/05/2019
3	Added Methodology	Louis Paul Boon	18/05/2019
4	Reviewed by promotor	Matthaeus, Marcus, Lucas, Johannes	21/06/2019
5	Accepted changes V4 Added final figures	Godfried Bomans	26/06/2019
6	Final version for submission	Godfried Bomans	03/07/2019

Best practices

When working with different versions of files try to take into account these tips:

Identify milestone versions of files to keep. Avoid clutter.
Use a systematic naming convention.
Record version and status of a datafile, e.g. raw, cleaned.
Document what changes are made to a file when a new version is created.
Document relationships between files where needed.
Track the location of files. When collaborating, use a common ‘workspace’ (i.e. shared folder, netshare) to avoid different versions of files lingering in different locations.
Keep file-sharing out of e-mail.

Automatic methods

Built-in versioning

Some software (platforms) provide(s) built-in version control. For instance, all Microsoft Office files stored on sharepoint or onedrive instances have automatic version history.

Versioning software

Specific software exists to systematically manage version information about files. Some of the most used examples of versioning software are git and subversion.

Also, cloud platforms exist to allow for simultaneous collaboration and version control. Examples are github and gitlab.

Collecting and organizing data

File formats

What is a file format?

Which file formats to use?

Risks

Best practices

Recommended formats

File naming

File name elements

Best practices

Examples

Folder/data organization

Examples

Version control

Manual methods

File names

Version history table

Best practices

Automatic methods

Programmes

Research

University life

Tools