Choosing file formats
When preparing to collect research data, you should chose open, well-documented and non-proprietary formats wherever possible. The choice of format will vary depending on how you plan to analyse, store and share your data.
You should consider the key questions below, and consult the advice on specific media, long and short term formats, dissemination and preservation formats, and proprietary formats. Guides with recommended formats are available below.
Four key questions in choosing formats
- How do you plan to use the data that you produce? How will you store, share and analyse your data?
- Do you have any funding for new software, if it is required? You can check the UCL ISD software database, which lists software that is available to you.
- Do your peers expect your data to appear in certain formats? Do you have access to expertise in particular software or to best practice information for your discipline or research area?
- Does your funder have expectations regarding how you present your data? Read our advice on funders’ expectations.
- Specific advice for text, images, audio and video files
Before choosing formats, you should consider the following advice:
- All text-based files should be encoded in UTF-8 where possible. This option is usually available when saving a file, and allows characters to display correctly between different text viewing tools. Read more about choosing a character encoding scheme.
- In choosing a format for images, you should consider the trade-off between file size, available image viewers, and image quality. TIFF or JP2 are preferable as preservation formats as they allow the storage of more data without compression. They also take up a lot of space and are not suitable for use in all image viewers. JPG is generally accepted to be a good dissemination format, but it is lossy and compressed and may result in some information being lost.
- Audio files should be in WAV format for preservation and can be compressed as MP3s for dissemination
- There is little consensus on the best format for video data, but that should not stop you from proceeding with collecting data. The UCL CAVA project addressed some of these issues in its File Formats report. See the Jisc Digital Media Infokit for advice on how to proceed.
- Recommended format guides
You should consult one of the following guides to see current opinion on best practice for choosing file formats:
- Long term preservation and short term dissemination formats
When choosing preservation formats, you need to consider the trade-off between functionality and longevity. For example, while Office formats are well-documented, they are proprietary and might be altered by Microsoft. Although it is unlikely that they will become obsolete, you may choose a truly open format such as PDF/A instead of DOCX, CSV instead of XLSX, or XML instead of a database.
Consider exactly what it is that you want to preserve. Is the formatting and functionality important, or do you just want to preserve the raw information? It may be important to preserve everything, including the look and feel of the original, for evidential or ethics purposes. Your funder may also have requirements in this regard. Please check our advice on funders' expectations.
You may want to keep working on your data as you begin to store it. In this case, some preservation formats may make it more difficult to alter your data. For example, PDFs are less editable than DOCX. You might periodically save long-term preservation copies of your data, while keeping working copies in a short-term format. See our advice on producing a Data Management Plan.
- Dissemination and preservation formats
You may decide to distinguish between formats for preservation and for dissemination of your data. In selecting both preservation and dissemination formats, you should consider the following:
- How do you expect your data to be reused? If you want others to edit and manipulate your data, choose a format that allows this. If you want your data to be read-only, choose a format that does not allow editing.
- What kind of licence will apply to your dissemination copies? See our advice on copyright and licences.
- How will dissemination copies of your data be accessed? A dissemination version is likely to be accessible through a repository or other web-based resource, and should be optimised for this. See a case study from the UCL CAVA project.
It is not always necessary to produce both preservation and dissemination copies. For example, if your data appears in text documents, you could use PDF/A for both the preservation and the dissemination format. For images, you may want to use TIFF or JP2 for preservation and web-optimised JPG for dissemination.
- Dealing with proprietary formats
Sometimes you may have no choice but to use proprietary software to produce your data – for example, researchers producing transcriptions of audio and video data may need to use Elan, and researchers working with GIS data are likely to use ArcGIS or Google Earth.
The outputs from these tools may not be convertible to non-proprietary formats, or some functionality may be lost in converting the files. The software provider may also choose to alter the software, and in doing so may remove the functionality that you rely on.
The best course of action is to be aware of the limitations of the software you use. You should:
- Always keep up to date with new versions of the software, and migrate your data to the latest versions where possible. This might be as simple as re-saving the files.
- Keep the most important information in a different format – for example, a spreadsheet or text file.
If you have any questions about how to deal with a specific format, or you have any further questions, please contact the Digital Curation Team.
This advice is partly based on the University of Cambridge Data Management Format FAQs.