LibGuides: SOM Data Management: Naming and Format

File Organization

File Naming
File Formats

Having a consistent file naming scheme will help you keep track of all your data. It will help you avoid computational mistakes when you analyze the data. It will help you browse your data and see what is in a file folder at a glance. Finally, when you return to old data, it will help you remember what is in each file.

The most important part of creating a file naming scheme is choosing something that you can consistently follow. The second most important part is documenting it. This information can go in a plain-text README in the folder(s) where you are storing your files.

Source: File Naming Best Practices document created by Christine Malinowski, MIT Libraries

Anyone who has tried to load a file created in an obsolete software program knows the pain of unstable file formats. For example, you may have attempted to load an old document created in WordPerfect, and saved in the obsolete format .wpd. File formats can become obsolete, orphaned or subject to abandonware, when the creator of a program abandons it. Stable file formats are formats that are unlikely to suffer from these issues. Using a stable file format helps to preserve your data for yourself and other researchers who may want to use it in the future.

What file format should I use?

Stable file formats have these characteristics:

Non-proprietary. A proprietary format is created and controlled by a single company. For instance, .pptx is a proprietary format to Microsoft. By contrast, non-proprietary formats can be used across multiple operating systems and pieces of software without restriction. When using proprietary software, you can often choose to export to a non-proprietary format.
Uncompressed. Compression algorithms modify your data in order to make files smaller by rounding off bits of ‘nonessential’ information. If your data analysis was done on uncompressed data, sharing only the compressed data can make your results nonreproducible.
Unencrypted. The type of encryption in popular use in software can change over time, making older files inaccessible.
Commonly used in your research community. Using common file types makes your work accessible to a wider set of researchers.

Preferred File Formats

Data type	Preferred file format examples
Containers	TAR, GZIP, ZIP
Databases	XML, CSV, SQLITE
Geospatial	SHP, DBF, GeoTIFF, NetCDF
Moving images	MOV, MPEG, AVI, MXF
Sounds	WAVE, AIFF, MP3, MXF
Statistics	ASCII, DTA, POR, SAS, SAV
Still images	TIFF, JPEG2000, PDF, PNG, GIF, BMP
Tabular data	CSV
Text	XML, PDF/A, HTML, ASCII, UTF-8
Web archive	WARC

(Table Source)

In addition to this general advice, some repositories give directions to researchers on stable file formats to use, for example Dryad. Your funder or research IT may also have preferred file formats.

Some data does not have a file format that reaches the standards laid out here, and must be saved in a proprietary format. When sharing data in a proprietary format, document in your readme the name of the program (and version number, if applicable) that can be used to read the data.

Some data types, for instance GIS files, require multiple files working together to be read. In this case, make sure you supply all the files needed and document the file structure in your readme.

Contact Us

For questions or comments, email us at lena.g.bohman@hofstra.edu.

File Structure and Naming Prompt Sheet
from University of Illinois
File naming convention worksheet
from Caltech

File naming scheme examples

Bulk File Renaming Programs

Library of Congress recommended file formats
PRONOM
Technical registry from the UK National Archives for file types.
FileInfo.com
Database of file formats