Skip to Main Content

SOM Data Management: Naming and Format

File Organization

Having a consistent file naming scheme will help you keep track of all your data. It will help you avoid computational mistakes when you analyze the data. It will help you browse your data and see what is in a file folder at a glance. Finally, when you return to old data, it will help you remember what is in each file.

The most important part of creating a file naming scheme is choosing something that you can consistently follow. The second most important part is documenting it. This information can go in a plain-text README in the folder(s) where you are storing your files.

 

Source: File Naming Best Practices document created by Christine Malinowski, MIT Libraries

Anyone who has tried to load a file created in an obsolete software program knows the pain of unstable file formats. For example, you may have attempted to load an old document created in WordPerfect, and saved in the obsolete format .wpd. File formats can become obsolete, orphaned or subject to abandonware, when the creator of a program abandons it. Stable file formats are formats that are unlikely to suffer from these issues. Using a stable file format helps to preserve your data for yourself and other researchers who may want to use it in the future.

What file format should I use?

Stable file formats have these characteristics:

  • Non-proprietary. A proprietary format is created and controlled by a single company. For instance, .pptx is a proprietary format to Microsoft. By contrast, non-proprietary formats can be used across multiple operating systems and pieces of software without restriction. When using proprietary software, you can often choose to export to a non-proprietary format.
  • Uncompressed. Compression algorithms modify your data in order to make files smaller by rounding off bits of ‘nonessential’ information. If your data analysis was done on uncompressed data, sharing only the compressed data can make your results nonreproducible.
  • Unencrypted. The type of encryption in popular use in software can change over time, making older files inaccessible.
  • Commonly used in your research community. Using common file types makes your work accessible to a wider set of researchers.

Preferred File Formats

Data type

Preferred file format examples

Containers

TAR, GZIP, ZIP

Databases

XML, CSV, SQLITE

Geospatial

SHP, DBF, GeoTIFF, NetCDF

Moving images

MOV, MPEG, AVI, MXF

Sounds

WAVE, AIFF, MP3, MXF

Statistics

ASCII, DTA, POR, SAS, SAV

Still images

TIFF, JPEG2000, PDF, PNG, GIF, BMP

Tabular data

CSV

Text

XML, PDF/A, HTML, ASCII, UTF-8

Web archive

WARC

(Table Source)

In addition to this general advice, some repositories give directions to researchers on stable file formats to use, for example Dryad. Your funder or research IT may also have preferred file formats.

Some data does not have a file format that reaches the standards laid out here, and must be saved in a proprietary format. When sharing data in a proprietary format, document in your readme the name of the program (and version number, if applicable) that can be used to read the data.

Some data types, for instance GIS files, require multiple files working together to be read. In this case, make sure you supply all the files needed and document the file structure in your readme.
 

Contact Us

For questions or comments, email us at medicine.datacatalog@hofstra.edu.

STRIDES Initiative

The NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative allows NIH to explore the use of cloud environments to streamline NIH data use by partnering with commercial providers. Click here to get more information.

Other resources

Hofstra University

This site is compliant with the W3C-WAI Web Content Accessibility Guidelines
HOFSTRA UNIVERSITY Hempstead, NY 11549-1000 (516) 463-6600 © 2000-2009 Hofstra University