Skip to Main Content

SOM Data Management: Description

Data Documentation

Guides: Data Management Best Practices: Documentation

Metadata, Codebooks, ReadMes, and Data Dictionaries

Source: Data Management Best Practices: Documentation, Created by the University of Pennsylvania Libraries

What is a ReadMe File?

The ReadMe file is an important piece of the data documentation process. The ReadMe file provides basic information about a data file or dataset to ensure that the data will be interpreted correctly by all users.

For best practices and recommended content, read the guide below created by Cornell University's Research Data Management Service Group:

Guide to writing "readme" style metadata | Research Data Management Service Group

A readme file provides information about a data file and is intended to help ensure that the data can be correctly interpreted, by yourself at a later date or by others when sharing or publishing data. Standards-based metadata is generally preferable, but where no appropriate standard exists, for internal use, writing "readme" style metadata is an appropriate strategy.

What is a codebook?

A codebook contains information on the structure, contents, and layout of a data file. The purpose of the codebook is to summarize the key information about the variables in a research project.

Structure of a codebook:

ICPSR recommends including the following structure and elements when creating a codebook:

Front Matter:

  • Study title
  • Name of the principal investigator(s)
  • Table of contents
  • Introduction describing the purpose and format of the codebook
  • Methodological details (optional)


  • Variable name: The name or number assigned to each variable in the data collection. Some researchers prefer to use mnemonic abbreviations (e.g., EMPLOY1), while others use alphanumeric patterns (e.g., VAR001). For survey data, try to name variables after the question numbers - e.g., Q1, Q2b, etc. [In above example, H40-SF12-2]
  • Variable label: A brief description to identify the variable for the user. Where possible, use the exact question or research wording. ["SF12 - ASSESSMENT OF R'S GENERAL HEALTH"]
  • Question text: Where applicable, the exact wording from survey questions. ["In general, would you say your health is .
  • Values: The actual coded values in the data for this variable. [1, 2, 3, 4, 5]
  • Value labels: The textual descriptions of the codes. [Excellent, Very Good, Good, Fair, Poor]
  • Summary statistics: Where appropriate and depending on the type of variable, provide unweighted summary statistics for quick reference. For categorical variables, for instance, frequency counts showing the number of times a value occurs and the percentage of cases that value represents for the variable are appropriate. For continuous variables, minimum, maximum, and median values are relevant.
  • Missing data: Where applicable, the values and labels of missing data. Missing data can bias an analysis and is important to convey in study documentation. Remember to describe all missing codes, including "system missing" and blank. [e.g., Refusal (-1)
  • Universe skip patterns: Where applicable, information about the population to which the variable refers, as well as the preceding and following variables. [e.g., Default Next Question: H00035.00]
  • Notes: Additional notes, remarks, or comments that contextualize the information conveyed in the variable or relay special instructions. For measures or questions from copyrighted instruments, the notes field is the appropriate location to cite the source.

Source: What is a Codebook?, ICPSR

Additional Resources:

  • Codebook Cookbook created by Patrick Bélisle and Lawrence Joseph Department of Clinical Epidemiology, McGill University Health Centre
  • What is a Codebook? created by the Substance Abuse & Mental Health Data Archive (SAMHDA)

What is metadata?

Metadata is data that provides information about other data. It's main purpose is to provide description and context which will help in organizing, finding, and understanding data.

Metadata standards

Most data repositories require that your project metadata follows a specific standard. The best-known, commonly-used, domain-agnostic standard is Dublin Core. The following are the fifteen properties used to describe data using Dublin Core:

  • Contributor - An entity responsible for making contributions to the resource.
  • Coverage - The spatial or temporal topic of the resource, the spatial applicability of the resource, or the jurisdiction under which the resource is relevant.
  • Creator - An entity primarily responsible for making the resource.
  • Date - A point or period of time associated with an event in the life cycle of the resource.
  • Description - An account of the resource.
  • Format - The file format, physical medium, or dimensions of the resource.
  • Identifier - An unambiguous reference to the resource within a given context.
  • Language - A language of the resource.
  • Publisher - An entity responsible for making the resource available.
  • Relation - A related resource.
  • Rights - Information about rights held in and over the resource.
  • Source - A related resource from which the described resource is derived.
  • Subject - The topic of the resource.
  • Title - A name given to the resource
  • Type - The nature or genre of the resource

For information on additional metadata standards, visit the Digital Curation Centre's site or the Research Data Alliance's site.

Additional Resources

Source: Research Data Management LibGuide, Created by University of Michigan, Taubman Health Sciences Library

Contact Us

For questions or comments, email us at

Hofstra University

This site is compliant with the W3C-WAI Web Content Accessibility Guidelines
HOFSTRA UNIVERSITY Hempstead, NY 11549-1000 (516) 463-6600 © 2000-2009 Hofstra University