You are here

Harvard Forest Data Archive


Scientific Data Provenance in R: RDataTracker and DDG Explorer

Related Publications



  • Lead: Emery Boose, Aaron Ellison, Elizabeth Fong, Matthew Lau, Barbara Lerner, Thomas Pasquier, Margo Seltzer
  • Investigators:
  • Contact: Information Manager
  • Start date:
  • End date:
  • Status: ongoing
  • Location:
  • Latitude:
  • Longitude:
  • Elevation:
  • Taxa:
  • Release date: 2014
  • Revisions: This software is under development. For the most recent version, please see the project website on Github.
  • EML file: knb-lter-hfr.91.26
  • DOI: digital object identifier
  • EDI: data package
  • DataONE: data package
  • Related links:
  • Study type: modeling
  • Research topic: ecological informatics and modelling
  • LTER core area: disturbance
  • Keywords: analytical tools, modeling
  • Abstract:

    Scientific data provenance is the information required to document the history of an item of data, including how it was created and how it was transformed. Data provenance has great potential to improve the transparency, reliability, and reproducibility of scientific results. However it has been little used to date by domain scientists because most systems that collect provenance require scientists to learn specialized software tools and jargon. This project is developing tools that allow scientists to collect, visualize, and query provenance directly from the R statistical language. The first tool (RDataTracker) is a library of R functions that can be downloaded and installed as an R package. RDataTracker allows the scientist to collect data provenance while executing an R script or during an R console session. The resulting provenance is stored on the scientist's computer as a DDG or data derivation graph. The second tool (DDG Explorer) is written in Java and allows the scientist to visualize, store, and query DDGs. Both tools are included in a single install package.

  • Methods:

    To install: (1) make sure that R is installed on your computer (, (2) download the installation file (hf091-01) from this website to your computer and rename it as "RDataTracker.tar.gz" (but do not uncompress), (3) install RDataTracker as an R package (e.g. use install.packages in the R console or Tools / Install Packages in RStudio), and (4) make sure that RDataTracker is loaded (e.g. via the R library command) for use in a particular session. For more details, please see the RDataTracker help messages and the project web page on Github.

  • Use:

    This dataset is released to the public under Creative Commons CC0 1.0 (No Rights Reserved). Please keep the dataset creators informed of any plans to use the dataset. Consultation with the original investigators is strongly encouraged. Publications and data products that make use of the dataset should include proper acknowledgement.

  • Citation:

    Boose E, Ellison A, Fong E, Lau M, Lerner B, Pasquier T, Seltzer M. 2014. Scientific Data Provenance in R: RDataTracker and DDG Explorer. Harvard Forest Data Archive: HF091 (v.26). Environmental Data Initiative:

Detailed Metadata

HF091-01: RDataTracker v. 2.26.0

  • Compression: tar.gz
  • Format: R package
  • Type: script