You are here

Harvard Forest Data Archive

HF091

Software Tools to Collect and Use Provenance in R

Related Publications

Data

Overview

  • Lead: Emery Boose, Aaron Ellison, Elizabeth Fong, Matthew Lau, Barbara Lerner, Thomas Pasquier, Margo Seltzer
  • Investigators:
  • Contact: Information Manager
  • Start date:
  • End date:
  • Status: complete
  • Location: Global
  • Latitude: -90 to +90 degrees
  • Longitude: -180 to +180 degrees
  • Elevation:
  • Datum: WGS84
  • Taxa:
  • Release date: 2024
  • Language: English
  • EML file: knb-lter-hfr.91.28
  • DOI: digital object identifier
  • EDI: data package
  • DataONE: data package
  • Related links:
  • Study type: modeling
  • Research topic: ecological informatics and modelling
  • LTER core area: disturbance patterns
  • Keywords: analytical tools, modeling
  • Abstract:

    The software tools that scientists use to process and analyze data are typically optimized for performance and ease of use. Few if any such tools are designed to capture and record the details of what happens as the tool performs its task. This detailed information, and more generally the history of an item of data from its creation to its present state, is known as provenance. Provenance has the potential to make science more transparent, reliable, and reproducible.

    This project focused on collecting and using provenance for scripts written in the R statistical language, which is widely used by ecologists and environmental scientists for data analysis and visualization. Our tools include a provenance collector (rdtLite), which collects provenance as an R script executes (or during a console session), as well as other tools that use the collected provenance to document and visualize the execution or to support activites such as script debugging. The R packages included here are also available on CRAN. For more details, see the project website on GitHub (https://end-to-end-provenance.github.io).

  • Methods:

    1. rdtLite collects provenance as an R script executes (or during a console session) and saves it in extended PROV-JSON format to the file prov.json. By default, this file is written to the R session temporary directory (to meet CRAN requirements) and is overwritten in subsequent executions of the same script (or console session), but you can choose to save it elsewhere and to save time-stamped versions if desired. Simple data values are automatically saved in the prov.json file. Complex data values (e.g. R lists or data frames) may optionally be saved (wholly or in part) as separate snapshot files.

    2. provDebugR uses the provenance collected by rdtLite to support time-traveling debugging of an R script without the need to set breakpoints or insert print statements and rerun the script.

    3. provExplainR uses the provenance collected by rdtLite from two different executions of a script to help explain why the script results differ.

    4. provGraphR creates an adjacency matrix from the provenance object created by provParseR. The adjacency matrix can then be used to quickly traverse the provenance graph. This package supports other packages and is not intended to be used directly.

    5. provParseR facilitates access to the provenance information collected by rdtLite. The prov.parse function accepts this information as a string or file in extended PROV-JSON format and returns it as an R object. Access functions then extract the desired information from this object and returns it as a data frame. This package supports other packages and is not intended to be used directly.

    6. provSummarizeR creates a concise high-level summary of the provenance collected by rdtLite, including information about computing environment, loaded libraries, sourced scripts, and inputs and outputs.

    7. provTraceR uses the provenance collected by rdtLite for a single R script or a series of R scripts to identify input files, output files, and exchanged files based on file hash values.

    8. provViz provides an R interface to a visualization tool, written in Java, that allows you to view and query the provenance graph directly. You will need to have Java installed for this to work.

    Note: the R packages included here have been renamed for archival purposes. Please rename the package file after downloading to remove the archival prefix before installing as an R package (e.g. rename "hf091-01-rdtLite_1.4.tar.gz" to "rdtLite_1.4.tar.gz").

  • Organization: Harvard Forest. 324 North Main Street, Petersham, MA 01366, USA. Phone (978) 724-3302. Fax (978) 724-3595.

  • Project: The Harvard Forest Long-Term Ecological Research (LTER) program examines ecological dynamics in the New England region resulting from natural disturbances, environmental change, and human impacts. (ROR).

  • Funding: National Science Foundation LTER grants: DEB-8811764, DEB-9411975, DEB-0080592, DEB-0620443, DEB-1237491, DEB-1832210.

  • Use: This dataset is released to the public under Creative Commons CC0 1.0 (No Rights Reserved). Please keep the dataset creators informed of any plans to use the dataset. Consultation with the original investigators is strongly encouraged. Publications and data products that make use of the dataset should include proper acknowledgement.

  • License: Creative Commons Zero v1.0 Universal (CC0-1.0)

  • Citation: Boose E, Ellison A, Fong E, Lau M, Lerner B, Pasquier T, Seltzer M. 2024. Software Tools to Collect and Use Provenance in R. Harvard Forest Data Archive: HF091 (v.28). Environmental Data Initiative: https://doi.org/10.6073/pasta/c622edd9114927f407cd55adb323fee7.

Detailed Metadata

HF091-01: rdtLite v. 1.4

  • Compression: tar.gz
  • Format: R package
  • Type: script

HF091-02: provDebugR v. 1.0.1

  • Compression: tar.gz
  • Format: R package
  • Type: script

HF091-03: provExplainR v. 1.1.1

  • Compression: tar.gz
  • Format: R package
  • Type: script

HF091-04: provGraphR v. 1.0.1

  • Compression: tar.gz
  • Format: R package
  • Type: script

HF091-05: provParseR v. 1.0

  • Compression: tar.gz
  • Format: R package
  • Type: script

HF091-06: provSummarize v. 1.5.1

  • Compression: tar.gz
  • Format: R package
  • Type: script

HF091-07: provTraceR v. 1.0

  • Compression: tar.gz
  • Format: R package
  • Type: script

HF091-08: provViz v. 1.0.9

  • Compression: tar.gz
  • Format: R package
  • Type: script