NSF DIBBS ACI 1443047

Analysis and management of large data sets are vital for progress in the data-intensive realm of scientific research and education. Scientists are producing, analyzing, storing and retrieving massive amounts of data. The anticipated growth in the analysis of scientific data raises complex issues of stewardship, curation and long-term access. Scientific data is tracked and described by metadata. This award will fund the design, development, and deployment of metadata-aware workflows to enable the management of large data sets produced by scientific analysis. Scientific workflows for data analysis are used by a broad community of scientists including astronomy, biology, ecology, and physics. Making workflows metadata-aware is an important step towards making scientific results easier to share, to reuse, and to support reproducibility. This project will pilot new workflow tools using data from the Laser Interferometer Gravitational-wave Observatory (LIGO), a data-intensive project at the frontiers of astrophysics. The goal of LIGO is to use gravitational waves—ripples in the fabric of spacetime—to explore the physics of black holes and understand the nature of gravity.

Efficient methods for accessing and mining the large data sets generated by LIGO’s diverse gravitational-wave searches are critical to the overall success of gravitational-wave physics and astronomy. Providing these capabilities will maximize existing NSF investments in LIGO, support new modes of collaboration within the LIGO Scientific Collaboration, and better enable scientists to explain their results to a wider community, including the critical issue of data and analysis provenance for LIGO’s first detections. The interdisciplinary collaboration involved in this project brings together computational and informatics theories and methods to solve data and workflow management problems in gravitational-wave physics. The research generated from this project will make a significant contribution to the theory and methods in identification of science requirements, metadata modeling, eScience workflow management, data provenance, reproducibility, data discovery and analysis. The LIGO scientists participating in this project will ensure that the needs of the community are met. The cyberinfrastructure and data-management scientists will ensure that the software products are well-designed and that the work funded by this award is useful to a broader community.