cormorant diving
Cormorants are a genus of diving birds common on the Columbia River.

Cormorant: Scientific Data Management

We consider Scientific Data Management to involve three forms of data:
  • Science Data: Potentially massive numeric datasets and derived data products; these data are the currency of scientific research.
  • Catalog Data: Metadata descriptors of the files used to store science data; these data enable users and client programs to access the science data.
  • Task Data: Descriptions of the derivations used to produce science data; these descriptions constitute provenance information when used post-derivation, and can drive monitoring and scheduling tools when used more proactively.
The context of our investigation is CORIE, an Environmental Observation and Forecasting System supporting investigation of the Columbia River Estuary through computer simulation augmented by an observation network of dozens of sensors.

The Science Data consists of simulation runs and sensor feeds, as well as derived GIF animations, timeseries plots, and composite The simulations produce 5GB of forecasted data each day, and an archive of “hindcasts” covering each day since 1997 is materializing concurrently. Managing the variety and complexity of the data products is a primary challenge.

The Catalog Data is somewhat obscured by the technical artifacts of the system’s implementation. Metadata is encoded in the names of files and directories, or as anonymous fields within the files themselves. Such implicit modes of data representation interfere with efforts to automate data access. Further, scientific systems are subject to frequent and significant evolution; both the information being recorded and the manner in which it is represented are likely to change. A database schema designed to organize these data would be subject to frequent revision, which is a known to be at best inconvenient. A more flexible approach is desirable.

Managing Science Data: GridFields for Manipulating Simulation Results

Environmental Observation and Forecasting Systems (EOFS) create new opportunities and challenges for generation and use of environmental data products. The number and diversity of these data products, however, has been artificially constrained by the lack of a simple descriptive language for expressing them. Data products that can be described simply in English take pages of obtuse scripts to generate. The scripts obfuscate the original intent of the data product making it difficult for users and scientists to understand the overall product catalog. The problem is exacerbated by the evolution of modern EOFS into data product "factories" subject to reliability requirements and daily production schedules. New products must be developed and assimilated into the product suite as quickly as they are imagined. Reliability must be maintained despite changing hardware, changing software, changing file formats, and changing staff.

We have developed an algebra of GridFields for naturally expressing executable data product recipes over structured and unstructured computational grids. Informed by relational database theory, we have defined a simple data model and a set of operators that can be composed to express complex visualizations, plots, and transformations of gridded datasets. The language provides a medium for design, discussion, and development of new data products without committing to particular data structures or algorithms.

cormorant diving
Two data products from the CORIE system. a) is a vertical profile of salinity in the estuary, along a deep channel. b) is derived from the same grid function, but shows a horizontal slice.
Our language does not subsume highly tuned algorithms for grid computations; rather, it serves as a starting point for reasoning about and selecting an appropriate algorithm for given data. Algebraic properties of the language admit optimization via rewrite rules. Our development is in the context of CORIE, an EOFS simulating and observing the dynamics of the Columbia River Estuary. The CORIE system produces over 5GB of forecasted data and data products daily, and a large archive of hindcasts is growing steadily. Such intensive computing requirements motivate a transition from ad hoc scripting to a new factory model of data product provision.

Managing Catalog Data: A Schema-less Metadata Repository

We have designed an architecture that supports schema experimentation, but does not impose a schema on the metadata. Metadata descriptors are collected as name-value pairs associated with a particular file. These descriptors are loaded into a database consisting of just one logical table over which several views are defined. Unlike standard DBMS views, our views must support transposition (so-called “crosstab”) operations to promote descriptor names into attributes. Web interfaces and other client programs can then use such views to locate files and thereby process science data.

Our web interface is based on free software that provides generic interfaces to a database schema. We have modified the software to expose database views as well as tables. With this interface, various potential ‘schema’ can be tested for utility before imposing them on the database. Additionally, users can create and modify their own views. Even users unfamiliar with database technology can contribute to the schema design effort.

To harvest the metadata descriptors, we require scripts that ‘know’ about naming conventions, file formats and other standards. These scripts must be maintained by the scientists who devise the standards. However, the effort is manageable, as our infrastructure has been specifically designed to impose as few restrictions as possible on the manner in which the scripts are written. Scripts may be written in one of several languages (Perl being favored by the current group), and must only call a function for each descriptor they wish to record. No file formats, database access libraries, or XML schemas need be agreed upon. Further, such a flexible scheme is better suited to inevitable changes in the number, type, and organization of the metadata that the scientists wish to record. We consider our metadata pipeline as way to jot down ideas programmatically. If a scientist thinks of some information that might be useful to record for a particular file, they can simply write (or modify) a script to collect it, and the system will incorporate it without any further changes required. Of course, no views will yet be defined making use of this information.

To improve this framework, we are investigating the use of data mining techniques to automatically identify patterns and thereby infer schema from the set of metadata descriptors. Patterns such as “these three descriptors always appear together” suggest potential schema constraints. Perhaps the three repeated descriptors should be factored out as a table of their own. Although we are working with a very simple name-value pair data model, schema inference for semi-structured data (particularly XML) is likely applicable.

Additional features envisioned for the comprehensive system include:

Query Language A declarative query language to complement the GridField algebra.
Production Planning Interfaces for planning and executing simulation campaigns consisting of many related runs.
Monitoring Instrumentation to monitor and record status of simulation runs during execution.

Do you manage or process simulation results? We want to help!

Publications

Presentations Other links
maintained by howe at cs.pdx.edu, updated 3/31/2005