FDD: Full Disclosure of Data Preparation and Use

Research Focus

The goals of this project are to:

Understand data manipulation activities by analysts (i.e., data users) in studies that use datasets as input, especially data preparation, integration and selection.
Devise methods and tools to document data manipulation activities in studies in order to greatly facilitate the specification and conduct of new studies (making the data userís job easier) and enable automatic processing of study data by the original or subsequent users.
Exploit the specification of research studies, e.g. to enable reuse and customization of studies.

We conducted a study where we interviewed a number of data users to try and identify ways in which we could help them conduct individual studies with greater ease and to see whether they documents their data manipulations and whether they thought documentation would be useful. Subjects were recruited from several disciplines including: transportation, urban planning, utility management (e.g., for economic analysis of utility usage patterns), and economic analysis at the state level for a variety of purposes. The interviews were conducted in December 2010 - February 2011. Based on our analysis of the results of the interviews, we focused on tracking datasets at the file level. Since the data usersí work is often exploratory, they may try various approaches to manipulate and process their datasets. Thus, it was often hard for them to remember or document precisely which steps were used to produce the final version of their datasets.

We developed a mock-up of a software tool that could be used by data users to document their work processes easily. The tool is called WHIM: Work History Information Manager. WHIM generalizes the capabilities of a version control system typically used in software development to this more general environment of data users.

A mock-up of WHIM is available at:

To validate our ideas, we conducted a second round of individual interviews with our research subjects where we showed them our WHIM mock-up and asked for their opinions of WHIM. Their reactions were uniformly positive; they provided several suggestions to further improve WHIM. We are currently pursuing several funding options to develop the WHIM tool as an open source project.

Overall, we found that:

1. Dataset reuse is highly desirable in many settings because of the amount of human effort and attention that must be invested when the study was originally conducted to prepare and analyze the datasets .

2. Datatset reuse is difficult because subsequent analysts need to know, often in great detail, precisely what was done to a dataset during the earlier study. The problem is that researchers are not able to trust that the data was properly filtered, transformed, cleaned, etc. for the purposes of the new study without knowing the details.

3. Even for dedicated data users the details of how datasets were manipulated is typically not recorded. One reason for this is because there are no standard mechanisms or tools that make documentation easy or that otherwise encourage users to document as they work.

4. Although it is often hard to encourage users to learn and use a new tool (such as the one we propose), we have evidence that the near-term benefits for the user (directly) might significantly encourage the capture of documentation using our tool. These near-term benefits could include the ability to easily see the history of steps that led to a given dataset (to refresh one's memory or to confirm that a process was followed), to undo certain steps, and to visualize the results of certain processing steps (e.g., showing the size of the datasets before and after the steps).


Publications and Products


This project was supported in part by NSF Grant Number 0954268. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.