Accumulation and Aggregation of XML Data
(The Merge Operator)

XML is rapidly becoming a standard for data exchange on the World Wide Web. The proliferation of the Internet and the rise of XML are enabling information exchange on the World Wide Web as never before. One can now envision streams of XML data flowing throughout the Web: a stream of stock quotes; play-by-play action broadcast as XML fragments – one for each play; the latest news updates published as a stream of XML documents. The point is that fundamental changes in the nature of data and queries are occurring due to the rise of XML and the World Wide Web. Data may now arrive in streams instead of residing in persistent local files; data may be irregular and incomplete and is often heavily nested; and long-running queries that monitor and aggregate data on the Web appear to be increasingly important. We have developed a new operation, Merge, which provides the capability to create aggregates over streams of data and the ability to take fragments of XML from different inputs and piece them together to create a new XML document. We have developed a flexible mechanism, called a Merge Template, for specifying how to “merge” two XML documents. The Merge operation effectively handles highly nested, semi-structured data and has features that make it useful in an environment where there are long-running queries and stream-based data sources. Merge provides a simple step towards making intelligent querying of data on the Internet a reality.

An important step towards improving the quality of querying available on the Web is to understand how to effectively process streaming XML data. This problem presents interesting challenges due to the structure of XML data and the nature of Web users’ queries. The data stored in XML is expected to be semi-structured in nature, meaning that while it has some structure, the data will likely be irregular and incomplete. In addition, the nature of queries to be run over XML documents available on the Internet may not be like traditional database queries. In particular, these queries might operate over live, unreliable streams of data instead of files and the queries may live for days or even months. Our new operation, Merge, can be used to perform an important function over streams – that is accumulating and aggregating data from an incoming stream. Merge is specifically designed to work on irregular and incomplete data - features inherent in web-resident XML data and is also in fact a general-purpose operation that is useful for replication, cache maintenance and incremental query processing. Merge takes two XML documents as input and produces a result merged XML document utilizing a Merge Template, itself expressed in XML, to specify how the documents should be merged together. A merge-based accumulate operator has been implemented in Version 1.0 of the Niagara Internet Query Engine.

The figure below shows the merge of a sample (Baseball) game event into a Roster file for a team. The input XML documents are shown in graphical form at the top of the figure and the result document appears below.