Nov-16-2021

A team of collaborators from the U.S. Department of Energy’s Oak Ridge National Laboratory, Google Inc., Snowflake Inc. and Ververica GmbH has tested a computing concept that could help speed up real-time processing of data that stream on mobile and other electronic devices.

The concept explores the function of watermarks, considered the most efficient mechanism for tracking how complete streaming data processing is. Watermarks allow new tasks to be processed immediately after prior tasks are completed.

To better understand how watermarks might be useful, the researchers studied the computation of data streams on two different data streaming processing systems. They presented the results at the 47th International Conference on Very Large Data Bases, held in August in Copenhagen, Denmark, and virtually. The paper they presented is one of the first that formally tests and examines watermarks in a basic research setting.

“There hasn’t been a clear, efficient mechanism for tracking phenomena of interest in a data stream over time and across different data processing pipelines,” said Edmon Begoli, AI Systems section head in ORNL’s National Security Sciences Directorate. “Watermarking is an up-and-coming concept that advances the state-of-the-art in stream processing frameworks.”

Computer scientists are continually looking for ways of studying real-time data so they can better anticipate consumer needs, estimate supply and demand, and deliver more accurate information to consumers. But over the last 10 years, data management has grown increasingly challenging. This challenge is in part due to the jump in real-time computing and interactions on social media sites, in autonomous platforms like self-driving cars and on mobile devices.

To determine how different platforms might effectively process real-time data, the team compared watermarks on the two that currently enable the most advanced implementation of them: Apache Flink, an open-source stream- and batch-processing framework, and Google Cloud Dataflow, a streaming analytics service.