DataFlow

Dataflow is a fully managed data processing service for executing Apache Beam pipelines within the Google Cloud Platform (GCP) ecosystem that supports both real-time and batch data streaming with low data latency. It is a cloud-based service that manages the dynamic allocation of resources to allow organizations to focus on the design and implementation of their application.

Dataflow - Powered by Stacktics

At its core, Dataflow uses the Apache Beam SDK, which expands on the concepts of MapReduce operations (much like Hadoop) and data windowing for batch and real-time data streaming respectively. The MapReduce framework works by partitioning the input dataset into independent compartments of data, which is then processed in parallel by worker nodes that are distributed over several machines.

 

Dataflow supports an additional collection of software development kits and application programming interfaces, such as REST and .NET, that allows software developers to gain the flexibility of designing and implementing their streaming or batch-based data pipelines.

 

Additional features that are provided by Dataflow include, but are not limited to:

  • Horizontal auto-scaling of worker resources for cost-efficient enterprise-level performance and fail-safe cluster architecture
  • Flexible job scheduling that uses a queue with a guaranteed six-hour execution window, and is open sourced in Google’s open network allowing individuals to contribute to the Beam SDK.

 

Most importantly, Dataflow is part of Google Cloud. Thus, it can be seamlessly integrated with other services such as pub/sub, datastore, BigQuery and Bigtable. Connect with Stacktics and derive the highest power of data processing through Dataflow.

Have a question, get an answer. We would be happy to chat.