Posts

Showing posts from January, 2016

Introducing Apache Beam (dataflow)

As part of the Google Cloud ecosystem, Google created Dataflow SDK. Now, as a Google, Talend, Cask, data Artisans, PayPal, and Cloudera join effort, we are proposing Apache Dataflow to the Apache Incubator. I’m proud, glad and excited to be the champion on the Apache Dataflow proposal. But first, I would like to thank James Malone and Frances Perry from Google for their help, always open minded and interesting discussion. It’s really great to work with them ! Let’s take a quick tour on what will be Apache Dataflow. Architecture and Programming Model Imagine, you have a Hadoop cluster where you used MapReduce jobs. Now, you want to “migrate” these jobs to Spark: you have to refactore all your jobs which requires lot of works and cost a lot. And after that, see the effort and cost if you want to change for a new platform like Flink: you have to refactore your jobs again. Dataflow aims to provide an abstraction layer between your code and the execution runtime. The SDK allows you to use a