Showing posts from 2016

Apache Beam in action: same code, several execution engines

If the previous article was an introduction to Apache Beam, it’s now time to see some of the key provided features. It’s the perfect timing as Apache Beam 0.2.0-incubating has just been released. This articles will show a first pipeline use case, and will execute the same pipeline code on different execution engines. Context: GDEL analyses For this article, we are going to create a pipeline to analyse GDELT data and count the number of events per location in the world. The GDELT project gathers all events happening in the world. It creates daily CSV files, containing one line per event. For instance, an event look like: 545037848 20150530 201505 2015 2015.4110 JPN TOKYO JPN 1 046 046 04 1 7.0 15 1 15 -1.06163552535792 0

Introducing Apache Beam (dataflow)

As part of the Google Cloud ecosystem, Google created Dataflow SDK. Now, as a Google, Talend, Cask, data Artisans, PayPal, and Cloudera join effort, we are proposing Apache Dataflow to the Apache Incubator. I’m proud, glad and excited to be the champion on the Apache Dataflow proposal. But first, I would like to thank James Malone and Frances Perry from Google for their help, always open minded and interesting discussion. It’s really great to work with them ! Let’s take a quick tour on what will be Apache Dataflow. Architecture and Programming Model Imagine, you have a Hadoop cluster where you used MapReduce jobs. Now, you want to “migrate” these jobs to Spark: you have to refactore all your jobs which requires lot of works and cost a lot. And after that, see the effort and cost if you want to change for a new platform like Flink: you have to refactore your jobs again. Dataflow aims to provide an abstraction layer between your code and the execution runtime. The SDK allows you to use a