Distributed Batch Processing with Apache Spark

Farhad Mehta

13.09.2017 09:00 - 17:00, Hochschule Luzern – Informatik
Max. Teilnehmer: 15
Durchführung noch unsicher


In case you have ever wondered how it is possible for a search engine to reply in under a second, this workshop is for you. Distributed batch processing is central to the functioning of big data, machine learning and data science in general. At the same time, it provides an application area where the benefits of functional programming can be elegantly demonstrated. Apache Spark is an open-source cluster-computing framework for large-scale data processing. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance. The Spark API accepts programs written in Java, Scala, Python or R. Apache Spark can be deployed on many different types of clusters, although deployment will not be a topic for this workshop.


The workshop will start with an introduction to the topic of distributed batch processing using the well-known map-reduce framework. After this, we will dive right into how large amounts of data are processed using Apache Spark. Focus will be laid on the following concepts: – Resilient Distributed Datasets (RDDs) – the properties of RDDs, and – operations on RDDs. A practical understanding of these concepts will be made possible using hands-on programming sessions using the Apache Spark Scala API. Although Apache Spark can be deployed on many different types of clusters, this will not be a topic for this workshop.


All participants can:

  • evaluate whether a specific problem can be solved using distributed batch processing.
  • explain how a batch process is executed in a distributed environment.
  • implement a simple distributed batch process in Scala with Apache Spark.


Programmers who are curious as to how large amounts of data can be processed quickly and elegantly.


  • You should be able to program in a general-purpose programming language (e.g. Java, C#).
  • Scala will be used in the programming exercises. Prior knowledge of Skala is not necessary and will be briefly covered during the workshop.
  • The workshop will be held in English.


  • Please bring your own laptop with a WIFi adapter and a modern web browser to use during the practical sessions.
  • To make sure that you can use your laptop for the practical session, please access https://try.jupyter.org and try clicking through one of the notebooks there.


Have a look at his website: https://www.ifs.hsr.ch/Farhad-Metha.13846.0.html