0%

Introduction to Apache Spark

1 Introduction

Apache spark is a batch computing framework. It is used to replace MapReduce in Hadoop. It can be deployed on Apache Yarn or Mesos, Kubernetes.

1.1 Compare

Flink is a stream computing framework. It is used for millisecond-level computing

2 Use Case

2.1 Interactive anylysis with the Spark shell

2.2 Self-contain applications written with Spark API

3 Programming Guides

3.1 RDD

Core and old API

3.2 Spark SQL, Datasets, DataFrames

Term.

  • Dataset is the new interface added in Spark1.6
  • DataFrame is a Dataset organized into named columns. It is equivalent to a table in a relational database.

Getting Started

  1. Create SparkSession
  2. Create DataFrames

3.3 Structured Streaming

Using Datasets and DataFrame, newer API than DStreams

3.4 Spark Streaming using DStreams

Using DStreams, old stream API

3.5 MLlib

maching learning algorithms

3.6 GraphX

graphs computing