Day 05 Integration & Deployment

Apache Spark & Production Scala

Apache Spark is the dominant distributed data processing framework, and it is written in Scala. Today you write Spark jobs, understand Spark's execution model, and look at Scala's role in the big data ecosystem.

~1 hour Hands-on Precision AI Academy

Today’s Objective

Apache Spark is the dominant distributed data processing framework, and it is written in Scala. Today you write Spark jobs, understand Spark's execution model, and look at Scala's role in the big data ecosystem.

Spark distributes computation across a cluster. Dataset[T] is a typed, immutable distributed collection. SparkSession is the entry point. Transformations (map, filter, groupBy) are lazy — they build an execution plan. Actions (collect, count, write) trigger execution. The Catalyst optimizer rewrites the plan for efficiency before execution. DataFrame (Dataset[Row]) adds SQL-like operations and schema inference.

Spark SQL and DataFrames

DataFrames support SQL queries: spark.sql('SELECT country, AVG(gdp) FROM nations GROUP BY country'). DataFrame API: df.select('col1', 'col2'), df.filter($'age' > 18), df.groupBy('country').agg(avg('gdp'), count('*')). DataFrames are untyped at compile time but optimized by Catalyst. For type safety, convert to Dataset[T]: df.as[CaseClass].

Scala in the Data Engineering Ecosystem

Scala powers: Apache Kafka (JVM client), Apache Flink (streaming), Databricks (Spark as a service), LinkedIn's Scalding (Hadoop MapReduce). sbt (Scala Build Tool) manages dependencies. The Typelevel ecosystem provides functional Scala: Cats (type class instances), Cats Effect (IO monad, async), fs2 (streaming), Doobie (database), Http4s (web). These libraries are favored for high-correctness production services.

scala
SCALA
import org.apache.spark.sql.{SparkSession, functions => F}

val spark = SparkSession.builder() .appName('PrecisionAI Analytics') .master('local[*]') .getOrCreate()

import spark.implicits._

// Load data
val df = spark.read .option('header', 'true') .option('inferSchema', 'true') .csv('students.csv')

// Transformation pipeline (lazy)
val result = df .filter($'score' >= 70) .withColumn('letter_grade', F.when($'score' >= 90, 'A') .when($'score' >= 80, 'B') .when($'score' >= 70, 'C') .otherwise('F')) .groupBy('letter_grade') .agg( F.count('*').as('count'), F.avg('score').as('avg_score') ) .orderBy('letter_grade')

// Action triggers execution
result.show()

// Write to Parquet
result.write.mode('overwrite').parquet('output/grades')

spark.stop()
Add .cache() to a DataFrame that is used multiple times in your Spark job. Without caching, Spark re-reads and re-processes the data for each action. With caching, it stores the result in memory/disk after the first action.
📝 Day 5 Exercise Write Your First Spark Job
  1. Add Spark to build.sbt: libraryDependencies += 'org.apache.spark' %% 'spark-sql' % '3.5.0'
  2. Create a SparkSession with master('local[*]') for local development
  3. Generate a CSV file with 10,000 rows of student data
  4. Write a Spark job that computes grade distribution and writes results to Parquet
  5. Run with: sbt run — check the Spark UI at http://localhost:4040 while it runs

Supporting Resources

Go deeper with these references.

Scala Docs
Official Scala Documentation Complete Scala language specification, standard library reference, and tutorials.
Coursera
Functional Programming in Scala Martin Odersky's course on functional programming principles.
Apache
Apache Spark Documentation Official docs for Spark 3.x with Scala and Python API references.

Day 5 Checkpoint

Before moving on, make sure you can answer these without looking:

Course Complete
Return to Scala in 5 Days Overview