PySpark Interview Questions

Top PySpark Interview Questions and Answers

May 25th, 2026
75
15:00 Minutes

PySpark is one of the most in-demand skills in the big data and data engineering world right now. Companies like Netflix, Uber, LinkedIn, and Amazon use Apache Spark at scale, and PySpark is their go-to tool for writing Spark jobs in Python. If you are preparing for a data engineering, data science, or big data analyst interview, you will almost certainly face PySpark questions.

With hands-on experience working on data processing pipelines, handling large-scale datasets, and optimizing Spark jobs for performance, I have seen how critical PySpark is in real-world data engineering workflows. From building ETL pipelines to improving query efficiency and managing distributed systems, practical exposure has helped me understand not just the concepts but how they are applied in production environments.

This article covers 40 PySpark interview questions and answers, organized into four clear sections: beginner, intermediate, experienced, and scenario-based. Each answer is detailed enough to help you understand the concept, not just memorize a definition. Whether you are just getting started or you are a senior engineer brushing up before a big interview, this guide has you covered.

Let us get into it.

PySpark Interview Questions for Freshers

These questions test your foundational understanding of PySpark. Interviewers ask these to check if you understand the core concepts before moving to advanced topics.

Q1. What is PySpark and how is it different from Apache Spark?

PySpark is the Python API for Apache Spark. Apache Spark is the underlying distributed computing engine written in Scala. PySpark lets you interact with Spark using Python code instead of Scala or Java.

The key difference is the language interface. Spark itself runs on the JVM (Java Virtual Machine). When using PySpark, your Python code communicates with the JVM through a bridge called Py4J. This means you get Python's readability and ecosystem while still running jobs on Spark's distributed engine.

Performance wise, native Scala Spark is slightly faster because it avoids the Python-JVM overhead. However, PySpark is close in performance for most real-world jobs, and the productivity gains from Python usually outweigh the minor speed difference.

Q2. What is an RDD in PySpark?

RDD stands for Resilient Distributed Dataset. It is the fundamental data structure in Apache Spark. An RDD is an immutable, distributed collection of objects that you can process in parallel across a cluster.

Three core properties define an RDD:

  • Resilient: It recovers automatically from node failures using lineage information.
  • Distributed: Data is spread across multiple nodes in a cluster.
  • Dataset: It represents a collection of partitioned data with values.

You create an RDD in two ways:

# From an existing Python collection
rdd = spark.spark Context.parallelism([1, 2, 3, 4, 5])

# From an external data source
rdd = spark.sparkContext.textFile("hdfs://path/to/file.txt")

RDDs support two types of operations: transformations (like map, filter, flatMap) that return a new RDD, and actions (like count, collect, save) that trigger actual computation.

Q3. What is the difference between a transform and action in PySpark?

Transformations are a way of outlining how to process data without actually executing that process but are subsequently able to call actions to execute those transformations and get back some type of output or results due to calling an action.

Basis Transformation Action
Execution Lazy (not executed immediately) Triggers execution
Purpose Defines new RDD or DataFrame Produces result or output
Return Type New RDD or DataFrame Returns value or writes output
Optimization Enables DAG optimization Executes optimized DAG
Examples map(), filter(), select() collect(), count(), show()

Q4. What is a SparkSession and why do you need it?

SparkSession is the single entry point to all Spark functionality in PySpark 2.0 and later. Before Spark 2.0, you had to create separate contexts for different APIs: SparkContext for RDDs, SQLContext for DataFrames, and HiveContext for Hive. SparkSession replaced all these with one unified object.

You create a SparkSession like this:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("MyApp") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

The .getOrCreate() method creates a new session or returns an existing one. You use SparkSession to read data, run SQL queries, create DataFrames, and configure your Spark application.

Q5. What is the difference between an RDD, a DataFrame and a Dataset in PySpark?

RDD, DataFrame, and Dataset are three different ways to interact with data in PySpark. These three types of data differ in their structure, how they perform, and how easy it is to work with them based on whether you want low-level access to the data or high-level, optimized operations.

Basis RDD DataFrame Dataset
Level Low-level High-level High-level and typed
Schema No schema Has schema Has schema
Optimization No optimization Uses Catalyst optimizer Uses Catalyst optimizer
Type Safety Type safe Not type safe Type safe
Performance Slower Faster Fast (best in JVM languages)

Q6. What is lazy evaluation in Spark and why does it matter?

Lazy evaluation means Spark does not execute transformations immediately when you write them. Instead, it builds a DAG (Directed Acyclic Graph) of all the transformations. It only executes that DAG when you call an action.

Why does this matter? Because Spark can look at your entire pipeline of transformations together and optimize the execution plan before running a single line.

For example, if you filter and then select columns, Spark can push the filter earlier in the pipeline to reduce the amount of data it processes.

# Spark just records these — no computation yet
df = spark.read.csv("large_file.csv", header=True)
df_filtered = df.filter(df["age"] > 30)
df_selected = df_filtered.select("name", "age")

# This action triggers the full optimized execution
df_selected.show()

Without lazy evaluation, Spark would run each step immediately without any opportunity to optimize. Lazy evaluation is one of the main reasons Spark is so fast.

Q7. What are PySpark's main data sources for reading and writing data?

PySpark supports a wide range of data sources through its unified spark.read and df.write API:

# Reading different formats
df_csv    = spark.read.csv("path/file.csv", header=True, inferSchema=True)
df_json   = spark.read.json("path/file.json")
df_parquet = spark.read.parquet("path/file.parquet")
df_orc    = spark.read.orc("path/file.orc")
df_jdbc   = spark.read.jdbc(url=jdbc_url, table="table_name", properties=props)

# Writing data
df.write.csv("output/path")
df.write.parquet("output/path")
df.write.mode("overwrite").parquet("output/path")

Parquet is the most commonly used format in production Spark pipelines. It is columnar, compressed, and reads very fast because Spark only reads the columns it needs

Q8. How does PySpark handle missing data?

PySpark provides several built-in methods to handle null and missing values through the DataFrame.na accessor:

# Drop rows with any null values
df.na.drop()

# Drop rows only if ALL columns are null
df.na.drop(how="all")

# Drop rows where specific columns are null
df.na.drop(subset=["age", "salary"])

# Fill null values with a constant
df.na.fill(0)
df.na.fill({"age": 0, "name": "Unknown"})

# Replace specific values
df.na.replace(["old_value"], ["new_value"], subset=["column_name"])

For more advanced scenarios, you can use PySpark's Imputer from pyspark.ml.feature to fill missing numeric values with the mean or median:

from pyspark.ml.feature import Imputer

imputer = Imputer(
    inputCols=["age", "salary"],
    outputCols=["age_imputed", "salary_imputed"]
).setStrategy("mean")

model = imputer.fit(df)
df_imputed = model.transform(df)

Q9. What is a partition in PySpark and why does it matter?

A partition is a chunk of data stored on a single node in the cluster. Spark processes each partition in parallel. The number of partitions directly affects the parallelism of your job.

If you have too few partitions, some nodes sit idle and your job runs slower than it should. If you have too many, you create excessive scheduling overhead.

You can control partitions with three methods:

# Check current number of partitions
print(df.rdd.getNumPartitions())

# Repartition — increases or decreases partitions, causes a full shuffle
df_repartitioned = df.repartition(10)

# Coalesce — only decreases partitions, avoids full shuffle (faster for reducing)
df_coalesced = df.coalesce(4)

Q10. What is PySpark's SparkContext and how does it relate to SparkSession?

SparkContext (often written as sc) is the original entry point to Spark functionality. It represents the connection to a Spark cluster and is responsible for coordinating distributed execution.

SparkSession (available from Spark 2.0 onwards) wraps SparkContext and also provides access to DataFrame, SQL, and Streaming APIs. When you create a SparkSession, Spark automatically creates an underlying SparkContext that you can access via spark.sparkContext.

spark = SparkSession.builder.appName("MyApp").getOrCreate()

# Access the underlying SparkContext
sc = spark.sparkContext

# You can still use sc for RDD operations
rdd = sc.parallelize([1, 2, 3])

In modern PySpark development, you always start with SparkSession. You only access SparkContext directly when you need to work with RDDs or set low-level Spark configurations.

Read Also: Top Data Science Interview Questions and Answers

PySpark Interview Questions for Intermediate Level

These questions go deeper. Interviewers ask them to check if you understand how Spark actually works under the hood and if you can write efficient PySpark code.

Q1. What is the Catalyst Optimizer in PySpark?

The Catalyst Optimizer is Spark SQL's built-in query optimization engine. It automatically optimizes your DataFrame and SQL queries before executing them. When you write a DataFrame transformation, Catalyst converts it into an optimized execution plan through four phases:

  • Analysis: Resolves column names and table references against the schema
  • Logical Optimization: Applies rule-based optimizations like predicate pushdown, constant folding, and column pruning
  • Physical Planning: Converts the logical plan into one or more physical plans and selects the best one using a cost model
  • Code Generation: Generates JVM bytecode for the selected plan

Q2. What is the difference between groupBy and reduceByKey in PySpark?

Different approaches exist within PySpark for aggregating data with groupBy vs. reduceByKey. However, they differ in efficiency regarding data transfer and calculations performed between nodes in a distributed environment.

Basis groupBy reduceByKey
Shuffle High Low
Aggregation After grouping During shuffle
Memory Usage Higher Lower
Performance Slower Faster
Use Case Requires full data Requires only aggregated data

Q3. What is a shuffle in Spark and why should you minimize it?

A shuffle is when Spark needs to redistribute data across partitions, usually across different nodes in the cluster. Shuffles happen during operations like groupBy, join, repartition, distinct, and orderBy.

Shuffles are expensive because they involve:

  • Writing data to disk on the source partitions
  • Transferring data over the network
  • Reading data from disk on the destination partitions

You can minimize shuffles in several ways:

  • Use broadcast join instead of a regular join when one table is small (see Q15)
  • Use reduceByKey instead of groupByKey for RDDs
  • Use coalesce instead of repartition when reducing the number of partitions
  • Filter data as early as possible to reduce what needs to be shuffled
  • Cache data that you use multiple times to avoid recomputation

You can see if a shuffle is happening in your job by looking at the Spark UI's Stages tab and checking for "Exchange" steps in the query plan.

Q4. How does caching and persistence work in PySpark?

By default, Spark recomputes a DataFrame or RDD from scratch every time you call an action on it. If you use the same data multiple times, you can cache it to avoid repeated computation.

cache() stores data in memory using the default storage level (MEMORY_AND_DISK).

persist() lets you choose the storage level explicitly:

from pyspark import StorageLevel

# Cache in memory (default)
df.cache()

# Or use persist with a specific storage level
df.persist(StorageLevel.MEMORY_AND_DISK)
df.persist(StorageLevel.MEMORY_ONLY)
df.persist(StorageLevel.DISK_ONLY)

# Always unpersist when done
df.unpersist()

Storage levels at a glance:

Level Memory Disk Serialized
MEMORY_ONLY Yes No No
MEMORY_AND_DISK Yes Yes (spill) No
DISK_ONLY No Yes Yes
MEMORY_ONLY_SER Yes No Yes

Use caching when you call multiple actions on the same DataFrame, especially in iterative machine learning algorithms. Do not cache everything — memory is finite, and unnecessary caching can push useful data out of cache.

Q5. What is a broadcast join in PySpark and when should you use it?

A broadcast join is a join optimization where Spark sends a complete copy of a smaller table to every node in the cluster. This eliminates the need to shuffle the larger table across the network, making the join much faster.

You should use a broadcast join when one of the tables is small enough to fit in memory (typically under 10 MB, configurable up to a few hundred MB).

from pyspark.sql.functions import broadcast
# Explicitly broadcast the smaller DataFrame
result = large_df.join(broadcast(small_df), on="user_id", how="inner")

You can also configure the automatic broadcast threshold:

# Spark automatically broadcasts tables smaller than this size
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "100mb")

# Disable auto-broadcast
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")

Q6. How do you perform window functions in PySpark?

Window functions let you perform calculations across a set of rows related to the current row — like row numbers, running totals, or rankings — without collapsing the rows into groups.

from pyspark.sql import Window

from pyspark.sql.functions import row_number, rank, sum, lag

# Define the window: partition by department, order by salary
window_spec = Window.partitionBy("department").orderBy("salary")

# Row number within each department
df.withColumn("row_num", row_number().over(window_spec))

# Running total of salary within each department
df.withColumn("running_total", sum("salary").over(window_spec))

# Previous row's salary (lag)
df.withColumn("prev_salary", lag("salary", 1).over(window_spec))

Q7. What is the difference between map and flatMap in PySpark?

In PySpark, there are two commonly used transformation functions named map() and flatMap(), both are different ways to perform some operation on the elements of a dataset, but they return different types of structured data as output. Here's a comparison of how the map() and flatMap() methods differ as well as when you would use each method:

Basis map() flatMap()
Output One-to-one One-to-many
Structure of Output Nested structure Flattened structure
Return Type Same number of elements Variable number of elements
Use Case Simple transformation Exploding data
Example x → x + 1 "a b" → ["a", "b"]

Q8. How do you handle data skew in PySpark?

Data skew happens when data is not evenly distributed across partitions. Some partitions end up with far more data than others, causing some tasks to take much longer while others finish quickly. This creates a bottleneck.

Here are the main strategies to handle skew:

1. Salting: Add a random prefix to the skewed key to spread it across multiple partitions:

import random
from pyspark.sql.functions import concat, lit, col

num_salt = 10
df_salted = df.withColumn("salted_key", 
    concat(col("skewed_key"), lit("_"), (col("skewed_key") % num_salt).cast("string")))

2. Broadcast join: If one table is small, broadcast it to avoid the shuffle entirely.

3. Repartition on another key: Repartition using a column with better distribution before joining.

4. Skew hints (Spark 3.x): Use adaptive query execution (AQE):

spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

Spark 3.0's Adaptive Query Execution (AQE) can detect and handle skew automatically, which is a great point to mention in interviews.

Q9. What is the difference between persist() and cache() in PySpark?

PySpark provides two options for in-memory data storage, cache and persist, to improve the speed of data re-use. The key difference between the two is that cache has a default storage level (Memory Only), while persist allows the user to select from a variety of storage options.

Comparison cache() persist()
Storage Level MEMORY_ONLY Configurable storage level
Flexibility Limited Flexible
Common Use Simple caching More complex data storage
Syntax rdd.cache() rdd.persist(level)
Performance Same as MEMORY_ONLY Depends on selected storage level

Q10. How does PySpark's explain() function help with optimization?

explain() prints the execution plan that Spark will use for a query. It is the first tool you should use when debugging slow queries or understanding what Spark is doing with your code.

# Simple plan (default)
df.explain()

# Extended plan (logical + physical)
df.explain(extended=True)

# Formatted plan (easiest to read, Spark 3.0+)
df.explain(mode="formatted")

The plan shows you:

  • Whether Spark is doing a broadcast join or a sort-merge join
  • Whether predicate pushdown is happening (filters moving to the scan)
  • Where shuffles occur (look for "Exchange" in the plan)
  • Whether columnar scanning is being used for Parquet files
  • Always use explain() before running expensive queries in production. Understanding the execution plan is a sign of an experienced PySpark developer.

Read Also: Data Engineer Interview Questions and Answers

PySpark Interview Questions for Experienced Professionals

These questions are for senior engineers and architects. They test deep knowledge of Spark internals, performance tuning, and production best practices.

Q1. How does Spark's Tungsten execution engine improve performance?

Tungsten is Spark's physical execution engine introduced in Spark 1.4. It improves performance through three main mechanisms:

1. Off-heap memory management: Tungsten manages memory manually outside of the JVM heap. This avoids garbage collection (GC) pauses, which are a major performance bottleneck in JVM-based systems.

2. Cache-aware computation: Tungsten uses data structures and algorithms designed to maximize CPU cache utilization. It operates on binary data in memory rather than Java objects, which reduces the memory footprint significantly.

3. Whole-stage code generation: Tungsten generates JVM bytecode for entire query stages at runtime (called "codegen"). Instead of calling separate functions for each operator, it generates a single tight loop that the JVM can optimize aggressively.

Q2. What is Adaptive Query Execution (AQE) in Spark 3.0?

Adaptive Query Execution (AQE) is a feature introduced in Spark 3.0 that allows Spark to change its query execution plan at runtime based on actual data statistics collected during execution. Traditional Spark planning is static — it makes decisions upfront based on estimates that may be wrong.

AQE provides three key capabilities:

1. Dynamic coalescing of shuffle partitions: Spark automatically merges small shuffle partitions into larger ones after a shuffle stage completes, reducing the number of tasks in subsequent stages.

2. Dynamic switching of join strategies: If a table turns out to be smaller than estimated, Spark can switch from a sort-merge join to a broadcast join at runtime.

3. Dynamic skew join optimization: Spark detects skewed partitions and splits them into smaller sub-tasks automatically.

# Enable AQE (enabled by default in Spark 3.2+)
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

AQE is one of the most impactful performance improvements in modern Spark. Mentioning it in interviews shows you are up to date with the Spark ecosystem.

Q3. How do you tune Spark executor memory configuration?

Spark executor memory is divided into several regions. Understanding this is critical for tuning large-scale jobs:

spark.executor.memory = Execution Memory + Storage Memory + User Memory + Reserved Memory

Key configuration parameters:

# Total memory per executor
spark.executor.memory = "8g"

# Fraction of executor memory for execution + storage (default: 0.6)
spark.memory.fraction = 0.6

# Within unified memory, fraction for storage (default: 0.5)
spark.memory.storageFraction = 0.5

# Overhead memory (for Python processes, off-heap, etc.)
spark.executor.memoryOverhead = "2g"

# Number of cores per executor
spark.executor.cores = 4

# Number of executors
spark.executor.instances = 10

Practical sizing rule: Use 4-5 cores per executor and allocate 4-8 GB of memory per core. Avoid executors with too many cores (more than 5) because HDFS throughput degrades with too many concurrent threads.

For PySpark specifically, always configure spark.executor.memoryOverhead to account for the Python worker process memory. A common starting point is 10-15% of spark.executor.memory.

Q4. What is Delta Lake and how does it work with PySpark?

Delta Lake is an open-source storage layer that adds ACID transaction support, versioning, and schema enforcement to data lakes. It stores data as Parquet files with a transaction log (called the Delta Log) that tracks every change.

Delta Lake integrates natively with PySpark:

# Write data as a Delta table
df.write.format("delta").save("/path/to/delta-table")

# Read a Delta table
df = spark.read.format("delta").load("/path/to/delta-table")

# Update records (not possible with regular Parquet)
from delta.tables import DeltaTable

deltaTable = DeltaTable.forPath(spark, "/path/to/delta-table")
deltaTable.update(
    condition="age < 18",
    set={"category": "'minor'"}
)

# Time travel — read data as it was at a previous version
df_old = spark.read.format("delta") \
    .option("versionAsOf", 5) \
    .load("/path/to/delta-table")

Key Delta Lake features for interviews:

  • ACID transactions: Multiple writers can safely write to the same table
  • Time travel: Query data at any previous point in time
  • Schema evolution: Safely add or modify columns
  • Upserts (MERGE): Update existing records or insert new ones in one operation

Delta Lake is the foundation of the modern Lakehouse architecture used by companies running Databricks and similar platforms.

Q5. How do you optimize PySpark jobs for production at scale?

Production PySpark optimization involves multiple layers. Here are the most important ones:

1. Use the right file format: Parquet or ORC for analytical workloads. They are columnar, compressed, and support predicate pushdown.

2. Partition your data properly: Partition large tables by frequently-filtered columns (like date or region) when storing to disk.

3. Enable AQE: spark.sql.adaptive.enabled = true

4. Use broadcast joins for small lookup tables.

5. Avoid UDFs when possible: Python UDFs break Tungsten optimization. Use built-in Spark functions from pyspark.sql.functions instead. If you must use a Python UDF, use Pandas UDFs (vectorized UDFs) which are significantly faster.

# Slow: Python UDF (breaks optimization)
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

@udf(returnType=StringType())
def my_udf(value):
    return value.upper()

# Faster: Built-in function
from pyspark.sql.functions import upper
df.withColumn("name_upper", upper(df["name"]))

# Better Python UDF alternative: Pandas UDF
from pyspark.sql.functions import pandas_udf
import pandas as pd

@pandas_udf(StringType())
def my_pandas_udf(series: pd.Series) -> pd.Series:
    return series.str.upper()

6. Z-order clustering (Delta Lake): For Delta tables, use Z-ordering to co-locate related data and speed up selective queries.

7. Monitor with Spark UI: Always profile jobs in the Spark UI to find skewed tasks, long GC times, and shuffle spills.

Q6. What is the difference between sort-merge join and broadcast join?

PySpark provides two options for joining datasets: The Sort-Merge Join and the Broadcast Join. The main difference between both options is that they approach joining datasets from different perspectives based upon data size, shuffle, and performance.

Basis Sort-Merge Join Broadcast Join
Data Size Large datasets One small dataset
Shuffle Required Minimal
Speed Slower Faster
Requirement Requires sorting Requires memory for broadcasting
Use Case Big-to-big joins Small-to-big joins

Q7. How does Spark handle fault tolerance?

Spark achieves fault tolerance through two mechanisms: lineage for RDDs and write-ahead logging for streaming.

  • Lineage (RDD fault tolerance): Every RDD remembers the sequence of transformations that created it (the lineage graph). If a partition is lost due to a node failure, Spark can recompute just that partition by replaying the transformations from the last checkpoint or from the original data source.
  • Checkpointing: For long lineage chains, replaying from the beginning is expensive. Checkpointing saves an RDD or DataFrame to a reliable storage system (like HDFS) and cuts the lineage at that point:

spark.sparkContext.setCheckpointDir("hdfs://path/to/checkpoints")
rdd.checkpoint()

  • Structured Streaming fault tolerance: Spark Structured Streaming uses checkpointing and write-ahead logs to maintain exactly-once processing guarantees. State information is saved to reliable storage so streams can recover from failures without data loss or duplication.
  • Stage retry: If a task fails, Spark retries it automatically (up to spark.task.maxFailures times, default 4). If enough tasks fail, Spark retries the entire stage.

Q8. What are Pandas UDFs (vectorized UDFs) and why are they faster than regular Python UDFs?

Pandas UDFs, also called vectorized UDFs, are a special type of Python UDF that use Apache Arrow to transfer data between the JVM and Python in columnar batches instead of row by row.

Regular Python UDFs process one row at a time. For each row, Spark serializes the data, sends it to Python, runs the function, serializes the result, and returns it to the JVM. This row-by-row overhead adds up enormously for large datasets.

Pandas UDFs transfer entire columns as Arrow-serialized Pandas Series, apply the function to the whole batch at once, and return the result. This is dramatically faster.

from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import DoubleType
import pandas as pd

# Pandas UDF — operates on a full column (Pandas Series)
@pandas_udf(DoubleType())
def multiply_by_two(series: pd.Series) -> pd.Series:
    return series * 2

# Apply it to a DataFrame
df.withColumn("value_doubled", multiply_by_two(df["value"]))

Q9. How do you implement schema evolution in PySpark with Delta Lake?

Schema evolution allows you to add new columns to a table without breaking existing queries or requiring a full table rewrite. Without Delta Lake, adding a new column to a Parquet table requires rewriting all existing data.

With Delta Lake, you enable schema evolution with a single option:

# Auto-merge schema when writing (adds new columns automatically)
df_with_new_column.write \
    .format("delta") \
    .option("mergeSchema", "true") \
    .mode("append") \
    .save("/path/to/delta-table")

For more complex schema changes (like renaming columns or changing data types), you use Delta's ALTER TABLE SQL commands:

spark.sql("ALTER TABLE delta.`/path/to/delta-table` ADD COLUMN new_col STRING")
spark.sql("ALTER TABLE delta.`/path/to/delta-table` RENAME COLUMN old_name TO new_name")

Schema evolution is critical for production pipelines where source data changes over time. Without it, a new column in your source data would cause your pipeline to fail.

Q10. What is the Spark DAG and how does the DAG Scheduler work?

A DAG (Directed Acyclic Graph) is Spark's internal representation of the computation steps needed to complete a job. "Directed" means each step flows in one direction. "Acyclic" means there are no circular dependencies.

When you call an action, Spark's DAG Scheduler converts your transformation pipeline into a DAG of stages and tasks:

  • RDD lineage to DAG: The DAG Scheduler looks at the RDD lineage and identifies where wide transformations (shuffles) break the computation into separate stages.
  • Stage creation: Narrow transformations (like map, filter) that do not require a shuffle are grouped into a single stage. Wide transformations (like groupBy, join) create stage boundaries.
  • Task scheduling: Each stage is divided into tasks (one per partition). Tasks are submitted to the Task Scheduler, which assigns them to available executor slots.
  • Execution: Stages execute sequentially (the output of one stage feeds the next). Within a stage, all tasks run in parallel.

You can visualize the DAG in the Spark UI under the "Jobs" and "Stages" tabs. Looking at the DAG is the best way to understand why a job is slow — you can see exactly how many stages and tasks your job creates.

Read Also: Python Tutorial for Beginners

Scenario-Based PySpark Interview Questions

These questions test your ability to apply PySpark knowledge to real-world problems. Interviewers use them to see how you think and solve practical challenges.

Q1. Your PySpark job runs fine on small data but crashes with an OOM error on large data. How do you debug and fix it?

Out-of-memory (OOM) errors are one of the most common production issues in Spark. Here is the systematic approach I would take:

Step 1: Identify the cause using Spark UI

Open the Spark UI and find the failing stage

Look at the "Tasks" tab for that stage — check for tasks with high memory usage or shuffle spill

Check the executor logs for the specific OOM error message (heap space vs. GC overhead limit)

Step 2: Common causes and fixes

  • Cause 1: Data skew — A few tasks process far more data than others.

Fix: Check key distribution, use salting or broadcast joins, enable AQE

  • Cause 2: Collecting too much data to the driver

df.collect() pulls all data to the driver — never do this on large datasets

Fix: Use df.write to save results, or use df.limit(n).collect() for sampling

  • Cause 3: Too many cached DataFrames

Fix: Call df.unpersist() when you no longer need cached data

  • Cause 4: Too little executor memory

Fix: Increase spark.executor.memory and spark.executor.memoryOverhead

  • Cause 5: Too many partition records per task

Fix: Increase the number of partitions with df.repartition(n) before heavy operations

Step 3: For PySpark-specific OOM: Python worker processes have separate memory.

Increase spark.executor.memoryOverhead and check if any Python UDFs are loading large objects into memory.

Q2. You need to join a 500 GB table with a 50 MB lookup table. What is your approach?

This is a classic case for a broadcast join. The 50 MB table is small enough to fit in memory on every executor. Broadcasting it eliminates the need to shuffle the 500 GB table.

from pyspark.sql.functions import broadcast

# Read the large table
large_df = spark.read.parquet("hdfs://path/to/large_table/")

# Read the small lookup table
small_df = spark.read.parquet("hdfs://path/to/lookup_table/")

# Perform the broadcast join
result = large_df.join(broadcast(small_df), on="id", how="left")

# Verify Spark is actually broadcasting
result.explain()
# Look for "BroadcastHashJoin" in the plan

I would also make sure the broadcast threshold is set high enough to catch this automatically:

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "100mb")

If the lookup table were, say, 500 MB (too big to broadcast), I would instead:

  • Pre-partition both tables on the join key and write them to storage
  • Join with the same number of partitions to avoid a full reshuffle
  • Consider using a sorted bucketed join if this join runs frequently

Q3. You have a streaming pipeline using PySpark Structured Streaming. How do you ensure exactly-once processing?

Exactly-once processing means every record is processed exactly once — no duplicates, no data loss. Structured Streaming achieves this through a combination of checkpointing, idempotent sinks, and offset tracking.

Here is the implementation approach:

from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StringType, IntegerType

spark = SparkSession.builder.appName("ExactlyOnce").getOrCreate()

# Read from Kafka — Spark tracks offsets automatically
df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "broker:9092") \
    .option("subscribe", "my_topic") \
    .option("startingOffsets", "earliest") \
    .load()

# Parse the message
schema = StructType().add("id", IntegerType()).add("value", StringType())
parsed = df.select(from_json(col("value").cast("string"), schema).alias("data")) \
    .select("data.*")

# Write with checkpointing for exactly-once semantics
query = parsed.writeStream \
    .format("delta") \
    .option("checkpointLocation", "hdfs://path/to/checkpoint") \
    .outputMode("append") \
    .start("hdfs://path/to/output")

query.awaitTermination()

Key components for exactly-once:

  • Checkpointing: Stores the progress (offsets processed) to reliable storage so the stream can resume from where it left off after a failure
  • Idempotent sink: The output sink must support idempotent writes (Delta Lake and Kafka with transactions both support this)
  • Source offset tracking: Structured Streaming tracks which records it has read from each source

Without all three, you risk "at-least-once" delivery (possible duplicates on failure recovery).

Q4. A data pipeline processes daily sales data. Some days, some product categories have 10x more records than others. Your Spark job keeps taking too long and sometimes fails. How do you fix this?

This is a data skew problem. The uneven distribution of product categories causes some tasks to run much longer than others, creating a bottleneck.

Here is my step-by-step approach:

Step 1: Confirm the skew

# Check the distribution of records per category
df.groupBy("category").count().orderBy("count", ascending=False).show(20)

Step 2: Enable AQE (Spark 3.0+)

spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionFactor", "5")

AQE detects skewed partitions and splits them into smaller tasks automatically.

Step 3: If AQE is not enough, apply salting

from pyspark.sql.functions import col, concat, lit, expr, rand, floor

num_salt = 20

# Add salt to the skewed table
df_salted = df.withColumn("salt", (floor(rand() * num_salt)).cast("string")) \
              .withColumn("salted_key", concat(col("category"), lit("_"), col("salt")))

# Add matching salt to the lookup table (explode the lookup for all salt values)
from pyspark.sql.functions import array, explode

salt_values = [str(i) for i in range(num_salt)]
lookup_exploded = lookup_df.withColumn("salt", explode(array(*[lit(s) for s in salt_values]))) \
                           .withColumn("salted_key", concat(col("category"), lit("_"), col("salt")))

# Join on the salted key
result = df_salted.join(lookup_exploded, on="salted_key", how="left")

Step 4: Repartition on a column with better distribution

If categories are used frequently in joins, consider repartitioning the data by a composite key (like category + date) that distributes more evenly.

Q5. You need to process 10 TB of raw log files stored in S3 and compute daily aggregates. What is your architecture and PySpark approach?

For 10 TB of log data, here is the production-grade architecture I would design:

Storage layer:

  • Store raw logs in S3 in Parquet format, partitioned by year/month/day
  • Use Delta Lake for the aggregated output layer

PySpark job design:

from pyspark.sql import SparkSession
from pyspark.sql.functions import to_date, col, count, sum

spark = SparkSession.builder \
    .appName("DailyAggregation") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.parquet.filterPushdown", "true") \
    .getOrCreate()

# Read only today's partition using partition pruning
today = "2026-04-23"
df = spark.read.parquet(f"s3://bucket/logs/year=2026/month=04/day=23/")

# Apply early filtering to reduce data volume
df_clean = df.filter(
    (col("status_code") != 200) | (col("response_time") > 1000)
)

# Compute aggregations
daily_agg = df_clean.groupBy("service_name", "endpoint") \
    .agg(
        count("*").alias("request_count"),
        sum("response_time").alias("total_response_time"),
        (sum("response_time") / count("*")).alias("avg_response_time")
    )

# Write to Delta Lake with overwrite for idempotency
daily_agg.write \
    .format("delta") \
    .mode("overwrite") \
    .option("replaceWhere", f"date = '{today}'") \
    .save("s3://bucket/daily-aggregates/")

Key optimization decisions:

  • Read only the partition for today — avoids reading 10 TB when you only need one day
  • Apply filters early (predicate pushdown)
  • Use AQE to handle any skew in service names automatically
  • Write to Delta with replaceWhere so re-running the job is idempotent (safe to retry)
  • Tune executor memory for the aggregation step: spark.executor.memory = "16g" with 4 cores per executor

Wrapping Up

Apache Spark makes it easy to process massive amounts of data with PySpark. We discussed basic concepts, optimization methods, and realistic examples when using PySpark throughout this tutorial. It's best to learn how to use PySpark by focusing on the concept of Distributed Execution, rather than trying to memorize everything. If you understand concepts such as DAGs, Joins, and Partitioning, you will be able to develop efficient, scalable Data Pipelines that are useful in today's Data Engineering positions.

FAQs

1. What is PySpark mainly used for?

PySpark is mainly used to process large amounts of data on an ongoing basis and to create and execute ETL pipelines, machine learning workflows, and stream data at scale in real-time.

2. Does PySpark require Hadoop?

PySpark does not require Hadoop to run, but it does typically operate in a production environment with Hadoop (HDFS, YARN) as part of its architecture.

3. What is the most important concept in PySpark interviews?

The concept and understanding of lazy evaluation and the concept of using a DAG will be critically important for those who are preparing for interviews.

4. How do I improve PySpark job performance?

Maximize job performance in PySpark by optimizing partitions, minimizing shuffles, smartly using caching, using built-in functions rather than UDFs, and using AQE.

About the Author
Sanjay Prajapat
About the Author

Sanjay Prajapat is a Data Engineer and technology writer with expertise in Python, SQL, data visualization, and machine learning. He simplifies complex concepts into engaging content, helping beginners and professionals learn effectively while exploring emerging fields like AI, ML, and cybersecurity in today’s evolving tech landscape.

Drop Us a Query
Fields marked * are mandatory
×

Your Shopping Cart


Your shopping cart is empty.