PySpark is one of the most in-demand skills in the big data and data engineering world right now. Companies like Netflix, Uber, LinkedIn, and Amazon use Apache Spark at scale, and PySpark is their go-to tool for writing Spark jobs in Python. If you are preparing for a data engineering, data science, or big data analyst interview, you will almost certainly face PySpark questions.
With hands-on experience working on data processing pipelines, handling large-scale datasets, and optimizing Spark jobs for performance, I have seen how critical PySpark is in real-world data engineering workflows. From building ETL pipelines to improving query efficiency and managing distributed systems, practical exposure has helped me understand not just the concepts but how they are applied in production environments.
This article covers 40 PySpark interview questions and answers, organized into four clear sections: beginner, intermediate, experienced, and scenario-based. Each answer is detailed enough to help you understand the concept, not just memorize a definition. Whether you are just getting started or you are a senior engineer brushing up before a big interview, this guide has you covered.
Let us get into it.
These questions test your foundational understanding of PySpark. Interviewers ask these to check if you understand the core concepts before moving to advanced topics.
PySpark is the Python API for Apache Spark. Apache Spark is the underlying distributed computing engine written in Scala. PySpark lets you interact with Spark using Python code instead of Scala or Java.
The key difference is the language interface. Spark itself runs on the JVM (Java Virtual Machine). When using PySpark, your Python code communicates with the JVM through a bridge called Py4J. This means you get Python's readability and ecosystem while still running jobs on Spark's distributed engine.
Performance wise, native Scala Spark is slightly faster because it avoids the Python-JVM overhead. However, PySpark is close in performance for most real-world jobs, and the productivity gains from Python usually outweigh the minor speed difference.
RDD stands for Resilient Distributed Dataset. It is the fundamental data structure in Apache Spark. An RDD is an immutable, distributed collection of objects that you can process in parallel across a cluster.
Three core properties define an RDD:
You create an RDD in two ways:
|
RDDs support two types of operations: transformations (like map, filter, flatMap) that return a new RDD, and actions (like count, collect, save) that trigger actual computation.
Transformations are a way of outlining how to process data without actually executing that process but are subsequently able to call actions to execute those transformations and get back some type of output or results due to calling an action.
| Basis | Transformation | Action |
| Execution | Lazy (not executed immediately) | Triggers execution |
| Purpose | Defines new RDD or DataFrame | Produces result or output |
| Return Type | New RDD or DataFrame | Returns value or writes output |
| Optimization | Enables DAG optimization | Executes optimized DAG |
| Examples | map(), filter(), select() | collect(), count(), show() |
SparkSession is the single entry point to all Spark functionality in PySpark 2.0 and later. Before Spark 2.0, you had to create separate contexts for different APIs: SparkContext for RDDs, SQLContext for DataFrames, and HiveContext for Hive. SparkSession replaced all these with one unified object.
You create a SparkSession like this:
|
The .getOrCreate() method creates a new session or returns an existing one. You use SparkSession to read data, run SQL queries, create DataFrames, and configure your Spark application.
RDD, DataFrame, and Dataset are three different ways to interact with data in PySpark. These three types of data differ in their structure, how they perform, and how easy it is to work with them based on whether you want low-level access to the data or high-level, optimized operations.
| Basis | RDD | DataFrame | Dataset |
| Level | Low-level | High-level | High-level and typed |
| Schema | No schema | Has schema | Has schema |
| Optimization | No optimization | Uses Catalyst optimizer | Uses Catalyst optimizer |
| Type Safety | Type safe | Not type safe | Type safe |
| Performance | Slower | Faster | Fast (best in JVM languages) |
Lazy evaluation means Spark does not execute transformations immediately when you write them. Instead, it builds a DAG (Directed Acyclic Graph) of all the transformations. It only executes that DAG when you call an action.
Why does this matter? Because Spark can look at your entire pipeline of transformations together and optimize the execution plan before running a single line.
For example, if you filter and then select columns, Spark can push the filter earlier in the pipeline to reduce the amount of data it processes.
|
Without lazy evaluation, Spark would run each step immediately without any opportunity to optimize. Lazy evaluation is one of the main reasons Spark is so fast.
PySpark supports a wide range of data sources through its unified spark.read and df.write API:
|
Parquet is the most commonly used format in production Spark pipelines. It is columnar, compressed, and reads very fast because Spark only reads the columns it needs
PySpark provides several built-in methods to handle null and missing values through the DataFrame.na accessor:
|
For more advanced scenarios, you can use PySpark's Imputer from pyspark.ml.feature to fill missing numeric values with the mean or median:
from pyspark.ml.feature import Imputer
|
A partition is a chunk of data stored on a single node in the cluster. Spark processes each partition in parallel. The number of partitions directly affects the parallelism of your job.
If you have too few partitions, some nodes sit idle and your job runs slower than it should. If you have too many, you create excessive scheduling overhead.
You can control partitions with three methods:
|
SparkContext (often written as sc) is the original entry point to Spark functionality. It represents the connection to a Spark cluster and is responsible for coordinating distributed execution.
SparkSession (available from Spark 2.0 onwards) wraps SparkContext and also provides access to DataFrame, SQL, and Streaming APIs. When you create a SparkSession, Spark automatically creates an underlying SparkContext that you can access via spark.sparkContext.
|
In modern PySpark development, you always start with SparkSession. You only access SparkContext directly when you need to work with RDDs or set low-level Spark configurations.
Read Also: Top Data Science Interview Questions and Answers
These questions go deeper. Interviewers ask them to check if you understand how Spark actually works under the hood and if you can write efficient PySpark code.
The Catalyst Optimizer is Spark SQL's built-in query optimization engine. It automatically optimizes your DataFrame and SQL queries before executing them. When you write a DataFrame transformation, Catalyst converts it into an optimized execution plan through four phases:
Different approaches exist within PySpark for aggregating data with groupBy vs. reduceByKey. However, they differ in efficiency regarding data transfer and calculations performed between nodes in a distributed environment.
| Basis | groupBy | reduceByKey |
| Shuffle | High | Low |
| Aggregation | After grouping | During shuffle |
| Memory Usage | Higher | Lower |
| Performance | Slower | Faster |
| Use Case | Requires full data | Requires only aggregated data |
A shuffle is when Spark needs to redistribute data across partitions, usually across different nodes in the cluster. Shuffles happen during operations like groupBy, join, repartition, distinct, and orderBy.
Shuffles are expensive because they involve:
You can minimize shuffles in several ways:
You can see if a shuffle is happening in your job by looking at the Spark UI's Stages tab and checking for "Exchange" steps in the query plan.
By default, Spark recomputes a DataFrame or RDD from scratch every time you call an action on it. If you use the same data multiple times, you can cache it to avoid repeated computation.
cache() stores data in memory using the default storage level (MEMORY_AND_DISK).
persist() lets you choose the storage level explicitly:
|
Storage levels at a glance:
| Level | Memory | Disk | Serialized |
| MEMORY_ONLY | Yes | No | No |
| MEMORY_AND_DISK | Yes | Yes (spill) | No |
| DISK_ONLY | No | Yes | Yes |
| MEMORY_ONLY_SER | Yes | No | Yes |
Use caching when you call multiple actions on the same DataFrame, especially in iterative machine learning algorithms. Do not cache everything — memory is finite, and unnecessary caching can push useful data out of cache.
A broadcast join is a join optimization where Spark sends a complete copy of a smaller table to every node in the cluster. This eliminates the need to shuffle the larger table across the network, making the join much faster.
You should use a broadcast join when one of the tables is small enough to fit in memory (typically under 10 MB, configurable up to a few hundred MB).
|
You can also configure the automatic broadcast threshold:
|
Window functions let you perform calculations across a set of rows related to the current row — like row numbers, running totals, or rankings — without collapsing the rows into groups.
from pyspark.sql import Window
from pyspark.sql.functions import row_number, rank, sum, lag
|
In PySpark, there are two commonly used transformation functions named map() and flatMap(), both are different ways to perform some operation on the elements of a dataset, but they return different types of structured data as output. Here's a comparison of how the map() and flatMap() methods differ as well as when you would use each method:
| Basis | map() | flatMap() |
| Output | One-to-one | One-to-many |
| Structure of Output | Nested structure | Flattened structure |
| Return Type | Same number of elements | Variable number of elements |
| Use Case | Simple transformation | Exploding data |
| Example | x → x + 1 | "a b" → ["a", "b"] |
Data skew happens when data is not evenly distributed across partitions. Some partitions end up with far more data than others, causing some tasks to take much longer while others finish quickly. This creates a bottleneck.
Here are the main strategies to handle skew:
1. Salting: Add a random prefix to the skewed key to spread it across multiple partitions:
|
2. Broadcast join: If one table is small, broadcast it to avoid the shuffle entirely.
3. Repartition on another key: Repartition using a column with better distribution before joining.
4. Skew hints (Spark 3.x): Use adaptive query execution (AQE):
|
Spark 3.0's Adaptive Query Execution (AQE) can detect and handle skew automatically, which is a great point to mention in interviews.
PySpark provides two options for in-memory data storage, cache and persist, to improve the speed of data re-use. The key difference between the two is that cache has a default storage level (Memory Only), while persist allows the user to select from a variety of storage options.
| Comparison | cache() | persist() |
| Storage Level | MEMORY_ONLY | Configurable storage level |
| Flexibility | Limited | Flexible |
| Common Use | Simple caching | More complex data storage |
| Syntax | rdd.cache() | rdd.persist(level) |
| Performance | Same as MEMORY_ONLY | Depends on selected storage level |
explain() prints the execution plan that Spark will use for a query. It is the first tool you should use when debugging slow queries or understanding what Spark is doing with your code.
|
The plan shows you:
Read Also: Data Engineer Interview Questions and Answers
These questions are for senior engineers and architects. They test deep knowledge of Spark internals, performance tuning, and production best practices.
Tungsten is Spark's physical execution engine introduced in Spark 1.4. It improves performance through three main mechanisms:
1. Off-heap memory management: Tungsten manages memory manually outside of the JVM heap. This avoids garbage collection (GC) pauses, which are a major performance bottleneck in JVM-based systems.
2. Cache-aware computation: Tungsten uses data structures and algorithms designed to maximize CPU cache utilization. It operates on binary data in memory rather than Java objects, which reduces the memory footprint significantly.
3. Whole-stage code generation: Tungsten generates JVM bytecode for entire query stages at runtime (called "codegen"). Instead of calling separate functions for each operator, it generates a single tight loop that the JVM can optimize aggressively.
Adaptive Query Execution (AQE) is a feature introduced in Spark 3.0 that allows Spark to change its query execution plan at runtime based on actual data statistics collected during execution. Traditional Spark planning is static — it makes decisions upfront based on estimates that may be wrong.
AQE provides three key capabilities:
1. Dynamic coalescing of shuffle partitions: Spark automatically merges small shuffle partitions into larger ones after a shuffle stage completes, reducing the number of tasks in subsequent stages.
2. Dynamic switching of join strategies: If a table turns out to be smaller than estimated, Spark can switch from a sort-merge join to a broadcast join at runtime.
3. Dynamic skew join optimization: Spark detects skewed partitions and splits them into smaller sub-tasks automatically.
|
AQE is one of the most impactful performance improvements in modern Spark. Mentioning it in interviews shows you are up to date with the Spark ecosystem.
Spark executor memory is divided into several regions. Understanding this is critical for tuning large-scale jobs:
|
Key configuration parameters:
|
Practical sizing rule: Use 4-5 cores per executor and allocate 4-8 GB of memory per core. Avoid executors with too many cores (more than 5) because HDFS throughput degrades with too many concurrent threads.
For PySpark specifically, always configure spark.executor.memoryOverhead to account for the Python worker process memory. A common starting point is 10-15% of spark.executor.memory.
Delta Lake is an open-source storage layer that adds ACID transaction support, versioning, and schema enforcement to data lakes. It stores data as Parquet files with a transaction log (called the Delta Log) that tracks every change.
Delta Lake integrates natively with PySpark:
|
Key Delta Lake features for interviews:
Delta Lake is the foundation of the modern Lakehouse architecture used by companies running Databricks and similar platforms.
Production PySpark optimization involves multiple layers. Here are the most important ones:
1. Use the right file format: Parquet or ORC for analytical workloads. They are columnar, compressed, and support predicate pushdown.
2. Partition your data properly: Partition large tables by frequently-filtered columns (like date or region) when storing to disk.
3. Enable AQE: spark.sql.adaptive.enabled = true
4. Use broadcast joins for small lookup tables.
5. Avoid UDFs when possible: Python UDFs break Tungsten optimization. Use built-in Spark functions from pyspark.sql.functions instead. If you must use a Python UDF, use Pandas UDFs (vectorized UDFs) which are significantly faster.
|
6. Z-order clustering (Delta Lake): For Delta tables, use Z-ordering to co-locate related data and speed up selective queries.
7. Monitor with Spark UI: Always profile jobs in the Spark UI to find skewed tasks, long GC times, and shuffle spills.
PySpark provides two options for joining datasets: The Sort-Merge Join and the Broadcast Join. The main difference between both options is that they approach joining datasets from different perspectives based upon data size, shuffle, and performance.
| Basis | Sort-Merge Join | Broadcast Join |
| Data Size | Large datasets | One small dataset |
| Shuffle | Required | Minimal |
| Speed | Slower | Faster |
| Requirement | Requires sorting | Requires memory for broadcasting |
| Use Case | Big-to-big joins | Small-to-big joins |
Spark achieves fault tolerance through two mechanisms: lineage for RDDs and write-ahead logging for streaming.
|
Pandas UDFs, also called vectorized UDFs, are a special type of Python UDF that use Apache Arrow to transfer data between the JVM and Python in columnar batches instead of row by row.
Regular Python UDFs process one row at a time. For each row, Spark serializes the data, sends it to Python, runs the function, serializes the result, and returns it to the JVM. This row-by-row overhead adds up enormously for large datasets.
Pandas UDFs transfer entire columns as Arrow-serialized Pandas Series, apply the function to the whole batch at once, and return the result. This is dramatically faster.
|
Schema evolution allows you to add new columns to a table without breaking existing queries or requiring a full table rewrite. Without Delta Lake, adding a new column to a Parquet table requires rewriting all existing data.
With Delta Lake, you enable schema evolution with a single option:
|
For more complex schema changes (like renaming columns or changing data types), you use Delta's ALTER TABLE SQL commands:
|
Schema evolution is critical for production pipelines where source data changes over time. Without it, a new column in your source data would cause your pipeline to fail.
A DAG (Directed Acyclic Graph) is Spark's internal representation of the computation steps needed to complete a job. "Directed" means each step flows in one direction. "Acyclic" means there are no circular dependencies.
When you call an action, Spark's DAG Scheduler converts your transformation pipeline into a DAG of stages and tasks:
You can visualize the DAG in the Spark UI under the "Jobs" and "Stages" tabs. Looking at the DAG is the best way to understand why a job is slow — you can see exactly how many stages and tasks your job creates.
Read Also: Python Tutorial for Beginners
These questions test your ability to apply PySpark knowledge to real-world problems. Interviewers use them to see how you think and solve practical challenges.
Out-of-memory (OOM) errors are one of the most common production issues in Spark. Here is the systematic approach I would take:
Open the Spark UI and find the failing stage
Look at the "Tasks" tab for that stage — check for tasks with high memory usage or shuffle spill
Check the executor logs for the specific OOM error message (heap space vs. GC overhead limit)
Fix: Check key distribution, use salting or broadcast joins, enable AQE
df.collect() pulls all data to the driver — never do this on large datasets
Fix: Use df.write to save results, or use df.limit(n).collect() for sampling
Fix: Call df.unpersist() when you no longer need cached data
Fix: Increase spark.executor.memory and spark.executor.memoryOverhead
Fix: Increase the number of partitions with df.repartition(n) before heavy operations
Increase spark.executor.memoryOverhead and check if any Python UDFs are loading large objects into memory.
This is a classic case for a broadcast join. The 50 MB table is small enough to fit in memory on every executor. Broadcasting it eliminates the need to shuffle the 500 GB table.
from pyspark.sql.functions import broadcast
|
I would also make sure the broadcast threshold is set high enough to catch this automatically:
|
If the lookup table were, say, 500 MB (too big to broadcast), I would instead:
Exactly-once processing means every record is processed exactly once — no duplicates, no data loss. Structured Streaming achieves this through a combination of checkpointing, idempotent sinks, and offset tracking.
Here is the implementation approach:
|
Key components for exactly-once:
Without all three, you risk "at-least-once" delivery (possible duplicates on failure recovery).
This is a data skew problem. The uneven distribution of product categories causes some tasks to run much longer than others, creating a bottleneck.
Here is my step-by-step approach:
|
|
AQE detects skewed partitions and splits them into smaller tasks automatically.
|
If categories are used frequently in joins, consider repartitioning the data by a composite key (like category + date) that distributes more evenly.
For 10 TB of log data, here is the production-grade architecture I would design:
Storage layer:
PySpark job design:
|
Key optimization decisions:
Apache Spark makes it easy to process massive amounts of data with PySpark. We discussed basic concepts, optimization methods, and realistic examples when using PySpark throughout this tutorial. It's best to learn how to use PySpark by focusing on the concept of Distributed Execution, rather than trying to memorize everything. If you understand concepts such as DAGs, Joins, and Partitioning, you will be able to develop efficient, scalable Data Pipelines that are useful in today's Data Engineering positions.
PySpark is mainly used to process large amounts of data on an ongoing basis and to create and execute ETL pipelines, machine learning workflows, and stream data at scale in real-time.
PySpark does not require Hadoop to run, but it does typically operate in a production environment with Hadoop (HDFS, YARN) as part of its architecture.
The concept and understanding of lazy evaluation and the concept of using a DAG will be critically important for those who are preparing for interviews.
Maximize job performance in PySpark by optimizing partitions, minimizing shuffles, smartly using caching, using built-in functions rather than UDFs, and using AQE.