List of Data Engineer Interview questions for Apache Spark…
Brief about Spark Architecture?
Apache Spark follows a master/slave architecture with two main daemons and a cluster manager –
- Master Daemon — (Master/Driver Process)
- Worker Daemon -(Slave Process)
A spark cluster has a single Master and any number of Slaves/Workers. The driver and the executors run their individual Java processes. Users can run them on the same horizontal spark cluster or separate machines, i.e., in a vertical spark cluster or mixed machine configuration.
Explain about Spark submission
The spark-submit script in Spark’s bin directory is used to launch applications on a cluster. It can use all of Spark’s supported cluster managers through a uniform interface, so you don’t have to configure your application, especially for each one.
./bin/spark-submit \ — class org.apache.spark.examples.SparkPi \ — master spark://126.96.36.199:7077 \ — deploy-mode cluster \ — supervise \ — executor-memory 20G \ — total-executor-cores 100 \ /path/to/examples.jar arguments1
Difference Between RDD, Dataframe, Dataset?
Resilient Distributed Dataset (RDD)
RDD was the primary user-facing API in Spark since its inception. At the core, an RDD is an immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel with a low-level API that offers transformations and actions.
Like an RDD, a DataFrame is an immutable distributed collection of data. Unlike an RDD, data is organized into named columns, like a table in a relational database. Designed to make large data sets processing even easier, DataFrame allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstraction; it provides a domain-specific language API to manipulate your distributed data.
Starting in Spark 2.0, Dataset takes on two distinct APIs characteristics: a strongly-typed API and an untyped API, as shown below. Conceptually, consider DataFrame as an alias for a collection of generic objects Dataset[Row], where a Row is a generic untyped JVM object.
When to use RDDs?
Consider these scenarios or common use cases for using RDDs when:
1. you want low-level transformation and actions and control on your dataset; your data is unstructured, such as media streams or streams of text;
2. you want to manipulate your data with functional programming constructs than domain-specific expressions;
3. you don’t care about imposing a schema, such as columnar format while processing or accessing data attributes by name or column; and
4. you can forgo some optimization and performance benefits available with DataFrames and Datasets for structured and semi-structured data.
What are the various modes in which Spark runs on YARN? (Client vs. Cluster Mode)
YARN client mode: The driver runs on the machine from which the client is connected
YARN Cluster Mode: The driver runs inside the cluster
What is DAG — Directed Acyclic Graph?
Directed Acyclic Graph — DAG is a graph data structure with an directional edge and does not have any loops or cycles. It is a way of representing dependencies between objects. It is widely used in computing.
What is an RDD, and How it works internally?
RDD (Resilient Distributed Dataset) represents data located on an Immutable network — You can operate on the RDD to produce another RDD, but you can’t alter it.
Partitioned / Parallel — The data located on RDD is operated in parallel. Any operation on RDD is done using multiple nodes.
Resilience — If one of the nodes hosting the partition fails, other nodes take its data.
You can always think of RDD as a big array under the hood spread over many completely abstract computers. So, RDD is made up of many partitions, each partition on different computers.
8. What do we mean by Partitions or slices?
Partitions, also known as ‘Slice’ in HDFS, is a logical chunk of data set which may be in the range of Petabyte, Terabytes and distributed across the cluster.
By default, Spark creates one Partition for each block of the file (For HDFS)
The default block size for HDFS block is 64 MB (Hadoop Version 1) / 128 MB (Hadoop Version 2), so as the split size.
However, one can explicitly specify the number of partitions to be created. Partitions are basically used to speed up data processing.
If you are loading data from an existing memory using sc.parallelize(), you can enforce your number of partitions bypassing the second argument.
You can change the number of partitions later using repartition().
If you want certain operations to consume the whole partitions at a time, you can use: mappartition().
9. What is the difference between map and flatMap?
Map and flatmap both functions are applied on each element of RDD. The only difference is that the function applied as part of the map must return only one value, while flatmap can return a list of values.
So, flatmap can convert one element into multiple elements of RDD, while a map can only result in an equal number of elements.
So, if we are loading RDD from a text file, each element in a sentence, to convert this RDD into an RDD of words, we will have to apply using flatmap a function that would split a string into an array of words. If we have just to cleanup each sentence or change the case of each sentence, we would be using a map instead of flatmap.
10. How can you minimize data transfers when working with Spark?
The various ways in which data transfers can be minimized when working with Apache Spark are:
- Broadcast Variable– Broadcast variable enhances the efficiency of joins between small and large RDDs.
2. Accumulators — Accumulators help update the values of variables in parallel while executing.
3. The most common way is to avoid operations ByKey, repartition, or other operations that trigger shuffles.
11. Why is there a need for broadcast variables when working with Apache Spark?
These are read-only variables, present in-memory cache on every machine. When working with Spark, usage of broadcast variables eliminates the necessity to ship copies of a variable for every task so that data can be processed faster. Broadcast variables help store a lookup table inside the memory, which enhances the retrieval efficiency compared to an RDD lookup ().
12. How can you trigger automatic clean-ups in Spark to handle accumulated metadata?
You can trigger the clean-ups by setting the parameter “spark.cleaner.TTL” or dividing the long-running jobs into different batches and writing the intermediary results to the disk.
13. Why is BlinkDB used?
BlinkDB is a query engine for executing interactive SQL queries on huge volumes of data and renders query results marked with meaningful error bars. BlinkDB helps users balance ‘query accuracy’ with response time.
14. What is Sliding Window operation?
Sliding Window controls the transmission of data packets between various computer networks. Spark Streaming library provides windowed computations where the transformations on RDDs are applied over a sliding window of data. Whenever the window slides, the RDDs that fall within the particular window are combined and operated upon to produce new RDDs of the windowed DStream.
15. What is Catalyst Optimiser?
Catalyst Optimizer is a new optimization framework present in Spark SQL. It allows Spark to automatically transform SQL queries by adding new optimizations to build a faster processing system.
16. What do you understand by Pair RDD?
Paired RDD is a distributed collection of data with the key-value pair. It is a subset of the Resilient Distributed Dataset. So it has all the features of RDD and some new features for the key-value pair. There are many transformation operations available for Paired RDD. These operations on Paired RDD are beneficial to solve many use cases that require sorting, grouping, reducing some value/function.
Commonly used operations on paired RDD are:
groupByKey() reduceByKey() countByKey() join()
17. What is the difference between persist() and cache()?
persist () allows the user to specify the storage level, whereas cache () uses the default storage level(MEMORY_ONLY).
18. What are the various levels of persistence in Apache Spark?
Apache Spark automatically persists the intermediary data from various shuffle operations; however, it is often suggested that users call the persist () method on the RDD if they plan to reuse it. Spark has various persistence levels to store the RDDs on disk or in memory or as a combination of both with different replication levels.
The various storage/persistence levels in Spark are –
• MEMORY_AND_DISK_SER, DISK_ONLY
19. Does Apache Spark provide checkpointing?
Lineage graphs are always useful to recover RDDs from failure, but this is generally time-consuming if the RDDs have long lineage chains. Spark has an API for checkpointing, i.e., a REPLICATE flag to persist. However, the decision on which data to the checkpoint — is decided by the user. Checkpoints are useful when the lineage graphs are long and have wide dependencies.
20. What do you understand by Lazy Evaluation?
Spark is intellectual in the manner in which it operates on data. When you tell Spark to operate on a given dataset, it needs the instructions and makes a note of it so that it does not forget — but it does nothing unless asked for the final result. When a transformation like a map () is called on an RDD-the operation is not performed immediately. Transformations in Spark are not evaluated till you act. This helps optimize the overall data processing workflow.
21. What do you understand by SchemaRDD?
An RDD consists of row objects (wrappers around basic string or integer arrays) with schema information about the type of data in each column. Dataframe is an example of SchemaRDD.
22. What are the disadvantages of using Apache Spark over Hadoop MapReduce?
Apache Spark’s in-memory capability at times comes a major roadblock for the cost-efficient processing of big data. Apache spark does not scale well for compute-intensive jobs and consumes a large number of system resources. Also, Spark does have its own file management system and hence needs to be integrated with other cloud-based data platforms or apache Hadoop.
23. What is “Lineage Graph” in Spark?
Whenever a series of transformations are performed on an RDD, they are not evaluated immediately but lazily(Lazy Evaluation). When a new RDD has been created from an existing RDD, that new RDD contains a pointer to the parent RDD. Similarly, all the dependencies between the RDDs will be logged in a graph rather than the actual data. This graph is called the lineage graph.
Spark does not support data replication in the memory. It is a process that reconstructs lost data partitions. In the event of any data loss, it is rebuilt using the “RDD Lineage.”
24. What do you understand by Executor Memory in a Spark application?
Every spark application has the same fixed heap size and fixed number of cores for a spark executor. The heap size is referred to as the Spark executor memory, which is controlled with the spark.
executor.memory property of the -executor-memory flag. Every spark application will have one executor on each worker node. The executor memory is basically a measure of how much memory of the worker node the application will utilize.
25. What is an “Accumulator”?
“Accumulators” are Spark’s offline debuggers. Similar to “Hadoop Counters,” “Accumulators” provides the number of “events” in a program.
Accumulators are the variables that can be added through associative operations. Spark natively supports accumulators of numeric value types and standard mutable collections. “AggregrateByKey()” and “combineByKey()” uses accumulators.
26. What is SparkContext .?
sparkContext was used as a channel to access all spark functionality.
The spark driver program uses spark context to connect to the cluster through a resource manager (YARN or Mesos.). sparkConf is required to create the spark context object, which stores configuration parameters like appName (to identify your spark driver), application, number of core, and memory size of executor running on the worker node.
To use APIs of SQL, HIVE, and Streaming, separate contexts need to be created.
creating sparkConf :
val conf = new SparkConf().setAppName(“Project”).setMaster(“spark://master:7077”) creation of sparkContext: val sc = new SparkContext(conf)
27. What is SparkSession?
SPARK 2.0.0 onwards, SparkSession provides a single point of entry to interact with underlying Spark functionality, and it allows Spark programming with DataFrame and Dataset APIs. All the functionality available with sparkContext are also available in sparkSession.
To use SQL, HIVE, and Streaming APIs, there is no need to create separate contexts as sparkSession includes all the APIs.
Once the SparkSession is instantiated, we can configure Spark’s run-time config properties.
Example: Creating Spark session:
val spark = SparkSession.builder.appName(“WorldBankIndex”).getOrCreate() Configuring properties:
28. Why is RDD immutable?
Following are the reasons:
- Immutable data is always safe to share across multiple processes as well as multiple threads. Since RDD is immutable, we can recreate the RDD at any time. (From lineage graph).
- If the computation is time-consuming, that we can cache the RDD, which results in performance improvement.
29. What is Partitioner?
A partitioner is an object that defines how the elements in a key-value pair RDD are partitioned by key, maps each key to a partition ID from 0 to numPartitions — 1. It captures the data distribution at the output. The contract of partitioner ensures that records for a given key have to reside on a single partition. With the help of a partitioner, the scheduler can optimize future operations.
We should choose a partitioner to use for cogroup-like operations. If any of the RDDs already has a partitioner, we should choose that one. Otherwise, we use a default HashPartitioner.
There are three types of partitioners in Spark :
- Hash Partitioner: Partitioning attempts to spread the data evenly across various partitions based on the key.
- Range Partitioner: In Range- Partitioning method, tuples having keys with the same range will appear on the same machine.
- Custom Partitioner
RDDs can be created with specific partitioning in two ways :
- Providing explicit partitioner by calling partition method on an RDD
- Applying transformations that return RDDs with specific partitioners.
30. What are the benefits of DataFrames?
1 DataFrame is distributed collection of data. In DataFrames, data is organized in the named column.
2 They are conceptually similar to a table in a relational database. Also, have richer optimizations.
3. Data Frames empower SQL queries and the DataFrame API.
4. we can process both structured and unstructured data formats through it, such as Avro, CSV, elastic search, and Cassandra. Also, it deals with storage systems HDFS, HIVE tables, MySQL, etc.
5. In Data Frames, Catalyst supports optimization(catalyst Optimizer). There are general libraries available to represent trees. In four phases, DataFrame uses Catalyst tree transformation:
- Analyze logical plan to solve references
- Logical plan optimization
- Physical planning
- Code generation to compile part of a query to Java bytecode.
6. The Data Frame APIs are available in various programming languages. For example, Java, Scala, Python, and R.
7. It provides Hive compatibility. We can run unmodified Hive queries on the existing Hive warehouse.
8. It can scale from kilobytes of data on a single laptop to petabytes of data on a large cluster.