Lastly, the built-in manager of Spark is responsible for launching any Spark application on the machines: Apache-Spark consists of a number of notable features that are necessary to discuss here to highlight the fact why they are used in large data processing? So, the features of Apache-Spark are described below: Features The executors are launched by “ Cluster Manager” and in some cases the drivers are also launched by this manager of Spark. And the third main component of Spark is “ Cluster Manager” as the name indicates it is a manager that manages executors and drivers. The Apache Spark works on master and slave phenomena following this pattern, a central coordinator in Spark is known as “ driver” (acts as a master) and its distributed workers are named as “executors” (acts as slave). The wide usage of Apache-Spark is because of its working mechanism that it follows: The data structure of Spark is based on RDD (acronym of Resilient Distributed Dataset) RDD consists of unchangeable distributed collection of objects these datasets may contain any type of objects related to Python, Java, Scala and can also contain the user defined classes. Spark uses DAG scheduler, memory caching and query execution to process the data as fast as possible and thus for large data handling. As the processing of large amounts of data needs fast processing, the processing machine/package must be efficient to do so. Apache-Spark is an open-source framework for big data processing, used by professional data scientists and engineers to perform actions on large amounts of data.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |