Spark Core#
Public Classes#
| 
 | Main entry point for Spark functionality. | 
| 
 | A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. | 
| 
 | A broadcast variable created with  | 
| 
 | A shared variable that can be accumulated, i.e., has a commutative and associative "add" operation. | 
| Helper object that defines how to accumulate values of a given type. | |
| 
 | Configuration for a Spark application. | 
| Resolves paths to files added through  | |
| 
 | Flags for controlling the storage of an RDD. | 
| Contextual information about a task which can be read or mutated during execution. | |
| 
 | Wraps an RDD in a barrier stage, which forces Spark to launch tasks of this stage together. | 
| A  | |
| 
 | Carries all task infos of a barrier task. | 
| 
 | Thread that is recommended to be used in PySpark when the pinned thread mode is enabled. | 
| Provides utility method to determine Spark versions with given input string. | 
Spark Context APIs#
| 
 | Create an  | 
| 
 | Add an archive to be downloaded with this Spark job on every node. | 
| 
 | Add a file to be downloaded with this Spark job on every node. | 
| Add a tag to be assigned to all the jobs started by this thread. | |
| 
 | Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future. | 
| A unique identifier for the Spark application. | |
| 
 | Read a directory of binary files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI as a byte array. | 
| 
 | Load data from a flat binary file, assuming each record is a set of numbers with the specified numerical format (see ByteBuffer), and the number of bytes per record is constant. | 
| 
 | Broadcast a read-only variable to the cluster, returning a  | 
| Cancel all jobs that have been scheduled or are running. | |
| 
 | Cancel active jobs for the specified group. | 
| Cancel active jobs that have the specified tag. | |
| Clear the current thread's job tags. | |
| Default min number of partitions for Hadoop RDDs when not given by user | |
| Default level of parallelism to use when not given by user (e.g. | |
| Dump the profile stats into directory path | |
| Create an  | |
| Return the directory where RDDs are checkpointed. | |
| Return a copy of this SparkContext's configuration  | |
| Get the tags that are currently set to be assigned to all the jobs started by this thread. | |
| Get a local property set in this thread, or null if it is missing. | |
| 
 | Get or instantiate a  | 
| 
 | Read an 'old' Hadoop InputFormat with arbitrary key and value class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. | 
| 
 | Read an 'old' Hadoop InputFormat with arbitrary key and value class, from an arbitrary Hadoop configuration, which is passed in as a Python dict. | 
| Returns a list of archive paths that are added to resources. | |
| Returns a list of file paths that are added to resources. | |
| 
 | Read a 'new API' Hadoop InputFormat with arbitrary key and value class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. | 
| 
 | Read a 'new API' Hadoop InputFormat with arbitrary key and value class, from an arbitrary Hadoop configuration, which is passed in as a Python dict. | 
| 
 | Distribute a local Python collection to form an RDD. | 
| 
 | Load an RDD previously saved using  | 
| 
 | Create a new RDD of int containing elements from start to end (exclusive), increased by step every element. | 
| Return the resource information of this  | |
| Remove a tag previously added to be assigned to all the jobs started by this thread. | |
| 
 | Executes the given partitionFunc on the specified set of partitions, returning the result as an array of elements. | 
| 
 | Read a Hadoop SequenceFile with arbitrary key and value Writable class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. | 
| 
 | Set the directory under which RDDs are going to be checkpointed. | 
| Set the behavior of job cancellation from jobs started in this thread. | |
| Set a human readable description of the current job. | |
| 
 | Assigns a group ID to all the jobs started by this thread until the group ID is set to a different value or cleared. | 
| 
 | Set a local property that affects jobs submitted from this thread, such as the Spark fair scheduler pool. | 
| 
 | Control our logLevel. | 
| 
 | Set a Java system property, such as spark.executor.memory. | 
| Print the profile stats to stdout | |
| Get SPARK_USER for user who is running SparkContext. | |
| Return the epoch time when the  | |
| Return  | |
| Shut down the  | |
| 
 | Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings. | 
| Return the URL of the SparkUI instance started by this  | |
| 
 | Build the union of a list of RDDs. | 
| The version of Spark on which this application is running. | |
| 
 | Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. | 
RDD APIs#
| 
 | Aggregate the elements of each partition, and then the results for all the partitions, using a given combine functions and a neutral "zero value." | 
| 
 | Aggregate the values of each key, using given combine functions and a neutral "zero value". | 
| Marks the current stage as a barrier stage, where Spark must launch all tasks together. | |
| Persist this RDD with the default storage level (MEMORY_ONLY). | |
| 
 | Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of elements  | 
| Mark this RDD for checkpointing. | |
| 
 | Removes an RDD's shuffles and it's non-persisted ancestors. | 
| 
 | Return a new RDD that is reduced into numPartitions partitions. | 
| 
 | For each key k in self or other, return a resulting RDD that contains a tuple with the list of values for that key in self as well as other. | 
| Return a list that contains all the elements in this RDD. | |
| Return the key-value pairs in this RDD to the master as a dictionary. | |
| 
 | When collect rdd, use this method to specify job group. | 
| 
 | Generic function to combine the elements for each key using a custom set of aggregation functions. | 
| The  | |
| Return the number of elements in this RDD. | |
| 
 | Approximate version of count() that returns a potentially incomplete result within a timeout, even if not all tasks have finished. | 
| 
 | Return approximate number of distinct elements in the RDD. | 
| Count the number of elements for each key, and return the result to the master as a dictionary. | |
| Return the count of each unique value in this RDD as a dictionary of (value, count) pairs. | |
| 
 | Return a new RDD containing the distinct elements in this RDD. | 
| 
 | Return a new RDD containing only the elements that satisfy a predicate. | 
| Return the first element in this RDD. | |
| 
 | Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results. | 
| Pass each value in the key-value pair RDD through a flatMap function without changing the keys; this also retains the original RDD's partitioning. | |
| 
 | Aggregate the elements of each partition, and then the results for all the partitions, using a given associative function and a neutral "zero value." | 
| 
 | Merge the values for each key using an associative function "func" and a neutral "zeroValue" which may be added to the result an arbitrary number of times, and must not change the result (e.g., 0 for addition, or 1 for multiplication.). | 
| 
 | Applies a function to all elements of this RDD. | 
| Applies a function to each partition of this RDD. | |
| 
 | Perform a right outer join of self and other. | 
| Gets the name of the file to which this RDD was checkpointed | |
| Returns the number of partitions in RDD | |
| Get the  | |
| Get the RDD's current storage level. | |
| 
 | Return an RDD created by coalescing all elements within each partition into a list. | 
| 
 | Return an RDD of grouped items. | 
| 
 | Group the values for each key in the RDD into a single sequence. | 
| 
 | Alias for cogroup but with support for multiple RDDs. | 
| 
 | Compute a histogram using the provided buckets. | 
| 
 | A unique ID for this RDD (within its SparkContext). | 
| 
 | Return the intersection of this RDD and another one. | 
| Return whether this RDD is checkpointed and materialized, either reliably or locally. | |
| Returns true if and only if the RDD contains no elements at all. | |
| Return whether this RDD is marked for local checkpointing. | |
| 
 | Return an RDD containing all pairs of elements with matching keys in self and other. | 
| 
 | Creates tuples of the elements in this RDD by applying f. | 
| 
 | Return an RDD with the keys of each tuple. | 
| 
 | Perform a left outer join of self and other. | 
| Mark this RDD for local checkpointing using Spark's existing caching layer. | |
| 
 | Return the list of values in the RDD for key key. | 
| 
 | Return a new RDD by applying a function to each element of this RDD. | 
| 
 | Return a new RDD by applying a function to each partition of this RDD. | 
| 
 | Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition. | 
| 
 | Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition. | 
| Pass each value in the key-value pair RDD through a map function without changing the keys; this also retains the original RDD's partitioning. | |
| 
 | Find the maximum item in this RDD. | 
| 
 | Compute the mean of this RDD's elements. | 
| 
 | Approximate operation to return the mean within a timeout or meet the confidence. | 
| 
 | Find the minimum item in this RDD. | 
| 
 | Return the name of this RDD. | 
| 
 | Return a copy of the RDD partitioned using the specified partitioner. | 
| 
 | Set this RDD's storage level to persist its values across operations after the first time it is computed. | 
| 
 | Return an RDD created by piping elements to a forked external process. | 
| 
 | Randomly splits this RDD with the provided weights. | 
| 
 | Reduces the elements of this RDD using the specified commutative and associative binary operator. | 
| 
 | Merge the values for each key using an associative and commutative reduce function. | 
| 
 | Merge the values for each key using an associative and commutative reduce function, but return the results immediately to the master as a dictionary. | 
| 
 | Return a new RDD that has exactly numPartitions partitions. | 
| Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys. | |
| 
 | Perform a right outer join of self and other. | 
| 
 | Return a sampled subset of this RDD. | 
| 
 | Return a subset of this RDD sampled by key (via stratified sampling). | 
| Compute the sample standard deviation of this RDD's elements (which corrects for bias in estimating the standard deviation by dividing by N-1 instead of N). | |
| Compute the sample variance of this RDD's elements (which corrects for bias in estimating the variance by dividing by N-1 instead of N). | |
| 
 | Output a Python RDD of key-value pairs (of form  | 
| 
 | Output a Python RDD of key-value pairs (of form  | 
| 
 | Output a Python RDD of key-value pairs (of form  | 
| 
 | Output a Python RDD of key-value pairs (of form  | 
| 
 | Save this RDD as a SequenceFile of serialized objects. | 
| 
 | Output a Python RDD of key-value pairs (of form  | 
| 
 | Save this RDD as a text file, using string representations of elements. | 
| 
 | Assign a name to this RDD. | 
| 
 | Sorts this RDD by the given keyfunc | 
| 
 | Sorts this RDD, which is assumed to consist of (key, value) pairs. | 
| Return a  | |
| Compute the standard deviation of this RDD's elements. | |
| 
 | Return each value in self that is not contained in other. | 
| 
 | Return each (key, value) pair in self that has no pair with matching key in other. | 
| 
 | Add up the elements in this RDD. | 
| 
 | Approximate operation to return the sum within a timeout or meet the confidence. | 
| 
 | Take the first num elements of the RDD. | 
| 
 | Get the N elements from an RDD ordered in ascending order or as specified by the optional key function. | 
| 
 | Return a fixed-size sampled subset of this RDD. | 
| A description of this RDD and its recursive dependencies for debugging. | |
| 
 | Return an iterator that contains all of the elements in this RDD. | 
| 
 | Get the top N elements from an RDD. | 
| 
 | Aggregates the elements of this RDD in a multi-level tree pattern. | 
| 
 | Reduces the elements of this RDD in a multi-level tree pattern. | 
| 
 | Return the union of this RDD and another one. | 
| 
 | Mark the RDD as non-persistent, and remove all blocks for it from memory and disk. | 
| Return an RDD with the values of each tuple. | |
| Compute the variance of this RDD's elements. | |
| 
 | Specify a  | 
| 
 | Zips this RDD with another one, returning key-value pairs with the first element in each RDD second element in each RDD, etc. | 
| Zips this RDD with its element indices. | |
| Zips this RDD with generated unique Long ids. | 
Broadcast and Accumulator#
| 
 | Destroy all data and metadata related to this broadcast variable. | 
| 
 | Write a pickled representation of value to the open file or socket. | 
| 
 | Read a pickled representation of value from the open file or socket. | 
| 
 | Read the pickled representation of an object from the open file and return the reconstituted object hierarchy specified therein. | 
| 
 | Delete cached copies of this broadcast on the executors. | 
| Return the broadcasted value | |
| 
 | Adds a term to this accumulator's value | 
| Get the accumulator's value; only usable in driver program | |
| 
 | Add two values of the accumulator's data type, returning a new value; for efficiency, can also update value1 in place and return it. | 
| 
 | Provide a "zero value" for the type, compatible in dimensions with the provided value (e.g., a zero vector) | 
Management#
| Return thread target wrapper which is recommended to be used in PySpark when the pinned thread mode is enabled. | |
| 
 | Does this configuration contain a given key? | 
| 
 | Get the configured value for some key, or return a default otherwise. | 
| Get all values as a list of key-value pairs. | |
| 
 | Set a configuration property. | 
| 
 | Set multiple parameters, passed as a list of key-value pairs. | 
| 
 | Set application name. | 
| 
 | Set an environment variable to be passed to executors. | 
| 
 | Set a configuration property, if not already set. | 
| 
 | Set master URL to connect to. | 
| 
 | Set path where Spark is installed on worker nodes. | 
| Returns a printable version of the configuration, as a list of key=value pairs, one per line. | |
| 
 | Get the absolute path of a file added through  | 
| Get the root directory that contains files added through  | |
| How many times this task has been attempted. | |
| CPUs allocated to the task. | |
| Return the currently active  | |
| Get a local property set upstream in the driver, or None if it is missing. | |
| The ID of the RDD partition that is computed by this task. | |
| Resources allocated to the task. | |
| The ID of the stage that this task belong to. | |
| An ID that is unique to this task attempt (within the same  | |
| 
 | Returns a new RDD by applying a function to each partition of the wrapped RDD, where tasks are launched together in a barrier stage. | 
| 
 | Returns a new RDD by applying a function to each partition of the wrapped RDD, while tracking the index of the original partition. | 
| 
 | This function blocks until all tasks in the same stage have reached this routine. | 
| How many times this task has been attempted. | |
| Sets a global barrier and waits until all tasks in this stage hit this barrier. | |
| CPUs allocated to the task. | |
| Return the currently active  | |
| Get a local property set upstream in the driver, or None if it is missing. | |
| Returns  | |
| The ID of the RDD partition that is computed by this task. | |
| Resources allocated to the task. | |
| The ID of the stage that this task belong to. | |
| An ID that is unique to this task attempt (within the same  | |
| 
 | Given a Spark version string, return the (major version number, minor version number). |