@InterfaceStability.Evolving
public interface DataSourceWriter
WriteSupport.createWriter(String, StructType, SaveMode, DataSourceOptions)
/
StreamWriteSupport.createStreamWriter(
String, StructType, OutputMode, DataSourceOptions)
.
It can mix in various writing optimization interfaces to speed up the data saving. The actual
writing logic is delegated to DataWriter
.
If an exception was throw when applying any of these writing optimizations, the action will fail
and no Spark job will be submitted.
The writing procedure is:
1. Create a writer factory by createWriterFactory()
, serialize and send it to all the
partitions of the input data(RDD).
2. For each partition, create the data writer, and write the data of the partition with this
writer. If all the data are written successfully, call DataWriter.commit()
. If
exception happens during the writing, call DataWriter.abort()
.
3. If all writers are successfully committed, call commit(WriterCommitMessage[])
. If
some writers are aborted, or the job failed with an unknown reason, call
abort(WriterCommitMessage[])
.
While Spark will retry failed writing tasks, Spark won't retry failed writing jobs. Users should
do it manually in their Spark applications if they want to retry.
Please refer to the documentation of commit/abort methods for detailed specifications.Modifier and Type | Method and Description |
---|---|
void |
abort(WriterCommitMessage[] messages)
Aborts this writing job because some data writers are failed and keep failing when retry,
or the Spark job fails with some unknown reasons,
or
onDataWriterCommit(WriterCommitMessage) fails,
or commit(WriterCommitMessage[]) fails. |
void |
commit(WriterCommitMessage[] messages)
Commits this writing job with a list of commit messages.
|
DataWriterFactory<org.apache.spark.sql.catalyst.InternalRow> |
createWriterFactory()
Creates a writer factory which will be serialized and sent to executors.
|
default void |
onDataWriterCommit(WriterCommitMessage message)
Handles a commit message on receiving from a successful data writer.
|
default boolean |
useCommitCoordinator()
Returns whether Spark should use the commit coordinator to ensure that at most one task for
each partition commits.
|
DataWriterFactory<org.apache.spark.sql.catalyst.InternalRow> createWriterFactory()
default boolean useCommitCoordinator()
default void onDataWriterCommit(WriterCommitMessage message)
abort(WriterCommitMessage[])
would be called.void commit(WriterCommitMessage[] messages)
DataWriter.commit()
.
If this method fails (by throwing an exception), this writing job is considered to to have been
failed, and abort(WriterCommitMessage[])
would be called. The state of the destination
is undefined and @abort(WriterCommitMessage[])
may not be able to deal with it.
Note that speculative execution may cause multiple tasks to run for a partition. By default,
Spark uses the commit coordinator to allow at most one task to commit. Implementations can
disable this behavior by overriding useCommitCoordinator()
. If disabled, multiple
tasks may have committed successfully and one successful commit message per task will be
passed to this commit method. The remaining commit messages are ignored by Spark.void abort(WriterCommitMessage[] messages)
onDataWriterCommit(WriterCommitMessage)
fails,
or commit(WriterCommitMessage[])
fails.
If this method fails (by throwing an exception), the underlying data source may require manual
cleanup.
Unless the abort is triggered by the failure of commit, the given messages should have some
null slots as there maybe only a few data writers that are committed before the abort
happens, or some data writers were committed but their commit messages haven't reached the
driver when the abort is triggered. So this is just a "best effort" for data sources to
clean up the data left by data writers.