[SPARK-20791][PYSPARK] Use Arrow to create Spark DataFrame from Pandas
authorBryan Cutler <cutlerb@gmail.com>
Mon, 13 Nov 2017 04:16:01 +0000 (13:16 +0900)
committerhyukjinkwon <gurwls223@gmail.com>
Mon, 13 Nov 2017 04:16:01 +0000 (13:16 +0900)
commit209b9361ac8a4410ff797cff1115e1888e2f7e66
tree4aae0900a14dac60e01a0cacba06ad3ac050ae0d
parent3d90b2cb384affe8ceac9398615e9e21b8c8e0b0
[SPARK-20791][PYSPARK] Use Arrow to create Spark DataFrame from Pandas

## What changes were proposed in this pull request?

This change uses Arrow to optimize the creation of a Spark DataFrame from a Pandas DataFrame. The input df is sliced according to the default parallelism. The optimization is enabled with the existing conf "spark.sql.execution.arrow.enabled" and is disabled by default.

## How was this patch tested?

Added new unit test to create DataFrame with and without the optimization enabled, then compare results.

Author: Bryan Cutler <cutlerb@gmail.com>
Author: Takuya UESHIN <ueshin@databricks.com>

Closes #19459 from BryanCutler/arrow-createDataFrame-from_pandas-SPARK-20791.
python/pyspark/context.py
python/pyspark/java_gateway.py
python/pyspark/serializers.py
python/pyspark/sql/session.py
python/pyspark/sql/tests.py
python/pyspark/sql/types.py
sql/core/src/main/scala/org/apache/spark/sql/api/python/PythonSQLUtils.scala
sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala