spark.git
3 months agoPreparing Spark release v2.2.1-rc1 v2.2.1-rc1
Felix Cheung [Mon, 13 Nov 2017 19:04:27 +0000 (19:04 +0000)] 
Preparing Spark release v2.2.1-rc1

3 months ago[MINOR][CORE] Using bufferedInputStream for dataDeserializeStream
Xianyang Liu [Mon, 13 Nov 2017 12:19:13 +0000 (06:19 -0600)] 
[MINOR][CORE] Using bufferedInputStream for dataDeserializeStream

## What changes were proposed in this pull request?

Small fix. Using bufferedInputStream for dataDeserializeStream.

## How was this patch tested?

Existing UT.

Author: Xianyang Liu <xianyang.liu@intel.com>

Closes #19735 from ConeyLiu/smallfix.

(cherry picked from commit 176ae4d53e0269cfc2cfa62d3a2991e28f5a9182)
Signed-off-by: Sean Owen <sowen@cloudera.com>
3 months ago[SPARK-22442][SQL][BRANCH-2.2][FOLLOWUP] ScalaReflection should produce correct field...
Liang-Chi Hsieh [Mon, 13 Nov 2017 11:41:42 +0000 (12:41 +0100)] 
[SPARK-22442][SQL][BRANCH-2.2][FOLLOWUP] ScalaReflection should produce correct field names for special characters

## What changes were proposed in this pull request?

`val TermName: TermNameExtractor` is new in scala 2.11. For 2.10, we should use deprecated `newTermName`.

## How was this patch tested?

Build locally with scala 2.10.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #19736 from viirya/SPARK-22442-2.2-followup.

3 months ago[SPARK-22442][SQL][BRANCH-2.2] ScalaReflection should produce correct field names...
Liang-Chi Hsieh [Mon, 13 Nov 2017 05:19:15 +0000 (21:19 -0800)] 
[SPARK-22442][SQL][BRANCH-2.2] ScalaReflection should produce correct field names for special characters

## What changes were proposed in this pull request?

For a class with field name of special characters, e.g.:
```scala
case class MyType(`field.1`: String, `field 2`: String)
```

Although we can manipulate DataFrame/Dataset, the field names are encoded:
```scala
scala> val df = Seq(MyType("a", "b"), MyType("c", "d")).toDF
df: org.apache.spark.sql.DataFrame = [field$u002E1: string, field$u00202: string]
scala> df.as[MyType].collect
res7: Array[MyType] = Array(MyType(a,b), MyType(c,d))
```

It causes resolving problem when we try to convert the data with non-encoded field names:
```scala
spark.read.json(path).as[MyType]
...
[info]   org.apache.spark.sql.AnalysisException: cannot resolve '`field$u002E1`' given input columns: [field 2, fie
ld.1];
[info]   at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
...
```

We should use decoded field name in Dataset schema.

## How was this patch tested?

Added tests.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #19734 from viirya/SPARK-22442-2.2.

3 months ago[SPARK-21694][R][ML] Reduce max iterations in Linear SVM test in R to speed up AppVey...
hyukjinkwon [Sun, 12 Nov 2017 22:37:20 +0000 (14:37 -0800)] 
[SPARK-21694][R][ML] Reduce max iterations in Linear SVM test in R to speed up AppVeyor build

This PR proposes to reduce max iteration in Linear SVM test in SparkR. This particular test elapses roughly 5 mins on my Mac and over 20 mins on Windows.

The root cause appears, it triggers 2500ish jobs by the default 100 max iterations. In Linux, `daemon.R` is forked but on Windows another process is launched, which is extremely slow.

So, given my observation, there are many processes (not forked) ran on Windows, which makes the differences of elapsed time.

After reducing the max iteration to 10, the total jobs in this single test is reduced to 550ish.

After reducing the max iteration to 5, the total jobs in this single test is reduced to 360ish.

Manually tested the elapsed times.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #19722 from HyukjinKwon/SPARK-21693-test.

(cherry picked from commit 3d90b2cb384affe8ceac9398615e9e21b8c8e0b0)
Signed-off-by: Felix Cheung <felixcheung@apache.org>
3 months ago[SPARK-19606][BUILD][BACKPORT-2.2][MESOS] fix mesos break
Felix Cheung [Sun, 12 Nov 2017 22:27:49 +0000 (14:27 -0800)] 
[SPARK-19606][BUILD][BACKPORT-2.2][MESOS] fix mesos break

## What changes were proposed in this pull request?

Fix break from cherry pick

## How was this patch tested?

Jenkins

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #19732 from felixcheung/fixmesosdriverconstraint.

3 months ago[SPARK-22464][BACKPORT-2.2][SQL] No pushdown for Hive metastore partition predicates...
gatorsmile [Sun, 12 Nov 2017 22:17:06 +0000 (23:17 +0100)] 
[SPARK-22464][BACKPORT-2.2][SQL] No pushdown for Hive metastore partition predicates containing null-safe equality

## What changes were proposed in this pull request?
`<=>` is not supported by Hive metastore partition predicate pushdown. We should not push down it to Hive metastore when they are be using in partition predicates.

## How was this patch tested?
Added a test case

Author: gatorsmile <gatorsmile@gmail.com>

Closes #19724 from gatorsmile/backportSPARK-22464.

3 months ago[SPARK-22488][BACKPORT-2.2][SQL] Fix the view resolution issue in the SparkSession...
gatorsmile [Sun, 12 Nov 2017 22:15:58 +0000 (23:15 +0100)] 
[SPARK-22488][BACKPORT-2.2][SQL] Fix the view resolution issue in the SparkSession internal table() API

## What changes were proposed in this pull request?

The current internal `table()` API of `SparkSession` bypasses the Analyzer and directly calls `sessionState.catalog.lookupRelation` API. This skips the view resolution logics in our Analyzer rule `ResolveRelations`. This internal API is widely used by various DDL commands, public and internal APIs.

Users might get the strange error caused by view resolution when the default database is different.
```
Table or view not found: t1; line 1 pos 14
org.apache.spark.sql.AnalysisException: Table or view not found: t1; line 1 pos 14
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
```

This PR is to fix it by enforcing it to use `ResolveRelations` to resolve the table.

## How was this patch tested?
Added a test case and modified the existing test cases

Author: gatorsmile <gatorsmile@gmail.com>

Closes #19723 from gatorsmile/backport22488.

3 months ago[SPARK-21720][SQL] Fix 64KB JVM bytecode limit problem with AND or OR
Kazuaki Ishizaki [Sun, 12 Nov 2017 21:44:47 +0000 (22:44 +0100)] 
[SPARK-21720][SQL] Fix 64KB JVM bytecode limit problem with AND or OR

This PR changes `AND` or `OR` code generation to place condition and then expressions' generated code into separated methods if these size could be large. When the method is newly generated, variables for `isNull` and `value` are declared as an instance variable to pass these values (e.g. `isNull1409` and `value1409`) to the callers of the generated method.

This PR resolved two cases:

* large code size of left expression
* large code size of right expression

Added a new test case into `CodeGenerationSuite`

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #18972 from kiszk/SPARK-21720.

(cherry picked from commit 9bf696dbece6b1993880efba24a6d32c54c4d11c)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
3 months ago[SPARK-19606][MESOS] Support constraints in spark-dispatcher
Paul Mackles [Sun, 12 Nov 2017 19:21:23 +0000 (11:21 -0800)] 
[SPARK-19606][MESOS] Support constraints in spark-dispatcher

A discussed in SPARK-19606, the addition of a new config property named "spark.mesos.constraints.driver" for constraining drivers running on a Mesos cluster

Corresponding unit test added also tested locally on a Mesos cluster

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Paul Mackles <pmackles@adobe.com>

Closes #19543 from pmackles/SPARK-19606.

(cherry picked from commit b3f9dbf48ec0938ff5c98833bb6b6855c620ef57)
Signed-off-by: Felix Cheung <felixcheung@apache.org>
3 months ago[SPARK-21667][STREAMING] ConsoleSink should not fail streaming query with checkpointL...
Rekha Joshi [Fri, 10 Nov 2017 23:18:11 +0000 (15:18 -0800)] 
[SPARK-21667][STREAMING] ConsoleSink should not fail streaming query with checkpointLocation option

## What changes were proposed in this pull request?
Fix to allow recovery on console , avoid checkpoint exception

## How was this patch tested?
existing tests
manual tests [ Replicating error and seeing no checkpoint error after fix]

Author: Rekha Joshi <rekhajoshm@gmail.com>
Author: rjoshi2 <rekhajoshm@gmail.com>

Closes #19407 from rekhajoshm/SPARK-21667.

(cherry picked from commit 808e886b9638ab2981dac676b594f09cda9722fe)
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
3 months ago[SPARK-19644][SQL] Clean up Scala reflection garbage after creating Encoder (branch...
Shixiong Zhu [Fri, 10 Nov 2017 22:14:47 +0000 (14:14 -0800)] 
[SPARK-19644][SQL] Clean up Scala reflection garbage after creating Encoder (branch-2.2)

## What changes were proposed in this pull request?

Backport #19687 to branch-2.2. The major difference is `cleanUpReflectionObjects` is protected by `ScalaReflectionLock.synchronized` in this PR for Scala 2.10.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <zsxwing@gmail.com>

Closes #19718 from zsxwing/SPARK-19644-2.2.

3 months ago[SPARK-22284][SQL] Fix 64KB JVM bytecode limit problem in calculating hash for nested...
Kazuaki Ishizaki [Fri, 10 Nov 2017 20:17:49 +0000 (21:17 +0100)] 
[SPARK-22284][SQL] Fix 64KB JVM bytecode limit problem in calculating hash for nested structs

## What changes were proposed in this pull request?

This PR avoids to generate a huge method for calculating a murmur3 hash for nested structs. This PR splits a huge method (e.g. `apply_4`) into multiple smaller methods.

Sample program
```
  val structOfString = new StructType().add("str", StringType)
  var inner = new StructType()
  for (_ <- 0 until 800) {
    inner = inner1.add("structOfString", structOfString)
  }
  var schema = new StructType()
  for (_ <- 0 until 50) {
    schema = schema.add("structOfStructOfStrings", inner)
  }
  GenerateMutableProjection.generate(Seq(Murmur3Hash(exprs, 42)))
```

Without this PR
```
/* 005 */ class SpecificMutableProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
/* 006 */
/* 007 */   private Object[] references;
/* 008 */   private InternalRow mutableRow;
/* 009 */   private int value;
/* 010 */   private int value_0;
...
/* 034 */   public java.lang.Object apply(java.lang.Object _i) {
/* 035 */     InternalRow i = (InternalRow) _i;
/* 036 */
/* 037 */
/* 038 */
/* 039 */     value = 42;
/* 040 */     apply_0(i);
/* 041 */     apply_1(i);
/* 042 */     apply_2(i);
/* 043 */     apply_3(i);
/* 044 */     apply_4(i);
/* 045 */     nestedClassInstance.apply_5(i);
...
/* 089 */     nestedClassInstance8.apply_49(i);
/* 090 */     value_0 = value;
/* 091 */
/* 092 */     // copy all the results into MutableRow
/* 093 */     mutableRow.setInt(0, value_0);
/* 094 */     return mutableRow;
/* 095 */   }
/* 096 */
/* 097 */
/* 098 */   private void apply_4(InternalRow i) {
/* 099 */
/* 100 */     boolean isNull5 = i.isNullAt(4);
/* 101 */     InternalRow value5 = isNull5 ? null : (i.getStruct(4, 800));
/* 102 */     if (!isNull5) {
/* 103 */
/* 104 */       if (!value5.isNullAt(0)) {
/* 105 */
/* 106 */         final InternalRow element6400 = value5.getStruct(0, 1);
/* 107 */
/* 108 */         if (!element6400.isNullAt(0)) {
/* 109 */
/* 110 */           final UTF8String element6401 = element6400.getUTF8String(0);
/* 111 */           value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6401.getBaseObject(), element6401.getBaseOffset(), element6401.numBytes(), value);
/* 112 */
/* 113 */         }
/* 114 */
/* 115 */
/* 116 */       }
/* 117 */
/* 118 */
/* 119 */       if (!value5.isNullAt(1)) {
/* 120 */
/* 121 */         final InternalRow element6402 = value5.getStruct(1, 1);
/* 122 */
/* 123 */         if (!element6402.isNullAt(0)) {
/* 124 */
/* 125 */           final UTF8String element6403 = element6402.getUTF8String(0);
/* 126 */           value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6403.getBaseObject(), element6403.getBaseOffset(), element6403.numBytes(), value);
/* 127 */
/* 128 */         }
/* 128 */         }
/* 129 */
/* 130 */
/* 131 */       }
/* 132 */
/* 133 */
/* 134 */       if (!value5.isNullAt(2)) {
/* 135 */
/* 136 */         final InternalRow element6404 = value5.getStruct(2, 1);
/* 137 */
/* 138 */         if (!element6404.isNullAt(0)) {
/* 139 */
/* 140 */           final UTF8String element6405 = element6404.getUTF8String(0);
/* 141 */           value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6405.getBaseObject(), element6405.getBaseOffset(), element6405.numBytes(), value);
/* 142 */
/* 143 */         }
/* 144 */
/* 145 */
/* 146 */       }
/* 147 */
...
/* 12074 */       if (!value5.isNullAt(798)) {
/* 12075 */
/* 12076 */         final InternalRow element7996 = value5.getStruct(798, 1);
/* 12077 */
/* 12078 */         if (!element7996.isNullAt(0)) {
/* 12079 */
/* 12080 */           final UTF8String element7997 = element7996.getUTF8String(0);
/* 12083 */         }
/* 12084 */
/* 12085 */
/* 12086 */       }
/* 12087 */
/* 12088 */
/* 12089 */       if (!value5.isNullAt(799)) {
/* 12090 */
/* 12091 */         final InternalRow element7998 = value5.getStruct(799, 1);
/* 12092 */
/* 12093 */         if (!element7998.isNullAt(0)) {
/* 12094 */
/* 12095 */           final UTF8String element7999 = element7998.getUTF8String(0);
/* 12096 */           value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element7999.getBaseObject(), element7999.getBaseOffset(), element7999.numBytes(), value);
/* 12097 */
/* 12098 */         }
/* 12099 */
/* 12100 */
/* 12101 */       }
/* 12102 */
/* 12103 */     }
/* 12104 */
/* 12105 */   }
/* 12106 */
/* 12106 */
/* 12107 */
/* 12108 */   private void apply_1(InternalRow i) {
...
```

With this PR
```
/* 005 */ class SpecificMutableProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
/* 006 */
/* 007 */   private Object[] references;
/* 008 */   private InternalRow mutableRow;
/* 009 */   private int value;
/* 010 */   private int value_0;
/* 011 */
...
/* 034 */   public java.lang.Object apply(java.lang.Object _i) {
/* 035 */     InternalRow i = (InternalRow) _i;
/* 036 */
/* 037 */
/* 038 */
/* 039 */     value = 42;
/* 040 */     nestedClassInstance11.apply50_0(i);
/* 041 */     nestedClassInstance11.apply50_1(i);
...
/* 088 */     nestedClassInstance11.apply50_48(i);
/* 089 */     nestedClassInstance11.apply50_49(i);
/* 090 */     value_0 = value;
/* 091 */
/* 092 */     // copy all the results into MutableRow
/* 093 */     mutableRow.setInt(0, value_0);
/* 094 */     return mutableRow;
/* 095 */   }
/* 096 */
...
/* 37717 */   private void apply4_0(InternalRow value5, InternalRow i) {
/* 37718 */
/* 37719 */     if (!value5.isNullAt(0)) {
/* 37720 */
/* 37721 */       final InternalRow element6400 = value5.getStruct(0, 1);
/* 37722 */
/* 37723 */       if (!element6400.isNullAt(0)) {
/* 37724 */
/* 37725 */         final UTF8String element6401 = element6400.getUTF8String(0);
/* 37726 */         value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6401.getBaseObject(), element6401.getBaseOffset(), element6401.numBytes(), value);
/* 37727 */
/* 37728 */       }
/* 37729 */
/* 37730 */
/* 37731 */     }
/* 37732 */
/* 37733 */     if (!value5.isNullAt(1)) {
/* 37734 */
/* 37735 */       final InternalRow element6402 = value5.getStruct(1, 1);
/* 37736 */
/* 37737 */       if (!element6402.isNullAt(0)) {
/* 37738 */
/* 37739 */         final UTF8String element6403 = element6402.getUTF8String(0);
/* 37740 */         value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6403.getBaseObject(), element6403.getBaseOffset(), element6403.numBytes(), value);
/* 37741 */
/* 37742 */       }
/* 37743 */
/* 37744 */
/* 37745 */     }
/* 37746 */
/* 37747 */     if (!value5.isNullAt(2)) {
/* 37748 */
/* 37749 */       final InternalRow element6404 = value5.getStruct(2, 1);
/* 37750 */
/* 37751 */       if (!element6404.isNullAt(0)) {
/* 37752 */
/* 37753 */         final UTF8String element6405 = element6404.getUTF8String(0);
/* 37754 */         value = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(element6405.getBaseObject(), element6405.getBaseOffset(), element6405.numBytes(), value);
/* 37755 */
/* 37756 */       }
/* 37757 */
/* 37758 */
/* 37759 */     }
/* 37760 */
/* 37761 */   }
...
/* 218470 */
/* 218471 */     private void apply50_4(InternalRow i) {
/* 218472 */
/* 218473 */       boolean isNull5 = i.isNullAt(4);
/* 218474 */       InternalRow value5 = isNull5 ? null : (i.getStruct(4, 800));
/* 218475 */       if (!isNull5) {
/* 218476 */         apply4_0(value5, i);
/* 218477 */         apply4_1(value5, i);
/* 218478 */         apply4_2(value5, i);
...
/* 218742 */         nestedClassInstance.apply4_266(value5, i);
/* 218743 */       }
/* 218744 */
/* 218745 */     }
```

## How was this patch tested?

Added new test to `HashExpressionsSuite`

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #19563 from kiszk/SPARK-22284.

(cherry picked from commit f2da738c76810131045e6c32533a2d13526cdaf6)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
3 months ago[SPARK-22294][DEPLOY] Reset spark.driver.bindAddress when starting a Checkpoint
Santiago Saavedra [Fri, 10 Nov 2017 18:57:58 +0000 (10:57 -0800)] 
[SPARK-22294][DEPLOY] Reset spark.driver.bindAddress when starting a Checkpoint

## What changes were proposed in this pull request?

It seems that recovering from a checkpoint can replace the old
driver and executor IP addresses, as the workload can now be taking
place in a different cluster configuration. It follows that the
bindAddress for the master may also have changed. Thus we should not be
keeping the old one, and instead be added to the list of properties to
reset and recreate from the new environment.

## How was this patch tested?

This patch was tested via manual testing on AWS, using the experimental (not yet merged) Kubernetes scheduler, which uses bindAddress to bind to a Kubernetes service (and thus was how I first encountered the bug too), but it is not a code-path related to the scheduler and this may have slipped through when merging SPARK-4563.

Author: Santiago Saavedra <ssaavedra@openshine.com>

Closes #19427 from ssaavedra/fix-checkpointing-master.

(cherry picked from commit 5ebdcd185f2108a90e37a1aa4214c3b6c69a97a4)
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
3 months ago[SPARK-22243][DSTREAM] spark.yarn.jars should reload from config when checkpoint...
ZouChenjun [Thu, 2 Nov 2017 18:06:37 +0000 (11:06 -0700)] 
[SPARK-22243][DSTREAM] spark.yarn.jars should reload from config when checkpoint recovery

## What changes were proposed in this pull request?
the previous [PR](https://github.com/apache/spark/pull/19469) is deleted by mistake.
the solution is straight forward.
adding  "spark.yarn.jars" to propertiesToReload so this property will load from config.

## How was this patch tested?

manual tests

Author: ZouChenjun <zouchenjun@youzan.com>

Closes #19637 from ChenjunZou/checkpoint-yarn-jars.

3 months ago[SPARK-22344][SPARKR] clean up install dir if running test as source package
Felix Cheung [Fri, 10 Nov 2017 18:22:42 +0000 (10:22 -0800)] 
[SPARK-22344][SPARKR] clean up install dir if running test as source package

## What changes were proposed in this pull request?

remove spark if spark downloaded & installed

## How was this patch tested?

manually by building package
Jenkins, AppVeyor

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #19657 from felixcheung/rinstalldir.

(cherry picked from commit b70aa9e08b4476746e912c2c2a8b7bdd102305e8)
Signed-off-by: Felix Cheung <felixcheung@apache.org>
3 months ago[SPARK-22472][SQL] add null check for top-level primitive values
Wenchen Fan [Fri, 10 Nov 2017 05:56:20 +0000 (21:56 -0800)] 
[SPARK-22472][SQL] add null check for top-level primitive values

## What changes were proposed in this pull request?

One powerful feature of `Dataset` is, we can easily map SQL rows to Scala/Java objects and do runtime null check automatically.

For example, let's say we have a parquet file with schema `<a: int, b: string>`, and we have a `case class Data(a: Int, b: String)`. Users can easily read this parquet file into `Data` objects, and Spark will throw NPE if column `a` has null values.

However the null checking is left behind for top-level primitive values. For example, let's say we have a parquet file with schema `<a: Int>`, and we read it into Scala `Int`. If column `a` has null values, we will get some weird results.
```
scala> val ds = spark.read.parquet(...).as[Int]

scala> ds.show()
+----+
|v   |
+----+
|null|
|1   |
+----+

scala> ds.collect
res0: Array[Long] = Array(0, 1)

scala> ds.map(_ * 2).show
+-----+
|value|
+-----+
|-2   |
|2    |
+-----+
```

This is because internally Spark use some special default values for primitive types, but never expect users to see/operate these default value directly.

This PR adds null check for top-level primitive values

## How was this patch tested?

new test

Author: Wenchen Fan <wenchen@databricks.com>

Closes #19707 from cloud-fan/bug.

(cherry picked from commit 0025ddeb1dd4fd6951ecd8456457f6b94124f84e)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
# Conflicts:
# sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala
# sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala

3 months ago[SPARK-22287][MESOS] SPARK_DAEMON_MEMORY not honored by MesosClusterD…
Paul Mackles [Fri, 10 Nov 2017 00:42:33 +0000 (16:42 -0800)] 
[SPARK-22287][MESOS] SPARK_DAEMON_MEMORY not honored by MesosClusterD…

…ispatcher

## What changes were proposed in this pull request?

Allow JVM max heap size to be controlled for MesosClusterDispatcher via SPARK_DAEMON_MEMORY environment variable.

## How was this patch tested?

Tested on local Mesos cluster

Author: Paul Mackles <pmackles@adobe.com>

Closes #19515 from pmackles/SPARK-22287.

(cherry picked from commit f5fe63f7b8546b0102d7bfaf3dde77379f58a4d1)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
3 months ago[SPARK-22403][SS] Add optional checkpointLocation argument to StructuredKafkaWordCoun...
Wing Yew Poon [Fri, 10 Nov 2017 00:20:55 +0000 (16:20 -0800)] 
[SPARK-22403][SS] Add optional checkpointLocation argument to StructuredKafkaWordCount example

## What changes were proposed in this pull request?

When run in YARN cluster mode, the StructuredKafkaWordCount example fails because Spark tries to create a temporary checkpoint location in a subdirectory of the path given by java.io.tmpdir, and YARN sets java.io.tmpdir to a path in the local filesystem that usually does not correspond to an existing path in the distributed filesystem.
Add an optional checkpointLocation argument to the StructuredKafkaWordCount example so that users can specify the checkpoint location and avoid this issue.

## How was this patch tested?

Built and ran the example manually on YARN client and cluster mode.

Author: Wing Yew Poon <wypoon@cloudera.com>

Closes #19703 from wypoon/SPARK-22403.

(cherry picked from commit 11c4021044f3a302449a2ea76811e73f5c99a26a)
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
3 months ago[SPARK-22417][PYTHON][FOLLOWUP][BRANCH-2.2] Fix for createDataFrame from pandas.DataF...
Takuya UESHIN [Thu, 9 Nov 2017 07:56:50 +0000 (16:56 +0900)] 
[SPARK-22417][PYTHON][FOLLOWUP][BRANCH-2.2] Fix for createDataFrame from pandas.DataFrame with timestamp

## What changes were proposed in this pull request?

This is a follow-up of #19646 for branch-2.2.
The original pr breaks branch-2.2 because the cherry-picked patch doesn't include some code which exists in master.

## How was this patch tested?

Existing tests.

Author: Takuya UESHIN <ueshin@databricks.com>

Closes #19704 from ueshin/issues/SPARK-22417_2.2/fup1.

3 months ago[SPARK-22211][SQL][FOLLOWUP] Fix bad merge for tests
Henry Robinson [Thu, 9 Nov 2017 07:34:22 +0000 (23:34 -0800)] 
[SPARK-22211][SQL][FOLLOWUP] Fix bad merge for tests

## What changes were proposed in this pull request?

The merge of SPARK-22211 to branch-2.2 dropped a couple of important lines that made sure the tests that compared plans did so with both plans having been analyzed. Fix by reintroducing the correct analysis statements.

## How was this patch tested?

Re-ran LimitPushdownSuite. All tests passed.

Author: Henry Robinson <henry@apache.org>

Closes #19701 from henryr/branch-2.2.

3 months ago[SPARK-22281][SPARKR] Handle R method breaking signature changes
Felix Cheung [Wed, 8 Nov 2017 05:02:14 +0000 (21:02 -0800)] 
[SPARK-22281][SPARKR] Handle R method breaking signature changes

## What changes were proposed in this pull request?

This is to fix the code for the latest R changes in R-devel, when running CRAN check
```
checking for code/documentation mismatches ... WARNING
Codoc mismatches from documentation object 'attach':
attach
Code: function(what, pos = 2L, name = deparse(substitute(what),
backtick = FALSE), warn.conflicts = TRUE)
Docs: function(what, pos = 2L, name = deparse(substitute(what)),
warn.conflicts = TRUE)
Mismatches in argument default values:
Name: 'name' Code: deparse(substitute(what), backtick = FALSE) Docs: deparse(substitute(what))

Codoc mismatches from documentation object 'glm':
glm
Code: function(formula, family = gaussian, data, weights, subset,
na.action, start = NULL, etastart, mustart, offset,
control = list(...), model = TRUE, method = "glm.fit",
x = FALSE, y = TRUE, singular.ok = TRUE, contrasts =
NULL, ...)
Docs: function(formula, family = gaussian, data, weights, subset,
na.action, start = NULL, etastart, mustart, offset,
control = list(...), model = TRUE, method = "glm.fit",
x = FALSE, y = TRUE, contrasts = NULL, ...)
Argument names in code not in docs:
singular.ok
Mismatches in argument names:
Position: 16 Code: singular.ok Docs: contrasts
Position: 17 Code: contrasts Docs: ...
```

With attach, it's pulling in the function definition from base::attach. We need to disable that but we would still need a function signature for roxygen2 to build with.

With glm it's pulling in the function definition (ie. "usage") from the stats::glm function. Since this is "compiled in" when we build the source package into the .Rd file, when it changes at runtime or in CRAN check it won't match the latest signature. The solution is not to pull in from stats::glm since there isn't much value in doing that (none of the param we actually use, the ones we do use we have explicitly documented them)

Also with attach we are changing to call dynamically.

## How was this patch tested?

Manually.
- [x] check documentation output - yes
- [x] check help `?attach` `?glm` - yes
- [x] check on other platforms, r-hub, on r-devel etc..

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #19557 from felixcheung/rattachglmdocerror.

(cherry picked from commit 2ca5aae47a25dc6bc9e333fb592025ff14824501)
Signed-off-by: Felix Cheung <felixcheung@apache.org>
3 months ago[SPARK-22327][SPARKR][TEST][BACKPORT-2.2] check for version warning
Felix Cheung [Wed, 8 Nov 2017 04:58:29 +0000 (20:58 -0800)] 
[SPARK-22327][SPARKR][TEST][BACKPORT-2.2] check for version warning

## What changes were proposed in this pull request?

Will need to port to this to branch-~~1.6~~, -2.0, -2.1, -2.2

## How was this patch tested?

manually
Jenkins, AppVeyor

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #19619 from felixcheung/rcranversioncheck22.

3 months ago[SPARK-22417][PYTHON] Fix for createDataFrame from pandas.DataFrame with timestamp
Bryan Cutler [Tue, 7 Nov 2017 20:32:37 +0000 (21:32 +0100)] 
[SPARK-22417][PYTHON] Fix for createDataFrame from pandas.DataFrame with timestamp

Currently, a pandas.DataFrame that contains a timestamp of type 'datetime64[ns]' when converted to a Spark DataFrame with `createDataFrame` will interpret the values as LongType. This fix will check for a timestamp type and convert it to microseconds which will allow Spark to read as TimestampType.

Added unit test to verify Spark schema is expected for TimestampType and DateType when created from pandas

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #19646 from BryanCutler/pyspark-non-arrow-createDataFrame-ts-fix-SPARK-22417.

(cherry picked from commit 1d341042d6948e636643183da9bf532268592c6a)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
3 months ago[SPARK-22315][SPARKR] Warn if SparkR package version doesn't match SparkContext
Shivaram Venkataraman [Mon, 6 Nov 2017 16:58:42 +0000 (08:58 -0800)] 
[SPARK-22315][SPARKR] Warn if SparkR package version doesn't match SparkContext

## What changes were proposed in this pull request?

This PR adds a check between the R package version used and the version reported by SparkContext running in the JVM. The goal here is to warn users when they have a R package downloaded from CRAN and are using that to connect to an existing Spark cluster.

This is raised as a warning rather than an error as users might want to use patch versions interchangeably (e.g., 2.1.3 with 2.1.2 etc.)

## How was this patch tested?

Manually by changing the `DESCRIPTION` file

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #19624 from shivaram/sparkr-version-check.

(cherry picked from commit 65a8bf6036fe41a53b4b1e4298fa35d7fa4e9970)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
3 months ago[SPARK-22429][STREAMING] Streaming checkpointing code does not retry after failure
Tristan Stevens [Sun, 5 Nov 2017 09:10:40 +0000 (09:10 +0000)] 
[SPARK-22429][STREAMING] Streaming checkpointing code does not retry after failure

## What changes were proposed in this pull request?

SPARK-14930/SPARK-13693 put in a change to set the fs object to null after a failure, however the retry loop does not include initialization. Moved fs initialization inside the retry while loop to aid recoverability.

## How was this patch tested?

Passes all existing unit tests.

Author: Tristan Stevens <tristan@cloudera.com>

Closes #19645 from tmgstevens/SPARK-22429.

(cherry picked from commit fe258a7963361c1f31bc3dc3a2a2ee4a5834bb58)
Signed-off-by: Sean Owen <sowen@cloudera.com>
3 months ago[SPARK-22211][SQL] Remove incorrect FOJ limit pushdown
Henry Robinson [Sun, 5 Nov 2017 05:47:25 +0000 (22:47 -0700)] 
[SPARK-22211][SQL] Remove incorrect FOJ limit pushdown

It's not safe in all cases to push down a LIMIT below a FULL OUTER
JOIN. If the limit is pushed to one side of the FOJ, the physical
join operator can not tell if a row in the non-limited side would have a
match in the other side.

*If* the join operator guarantees that unmatched tuples from the limited
side are emitted before any unmatched tuples from the other side,
pushing down the limit is safe. But this is impractical for some join
implementations, e.g. SortMergeJoin.

For now, disable limit pushdown through a FULL OUTER JOIN, and we can
evaluate whether a more complicated solution is necessary in the future.

Ran org.apache.spark.sql.* tests. Altered full outer join tests in
LimitPushdownSuite.

Author: Henry Robinson <henry@cloudera.com>

Closes #19647 from henryr/spark-22211.

(cherry picked from commit 6c6626614e59b2e8d66ca853a74638d3d6267d73)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
3 months ago[SPARK-22306][SQL][2.2] alter table schema should not erase the bucketing metadata...
Wenchen Fan [Thu, 2 Nov 2017 11:37:52 +0000 (12:37 +0100)] 
[SPARK-22306][SQL][2.2] alter table schema should not erase the bucketing metadata at hive side

## What changes were proposed in this pull request?

When we alter table schema, we set the new schema to spark `CatalogTable`, convert it to hive table, and finally call `hive.alterTable`. This causes a problem in Spark 2.2, because hive bucketing metedata is not recognized by Spark, which means a Spark `CatalogTable` representing a hive table is always non-bucketed, and when we convert it to hive table and call `hive.alterTable`, the original hive bucketing metadata will be removed.

To fix this bug, we should read out the raw hive table metadata, update its schema, and call `hive.alterTable`. By doing this we can guarantee only the schema is changed, and nothing else.

Note that this bug doesn't exist in the master branch, because we've added hive bucketing support and the hive bucketing metadata can be recognized by Spark. I think we should merge this PR to master too, for code cleanup and reduce the difference between master and 2.2 branch for backporting.

## How was this patch tested?

new regression test

Author: Wenchen Fan <wenchen@databricks.com>

Closes #19622 from cloud-fan/infer.

3 months ago[MINOR][DOC] automatic type inference supports also Date and Timestamp
Marco Gaido [Thu, 2 Nov 2017 00:30:03 +0000 (09:30 +0900)] 
[MINOR][DOC] automatic type inference supports also Date and Timestamp

## What changes were proposed in this pull request?

Easy fix in the documentation, which is reporting that only numeric types and string are supported in type inference for partition columns, while Date and Timestamp are supported too since 2.1.0, thanks to SPARK-17388.

## How was this patch tested?

n/a

Author: Marco Gaido <mgaido@hortonworks.com>

Closes #19628 from mgaido91/SPARK-22398.

(cherry picked from commit b04eefae49b96e2ef5a8d75334db29ef4e19ce58)
Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
3 months ago[SPARK-22333][SQL][BACKPORT-2.2] timeFunctionCall(CURRENT_DATE, CURRENT_TIMESTAMP...
donnyzone [Tue, 31 Oct 2017 17:37:27 +0000 (10:37 -0700)] 
[SPARK-22333][SQL][BACKPORT-2.2] timeFunctionCall(CURRENT_DATE, CURRENT_TIMESTAMP) has conflicts with columnReference

## What changes were proposed in this pull request?

This is a backport pr of https://github.com/apache/spark/pull/19559
for branch-2.2

## How was this patch tested?
unit tests

Author: donnyzone <wellfengzhu@gmail.com>

Closes #19606 from DonnyZone/branch-2.2.

3 months ago[SPARK-19611][SQL][FOLLOWUP] set dataSchema correctly in HiveMetastoreCatalog.convert...
Wenchen Fan [Tue, 31 Oct 2017 10:35:32 +0000 (11:35 +0100)] 
[SPARK-19611][SQL][FOLLOWUP] set dataSchema correctly in HiveMetastoreCatalog.convertToLogicalRelation

We made a mistake in https://github.com/apache/spark/pull/16944 . In `HiveMetastoreCatalog#inferIfNeeded` we infer the data schema, merge with full schema, and return the new full schema. At caller side we treat the full schema as data schema and set it to `HadoopFsRelation`.

This doesn't cause any problem because both parquet and orc can work with a wrong data schema that has extra columns, but it's better to fix this mistake.

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #19615 from cloud-fan/infer.

(cherry picked from commit 4d9ebf3835dde1abbf9cff29a55675d9f4227620)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
3 months ago[SPARK-22291][SQL] Conversion error when transforming array types of uuid, inet and...
Jen-Ming Chung [Mon, 30 Oct 2017 08:09:11 +0000 (09:09 +0100)] 
[SPARK-22291][SQL] Conversion error when transforming array types of uuid, inet and cidr to StingType in PostgreSQL

## What changes were proposed in this pull request?

This PR fixes the conversion error when transforming array types of `uuid`, `inet` and `cidr` to `StingType` in PostgreSQL.

## How was this patch tested?

Added test in `PostgresIntegrationSuite`.

Author: Jen-Ming Chung <jenmingisme@gmail.com>

Closes #19604 from jmchung/SPARK-22291-FOLLOWUP.

3 months ago[SPARK-22344][SPARKR] Set java.io.tmpdir for SparkR tests
Shivaram Venkataraman [Mon, 30 Oct 2017 01:53:47 +0000 (18:53 -0700)] 
[SPARK-22344][SPARKR] Set java.io.tmpdir for SparkR tests

This PR sets the java.io.tmpdir for CRAN checks and also disables the hsperfdata for the JVM when running CRAN checks. Together this prevents files from being left behind in `/tmp`

## How was this patch tested?
Tested manually on a clean EC2 machine

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #19589 from shivaram/sparkr-tmpdir-clean.

(cherry picked from commit 1fe27612d7bcb8b6478a36bc16ddd4802e4ee2fc)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
3 months ago[SPARK-19727][SQL][FOLLOWUP] Fix for round function that modifies original column
Wenchen Fan [Sun, 29 Oct 2017 01:24:18 +0000 (18:24 -0700)] 
[SPARK-19727][SQL][FOLLOWUP] Fix for round function that modifies original column

## What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/17075 , to fix the bug in codegen path.

## How was this patch tested?

new regression test

Author: Wenchen Fan <wenchen@databricks.com>

Closes #19576 from cloud-fan/bug.

(cherry picked from commit 7fdacbc77bbcf98c2c045a1873e749129769dcc0)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
3 months ago[SPARK-22356][SQL] data source table should support overlapped columns between data...
Wenchen Fan [Fri, 27 Oct 2017 00:39:53 +0000 (17:39 -0700)] 
[SPARK-22356][SQL] data source table should support overlapped columns between data and partition schema

This is a regression introduced by #14207. After Spark 2.1, we store the inferred schema when creating the table, to avoid inferring schema again at read path. However, there is one special case: overlapped columns between data and partition. For this case, it breaks the assumption of table schema that there is on ovelap between data and partition schema, and partition columns should be at the end. The result is, for Spark 2.1, the table scan has incorrect schema that puts partition columns at the end. For Spark 2.2, we add a check in CatalogTable to validate table schema, which fails at this case.

To fix this issue, a simple and safe approach is to fallback to old behavior when overlapeed columns detected, i.e. store empty schema in metastore.

new regression test

Author: Wenchen Fan <wenchen@databricks.com>

Closes #19579 from cloud-fan/bug2.

3 months ago[SPARK-22355][SQL] Dataset.collect is not threadsafe
Wenchen Fan [Fri, 27 Oct 2017 00:51:16 +0000 (17:51 -0700)] 
[SPARK-22355][SQL] Dataset.collect is not threadsafe

It's possible that users create a `Dataset`, and call `collect` of this `Dataset` in many threads at the same time. Currently `Dataset#collect` just call `encoder.fromRow` to convert spark rows to objects of type T, and this encoder is per-dataset. This means `Dataset#collect` is not thread-safe, because the encoder uses a projection to output the object to a re-usable row.

This PR fixes this problem, by creating a new projection when calling `Dataset#collect`, so that we have the re-usable row for each method call, instead of each Dataset.

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #19577 from cloud-fan/encoder.

(cherry picked from commit 5c3a1f3fad695317c2fff1243cdb9b3ceb25c317)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
3 months ago[SPARK-22328][CORE] ClosureCleaner should not miss referenced superclass fields
Liang-Chi Hsieh [Thu, 26 Oct 2017 20:41:45 +0000 (21:41 +0100)] 
[SPARK-22328][CORE] ClosureCleaner should not miss referenced superclass fields

When the given closure uses some fields defined in super class, `ClosureCleaner` can't figure them and don't set it properly. Those fields will be in null values.

Added test.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #19556 from viirya/SPARK-22328.

(cherry picked from commit 4f8dc6b01ea787243a38678ea8199fbb0814cffc)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
3 months ago[SPARK-17902][R] Revive stringsAsFactors option for collect() in SparkR
hyukjinkwon [Thu, 26 Oct 2017 11:54:36 +0000 (20:54 +0900)] 
[SPARK-17902][R] Revive stringsAsFactors option for collect() in SparkR

## What changes were proposed in this pull request?

This PR proposes to revive `stringsAsFactors` option in collect API, which was mistakenly removed in https://github.com/apache/spark/commit/71a138cd0e0a14e8426f97877e3b52a562bbd02c.

Simply, it casts `charactor` to `factor` if it meets the condition, `stringsAsFactors && is.character(vec)` in primitive type conversion.

## How was this patch tested?

Unit test in `R/pkg/tests/fulltests/test_sparkSQL.R`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #19551 from HyukjinKwon/SPARK-17902.

(cherry picked from commit a83d8d5adcb4e0061e43105767242ba9770dda96)
Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
3 months ago[SPARK-21991][LAUNCHER][FOLLOWUP] Fix java lint
Andrew Ash [Wed, 25 Oct 2017 21:41:02 +0000 (14:41 -0700)] 
[SPARK-21991][LAUNCHER][FOLLOWUP] Fix java lint

## What changes were proposed in this pull request?

Fix java lint

## How was this patch tested?

Run `./dev/lint-java`

Author: Andrew Ash <andrew@andrewash.com>

Closes #19574 from ash211/aash/fix-java-lint.

(cherry picked from commit 5433be44caecaeef45ed1fdae10b223c698a9d14)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
3 months ago[SPARK-22332][ML][TEST] Fix NaiveBayes unit test occasionly fail (cause by test datas...
WeichenXu [Wed, 25 Oct 2017 21:31:36 +0000 (14:31 -0700)] 
[SPARK-22332][ML][TEST] Fix NaiveBayes unit test occasionly fail (cause by test dataset not deterministic)

## What changes were proposed in this pull request?

Fix NaiveBayes unit test occasionly fail:
Set seed for `BrzMultinomial.sample`, make `generateNaiveBayesInput` output deterministic dataset.
(If we do not set seed, the generated dataset will be random, and the model will be possible to exceed the tolerance in the test, which trigger this failure)

## How was this patch tested?

Manually run tests multiple times and check each time output models contains the same values.

Author: WeichenXu <weichen.xu@databricks.com>

Closes #19558 from WeichenXu123/fix_nb_test_seed.

(cherry picked from commit 841f1d776f420424c20d99cf7110d06c73f9ca20)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
3 months ago[SPARK-22227][CORE] DiskBlockManager.getAllBlocks now tolerates temp files
Sergei Lebedev [Wed, 25 Oct 2017 21:15:44 +0000 (22:15 +0100)] 
[SPARK-22227][CORE] DiskBlockManager.getAllBlocks now tolerates temp files

Prior to this commit getAllBlocks implicitly assumed that the directories
managed by the DiskBlockManager contain only the files corresponding to
valid block IDs. In reality, this assumption was violated during shuffle,
which produces temporary files in the same directory as the resulting
blocks. As a result, calls to getAllBlocks during shuffle were unreliable.

The fix could be made more efficient, but this is probably good enough.

`DiskBlockManagerSuite`

Author: Sergei Lebedev <s.lebedev@criteo.com>

Closes #19458 from superbobry/block-id-option.

(cherry picked from commit b377ef133cdc38d49b460b2cc6ece0b5892804cc)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
3 months ago[SPARK-21991][LAUNCHER] Fix race condition in LauncherServer#acceptConnections
Andrea zito [Wed, 25 Oct 2017 17:10:24 +0000 (10:10 -0700)] 
[SPARK-21991][LAUNCHER] Fix race condition in LauncherServer#acceptConnections

## What changes were proposed in this pull request?
This patch changes the order in which _acceptConnections_ starts the client thread and schedules the client timeout action ensuring that the latter has been scheduled before the former get a chance to cancel it.

## How was this patch tested?
Due to the non-deterministic nature of the patch I wasn't able to add a new test for this issue.

Author: Andrea zito <andrea.zito@u-hopper.com>

Closes #19217 from nivox/SPARK-21991.

(cherry picked from commit 6ea8a56ca26a7e02e6574f5f763bb91059119a80)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
3 months ago[SPARK-21936][SQL][FOLLOW-UP] backward compatibility test framework for HiveExternalC...
Sean Owen [Tue, 24 Oct 2017 12:56:10 +0000 (13:56 +0100)] 
[SPARK-21936][SQL][FOLLOW-UP] backward compatibility test framework for HiveExternalCatalog

## What changes were proposed in this pull request?

Adjust Spark download in test to use Apache mirrors and respect its load balancer, and use Spark 2.1.2. This follows on a recent PMC list thread about removing the cloudfront download rather than update it further.

## How was this patch tested?

Existing tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #19564 from srowen/SPARK-21936.2.

(cherry picked from commit 8beeaed66bde0ace44495b38dc967816e16b3464)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
3 months ago[SPARK-22319][CORE][BACKPORT-2.2] call loginUserFromKeytab before accessing hdfs
Steven Rand [Mon, 23 Oct 2017 06:26:03 +0000 (14:26 +0800)] 
[SPARK-22319][CORE][BACKPORT-2.2] call loginUserFromKeytab before accessing hdfs

In SparkSubmit, call loginUserFromKeytab before attempting to make RPC calls to the NameNode.

Same as #https://github.com/apache/spark/pull/19540, but for branch-2.2.

Manually tested for master as described in https://github.com/apache/spark/pull/19540.

Author: Steven Rand <srand@palantir.com>

Closes #19554 from sjrand/SPARK-22319-branch-2.2.

Change-Id: Ic550a818fd6a3f38b356ac48029942d463738458

4 months ago[SPARK-21551][PYTHON] Increase timeout for PythonRDD.serveIterator
peay [Thu, 19 Oct 2017 04:07:04 +0000 (13:07 +0900)] 
[SPARK-21551][PYTHON] Increase timeout for PythonRDD.serveIterator

Backport of https://github.com/apache/spark/pull/18752 (https://issues.apache.org/jira/browse/SPARK-21551)

(cherry picked from commit 9d3c6640f56e3e4fd195d3ad8cead09df67a72c7)

Author: peay <peay@protonmail.com>

Closes #19512 from FRosner/branch-2.2.

4 months ago[SPARK-22249][FOLLOWUP][SQL] Check if list of value for IN is empty in the optimizer
Marco Gaido [Wed, 18 Oct 2017 16:14:46 +0000 (09:14 -0700)] 
[SPARK-22249][FOLLOWUP][SQL] Check if list of value for IN is empty in the optimizer

## What changes were proposed in this pull request?

This PR addresses the comments by gatorsmile on [the previous PR](https://github.com/apache/spark/pull/19494).

## How was this patch tested?

Previous UT and added UT.

Author: Marco Gaido <marcogaido91@gmail.com>

Closes #19522 from mgaido91/SPARK-22249_FOLLOWUP.

(cherry picked from commit 1f25d8683a84a479fd7fc77b5a1ea980289b681b)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
4 months ago[SPARK-22271][SQL] mean overflows and returns null for some decimal variables
Huaxin Gao [Tue, 17 Oct 2017 19:50:41 +0000 (12:50 -0700)] 
[SPARK-22271][SQL] mean overflows and returns null for some decimal variables

## What changes were proposed in this pull request?

In Average.scala, it has
```
  override lazy val evaluateExpression = child.dataType match {
    case DecimalType.Fixed(p, s) =>
      // increase the precision and scale to prevent precision loss
      val dt = DecimalType.bounded(p + 14, s + 4)
      Cast(Cast(sum, dt) / Cast(count, dt), resultType)
    case _ =>
      Cast(sum, resultType) / Cast(count, resultType)
  }

  def setChild (newchild: Expression) = {
    child = newchild
  }

```
It is possible that  Cast(count, dt), resultType) will make the precision of the decimal number bigger than 38, and this causes over flow.  Since count is an integer and doesn't need a scale, I will cast it using DecimalType.bounded(38,0)
## How was this patch tested?
In DataFrameSuite, I will add a test case.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Huaxin Gao <huaxing@us.ibm.com>

Closes #19496 from huaxingao/spark-22271.

(cherry picked from commit 28f9f3f22511e9f2f900764d9bd5b90d2eeee773)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
# Conflicts:
# sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

4 months ago[SPARK-22249][SQL] isin with empty list throws exception on cached DataFrame
Marco Gaido [Tue, 17 Oct 2017 07:41:23 +0000 (09:41 +0200)] 
[SPARK-22249][SQL] isin with empty list throws exception on cached DataFrame

## What changes were proposed in this pull request?

As pointed out in the JIRA, there is a bug which causes an exception to be thrown if `isin` is called with an empty list on a cached DataFrame. The PR fixes it.

## How was this patch tested?

Added UT.

Author: Marco Gaido <marcogaido91@gmail.com>

Closes #19494 from mgaido91/SPARK-22249.

(cherry picked from commit 8148f19ca1f0e0375603cb4f180c1bad8b0b8042)
Signed-off-by: Sean Owen <sowen@cloudera.com>
4 months ago[SPARK-22223][SQL] ObjectHashAggregate should not introduce unnecessary shuffle
Liang-Chi Hsieh [Mon, 16 Oct 2017 05:37:58 +0000 (13:37 +0800)] 
[SPARK-22223][SQL] ObjectHashAggregate should not introduce unnecessary shuffle

`ObjectHashAggregateExec` should override `outputPartitioning` in order to avoid unnecessary shuffle.

Added Jenkins test.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #19501 from viirya/SPARK-22223.

(cherry picked from commit 0ae96495dedb54b3b6bae0bd55560820c5ca29a2)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
4 months ago[SPARK-21549][CORE] Respect OutputFormats with no/invalid output directory provided
Mridul Muralidharan [Mon, 16 Oct 2017 01:40:53 +0000 (18:40 -0700)] 
[SPARK-21549][CORE] Respect OutputFormats with no/invalid output directory provided

## What changes were proposed in this pull request?

PR #19294 added support for null's - but spark 2.1 handled other error cases where path argument can be invalid.
Namely:

* empty string
* URI parse exception while creating Path

This is resubmission of PR #19487, which I messed up while updating my repo.

## How was this patch tested?

Enhanced test to cover new support added.

Author: Mridul Muralidharan <mridul@gmail.com>

Closes #19497 from mridulm/master.

(cherry picked from commit 13c1559587d0eb533c94f5a492390f81b048b347)
Signed-off-by: Mridul Muralidharan <mridul@gmail.com>
4 months ago[SPARK-22273][SQL] Fix key/value schema field names in HashMapGenerators.
Takuya UESHIN [Sat, 14 Oct 2017 06:24:36 +0000 (23:24 -0700)] 
[SPARK-22273][SQL] Fix key/value schema field names in HashMapGenerators.

## What changes were proposed in this pull request?

When fixing schema field names using escape characters with `addReferenceMinorObj()` at [SPARK-18952](https://issues.apache.org/jira/browse/SPARK-18952) (#16361), double-quotes around the names were remained and the names become something like `"((java.lang.String) references[1])"`.

```java
/* 055 */     private int maxSteps = 2;
/* 056 */     private int numRows = 0;
/* 057 */     private org.apache.spark.sql.types.StructType keySchema = new org.apache.spark.sql.types.StructType().add("((java.lang.String) references[1])", org.apache.spark.sql.types.DataTypes.StringType);
/* 058 */     private org.apache.spark.sql.types.StructType valueSchema = new org.apache.spark.sql.types.StructType().add("((java.lang.String) references[2])", org.apache.spark.sql.types.DataTypes.LongType);
/* 059 */     private Object emptyVBase;
```

We should remove the double-quotes to refer the values in `references` properly:

```java
/* 055 */     private int maxSteps = 2;
/* 056 */     private int numRows = 0;
/* 057 */     private org.apache.spark.sql.types.StructType keySchema = new org.apache.spark.sql.types.StructType().add(((java.lang.String) references[1]), org.apache.spark.sql.types.DataTypes.StringType);
/* 058 */     private org.apache.spark.sql.types.StructType valueSchema = new org.apache.spark.sql.types.StructType().add(((java.lang.String) references[2]), org.apache.spark.sql.types.DataTypes.LongType);
/* 059 */     private Object emptyVBase;
```

## How was this patch tested?

Existing tests.

Author: Takuya UESHIN <ueshin@databricks.com>

Closes #19491 from ueshin/issues/SPARK-22273.

(cherry picked from commit e0503a7223410289d01bc4b20da3a451730577da)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
4 months ago[SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark schema to read ORC table inste...
Dongjoon Hyun [Fri, 13 Oct 2017 15:09:12 +0000 (23:09 +0800)] 
[SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark schema to read ORC table instead of ORC file schema

Before Hive 2.0, ORC File schema has invalid column names like `_col1` and `_col2`. This is a well-known limitation and there are several Apache Spark issues with `spark.sql.hive.convertMetastoreOrc=true`. This PR ignores ORC File schema and use Spark schema.

Pass the newly added test case.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #19470 from dongjoon-hyun/SPARK-18355.

(cherry picked from commit e6e36004afc3f9fc8abea98542248e9de11b4435)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
4 months ago[SPARK-22252][SQL][2.2] FileFormatWriter should respect the input query schema
Wenchen Fan [Fri, 13 Oct 2017 04:54:00 +0000 (21:54 -0700)] 
[SPARK-22252][SQL][2.2] FileFormatWriter should respect the input query schema

## What changes were proposed in this pull request?

https://github.com/apache/spark/pull/18386 fixes SPARK-21165 but breaks SPARK-22252. This PR reverts https://github.com/apache/spark/pull/18386 and picks the patch from https://github.com/apache/spark/pull/19483 to fix SPARK-21165.

## How was this patch tested?

new regression test

Author: Wenchen Fan <wenchen@databricks.com>

Closes #19484 from cloud-fan/bug.

4 months ago[SPARK-22217][SQL] ParquetFileFormat to support arbitrary OutputCommitters
Steve Loughran [Thu, 12 Oct 2017 23:40:26 +0000 (08:40 +0900)] 
[SPARK-22217][SQL] ParquetFileFormat to support arbitrary OutputCommitters

## What changes were proposed in this pull request?

`ParquetFileFormat` to relax its requirement of output committer class from `org.apache.parquet.hadoop.ParquetOutputCommitter` or subclass thereof (and so implicitly Hadoop `FileOutputCommitter`) to any committer implementing `org.apache.hadoop.mapreduce.OutputCommitter`

This enables output committers which don't write to the filesystem the way `FileOutputCommitter` does to save parquet data from a dataframe: at present you cannot do this.

Before a committer which isn't a subclass of `ParquetOutputCommitter`, it checks to see if the context has requested summary metadata by setting `parquet.enable.summary-metadata`. If true, and the committer class isn't a parquet committer, it raises a RuntimeException with an error message.

(It could downgrade, of course, but raising an exception makes it clear there won't be an summary. It also makes the behaviour testable.)

Note that `SQLConf` already states that any `OutputCommitter` can be used, but that typically it's a subclass of ParquetOutputCommitter. That's not currently true. This patch will make the code consistent with the docs, adding tests to verify,

## How was this patch tested?

The patch includes a test suite, `ParquetCommitterSuite`, with a new committer, `MarkingFileOutputCommitter` which extends `FileOutputCommitter` and writes a marker file in the destination directory. The presence of the marker file can be used to verify the new committer was used. The tests then try the combinations of Parquet committer summary/no-summary and marking committer summary/no-summary.

| committer | summary | outcome |
|-----------|---------|---------|
| parquet   | true    | success |
| parquet   | false   | success |
| marking   | false   | success with marker |
| marking   | true    | exception |

All tests are happy.

Author: Steve Loughran <stevel@hortonworks.com>

Closes #19448 from steveloughran/cloud/SPARK-22217-committer.

4 months ago[SPARK-21907][CORE][BACKPORT 2.2] oom during spill
Eyal Farago [Thu, 12 Oct 2017 12:56:04 +0000 (14:56 +0200)] 
[SPARK-21907][CORE][BACKPORT 2.2] oom during spill

back-port #19181 to branch-2.2.

## What changes were proposed in this pull request?
1. a test reproducing [SPARK-21907](https://issues.apache.org/jira/browse/SPARK-21907)
2. a fix for the root cause of the issue.

`org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill` calls `org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset` which may trigger another spill,
when this happens the `array` member is already de-allocated but still referenced by the code, this causes the nested spill to fail with an NPE in `org.apache.spark.memory.TaskMemoryManager.getPage`.
This patch introduces a reproduction in a test case and a fix, the fix simply sets the in-mem sorter's array member to an empty array before actually performing the allocation. This prevents the spilling code from 'touching' the de-allocated array.

## How was this patch tested?
introduced a new test case: `org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorterSuite#testOOMDuringSpill`.

Author: Eyal Farago <eyal@nrgene.com>

Closes #19481 from eyalfa/SPARK-21907__oom_during_spill__BACKPORT-2.2.

4 months ago[SPARK-22218] spark shuffle services fails to update secret on app re-attempts
Thomas Graves [Mon, 9 Oct 2017 19:56:37 +0000 (12:56 -0700)] 
[SPARK-22218] spark shuffle services fails to update secret on app re-attempts

This patch fixes application re-attempts when running spark on yarn using the external shuffle service with security on.  Currently executors will fail to launch on any application re-attempt when launched on a nodemanager that had an executor from the first attempt.  The reason for this is because we aren't updating the secret key after the first application attempt.  The fix here is to just remove the containskey check to see if it already exists. In this way, we always add it and make sure its the most recent secret.  Similarly remove the check for containsKey on the remove since its just adding extra check that isn't really needed.

Note this worked before spark 2.2 because the check used to be contains (which was looking for the value) rather then containsKey, so that never matched and it was just always adding the new secret.

Patch was tested on a 10 node cluster as well as added the unit test.
The test ran was a wordcount where the output directory already existed.  With the bug present the application attempt failed with max number of executor Failures which were all saslExceptions.  With the fix present the application re-attempts fail with directory already exists or when you remove the directory between attempts the re-attemps succeed.

Author: Thomas Graves <tgraves@unharmedunarmed.corp.ne1.yahoo.com>

Closes #19450 from tgravescs/SPARK-22218.

(cherry picked from commit a74ec6d7bbfe185ba995dcb02d69e90a089c293e)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
4 months ago[SPARK-21549][CORE] Respect OutputFormats with no output directory provided
Sergey Zhemzhitsky [Sat, 7 Oct 2017 03:43:53 +0000 (20:43 -0700)] 
[SPARK-21549][CORE] Respect OutputFormats with no output directory provided

## What changes were proposed in this pull request?

Fix for https://issues.apache.org/jira/browse/SPARK-21549 JIRA issue.

Since version 2.2 Spark does not respect OutputFormat with no output paths provided.
The examples of such formats are [Cassandra OutputFormat](https://github.com/finn-no/cassandra-hadoop/blob/08dfa3a7ac727bb87269f27a1c82ece54e3f67e6/src/main/java/org/apache/cassandra/hadoop2/AbstractColumnFamilyOutputFormat.java), [Aerospike OutputFormat](https://github.com/aerospike/aerospike-hadoop/blob/master/mapreduce/src/main/java/com/aerospike/hadoop/mapreduce/AerospikeOutputFormat.java), etc. which do not have an ability to rollback the results written to an external systems on job failure.

Provided output directory is required by Spark to allows files to be committed to an absolute output location, that is not the case for output formats which write data to external systems.

This pull request prevents accessing `absPathStagingDir` method that causes the error described in SPARK-21549 unless there are files to rename in `addedAbsPathFiles`.

## How was this patch tested?

Unit tests

Author: Sergey Zhemzhitsky <szhemzhitski@gmail.com>

Closes #19294 from szhem/SPARK-21549-abs-output-commits.

(cherry picked from commit 2030f19511f656e9534f3fd692e622e45f9a074e)
Signed-off-by: Mridul Muralidharan <mridul@gmail.com>
4 months ago[SPARK-22206][SQL][SPARKR] gapply in R can't work on empty grouping columns
Liang-Chi Hsieh [Thu, 5 Oct 2017 14:36:18 +0000 (23:36 +0900)] 
[SPARK-22206][SQL][SPARKR] gapply in R can't work on empty grouping columns

## What changes were proposed in this pull request?

Looks like `FlatMapGroupsInRExec.requiredChildDistribution` didn't consider empty grouping attributes. It should be a problem when running `EnsureRequirements` and `gapply` in R can't work on empty grouping columns.

## How was this patch tested?

Added test.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #19436 from viirya/fix-flatmapinr-distribution.

(cherry picked from commit ae61f187aa0471242c046fdeac6ed55b9b98a3f6)
Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
4 months ago[SPARK-20466][CORE] HadoopRDD#addLocalConfiguration throws NPE
Sahil Takiar [Tue, 3 Oct 2017 23:53:32 +0000 (16:53 -0700)] 
[SPARK-20466][CORE] HadoopRDD#addLocalConfiguration throws NPE

## What changes were proposed in this pull request?

Fix for SPARK-20466, full description of the issue in the JIRA. To summarize, `HadoopRDD` uses a metadata cache to cache `JobConf` objects. The cache uses soft-references, which means the JVM can delete entries from the cache whenever there is GC pressure. `HadoopRDD#getJobConf` had a bug where it would check if the cache contained the `JobConf`, if it did it would get the `JobConf` from the cache and return it. This doesn't work when soft-references are used as the JVM can delete the entry between the existence check and the get call.

## How was this patch tested?

Haven't thought of a good way to test this yet given the issue only occurs sometimes, and happens during high GC pressure. Was thinking of using mocks to verify `#getJobConf` is doing the right thing. I deleted the method `HadoopRDD#containsCachedMetadata` so that we don't hit this issue again.

Author: Sahil Takiar <stakiar@cloudera.com>

Closes #19413 from sahilTakiar/master.

(cherry picked from commit e36ec38d89472df0dfe12222b6af54cd6eea8e98)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
4 months ago[SPARK-22178][SQL] Refresh Persistent Views by REFRESH TABLE Command
gatorsmile [Tue, 3 Oct 2017 19:40:22 +0000 (12:40 -0700)] 
[SPARK-22178][SQL] Refresh Persistent Views by REFRESH TABLE Command

## What changes were proposed in this pull request?
The underlying tables of persistent views are not refreshed when users issue the REFRESH TABLE command against the persistent views.

## How was this patch tested?
Added a test case

Author: gatorsmile <gatorsmile@gmail.com>

Closes #19405 from gatorsmile/refreshView.

(cherry picked from commit e65b6b7ca1a7cff1b91ad2262bb7941e6bf057cd)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
4 months ago[SPARK-22158][SQL][BRANCH-2.2] convertMetastore should not ignore table property
Dongjoon Hyun [Tue, 3 Oct 2017 18:42:55 +0000 (11:42 -0700)] 
[SPARK-22158][SQL][BRANCH-2.2] convertMetastore should not ignore table property

## What changes were proposed in this pull request?

From the beginning, **convertMetastoreOrc** ignores table properties and use an empty map instead. This PR fixes that. **convertMetastoreParquet** also ignore.

```scala
val options = Map[String, String]()
```

- [SPARK-14070: HiveMetastoreCatalog.scala](https://github.com/apache/spark/pull/11891/files#diff-ee66e11b56c21364760a5ed2b783f863R650)
- [Master branch: HiveStrategies.scala](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L197
)

## How was this patch tested?

Pass the Jenkins with an updated test suite.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #19417 from dongjoon-hyun/SPARK-22158-BRANCH-2.2.

4 months ago[SPARK-22167][R][BUILD] sparkr packaging issue allow zinc
Holden Karau [Mon, 2 Oct 2017 18:46:51 +0000 (11:46 -0700)] 
[SPARK-22167][R][BUILD] sparkr packaging issue allow zinc

## What changes were proposed in this pull request?

When zinc is running the pwd might be in the root of the project. A quick solution to this is to not go a level up incase we are in the root rather than root/core/. If we are in the root everything works fine, if we are in core add a script which goes and runs the level up

## How was this patch tested?

set -x in the SparkR install scripts.

Author: Holden Karau <holden@us.ibm.com>

Closes #19402 from holdenk/SPARK-22167-sparkr-packaging-issue-allow-zinc.

(cherry picked from commit 8fab7995d36c7bc4524393b20a4e524dbf6bbf62)
Signed-off-by: Holden Karau <holden@us.ibm.com>
4 months ago[SPARK-22146] FileNotFoundException while reading ORC files containing special characters
Marco Gaido [Fri, 29 Sep 2017 06:14:53 +0000 (23:14 -0700)] 
[SPARK-22146] FileNotFoundException while reading ORC files containing special characters

## What changes were proposed in this pull request?

Reading ORC files containing special characters like '%' fails with a FileNotFoundException.
This PR aims to fix the problem.

## How was this patch tested?

Added UT.

Author: Marco Gaido <marcogaido91@gmail.com>
Author: Marco Gaido <mgaido@hortonworks.com>

Closes #19368 from mgaido91/SPARK-22146.

4 months ago[SPARK-22161][SQL] Add Impala-modified TPC-DS queries
gatorsmile [Fri, 29 Sep 2017 15:59:42 +0000 (08:59 -0700)] 
[SPARK-22161][SQL] Add Impala-modified TPC-DS queries

## What changes were proposed in this pull request?

Added IMPALA-modified TPCDS queries to TPC-DS query suites.

- Ref: https://github.com/cloudera/impala-tpcds-kit/tree/master/queries

## How was this patch tested?
N/A

Author: gatorsmile <gatorsmile@gmail.com>

Closes #19386 from gatorsmile/addImpalaQueries.

(cherry picked from commit 9ed7394a68315126b2dd00e53a444cc65b5a62ea)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
4 months ago[SPARK-22129][SPARK-22138] Release script improvements
Holden Karau [Fri, 29 Sep 2017 15:04:14 +0000 (08:04 -0700)] 
[SPARK-22129][SPARK-22138] Release script improvements

## What changes were proposed in this pull request?

Use the GPG_KEY param, fix lsof to non-hardcoded path, remove version swap since it wasn't really needed. Use EXPORT on JAVA_HOME for downstream scripts as well.

## How was this patch tested?

Rolled 2.1.2 RC2

Author: Holden Karau <holden@us.ibm.com>

Closes #19359 from holdenk/SPARK-22129-fix-signing.

(cherry picked from commit ecbe416ab5001b32737966c5a2407597a1dafc32)
Signed-off-by: Holden Karau <holden@us.ibm.com>
4 months ago[SPARK-22143][SQL][BRANCH-2.2] Fix memory leak in OffHeapColumnVector
Herman van Hovell [Thu, 28 Sep 2017 15:51:40 +0000 (17:51 +0200)] 
[SPARK-22143][SQL][BRANCH-2.2] Fix memory leak in OffHeapColumnVector

This is a backport of https://github.com/apache/spark/commit/02bb0682e68a2ce81f3b98d33649d368da7f2b3d.

## What changes were proposed in this pull request?
`WriteableColumnVector` does not close its child column vectors. This can create memory leaks for `OffHeapColumnVector` where we do not clean up the memory allocated by a vectors children. This can be especially bad for string columns (which uses a child byte column vector).

## How was this patch tested?
I have updated the existing tests to always use both on-heap and off-heap vectors. Testing and diagnosis was done locally.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #19378 from hvanhovell/SPARK-22143-2.2.

4 months ago[SPARK-22135][MESOS] metrics in spark-dispatcher not being registered properly
Paul Mackles [Thu, 28 Sep 2017 06:43:31 +0000 (14:43 +0800)] 
[SPARK-22135][MESOS] metrics in spark-dispatcher not being registered properly

## What changes were proposed in this pull request?

Fix a trivial bug with how metrics are registered in the mesos dispatcher. Bug resulted in creating a new registry each time the metricRegistry() method was called.

## How was this patch tested?

Verified manually on local mesos setup

Author: Paul Mackles <pmackles@adobe.com>

Closes #19358 from pmackles/SPARK-22135.

(cherry picked from commit f20be4d70bf321f377020d1bde761a43e5c72f0a)
Signed-off-by: jerryshao <sshao@hortonworks.com>
4 months ago[SPARK-22140] Add TPCDSQuerySuite
gatorsmile [Thu, 28 Sep 2017 00:03:42 +0000 (17:03 -0700)] 
[SPARK-22140] Add TPCDSQuerySuite

## What changes were proposed in this pull request?
Now, we are not running TPC-DS queries as regular test cases. Thus, we need to add a test suite using empty tables for ensuring the new code changes will not break them. For example, optimizer/analyzer batches should not exceed the max iteration.

## How was this patch tested?
N/A

Author: gatorsmile <gatorsmile@gmail.com>

Closes #19361 from gatorsmile/tpcdsQuerySuite.

(cherry picked from commit 9244957b500cb2b458c32db2c63293a1444690d7)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
4 months ago[SPARK-22141][BACKPORT][SQL] Propagate empty relation before checking Cartesian products
Wang Gengliang [Wed, 27 Sep 2017 15:40:31 +0000 (17:40 +0200)] 
[SPARK-22141][BACKPORT][SQL] Propagate empty relation before checking Cartesian products

Back port https://github.com/apache/spark/pull/19362 to branch-2.2

## What changes were proposed in this pull request?

When inferring constraints from children, Join's condition can be simplified as None.
For example,
```
val testRelation = LocalRelation('a.int)
val x = testRelation.as("x")
val y = testRelation.where($"a" === 2 && !($"a" === 2)).as("y")
x.join.where($"x.a" === $"y.a")
```
The plan will become
```
Join Inner
:- LocalRelation <empty>, [a#23]
+- LocalRelation <empty>, [a#224]
```
And the Cartesian products check will throw exception for above plan.

Propagate empty relation before checking Cartesian products, and the issue is resolved.

## How was this patch tested?

Unit test

Author: Wang Gengliang <ltnwgl@gmail.com>

Closes #19366 from gengliangwang/branch-2.2.

4 months ago[SPARK-22120][SQL] TestHiveSparkSession.reset() should clean out Hive warehouse directory
Greg Owen [Mon, 25 Sep 2017 21:16:11 +0000 (14:16 -0700)] 
[SPARK-22120][SQL] TestHiveSparkSession.reset() should clean out Hive warehouse directory

## What changes were proposed in this pull request?
During TestHiveSparkSession.reset(), which is called after each TestHiveSingleton suite, we now delete and recreate the Hive warehouse directory.

## How was this patch tested?
Ran full suite of tests locally, verified that they pass.

Author: Greg Owen <greg@databricks.com>

Closes #19341 from GregOwen/SPARK-22120.

(cherry picked from commit ce204780ee2434ff6bae50428ae37083835798d3)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
4 months ago[SPARK-22083][CORE] Release locks in MemoryStore.evictBlocksToFreeSpace
Imran Rashid [Mon, 25 Sep 2017 19:02:30 +0000 (12:02 -0700)] 
[SPARK-22083][CORE] Release locks in MemoryStore.evictBlocksToFreeSpace

## What changes were proposed in this pull request?

MemoryStore.evictBlocksToFreeSpace acquires write locks for all the
blocks it intends to evict up front.  If there is a failure to evict
blocks (eg., some failure dropping a block to disk), then we have to
release the lock.  Otherwise the lock is never released and an executor
trying to get the lock will wait forever.

## How was this patch tested?

Added unit test.

Author: Imran Rashid <irashid@cloudera.com>

Closes #19311 from squito/SPARK-22083.

(cherry picked from commit 2c5b9b1173c23f6ca8890817a9a35dc7557b0776)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
4 months ago[SPARK-22107] Change as to alias in python quickstart
John O'Leary [Mon, 25 Sep 2017 00:16:27 +0000 (09:16 +0900)] 
[SPARK-22107] Change as to alias in python quickstart

## What changes were proposed in this pull request?

Updated docs so that a line of python in the quick start guide executes. Closes #19283

## How was this patch tested?

Existing tests.

Author: John O'Leary <jgoleary@gmail.com>

Closes #19326 from jgoleary/issues/22107.

(cherry picked from commit 20adf9aa1f42353432d356117e655e799ea1290b)
Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
4 months ago[SPARK-22109][SQL][BRANCH-2.2] Resolves type conflicts between strings and timestamps...
hyukjinkwon [Sat, 23 Sep 2017 17:51:04 +0000 (02:51 +0900)] 
[SPARK-22109][SQL][BRANCH-2.2] Resolves type conflicts between strings and timestamps in partition column

## What changes were proposed in this pull request?

This PR backports https://github.com/apache/spark/commit/04975a68b583a6175f93da52374108e5d4754d9a into branch-2.2.

## How was this patch tested?

Unit tests in `ParquetPartitionDiscoverySuite`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #19333 from HyukjinKwon/SPARK-22109-backport-2.2.

4 months ago[SPARK-22092] Reallocation in OffHeapColumnVector.reserveInternal corrupts struct...
Ala Luszczak [Sat, 23 Sep 2017 14:09:47 +0000 (16:09 +0200)] 
[SPARK-22092] Reallocation in OffHeapColumnVector.reserveInternal corrupts struct and array data

`OffHeapColumnVector.reserveInternal()` will only copy already inserted values during reallocation if `data != null`. In vectors containing arrays or structs this is incorrect, since there field `data` is not used at all. We need to check `nulls` instead.

Adds new tests to `ColumnVectorSuite` that reproduce the errors.

Author: Ala Luszczak <ala@databricks.com>

Closes #19323 from ala/port-vector-realloc.

4 months ago[SPARK-18136] Fix SPARK_JARS_DIR for Python pip install on Windows
Jakub Nowacki [Sat, 23 Sep 2017 12:04:10 +0000 (21:04 +0900)] 
[SPARK-18136] Fix SPARK_JARS_DIR for Python pip install on Windows

## What changes were proposed in this pull request?

Fix for setup of `SPARK_JARS_DIR` on Windows as it looks for `%SPARK_HOME%\RELEASE` file instead of `%SPARK_HOME%\jars` as it should. RELEASE file is not included in the `pip` build of PySpark.

## How was this patch tested?

Local install of PySpark on Anaconda 4.4.0 (Python 3.6.1).

Author: Jakub Nowacki <j.s.nowacki@gmail.com>

Closes #19310 from jsnowacki/master.

(cherry picked from commit c11f24a94007bbaad0835645843e776507094071)
Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
4 months ago[SPARK-22072][SPARK-22071][BUILD] Improve release build scripts
Holden Karau [Fri, 22 Sep 2017 07:14:57 +0000 (00:14 -0700)] 
[SPARK-22072][SPARK-22071][BUILD] Improve release build scripts

## What changes were proposed in this pull request?

Check JDK version (with javac) and use SPARK_VERSION for publish-release

## How was this patch tested?

Manually tried local build with wrong JDK / JAVA_HOME & built a local release (LFTP disabled)

Author: Holden Karau <holden@us.ibm.com>

Closes #19312 from holdenk/improve-release-scripts-r2.

(cherry picked from commit 8f130ad40178e35fecb3f2ba4a61ad23e6a90e3d)
Signed-off-by: Holden Karau <holden@us.ibm.com>
4 months ago[SPARK-22094][SS] processAllAvailable should check the query state
Shixiong Zhu [Fri, 22 Sep 2017 04:55:07 +0000 (21:55 -0700)] 
[SPARK-22094][SS] processAllAvailable should check the query state

`processAllAvailable` should also check the query state and if the query is stopped, it should return.

The new unit test.

Author: Shixiong Zhu <zsxwing@gmail.com>

Closes #19314 from zsxwing/SPARK-22094.

(cherry picked from commit fedf6961be4e99139eb7ab08d5e6e29187ea5ccf)
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
4 months ago[SPARK-21928][CORE] Set classloader on SerializerManager's private kryo
Imran Rashid [Thu, 21 Sep 2017 17:20:19 +0000 (10:20 -0700)] 
[SPARK-21928][CORE] Set classloader on SerializerManager's private kryo

## What changes were proposed in this pull request?

We have to make sure that SerializerManager's private instance of
kryo also uses the right classloader, regardless of the current thread
classloader.  In particular, this fixes serde during remote cache
fetches, as those occur in netty threads.

## How was this patch tested?

Manual tests & existing suite via jenkins.  I haven't been able to reproduce this is in a unit test, because when a remote RDD partition can be fetched, there is a warning message and then the partition is just recomputed locally.  I manually verified the warning message is no longer present.

Author: Imran Rashid <irashid@cloudera.com>

Closes #19280 from squito/SPARK-21928_ser_classloader.

(cherry picked from commit b75bd1777496ce0354458bf85603a8087a6a0ff8)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
4 months ago[SPARK-21384][YARN] Spark + YARN fails with LocalFileSystem as default FS
Devaraj K [Wed, 20 Sep 2017 23:22:36 +0000 (16:22 -0700)] 
[SPARK-21384][YARN] Spark + YARN fails with LocalFileSystem as default FS

## What changes were proposed in this pull request?

When the libraries temp directory(i.e. __spark_libs__*.zip dir) file system and staging dir(destination) file systems are the same then the __spark_libs__*.zip is not copying to the staging directory. But after making this decision the libraries zip file is getting deleted immediately and becoming unavailable for the Node Manager's localization.

With this change, client copies the files to remote always when the source scheme is "file".

## How was this patch tested?

I have verified it manually in yarn/cluster and yarn/client modes with hdfs and local file systems.

Author: Devaraj K <devaraj@apache.org>

Closes #19141 from devaraj-kavali/SPARK-21384.

(cherry picked from commit 55d5fa79db883e4d93a9c102a94713c9d2d1fb55)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
5 months ago[SPARK-22076][SQL] Expand.projections should not be a Stream
Wenchen Fan [Wed, 20 Sep 2017 16:00:43 +0000 (09:00 -0700)] 
[SPARK-22076][SQL] Expand.projections should not be a Stream

## What changes were proposed in this pull request?

Spark with Scala 2.10 fails with a group by cube:
```
spark.range(1).select($"id" as "a", $"id" as "b").write.partitionBy("a").mode("overwrite").saveAsTable("rollup_bug")
spark.sql("select 1 from rollup_bug group by rollup ()").show
```

It can be traced back to https://github.com/apache/spark/pull/15484 , which made `Expand.projections` a lazy `Stream` for group by cube.

In scala 2.10 `Stream` captures a lot of stuff, and in this case it captures the entire query plan which has some un-serializable parts.

This change is also good for master branch, to reduce the serialized size of `Expand.projections`.

## How was this patch tested?

manually verified with Spark with Scala 2.10.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #19289 from cloud-fan/bug.

(cherry picked from commit ce6a71e013c403d0a3690cf823934530ce0ea5ef)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
5 months ago[SPARK-22052] Incorrect Metric assigned in MetricsReporter.scala
Taaffy [Tue, 19 Sep 2017 09:20:04 +0000 (10:20 +0100)] 
[SPARK-22052] Incorrect Metric assigned in MetricsReporter.scala

Current implementation for processingRate-total uses wrong metric:
mistakenly uses inputRowsPerSecond instead of processedRowsPerSecond

## What changes were proposed in this pull request?
Adjust processingRate-total from using inputRowsPerSecond to processedRowsPerSecond

## How was this patch tested?

Built spark from source with proposed change and tested output with correct parameter. Before change the csv metrics file for inputRate-total and processingRate-total displayed the same values due to the error. After changing MetricsReporter.scala the processingRate-total csv file displayed the correct metric.
<img width="963" alt="processed rows per second" src="https://user-images.githubusercontent.com/32072374/30554340-82eea12c-9ca4-11e7-8370-8168526ff9a2.png">

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Taaffy <32072374+Taaffy@users.noreply.github.com>

Closes #19268 from Taaffy/patch-1.

(cherry picked from commit 1bc17a6b8add02772a8a0a1048ac6a01d045baf4)
Signed-off-by: Sean Owen <sowen@cloudera.com>
5 months ago[SPARK-22047][FLAKY TEST] HiveExternalCatalogVersionsSuite
Wenchen Fan [Tue, 19 Sep 2017 03:53:50 +0000 (11:53 +0800)] 
[SPARK-22047][FLAKY TEST] HiveExternalCatalogVersionsSuite

## What changes were proposed in this pull request?

This PR tries to download Spark for each test run, to make sure each test run is absolutely isolated.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #19265 from cloud-fan/test.

(cherry picked from commit 10f45b3c84ff7b3f1765dc6384a563c33d26548b)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
5 months ago[SPARK-22047][TEST] ignore HiveExternalCatalogVersionsSuite
Wenchen Fan [Mon, 18 Sep 2017 08:42:08 +0000 (16:42 +0800)] 
[SPARK-22047][TEST] ignore HiveExternalCatalogVersionsSuite

## What changes were proposed in this pull request?

As reported in https://issues.apache.org/jira/browse/SPARK-22047 , HiveExternalCatalogVersionsSuite is failing frequently, let's disable this test suite to unblock other PRs, I'm looking into the root cause.

## How was this patch tested?
N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #19264 from cloud-fan/test.

(cherry picked from commit 894a7561de2c2ff01fe7fcc5268378161e9e5643)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
5 months ago[SPARK-22043][PYTHON] Improves error message for show_profiles and dump_profiles
hyukjinkwon [Mon, 18 Sep 2017 04:20:11 +0000 (13:20 +0900)] 
[SPARK-22043][PYTHON] Improves error message for show_profiles and dump_profiles

## What changes were proposed in this pull request?

This PR proposes to improve error message from:

```
>>> sc.show_profiles()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/context.py", line 1000, in show_profiles
    self.profiler_collector.show_profiles()
AttributeError: 'NoneType' object has no attribute 'show_profiles'
>>> sc.dump_profiles("/tmp/abc")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/context.py", line 1005, in dump_profiles
    self.profiler_collector.dump_profiles(path)
AttributeError: 'NoneType' object has no attribute 'dump_profiles'
```

to

```
>>> sc.show_profiles()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/context.py", line 1003, in show_profiles
    raise RuntimeError("'spark.python.profile' configuration must be set "
RuntimeError: 'spark.python.profile' configuration must be set to 'true' to enable Python profile.
>>> sc.dump_profiles("/tmp/abc")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/context.py", line 1012, in dump_profiles
    raise RuntimeError("'spark.python.profile' configuration must be set "
RuntimeError: 'spark.python.profile' configuration must be set to 'true' to enable Python profile.
```

## How was this patch tested?

Unit tests added in `python/pyspark/tests.py` and manual tests.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #19260 from HyukjinKwon/profile-errors.

(cherry picked from commit 7c7266208a3be984ac1ce53747dc0c3640f4ecac)
Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
5 months ago[SPARK-21953] Show both memory and disk bytes spilled if either is present
Andrew Ash [Mon, 18 Sep 2017 02:42:24 +0000 (10:42 +0800)] 
[SPARK-21953] Show both memory and disk bytes spilled if either is present

As written now, there must be both memory and disk bytes spilled to show either of them. If there is only one of those types of spill recorded, it will be hidden.

Author: Andrew Ash <andrew@andrewash.com>

Closes #19164 from ash211/patch-3.

(cherry picked from commit 6308c65f08b507408033da1f1658144ea8c1491f)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
5 months ago[SPARK-21985][PYSPARK] PairDeserializer is broken for double-zipped RDDs
Andrew Ray [Sun, 17 Sep 2017 17:46:27 +0000 (02:46 +0900)] 
[SPARK-21985][PYSPARK] PairDeserializer is broken for double-zipped RDDs

## What changes were proposed in this pull request?
(edited)
Fixes a bug introduced in #16121

In PairDeserializer convert each batch of keys and values to lists (if they do not have `__len__` already) so that we can check that they are the same size. Normally they already are lists so this should not have a performance impact, but this is needed when repeated `zip`'s are done.

## How was this patch tested?

Additional unit test

Author: Andrew Ray <ray.andrew@gmail.com>

Closes #19226 from aray/SPARK-21985.

(cherry picked from commit 6adf67dd14b0ece342bb91adf800df0a7101e038)
Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
5 months ago[SPARK-18608][ML][FOLLOWUP] Fix double caching for PySpark OneVsRest.
Yanbo Liang [Thu, 14 Sep 2017 06:09:44 +0000 (14:09 +0800)] 
[SPARK-18608][ML][FOLLOWUP] Fix double caching for PySpark OneVsRest.

## What changes were proposed in this pull request?
#19197 fixed double caching for MLlib algorithms, but missed PySpark ```OneVsRest```, this PR fixed it.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #19220 from yanboliang/SPARK-18608.

(cherry picked from commit c76153cc7dd25b8de5266fe119095066be7f78f5)
Signed-off-by: Yanbo Liang <ybliang8@gmail.com>
5 months ago[SPARK-21980][SQL] References in grouping functions should be indexed with semanticEquals
donnyzone [Wed, 13 Sep 2017 17:06:53 +0000 (10:06 -0700)] 
[SPARK-21980][SQL] References in grouping functions should be indexed with semanticEquals

## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-21980

This PR fixes the issue in ResolveGroupingAnalytics rule, which indexes the column references in grouping functions without considering case sensitive configurations.

The problem can be reproduced by:

`val df = spark.createDataFrame(Seq((1, 1), (2, 1), (2, 2))).toDF("a", "b")
 df.cube("a").agg(grouping("A")).show()`

## How was this patch tested?
unit tests

Author: donnyzone <wellfengzhu@gmail.com>

Closes #19202 from DonnyZone/ResolveGroupingAnalytics.

(cherry picked from commit 21c4450fb24635fab6481a3756fefa9c6f6d6235)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
5 months ago[SPARK-18608][ML] Fix double caching
Zheng RuiFeng [Tue, 12 Sep 2017 18:37:05 +0000 (11:37 -0700)] 
[SPARK-18608][ML] Fix double caching

## What changes were proposed in this pull request?
`df.rdd.getStorageLevel` => `df.storageLevel`

using cmd `find . -name '*.scala' | xargs -i bash -c 'egrep -in "\.rdd\.getStorageLevel" {} && echo {}'` to make sure all algs involved in this issue are fixed.

Previous discussion in other PRs: https://github.com/apache/spark/pull/19107, https://github.com/apache/spark/pull/17014

## How was this patch tested?
existing tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #19197 from zhengruifeng/double_caching.

(cherry picked from commit c5f9b89dda40ffaa4622a7ba2b3d0605dbe815c0)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
5 months ago[DOCS] Fix unreachable links in the document
Kousuke Saruta [Tue, 12 Sep 2017 14:07:04 +0000 (15:07 +0100)] 
[DOCS] Fix unreachable links in the document

## What changes were proposed in this pull request?

Recently, I found two unreachable links in the document and fixed them.
Because of small changes related to the document, I don't file this issue in JIRA but please suggest I should do it if you think it's needed.

## How was this patch tested?

Tested manually.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #19195 from sarutak/fix-unreachable-link.

(cherry picked from commit 957558235b7537c706c6ab4779655aa57838ebac)
Signed-off-by: Sean Owen <sowen@cloudera.com>
5 months ago[SPARK-21976][DOC] Fix wrong documentation for Mean Absolute Error.
FavioVazquez [Tue, 12 Sep 2017 09:33:35 +0000 (10:33 +0100)] 
[SPARK-21976][DOC] Fix wrong documentation for Mean Absolute Error.

## What changes were proposed in this pull request?

Fixed wrong documentation for Mean Absolute Error.

Even though the code is correct for the MAE:

```scala
Since("1.2.0")
  def meanAbsoluteError: Double = {
    summary.normL1(1) / summary.count
  }
```
In the documentation the division by N is missing.

## How was this patch tested?

All of spark tests were run.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: FavioVazquez <favio.vazquezp@gmail.com>
Author: faviovazquez <favio.vazquezp@gmail.com>
Author: Favio André Vázquez <favio.vazquezp@gmail.com>

Closes #19190 from FavioVazquez/mae-fix.

(cherry picked from commit e2ac2f1c71a0f8b03743d0d916dc0ef28482a393)
Signed-off-by: Sean Owen <sowen@cloudera.com>
5 months ago[SPARK-20098][PYSPARK] dataType's typeName fix
Peter Szalai [Sun, 10 Sep 2017 08:47:45 +0000 (17:47 +0900)] 
[SPARK-20098][PYSPARK] dataType's typeName fix

## What changes were proposed in this pull request?
`typeName`  classmethod has been fixed by using type -> typeName map.

## How was this patch tested?
local build

Author: Peter Szalai <szalaipeti.vagyok@gmail.com>

Closes #17435 from szalai1/datatype-gettype-fix.

(cherry picked from commit 520d92a191c3148498087d751aeeddd683055622)
Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
5 months ago[SPARK-21954][SQL] JacksonUtils should verify MapType's value type instead of key...
Liang-Chi Hsieh [Sat, 9 Sep 2017 10:10:52 +0000 (19:10 +0900)] 
[SPARK-21954][SQL] JacksonUtils should verify MapType's value type instead of key type

## What changes were proposed in this pull request?

`JacksonUtils.verifySchema` verifies if a data type can be converted to JSON. For `MapType`, it now verifies the key type. However, in `JacksonGenerator`, when converting a map to JSON, we only care about its values and create a writer for the values. The keys in a map are treated as strings by calling `toString` on the keys.

Thus, we should change `JacksonUtils.verifySchema` to verify the value type of `MapType`.

## How was this patch tested?

Added tests.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #19167 from viirya/test-jacksonutils.

(cherry picked from commit 6b45d7e941eba8a36be26116787322d9e3ae25d0)
Signed-off-by: hyukjinkwon <gurwls223@gmail.com>
5 months ago[SPARK-21128][R][BACKPORT-2.2] Remove both "spark-warehouse" and "metastore_db" befor...
hyukjinkwon [Fri, 8 Sep 2017 16:47:45 +0000 (09:47 -0700)] 
[SPARK-21128][R][BACKPORT-2.2] Remove both "spark-warehouse" and "metastore_db" before listing files in R tests

## What changes were proposed in this pull request?

This PR proposes to list the files in test _after_ removing both "spark-warehouse" and "metastore_db" so that the next run of R tests pass fine. This is sometimes a bit annoying.

## How was this patch tested?

Manually running multiple times R tests via `./R/run-tests.sh`.

**Before**

Second run:

```
SparkSQL functions: Spark package found in SPARK_HOME: .../spark
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
....................................................................................................1234.......................

Failed -------------------------------------------------------------------------
1. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3384)
length(list1) not equal to length(list2).
1/1 mismatches
[1] 25 - 23 == 2

2. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3384)
sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE).
10/25 mismatches
x[16]: "metastore_db"
y[16]: "pkg"

x[17]: "pkg"
y[17]: "R"

x[18]: "R"
y[18]: "README.md"

x[19]: "README.md"
y[19]: "run-tests.sh"

x[20]: "run-tests.sh"
y[20]: "SparkR_2.2.0.tar.gz"

x[21]: "metastore_db"
y[21]: "pkg"

x[22]: "pkg"
y[22]: "R"

x[23]: "R"
y[23]: "README.md"

x[24]: "README.md"
y[24]: "run-tests.sh"

x[25]: "run-tests.sh"
y[25]: "SparkR_2.2.0.tar.gz"

3. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3388)
length(list1) not equal to length(list2).
1/1 mismatches
[1] 25 - 23 == 2

4. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3388)
sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE).
10/25 mismatches
x[16]: "metastore_db"
y[16]: "pkg"

x[17]: "pkg"
y[17]: "R"

x[18]: "R"
y[18]: "README.md"

x[19]: "README.md"
y[19]: "run-tests.sh"

x[20]: "run-tests.sh"
y[20]: "SparkR_2.2.0.tar.gz"

x[21]: "metastore_db"
y[21]: "pkg"

x[22]: "pkg"
y[22]: "R"

x[23]: "R"
y[23]: "README.md"

x[24]: "README.md"
y[24]: "run-tests.sh"

x[25]: "run-tests.sh"
y[25]: "SparkR_2.2.0.tar.gz"

DONE ===========================================================================
```

**After**

Second run:

```
SparkSQL functions: Spark package found in SPARK_HOME: .../spark
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................................................
...............................................................................................................................
```

Author: hyukjinkwon <gurwls223gmail.com>

Closes #18335 from HyukjinKwon/SPARK-21128.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #19166 from felixcheung/rbackport21128.

5 months ago[SPARK-21946][TEST] fix flaky test: "alter table: rename cached table" in InMemoryCat...
Kazuaki Ishizaki [Fri, 8 Sep 2017 16:39:20 +0000 (09:39 -0700)] 
[SPARK-21946][TEST] fix flaky test: "alter table: rename cached table" in InMemoryCatalogedDDLSuite

## What changes were proposed in this pull request?

This PR fixes flaky test `InMemoryCatalogedDDLSuite "alter table: rename cached table"`.
Since this test validates distributed DataFrame, the result should be checked by using `checkAnswer`. The original version used `df.collect().Seq` method that does not guaranty an order of each element of the result.

## How was this patch tested?

Use existing test case

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #19159 from kiszk/SPARK-21946.

(cherry picked from commit 8a4f228dc0afed7992695486ecab6bc522f1e392)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
5 months ago[SPARK-21936][SQL][2.2] backward compatibility test framework for HiveExternalCatalog
Wenchen Fan [Fri, 8 Sep 2017 16:35:41 +0000 (09:35 -0700)] 
[SPARK-21936][SQL][2.2] backward compatibility test framework for HiveExternalCatalog

backport https://github.com/apache/spark/pull/19148 to 2.2

Author: Wenchen Fan <wenchen@databricks.com>

Closes #19163 from cloud-fan/test.

5 months ago[SPARK-21915][ML][PYSPARK] Model 1 and Model 2 ParamMaps Missing
MarkTab marktab.net [Fri, 8 Sep 2017 07:08:09 +0000 (08:08 +0100)] 
[SPARK-21915][ML][PYSPARK] Model 1 and Model 2 ParamMaps Missing

dongjoon-hyun HyukjinKwon

Error in PySpark example code:
/examples/src/main/python/ml/estimator_transformer_param_example.py

The original Scala code says
println("Model 2 was fit using parameters: " + model2.parent.extractParamMap)

The parent is lr

There is no method for accessing parent as is done in Scala.

This code has been tested in Python, and returns values consistent with Scala

## What changes were proposed in this pull request?

Proposing to call the lr variable instead of model1 or model2

## How was this patch tested?

This patch was tested with Spark 2.1.0 comparing the Scala and PySpark results. Pyspark returns nothing at present for those two print lines.

The output for model2 in PySpark should be

{Param(parent='LogisticRegression_4187be538f744d5a9090', name='tol', doc='the convergence tolerance for iterative algorithms (>= 0).'): 1e-06,
Param(parent='LogisticRegression_4187be538f744d5a9090', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0,
Param(parent='LogisticRegression_4187be538f744d5a9090', name='predictionCol', doc='prediction column name.'): 'prediction',
Param(parent='LogisticRegression_4187be538f744d5a9090', name='featuresCol', doc='features column name.'): 'features',
Param(parent='LogisticRegression_4187be538f744d5a9090', name='labelCol', doc='label column name.'): 'label',
Param(parent='LogisticRegression_4187be538f744d5a9090', name='probabilityCol', doc='Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities.'): 'myProbability',
Param(parent='LogisticRegression_4187be538f744d5a9090', name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column name.'): 'rawPrediction',
Param(parent='LogisticRegression_4187be538f744d5a9090', name='family', doc='The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial'): 'auto',
Param(parent='LogisticRegression_4187be538f744d5a9090', name='fitIntercept', doc='whether to fit an intercept term.'): True,
Param(parent='LogisticRegression_4187be538f744d5a9090', name='threshold', doc='Threshold in binary classification prediction, in range [0, 1]. If threshold and thresholds are both set, they must match.e.g. if threshold is p, then thresholds must be equal to [1-p, p].'): 0.55,
Param(parent='LogisticRegression_4187be538f744d5a9090', name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2).'): 2,
Param(parent='LogisticRegression_4187be538f744d5a9090', name='maxIter', doc='max number of iterations (>= 0).'): 30,
Param(parent='LogisticRegression_4187be538f744d5a9090', name='regParam', doc='regularization parameter (>= 0).'): 0.1,
Param(parent='LogisticRegression_4187be538f744d5a9090', name='standardization', doc='whether to standardize the training features before fitting the model.'): True}

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: MarkTab marktab.net <marktab@users.noreply.github.com>

Closes #19152 from marktab/branch-2.2.

5 months ago[SPARK-21950][SQL][PYTHON][TEST] pyspark.sql.tests.SQLTests2 should stop SparkContext.
Takuya UESHIN [Fri, 8 Sep 2017 05:26:07 +0000 (14:26 +0900)] 
[SPARK-21950][SQL][PYTHON][TEST] pyspark.sql.tests.SQLTests2 should stop SparkContext.

## What changes were proposed in this pull request?

`pyspark.sql.tests.SQLTests2` doesn't stop newly created spark context in the test and it might affect the following tests.
This pr makes `pyspark.sql.tests.SQLTests2` stop `SparkContext`.

## How was this patch tested?

Existing tests.

Author: Takuya UESHIN <ueshin@databricks.com>

Closes #19158 from ueshin/issues/SPARK-21950.

(cherry picked from commit 57bc1e9eb452284cbed090dbd5008eb2062f1b36)
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
5 months ago[SPARK-21890] Credentials not being passed to add the tokens
Sanket Chintapalli [Thu, 7 Sep 2017 17:20:39 +0000 (10:20 -0700)] 
[SPARK-21890] Credentials not being passed to add the tokens

## What changes were proposed in this pull request?
I observed this while running a oozie job trying to connect to hbase via spark.
It look like the creds are not being passed in thehttps://github.com/apache/spark/blob/branch-2.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/security/HadoopFSCredentialProvider.scala#L53 for 2.2 release.
More Info as to why it fails on secure grid:
Oozie client gets the necessary tokens the application needs before launching. It passes those tokens along to the oozie launcher job (MR job) which will then actually call the Spark client to launch the spark app and pass the tokens along.
The oozie launcher job cannot get anymore tokens because all it has is tokens ( you can't get tokens with tokens, you need tgt or keytab).
The error here is because the launcher job runs the Spark Client to submit the spark job but the spark client doesn't see that it already has the hdfs tokens so it tries to get more, which ends with the exception.
There was a change with SPARK-19021 to generalize the hdfs credentials provider that changed it so we don't pass the existing credentials into the call to get tokens so it doesn't realize it already has the necessary tokens.

https://issues.apache.org/jira/browse/SPARK-21890
Modified to pass creds to get delegation tokens

## How was this patch tested?
Manual testing on our secure cluster

Author: Sanket Chintapalli <schintap@yahoo-inc.com>

Closes #19103 from redsanket/SPARK-21890.

5 months agoFixed pandoc dependency issue in python/setup.py
Tucker Beck [Thu, 7 Sep 2017 00:38:00 +0000 (09:38 +0900)] 
Fixed pandoc dependency issue in python/setup.py

## Problem Description

When pyspark is listed as a dependency of another package, installing
the other package will cause an install failure in pyspark. When the
other package is being installed, pyspark's setup_requires requirements
are installed including pypandoc. Thus, the exception handling on
setup.py:152 does not work because the pypandoc module is indeed
available. However, the pypandoc.convert() function fails if pandoc
itself is not installed (in our use cases it is not). This raises an
OSError that is not handled, and setup fails.

The following is a sample failure:
```
$ which pandoc
$ pip freeze | grep pypandoc
pypandoc==1.4
$ pip install pyspark
Collecting pyspark
  Downloading pyspark-2.2.0.post0.tar.gz (188.3MB)
    100% |████████████████████████████████| 188.3MB 16.8MB/s
    Complete output from command python setup.py egg_info:
    Maybe try:

        sudo apt-get install pandoc
    See http://johnmacfarlane.net/pandoc/installing.html
    for installation options
    ---------------------------------------------------------------

    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-mfnizcwa/pyspark/setup.py", line 151, in <module>
        long_description = pypandoc.convert('README.md', 'rst')
      File "/home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages/pypandoc/__init__.py", line 69, in convert
        outputfile=outputfile, filters=filters)
      File "/home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages/pypandoc/__init__.py", line 260, in _convert_input
        _ensure_pandoc_path()
      File "/home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages/pypandoc/__init__.py", line 544, in _ensure_pandoc_path
        raise OSError("No pandoc was found: either install pandoc and add it\n"
    OSError: No pandoc was found: either install pandoc and add it
    to your PATH or or call pypandoc.download_pandoc(...) or
    install pypandoc wheels with included pandoc.

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-mfnizcwa/pyspark/
```

## What changes were proposed in this pull request?

This change simply adds an additional exception handler for the OSError
that is raised. This allows pyspark to be installed client-side without requiring pandoc to be installed.

## How was this patch tested?

I tested this by building a wheel package of pyspark with the change applied. Then, in a clean virtual environment with pypandoc installed but pandoc not available on the system, I installed pyspark from the wheel.

Here is the output

```
$ pip freeze | grep pypandoc
pypandoc==1.4
$ which pandoc
$ pip install --no-cache-dir ../spark/python/dist/pyspark-2.3.0.dev0-py2.py3-none-any.whl
Processing /home/tbeck/work/spark/python/dist/pyspark-2.3.0.dev0-py2.py3-none-any.whl
Requirement already satisfied: py4j==0.10.6 in /home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages (from pyspark==2.3.0.dev0)
Installing collected packages: pyspark
Successfully installed pyspark-2.3.0.dev0
```

Author: Tucker Beck <tucker.beck@rentrakmail.com>

Closes #18981 from dusktreader/dusktreader/fix-pandoc-dependency-issue-in-setup_py.

(cherry picked from commit aad2125475dcdeb4a0410392b6706511db17bac4)
Signed-off-by: hyukjinkwon <gurwls223@gmail.com>