incubator-hivemall.git
3 weeks agoMerge pull request #250 from CalvinKirs/patch-1 master
Gav [Tue, 6 Sep 2022 15:55:16 +0000 (17:55 +0200)] 
Merge pull request #250 from CalvinKirs/patch-1

Create RETIRED.txt

4 weeks agoCreate RETIRED.txt 250/head
Kirs [Fri, 2 Sep 2022 03:22:22 +0000 (11:22 +0800)] 
Create RETIRED.txt

10 months ago[HIVEMALL-317] Update documentation about Amazon EMR
boomkim [Sun, 28 Nov 2021 04:08:19 +0000 (13:08 +0900)] 
[HIVEMALL-317] Update documentation about Amazon EMR

## What changes were proposed in this pull request?

Update documentation about Amazon EMR.

Just little change to make bootstrap script working.
Previous script had dead link.

## What type of PR is it?

Documentation

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-317

## How was this patch tested?

Document update. Test not needed.

Author: boomkim <bhk3177@gmail.com>

Closes #246 from boomkim/emr_docs.

15 months ago[HIVEMALL-316] Improve error message for duplicate entries error in Tokenizer user...
Makoto Yui [Fri, 2 Jul 2021 06:15:20 +0000 (15:15 +0900)] 
[HIVEMALL-316] Improve error message for duplicate entries error in Tokenizer user dictionary

## What changes were proposed in this pull request?

Improve error message for duplicate entries error in Tokenizer user dictionary

## What type of PR is it?

Improvement

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-316

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #245 from myui/HIVEMALL-316.

16 months ago[HIVEMALL-314] fixed Spark DDLs
Makoto Yui [Fri, 14 May 2021 18:12:19 +0000 (03:12 +0900)] 
[HIVEMALL-314] fixed Spark DDLs

## What changes were proposed in this pull request?

fixed Spark DDLs

## What type of PR is it?

Bug Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-314

## How was this patch tested?

manual tests

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #244 from myui/HIVEMALL-314-fix-spark-ddls.

16 months agoFixed links
Makoto Yui [Fri, 14 May 2021 03:35:11 +0000 (12:35 +0900)] 
Fixed links

16 months ago[HIVEMALL-307][DOC] Update tokenize_ko examples
Makoto Yui [Fri, 14 May 2021 03:25:13 +0000 (12:25 +0900)] 
[HIVEMALL-307][DOC] Update tokenize_ko examples

## What changes were proposed in this pull request?

Update tokenize_ko examples

## What type of PR is it?

Documentation

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-307

Author: Makoto Yui <myui@apache.org>

Closes #243 from myui/update_tokenize_ko_example.

16 months ago[HIVEMALL-312] Changed default constructor accessor to public for Spark
Makoto Yui [Thu, 6 May 2021 08:18:22 +0000 (17:18 +0900)] 
[HIVEMALL-312] Changed default constructor accessor to public for Spark

## What changes were proposed in this pull request?

Changed default constructor accessor to public for Spark

```
Exception in thread "main" org.apache.spark.sql.AnalysisException: No handler for UDF/UDAF/UDTF 'hivemall.factorization.fm.FMPredictGenericUDAF': java.lang.IllegalAccessException: Class org.apache.spark.sql.hive.HiveShim$HiveFunctionWrapper can not access a member of class hivemall.factorization.fm.FMPredictGenericUDAF with modifiers "private"; line 6 pos 0Exception in thread "main" org.apache.spark.sql.AnalysisException: No handler for UDF/UDAF/UDTF 'hivemall.factorization.fm.FMPredictGenericUDAF': java.lang.IllegalAccessException: Class org.apache.spark.sql.hive.HiveShim$HiveFunctionWrapper can not access a member of class hivemall.factorization.fm.FMPredictGenericUDAF with modifiers "private"; line 6 pos 0 at sun.reflect.Reflection.ensureMemberAccess(Reflection.java:102) at java.lang.Class.newInstance(Class.java:436) at org.apache.spark.sql.hive.HiveShim$HiveFunctionWrapper.createFunction(HiveShim.scala:220) at org.apache.spark.sql.hive.HiveUDAFFunction.newEvaluator(hiveUDFs.scala:343) at org.apache.spark.sql.hive.HiveUDAFFunction.org$apache$spark$sql$hive$HiveUDAFFunction$$finalHiveEvaluator$lzycompute(hiveUDFs.scala:366) at org.apache.spark.sql.hive.HiveUDAFFunction.org$apache$spark$sql$hive$HiveUDAFFunction$$finalHiveEvaluator(hiveUDFs.scala:365) at org.apache.spark.sql.hive.HiveUDAFFunction.dataType$lzycompute(hiveUDFs.scala:394) at org.apache.spark.sql.hive.HiveUDAFFunction.dataType(hiveUDFs.scala:394) at org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionExpression$1$$anonfun$apply$2.apply(HiveSessionCatalog.scala:85) at org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionExpression$1$$anonfun$apply$2.apply(HiveSessionCatalog.scala:71) at scala.util.Try.getOrElse(Try.scala:79) at org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionExpression$1.apply(HiveSessionCatalog.scala:71) at org.apache.spark.sql.hive.HiveSessionCatalog$$anonfun$makeFunctionExpression$1.apply(HiveSessionCatalog.scala:71) at
```

## What type of PR is it?

Bug Fix, Hot fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-312

## How was this patch tested?

unit tests

Author: Makoto Yui <myui@apache.org>

Closes #242 from myui/HIVEMALL-312.

17 months ago[HIVEMALL-311] Upgrade Kryo version from 2.21 to 2.24.0
Makoto Yui [Sun, 2 May 2021 06:55:46 +0000 (15:55 +0900)] 
[HIVEMALL-311] Upgrade Kryo version from 2.21 to 2.24.0

## What changes were proposed in this pull request?

xgboost4j and xgboost module used Kryo version 2.21 but it has a bug in serializing generic collections. So, update Kryo version to 2.24.0 just in case.

## What type of PR is it?

Bug Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-311

## How was this patch tested?

unit tests

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #241 from myui/kryo_update.

17 months ago[HIVEMALL-310] Remove old release artifacts and unlink them
Makoto Yui [Wed, 28 Apr 2021 02:43:37 +0000 (11:43 +0900)] 
[HIVEMALL-310] Remove old release artifacts and unlink them

## What changes were proposed in this pull request?

Update links for old release artifacts

## What type of PR is it?

Documentation

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-310

Author: Makoto Yui <myui@apache.org>

Closes #240 from myui/HIVEMALL-310.

17 months ago[HIVEMALL-308] Relocate kryo packages in shaded jar
Makoto Yui [Fri, 23 Apr 2021 12:22:30 +0000 (21:22 +0900)] 
[HIVEMALL-308] Relocate kryo packages in shaded jar

## What changes were proposed in this pull request?

Relocate Kryo packages in fat jar to avoid conflicts

## What type of PR is it?

Hot Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-308

## How was this patch tested?

manual tests on EMR

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #239 from myui/relocate_kryo.

17 months ago[HIVEMALL-309] Enhance tokenize_ko to support stopwords and external user dict
Makoto Yui [Fri, 23 Apr 2021 10:17:14 +0000 (19:17 +0900)] 
[HIVEMALL-309] Enhance tokenize_ko to support stopwords and external user dict

## What changes were proposed in this pull request?

Enhance tokenize_ko to support stopwords and external user dict

## What type of PR is it?

Improvement

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-309

## How was this patch tested?

unit tests, manual tests on EMR

## How to use this feature?

```sql
-- default stopward (null), default stoptags (null), custom dict
select tokenize_ko('나는 C++ 언어를 프로그래밍 언어로 사랑한다.', '-mode discard', null, null, array('C++'));
> ["나","c++","언어","프로그래밍","언어","사랑"]

select tokenize_ko('나는 c++ 프로그래밍을 즐긴다.', '-mode discard', null, null, 'https://raw.githubusercontent.com/apache/lucene/main/lucene/analysis/nori/src/test/org/apache/lucene/analysis/ko/userdict.txt');

> ["나","c++","프로그래밍","즐기"]
```

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #238 from myui/korean-enhancement.

17 months agoImplement Korean text tokenizer
Makoto Yui [Thu, 22 Apr 2021 14:53:10 +0000 (23:53 +0900)] 
Implement Korean text tokenizer

## What changes were proposed in this pull request?

Implement Korean text tokenizer

## What type of PR is it?

Feature

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-307

## How was this patch tested?

unit tests and manual tests on EMR

## How to use this feature?

```sql
-- show version of lucene-analyzers-nori
select tokenize_ko();
> 8.8.2

select tokenize_ko("소설 무궁화꽃이 피었습니다.");
> ["소설","무궁","화","꽃","피"]

select tokenize_ko("소설 무궁화꽃이 피었습니다.", null, "mixed");
> ["소설","무궁화","무궁","화","꽃","피"]

select tokenize_ko("소설 무궁화꽃이 피었습니다.", null, "discard", array("E", "VV"));
> ["소설","무궁","화","꽃","이"]

select tokenize_ko("Hello, world.", null, "none", array(), true);
> ["h","e","l","l","o","w","o","r","l","d"]

select tokenize_ko("Hello, world.", null, "none", array(), false);
> ["hello","world"]

select tokenize_ko("나는 C++ 언어를 프로그래밍 언어로 사랑한다.", null, "discard", array());
> ["나","는","c","언어","를","프로그래밍","언어","로","사랑","하","ᆫ다"]

select tokenize_ko("나는 C++ 언어를 프로그래밍 언어로 사랑한다.", array("C++"), "discard", array());
> ["나","는","c++","언어","를","프로그래밍","언어","로","사랑","하","ᆫ다"]
```

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #237 from myui/korean_tokenizer.

17 months agoFixed gitbook build
Makoto Yui [Thu, 22 Apr 2021 12:24:33 +0000 (21:24 +0900)] 
Fixed gitbook build

## What changes were proposed in this pull request?

Fixed gitbook build

## What type of PR is it?

Documentation

Author: Makoto Yui <myui@apache.org>

Closes #236 from myui/fix_gitbook.

17 months ago[HIVEMALL-305] Kuromoji Japanese tokenizer with Neologd dictionary
Makoto Yui [Thu, 22 Apr 2021 03:39:21 +0000 (12:39 +0900)] 
[HIVEMALL-305] Kuromoji Japanese tokenizer with Neologd dictionary

## What changes were proposed in this pull request?

Add tokenize_ja_neologd UDF that uses Neologd dictionary for Kuromoji tokenization.

## What type of PR is it?

Feature

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-305

## How was this patch tested?

unit tests and manual tests on EMR

## How to use this feature?

```sql
tokenize_ja_neologd(text input, optional const text mode = "normal", optional const array<string> stopWords, const array<string> stopTags, const array<string> userDict)

select tokenize_ja_neologd("彼女はペンパイナッポーアッポーペンと恋ダンスを踊った。");
> ["彼女","ペンパイナッポーアッポーペン","恋ダンス","踊る"]
```

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #235 from myui/neologd.

17 months ago[HIVEMALL-304] Updated lucene version from 5.5.5 (java7) to 8.8.2 (java8)
Makoto Yui [Mon, 19 Apr 2021 06:39:03 +0000 (15:39 +0900)] 
[HIVEMALL-304] Updated lucene version from 5.5.5 (java7) to 8.8.2 (java8)

## What changes were proposed in this pull request?

Updated lucene version from 5.5.5 (java7) to 8.8.2 (java8)

## What type of PR is it?

Improvement

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-304

## How was this patch tested?

unit tests

## How to use this feature?

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #234 from myui/lucene_version_up.

17 months ago[HIVEMALL-303] Changed compilation target to Java 8
Makoto Yui [Thu, 15 Apr 2021 05:29:23 +0000 (14:29 +0900)] 
[HIVEMALL-303] Changed compilation target to Java 8

## What changes were proposed in this pull request?

Change compilation target to Java 8 from Java 7.

## What type of PR is it?

Improvement

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-303

## How was this patch tested?

unit tests

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #233 from myui/HIVEMALL-303-java8.

18 months ago[HIVEMALL-301] Remove macros and replace them with UDF
Makoto Yui [Mon, 29 Mar 2021 07:42:58 +0000 (16:42 +0900)] 
[HIVEMALL-301] Remove macros and replace them with UDF

## What changes were proposed in this pull request?

Remove macros and replace them with UDF

## What type of PR is it?

Improvement, Refactoring

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-301

## How was this patch tested?

manual tests

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #232 from myui/HIVEMALL-301-tfidf.

2 years agoRevised bagging doc entry
Makoto Yui [Thu, 27 Aug 2020 09:06:58 +0000 (18:06 +0900)] 
Revised bagging doc entry

2 years agoTrivial doc fix
Makoto Yui [Fri, 21 Aug 2020 09:54:15 +0000 (18:54 +0900)] 
Trivial doc fix

2 years agoAdded user guide entry for bagging classifiers
Makoto Yui [Fri, 21 Aug 2020 06:21:45 +0000 (15:21 +0900)] 
Added user guide entry for bagging classifiers

2 years ago[HIVEMALL-297] Fixed null element handling in feature vector
Makoto Yui [Thu, 6 Aug 2020 07:05:37 +0000 (16:05 +0900)] 
[HIVEMALL-297] Fixed null element handling in feature vector

## What changes were proposed in this pull request?

Fixed null element handling in feature vector

## What type of PR is it?

Bug Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-297

## How was this patch tested?

unit tests

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #231 from myui/HIVEMALL-297.

2 years ago[HIVEMALL-296][BUGFIX] Fixed corner case NPE bug when count is zero
Makoto Yui [Mon, 27 Jul 2020 06:36:48 +0000 (15:36 +0900)] 
[HIVEMALL-296][BUGFIX] Fixed corner case NPE bug when count is zero

## What changes were proposed in this pull request?

Fixed corner case NPE bug when count is zero.

```
Caused by: java.lang.NullPointerException
at hivemall.GeneralLearnerBaseUDTF.forwardModel(GeneralLearnerBaseUDTF.java:763)
at hivemall.GeneralLearnerBaseUDTF.close(GeneralLearnerBaseUDTF.java:560)
at org.apache.hadoop.hive.ql.exec.UDTFOperator.closeOp(UDTFOperator.java:152)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:697)
at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.close(ExecReducer.java:279)
```

## What type of PR is it?

Bug Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-296

## How was this patch tested?

unit tests

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #230 from myui/HIVEMALL-296.

2 years ago[HIVEMALL-295][BUGFIX] transpose_and_dot throws UDFArgumentException for 0 rows input
Makoto Yui [Thu, 4 Jun 2020 05:21:05 +0000 (14:21 +0900)] 
[HIVEMALL-295][BUGFIX] transpose_and_dot throws UDFArgumentException for 0 rows input

## What changes were proposed in this pull request?

transpose_and_dot throws UDFArgumentException for 0 rows input.

```
WITH INPUT AS(
  SELECT
    ARRAY(1.0,2.0,3.0) AS X,
    ARRAY(0,1) AS Y
)
SELECT
  transpose_and_dot(Y,X) AS observed
FROM
  INPUT
WHERE false

Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.exec.UDFArgumentException
        at org.apache.hadoop.hive.ql.exec.GroupByOperator.closeOp(GroupByOperator.java:1126)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:697)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
        at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:711)
        at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.close(MapRecordProcessor.java:464)
        ... 15 more
Caused by: org.apache.hadoop.hive.ql.exec.UDFArgumentException
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at java.lang.Class.newInstance(Class.java:442)
        at hivemall.utils.lang.Preconditions.checkNotNull(Preconditions.java:43)
        at hivemall.tools.matrix.TransposeAndDotUDAF$TransposeAndDotUDAFEvaluator.iterate(TransposeAndDotUDAF.java:172)
        at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.aggregate(GenericUDAFEvaluator.java:192)
        at org.apache.hadoop.hive.ql.exec.GroupByOperator.closeOp(GroupByOperator.java:1117)
        ... 21 more
```

## What type of PR is it?

Bug Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-295

## How was this patch tested?

manual tests on EMR

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #229 from myui/HIVEMALL-295.

2 years ago[HIVEMALL-294] Fix XGboost to report progress report for each iteration
Makoto Yui [Fri, 29 May 2020 07:42:14 +0000 (16:42 +0900)] 
[HIVEMALL-294] Fix XGboost to report progress report for each iteration

## What changes were proposed in this pull request?

Fix XGboost to report progress report for each iteration.

## What type of PR is it?

Improvement, Hot Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-294

## How was this patch tested?

unit tests

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #228 from myui/HIVEMALL-294.

2 years ago[HIVEMALL-291] Fixed dedup behavior of to_ordered_list UDAF
Makoto Yui [Tue, 3 Mar 2020 06:44:04 +0000 (15:44 +0900)] 
[HIVEMALL-291] Fixed dedup behavior of to_ordered_list UDAF

## What changes were proposed in this pull request?

Fixed dedup behavior of to_ordered_list UDAF

## What type of PR is it?

Hot Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-291

## How was this patch tested?

unit tests

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #227 from myui/HIVEMALL-291-2.

2 years ago[HIVEMALL-291] Support deduplication in to_order_list UDAF
Makoto Yui [Thu, 23 Jan 2020 10:20:55 +0000 (19:20 +0900)] 
[HIVEMALL-291] Support deduplication in to_order_list UDAF

## What changes were proposed in this pull request?

Add -dedup option to to_ordered_list.

## What type of PR is it?

Improvement

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-291

## How was this patch tested?

manual tests on EMR

## How to use this feature?

```sql
WITH data as (
    SELECT 5 as key, 'apple' as value
    UNION ALL
    SELECT 3 as key, 'banana' as value
    UNION ALL
    SELECT 4 as key, 'candy' as value
    UNION ALL
    SELECT 1 as key, 'donut' as value
    UNION ALL
    SELECT 2 as key, 'egg' as value
    UNION ALL
    SELECT 4 as key, 'candy' as value -- both key and value duplicates
)
select
  to_ordered_list(value, key, '-k 4 -dedup -vk_map'),
  to_ordered_list(value, key, '-k 4 -vk_map'),
  to_ordered_list(value, key, '-k 4 -dedup'),
  to_ordered_list(value, key, '-k 4')
from
  data
```

> {"apple":5,"candy":4,"banana":3,"egg":2}        {"apple":5,"candy":4,"banana":3}        ["apple","candy","banana","egg"]      [
"apple","candy","candy","banana"]

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #226 from myui/HIVEMALL-291.

2 years agoFixed docs
Makoto Yui [Tue, 21 Jan 2020 09:25:37 +0000 (18:25 +0900)] 
Fixed docs

2 years ago[HIVEMALL-289] Add str_contain(string str, array<string> match, boolean or=true) UDF
Makoto Yui [Thu, 26 Dec 2019 07:48:02 +0000 (16:48 +0900)] 
[HIVEMALL-289] Add str_contain(string str, array<string> match, boolean or=true) UDF

## What changes were proposed in this pull request?

Add str_contain(string str, array<string> match, boolean or=true) UDF

## What type of PR is it?

Feature

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-289

## How was this patch tested?

manual tests on EMR

## How to use this feature?

```sql
select
  str_contains('There are apple and orange', array('apple')),
  str_contains('There are apple and orange', array('apple', 'banana'), true),
  str_contains('There are apple and orange', array('apple', 'banana'), false);
> true, true, false
```

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #225 from myui/HIVEMALL-289.

2 years agoFixed docs for UDF preparation
Makoto Yui [Wed, 25 Dec 2019 09:30:06 +0000 (18:30 +0900)] 
Fixed docs for UDF preparation

2 years agoFixed a typo
Makoto Yui [Fri, 20 Dec 2019 15:45:22 +0000 (00:45 +0900)] 
Fixed a typo

2 years agoAdded ChangeLog
Makoto Yui [Fri, 20 Dec 2019 13:05:12 +0000 (22:05 +0900)] 
Added ChangeLog

2 years agoReplaced http with https and added verification procedure
Makoto Yui [Thu, 19 Dec 2019 12:31:55 +0000 (21:31 +0900)] 
Replaced http with https and added verification procedure

2 years agoFixed links in doc
Makoto Yui [Thu, 19 Dec 2019 10:18:04 +0000 (19:18 +0900)] 
Fixed links in doc

2 years agoUpdated download page
Makoto Yui [Thu, 19 Dec 2019 08:36:51 +0000 (17:36 +0900)] 
Updated download page

2 years agoMerge remote-tracking branch 'origin/v0.6.0'
Makoto Yui [Thu, 19 Dec 2019 08:06:44 +0000 (17:06 +0900)] 
Merge remote-tracking branch 'origin/v0.6.0'

2 years agoUpdated copyrights holders
Makoto Yui [Thu, 19 Dec 2019 05:18:25 +0000 (14:18 +0900)] 
Updated copyrights holders

2 years ago[HIVEMALL-288] mf_predict throws SemanticException No matching method with (array...
Makoto Yui [Thu, 12 Dec 2019 08:32:27 +0000 (17:32 +0900)] 
[HIVEMALL-288] mf_predict throws SemanticException No matching method with (array<double>, array<double>, int)

## What changes were proposed in this pull request?

`mf_predict` throws SemanticException No matching method with (array<double>, array<double>, int)

## What type of PR is it?

Bug Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-288

## How was this patch tested?

manual tests on EMR

```sql
select
  -- 3 arguments
  mf_predict(array(cast(1.0 as float),cast(2.0 as float),cast(3.0 as float)), array(cast(1.0 as float),cast(2.0 as float),cast(3.0 as float)), 1),
  mf_predict(array(1.0,2.0,3.0), array(1.0,2.0,3.0), 1),
  mf_predict(array(cast(1.0 as DOUBLE),cast(2.0 as DOUBLE),cast(3.0 as DOUBLE)), array(cast(1.0 as DOUBLE),cast(2.0 as DOUBLE),cast(3.0 as DOUBLE)), 1),
  -- 2 arguments
  mf_predict(array(1.0,2.0,3.0), array(1.0,2.0,3.0)),
  -- 4 arguments
  mf_predict(array(1.0,2.0,3.0), array(1.0,2.0,3.0), 0, 0),
  -- 5 arguments
  mf_predict(array(1.0,2.0,3.0), array(1.0,2.0,3.0), 0, 0, 1);
```

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #224 from myui/HIVEMALL-288.

2 years agoUpdate date
Makoto Yui [Tue, 3 Dec 2019 06:21:35 +0000 (15:21 +0900)] 
Update date

2 years ago[DOC] update titanic random forest doc for decision_path
Makoto Yui [Mon, 2 Dec 2019 10:25:54 +0000 (19:25 +0900)] 
[DOC] update titanic random forest doc for decision_path

2 years agoFixed release guide
Makoto Yui [Thu, 28 Nov 2019 18:26:51 +0000 (03:26 +0900)] 
Fixed release guide

2 years ago[maven-release-plugin] prepare for next development iteration v0.6.0
Makoto Yui [Thu, 28 Nov 2019 16:43:53 +0000 (01:43 +0900)] 
[maven-release-plugin] prepare for next development iteration

2 years ago[maven-release-plugin] prepare release v0.6.0-rc1 v0.6.0-rc1
Makoto Yui [Thu, 28 Nov 2019 16:43:43 +0000 (01:43 +0900)] 
[maven-release-plugin] prepare release v0.6.0-rc1

2 years agoBumped version string to 0.6.0-incubating
Makoto Yui [Thu, 28 Nov 2019 16:41:45 +0000 (01:41 +0900)] 
Bumped version string to 0.6.0-incubating

2 years agoMinor refactoring and fixed function docs
Makoto Yui [Thu, 28 Nov 2019 07:46:02 +0000 (16:46 +0900)] 
Minor refactoring and fixed function docs

2 years ago[HIVEMALL-159][DOC] Add documentation about One-hot encoding
Makoto Yui [Thu, 28 Nov 2019 07:11:17 +0000 (16:11 +0900)] 
[HIVEMALL-159][DOC] Add documentation about One-hot encoding

## What changes were proposed in this pull request?

Add documentation about One-hot encoding

## What type of PR is it?

Documentation

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-159

## How to use this feature?

See userguide

Author: Makoto Yui <myui@apache.org>

Closes #223 from myui/onehot_docs.

2 years ago[HIVEMALL-56][DOC] Add documentation about Similarity/Distance functions
Makoto Yui [Wed, 27 Nov 2019 09:03:41 +0000 (18:03 +0900)] 
[HIVEMALL-56][DOC] Add documentation about Similarity/Distance functions

## What changes were proposed in this pull request?

Add documentation about Similarity/Distance functions

## What type of PR is it?

Documentation

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-56

## Checklist

Author: Makoto Yui <myui@apache.org>

Closes #222 from myui/HIVEMALL-56.

2 years ago[HIVEMALL-158][DOC] Refine deprecated userguide contents
Makoto Yui [Wed, 27 Nov 2019 07:42:34 +0000 (16:42 +0900)] 
[HIVEMALL-158][DOC] Refine deprecated userguide contents

## What changes were proposed in this pull request?

Refine deprecated userguide contents

## What type of PR is it?

Documentation

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-158

Author: Makoto Yui <myui@apache.org>

Closes #221 from myui/HIVEMALL-158.

2 years ago[HIVEMALL-285] Add -inspect_opts option to show hyperparameters
Makoto Yui [Wed, 27 Nov 2019 07:11:56 +0000 (16:11 +0900)] 
[HIVEMALL-285] Add -inspect_opts option to show hyperparameters

## What changes were proposed in this pull request?

Add `-inspect_opts` option to show hyperparameters

## What type of PR is it?

Improvement

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-285

## How was this patch tested?

manual tests on EMR

## How to use this feature?

```sql
select train_regressor(array(), 0, '-inspect_opts -optimizer adam -reg elasticnet');

FAILED: UDFArgumentException Inspected Optimizer options ...
{disable_cvtest=false, regularization=ElasticNet, loss_function=SquaredLoss, eps=1.0E-8, decay=0.0, iterations=10, eta0=0.1, l1_ratio=0.5, lambda=1.0E-4, eta=Invscaling, optimizer=adam, beta1=0.9, beta2=0.999, alpha=1.0, cv_rate=0.005, power_t=0.1}
```

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #220 from myui/HIVEMALL-285.

2 years agoRevised exception type
Makoto Yui [Tue, 26 Nov 2019 06:43:09 +0000 (15:43 +0900)] 
Revised exception type

2 years agoMinor refactoring
Makoto Yui [Tue, 26 Nov 2019 06:39:30 +0000 (15:39 +0900)] 
Minor refactoring

2 years ago[HIVEMALL-283] Bump up netty version to 4.1.42.Final
Makoto Yui [Tue, 26 Nov 2019 04:54:43 +0000 (13:54 +0900)] 
[HIVEMALL-283] Bump up netty version to 4.1.42.Final

## What changes were proposed in this pull request?

Bump up netty version to 4.1.42.Final

This closes #206 and closes #207

## What type of PR is it?

Hot Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-283

## How was this patch tested?

unit tests

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #219 from myui/HIVEMALL-283.

2 years ago[HIVEMALL-226] Move hivemall.fm and hivemall.mf packages to under hivemall.factorization
Makoto Yui [Mon, 25 Nov 2019 18:58:42 +0000 (03:58 +0900)] 
[HIVEMALL-226] Move hivemall.fm and hivemall.mf packages to under hivemall.factorization

## What changes were proposed in this pull request?

Move hivemall.fm and hivemall.mf packages to under hivemall.factorization

## What type of PR is it?

Refactoring

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-226

## How was this patch tested?

unit tests and manual tests on EMR

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #218 from myui/HIVEMALL-266.

2 years agoUpdate javadoc and applied formatter
Makoto Yui [Mon, 25 Nov 2019 17:05:56 +0000 (02:05 +0900)] 
Update javadoc and applied formatter

2 years ago[HIVEMALL-165] Fixed to accept any primitive
Makoto Yui [Mon, 25 Nov 2019 16:53:29 +0000 (01:53 +0900)] 
[HIVEMALL-165] Fixed to accept any primitive

## What changes were proposed in this pull request?

Fix a bug that `array_remove` UDF throws exception when the first argument is null

## What type of PR is it?

Bug Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-165

## How was this patch tested?

manual tests on EMR

## How to use this feature?

```sql
WITH data4 as (
  select false as n, array(2.0, 3.0, 4.0) as nums
  union all
   select true as n, array(2.0, 3.0, 4.0) as nums
)
select
  array_remove(if(n = true, null, nums), 2.0) as c1,
  array_remove(if(n = true, null, nums), array(3.0,2.0)) as c2,
  array_remove(if(n = false, null, nums), 2.0) as c3
from
  data4;
> c1      c2      c3
> [3,4]   [4]     NULL
> NULL    NULL    [3,4]

select array_remove(array(2.0,2.1,3.0,4.0,2.0),2), array_remove(array(2.0,3.0,4.0),array(3,2.0));
> [2.1,3,4]       [4]

SELECT array_remove(array(1,null,3),null);
> [1,3]

SELECT array_remove(array(1,null,3,null,5),null);
> [1,3,5]

SELECT array_remove(array(1,null,3),array(null));
> [1,3]

SELECT array_remove(array('aaa','bbb'),'bbb');
> ["aaa"]

SELECT array_remove(array('aaa','bbb','ccc','bbb'), array('bbb','ccc'));
> ["aaa"]

select array_remove(array(null),null);
> []

select array_remove(array(null,'bbb'),'aaa');
> [null,"bbb"]
```

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #217 from myui/HIVEMALL-165.

2 years ago[HIVEMALL-121] Add -libsvm formatting option to feature_hashing UDF
Makoto Yui [Mon, 25 Nov 2019 10:03:15 +0000 (19:03 +0900)] 
[HIVEMALL-121] Add -libsvm formatting option to feature_hashing UDF

## What changes were proposed in this pull request?

Add `-libsvm` formatting option for `feature_hashing

## What type of PR is it?

Improvement

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-121

## How was this patch tested?

unit tests, manual tests on EMR

## How to use this feature?

```sql
select feature_hashing(array('aaa:1.0','aaa','bbb:2.0'), '-libsvm');
> ["4063537:1.0","4063537:1","8459207:2.0"]

select feature_hashing(array('aaa:1.0','aaa','bbb:2.0'), '-features 10 -libsvm');
> ["1:2.0","7:1.0","7:1"]
```

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #216 from myui/HIVEMALL-121.

2 years ago[HIVEMALL-249] Fix fmeasure UDAF to support any integers
Makoto Yui [Mon, 25 Nov 2019 08:50:35 +0000 (17:50 +0900)] 
[HIVEMALL-249] Fix fmeasure UDAF to support any integers

## What changes were proposed in this pull request?

Fix fmeasure UDAF to support any integers

## What type of PR is it?

Hot Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-249

## How to use this feature?

```sql
create table data2 as
  select 1.1 as truth, 0 as predicted
union all
  select 0.0 as truth, 1 as predicted
union all
  select 0.0 as truth, 0 as predicted
union all
  select 1.0 as truth, 1 as predicted
union all
  select 0.0 as truth, 1 as predicted
union all
  select 0.0 as truth, 0 as predicted
;

select fmeasure(truth, predicted, '-average binary') from data;
```

## How was this patch tested?

manual tests on EMR

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #215 from myui/HIVEMALL-249.

2 years ago[HIVEMALL-276] Stable support for XGBoost v0.90
Makoto Yui [Fri, 22 Nov 2019 15:56:36 +0000 (00:56 +0900)] 
[HIVEMALL-276] Stable support for XGBoost v0.90

## What changes were proposed in this pull request?

- Fix xgboost module to create DMatrix from CSRMatrix
- Support xgboost v0.90 hyperparameters
- Replace xgboost4j with [xgboost-predictor](https://github.com/komiya-atsushi/xgboost-predictor-java) for prediction
- Add documentation about Xgboost

## What type of PR is it?

Refactoring, Improvement

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-276
https://issues.apache.org/jira/browse/HIVEMALL-275
https://issues.apache.org/jira/browse/HIVEMALL-279
https://issues.apache.org/jira/browse/HIVEMALL-272
https://issues.apache.org/jira/browse/HIVEMALL-27

## How to use this feature?

as described in [user guide](http://hivemall.apache.org/userguide/index.html).

## How was this patch tested?

unit tests and manual tests on EMR

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #213 from myui/HIVEMALL-275-2.

2 years ago[HIVEMALL-281] Support max_by, min_by, majority_vote UDAFs
Makoto Yui [Fri, 22 Nov 2019 14:17:11 +0000 (23:17 +0900)] 
[HIVEMALL-281] Support max_by, min_by, majority_vote UDAFs

## What changes were proposed in this pull request?

upport max_by, min_by, majority_vote UDAFs

## What type of PR is it?

Feature

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-281

## How was this patch tested?

manual tests on EMR

## How to use this feature?

```sql

create table data1 as (
  select 'jake' as name, 18 as age
  union all
  select 'tom' as name, 64 as age
  union all
  select 'lisa' as name, 32 as age
);

select
  max_by(name, age) as max_name,
  min_by(name, age) as min_name
from
  data1;
> tom, jake

create table data2 as
  select
    explode(array('1', '2', '2', '2', '5', '4', '1', '2')) as k;

select
  majority_vote(k) as k
from
  data2;
> 2
```

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #214 from myui/HIVEMALL-281.

2 years ago[HOTFIX] bumped matrix4j version to 0.9.2
Makoto Yui [Mon, 11 Nov 2019 05:38:54 +0000 (14:38 +0900)] 
[HOTFIX] bumped matrix4j version to 0.9.2

2 years ago[HIVEMALL-278] Bumped matrix4j version to v0.9.1
Makoto Yui [Fri, 1 Nov 2019 09:27:53 +0000 (18:27 +0900)] 
[HIVEMALL-278] Bumped matrix4j version to v0.9.1

## What changes were proposed in this pull request?

Bumped matrix4j version to v0.9.1 since matrix4j v0.9.0 had a bug on constructing CSRMatrix in an unordered column order.

## What type of PR is it?

Bug Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-278

## How was this patch tested?

unit tests

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #212 from myui/HIVEMALL-278.

2 years agoadd missing junit dependency
Makoto Yui [Thu, 31 Oct 2019 10:58:20 +0000 (19:58 +0900)] 
add missing junit dependency

2 years agoAdded SparseDMatrixBuilder 211/head
Makoto Yui [Thu, 31 Oct 2019 10:17:54 +0000 (19:17 +0900)] 
Added SparseDMatrixBuilder

2 years agoRenamed XGBoostUDTF as XGBoostBaseUDTF
Makoto Yui [Thu, 31 Oct 2019 10:17:31 +0000 (19:17 +0900)] 
Renamed XGBoostUDTF as XGBoostBaseUDTF

2 years ago[HIVEMALL-274] Fix wrong column name of train_regressor() in tutorial
Aki Ariga [Thu, 31 Oct 2019 07:44:44 +0000 (16:44 +0900)] 
[HIVEMALL-274] Fix wrong column name of train_regressor() in tutorial

## What changes were proposed in this pull request?

Fix document bug reported in HIVEMALL-274

## What type of PR is it?

Documentation

## What is the Jira issue?

https://issues.apache.org/jira/projects/HIVEMALL/issues/HIVEMALL-274

## How was this patch tested?

N/A

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Aki Ariga <ariga@treasure-data.com>

Closes #210 from chezou/HIVEMALL-274.

2 years agoAdded document about xgboost_version() UDF
Makoto Yui [Wed, 30 Oct 2019 08:59:49 +0000 (17:59 +0900)] 
Added document about xgboost_version() UDF

2 years ago[HIVEMALL-273] Support xgboost v0.90
Makoto Yui [Wed, 30 Oct 2019 07:41:21 +0000 (16:41 +0900)] 
[HIVEMALL-273] Support xgboost v0.90

## What changes were proposed in this pull request?

Support xgboost v0.90

## What type of PR is it?

Improvement

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-273

## How was this patch tested?

unit tests and manual tests on EMR

## How to use this feature?

https://gist.github.com/myui/aa6e142a95ca8f995cc8e49146dbe2eb

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #209 from myui/HIVEMALL-273.

2 years ago[HIVEMALL-260] Remove dependencies to Scala library in xgboost classifier
Makoto Yui [Tue, 29 Oct 2019 06:37:43 +0000 (15:37 +0900)] 
[HIVEMALL-260] Remove dependencies to Scala library in xgboost classifier

## What changes were proposed in this pull request?

Remove dependencies to Scala library in xgboost classifier

## What type of PR is it?

Bug Fix, Hot Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-260

## How was this patch tested?

manual tests on EMR

## How to use this feature?

to appear

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #205 from myui/HIVEMALL-260.

2 years agoRemove rand_gid/rand_gid2 macro
Makoto Yui [Wed, 23 Oct 2019 09:44:41 +0000 (18:44 +0900)] 
Remove rand_gid/rand_gid2 macro

## What changes were proposed in this pull request?

Remove rand_gid/rand_gid2 macro

## What type of PR is it?

Hot Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-270

Author: Makoto Yui <myui@apache.org>

Closes #204 from myui/HIVEMALL-270.

2 years ago[HIVEMALL-261][HIVEMALL-262] argmin/argmax/argsort UDF
Makoto Yui [Wed, 23 Oct 2019 09:01:51 +0000 (18:01 +0900)] 
[HIVEMALL-261][HIVEMALL-262] argmin/argmax/argsort UDF

## What changes were proposed in this pull request?

Introduce argmin/argmax/argsort UDF

## What type of PR is it?

Feature

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-261
https://issues.apache.org/jira/browse/HIVEMALL-262

## How was this patch tested?

unit tests, manual tests on EMR

## How to use this feature?

```sql
SELECT argmax(array(5,2,0,1));
> 0

SELECT array_slice(array(5,2,0,1), argmax(array(5,2,0,1)));
> 5

SELECT argmin(array(5,2,0,1));
> 2

SELECT argsort(array(5,2,0,1));
> 2, 3, 1, 0

SELECT array_slice(array(5,2,0,1), argsort(array(5,2,0,1)));
> 0, 1, 2, 5

SELECT argsort(argsort(array(5,2,0,1))), argrank(array(5,2,0,1));
> 3, 2, 0, 1

SELECT arange(5), arange(1, 5), arange(1, 5, 1), arange(0, 5, 1);
> [0,1,2,3,4]     [1,2,3,4]       [1,2,3,4]       [0,1,2,3,4]

SELECT arange(1, 6, 2);
> 1, 3, 5

SELECT arange(-1, -6, 2);
> -1, -3, -5

SELECT argsort(array(5, 2, 0, 1)), argrank(array(5, 2, 0, 1)), argsort(argsort(array(5, 2, 0, 1)));
> [2,3,1,0]       [3,2,0,1]       [3,2,0,1]
```

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #197 from myui/argmax.

2 years ago[HIVEMALL-244] Support Java9, Java11(LTS)
Makoto Yui [Mon, 21 Oct 2019 07:22:05 +0000 (16:22 +0900)] 
[HIVEMALL-244] Support Java9, Java11(LTS)

## What changes were proposed in this pull request?

Support Java9, Java11(LTS)

## What type of PR is it?

Improvement | Hot Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-244

## How was this patch tested?

unit tests

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #203 from myui/HIVEMALL-244.

2 years ago[HIVEMALL-269] Modified to use matrix4j for matrix module
Makoto Yui [Fri, 18 Oct 2019 08:42:16 +0000 (17:42 +0900)] 
[HIVEMALL-269] Modified to use matrix4j for matrix module

## What changes were proposed in this pull request?

 Use matrix4j for matrix module

## What type of PR is it?

Hot Fix | Refactoring

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-269

## How was this patch tested?

unit tests

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #202 from myui/HIVEMALL-269.

2 years agoFixed annotations
Makoto Yui [Tue, 8 Oct 2019 07:15:24 +0000 (16:15 +0900)] 
Fixed annotations

2 years agoMoved matrix/random package to utils/random
Makoto Yui [Mon, 7 Oct 2019 07:16:19 +0000 (16:16 +0900)] 
Moved matrix/random package to utils/random

2 years agoMerged ArrayUtilsTest
Makoto Yui [Mon, 7 Oct 2019 05:44:39 +0000 (14:44 +0900)] 
Merged ArrayUtilsTest

2 years ago[HIVEMALL-267] Drop Spark Dataframe support (SparkSQL remain supported)
Makoto Yui [Fri, 4 Oct 2019 05:28:49 +0000 (14:28 +0900)] 
[HIVEMALL-267] Drop Spark Dataframe support (SparkSQL remain supported)

## What changes were proposed in this pull request?

Drop Spark Dataframe support (SparkSQL remain supported).

## What type of PR is it?

Hot Fix, Refactoring

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-267

## How was this patch tested?

unit tests, manual tests

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #201 from myui/HIVEMALL-267.

3 years ago[HIVEMALL-268] Fix the default vInit, eta initialization bug in FactorizationMachines
Makoto Yui [Thu, 3 Oct 2019 08:34:10 +0000 (17:34 +0900)] 
[HIVEMALL-268] Fix the default vInit, eta initialization bug in FactorizationMachines

## What changes were proposed in this pull request?

Fix the default vInit, eta initialization bug in FactorizationMachines

## What type of PR is it?

Bug Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-268

## How was this patch tested?

unit tests, manual tests on EMR

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #200 from myui/HIVEMALL-268.

3 years ago[HIVEMALL-171] Tracing functionality for prediction of DecisionTrees
Makoto Yui [Fri, 27 Sep 2019 18:39:01 +0000 (03:39 +0900)] 
[HIVEMALL-171] Tracing functionality for prediction of DecisionTrees

## What changes were proposed in this pull request?

Introduce `decision_path` UDF providing tracing of decision tree prediction paths

## What type of PR is it?

Feature

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-171

## How was this patch tested?

unit tests, manual tests on EMR

## How to use this feature?

to be described in the user guide

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #199 from myui/HIVEMALL-171.

3 years ago[HIVEMALL-245] Refactor RandomForest for Sparse Data handling
Makoto Yui [Fri, 13 Sep 2019 09:23:00 +0000 (18:23 +0900)] 
[HIVEMALL-245] Refactor RandomForest for Sparse Data handling

## What changes were proposed in this pull request?

Refactor RandomForest for Sparse Data handling

## What type of PR is it?

Refactoring

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-245
https://issues.apache.org/jira/browse/HIVEMALL-171

## How was this patch tested?

unit tests, manual tests on EMR

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #198 from myui/HIVEMALL-245.

3 years agoFixed a documentation bug
Makoto Yui [Fri, 26 Jul 2019 07:33:22 +0000 (16:33 +0900)] 
Fixed a documentation bug

3 years agoAdd test of sparse input for randomforest classifier
Makoto Yui [Thu, 18 Jul 2019 07:51:33 +0000 (16:51 +0900)] 
Add test of sparse input for randomforest classifier

3 years agoFixed a minor typo in doc
Makoto Yui [Sat, 13 Jul 2019 14:45:52 +0000 (23:45 +0900)] 
Fixed a minor typo in doc

3 years agoAdded sanity checks for training data in RandomForest
Makoto Yui [Wed, 10 Jul 2019 07:17:20 +0000 (16:17 +0900)] 
Added sanity checks for training data in RandomForest

3 years agoRefactor Matrix module for NNZ and zero value handling
Makoto Yui [Wed, 10 Jul 2019 05:58:39 +0000 (14:58 +0900)] 
Refactor Matrix module for NNZ and zero value handling

## What changes were proposed in this pull request?

Refactor Matrix module for NNZ and zero value handling.

## What type of PR is it?

Hot Fix, Refactoring

## What is the Jira issue?

no JIRA issue

## How was this patch tested?

Unit tests

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #196 from myui/refactor_randomforest.

3 years agoFixed ToC
Makoto Yui [Fri, 28 Jun 2019 16:57:48 +0000 (01:57 +0900)] 
Fixed ToC

3 years agoAdded usage for feature_binning UDF
Makoto Yui [Fri, 28 Jun 2019 16:55:39 +0000 (01:55 +0900)] 
Added usage for feature_binning UDF

3 years agoFixed a doc
Makoto Yui [Fri, 28 Jun 2019 16:30:53 +0000 (01:30 +0900)] 
Fixed a doc

3 years agoFixed feature binning documentation
Makoto Yui [Fri, 28 Jun 2019 06:43:05 +0000 (15:43 +0900)] 
Fixed feature binning documentation

3 years ago[HIVEMALL-259][DOC] Refactor feature_binning UDF
Makoto Yui [Thu, 27 Jun 2019 18:02:38 +0000 (03:02 +0900)] 
[HIVEMALL-259][DOC] Refactor feature_binning UDF

## What changes were proposed in this pull request?

Refactor feature_binning UDF and update the function usage

## What type of PR is it?

Documentation, Refactoring

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-259

## How was this patch tested?

unit tests, manual tests on EMR

## How to use this feature?

```
WITH extracted as (
  select
    extract_feature(feature) as index,
    extract_weight(feature) as value
  from
    input l
    LATERAL VIEW explode(features) r as feature
),
mapping as (
  select
    index,
    build_bins(value, 5, true) as quantiles -- 5 bins with auto bin shrinking
  from
    extracted
  group by
    index
),
bins as (
   select
    to_map(index, quantiles) as quantiles
   from
    mapping
)
select
  l.features as original,
  feature_binning(l.features, r.quantiles) as features
from
  input l
  cross join bins r
```

see https://gist.github.com/myui/f943fa3ce1a7e1ac3f2dd9a7f9fa703b

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #195 from myui/HIVEMALL-259.

3 years agoFixed imports
Makoto Yui [Tue, 25 Jun 2019 12:52:12 +0000 (21:52 +0900)] 
Fixed imports

3 years ago[HIVEMALL-253-2] map_roulette UDF
Solodye [Tue, 25 Jun 2019 10:31:02 +0000 (19:31 +0900)] 
[HIVEMALL-253-2] map_roulette UDF

revise #192

Author: Makoto Yui <myui@apache.org>

Closes #193 from myui/HIVEMALL-253-2.

3 years ago[HIVEMALL-258] Add UDF to convert feature/label in Libsvm format
Makoto Yui [Thu, 20 Jun 2019 10:35:42 +0000 (19:35 +0900)] 
[HIVEMALL-258] Add UDF to convert feature/label in Libsvm format

## What changes were proposed in this pull request?

Add UDF to convert feature/label in Libsvm format

## What type of PR is it?

Feature

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-258

## How was this patch tested?

unit tests and manual tests

## How to use this feature?

```sql
Usage:
 select to_libsvm_format(array('apple:3.4','orange:2.1'))
 > 6284535:3.4 8104713:2.1
 select to_libsvm_format(array('apple:3.4','orange:2.1'), '-features 10')
 > 3:2.1 7:3.4
 select to_libsvm_format(array('7:3.4','3:2.1'), 5.0)
 > 5.0 3:2.1 7:3.4
```

## Checklist

(Please remove this section if not needed; check `x` for YES, blank for NO)

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #194 from myui/libsvm.

3 years agoFixed a bug in document
Makoto Yui [Thu, 20 Jun 2019 07:09:16 +0000 (16:09 +0900)] 
Fixed a bug in document

3 years agoFixed the usage of min-max scaling and zscore
Makoto Yui [Wed, 19 Jun 2019 10:12:03 +0000 (19:12 +0900)] 
Fixed the usage of min-max scaling and zscore

3 years agoIncreased write buffer from 1MB to 2MB
Makoto Yui [Wed, 12 Jun 2019 08:27:24 +0000 (17:27 +0900)] 
Increased write buffer from 1MB to 2MB

3 years agoUpdate doc
Makoto Yui [Fri, 19 Apr 2019 07:16:32 +0000 (16:16 +0900)] 
Update doc

3 years ago[HIVEMALL-251] Add option to return PartOfSpeech information for tokenize_ja
Makoto Yui [Fri, 19 Apr 2019 07:04:01 +0000 (16:04 +0900)] 
[HIVEMALL-251] Add option to return PartOfSpeech information for tokenize_ja

## What changes were proposed in this pull request?

Add option to return PartOfSpeech information for `tokenize_ja` UDF.

## What type of PR is it?

Feature, Improvement

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-251

## How was this patch tested?

unit tests and manual tests on EMR

## How to use this feature?

```sql
WITH tmp as (
  select
    tokenize_ja('kuromojiを使った分かち書きのテストです。','-mode search -pos') as r
)
select
  r.tokens,
  r.pos,
  r.tokens[0] as token0,
  r.pos[0] as pos0
from
  tmp;
```

| tokens |pos | token0 | pos0 |
|:-:|:-:|:-:|:-:|
| ["kuromoji","使う","分かち書き","テスト"] | ["名詞-一般","動詞-自立","名詞-一般","名詞-サ変接続"] | kuromoji | 名詞-一般 |

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #191 from myui/HIVEMALL-251.

3 years ago[HIVEMALL-246] Add feature name validation in feature UDF
Makoto Yui [Sat, 13 Apr 2019 21:24:42 +0000 (06:24 +0900)] 
[HIVEMALL-246] Add feature name validation in feature UDF

## What changes were proposed in this pull request?

This PR adds feature name validation in feature UDF

feature(name, value) should validate name not to include ":". Fail-fast behavior is preferable.

## What type of PR is it?

Hot Fix

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-246

## How was this patch tested?

unit tests

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [ ] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #190 from myui/HIVEMALL-246.

3 years ago[HIVEMALL-237-1] Add usage in ML function reference page
Makoto Yui [Sat, 13 Apr 2019 20:37:14 +0000 (05:37 +0900)] 
[HIVEMALL-237-1] Add usage in ML function reference page

## What changes were proposed in this pull request?

Add usage in ML function reference page

## What type of PR is it?

Documentation

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-237

## How was this patch tested?

via CI

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?

Author: Makoto Yui <myui@apache.org>
Author: Makoto YUI <yuin405@gmail.com>

Closes #183 from myui/HIVEMALL-237.

3 years ago[HIVEMALL-248] UDF for Kuromoji stoptags
Makoto Yui [Sat, 13 Apr 2019 20:09:38 +0000 (05:09 +0900)] 
[HIVEMALL-248] UDF for Kuromoji stoptags

## What changes were proposed in this pull request?

In tokenize_ja, user need to provide stoptags that matched tokens removed from the token stream. So, stoptag is "exclusive" rule.

## What type of PR is it?

Feature

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-248

## How was this patch tested?

unit tests, functional test on EMR

## How to use this feature?

```sql
select tokenize_ja("kuromojiを使った分かち書きのテストです。", "normal", array("kuromoji"), stoptags_exclude(array("名詞")));
```
> ["分かち書き","テスト"]

`stoptags_exclude(array<string> tags, [, const string lang='ja'])` is a useful UDF for getting [stoptags](https://github.com/apache/lucene-solr/blob/master/lucene/analysis/kuromoji/src/resources/org/apache/lucene/analysis/ja/stoptags.txt) excluding given part-of-speech tags as seen below:

```sql
select stoptags_exclude(array("名詞-固有名詞"));
```
> ["その他","その他-間投","フィラー","副詞","副詞-一般","副詞-助詞類接続","助動詞","助詞","助詞-並立助詞"
,"助詞-係助詞","助詞-副助詞","助詞-副助詞/並立助詞/終助詞","助詞-副詞化","助詞-接続助詞","助詞-格助詞
","助詞-格助詞-一般","助詞-格助詞-引用","助詞-格助詞-連語","助詞-特殊","助詞-終助詞","助詞-連体化","助
詞-間投助詞","動詞","動詞-接尾","動詞-自立","動詞-非自立","名詞","名詞-サ変接続","名詞-ナイ形容詞語幹",
"名詞-一般","名詞-代名詞","名詞-代名詞-一般","名詞-代名詞-縮約","名詞-副詞可能","名詞-動詞非自立的","名
詞-引用文字列","名詞-形容動詞語幹","名詞-接尾","名詞-接尾-サ変接続","名詞-接尾-一般","名詞-接尾-人名","
名詞-接尾-副詞可能","名詞-接尾-助動詞語幹","名詞-接尾-助数詞","名詞-接尾-地域","名詞-接尾-形容動詞語幹"
,"名詞-接尾-特殊","名詞-接続詞的","名詞-数","名詞-特殊","名詞-特殊-助動詞語幹","名詞-非自立","名詞-非自
立-一般","名詞-非自立-副詞可能","名詞-非自立-助動詞語幹","名詞-非自立-形容動詞語幹","形容詞","形容詞-接
尾","形容詞-自立","形容詞-非自立","感動詞","接続詞","接頭詞","接頭詞-動詞接続","接頭詞-名詞接続","接頭
詞-形容詞接続","接頭詞-数接","未知語","記号","記号-アルファベット","記号-一般","記号-句点","記号-括弧閉
","記号-括弧開","記号-空白","記号-読点","語断片","連体詞","非言語音"]

## Checklist

- [x] Did you apply source code formatter, i.e., `./bin/format_code.sh`, for your commit?
- [x] Did you run system tests on Hive (or Spark)?

Author: Makoto Yui <myui@apache.org>

Closes #189 from myui/HIVEMALL-248.