arrow.git
72 min agoARROW-17883: [Java] implement immutable table (#14316) master
Larry White [Thu, 6 Oct 2022 14:58:43 +0000 (10:58 -0400)] 
ARROW-17883: [Java] implement immutable table (#14316)

Table is a new immutable tabular data structure based on FieldVectors.

This PR is described in detail in the included README.md file. The original design discussion can be found [here](https://docs.google.com/document/d/1J77irZFWNnSID7vK71z26Nw_Pi99I9Hb9iryno8B03c/edit#heading=h.a1lebwljypq5), if you're interested.

Note to reviewers:
- This is a fairly large change set. Most of the code is in "getters" in the Row class. These methods are fairly well covered by tests, but it would be good to have someone look especially at the complex vector types.
- The only changes to existing classes were three new export methods added to the Data class. These use the logic for exporting VectorSchemaRoots.

Lead-authored-by: Larry White <ljw1001@gmail.com>
Co-authored-by: Larry White <lwhite1@users.noreply.github.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
6 hours agoARROW-17911: [R] Implement `across()` within `transmute()` (#14290)
Nic Crane [Thu, 6 Oct 2022 09:58:39 +0000 (10:58 +0100)] 
ARROW-17911: [R] Implement `across()` within `transmute()` (#14290)

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
8 hours agoARROW-17771: [Docs][Python] Add the use of CONDA_DLL_SEARCH_MODIFICATION_ENABLE to...
Alenka Frim [Thu, 6 Oct 2022 07:42:24 +0000 (09:42 +0200)] 
ARROW-17771: [Docs][Python] Add the use of CONDA_DLL_SEARCH_MODIFICATION_ENABLE to the docs (#14302)

This PR adds info to the Python dev docs about the need to use `CONDA_DLL_SEARCH_MODIFICATION_ENABLE=1` in Python versions < 3.10.

![Screenshot 2022-10-04 at 09 00 54](https://user-images.githubusercontent.com/16418547/193755270-d289d3b0-55a5-4c70-b2ab-0d6ca3a2ecbe.png)

Lead-authored-by: Alenka Frim <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
9 hours agoARROW-17099: [Python] pyarrow build does not support RELWITHDEBINFO build type (...
Alenka Frim [Thu, 6 Oct 2022 06:45:52 +0000 (08:45 +0200)] 
ARROW-17099: [Python] pyarrow build does not support RELWITHDEBINFO build type (#14324)

Add support for `RelWithDebInfo` in the PyArrow build process defined in `setup.py`.

Authored-by: Alenka Frim <frim.alenka@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
19 hours agoARROW-17861: [C++] Deprecate Plasma (#14305)
Antoine Pitrou [Wed, 5 Oct 2022 20:58:16 +0000 (22:58 +0200)] 
ARROW-17861: [C++] Deprecate Plasma (#14305)

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
22 hours agoARROW-17929: [C#] Improve the NuGet packages. (#14312)
Theodore Tsirpanis [Wed, 5 Oct 2022 17:18:13 +0000 (20:18 +0300)] 
ARROW-17929: [C#] Improve the NuGet packages. (#14312)

Authored-by: Theodore Tsirpanis <theodore.tsirpanis@tiledb.com>
Signed-off-by: Eric Erhardt <eric.erhardt@microsoft.com>
26 hours agoARROW-17938: [Python] Fix compilation error on python_test.cc (#14321)
Antoine Pitrou [Wed, 5 Oct 2022 13:35:40 +0000 (15:35 +0200)] 
ARROW-17938: [Python] Fix compilation error on python_test.cc (#14321)

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
29 hours agoARROW-17939: [Docs][Python] Update python dev page after PyArrow C++ tests change...
Alenka Frim [Wed, 5 Oct 2022 10:17:09 +0000 (12:17 +0200)] 
ARROW-17939: [Docs][Python] Update python dev page after PyArrow C++ tests change (#14322)

Authored-by: Alenka Frim <frim.alenka@gmail.com>
Signed-off-by: Alenka Frim <frim.alenka@gmail.com>
32 hours agoARROW-17924: [Doc][Format] Clarify immutability assumption in C Data Interface (...
Antoine Pitrou [Wed, 5 Oct 2022 08:06:30 +0000 (10:06 +0200)] 
ARROW-17924: [Doc][Format] Clarify immutability assumption in C Data Interface (#14304)

Also create a "Semantics" section to better structure the document.

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
32 hours agoARROW-17834: [Python] Allow creating ExtensionArray through pa.array(..) constructor...
Joris Van den Bossche [Wed, 5 Oct 2022 08:00:17 +0000 (10:00 +0200)] 
ARROW-17834: [Python] Allow creating ExtensionArray through pa.array(..) constructor (#14253)

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
32 hours agoARROW-17829: [Python] Avoid pandas groupby deprecation warning write_to_dataset ...
Alenka Frim [Wed, 5 Oct 2022 07:59:36 +0000 (09:59 +0200)] 
ARROW-17829: [Python] Avoid pandas groupby deprecation warning write_to_dataset (#14306)

This PR is looking at the deprecation warning from Pandas 1.5.0 connected to length 1 tuple used in the `groupby()` operation, see https://github.com/pandas-dev/pandas/issues/42795.

Currently we use `groupby` operation with a length 1 tuple in:
- `multisourcefs()` fixture in `test_dataset.py`:
https://github.com/apache/arrow/blob/466018084861f86a9171803c6d689e9c1c1efb5a/python/pyarrow/tests/test_dataset.py#L197
- `write_to_dataset()` in `pyarrow/parquet/core.py`:
https://github.com/apache/arrow/blob/466018084861f86a9171803c6d689e9c1c1efb5a/python/pyarrow/parquet/core.py#L3346-L3348

This PR fixes the the test and adds a check in `parquet/core.py` to avoid the warning.

Lead-authored-by: Alenka Frim <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
38 hours agoMINOR: [R] Adapt stringr::str_c mapping for upcoming release (#14296)
Neal Richardson [Wed, 5 Oct 2022 01:19:24 +0000 (21:19 -0400)] 
MINOR: [R] Adapt stringr::str_c mapping for upcoming release (#14296)

Previously, `str_c(sep = NA)` would just emit `NA`, but in the upcoming stringr release, it will error.

cc @hadley

Authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
39 hours agoARROW-17930: [CI][C++] Valgrind failure in PrintValue<arrow::dataset::ScannerTestPara...
Ben Harkins [Wed, 5 Oct 2022 00:13:17 +0000 (20:13 -0400)] 
ARROW-17930: [CI][C++] Valgrind failure in PrintValue<arrow::dataset::ScannerTestParams> (#14317)

Addresses [ARROW-17930](https://issues.apache.org/jira/browse/ARROW-17930).

Not terribly obvious, but this seems to fix the issue.

Authored-by: benibus <bpharks@gmx.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
41 hours agoARROW-17687: ScanningStress test is flaky in CI (#14314)
Weston Pace [Tue, 4 Oct 2022 22:22:42 +0000 (12:22 -1000)] 
ARROW-17687: ScanningStress test is flaky in CI (#14314)

Fixed a race condition in AsyncTaskScheduler

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
45 hours agoARROW-17934: [R] Use tempfile instead of working directory for dataset test (#14315)
Dewey Dunnington [Tue, 4 Oct 2022 19:04:57 +0000 (16:04 -0300)] 
ARROW-17934: [R] Use tempfile instead of working directory for dataset test (#14315)

In recent checkouts I get the folder `tests/testthat/another_dataset` leftover after running devtools::test(). This PR just uses a tempfile instead and cleans it up afterward!

Authored-by: Dewey Dunnington <dewey@fishandwhistle.net>
Signed-off-by: Dewey Dunnington <dewey@fishandwhistle.net>
46 hours agoARROW-17585: [Java] Update GenerateSampleData.java (#14289)
Larry White [Tue, 4 Oct 2022 17:23:21 +0000 (13:23 -0400)] 
ARROW-17585: [Java] Update GenerateSampleData.java (#14289)

Adds support for generating data for the four Uint FieldVectors

Authored-by: Larry White <lwhite1@users.noreply.github.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
2 days agoMINOR: [C++][Docs] Fixing a typo in scanner MaterializedFields function docs (#14309)
Vibhatha Lakmal Abeykoon [Tue, 4 Oct 2022 12:31:25 +0000 (18:01 +0530)] 
MINOR: [C++][Docs] Fixing a typo in scanner MaterializedFields function docs (#14309)

There is a typo in the one of the expressions and fixed in this PR.

Authored-by: Vibhatha Lakmal Abeykoon <vibhatha@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
2 days agoARROW-17903: [JS] Update dependencies (#14285)
Dominik Moritz [Tue, 4 Oct 2022 12:18:23 +0000 (08:18 -0400)] 
ARROW-17903: [JS] Update dependencies (#14285)

I am not updating the flatbuffers dependency yet since it requires rebuilding the protobufs.

Authored-by: Dominik Moritz <domoritz@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
2 days agoARROW-16879: [R][CI] Test R GCS bindings with testbench (#13542)
Will Jones [Tue, 4 Oct 2022 12:09:06 +0000 (05:09 -0700)] 
ARROW-16879: [R][CI] Test R GCS bindings with testbench (#13542)

This PR:

 * Moves minio integration tests into a generic suite that is now run on minio (S3 emulator) and GCS testbench (GCS emulator). This is run in CI.
 * Move Minio and GCS test server initialization to within the tests. This makes it easier to setup the background processes in a cross-platform way.
 * MinIO and GCS tests are now run on R Ubuntu CI. MinIO is now run on Windows CI. I couldn't get GCS to run on Windows CI yet, due to some issue where the tests hang (I believe this is an issue with the test setup and not the functionality). See follow up at: [ARROW-17149: [R] Enable GCS tests for Windows](https://issues.apache.org/jira/browse/ARROW-17149)
 * Sets the default retry timeout to 15 seconds to mitigate issue described by ARROW-17020. This affects explicitly-created fs with `GcsFileSystem$create()` (and `gs_bucket()` introducted in #13601), but not URIs.

Authored-by: Will Jones <willjones127@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
2 days agoARROW-17450 : [C++][Parquet] Support RLE decode for boolean datatype (#14147)
Nishanth Thimmegowda [Tue, 4 Oct 2022 07:00:31 +0000 (00:00 -0700)] 
ARROW-17450 : [C++][Parquet] Support RLE decode for boolean datatype (#14147)

Currently, parquet-cpp does not support columns encoded with RLE. Although the users of RLE are quite sparse with uses of one of the 3 types [Repetition and definition levels, dictionary indices and boolean values in data pages], [Parquet-encodings](https://parquet.apache.org/docs/file-format/data-pages/encodings/). Some implementations do encode this directly on boolean columns (Athena on AWS). Even though there is encoding and decoding support for repetition and definition levels, there is no support for boolean column with RLE.

This PR integrates the column scanning to support columns with RLE. The first 4 bytes of the data length are size of the encoded data, which is parsed first and then passes to decoder.

Added two tests with RLE boolean encoded parquet file to validate that values can be parsed individually and in a batch.

Lead-authored-by: Nishanth Thimmegowda <nishanth.thimmegowda@snowflake.com>
Co-authored-by: sfc-gh-nthimmegowda <nishanth.thimmegowda@snowflake.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-29-51-62.us-west-2.compute.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-29-51-79.us-west-2.compute.internal>
Co-authored-by: Ubuntu <ubuntu@ip-172-29-51-6.us-west-2.compute.internal>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
2 days agoARROW-17856: [CI][Archery] Add new Archery command to delete old branches and tags...
Raúl Cumplido [Mon, 3 Oct 2022 23:58:41 +0000 (01:58 +0200)] 
ARROW-17856: [CI][Archery] Add new Archery command to delete old branches and tags on crossbow repo (#14248)

I have tested the script locally like:
```
$ archery crossbow delete-old-branches --days 600
Total number of references to delete: 342
```

Before running there were on the remote repository:
```
52,188 branches
50,866 tags
```
After running we have:
```
52,059 branches
50,768 tags
```

Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
2 days agoMINOR: [Dev] Fix typo and improve instructions in merge script (#14297)
Dominik Moritz [Mon, 3 Oct 2022 23:52:24 +0000 (19:52 -0400)] 
MINOR: [Dev] Fix typo and improve instructions in merge script (#14297)

I think it's best not to recommend using sudo to install pip dependencies. Also, fixed a typo.

Authored-by: Dominik Moritz <domoritz@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
2 days agoARROW-17016: [C++][Python] Move Arrow Python C++ tests into Cython (#14117)
Alenka Frim [Mon, 3 Oct 2022 16:45:00 +0000 (18:45 +0200)] 
ARROW-17016: [C++][Python] Move Arrow Python C++ tests into Cython (#14117)

This PR tries to connect the PyArrow C++ tests with PyArrow tests so they can all be run from `pytest`. This will remove GoogleTest as a dependency for PyArrow and therefore a change is needed in the C++ tests so they return a Status which can then be checked through Cython/Python.

Example of pytest run:

```
pyarrow/tests/test_cpp_internals.py::test_owned_ref_moves PASSED
pyarrow/tests/test_cpp_internals.py::test_owned_ref_nogil_moves PASSED
pyarrow/tests/test_cpp_internals.py::test_check_pyerror_status PASSED
pyarrow/tests/test_cpp_internals.py::test_check_pyerror_status_nogil PASSED
pyarrow/tests/test_cpp_internals.py::test_restore_pyerror_basics PASSED
pyarrow/tests/test_cpp_internals.py::test_pybuffer_invalid_input_object PASSED
pyarrow/tests/test_cpp_internals.py::test_pybuffer_numpy_array PASSED
pyarrow/tests/test_cpp_internals.py::test_numpybuffer_numpy_array PASSED
pyarrow/tests/test_cpp_internals.py::test_python_decimal_to_string PASSED
pyarrow/tests/test_cpp_internals.py::test_infer_precision_and_scale PASSED
pyarrow/tests/test_cpp_internals.py::test_infer_precision_and_negative_scale PASSED
pyarrow/tests/test_cpp_internals.py::test_infer_all_leading_zeros PASSED
pyarrow/tests/test_cpp_internals.py::test_infer_all_leading_zeros_exponential_notation_positive PASSED
pyarrow/tests/test_cpp_internals.py::test_infer_all_leading_zeros_exponential_notation_negative PASSED
pyarrow/tests/test_cpp_internals.py::test_object_block_write_fails PASSED
pyarrow/tests/test_cpp_internals.py::test_mixed_type_fails PASSED
pyarrow/tests/test_cpp_internals.py::test_from_python_decimal_rescale_not_truncateable PASSED
pyarrow/tests/test_cpp_internals.py::test_from_python_decimal_rescale_truncateable PASSED
pyarrow/tests/test_cpp_internals.py::test_from_python_negative_decimal_rescale PASSED
pyarrow/tests/test_cpp_internals.py::test_decimal128_from_python_integer PASSED
pyarrow/tests/test_cpp_internals.py::test_decimal256_from_python_integer PASSED
pyarrow/tests/test_cpp_internals.py::test_decimal128_overflow_fails PASSED
pyarrow/tests/test_cpp_internals.py::test_decimal256_overflow_fails PASSED
pyarrow/tests/test_cpp_internals.py::test_none_and_nan PASSED
pyarrow/tests/test_cpp_internals.py::test_mixed_precision_and_scale PASSED
pyarrow/tests/test_cpp_internals.py::test_mixed_precision_and_scale_sequence_convert PASSED
pyarrow/tests/test_cpp_internals.py::test_simple_inference PASSED
pyarrow/tests/test_cpp_internals.py::test_update_with_nan PASSED
```

Lead-authored-by: Alenka Frim <frim.alenka@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
2 days agoARROW-17888: [Docs] Add reference of the cookbook contrib page to New Contributor...
Alenka Frim [Mon, 3 Oct 2022 16:41:01 +0000 (18:41 +0200)] 
ARROW-17888: [Docs] Add reference of the cookbook contrib page to New Contributor's Guide (#14283)

This PR adds a reference for Apache Arrow Cookbook contributing info page to the New Contributor's Guide.

![Screenshot 2022-09-30 at 15 27 21](https://user-images.githubusercontent.com/16418547/193280252-9974705d-20f0-4833-bfa2-11c2ef4ed87d.png)

Lead-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Alenka Frim <frim.alenka@gmail.com>
Signed-off-by: Alenka Frim <frim.alenka@gmail.com>
3 days agoARROW-17857: [C++] Fix segfault in Table::CombineChunksToBatch (#14249)
David Li [Mon, 3 Oct 2022 16:00:03 +0000 (12:00 -0400)] 
ARROW-17857: [C++] Fix segfault in Table::CombineChunksToBatch (#14249)

Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
3 days agoARROW-17697: [Python] Fix Cython warning in types.pxi (#14280)
Miles Granger [Mon, 3 Oct 2022 15:56:32 +0000 (17:56 +0200)] 
ARROW-17697: [Python] Fix Cython warning in types.pxi (#14280)

Will fix [ARROW-17697](https://issues.apache.org/jira/browse/ARROW-17697)

- Turn warnings into errors
- Change signature of DataType.field as others like `StructType.field` allowed taking `int` or `str`.

Authored-by: Miles Granger <miles59923@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
3 days agoARROW-11841: [R][C++] Allow cancelling long-running commands (#13635)
Dewey Dunnington [Mon, 3 Oct 2022 14:44:19 +0000 (11:44 -0300)] 
ARROW-11841: [R][C++] Allow cancelling long-running commands (#13635)

Lead-authored-by: Dewey Dunnington <dewey@fishandwhistle.net>
Co-authored-by: Dewey Dunnington <dewey@voltrondata.com>
Signed-off-by: Dewey Dunnington <dewey@voltrondata.com>
3 days agoARROW-17287: [C++] Create scan node that doesn't rely on the merged generator (#13782)
Weston Pace [Mon, 3 Oct 2022 12:02:32 +0000 (02:02 -1000)] 
ARROW-17287: [C++] Create scan node that doesn't rely on the merged generator (#13782)

**Primary Goal:** Create a scanner that "cancels" properly.  In other words, when the scan node is marked finished then all scan-related thread tasks will be finished.  This is different than the current model where I/O tasks are allowed to keep parts of the scan alive via captures of shared_ptr state.

**Secondary Goal:** Remove our dependency on the merged generator and make the scanner more accessible.  The merged generator is complicated and does not support cancellation, and it currently only understood by a very small set of people.

**Secondary Goal:** Add interfaces for schema evolution.  This wasn't originally a goal but arose from my attempt to codify and normalize what we are currently doing.  These interfaces should eventually allow for things like filling a missing field with a default value or using the parquet column id for field resolution.

Performance isn't a goal for this rework but ideally this should not degrade performance.

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
5 days agoARROW-17868: [C++][Python] Restore the ARROW_PYTHON CMake option (#14273)
Sutou Kouhei [Sat, 1 Oct 2022 11:46:17 +0000 (20:46 +0900)] 
ARROW-17868: [C++][Python] Restore the ARROW_PYTHON CMake option (#14273)

Restore it but it's marked as a deprecated option. Because the Python component in Apache Arrow C++ was moved to PyArrow by ARROW-16340. It' removed in a feature release.

Users should use CMake presets instead of ARROW_PYTHON but CMake presets requires CMake 3.19 or later.

Lead-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
5 days agoARROW-17355: [R] Refactor the handle_* utility functions for a better dev experience...
Nic Crane [Sat, 1 Oct 2022 08:13:06 +0000 (09:13 +0100)] 
ARROW-17355: [R] Refactor the handle_* utility functions for a better dev experience (#14030)

For context; these `handle_*` functions originally caught an error and if it contained a particular string raised an augmented error message with extra guidance for the user (and if not, raised the original error message).

This became problematic in a later PR where we wanted to test multiple conditions and only raise the original error if none of the conditions were met - the temporary approach was to move the responsibility for the raising of the original error to outside of the `handle_*` functions.  The issue here is that this makes it easy for developers to forget to add in this line of code.

The proposed solution here implements a generic error augmentation function `augment_io_error_msg()`, which tests all conditions, raises an error with an augmented message if any conditions are met, or raises the original error if not.

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
5 days agoARROW-17848: [R] Skip lubridate::format_ISO8601 tests until next release (#14282)
Neal Richardson [Sat, 1 Oct 2022 00:07:31 +0000 (20:07 -0400)] 
ARROW-17848: [R] Skip lubridate::format_ISO8601 tests until next release (#14282)

Code comment explains the story. If all CI passes, this has worked.

Authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
6 days agoMINOR: [CI] Use secrets for bucket name in preview-docs job (#14270)
Jacob Wujciak-Jens [Fri, 30 Sep 2022 15:54:24 +0000 (17:54 +0200)] 
MINOR: [CI] Use secrets for bucket name in preview-docs job (#14270)

For technical reasons we have to change the s3 bucket, to avoid having to make changes to the workflow in the future I have changed the envvar so they use secrets.

SSL on bucket hosted static sites is non-trivial and not really necessary (it's a static site after all).

Here a render with the changes:
```yml
      - name: Upload preview to S3
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.CROSSBOW_DOCS_AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.CROSSBOW_DOCS_AWS_SECRET_ACCESS_KEY }}
          AWS_DEFAULT_REGION: ${{ secrets.CROSSBOW_DOCS_S3_BUCKET_REGION }}
          BUCKET: ${{ secrets.CROSSBOW_DOCS_S3_BUCKET }}
        run: |
          aws s3 cp build/docs/ $BUCKET/pr_docs/master/ --recursive
          echo ":open_book: You can find the preview here: http://crossbow.voltrondata.com/pr_docs/master" >> $GITHUB_STEP_SUMMARY
```

Authored-by: Jacob Wujciak-Jens <jacob@wujciak.de>
Signed-off-by: Antoine Pitrou <antoine@python.org>
6 days agoARROW-17753: [Python][Docs] Document cleaning for fixing build environment issues...
Anja Kefala [Fri, 30 Sep 2022 13:28:31 +0000 (06:28 -0700)] 
ARROW-17753: [Python][Docs] Document cleaning for fixing build environment issues (#14260)

![Screenshot from 2022-09-27 17-18-05](https://user-images.githubusercontent.com/7489659/192659718-b4f77e41-8b00-48f3-bc1b-cfdaa36e31da.png)

Authored-by: anjakefala <anja@voltrondata.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
6 days agoARROW-17822: [C++][FlightRPC] Fix crash on invalid transport scheme (#14267)
David Li [Fri, 30 Sep 2022 11:37:47 +0000 (07:37 -0400)] 
ARROW-17822: [C++][FlightRPC] Fix crash on invalid transport scheme (#14267)

Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
6 days agoARROW-17880: [Go] Add support for Decimal128 and Decimal256 to CSV writer (#14278)
Mitch [Fri, 30 Sep 2022 01:05:04 +0000 (21:05 -0400)] 
ARROW-17880: [Go] Add support for Decimal128 and Decimal256 to CSV writer (#14278)

Authored-by: Mitch Devenport <mitch@spice.ai>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
6 days agoARROW-17864: [Plasma][Ruby] Deprecate Plasma Ruby bindings (#14258)
Sutou Kouhei [Thu, 29 Sep 2022 21:09:15 +0000 (06:09 +0900)] 
ARROW-17864: [Plasma][Ruby] Deprecate Plasma Ruby bindings (#14258)

See discussion at
https://lists.apache.org/thread/nw232k2lzmg9kcl8ts475m9ybl34j81p

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
6 days agoARROW-17862: [Plasma][GLib] Deprecate Plasma C GLib bindings (#14259)
Sutou Kouhei [Thu, 29 Sep 2022 21:08:57 +0000 (06:08 +0900)] 
ARROW-17862: [Plasma][GLib] Deprecate Plasma C GLib bindings (#14259)

See discussion at
https://lists.apache.org/thread/nw232k2lzmg9kcl8ts475m9ybl34j81p

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
6 days agoARROW-17889: [CI] Remove Kartothek integration tests (#14274)
Raúl Cumplido [Thu, 29 Sep 2022 21:08:39 +0000 (23:08 +0200)] 
ARROW-17889: [CI] Remove Kartothek integration tests (#14274)

This PR removes the Kartothek integration tests as discussed on the Mailing List.
https://lists.apache.org/thread/bqj7mordgf53ctws5jl1kdr0l4671nlj

Authored-by: Raúl Cumplido <raulcumplido@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
6 days agoARROW-17550: [C++][CI][MinGW] Use system Python for GCS testbench (#14272)
Sutou Kouhei [Thu, 29 Sep 2022 21:07:25 +0000 (06:07 +0900)] 
ARROW-17550: [C++][CI][MinGW] Use system Python for GCS testbench (#14272)

We don't need to use MinGW Python because ARROW-16340 moved cpp/src/arrow/python/ to python/pyarrow/src/.

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
6 days agoARROW-17480: [Java] add setNull() to FieldVector interface (#14244)
Larry White [Thu, 29 Sep 2022 19:07:36 +0000 (15:07 -0400)] 
ARROW-17480: [Java] add setNull() to FieldVector interface (#14244)

Implemented setNull for vector types where it was not supported (including the abstract ExtensionTypeVector class), and added it to the FieldVector interface.

The change makes it possible to call setNull() on arbitrary field vectors without casting.

Lead-authored-by: Larry White <ljw1001@gmail.com>
Co-authored-by: Larry White <lwhite1@users.noreply.github.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
7 days agoARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation (#13687)
Vibhatha Lakmal Abeykoon [Thu, 29 Sep 2022 08:41:46 +0000 (14:11 +0530)] 
ARROW-17181: [Docs][Python] Scalar UDF Experimental Documentation (#13687)

This PR contains Python Scalar UDF documentation as an experimental version of docs.
At the moment we only support Scalar UDFs and the code snippets include how to use
UDFs with PyArrow.

Lead-authored-by: Vibhatha Abeykoon <vibhatha@gmail.com>
Co-authored-by: Vibhatha Lakmal Abeykoon <vibhatha@users.noreply.github.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
7 days agoARROW-17512: [Doc] Updates to crossbow documentation for clarity (#13993)
lafiona [Thu, 29 Sep 2022 07:52:36 +0000 (03:52 -0400)] 
ARROW-17512: [Doc] Updates to crossbow documentation for clarity (#13993)

## Overview
While setting up a queue repository for testing changes to `crossbow`, we noticed some updates that can be made to help future developers set up their environment.

## Implementation
1. Clarify Travis CI auto-cancellation default behavior.
2. Fix broken links referenced by instructions.
3. Minor typos.

## Testing
1. Qualified by performing a directory level sphinx build and visually verifying the changes.

## Notes
Thank you for your help on this pull request, @kevingurney!

Lead-authored-by: Fiona La <fionala7@gmail.com>
Co-authored-by: Kevin Gurney <kgurney@mathworks.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
7 days agoARROW-17850: [Java] Upgrade netty + grpc + protobuf + jackson BOM versions (#14265)
david dali susanibar arce [Wed, 28 Sep 2022 22:17:13 +0000 (17:17 -0500)] 
ARROW-17850: [Java] Upgrade netty + grpc + protobuf + jackson BOM versions (#14265)

Authored-by: david dali susanibar arce <davi.sarces@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
7 days agoARROW-15479 : [C++] Cast fixed size list to compatible fixed size list type (other...
Kshiteej K [Wed, 28 Sep 2022 22:16:12 +0000 (03:46 +0530)] 
ARROW-15479 : [C++] Cast fixed size list to compatible fixed size list type (other values type, other field name) (#14181)

Add kernel for FSL to FSL casting.

Authored-by: kshitij12345 <kshitijkalambarkar@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
7 days agoARROW-17811: [Java][Doc] Added high-level documentation for Dictionary Encoding in...
Larry White [Wed, 28 Sep 2022 18:29:21 +0000 (14:29 -0400)] 
ARROW-17811: [Java][Doc] Added high-level documentation for Dictionary Encoding in Java (#14213)

Added some overview documentation to the ValueVector tutorial page, describing how dictionary encoding works in Arrow Java .

Lead-authored-by: Larry White <ljw1001@gmail.com>
Co-authored-by: Larry White <lwhite1@users.noreply.github.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
7 days agoARROW-17847: [C++] Support unquoted decimal in JSON parser (#14242)
Jin Shang [Wed, 28 Sep 2022 16:52:56 +0000 (00:52 +0800)] 
ARROW-17847: [C++] Support unquoted decimal in JSON parser (#14242)

Support both quoted and unquoted decimal in JSON parser automatically.

Authored-by: Jin Shang <shangjin1997@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
8 days agoARROW-17865: [Java] Deprecate Java Plasma JNI bindings (#14262)
david dali susanibar arce [Wed, 28 Sep 2022 15:03:42 +0000 (10:03 -0500)] 
ARROW-17865: [Java] Deprecate Java Plasma JNI bindings (#14262)

Deprecate Java Plasma JNI bindings based on https://lists.apache.org/thread/nw232k2lzmg9kcl8ts475m9ybl34j81p

Authored-by: david dali susanibar arce <davi.sarces@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
8 days agoARROW-17875: [C++] Remove assorted pre-C++17 compatibility measures (#14263)
Antoine Pitrou [Wed, 28 Sep 2022 14:01:38 +0000 (16:01 +0200)] 
ARROW-17875: [C++] Remove assorted pre-C++17 compatibility measures (#14263)

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
8 days agoARROW-17854: [CI][Developer] Host preview docs on S3 (#14247)
Jacob Wujciak-Jens [Wed, 28 Sep 2022 13:35:42 +0000 (15:35 +0200)] 
ARROW-17854: [CI][Developer] Host preview docs on S3 (#14247)

Authored-by: Jacob Wujciak-Jens <jacob@wujciak.de>
Signed-off-by: Antoine Pitrou <antoine@python.org>
8 days agoMINOR: [C++] Update parquet-testing submodule (#14251)
Muthunagappan Muthuraman [Wed, 28 Sep 2022 11:14:01 +0000 (16:44 +0530)] 
MINOR: [C++] Update parquet-testing submodule (#14251)

Authored-by: Muthunagappan Muthuraman <m.muthuraman@snowflake.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
8 days agoARROW-16431: [C++][Python] Improve AppendRowGroups error when schemas differ (#14029)
Miles Granger [Wed, 28 Sep 2022 08:08:42 +0000 (10:08 +0200)] 
ARROW-16431: [C++][Python] Improve AppendRowGroups error when schemas differ (#14029)

Fix [ARROW-16431](https://issues.apache.org/jira/browse/ARROW-16431)

Feel free to opine on specific error messages or the implementation as a whole. 👌

Examples

```python
# meta1 and meta2 differ in column types
meta1.append_row_groups(meta2)
*** RuntimeError: AppendRowGroups requires equal schemas.
The two columns with index 0 differ.
column descriptor = {
  name: col1,
  path: col1,
  physical_type: INT64,
  converted_type: NONE,
  logical_type: None,
  max_definition_level: 1,
  max_repetition_level: 0,
}
column descriptor = {
  name: col2,
  path: col2,
  physical_type: INT64,
  converted_type: NONE,
  logical_type: None,
  max_definition_level: 1,
  max_repetition_level: 0,
}

# meta1 and meta2 differ in number of columns
meta1.append_row_groups(meta2)
*** RuntimeError: This schema has 2 columns, other has 1
```

Authored-by: Miles Granger <miles59923@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
8 days agoARROW-17427: [Java] Add Windows build script that produces DLLs (#14203)
Sutou Kouhei [Wed, 28 Sep 2022 03:14:22 +0000 (12:14 +0900)] 
ARROW-17427: [Java] Add Windows build script that produces DLLs (#14203)

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
8 days agoMINOR: [C++] Remove nonexistent class from io/type_fwd.h (#14254)
David Li [Wed, 28 Sep 2022 02:33:19 +0000 (22:33 -0400)] 
MINOR: [C++] Remove nonexistent class from io/type_fwd.h (#14254)

Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: Yibo Cai <yibo.cai@arm.com>
8 days agoARROW-17773: [CI][C++] Fix sccache error on Travis-CI Arm64 build (#14201)
Yibo Cai [Wed, 28 Sep 2022 02:14:59 +0000 (10:14 +0800)] 
ARROW-17773: [CI][C++] Fix sccache error on Travis-CI Arm64 build (#14201)

Authored-by: Yibo Cai <yibo.cai@arm.com>
Signed-off-by: Yibo Cai <yibo.cai@arm.com>
8 days agoARROW-17795: [C++][R] Add missing PKG_CONFIG_PATH to use system zstd (#14202)
Sutou Kouhei [Tue, 27 Sep 2022 20:34:58 +0000 (05:34 +0900)] 
ARROW-17795: [C++][R] Add missing PKG_CONFIG_PATH to use system zstd (#14202)

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
8 days agoMINOR: [Doc][Java] Fix inline literals in "Ordering comparison" (#14252)
Leo Gertsenshteyn [Tue, 27 Sep 2022 20:21:44 +0000 (13:21 -0700)] 
MINOR: [Doc][Java] Fix inline literals in "Ordering comparison" (#14252)

An extra space before double-backticks was causing incorrect rendering of the list of literals.

Authored-by: Leo Gertsenshteyn <146586+leoger@users.noreply.github.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
8 days agoMINOR: [C++][Docs] Minor fixes to C++ API docs for AsyncTaskScheduler (#14245)
Percy Camilo Triveño Aucahuasi [Tue, 27 Sep 2022 16:35:35 +0000 (11:35 -0500)] 
MINOR: [C++][Docs] Minor fixes to C++ API docs for AsyncTaskScheduler (#14245)

Authored-by: Percy Camilo Triveño Aucahuasi <aucahuasi@users.noreply.github.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
8 days agoMINOR: [R] Import the missing `rlang::quo` function (#14091)
eitsupi [Tue, 27 Sep 2022 16:32:20 +0000 (01:32 +0900)] 
MINOR: [R] Import the missing `rlang::quo` function (#14091)

The `rlang::quo` function used in lines added in PR #13786 (d5f80cbe2b2e8801127639b15fd24f829478ea84) but not imported.

```r
mtcars |> arrow::arrow_table() |> dplyr::mutate(dplyr::across(starts_with("c"), as.character)) |> dplyr::collect()
#> Error in quo(!!call2(.x, sym(.y))) : could not find function "quo"
```

Authored-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
9 days agoARROW-17669: [Go] Take Function kernels for Record batch, Tables and Chunked Arrays...
Matt Topol [Tue, 27 Sep 2022 14:35:47 +0000 (10:35 -0400)] 
ARROW-17669: [Go] Take Function kernels for Record batch, Tables and Chunked Arrays (#14214)

Authored-by: Matt Topol <zotthewizard@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
9 days agoARROW-17853: temporary revert fix for test_write_dataset_max_rows_per_file (#14246)
Joris Van den Bossche [Tue, 27 Sep 2022 13:40:15 +0000 (15:40 +0200)] 
ARROW-17853: temporary revert fix for test_write_dataset_max_rows_per_file (#14246)

Revert "ARROW-17614: [CI][Python] test test_write_dataset_max_rows_per_file is producing several nightly build failures (#14199)" (commit acd69f92ee92140f42b64714a348d0735a368931), as this causing new failures in `test_write_dataset_s3_put_only`

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
9 days agoARROW-17770: [C++][Gandiva] Fix const correctness of Gandiva projector Evaluate ...
Jin Shang [Tue, 27 Sep 2022 13:05:40 +0000 (21:05 +0800)] 
ARROW-17770: [C++][Gandiva] Fix const correctness of Gandiva projector Evaluate (#14165)

I was trying to figure out the thread-safeness of Gandiva projector evaluation, i.e., whether I can use a single Projector to evaluate multiple inputs concurrently. I assumed it isn't safe because the Evaluate function is not marked const. However, as far as I understand, the Evaluate function merely executes a compiled function on the input, which doesn't modify a project's internal states and should be const.

Authored-by: Jin Shang <shangjin1997@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
9 days agoARROW-17320: [Python] Refine pyarrow.parquet API exposure (#14096)
Miles Granger [Tue, 27 Sep 2022 09:11:58 +0000 (11:11 +0200)] 
ARROW-17320: [Python] Refine pyarrow.parquet API exposure (#14096)

Fixes [ARROW-17320](https://issues.apache.org/jira/browse/ARROW-17320?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=17577330)

Added a deprecation for `_filters_to_expression` -> `filters_to_expression`, in https://github.com/apache/arrow/pull/14096/commits/c7fdff3b50ca0fe95ae4c771cee81d4d59aa98d0 let me know if that commit should be dropped. :)

Authored-by: Miles Granger <miles59923@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
9 days agoARROW-17846: [C++] Use `if constexpr` in CSV subsystem (#14241)
Antoine Pitrou [Tue, 27 Sep 2022 07:58:01 +0000 (09:58 +0200)] 
ARROW-17846: [C++] Use `if constexpr` in CSV subsystem (#14241)

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
9 days agoARROW-17842: [C++][CI] Use Brew installed clang for MacOS verify-rc (#14236)
Jin Shang [Tue, 27 Sep 2022 05:55:17 +0000 (13:55 +0800)] 
ARROW-17842: [C++][CI] Use Brew installed clang for MacOS verify-rc (#14236)

Same problem as https://github.com/apache/arrow/pull/14187#issuecomment-1253322948

Authored-by: Jin Shang <shangjin1997@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
9 days agoARROW-6858: [C++] Simplify transitive build option dependencies (#14224)
Sutou Kouhei [Tue, 27 Sep 2022 05:37:13 +0000 (14:37 +0900)] 
ARROW-6858: [C++] Simplify transitive build option dependencies (#14224)

This approach adds "depends" information to each (bool) build options and use it to resolve transitive build option dependencies automatically. This approach implements topological sort in CMake to resolve transitive dependencies.

Another approach proposed in the associated Jira issue: It creates a Python script that generates a CMake code (.cmake file) that handles transitive dependencies. Dependencies information are written in the Python script.

I think that this approach is better than the another approach because:

* We can put option definitions and their dependencies into the same place. (We don't need to put them into .cmake and .py.)
* We don't need to regenerate a .cmake file when we update option dependencies.
* We can specify dependencies information with a simple way. (We can just add "DEPENDS ARROW_XXX ARROW_YYY ..." to an option definition.)

Here are downsides of this approach:

* We need to maintain topological sort implementation in CMake. Because CMake doesn't provide a topological sort feature that is used in CMake internally. But topological sort algorithm is well-known (Tarjan's algorithm was published at 1976) and its implementation in this approach has only 20+ CMake lines. I think that we can maintain it.
* This can't support complex conditions such as "ARROW_X AND NOT ARROW_Y". But we don't have any complex condition for now.

Lead-authored-by: Sutou Kouhei <kou@clear-code.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Signed-off-by: Yibo Cai <yibo.cai@arm.com>
9 days agoARROW-17603: [C++][FlightRPC] Be verbose about failures when REQUIRE_TLSCREDENTIALSOP...
David Li [Mon, 26 Sep 2022 23:42:20 +0000 (19:42 -0400)] 
ARROW-17603: [C++][FlightRPC] Be verbose about failures when REQUIRE_TLSCREDENTIALSOPTIONS is on (#14034)

This option is often used on CI. When enabled, be verbose about failures to detect gRPC versions to hopefully make issues easier to debug.

Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
9 days agoARROW-17736: [C++] Added a fallback name resolution mechanism to the Substrait produc...
Weston Pace [Mon, 26 Sep 2022 20:41:43 +0000 (10:41 -1000)] 
ARROW-17736: [C++] Added a fallback name resolution mechanism to the Substrait producer. (#14143)

This will only be used if the URI part of the function reference is empty or /.  This is for compatibility with Isthmus which still hasn't decided what to use for the URI portion.  Longer term we may remove this.  Or we may leave it in as a utility for developers of new functions that don't want to have to worry about the URI initially.

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
9 days agoARROW-17845: [CI][Conan] Re-enable Flight in Conan CI check (#14240)
Will Jones [Mon, 26 Sep 2022 20:01:43 +0000 (13:01 -0700)] 
ARROW-17845: [CI][Conan] Re-enable Flight in Conan CI check (#14240)

Authored-by: Will Jones <willjones127@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
9 days agoARROW-17844: [C++] Remove atomic shared_ptr compatibility functions (#14239)
Antoine Pitrou [Mon, 26 Sep 2022 17:20:24 +0000 (19:20 +0200)] 
ARROW-17844: [C++] Remove atomic shared_ptr compatibility functions (#14239)

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
9 days agoARROW-17840: [Java] Disable flaky JaCoCo coverage check (#14231)
David Li [Mon, 26 Sep 2022 17:02:49 +0000 (13:02 -0400)] 
ARROW-17840: [Java] Disable flaky JaCoCo coverage check (#14231)

The check doesn't add much value, and makes things flaky because whether a branch is covered or not can come down to chance based on where exactly an exception occurs.

Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
9 days agoARROW-17614: [CI][Python] test test_write_dataset_max_rows_per_file is producing...
Weston Pace [Mon, 26 Sep 2022 16:24:43 +0000 (06:24 -1000)] 
ARROW-17614: [CI][Python] test test_write_dataset_max_rows_per_file is producing several nightly build failures (#14199)

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
10 days agoARROW-17794: [Java] Force delete jni lib file on JVM exit (#14189)
Jacky Lee [Mon, 26 Sep 2022 14:32:26 +0000 (22:32 +0800)] 
ARROW-17794: [Java] Force delete jni lib file on JVM exit (#14189)

Use `File.deleteOnExit` to delete jni lib file on JVM exit. `File.deleteOnExit` actually add a shut down hook to make sure file delte.

Authored-by: jackylee-ch <lijunqing@baidu.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
10 days agoARROW-14958: [C++][Python][FlightRPC] Implement Flight middleware for OpenTelemetry...
David Li [Mon, 26 Sep 2022 11:42:28 +0000 (07:42 -0400)] 
ARROW-14958: [C++][Python][FlightRPC] Implement Flight middleware for OpenTelemetry propagation (#11920)

Adds a client middleware that sends span/trace ID to the server, and a server middleware that gets the span/trace ID and starts a child span.

The middleware are available in builds without OpenTelemetry, they simply do nothing.

Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
10 days agoMINOR: [C++] Use the array returned by TweakValidityBit (#14221)
Kshiteej K [Mon, 26 Sep 2022 11:38:21 +0000 (17:08 +0530)] 
MINOR: [C++] Use the array returned by TweakValidityBit (#14221)

TweakValidityBit returns a new Array so the calling function should use it.
https://github.com/apache/arrow/blob/6cc37cf2d1ba72c46b64fbc7ac499bd0d7296d20/cpp/src/arrow/testing/gtest_util.cc#L568-L579

Authored-by: kshitij12345 <kshitijkalambarkar@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
10 days agoARROW-17824: [C++][Gandiva] Implement preallocation for variable length output buffer...
Jin Shang [Mon, 26 Sep 2022 05:06:41 +0000 (13:06 +0800)] 
ARROW-17824: [C++][Gandiva] Implement preallocation for variable length output buffer (#14230)

When the output type of an expression is of variable length, e.g. string, Gandiva would realloc the output buffer to make space for new outputs for each row. When num of rows is high some memory allocators perform poorly.

We can use the std::vector like approach to amortize the allcation cost. First allocate some initial space depending on the input size. Each time we run out of space, double the buffer size. In the end shrink it to fit the actual size. Arrow string builder also uses this approach.

Authored-by: Jin Shang <shangjin1997@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
10 days agoARROW-17823: [C++] Revert std::make_shared change for CUDA (#14233)
Sutou Kouhei [Mon, 26 Sep 2022 01:39:37 +0000 (10:39 +0900)] 
ARROW-17823: [C++] Revert std::make_shared change for CUDA (#14233)

This is a follow-up of #14216. We can't use std::make_shared for CUDA related classes because their constructors aren't public.

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
11 days agoARROW-17830: [C++][Gandiva] Temporarily pin LLVM version on AppVeyor (#14228)
Jin Shang [Sun, 25 Sep 2022 10:51:58 +0000 (18:51 +0800)] 
ARROW-17830: [C++][Gandiva] Temporarily pin LLVM version on AppVeyor (#14228)

Temporarily pin LLVM version on Appveyor due to a bug in Conda's packaging of LLVM.

Authored-by: Jin Shang <shangjin1997@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
11 days agoARROW-17823: [C++] Prefer std::make_shared/std::make_unique over constructor with...
Jin Shang [Sat, 24 Sep 2022 20:16:36 +0000 (04:16 +0800)] 
ARROW-17823: [C++] Prefer std::make_shared/std::make_unique over constructor with new (#14216)

Advantage: readabilty, exception safety and efficiency(only for shared_ptr).

Cases that don't apply: When calling a private/protected constructor within class member function, make_shared/unique can't work.

Authored-by: Jin Shang <shangjin1997@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
13 days agoARROW-17814: [C++] Fix style (#14218)
Sutou Kouhei [Fri, 23 Sep 2022 15:39:54 +0000 (00:39 +0900)] 
ARROW-17814: [C++] Fix style (#14218)

This is a follow-up of #14204.

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
13 days agoARROW-17785: [Java] Suppress flakiness from gRPC in JDBC driver tests (#14210)
David Li [Fri, 23 Sep 2022 13:51:21 +0000 (09:51 -0400)] 
ARROW-17785: [Java] Suppress flakiness from gRPC in JDBC driver tests (#14210)

I couldn't reproduce it, so I added a suppression instead.

In both cases, the error is that the server is uncontactable. That shouldn't happen, but I changed the tests to also bind to port 0 instead of using a potentially flaky free port finder.

Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
13 days agoARROW-17787: [Java] Fix Javadoc build (#14212)
David Li [Fri, 23 Sep 2022 13:50:40 +0000 (09:50 -0400)] 
ARROW-17787: [Java] Fix Javadoc build (#14212)

Don't document the javadocs to avoid errors when there is nothing documentable in the javadocs

Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
13 days agoARROW-17810: [Java] Use jacoco-maven-plugin 0.8.8 for Java 18 support (#14197)
David Li [Fri, 23 Sep 2022 13:50:19 +0000 (09:50 -0400)] 
ARROW-17810: [Java] Use jacoco-maven-plugin 0.8.8 for Java 18 support (#14197)

Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
13 days agoMINOR: [Dev] Fix actions/labeler's tag (#14215)
Sutou Kouhei [Fri, 23 Sep 2022 04:26:46 +0000 (13:26 +0900)] 
MINOR: [Dev] Fix actions/labeler's tag (#14215)

This is a follow-up of ARROW-17621 / #14155.

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
13 days agoARROW-17818: [R] Skip duckdb test that is failing until the issue is resolved (#14209)
Dewey Dunnington [Fri, 23 Sep 2022 00:44:03 +0000 (21:44 -0300)] 
ARROW-17818: [R] Skip duckdb test that is failing until the issue is resolved (#14209)

This just skips the test to stop it from obscuring other failures until we diagnose it properly (ARROW-17818).

Authored-by: Dewey Dunnington <dewey@voltrondata.com>
Signed-off-by: Dewey Dunnington <dewey@fishandwhistle.net>
13 days agoARROW-17621: [CI] Audit workflows (#14155)
Jacob Wujciak-Jens [Thu, 22 Sep 2022 20:31:04 +0000 (22:31 +0200)] 
ARROW-17621: [CI] Audit workflows (#14155)

In this PR I:
- reduced the scope of the automatically generated `GITHUB_TOKEN` as much as possible (technically `contents:none` would  be the minimum but it is a bit unintuitive as it does not prevent checkout of public repos, I set `contents:read` in those cases)
- update all actions used to the newest version (checking for breaking changes, only case is actions/github-script which remains on v3 for that reason -> follow up)
- move the creation of envvars containing secrets as close to their usage as possible (-> the step they are used in), this duplicates them in workflows with multiple jobs but is safer.

I have opted **NOT** to pin the different actions by SHA as recommended in some places as the con outweigh the possible protection in my opinion. The main danger with pinning tags or branches is that a malicious actor changes the commit the tag points to and exfiltrates secrets (either repository secrets or in case of private repos code/ip) or takes some other damaging action like deleting branches, rewriting history etc..

We only ever pass actions the `GITHUB_TOKEN` which is ephemeral (deleted after workflow is finished) and scope limited so exfiltration of that token would worst case allow an attacker to create/delete labels and pr comments as well as modify PR branches (if the submitter activated the checkbox for maintainer access). Actions can not access secrets without the workflow author explicitly passing them as input (envvars might reveal them though)

The Apache Org limits the actions that can be used in repos, so we only use well known allow-listed actions, while this does of course not prevent malicious actions it reduces the risk substantially.

Pinning SHAs would mitigate these risks (provided the action at that sha was audited...) but would also necessitate regularly checking + re-auditing the actions as to not miss security patches in these actions (e.g. [here](https://github.com/matlab-actions/setup-matlab/releases/tag/v1.1.1)). IMHO that would be a considerable effort (+ needing real expertise in typescript/node to spot any malicious additions outside of blatant secret exfiltration or nuking) resulting in a small gain.

Lead-authored-by: Jacob Wujciak-Jens <jacob@wujciak.de>
Co-authored-by: assignUser <jacob@wujciak.de>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
13 days agoARROW-16226: [C++] Add better coverage for filesystem tell. (#14064)
Ben Harkins [Thu, 22 Sep 2022 16:56:22 +0000 (12:56 -0400)] 
ARROW-16226: [C++] Add better coverage for filesystem tell. (#14064)

Based on [ARROW-16226](https://issues.apache.org/jira/browse/ARROW-16226).

Adds coverage to GenericFileSystemTest::TestOpenInput(Stream|File) for validating Tell() and reads after seeking.

Authored-by: benibus <bpharks@gmx.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
13 days agoARROW-17814: [C++] Remove make_unique reimplementation (#14204)
Gang Wu [Thu, 22 Sep 2022 16:48:52 +0000 (00:48 +0800)] 
ARROW-17814: [C++] Remove make_unique reimplementation (#14204)

Use std::make_unique instead.

Authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
13 days agoARROW-17604: [Docs][Java] Make it more obvious that --add-opens is required (#14066)
David Li [Thu, 22 Sep 2022 16:48:35 +0000 (12:48 -0400)] 
ARROW-17604: [Docs][Java] Make it more obvious that --add-opens is required (#14066)

- Improve docs about this
- Improve error message about this

  ```
  java.lang.RuntimeException: Failed to initialize MemoryUtil. Was Java started with `--add-opens=java.base/java.nio=ALL-UNNAMED`? (See https://arrow.apache.org/docs/java/install.html)
  at org.apache.arrow.memory.util.MemoryUtil.<clinit>(MemoryUtil.java:138)
  at org.apache.arrow.memory.DefaultAllocationManagerFactory.<clinit>(DefaultAllocationManagerFactory.java:31)
  ```

Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
2 weeks agoARROW-17803: [C++] Use [[nodiscard]] (#14193)
Antoine Pitrou [Thu, 22 Sep 2022 14:59:10 +0000 (16:59 +0200)] 
ARROW-17803: [C++] Use [[nodiscard]] (#14193)

C++17 supports the standard attribute `[[nodiscard]]`, use that instead of the clang-specific `__attribute__((warn_unused_result))`.

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
2 weeks agoMINOR: [C++] Bump vendored cpplint version (#14206)
Antoine Pitrou [Thu, 22 Sep 2022 13:14:54 +0000 (15:14 +0200)] 
MINOR: [C++] Bump vendored cpplint version (#14206)

Fetched from upstream https://raw.githubusercontent.com/cpplint/cpplint/fa12a0bbdafa15291276ddd2a2dcd2ac7a2ce4cb/cpplint.py

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
2 weeks agoARROW-17792: [C++] Use lambda capture move construction (#14188)
Antoine Pitrou [Thu, 22 Sep 2022 12:57:02 +0000 (14:57 +0200)] 
ARROW-17792: [C++] Use lambda capture move construction (#14188)

Simplify some code using functors for the purpose of capturing a moved value.

I did not modify all such places, only where readability would seem to improve.

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
2 weeks agoARROW-17815: [Python] Warn, not error out, when SetSignalStopSource fails (#14205)
Antoine Pitrou [Thu, 22 Sep 2022 12:23:44 +0000 (14:23 +0200)] 
ARROW-17815: [Python] Warn, not error out, when SetSignalStopSource fails (#14205)

In complex scenarios, the global signal-receiving StopSource may have already been created.

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
2 weeks agoARROW-17790: [C++][Gandiva] Adapt to LLVM opaque pointer (#14187)
Jin Shang [Thu, 22 Sep 2022 08:40:42 +0000 (16:40 +0800)] 
ARROW-17790: [C++][Gandiva] Adapt to LLVM opaque pointer (#14187)

Starting from LLVM 13, LLVM IR has been shifting towards a unified opaque pointer type, i.e. pointers without pointee types. It has provided workarounds until LLVM 15. The temporary workarounds need to be replaced in order to support LLVM 15 and onwards. We need to supply the pointee type to the CreateGEP and CreateLoad methods.

For more background info, see https://llvm.org/docs/OpaquePointers.html and https://lists.llvm.org/pipermail/llvm-dev/2015-February/081822.html

Related issues:

https://issues.apache.org/jira/browse/ARROW-14363

https://issues.apache.org/jira/browse/ARROW-17728

https://issues.apache.org/jira/browse/ARROW-17775

Lead-authored-by: Jin Shang <shangjin1997@gmail.com>
Co-authored-by: jinshang <jinshang@tencent.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
2 weeks agoARROW-17659: [Java] Populate JDBC schema name metadata when config.shouldIncludeMetad...
Igor Suhorukov [Wed, 21 Sep 2022 20:44:41 +0000 (23:44 +0300)] 
ARROW-17659: [Java] Populate JDBC schema name metadata when config.shouldIncludeMetadata provided (#14196)

Current implementation include [catalog,table,column,type](https://github.com/apache/arrow/blob/master/java/adapter/jdbc/src/main/java/org/apache/arrow/adapter/jdbc/JdbcToArrowUtils.java#L248) metadata, but schema metadata field is missing. In terms of PostgreSQL catalog - is database, schema - namespace inside database, so catalog name is insufficient for table addressing without schema.

Proposed changes is + metadata.put(Constants.SQL_SCHEMA_KEY, rsmd.getSchemaName(i));

Authored-by: igor.suhorukov <igor.suhorukov@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
2 weeks agoARROW-17800: [C++] Fix failures in jemalloc stats tests (#14194)
Antoine Pitrou [Wed, 21 Sep 2022 20:07:09 +0000 (22:07 +0200)] 
ARROW-17800: [C++] Fix failures in jemalloc stats tests (#14194)

- Provide compatibility for 32-bit platforms
- Avoid memory leak in tests
- Make checks less strict

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
2 weeks agoARROW-17804: [Go][CSV] Add Date32 and Time32 parsers (#14192)
Matt Topol [Wed, 21 Sep 2022 19:20:56 +0000 (15:20 -0400)] 
ARROW-17804: [Go][CSV] Add Date32 and Time32 parsers (#14192)

Given the recent addition of inferring schemas for CSVs, it should be able to parse all the types that can be inferred

Authored-by: Matt Topol <zotthewizard@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
2 weeks agoARROW-17696: [C++] arrow-compute-asof-join-node-test inordinately slow (#14190)
rtpsw [Wed, 21 Sep 2022 15:25:36 +0000 (18:25 +0300)] 
ARROW-17696: [C++] arrow-compute-asof-join-node-test inordinately slow (#14190)

See https://issues.apache.org/jira/browse/ARROW-17696

Authored-by: Yaron Gvili <rtpsw@hotmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
2 weeks agoARROW-17169: [Go][Parquet] Panic in bitmap writer with Nullable List of Struct (...
Matt Topol [Wed, 21 Sep 2022 15:05:57 +0000 (11:05 -0400)] 
ARROW-17169: [Go][Parquet] Panic in bitmap writer with Nullable List of Struct (#14183)

When building the Nullable List of Struct column for record reading we didn't account for the worst-case scenario for building the final array. We need to handle the upper bound case of `offsetData[validityIO.Read]`+`validityIO.NullCount`

Authored-by: Matt Topol <zotthewizard@gmail.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
2 weeks agoARROW-17699: [R] Add better error message for if a non-schema passed into open_datase...
Nic Crane [Wed, 21 Sep 2022 14:46:58 +0000 (15:46 +0100)] 
ARROW-17699: [R] Add better error message for if a non-schema passed into open_dataset() (#14108)

Example below shows what happens if the `schema()` function is passed in as the `schema` argument via the `...` argument to `open_dataset()`.

Before (a later check for something else catches it, this giving a misleading message):

```
library(dplyr)
library(arrow)
tf <- tempfile()
dir.create(tf)
write_dataset(mtcars, tf, format = "csv")
open_dataset(tf, format = "csv", schema = schema) %>% collect()
#> Error in `CsvFileFormat$create()`:
#> ! Values in `column_names` must match `schema` field names
#> ✖ `column_names` and `schema` field names match but are not in the same order
```

After (more accurate error message):

```
library(dplyr)
library(arrow)
tf <- tempfile()
dir.create(tf)
write_dataset(mtcars, tf, format = "csv")
open_dataset(tf, format = "csv", schema = schema) %>% collect()
#> Error in `CsvFileFormat$create()`:
#> ! `schema` must be an object of class 'Schema' not 'function'.
```

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
2 weeks agoARROW-16174: [Python] Fix FixedSizeListArray.flatten() on sliced input (#14000)
Miles Granger [Wed, 21 Sep 2022 14:37:49 +0000 (16:37 +0200)] 
ARROW-16174: [Python] Fix FixedSizeListArray.flatten() on sliced input (#14000)

[ARROW-16174](https://issues.apache.org/jira/browse/ARROW-16174)

Current behavior
```python
import pyarrow as pa

array = pa.array([[1], [2], [3]], type=pa.list_(pa.int64(), list_size=1))
array[2:].flatten().to_pylist()
[1, 2, 3]
```

After this patch
```python
import pyarrow as pa

array = pa.array([[1], [2], [3]], type=pa.list_(pa.int64(), list_size=1))
array[2:].flatten().to_pylist()
[3]
```

Authored-by: Miles Granger <miles59923@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>