Apache Arrow 0.2.0 (2017-02-18)
Bug Fixes
- ARROW-112 - Changed constexprs to kValue naming.
- ARROW-202 - Integrate with appveyor ci for windows
- ARROW-220 - [C++] Build conda artifacts in a build environment with better cross-linux ABI compatibility
- ARROW-224 - [C++] Address static linking of boost dependencies
- ARROW-230 - Python: Do not name modules like native ones (i.e. rename pyarrow.io)
- ARROW-239 - Test reading remainder of file in HDFS with read() with no args
- ARROW-261 - Refactor String/Binary code paths to reflect unnested (non-list-based) structure
- ARROW-273 - Lists use unsigned offset vectors instead of signed (as defined in the spec)
- ARROW-275 - Add tests for UnionVector in Arrow File
- ARROW-294 - [C++] Do not use platform-dependent fopen/fclose functions for MemoryMappedFile
- ARROW-322 - [C++] Remove ARROW_HDFS option, always build the module
- ARROW-323 - [Python] Opt-in to pyarrow.parquet extension rather than attempting and failing silently
- ARROW-334 - [Python] Remove INSTALL_RPATH_USE_LINK_PATH
- ARROW-337 - UnionListWriter.list() is doing more than it should, this …
- ARROW-339 - [Dev] Lingering Python 3 fixes
- ARROW-339 - Python 3 compatibility in merge_arrow_pr.py
- ARROW-340 - [C++] Opening a writeable file on disk that already exists does not truncate to zero
- ARROW-342 - Set Python version on release
- ARROW-345 - libhdfs integration doesn't work for Mac
- ARROW-346 - Use conda environment to build API docs
- ARROW-348 - [Python] Add build-type command line option to setup.py, build CMake extensions in a build type subdirectory
- ARROW-349 - Add six as a requirement
- ARROW-351 - Time type has no unit
- ARROW-354 - Fix comparison of arrays of empty strings
- ARROW-357 - Use a single RowGroup for Parquet files as default.
- ARROW-358 - Add explicit environment variable to locate libhdfs in one's environment
- ARROW-362 - Fix memory leak in zero-copy arrow to NumPy/pandas conversion
- ARROW-371 - Handle pandas-nullable types correctly
- ARROW-375 - Fix unicode Python 3 issue in columns argument of parquet.read_table
- ARROW-384 - Align Java and C++ RecordBatch data and metadata layout
- ARROW-386 - [Java] Respect case of struct / map field names
- ARROW-387 - [C++] Verify zero-copy Buffer slices from BufferReader retain reference to parent Buffer
- ARROW-390 - Only specify dependencies for json-integration-test on ARROW_BUILD_TESTS=ON
- ARROW-392 - [C++/Java] String IPC integration testing / fixes. Add array / record batch pretty-printing
- ARROW-393 - [JAVA] JSON file reader fails to set the buffer size on String data vector
- ARROW-395 - Arrow file format writes record batches in reverse order.
- ARROW-398 - Java file format requires bitmaps of all 1's to be written…
- ARROW-399 - ListVector.loadFieldBuffers ignores the ArrowFieldNode len…
- ARROW-400 - set struct length on load
- ARROW-401 - Floating point vectors should do an approximate comparison…
- ARROW-402 - Fix reference counting issue with empty buffers. Close #232
- ARROW-403 - [Java] Create transfer pairs for internal vectors in UnionVector transfer impl
- ARROW-404 - [Python] Fix segfault caused by HdfsClient getting closed before an HdfsFile
- ARROW-405 - Use vendored hdfs.h if not found in include/ in $HADOOP_HOME
- ARROW-406 - [C++] Set explicit 64K HDFS buffer size, test large reads
- ARROW-408 - Remove defunct conda recipes
- ARROW-414 - [Java] "Buffer too large to resize to ..." error
- ARROW-420 - Align DATE type with Java implementation
- ARROW-421 - [Python] Retain parent reference in PyBytesReader
- ARROW-422 - IPC should depend on rapidjson_ep if RapidJSON is vendored
- ARROW-429 - Revert ARROW-379 until git-archive issues are resolved
- ARROW-433 - Correctly handle Arrow to Python date conversion for timezones west of London
- ARROW-434 - [Python] Correctly handle Python file objects in Parquet read/write paths
- ARROW-435 - Fix spelling of RAPIDJSON_VENDORED
- ARROW-437 - [C++} Fix clang compiler warning
- ARROW-445 - arrow_ipc_objlib depends on Flatbuffer generated files
- ARROW-447 - Always return unicode objects for UTF-8 strings
- ARROW-455 - [C++] Add dtor to BufferOutputStream that calls Close()
- ARROW-469 - C++: Add option so that resize doesn't decrease the capacity
- ARROW-481 - [Python] Fix 2.7 regression in Parquet path to open file code path
- ARROW-486 - [C++] Use virtual inheritance for diamond inheritance
- ARROW-487 - Python: ConvertTableToPandas segfaults if ObjectBlock::Write fails
- ARROW-494 - [C++] Extend lifetime of memory mapped data if any buffers reference it
- ARROW-499 - Update file serialization to use the streaming serialization format.
- ARROW-505 - [C++] Fix compiler warning in gcc in release mode
- ARROW-511 - Python: Implement List conversions for single arrays
- ARROW-513 - [C++] Fixing Appveyor / MSVC build
- ARROW-516 - Building pyarrow with parquet
- ARROW-519 - [C++] Refactor array comparison code into a compare.h / compare.cc in part to resolve Xcode 6.1 linker issue
- ARROW-523 - Python: Account for changes in PARQUET-834
- ARROW-533 - [C++] arrow::TimestampArray / TimeArray has a broken constructor
- ARROW-535 - [Python] Add type mapping for NPY_LONGLONG
- ARROW-537 - [C++] Do not compare String/Binary data in null slots when comparing arrays
- ARROW-540 - [C++] Build fixes after ARROW-33, PARQUET-866
- ARROW-543 - C++: Lazily computed null_counts counts number of non-null entries
- ARROW-544 - [C++] Test writing zero-length record batches, zero-length BinaryArray fixes
- ARROW-545 - [Python] Ignore non .parq/.parquet files when reading directories as Parquet datasets
- ARROW-548 - [Python] Add nthreads to Filesystem.read_parquet and pass through
- ARROW-551 - C++: Construction of Column with nullptr Array segfaults
- ARROW-556 - [Integration] Configure C++ integration test executable with a single environment variable. Update README
- ARROW-561 - [JAVA][PYTHON] Update java & python dependencies to improve downstream packaging experience
- ARROW-562 - Mockito should be in test scope
New Features and Improvements
- ARROW-33 - [C++] Implement zero-copy array slicing, integrate with IPC code paths
- ARROW-81 - [Format] Augment dictionary encoding metadata to accommodate additional use cases
- ARROW-96 - Add C++ API documentation
- ARROW-97 - API documentation via sphinx-apidoc
- ARROW-108 - [C++] Add Union implementation and IPC/JSON serialization tests
- ARROW-189 - Build 3rd party with ExternalProject.
- ARROW-191 - Python: Provide infrastructure for manylinux1 wheels
- ARROW-221 - Add switch for writing Parquet 1.0 compatible logical types
- ARROW-227 - [C++/Python] Hook arrow_io generic reader / writer interface into arrow_parquet
- ARROW-228 - [Python] Create an Arrow-cpp-compatible interface for reading bytes from Python file-like objects
- ARROW-240 - Installation instructions for pyarrow
- ARROW-243 - [C++] Add option to switch between libhdfs and libhdfs3 when creating HdfsClient
- ARROW-268 - [C++] Flesh out union implementation to have all required methods for IPC
- ARROW-303 - [C++] Also build static libraries for leaf libraries
- ARROW-312 - [Java] IPC file round trip tool for integration testing
- ARROW-312 - Read and write Arrow IPC file format from Python
- ARROW-317 - Add Slice, Copy methods to Buffer
- ARROW-327 - [Python] Remove conda builds from Travis CI setup
- ARROW-328 - Return shared_ptr<T> by value instead of const-ref
- ARROW-330 - CMake functions to simplify shared / static library configuration
- ARROW-332 - Add RecordBatch.to_pandas method
- ARROW-333 - Make writers update their internal schema even when no data is written
- ARROW-335 - Improve Type apis and toString() by encapsulating flatbuffers better
- ARROW-336 - Run Apache Rat in Travis builds
- ARROW-338 - Implement visitor pattern for IPC loading/unloading
- ARROW-344 - Instructions for building with conda
- ARROW-350 - Added Kerberos to HDFS client
- ARROW-353 - Arrow release 0.2
- ARROW-355 - Add tests for serialising arrays of empty strings to Parquet
- ARROW-356 - Add documentation about reading Parquet
- ARROW-359 - Document ARROW_LIBHDFS_DIR
- ARROW-360 - C++: Add method to shrink PoolBuffer using realloc
- ARROW-361 - Python: Support reading a column-selection from Parquet files
- ARROW-363 - [Java/C++] integration testing harness, initial integration tests
- ARROW-365 - Python: Provide Array.to_pandas()
- ARROW-366 - Java Dictionary Vector
- ARROW-367 - converter json <=> Arrow file format for Integration tests
- ARROW-368 - Added note for LD_LIBRARY_PATH in Python README
- ARROW-369 - [Python] Convert multiple record batches at once to Pandas
- ARROW-370 - Python: Pandas conversion from `datetime.date` columns
- ARROW-372 - json vector serialization format
- ARROW-373 - [C++] JSON serialization format for testing
- ARROW-374 - More precise handling of bytes vs unicode in Python API
- ARROW-377 - Python: Add support for conversion of Pandas.Categorical
- ARROW-379 - Use setuptools_scm for Python versioning
- ARROW-380 - [Java] optimize null count when serializing vectors
- ARROW-381 - [C++] Simplify primitive array type builders to use a default type singleton
- ARROW-382 - Extend Python API documentation
- ARROW-383 - [C++] Integration testing CLI tool
- ARROW-389 - Python: Write Parquet files to pyarrow.io.NativeFile objects
- ARROW-394 - [Integration] Generate tests cases for numeric types, strings, lists, structs
- ARROW-396 - [Python] Add pyarrow.schema.Schema.equals
- ARROW-409 - [Python] Change record batches conversion to Table
- ARROW-410 - [C++] Add virtual Writeable::Flush
- ARROW-411 - [Java] Move compactor functions in Integration to a separate Validator module
- ARROW-415 - C++: Add Equals implementation to compare Tables
- ARROW-416 - C++: Add Equals implementation to compare Columns
- ARROW-417 - Add Equals implementation to compare ChunkedArrays
- ARROW-418 - [C++] Array / Builder class code reorganization, flattening
- ARROW-419 - [C++] Promote util/{status.h, buffer.h, memory-pool.h} to top level of arrow/ source directory
- ARROW-423 - Define BUILD_BYPRODUCTS for CMake 3.2+
- ARROW-425 - Add private API to get python Table from a C++ object
- ARROW-426 - Python: Conversion from pyarrow.Array to a Python list
- ARROW-427 - [C++] Implement dictionary array type
- ARROW-428 - [Python] Multithreaded conversion from Arrow table to pandas.DataFrame
- ARROW-430 - Improved version handling
- ARROW-432 - [Python] Construct precise pandas BlockManager structure for zero-copy DataFrame initialization
- ARROW-438 - [C++/Python] Implement zero-data-copy record batch and table concatenation.
- ARROW-440 - [C++] Support pkg-config
- ARROW-441 - [Python] Expose Arrow's file and memory map classes as NativeFile subclasses
- ARROW-442 - [Python] Inspect Parquet file metadata from Python
- ARROW-444 - [Python] Native file reads into pre-allocated memory. Some IO API cleanup / niceness
- ARROW-449 - Python: Conversion from pyarrow.{Table,RecordBatch} to a Python dict
- ARROW-450 - Fixes for PARQUET-818
- ARROW-456 - Add jemalloc based MemoryPool
- ARROW-457 - Python: Better control over memory pool
- ARROW-458 - [Python] Expose jemalloc MemoryPool
- ARROW-461 - [Python] Add Python interfaces to DictionaryArray data, pandas interop
- ARROW-463 - C++: Support jemalloc 4.x
- ARROW-466 - Add ExternalProject for jemalloc
- ARROW-467 - [Python] Run Python parquet-cpp unit tests in Travis CI
- ARROW-468 - Python: Conversion of nested data in pd.DataFrames
- ARROW-470 - [Python] Add "FileSystem" abstraction to access directories of files in a uniform way
- ARROW-471 - [Python] Enable ParquetFile to pass down separately-obtained file metadata
- ARROW-472 - [Python] Expose more C++ IO interfaces. Add equals methods to Parquet schemas. Pass Parquet metadata separately in reader
- ARROW-474 - [Java] Add initial version of streaming serialized format.
- ARROW-475 - [Python] Add support for reading multiple Parquet files as a single pyarrow.Table
- ARROW-476 - Add binary integration test fixture, add Java support
- ARROW-477 - [Java] Add support for second/microsecond/nanosecond timestamps in-memory and in IPC/JSON layer
- ARROW-478 - Consolidate BytesReader and BufferReader to accept PyBytes or Buffer
- ARROW-479 - Python: Test for expected schema in Pandas conversion
- ARROW-484 - Revise README to include more detail about software components
- ARROW-485 - [Java] Users are required to initialize VariableLengthVectors.offsetVector before calling VariableLengthVectors.mutator.getSafe
- ARROW-490 - Python: Update manylinux1 build scripts
- ARROW-495 - [C++] Implement streaming binary format, refactoring
- ARROW-497 - Integration harness for streaming file format
- ARROW-498 - [C++] Add command line utilities that convert between stream and file.
- ARROW-503 - [Python] Implement Python interface to streaming file format
- ARROW-506 - Java: Implement echo server for integration testing.
- ARROW-508 - [C++] Add basic threadsafety to normal files and memory maps
- ARROW-509 - [Python] Add support for multithreaded Parquet reads
- ARROW-512 - C++: Add method to check for primitive types
- ARROW-514 - [Python] Automatically wrap pyarrow.io.Buffer in BufferReader
- ARROW-515 - [Python] Add read_all methods to FileReader, StreamReader
- ARROW-521 - [C++] Track peak allocations in default memory pool
- ARROW-524 - provide apis to access nested vectors and buffers
- ARROW-525 - Python: Add more documentation to the package
- ARROW-527 - Remove drill-module.conf file
- ARROW-529 - Python: Add jemalloc and Python 3.6 to manylinux1 build
- ARROW-531 - Python: Document jemalloc, extend Pandas section, add Getting Involved
- ARROW-538 - [C++] Set up AddressSanitizer (ASAN) builds
- ARROW-546 - Python: Account for changes in PARQUET-867
- ARROW-547 - [Python] Add zero-copy slice methods to Array, RecordBatch
- ARROW-553 - C++: Faster valid bitmap building
- ARROW-558 - Add KEYS files