Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

uk.elementarysoftware:quick-csv-streamer

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

uk.elementarysoftware:quick-csv-streamer

Quick CSV Parser with Java 8 Streams API

  • 0.2.4
  • Source
  • Maven
  • Socket score

Version published
Maintainers
1
Source

Quick CSV Streamer

Build Status Maven Central Javadoc

Quick CSV streamer is a high performance CSV parsing library with Java 8 Stream API. The library operates in "zero-copy" mode and only parses what is required by the client. Amount of garbage produced is also optimized, reducing pressure on the garbage collector. Parallel, multi-core parsing is supported transparently via Java Stream API.

Compared to other open source Java CSV parsing libraries Quick CSV achieves speed ups at 2x - 10x range in sequential, single thread, mode. Naturally parallel mode improves performance further. See benchmarking results below for more details.

The library is limited to so called "line-optimal" charsets like UTF-8, US-ASCII, ISO-8859-1 and some others. Such line-optimal charsets have the property that line feed ('\n'), carriage return ('\r'), CSV separator are easily identifiable from other encoded characters.

Maven dependency

Available from Maven Central:

<dependency>
    <groupId>uk.elementarysoftware</groupId>
    <artifactId>quick-csv-streamer</artifactId>
    <version>0.2.4</version>
</dependency>

Example usage

Suppose following CSV file needs to be parsed

Country,City,AccentCity,Region,Population,Latitude,Longitude
ad,andorra,Andorra,07,,42.5,1.5166667
gb,city of london,City of London,H9,,51.514125,-.093689
ua,kharkiv,Kharkiv,07,,49.980814,36.252718

First define Java class to represent the records as follows

public class City {
    private final String city;
    private final int population;
    private final double latitude;
    private final double longitude;

    ...
}

here we will be sourcing 4 fields from the source file, ignoring other 3.

Parsing the file is simple

import uk.elementarysoftware.quickcsv.api.*;

CSVParser<City> parser = CSVParserBuilder.aParser(City::new, City.CSVFields.class).forRfc4180().build();

the parser will be using CSV separators as per RFC 4180, default encoding and will be expecting header as first record in the source. Custom separators, quotes, encodings and header sources are supported.

Actual mapping is done in City constructor

public class City {

    public static enum CSVFields {
        AccentCity,
        Population,
        Latitude,
        Longitude
    }

    public City(CSVRecordWithHeader<CSVFields> r) {
        this.city = r.getField(CSVFields.AccentCity).asString();
        this.population = r.getField(CSVFields.Population).asInt();
        this.latitude = r.getField(CSVFields.Latitude).asDouble();
        this.longitude = r.getField(CSVFields.Longitude).asDouble();
    }

first CSVFields enum specifies which fields should be sourced and only these fields will be actually parsed. After that CSVRecordWithHeader instance is used to populate City instance fields, refering to CSV fields by enum values.

Of course mapping can also be done outside domain class constructor, just pass different Function<CSVRecordWithHeader, City> to CSVParserBuilder.

Resulting stream can be processed in parallel or sequentially with usual Java stream API. For example to parse sequentially on a single thread

Stream<City> stream = parser.parse(source).sequential();
stream.forEach(System.out::println);    

By default parser will operate in parallel mode.

Please see sample project for full source code of the above example.

Special cases for headers

When header contains special characters the fields can not be simply encoded by enum literals. In such cases toString should be overwritten, for example

enum Fields {
    Latitude("City Latitude"),
    Longitude("City Longitude"),
    City("City name"),
    Population("City Population");

    private final String headerFieldName;

    private Fields(String headerFieldName) {
        this.headerFieldName = headerFieldName;
    }

    @Override public String toString() {
        return headerFieldName;
    }
}

If header is missing from the source it can be supplied during parser constuction

CSVParserBuilder
    .aParser(City::new, City.CSVFields.class)
    .usingExplicitHeader("Country", "City", "AccentCity", "Region", "Population", "Latitude", "Longitude")
    .build();

Advanced usage

About 10% performance improvement compared to normal usage can be achieved by referencing the fields by position instead of name. In this case parser construction is even simpler

CSVParser<City> parser = CSVParserBuilder.aParser(City::new).build();

as enumeration specifying field names is not needed. However now constructor will be using CSVRecord interface

public City(CSVRecord r) {
    r.skipFields(2);
    this.city  = r.getNextField().asString();
    r.skipField();        
    this.population = r.getNextField().asInt();        
    this.latitude = r.getNextField().asDouble();
    this.longitude = r.getNextField().asDouble();
}

effectively this encodes field order in the CSV source.

Performance

Best way to check performance of the library is to run benchmark on your target system with

gradle jmh

reports can be then found in build/reports/jmh.

It is very important to appreciate that performance might vary dramattically depending on the actual CSV content. As a very rough guideline see below sample output of "gradle jmh" on i7 2700k Ubuntu system, which uses cities.txt similar to example above, expanded to have 3173800 rows and 157 MB in size:

BenchmarkModeCntScoreErrorUnits
OpenCSVParseravgt52393.921± 262.347ms/op
Quick CSV Parallel with headeravgt5205.013± 1.739ms/op
Quick CSV Parallel (advanced)avgt5177.262± 1.739ms/op
Quick CSV Sequentialavgt5648.462± 45.991ms/op

Comparison is done with OpenCSV library v3.8, performance of other libraries can be extrapolated using chart from https://github.com/uniVocity/csv-parsers-comparison

Prerequisites

Quick CSV Streamer library requires Java 8, it has no other dependencies.

License

Library is licensed under the terms of GPL v2.0 license. Please contact me if you wish to use this library under more commercially friendly license or want to extend it, for example to add async parsing or support different file formats.

FAQs

Package last updated on 19 Aug 2017

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc