A Regular expression(Joni) parser plugin for Embulk
This is a regular expression parser plugin for Embulk.
It use Joni regular expression library.
The Joni is Java port of Oniguruma regexp library.
The Fluentd also use Joni/Oniguruma.
This plugin aim compatibility with fluentd regexp parser plugin format.
Overview
- Plugin type: parser
- Guess supported: yes (trivial)
Configuration
- type: Specify this parser as
joni_regexp
- columns: Specify column name and type. See below (array, required)
- stop_on_invalid_record: Stop bulk load transaction if a file includes invalid record (such as invalid timestamp) (boolean, default: false)
- default_timezone: Default timezone of the timestamp (string, default: UTC)
- default_timestamp_format: Default timestamp format of the timestamp (string, default:
%Y-%m-%d %H:%M:%S.%N %z
) - newline: Newline character (CRLF, LF or CR) (string, default: CRLF)
- charset: Character encoding (eg. ISO-8859-1, UTF-8) (string, default: UTF-8)
- format: Regular expression string Supported expression (string, required)
columns
- name: Name of the column (string, required)
- type: Type of the column (string, required)
- timezone: Timezone of the timestamp if type is timestamp (string, default: default_timestamp)
- format: Format of the timestamp if type is timestamp (string, default: default_format)
Example
in:
type: any file input plugin type
parser:
type: joni_regexp
columns:
- { name: host, type: string }
- { name: user, type: string }
- { name: time, type: timestamp, format: "%d/%b/%Y:%H:%M:%S %z" }
- { name: method, type: string }
- { name: path, type: string }
- { name: code, type: string }
- { name: size, type: string }
- { name: referer, type: string }
- { name: agent, type: string }
format: '^(?<host>[^ ]*) [^ ]* (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^ ]*) +\S*)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$'
Guess
This plugin also support minimul guess
command.
The guess
command require type
and fomat
parameters.
seed.yml
example.
in:
type: file
path_prefix: example/test2.txt
parser:
type: joni_regexp
format: "(?<name>[^,]+),(?<birth>\\d{4}-\\d{2}-\\d{2}),(?<age>\\d+)"
out:
type: stdout
execute guess
command.
$ embulk guess -g joni_regexp config.yml -o guessed.yml
The guess
command read format
parameter and generate columns
.
in:
type: file
path_prefix: example/test2.txt
parser:
type: joni_regexp
format: (?<name>[^,]+),(?<birth>\d{4}-\d{2}-\d{2}),(?<age>\d+)
charset: UTF-8
newline: LF
columns:
- {name: name, type: string}
- {name: birth, type: string}
- {name: age, type: string}
out: {type: stdout}
Install
$ embulk gem install embulk-parser-joni
Build
$ ./gradlew gem # -t to watch change of files and rebuild continuously