Kaldiio
A pure python module for reading and writing kaldi ark files
Introduction
What are ark
and scp
?
kaldiio
is an IO utility implemented in pure Python language for several file formats used in kaldi, which are named asark
and scp
. ark
and scp
are used in in order to archive some objects defined in Kaldi, typically it is Matrix object of Kaldi.
In this section, we describe the basic concept of ark
and scp
. More detail about the File-IO in Kaldi-asr
: http://kaldi-asr.org/doc/io.html
Basic of File IO in kaldi: Ark and copy-feats
ark
is an archive format to save any Kaldi objects
. This library mainly support KaldiMatrix/KaldiVector
.
This ia an example of ark file of KaldiMatrix: ark file
If you have Kaldi
, you can convert it to text format as following
copy-feats ark:test.ark ark,t:text.ark
copy-feats
is designed to have high affinity with unix command line:
-
ark
can be flushed to and from unix pipe.
cat test.ark | copy-feats ark:- ark,t:- | less # Show the contents in the ark
-
indicates standard input stream or output stream.
-
Unix command can be used as read-specifier
and wspecifier
copy-feats ark:'gunzip -c some.ark.gz |' ark:some.ark
Scp file
scp
is a text file such as,
uttid1 /some/where/feats.ark:123
uttid2 /some/where/feats.ark:156
uttid3 /some/where/feats.ark:245
The first column, uttid1
, indicates the utterance id and the second, /some/where/feats.ark:123
, is the file path of matrix/vector of kaldi formats. The number after colon is a starting addressof the object of the file.
scp
looks very simple format, but has several powerful features.
-
Mutual conversion betweenark
and scp
copy-feats scp:foo.scp ark:foo.ark # scp -> ark
copy-feats ark:foo.ark ark,scp:bar.ark,bar.scp # ark -> ark,scp
-
Unix command can be used insead of direct file path
For example, the following scp file can be also used.
uttid1 cat /some/where/feats1.mat |
uttid2 cat /some/where/feats2.mat |
uttid3 cat /some/where/feats3.mat |
wav.scp
wav.scp
is a scp
to describe wave file paths.
uttid1 /some/path/a.wav
uttid2 /some/path/b.wav
uttid3 /some/path/c.wav
wav.scp
is also can be embeded unix command as normal scp file. This is often used for converting file format in kaldi recipes.
uttid1 sph2pipe -f wav /some/path/a.wv1 |
uttid2 sph2pipe -f wav /some/path/b.wv1 |
uttid3 sph2pipe -f wav /some/path/c.wv1 |
Features
Kaldiio supports:
- Read/Write for archive formats: ark, scp
- Binary/Text - Float/Double Matrix: DM, FM
- Binary/Text - Float/Double Vector: DV, FV
- Compressed Matrix for loading: CM, CM2, CM3
- Compressed Matrix for writing: All compressoin_method are supported: 1,2,3,4,5,6,7
- Binary/Text for Int-vector, typically used for
ali
files.
- Read/Write via a pipe: e.g. "ark: cat feats.ark |"
- Read wav.scp / wav.ark
- (New!) Some extended ark format not supported in Kaldi originally.
- The ark file for numpy, pickle, wav, flac files.
The followings are not supported
- Write in existing scp file
- NNet2/NNet3 egs
- Lattice file
Similar projects
- Python-C++ binding
- Pure Python
Install
pip install kaldiio
Usage
kaldiio
doesn't distinguish the API for each kaldi-objects, i.e.
Kaldi-Matrix
, Kaldi-Vector
, not depending on whether it is binary or text, or compressed or not,
can be handled by the same API.
ReadHelper
ReadHelper
supports sequential accessing for scp
or ark
. If you need to access randomly, then use kaldiio.load_scp
.
from kaldiio import ReadHelper
with ReadHelper('scp:file.scp') as reader:
for key, numpy_array in reader:
...
from kaldiio import ReadHelper
with ReadHelper('ark: gunzip -c file.ark.gz |') as reader:
for key, numpy_array in reader:
...
with ReadHelper('ark: gunzip -c exp/tri3_ali/ali.*.gz |') as reader:
for key, numpy_array in reader:
...
from kaldiio import ReadHelper
with ReadHelper('scp:wav.scp') as reader:
for key, (rate, numpy_array) in reader:
...
- v2.11.0: Removed wav
option. You can load wav.scp
without any addtional argument.
- Read wav.scp with segments
from kaldiio import ReadHelper
with ReadHelper('scp:wav.scp', segments='segments') as reader
for key, (rate, numpy_array) in reader:
...
from kaldiio import ReadHelper
with ReadHelper('ark:-') as reader:
for key, numpy_array in reader:
...
WriteHelper
- Write matrices and vectors in a ark with scp
import numpy
from kaldiio import WriteHelper
with WriteHelper('ark,scp:file.ark,file.scp') as writer:
for i in range(10):
writer(str(i), numpy.random.randn(10, 10))
- Write in compressed matrix
import numpy
from kaldiio import WriteHelper
with WriteHelper('ark:file.ark', compression_method=2) as writer:
for i in range(10):
writer(str(i), numpy.random.randn(10, 10))
import numpy
from kaldiio import WriteHelper
with WriteHelper('ark,t:file.ark') as writer:
for i in range(10):
writer(str(i), numpy.random.randn(10, 10))
import numpy
from kaldiio import WriteHelper
with WriteHelper('ark:| gzip -c > file.ark.gz') as writer:
for i in range(10):
writer(str(i), numpy.random.randn(10, 10))
import numpy
from kaldiio import WriteHelper
with WriteHelper('ark:-') as writer:
for i in range(10):
writer(str(i), numpy.random.randn(10, 10))
- (New!) Extended ark format using numpy, pickle, soundfile
import numpy
from kaldiio import WriteHelper
with WriteHelper('ark:-', write_function="numpy") as writer:
writer("foo", numpy.random.randn(10, 10))
with WriteHelper('ark:-', write_function="pickle") as writer:
writer("foo", numpy.random.randn(10, 10))
with WriteHelper('ark:-', write_function="soundfile_flac") as writer:
writer("foo", numpy.random.randn(1000))
Note that soundfile
is an optional module and you need to install it to use this feature.
pip install soundfile
More low level API
WriteHelper
and ReadHelper
are high level wrapper of the following API to support kaldi style arguments.
load_ark
import kaldiio
d = kaldiio.load_ark('a.ark')
for key, numpy_array in d:
...
with open('a.ark') as fd:
for key, numpy_array in kaldiio.load_ark(fd):
...
from kaldiio import open_like_kaldi
with open_like_kaldi('gunzip -c file.ark.gz |', 'r') as f:
for key, numpy_array in kaldiio.load_ark(fd):
...
load_ark
can load both matrices of ark and vectors of ark and also, it can be both text and binary.
load_scp
load_scp
creates "lazy dict", i.e.
The data are loaded in memory when accessing the element.
import kaldiio
d = kaldiio.load_scp('a.scp')
for key in d:
numpy_array = d[key]
with open('a.scp') as fd:
kaldiio.load_scp(fd)
d = kaldiio.load_scp('data/train/wav.scp', segments='data/train/segments')
for key in d:
rate, numpy_array = d[key]
The object created by load_scp
is a dict-like object, thus it has methods of dict
.
import kaldiio
d = kaldiio.load_scp('a.scp')
d.keys()
d.items()
d.values()
'uttid' in d
d.get('uttid')
load_scp_sequential (from v2.13.0)
load_scp_sequential
creates "generator" as same as load_ark
.
If you don't need random-accessing for each elements
and use it just to iterate for whole data,
then this method possibly performs faster than load_scp
.
import kaldiio
d = kaldiio.load_scp_sequential('a.scp')
for key, numpy_array in d:
...
load_wav_scp
d = kaldiio.load_scp('wav.scp')
for key in d:
rate, numpy_array = d[key]
d = kaldiio.load_scp('data/train/wav.scp', segments='data/train/segments')
for key in d:
rate, numpy_array = d[key]
- v2.11.0:
load_wav_scp
is deprecated now. Use load_scp
.
load_mat
numpy_array = kaldiio.load_mat('a.mat')
numpy_array = kaldiio.load_mat('a.ark:1134')
rate, numpy_array = kaldiio.load_mat('a.wav')
load_mat
can load kaldi-matrix, kaldi-vector, and wave
save_ark
kaldiio.save_ark('b.ark', {'key': numpy_array, 'key2': numpy_array2})
kaldiio.save_ark('b.ark', {'key': numpy_array, 'key2': numpy_array2},
scp='b.scp')
import sys
kaldiio.save_ark(sys.stdout, {'key': numpy_array})
kaldiio.save_ark('a.ark', {'key': numpy_array, 'key2': numpy_array2})
kaldiio.save_ark('a.ark', {'key3': numpy_array3}, append=True)
from kaldiio import open_like_kaldi
with open_like_kaldi('| gzip a.ark.gz', 'w') as f:
kaldiio.save_ark(f, {'key': numpy_array})
kaldiio.save_ark(f, {'key2': numpy_array2})
save_mat
kaldiio.save_mat('a.mat', numpy_array)
save_mat
can save both kaldi-matrix and kaldi-vector
open_like_kaldi
kaldiio.open_like_kaldi
is a useful tool if you are familiar with Kaldi. This function can performs as following,
from kaldiio import open_like_kaldi
with open_like_kaldi('echo -n hello |', 'r') as f:
assert f.read() == 'hello'
with open_like_kaldi('| cat > out.txt', 'w') as f:
f.write('hello')
with open('out.txt', 'r') as f:
assert f.read() == 'hello'
import sys
with open_like_kaldi('-', 'r') as f:
assert f is sys.stdin
with open_like_kaldi('-', 'w') as f:
assert f is sys.stdout
For example, if there are gziped alignment file, then you can load it as:
from kaldiio import open_like_kaldi, load_ark
with open_like_kaldi('gunzip -c exp/tri3_ali/ali.*.gz |', 'rb') as f:
g = load_ark(f)
for k, numpy_array in g:
...
parse_specifier
from kaldiio import parse_specifier, open_like_kaldi, load_ark
rspecifier = 'ark:gunzip -c file.ark.gz |'
spec_dict = parse_specifier(rspecifier)
with open_like_kaldi(spec_dict['ark'], 'rb') as fark:
for key, numpy_array in load_ark(fark):
...