New Case Study:See how Anthropic automated 95% of dependency reviews with Socket.Learn More
Socket
Sign inDemoInstall
Socket

csv-dataset

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

csv-dataset

csv-dataset helps to read csv files and create descriptive and efficient input pipelines for deep learning in a streaming fashion

  • 3.5.0
  • PyPI
  • Socket score

Maintainers
1

csv-dataset

CsvDataset helps to read a csv file and create descriptive and efficient input pipelines for deep learning.

CsvDataset iterates the records of the csv file in a streaming fashion, so the full dataset does not need to fit into memory.

Install

$ pip install csv-dataset

Usage

Suppose we have a csv file whose absolute path is filepath:

open_time,open,high,low,close,volume
1576771200000,7145.99,7150.0,7141.01,7142.33,21.094283
1576771260000,7142.89,7142.99,7120.7,7125.73,118.279931
1576771320000,7125.76,7134.46,7123.12,7123.12,41.03628
1576771380000,7123.74,7128.06,7117.12,7126.57,39.885367
1576771440000,7127.34,7137.84,7126.71,7134.99,25.138154
1576771500000,7134.99,7144.13,7132.84,7141.64,26.467308
...
from csv_dataset import (
    Dataset,
    CsvReader
)

dataset = CsvDataset(
    CsvReader(
        filepath,
        float,
        # Abandon the first column and only pick the following
        indexes=[1, 2, 3, 4, 5],
        header=True
    )
).window(3, 1).batch(2)

for element in dataset:
    print(element)

The following output shows one print.

[[[7145.99,  7150.0,   7141.01,  7142.33,   21.094283]
  [7142.89,  7142.99,  7120.7,   7125.73,  118.279931]
  [7125.76,  7134.46,  7123.12,  7123.12,   41.03628 ]]

 [[7142.89,  7142.99,  7120.7,   7125.73,  118.279931]
  [7125.76,  7134.46,  7123.12,  7123.12,   41.03628 ]
  [7123.74,  7128.06,  7117.12,  7126.57,   39.885367]]]

...

Dataset(reader: AbstractReader)

dataset.window(size: int, shift: int = None, stride: int = 1) -> self

Defines the window size, shift and stride.

The default window size is 1 which means the dataset has no window.

Parameter explanation

Suppose we have a raw data set

[ 1  2  3  4  5  6  7  8  9 ... ]

And the following is a window of (size=4, shift=3, stride=2)

          |-------------- size:4 --------------|
          |- stride:2 -|                       |
          |            |                       |
win 0:  [ 1            3           5           7  ] --------|-----
                                                       shift:3
win 1:  [ 4            6           8           10 ] --------|-----

win 2:  [ 7            9           11          13 ]

...
dataset.batch(batch: int) -> self

Defines batch size.

The default batch size of the dataset is 1 which means it is single-batch

If batch is 2

batch 0:  [[ 1            3           5           7  ]
           [ 4            6           8           10 ]]

batch 1:  [[ 7            9           11          13 ]
           [ 10           12          14          16 ]]

...
dataset.get() -> Optional[np.ndarray]

Gets the data of the next batch

dataset.reset() -> self

Resets dataset

dataset.read(amount: int, reset_buffer: bool = False)
  • amount the maximum length of data the dataset will read
  • reset_buffer if True, the dataset will reset the data of the previous window in the buffer

Reads multiple batches at a time

If we reset_buffer, then the next read will not use existing data in the buffer, and the result will have no overlap with the last read.

dataset.reset_buffer() -> None

Reset buffer, so that the next read will have no overlap with the last one

dataset.lines_need(reads: int) -> int

Calculates and returns how many lines of the underlying datum are needed for reading reads times

dataset.max_reads(max_lines: int) -> int | None

Calculates max_lines lines could afford how many reads

dataset.max_reads() -> int | None

Calculates the current reader could afford how many reads.

If max_lines of current reader is unset, then it returns None

CsvReader(filepath, dtype, indexes, **kwargs)

  • filepath str absolute path of the csv file
  • dtype Callable data type. We should only use float or int for this argument.
  • indexes List[int] column indexes to pick from the lines of the csv file
  • kwargs
    • header bool = False whether we should skip reading the header line.
    • splitter str = ',' the column splitter of the csv file
    • normalizer List[NormalizerProtocol] list of normalizer to normalize each column of data. A NormalizerProtocol should contains two methods, normalize(float) -> float to normalize the given datum and restore(float) -> float to restore the normalized datum.
    • max_lines int = -1 max lines of the csv file to be read. Defaults to -1 which means no limit.
reader.reset()

Resets reader pos

property reader.max_lines

Gets max_lines

setter reader.max_lines = lines

Changes max_lines

reader.readline() -> list

Returns the converted value of the next line

reader csvReader.lines

Returns number of lines has been read

License

MIT

Keywords

FAQs


Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc