You're Invited:Meet the Socket Team at BlackHat and DEF CON in Las Vegas, Aug 4-6.RSVP
Socket
Book a DemoInstallSign in
Socket

pytortilla

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

pytortilla

The file format behind TACO.

0.5.1
pipPyPI
Maintainers
1

The file format behind TACO. 🫓

PyPI License Black isort

GitHub: https://github.com/tacofoundation/tortilla-python 🌐

PyPI: https://pypi.org/project/pytortilla/ 🛠️

Tortilla 🫓

Hello! I'm a Tortilla, a format to serialize your EO data 🤗.

pytortilla is a Python package that simplifies the creation and management of .tortilla files—these files are designed to encapsulate metadata, dataset information, and links to relevant files in remote sensing or AI workflows.

This package is “re-exported” within tacotoolbox, specifically under tacotoolbox.tortilla. Therefore, by installing and using pytortilla, you can also leverage it from tacotoolbox.tortilla.

Goals

  • Metadata handling: Defines classes (Sample, Samples) to describe and structure your data’s information.
  • Dataset structuring: Easily generate training, validation, and testing splits, and store them in .tortilla files.
  • Internal validation: Validate your dataset’s integrity (e.g., opening each file with rasterio).
  • Integration with Earth Engine (ee): Combines local data operations with GEE functionalities.
  • Unified usage with tacotoolbox: Load and manipulate these datasets with tacoreader and other helper functions from tacotoolbox.

Table of Contents

Installation

pip install pytortilla

or from source:

git clone https://github.com/tacofoundation/tortilla-python.git
cd tortilla-python
pip install .

Note: You may also install it as part of tacotoolbox, where pytortilla is included as a dependency.

Usage guide

In this guide, we delve deeper into the step-by-step creation of .tortilla files, providing tips and best practices.

import pathlib
import rasterio
import pandas as pd
from sklearn.model_selection import train_test_split
import pytortilla

If you need Earth Engine:

import ee
ee.Initialize()  # Requires prior authentication if not done already

Files

Move the Files from Hugging Face to Your Local Machine

import os

# URL path to the Hugging Face repository
path = "https://huggingface.co/datasets/tacofoundation/tortilla_demo/resolve/main/"

# List of demo files to download
files = [
    "demo/high__test__ROI_0010__20190125T112341_20190125T112624_T28QFG.tif",
    "demo/high__test__ROI_0011__20190130T103251_20190130T104108_T31REP.tif",
    "demo/high__test__ROI_0011__20190830T102029_20190830T102552_T31REP.tif",
    "demo/high__test__ROI_0064__20190317T015619_20190317T020354_T51JVH.tif",
    "demo/high__test__ROI_0120__20191219T045209_20191219T045214_T45TXE.tif",
    "demo/high__test__ROI_0141__20190316T141049_20190316T142437_T19FDE.tif",
    "demo/high__test__ROI_0159__20200403T143721_20200403T144642_T19HBV.tif",
    "demo/high__test__ROI_0235__20200402T053639_20200402T053638_T44UNV.tif"
]

# Create a local folder called 'demo' (if not already existing)
os.system("mkdir -p demo")

# Download each file to the 'demo' folder
for file in files:
    os.system(f"wget {path}{file} -O {file}")

Note: Depending on your environment, you might prefer using requests or urllib instead of os.system for downloading files.

At this point, you should have a demo/ folder populated with several .tif files.

Creating samples

Now, we will create samples using pytortilla:

import pathlib
import pandas as pd
from sklearn.model_selection import train_test_split
import rasterio
from pytortilla.datamodel import Sample, Samples

# Define the local path containing the TIFF files
demo_path = pathlib.Path("./demo")

# Collect all .tif files in the demo folder
all_files = list(demo_path.glob("*.tif"))

# Split into train, val, and test
train_files, test_files = train_test_split(all_files, test_size=0.2, random_state=42)
train_files, val_files = train_test_split(train_files, test_size=0.2, random_state=42)

train_df = pd.DataFrame({"path": train_files, "split": "train"})
val_df = pd.DataFrame({"path": val_files, "split": "validation"})
test_df = pd.DataFrame({"path": test_files, "split": "test"})
dataset_full = pd.concat([train_df, val_df, test_df], ignore_index=True)

# Build a list of Sample objects
samples_list = []
for _, row in dataset_full.iterrows():
    with rasterio.open(row.path) as src:
        metadata = src.profile
        sample_obj = Sample(
            id=row.path.stem,
            path=str(row.path),
            file_format="GTiff",
            data_split=row.split,
            stac_data={
                "crs": str(metadata["crs"]),
                "geotransform": metadata["transform"].to_gdal(),
                "raster_shape": (metadata["height"], metadata["width"])
            }
        )
        samples_list.append(sample_obj)

samples_obj = Samples(samples=samples_list)

Validation and adding metadata

Validate each .tif file by trying to open it:

samples_obj.deep_validator(read_function=lambda x: rasterio.open(x))

If you need RAI metadata (or any other additional metadata) in your workflow, you can include it:

samples_obj = samples_obj.include_rai_metadata(
    sample_footprint=5120,  # Example footprint value
    cache=False,
    quiet=False
)

Generating the .tortilla file

Use pytortilla.create.main.create() (or the equivalent tacotoolbox.tortilla.create if you have tacotoolbox installed):

from pytortilla.create.main import create

# Generate the .tortilla file
output_file = create(
    samples=samples_obj,
    output="demo_dataset.tortilla"
)

print(f"Tortilla file generated: {output_file}")

The .tortilla might split into multiple files (.0000.part.tortilla, etc.) for large datasets.

Loading and using the .tortilla file

Finally, load the .tortilla file (or its parts) with tacoreader:

import tacoreader
import pandas as pd

dataset_chunks = []

# Try loading .part.tortilla files (assuming a maximum of 4 parts for this example)
for i in range(4):
    part_file = f"demo_dataset.{i:04d}.part.tortilla"
    try:
        dataset_part = tacoreader.load(part_file)
        dataset_chunks.append(dataset_part)
    except FileNotFoundError:
        break  # Stop if no more parts

if dataset_chunks:
    dataset = pd.concat(dataset_chunks, ignore_index=True)
    print(dataset.head())
else:
    print("No tortilla parts found.")

FAQs

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts