The file format behind TACO. 🫓
GitHub: https://github.com/tacofoundation/tortilla-python 🌐
PyPI: https://pypi.org/project/pytortilla/ 🛠️
Tortilla 🫓
Hello! I'm a Tortilla, a format to serialize your EO data 🤗.
pytortilla is a Python package that simplifies the creation and management of .tortilla
files—these files are designed to encapsulate metadata, dataset information, and links to relevant files in remote sensing or AI workflows.
This package is “re-exported” within tacotoolbox
, specifically under tacotoolbox.tortilla
. Therefore, by installing and using pytortilla
, you can also leverage it from tacotoolbox.tortilla
.
Goals
- Metadata handling: Defines classes (
Sample
, Samples
) to describe and structure your data’s information.
- Dataset structuring: Easily generate training, validation, and testing splits, and store them in
.tortilla
files.
- Internal validation: Validate your dataset’s integrity (e.g., opening each file with rasterio).
- Integration with Earth Engine (ee): Combines local data operations with GEE functionalities.
- Unified usage with
tacotoolbox
: Load and manipulate these datasets with tacoreader
and other helper functions from tacotoolbox
.
Table of Contents
Installation
pip install pytortilla
or from source:
git clone https://github.com/tacofoundation/tortilla-python.git
cd tortilla-python
pip install .
Note: You may also install it as part of tacotoolbox
, where pytortilla
is included as a dependency.
Usage guide
In this guide, we delve deeper into the step-by-step creation of .tortilla
files, providing tips and best practices.
import pathlib
import rasterio
import pandas as pd
from sklearn.model_selection import train_test_split
import pytortilla
If you need Earth Engine:
import ee
ee.Initialize()
Files
Move the Files from Hugging Face to Your Local Machine
import os
path = "https://huggingface.co/datasets/tacofoundation/tortilla_demo/resolve/main/"
files = [
"demo/high__test__ROI_0010__20190125T112341_20190125T112624_T28QFG.tif",
"demo/high__test__ROI_0011__20190130T103251_20190130T104108_T31REP.tif",
"demo/high__test__ROI_0011__20190830T102029_20190830T102552_T31REP.tif",
"demo/high__test__ROI_0064__20190317T015619_20190317T020354_T51JVH.tif",
"demo/high__test__ROI_0120__20191219T045209_20191219T045214_T45TXE.tif",
"demo/high__test__ROI_0141__20190316T141049_20190316T142437_T19FDE.tif",
"demo/high__test__ROI_0159__20200403T143721_20200403T144642_T19HBV.tif",
"demo/high__test__ROI_0235__20200402T053639_20200402T053638_T44UNV.tif"
]
os.system("mkdir -p demo")
for file in files:
os.system(f"wget {path}{file} -O {file}")
Note: Depending on your environment, you might prefer using requests or urllib instead of os.system for downloading files.
At this point, you should have a demo/
folder populated with several .tif
files.
Creating samples
Now, we will create samples using pytortilla
:
import pathlib
import pandas as pd
from sklearn.model_selection import train_test_split
import rasterio
from pytortilla.datamodel import Sample, Samples
demo_path = pathlib.Path("./demo")
all_files = list(demo_path.glob("*.tif"))
train_files, test_files = train_test_split(all_files, test_size=0.2, random_state=42)
train_files, val_files = train_test_split(train_files, test_size=0.2, random_state=42)
train_df = pd.DataFrame({"path": train_files, "split": "train"})
val_df = pd.DataFrame({"path": val_files, "split": "validation"})
test_df = pd.DataFrame({"path": test_files, "split": "test"})
dataset_full = pd.concat([train_df, val_df, test_df], ignore_index=True)
samples_list = []
for _, row in dataset_full.iterrows():
with rasterio.open(row.path) as src:
metadata = src.profile
sample_obj = Sample(
id=row.path.stem,
path=str(row.path),
file_format="GTiff",
data_split=row.split,
stac_data={
"crs": str(metadata["crs"]),
"geotransform": metadata["transform"].to_gdal(),
"raster_shape": (metadata["height"], metadata["width"])
}
)
samples_list.append(sample_obj)
samples_obj = Samples(samples=samples_list)
Validation and adding metadata
Validate each .tif
file by trying to open it:
samples_obj.deep_validator(read_function=lambda x: rasterio.open(x))
If you need RAI metadata (or any other additional metadata) in your workflow, you can include it:
samples_obj = samples_obj.include_rai_metadata(
sample_footprint=5120,
cache=False,
quiet=False
)
Generating the .tortilla
file
Use pytortilla.create.main.create()
(or the equivalent tacotoolbox.tortilla.create
if you have tacotoolbox
installed):
from pytortilla.create.main import create
output_file = create(
samples=samples_obj,
output="demo_dataset.tortilla"
)
print(f"Tortilla file generated: {output_file}")
The .tortilla might split into multiple files (.0000.part.tortilla
, etc.) for large datasets.
Loading and using the .tortilla
file
Finally, load the .tortilla
file (or its parts) with tacoreader:
import tacoreader
import pandas as pd
dataset_chunks = []
for i in range(4):
part_file = f"demo_dataset.{i:04d}.part.tortilla"
try:
dataset_part = tacoreader.load(part_file)
dataset_chunks.append(dataset_part)
except FileNotFoundError:
break
if dataset_chunks:
dataset = pd.concat(dataset_chunks, ignore_index=True)
print(dataset.head())
else:
print("No tortilla parts found.")