
Research
2025 Report: Destructive Malware in Open Source Packages
Destructive malware is rising across open source registries, using delays and kill switches to wipe code, break builds, and disrupt CI/CD.
md-spreadsheet-parser
Advanced tools
A robust, zero-dependency Python library for parsing, validating, and manipulating Markdown tables, including conversion from Excel to Markdown.
A robust, zero-dependency Python library for converting Excel to Markdown, parsing tables, and type-safe validation.
md-spreadsheet-parser elevates Markdown tables from simple text to first-class data structures. It offers a precise, zero-dependency engine to parse, validate, and manipulate tables with the ease of a spreadsheet and the power of Python.
π Need a quick solution? Check out the Cookbook for copy-pasteable recipes (Excel conversion, Pandas integration, Markdown table manipulation, and more).
dataclasses with automatic type conversion, including customizable boolean logic (I18N) and custom type converters.:--, :--:, --:) and correct handling of pipes within inline code (`|`).dict/TypedDict.pip install md-spreadsheet-parser
Single Table Parse a standard Markdown table into a structured object.
from md_spreadsheet_parser import parse_table
markdown = """
| Name | Age |
| --- | --- |
| Alice | 30 |
| Bob | 25 |
"""
result = parse_table(markdown)
print(result.headers)
# ['Name', 'Age']
print(result.rows)
# [['Alice', '30'], ['Bob', '25']]
Multiple Tables (Workbook)
Parse a file containing multiple sheets (sections). By default, it looks for # Tables as the root marker and ## Sheet Name for sheets.
from md_spreadsheet_parser import parse_workbook, MultiTableParsingSchema
markdown = """
# Tables
## Users
| ID | Name |
| -- | ---- |
| 1 | Alice|
## Products
| ID | Item |
| -- | ---- |
| A | Apple|
"""
# Use default schema
schema = MultiTableParsingSchema()
workbook = parse_workbook(markdown, schema)
for sheet in workbook.sheets:
print(f"Sheet: {sheet.name}")
for table in sheet.tables:
print(table.rows)
Lookup API & Metadata Retrieve sheets and tables directly by name, and access parsed metadata like descriptions.
from md_spreadsheet_parser import parse_workbook
markdown = """
# Tables
## Sales Data
### Q1 Results
Financial performance for the first quarter.
| Year | Revenue |
| ---- | ------- |
| 2023 | 1000 |
"""
workbook = parse_workbook(markdown)
# Access by name
sheet = workbook.get_sheet("Sales Data")
if sheet:
# Retrieve table by name (from ### Header)
table = sheet.get_table("Q1 Results")
print(table.description)
# "Financial performance for the first quarter."
print(table.rows)
# [['2023', '1000']]
Simple Scan Interface
If you want to extract all tables from a document regardless of its structure (ignoring sheets and headers), use scan_tables.
from md_spreadsheet_parser import scan_tables
markdown = """
| ID | Name |
| -- | ---- |
| 1 | Alice|
... text ...
| ID | Item |
| -- | ---- |
| A | Apple|
"""
# Returns a flat list of all tables found
tables = scan_tables(markdown)
print(len(tables)) # 2
File Loading Helpers
For convenience, you can parse directly from a file path (str or Path) or file-like object using the _from_file variants:
from md_spreadsheet_parser import parse_workbook_from_file
# Clean and easy
workbook = parse_workbook_from_file("data.md")
Available helpers:
parse_table_from_file(path_or_file)parse_workbook_from_file(path_or_file)scan_tables_from_file(path_or_file)The parser strictly adheres to GitHub Flavored Markdown (GFM) specifications for tables.
Column Alignment Alignment markers in the separator row are parsed and preserved.
markdown = """
| Left | Center | Right |
| :--- | :----: | ----: |
| 1 | 2 | 3 |
"""
table = parse_table(markdown)
print(table.alignments)
# ["left", "center", "right"]
Pipes in Code & Escaping
Pipes | inside inline code blocks (backticks) or escaped with \ are correctly treated as content, not column separators.
markdown = """
| Code | Escaped |
| ----- | ------- |
| `a|b` | \| |
"""
table = parse_table(markdown)
# table.rows[0] == ["`a|b`", "|"]
The most powerful feature of this library is converting loose markdown tables into strongly-typed Python objects using dataclasses. This ensures your data is valid and easy to work with.
from dataclasses import dataclass
from md_spreadsheet_parser import parse_table, TableValidationError
@dataclass
class User:
name: str
age: int
is_active: bool = True
markdown = """
| Name | Age | Is Active |
|---|---|---|
| Alice | 30 | yes |
| Bob | 25 | no |
"""
try:
# Parse and validate in one step
users = parse_table(markdown).to_models(User)
for user in users:
print(f"{user.name} is {user.age} years old.")
# Alice is 30 years old.
# Bob is 25 years old.
except TableValidationError as e:
print(e)
Features:
int, float, bool using standard rules.true/false, yes/no, on/off, 1/0. (See Advanced Type Conversion for customization).Optional[T] by converting empty strings to None.For more advanced validation (email format, ranges, regex), you can use Pydantic models instead of dataclasses. This feature is enabled automatically if pydantic is installed.
from pydantic import BaseModel, Field, EmailStr
class User(BaseModel):
name: str = Field(alias="User Name")
age: int = Field(gt=0)
email: EmailStr
# Automatically detects Pydantic model and uses it for validation
users = parse_table(markdown).to_models(User)
The parser respects Pydantic's alias and Field constraints.
Sometimes you don't want to define a full Dataclass or Pydantic model, or you have columns containing JSON strings.
Simple Dictionary Output Convert tables directly to a list of dictionaries. Keys are derived from headers.
# Returns list[dict[str, Any]] (Values are raw strings)
rows = parse_table(markdown).to_models(dict)
print(rows[0])
# {'Name': 'Alice', 'Age': '30'}
TypedDict Support
Use TypedDict for lightweight type safety. The parser uses the type annotations to convert values automatically.
from typing import TypedDict
class User(TypedDict):
name: str
age: int
active: bool
rows = parse_table(markdown).to_models(User)
print(rows[0])
# {'name': 'Alice', 'age': 30, 'active': True}
Column-Level JSON Parsing
If a field is typed as dict or list (in a Dataclass or Pydantic model), the parser automatically parses the cell value as JSON.
@dataclass
class Config:
id: int
metadata: dict # Cell: '{"debug": true}' -> Parsed to dict
tags: list # Cell: '["a", "b"]' -> Parsed to list
# Pydantic models also work without Json[] wrapper
class ConfigModel(BaseModel):
metadata: dict
Limitations:
{"a": 1}). Malformed JSON raises a ValueError.to_models(dict) does not automatically parse inner JSON strings unless you use a custom schema. It only creates a shallow dictionary of strings.This library is designed to be a bridge between Markdown and Data Science tools like Pandas.
Convert to DataFrame (Easiest Way)
The cleanest way to create a DataFrame is using to_models(dict). This returns a list of dictionaries that Pandas can ingest directly.
import pandas as pd
from md_spreadsheet_parser import parse_table
markdown = """
| Date | Sales | Region |
|------------|-------|--------|
| 2023-01-01 | 100 | US |
| 2023-01-02 | 150 | EU |
"""
table = parse_table(markdown)
# 1. Convert to list of dicts
data = table.to_models(dict)
# 2. Create DataFrame
df = pd.DataFrame(data)
# 3. Post-Process: Convert types (Pandas usually infers strings initially)
df["Sales"] = pd.to_numeric(df["Sales"])
df["Date"] = pd.to_datetime(df["Date"])
print(df.dtypes)
# Date datetime64[ns]
# Sales int64
# Region object
Convert from Type-Safe Objects
If you want to validate data before creating a DataFrame (e.g., ensuring "Sales" is an integer during parsing), use a dataclass and then convert to Pandas.
from dataclasses import dataclass, asdict
@dataclass
class SalesRecord:
date: str
amount: int
region: str
# 1. Parse and Validate (Raises TableValidationError if invalid)
records = parse_table(markdown).to_models(SalesRecord)
# 2. Convert to DataFrame using asdict()
df = pd.DataFrame([asdict(r) for r in records])
# The 'amount' column is already int64 because validation handled conversion
print(df["amount"].dtype) # int64
JSON Export
All result objects (Workbook, Sheet, Table) have a .json property that returns a dictionary structure suitable for serialization.
import json
# Export entire workbook structure
print(json.dumps(workbook.json, indent=2))
Import Excel data (via TSV/CSV or openpyxl) with intelligent handling of merged cells and hierarchical headers.
[!NOTE] Importing from TSV/CSV text works with zero dependencies. Direct
.xlsxfile loading requiresopenpyxl(a user-managed optional dependency).
Basic Usage
π See the Cookbook for more comprehensive recipes.
from md_spreadsheet_parser import parse_excel
# From TSV/CSV (Zero Dependency)
table = parse_excel("Name\tAge\nAlice\t30")
# From .xlsx (requires openpyxl)
import openpyxl
wb = openpyxl.load_workbook("data.xlsx")
table = parse_excel(wb.active)
Merged Header Handling
When Excel exports merged cells, they appear as empty cells. The parser automatically forward-fills these gaps:
Excel (merged headers):
βββββββββββββββββββββββββββββββ¬βββββββββ
β Category (3 cols) β Info β
βββββββββββ¬ββββββββββ¬ββββββββββΌβββββββββ€
β A β B β C β D β
βββββββββββ΄ββββββββββ΄ββββββββββ΄βββββββββ
β parse_excel()
Markdown:
| Category | Category | Category | Info |
|----------|----------|----------|------|
| A | B | C | D |
2-Row Hierarchical Headers
For complex headers with parent-child relationships, use ExcelParsingSchema(header_rows=2):
Excel (2-row header):
βββββββββββββββββββββ¬ββββββββββββββββββββ
β Info β Metrics β β Row 1 (Parent)
βββββββββββ¬ββββββββββΌββββββββββ¬ββββββββββ€
β Name β ID β Score β Rank β β Row 2 (Child)
βββββββββββΌββββββββββΌββββββββββΌββββββββββ€
β Alice β 001 β 95 β 1 β
βββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ
β parse_excel(schema=ExcelParsingSchema(header_rows=2))
Markdown:
| Info - Name | Info - ID | Metrics - Score | Metrics - Rank |
|-------------|-----------|-----------------|----------------|
| Alice | 001 | 95 | 1 |
Note: Currently supports up to 2 header rows. For deeper hierarchies, pre-process your data before parsing.
Excel to Structured Objects (The "Killer" Feature)
Don't just convert to textβconvert Excel directly to valid, type-safe Python objects in one step.
@dataclass
class SalesRecord:
category: str
item: str
amount: int # Automatic string-to-int conversion
# 1. Parse Excel (handles merged cells automatically)
# 2. Validate & Convert to objects
records = parse_excel(ws).to_models(SalesRecord)
# Now you have clean, typed data
assert records[0].amount == 1000
Configuration
Use ExcelParsingSchema to customize parsing behavior:
from md_spreadsheet_parser import parse_excel, ExcelParsingSchema
schema = ExcelParsingSchema(
header_rows=2,
fill_merged_headers=True,
header_separator=" / "
)
table = parse_excel(source, schema)
| Option | Default | Description |
|---|---|---|
header_rows | 1 | Number of header rows (1 or 2). |
fill_merged_headers | True | Forward-fill empty header cells. |
header_separator | " - " | Separator for flattened 2-row headers. |
delimiter | "\t" | Column separator for TSV/CSV. |
You can modify parsed objects and convert them back to Markdown strings using to_markdown(). This enables a complete "Parse -> Modify -> Generate" workflow.
from md_spreadsheet_parser import parse_table, ParsingSchema
markdown = "| A | B |\n|---|---| \n| 1 | 2 |"
table = parse_table(markdown)
# Modify data
table.rows.append(["3", "4"])
# Generate Markdown
# You can customize the output format using a schema
schema = ParsingSchema(require_outer_pipes=True)
print(table.to_markdown(schema))
# | A | B |
# | --- | --- |
# | 1 | 2 |
# | 3 | 4 |
Metadata Extraction Configuration
By default, the parser captures table names (level 3 headers) and descriptions. You can customize this behavior with MultiTableParsingSchema.
from md_spreadsheet_parser import MultiTableParsingSchema
schema = MultiTableParsingSchema(
table_header_level=3, # Treat ### Header as table name
capture_description=True # Capture text between header and table
)
# Pass schema to parse_workbook...
You can customize how string values are converted to Python objects by passing a ConversionSchema to to_models(). This is useful for internationalization (I18N) and handling custom types.
Internationalization (I18N): Custom Boolean Pairs
Configure which string pairs map to True/False (case-insensitive).
from md_spreadsheet_parser import parse_table, ConversionSchema
markdown = """
| User | Active? |
| --- | --- |
| Tanaka | γ―γ |
| Suzuki | γγγ |
"""
# Configure "γ―γ" -> True, "γγγ" -> False
schema = ConversionSchema(
boolean_pairs=(("γ―γ", "γγγ"),)
)
users = parse_table(markdown).to_models(User, conversion_schema=schema)
# Tanaka.active is True
Custom Type Converters
Register custom conversion functions for specific types. You can use ANY Python type as a key, including:
int, float, bool (to override default behavior)Decimal, datetime, date, ZoneInfo, UUIDExample using standard library types and a custom class:
from dataclasses import dataclass
from uuid import UUID
from zoneinfo import ZoneInfo
from md_spreadsheet_parser import ConversionSchema, parse_table
@dataclass
class Color:
r: int
g: int
b: int
@dataclass
class Config:
timezone: ZoneInfo
session_id: UUID
theme_color: Color
markdown = """
| Timezone | Session ID | Theme Color |
| --- | --- | --- |
| Asia/Tokyo | 12345678-1234-5678-1234-567812345678 | 255,0,0 |
"""
schema = ConversionSchema(
custom_converters={
# Standard Library Types
ZoneInfo: lambda v: ZoneInfo(v),
UUID: lambda v: UUID(v),
# Custom Class
Color: lambda v: Color(*map(int, v.split(",")))
}
)
data = parse_table(markdown).to_models(Config, conversion_schema=schema)
# data[0].timezone is ZoneInfo("Asia/Tokyo")
# data[0].theme_color is Color(255, 0, 0)
Field-Specific Converters
For granular control, you can define converters for specific field names, which take precedence over type-based converters.
def parse_usd(val): ...
def parse_jpy(val): ...
schema = ConversionSchema(
# Type-based defaults (Low priority)
custom_converters={
Decimal: parse_usd
},
# Field-name overrides (High priority)
field_converters={
"price_jpy": parse_jpy,
"created_at": lambda x: datetime.strptime(x, "%Y/%m/%d")
}
)
# price_usd (no override) -> custom_converters (parse_usd)
# price_jpy (override) -> field_converters (parse_jpy)
data = parse_table(markdown).to_models(Product, conversion_schema=schema)
Standard Converters Library
For common patterns (currencies, lists), you can use the built-in helper functions in md_spreadsheet_parser.converters instead of writing your own.
from md_spreadsheet_parser.converters import (
to_decimal_clean, # Handles "$1,000", "Β₯500" -> Decimal
make_datetime_converter, # Factory for parse/TZ logic
make_list_converter, # "a,b,c" -> ["a", "b", "c"]
make_bool_converter # Custom strict boolean sets
)
schema = ConversionSchema(
custom_converters={
# Currency: removes $, Β₯, β¬, Β£, comma, space
Decimal: to_decimal_clean,
# DateTime: ISO format default, attach Tokyo TZ if naive
datetime: make_datetime_converter(tz=ZoneInfo("Asia/Tokyo")),
# Lists: Split by comma, strip whitespace
list: make_list_converter(separator=",")
},
field_converters={
# Custom boolean for specific field
"is_valid": make_bool_converter(true_values=["OK"], false_values=["NG"])
}
)
The parser is designed to handle imperfect markdown tables gracefully.
from md_spreadsheet_parser import parse_table
markdown = """
| A | B |
|---|---|
| 1 | <-- Missing column
| 1 | 2 | 3 <-- Extra column
"""
table = parse_table(markdown)
print(table.rows)
# [['1', ''], ['1', '2']]
This ensures that table.rows always matches the structure of table.headers, preventing crashes during iteration or validation.
The parser automatically converts HTML line breaks to Python newlines (\n). This enables handling multiline cells naturally.
Supported Tags (Case-Insensitive):
<br><br/><br />markdown = "| Line1<br>Line2 |"
table = parse_table(markdown)
# table.rows[0][0] == "Line1\nLine2"
Round-Trip Support:
When generating Markdown (e.g., table.to_markdown()), Python newlines (\n) are automatically converted back to <br> tags to preserve the table structure.
To disable this, set convert_br_to_newline=False in ParsingSchema.
Do you really have a 10GB Markdown file?
Probably not. We sincerely hope you don't. Markdown wasn't built for that.
But if you doβperhaps you're generating extensive logs or auditing standard convertersβthis library has your back. While Excel gives up after 1,048,576 rows, md-spreadsheet-parser supports streaming processing for files of unlimited size, keeping memory usage constant.
scan_tables_iter:
This function reads the file line-by-line and yields Table objects as they are found. It does not load the entire file into memory.
from md_spreadsheet_parser import scan_tables_iter
# Process a massive log file (e.g., 10GB)
# Memory usage remains low (only the size of a single table block)
for table in scan_tables_iter("huge_server_log.md"):
print(f"Found table with {len(table.rows)} rows")
# Process rows...
for row in table.rows:
pass
This is ideal for data pipelines, log analysis, and processing exports that are too large to open in standard spreadsheet editors.
The library provides immutable methods to modify the data structure. These methods return a new instance of the object with the changes applied, keeping the original object unchanged.
Workbook Operations
# Add a new sheet (creates a default table with headers A, B, C)
new_wb = workbook.add_sheet("New Sheet")
# Rename a sheet
new_wb = workbook.rename_sheet(sheet_index=0, new_name("Budget 2024"))
# Delete a sheet
new_wb = workbook.delete_sheet(sheet_index=1)
Sheet Operations
# Rename sheet (direct method)
new_sheet = sheet.rename("Q1 Data")
# Update table metadata
new_sheet = sheet.update_table_metadata(
table_index=0,
name="Expenses",
description="Monthly expense report"
)
Table Operations
# Update a cell (automatically expands table if index is out of bounds)
new_table = table.update_cell(row_idx=5, col_idx=2, value="Updated")
# Delete a row (structural delete)
new_table = table.delete_row(row_idx=2)
# Clear column data (keeps headers and row structure, empties cells)
new_table = table.clear_column_data(col_idx=3)
The library supports persisting visual state (like column widths and filter settings) without altering the Markdown table structure itself. This is achieved via a hidden HTML comment appended after the table.
| A | B |
|---|---|
| 1 | 2 |
<!-- md-spreadsheet-metadata: {"columnWidths": [100, 200]} -->
This ensures that:
You can use the md-spreadsheet-parser command to parse Markdown files and output JSON. This is useful for piping data to other tools.
# Read from file
md-spreadsheet-parser input.md
# Read from stdin (pipe)
cat input.md | md-spreadsheet-parser
Options:
--scan: Scan for all tables ignoring workbook structure (returns a list of tables).--root-marker: Set the root marker (default: # Tables).--sheet-header-level: Set sheet header level (default: 2).--table-header-level: Set table header level (default: 3).--capture-description: Capture table descriptions (default: True).--column-separator: Character used to separate columns (default: |).--header-separator-char: Character used in the separator row (default: -).--no-outer-pipes: Allow tables without outer pipes (default: False).--no-strip-whitespace: Do not strip whitespace from cell values (default: False).--no-br-conversion: Disable automatic conversion of <br> tags to newlines (default: False).Customize parsing behavior using ParsingSchema and MultiTableParsingSchema.
| Option | Default | Description |
|---|---|---|
column_separator | | | Character used to separate columns. |
header_separator_char | - | Character used in the separator row. |
require_outer_pipes | True | If True, generated markdown tables will include outer pipes. |
strip_whitespace | True | If True, whitespace is stripped from cell values. |
convert_br_to_newline | True | If True, <br> tags are converted to \n (and back). |
root_marker | # Tables | (MultiTable) Marker indicating start of data section. |
sheet_header_level | 2 | (MultiTable) Header level for sheets. |
table_header_level | 3 | (MultiTable) Header level for tables. |
capture_description | True | (MultiTable) Capture text between header and table. |
This parser is the core foundation of a new ecosystem: Text-Based Spreadsheet Management.
We are preparing to release a Rich VS Code Extension that acts as a full GUI Spreadsheet Editor for Markdown files.
[!NOTE] Coming Soon: The official release of the VS Code Extension will be announced here alongside v1.0.0 of this library. Stay tuned!
The Vision: "Excel-like UX, Git-native Data" By combining a high-performance editor with this robust parser, we aim to solve the long-standing problem of managing binary spreadsheet files in software projects.
Upcoming Features We are actively working on:
This project is licensed under the MIT License.
FAQs
A robust, zero-dependency Python library for parsing, validating, and manipulating Markdown tables, including conversion from Excel to Markdown.
We found that md-spreadsheet-parser demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago.Β It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Research
Destructive malware is rising across open source registries, using delays and kill switches to wipe code, break builds, and disrupt CI/CD.

Security News
Socket CTO Ahmad Nassri shares practical AI coding techniques, tools, and team workflows, plus what still feels noisy and why shipping remains human-led.

Research
/Security News
A five-month operation turned 27 npm packages into durable hosting for browser-run lures that mimic document-sharing portals and Microsoft sign-in, targeting 25 organizations across manufacturing, industrial automation, plastics, and healthcare for credential theft.