
Security News
Attackers Are Hunting High-Impact Node.js Maintainers in a Coordinated Social Engineering Campaign
Multiple high-impact npm maintainers confirm they have been targeted in the same social engineering campaign that compromised Axios.
@lingo-reader/epub-parser
Advanced tools
An epub parser which can extract chapter contents from an epub file
The EPUB file format is used to store ebook content, containing both the book's chapter materials and files specifying how these chapters should be sequentially read.
An EPUB file is essentially a .zip archive. Its content structure is built using HTML and CSS, and can theoretically include JavaScript as well. By changing the file extension to .zip and extracting the contents, you can directly view chapter content by opening the corresponding HTML/XHTML files. However, the chapters will appear in random order. If certain chapters or resources are encrypted, this zip extraction method will fail.
When parsing EPUB files:
(1) The first step involves parsing files like container.xml, .opf, and .ncx, which contain metadata (title, author, publication date, etc.), resource information (paths to images and other assets within the EPUB), and sequential chapter display information (Spine).
(2) The second step handles resource paths within chapters. References to resources in chapter files are only valid internally, so they must be converted to paths usable in the display environment—either as blob URLs in browsers or absolute filesystem paths in Node.js.
(3). The encryption information of an EPUB file is stored in the META-INF/encryption.xml file. Version 0.3.x supports parsing encrypted EPUB files, but it requires adherence to a specific encryption scheme and the provision of a private key for decryption. The supported encryption methods are detailed in the initEpubFile section.
(4). In addition, EPUB files may also include signatures and rights management information, stored in the signatures.xml and rights.xml files, respectively. Like container.xml, these files are located in the /META-INF/ directory and have fixed filenames. Support for parsing these files will be added in future updates of @lingo-reader/epub-parser.
The parser follows the EPUB 3.3 and Open Packaging Format (OPF) 2.0.1 v1.0 specifications. Its API aims to expose all available file information comprehensively.
pnpm install @lingo-reader/epub-parser
import { initEpubFile } from '@lingo-reader/epub-parser'
const epub = await initEpubFile('./example/alice.epub')
const spine = epub.getSpine()
const fileInfo = epub.getFileInfo()
// Load the first chapter:
// - html: Processed chapter HTML string
// - css: Chapter CSS files (absolute paths in Node.js, directly readable)
const { html, css } = epub.loadChapter(spine[0].id)
// ...
import { initEpubFile } from '@lingo-reader/epub-parser'
async function initEpub(file: File) {
const epub = await initEpubFile(file)
const spine = epub.getSpine()
const fileInfo = epub.getFileInfo()
// Load the first chapter:
// - html: Processed chapter HTML string
// - css: Chapter CSS files (provided as blob URLs, fetchable)
const { html, css } = epub.loadChapter(spine[0].id)
}
// ...
import { initEpubFile } from '@lingo-reader/epub-parser'
import type { EpubFile } from '@lingo-reader/epub-parser'
/*
interface EpubFileOptions {
rsaPrivateKey?: string | Uint8Array
aesSymmetricKey?: string | Uint8Array
}
type initEpubFile = (epubPath: string | File, resourceSaveDir: string = './images', options: EpubFileOptions = {}): => Promise<EpubFile>
*/
const epub: EpubFile = await initEpubFile(
file,
'./images', // The default is './images'. If you don't want to change it, you can simply pass undefined.
{
// The RSA private key in PKCS#8 format should be provided either as a Base64-encoded string or a Uint8Array.
rsaPrivateKey: 'MIIEvgIBADANBgkqhkiG9w0BAQEFAASCBKgwggSkAgEAAoIBAQ......',
aesSymmetricKey: 'D2wVcst49HU6KqC......',
}
)
The primary API exposed by @lingo-reader/epub-parser is initEpubFile. When provided with a file path or File object, it returns an initialized EpubFile class containing methods to read metadata, Spine information, and other EPUB data.
Parameters:
epubPath: string | File: File path or File object.resourceSaveDir?: string: Optional (Node.js only). Specifies where to save resources like images.
default: './images/'options?: EpubFileOptions:Optional. Used to pass in key information。interface EpubFileOptions {
// The RSA private key in PKCS#8 format should be provided either as a Base64-encoded string or a Uint8Array.
rsaPrivateKey?: string | Uint8Array
aesSymmetricKey?: string | Uint8Array
}
Returns:
Promise: Initialized EpubFile object (Promise).Note: For the epubPath parameter, its type differs between environments:
File | Uint8Array. Passing a string will result in an error.string | Uint8Array. Passing a File will result in an error.The 0.3.x version of epub-parser supports decryption using two encryption schemes:
rsaPrivateKey option in EpubFileOptions.
This method supports storing multiple AES key entries within the encryption.xml file.aesSymmetricKey option must be provided for decryption.The decryption logic is implemented in the parseEncryption method within epub-parser/src/parseFiles.ts.
Decryption does not rely on any third-party libraries—it is built on the native Web Crypto API in the browser and Node's crypto module, allowing the parser to run in both browser and Node environments.
Note that the browser supports fewer cryptographic algorithms than Node; however, all browser-supported algorithms are also available in Node. Therefore, the set of supported algorithms is aligned with browser compatibility, effectively a subset of Node's capabilities.
Supported Algorithms:
RSA-OAEPRSA-OAEP-MGF1PAES-256-CBCAES-256-CTRAES-256-GCMAES-128-CBCAES-128-CTRAES-128-GCMAES-192 is not supported in browsers and will throw an error if used to encrypt EPUB content, although it is fully supported in Node.js. The IV used for encryption should be placed at the beginning of the encrypted file. The expected key lengths for AES are:
The EpubFile class exposes these methods:
import { EpubFile } from '@lingo-reader/epub-parser'
import { EBookParser } from '@lingo-reader/shared'
declare class EpubFile implements EBookParser {
getFileInfo(): EpubFileInfo
getMetadata(): EpubMetadata
getManifest(): Record<string, ManifestItem>
getSpine(): EpubSpine
getGuide(): GuideReference[]
getCollection(): CollectionItem[]
getToc(): EpubToc
getPageList(): PageList
getNavList(): NavList
loadChapter(id: string): Promise<EpubProcessedChapter>
resolveHref(href: string): EpubResolvedHref | undefined
destroy(): void
}
Retrieves all resources contained in the EPUB (HTML files, images, etc.).
import { getManifest } from '@lingo-reader/epub-parser'
import type { ManifestItem } from '@lingo-reader/epub-parser'
/*
type getManifest = () => Record<string, ManifestItem>
*/
// Keys represent resource `id`
const manifest: Record<string, ManifestItem> = epub.getManifest()
Parameters:
Returns:
Record - A dictionary mapping resource id to their descriptors:interface ManifestItem {
// Unique resource identifier
id: string
// Path within the EPUB (ZIP) archive
href: string
// MIME type (e.g., "application/xhtml+xml")
mediaType: string
// Special role (e.g., "cover-image")
properties?: string
// Associated media overlay for audio/video
mediaOverlay?: string
// Fallback resources when this item cannot be loaded
fallback?: string[]
}
Returns the reading order of all content documents in the EPUB.
The linear property in SpineItem indicates whether the item is part of the primary reading flow (values: "yes" or "no").
import { getSpine } from '@lingo-reader/epub-parser'
import type { EpubSpine } from '@lingo-reader/epub-parser'
/*
type getSpine = () => EpubSpine
*/
const spine: EpubSpine = epub.getSpine()
Parameters:
Returns:
EpubSpine - An ordered array of spine items:type SpineItem = ManifestItem & {
/**
* Reading progression flag
* - "yes": Primary reading content (default)
* - "no": Supplementary material
*/
linear?: string
}
type EpubSpine = SpineItem[]
The loadChapter function takes a chapter id as parameter and returns a processed chapter object. Returns undefined if the chapter doesn't exist.
const spine = epub.getSpine()
const fileInfo = epub.getFileInfo()
// Load the first chapter. 'html' is the processed HTML chapter string,
// 'css' is the chapter's CSS file, provided as an absolute path in Node.js,
// which can be directly read.
const { html, css } = epub.loadChapter(spine[0].id)
Parameters:
id: string - The chapter id from spineReturns:
Promise<EpubProcessedChapter | undefined> - Processed chapter content// css
interface EpubCssPart {
id: string
href: string
}
// media-overlay
interface Par {
// element id
textDOMId: string
// unit: s
clipBegin: number
clipEnd: number
}
interface SmilAudio {
audioSrc: string
pars: Par[]
}
type SmilAudios = SmilAudio[]
// chapter
interface EpubProcessedChapter {
css: EpubCssPart[]
html: string
mediaOverlays?: SmilAudios
}
In an EPUB ebook file, each chapter is typically an XHTML (or HTML) file. Thus, the processed chapter object consists of two parts: one is the HTML content string under the <body> tag, and the other is the CSS. The CSS is parsed from the <link> tags in the chapter file and provided here in the form of a blob URL (or as an absolute filesystem path in a Node.js environment), represented by the href field in EpubCssPart, along with a corresponding id for the URL. The CSS blob URL can be directly referenced in a <link> tag or fetched via the Fetch API (using the absolute path in Node.js) to obtain the CSS text for further processing.
In EPUB, SMIL files enable read-aloud functionality by mapping segments of an audio track to specific text elements in the document. During playback, the current audio time can be used to locate the corresponding text element and highlight it in the DOM. When processed, a SMIL file is represented in an EpubProcessedChapter as the optional mediaOverlays property.
SmilAudio objectsPar mappingsInternal chapter navigation in EPUBs is handled through <a> tags' href attributes. To distinguish internal links from external links and facilitate internal navigation logic, internal links are prefixed with epub:. These links can be resolved using the resolveHref function. The handling of such links is managed at the UI layer, while epub-parser only provides the corresponding chapter HTML and selector functionality.
resolveHref parses internal links into a chapter ID and a CSS selector within the book's HTML.
If an external link (e.g., https://www.example.com) or an invalid internal link is provided, it returns undefined.
const toc: EpubToc = epub.getToc()
// 'id' is the chapter ID, 'selector' is a DOM selector (e.g., `[id="ididid"]`)
const { id, selector } = epub.resolveHref(toc[0].href)
Parameters:
href: string:The internal resource path.Returns:
EpubResolvedHref | undefined:The resolved internal link. Returns undefined if the path is invalid.interface EpubResolvedHref {
id: string
selector: string
}
The toc structure corresponds to the navMap section of the EPUB's .ncx file, which contains the book's navigation hierarchy.
import { getToc } from '@lingo-reader/epub-parser'
import type { EpubToc } from '@lingo-reader/epub-parser'
/*
type getToc = () => EpubToc
*/
const toc: EpubToc = epub.getToc()
Parameters:
Returns:
EpubToc:interface NavPoint {
// Display text of the table of contents entry
label: string
// Resource path within the EPUB file (preprocessed format).
// Can be resolved using resolveHref()
href: string
// Chapter identifier
id: string
// Reading order sequence
playOrder: string
// Nested sub-entries (optional)
children?: NavPoint[]
}
/** EPUB table of contents structure (NCX navMap representation) */
type EpubToc = NavPoint[]
Supported since v0.4.1.
Return the url of cover image.
Parameters:
Returns:
string:the url of cover imageCleans up generated resources (like blob URLs) created during file parsing to prevent memory leaks. In Node.js environments, it also deletes corresponding temporary files.
import type { EpubFileInfo } from '@lingo-reader/epub-parser'
/*
type getFileInfo = () => EpubFileInfo
*/
const fileInfo: EpubFileInfo = epub.getFileInfo()
EpubFileInfo currently includes two attributes: fileName represents the file name, and mimetype indicates the file type of the EPUB file, which is read from the /mimetype file but is always fixed as application/epub+zip.
Parameters:
Returns:
EpubFileInfo:interface EpubFileInfo {
fileName: string
mimetype: string
}
The metadata recorded in the book.
import type { EpubMetadata } from '@lingo-reader/epub-parser'
/*
type getMetadata = () => EpubFileInfo
*/
const metadata: EpubMetadata = epub.getMetadata()
Parameters:
Returns:
EpubMetadata:interface EpubMetadata {
// Title of the book
title: string
// Language of the book
language: string
// Description of the book
description?: string
// Publisher of the EPUB file
publisher?: string
// General type/genre of the book, such as novel, biography, etc.
type?: string
// MIME type of the EPUB file
format?: string
// Original source of the book content
source?: string
// Related external resources
relation?: string
// Coverage of the publication content
coverage?: string
// Copyright statement
rights?: string
// Includes creation time, publication date, update time, etc. of the book
// Specific fields depend on opf:event, such as modification
date?: Record<string, string>
identifier: Identifier
packageIdentifier: Identifier
creator?: Contributor[]
contributor?: Contributor[]
subject?: Subject[]
metas?: Record<string, string>
links?: Link[]
}
id represents the unique identifier of the resource. The scheme specifies the system or authority used to generate or assign the identifier, such as ISBN or DOI. identifierType indicates the type of identifier used by id, which is similar to scheme.
interface Identifier {
id: string
scheme?: string
identifierType?: string
}
It is essentially also an Identifier. Typically, within the <package> tag, it is referenced using the unique-identifier attribute, whose value corresponds to the id of the relevant <identifier> element.
<package unique-identifier="id">
<dc:identifier id="id" opf:scheme="URI">uuid:19c0c5cb-002b-476f-baa7-fcf510414f95</dc:identifier>
</package>
Describes the various contributors.
interface Contributor {
// Name of the contributor
contributor: string
// Sort-friendly version of the name
fileAs?: string
// Role of the contributor
role?: string
// The encoding scheme used for role or alternateScript,
// can also represent a language, such as English or Chinese
scheme?: string
// Alternative script or writing system for the contributor's name
alternateScript?: string
}
The subject or theme of the book.
interface Subject {
// Subject, such as fiction, essay, etc.
subject: string
// The authority or organization providing the code or identifier
authority?: string
// Associated subject code or term
term?: string
}
Provides additional related resources or external links.
interface Link {
// URL or path to the resource
href: string
// Language of the resource
hreflang?: string
// id
id?: string
// MIME type of the resource (e.g., image/jpeg, application/xml)
mediaType?: string
// Additional properties
properties?: string
// Purpose or function of the link
rel: string
}
The preview chapters of the book, which can also be replaced by the first few chapters from the spine.
import { getGuide } from '@lingo-reader/epub-parser'
import type { EpubGuide } from '@lingo-reader/epub-parser'
/*
type getGuide = () => EpubGuide
*/
const guide: EpubGuide = epub.getGuide()
Parameters:
Returns:
EpubGuide:interface GuideReference {
title: string
// The role of the resource, such as toc, loi, cover-image, etc.
type: string
// The path to the resource within the EPUB file
href: string
}
type EpubGuide = GuideReference[]
The content under the <collection> tag in the .opf file, used to specify whether an EPUB file belongs to a specific collection, such as a series, category, or a particular group of publications.
import { getCollection } from '@lingo-reader/epub-parser'
import type { EpubCollection } from '@lingo-reader/epub-parser'
/*
type getCollection = () => EpubCollection
*/
const collection: EpubCollection = epub.getCollection()
Parameters:
Returns:
EpubCollection:interface CollectionItem {
// The role played within the Collection
role: string
// Links to related resources
links: string[]
}
type EpubCollection = CollectionItem[]
Refer to https://idpf.org/epub/20/spec/OPF_2.0.1_draft.htm#Section2.4.1.2, where the correspondId refers to the resource's ID, and the rest correspond to the specifications.
import { getPageList } from '@lingo-reader/epub-parser'
import type { PageList } from '@lingo-reader/epub-parser'
/*
type getPageList = () => PageList
*/
const pageList: PageList = epub.getPageList()
Parameters:
Returns:
PageList:interface PageTarget {
label: string
// Page number
value: string
href: string
playOrder: string
type: string
correspondId: string
}
interface PageList {
label: string
pageTargets: PageTarget[]
}
Refer to https://idpf.org/epub/20/spec/OPF_2.0.1_draft.htm#Section2.4.1.2, where the correspondId refers to the resource's ID, label corresponds to the content of navLabel.text, and href is the path to the resource within the EPUB file.
import { getNavList } from '@lingo-reader/epub-parser'
import type { NavList } from '@lingo-reader/epub-parser'
/*
type getNavList = () => NavList
*/
const navList: NavList = epub.getNavList()
Parameters:
Returns:
NavList:interface NavTarget {
label: string
href: string
correspondId: string
}
interface NavList {
label: string
navTargets: NavTarget[]
}
FAQs
An epub parser which can extract chapter contents from an epub file
We found that @lingo-reader/epub-parser demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
Multiple high-impact npm maintainers confirm they have been targeted in the same social engineering campaign that compromised Axios.

Security News
Axios compromise traced to social engineering, showing how attacks on maintainers can bypass controls and expose the broader software supply chain.

Security News
Node.js has paused its bug bounty program after funding ended, removing payouts for vulnerability reports but keeping its security process unchanged.