
Security News
Attackers Are Hunting High-Impact Node.js Maintainers in a Coordinated Social Engineering Campaign
Multiple high-impact npm maintainers confirm they have been targeted in the same social engineering campaign that compromised Axios.
flappa-doormal
Advanced tools
Arabic text marker pattern library for generating regex from declarative configurations
Declarative Arabic text segmentation library
Split pages of content into logical segments using human-readable patterns.
🚀 Live Demo • 📦 npm • 📚 GitHub
Working with Arabic hadith and Islamic text collections requires splitting continuous text into segments (individual hadiths, chapters, verses). This traditionally means:
^[\u0660-\u0669]+\s*[-–—ـ]\s*حَدَّثَنَا vs حدثناflappa-doormal provides:
✅ Readable templates: {{raqms}} {{dash}} instead of cryptic regex
✅ Named captures: {{raqms:hadithNum}} auto-extracts to meta.hadithNum
✅ Fuzzy matching: Auto-enabled for {{bab}}, {{kitab}}, {{basmalah}}, {{fasl}}, {{naql}} (override with fuzzy: false)
✅ Content limits: maxPages and maxContentLength (safety-hardened) control segment size
✅ Page tracking: Know which page each segment came from
✅ Declarative rules: Describe what to match, not how
npm install flappa-doormal
# or
bun add flappa-doormal
# or
yarn add flappa-doormal
import { segmentPages } from 'flappa-doormal';
// Your pages from a hadith book
const pages = [
{ id: 1, content: '٦٦٩٦ - حَدَّثَنَا أَبُو بَكْرٍ عَنِ النَّبِيِّ...' },
{ id: 1, content: '٦٦٩٧ - أَخْبَرَنَا عُمَرُ قَالَ...' },
{ id: 2, content: '٦٦٩٨ - حَدَّثَنِي مُحَمَّدٌ...' },
];
const segments = segmentPages(pages, {
rules: [{
lineStartsAfter: ['{{raqms:num}} {{dash}} '],
split: 'at',
}]
});
// Result:
// [
// { content: 'حَدَّثَنَا أَبُو بَكْرٍ عَنِ النَّبِيِّ...', from: 1, meta: { num: '٦٦٩٦' } },
// { content: 'أَخْبَرَنَا عُمَرُ قَالَ...', from: 1, meta: { num: '٦٦٩٧' } },
// { content: 'حَدَّثَنِي مُحَمَّدٌ...', from: 2, meta: { num: '٦٦٩٨' } }
// ]
Use validateSegments() to sanity-check segmentation output against the input pages and options. This is useful for detecting page attribution issues or maxPages violations before sending segments to downstream systems.
import { segmentPages, validateSegments } from 'flappa-doormal';
const segments = segmentPages(pages, { rules, maxPages: 0 });
const report = validateSegments(pages, { rules, maxPages: 0 }, segments);
if (!report.ok) {
console.log(report.summary);
console.log(report.issues[0]);
}
Example issue entry (truncated):
{
"type": "page_attribution_mismatch",
"severity": "error",
"segmentIndex": 2,
"expected": { "from": 5 },
"actual": { "from": 4 },
"evidence": "Content found in page 5, but segment.from=4."
}
Replace regex with readable tokens:
| Token | Matches | Regex Equivalent |
|---|---|---|
{{raqms}} | Arabic-Indic digits | [\\u0660-\\u0669]+ |
{{raqm}} | Single Arabic digit | [\\u0660-\\u0669] |
{{nums}} | ASCII digits | \\d+ |
{{num}} | Single ASCII digit | \\d |
{{dash}} | Dash variants | [-–—ـ] |
{{harf}} | Arabic letter | [أ-ي] |
{{harfs}} | Single-letter codes separated by spaces | [أ-ي](?:\s+[أ-ي])* |
{{rumuz}} | Source abbreviations (rijāl/takhrīj rumuz), incl. multi-code blocks | e.g. خت ٤, خ سي, خ فق, د ت سي ق, دت عس ق |
{{numbered}} | Hadith numbering ٢٢ - | {{raqms}} {{dash}} |
{{fasl}} | Section markers | فصل|مسألة |
{{tarqim}} | Punctuation marks | [.!?؟؛] |
{{bullet}} | Bullet points | [•*°] |
{{newline}} | Newline character | \n |
{{naql}} | Narrator phrases | حدثنا|أخبرنا|... |
{{kitab}} | "كتاب" (book) | كتاب |
{{bab}} | "باب" (chapter) | باب |
{{basmalah}} | "بسم الله" | بسم الله |
{{hr}} | Horizontal rule (5+ chars) | [-–—ـ_=]{5,} |
{{kitab}} – Matches "كتاب" (Book). Used in hadith collections to mark major book divisions. Example: كتاب الإيمان (Book of Faith).{{bab}} – Matches "باب" (Chapter). Example: باب ما جاء في الصلاة (Chapter on what came regarding prayer).{{fasl}} – Matches "فصل" or "مسألة" (Section/Issue). Common in fiqh books.{{basmalah}} – Matches "بسم الله" or "﷽". Commonly appears at the start of chapters, books, or documents.naql){{naql}} matches common hadith transmission phrases:
rumuz){{rumuz}} matches rijāl/takhrīj source abbreviations used in narrator biography books:
Matches blocks of codes separated by whitespace (e.g., خ سي, خ فق, خت ٤, د ت سي ق).
Note: Single-letter rumuz like
عare only matched when they appear as standalone codes, not as the first letter of words likeعَن.
| Token | Matches | Example |
|---|---|---|
{{raqms}} | One or more Arabic-Indic digits (٠-٩) | ٦٦٩٦ in ٦٦٩٦ - حدثنا |
{{raqm}} | Single Arabic-Indic digit | ٥ |
{{nums}} | One or more ASCII digits (0-9) | 123 |
{{num}} | Single ASCII digit | 5 |
{{numbered}} | Common hadith format: {{raqms}} {{dash}} | ٢٢ - حدثنا |
{{dash}} matches:
- (hyphen-minus U+002D)– (en-dash U+2013)— (em-dash U+2014)ـ (tatweel U+0640, Arabic elongation character)Example: ٦٦٩٦ - حدثنا or ٦٦٩٦ ـ حدثنا
For better IDE support, use the Token constants instead of raw strings:
import { Token, withCapture } from 'flappa-doormal';
// Instead of:
{ lineStartsWith: ['{{kitab}}', '{{bab}}'] }
// Use:
{ lineStartsWith: [Token.KITAB, Token.BAB] }
// With named captures:
const pattern = withCapture(Token.RAQMS, 'hadithNum') + ' ' + Token.DASH + ' ';
// Result: '{{raqms:hadithNum}} {{dash}} '
{ lineStartsAfter: [pattern], split: 'at' }
// segment.meta.hadithNum will contain the matched number
Available constants: Token.BAB, Token.BASMALAH, Token.BULLET, Token.DASH, Token.FASL, Token.HARF, Token.HARFS, Token.HR, Token.KITAB, Token.NAQL, Token.NUM, Token.NUMS, Token.NUMBERED, Token.RAQM, Token.RAQMS, Token.RUMUZ, Token.TARQIM
Extract metadata automatically with the {{token:name}} syntax:
// Capture hadith number
{ template: '^{{raqms:hadithNum}} {{dash}} ' }
// Result: meta.hadithNum = '٦٦٩٦'
// Capture volume and page
{ template: '^{{raqms:vol}}/{{raqms:page}} {{dash}} ' }
// Result: meta.vol = '٣', meta.page = '٤٥٦'
// Capture rest of content
{ template: '^{{raqms:num}} {{dash}} {{:text}}' }
// Result: meta.num = '٦٦٩٦', meta.text = 'حَدَّثَنَا أَبُو بَكْرٍ'
Match Arabic text regardless of harakat:
const rules = [{
fuzzy: true,
lineStartsAfter: ['{{kitab:book}} '],
split: 'at',
}];
// Matches both:
// - 'كِتَابُ الصلاة' (with diacritics)
// - 'كتاب الصيام' (without diacritics)
| Type | Marker in content? | Use case |
|---|---|---|
lineStartsWith | ✅ Included | Keep marker, segment at boundary |
lineStartsAfter | ❌ Excluded | Strip marker, capture only content |
lineEndsWith | ✅ Included | Match patterns at end of line |
template | Depends | Custom pattern with full control |
regex | Depends | Raw regex for complex cases |
The library exports PATTERN_TYPE_KEYS (a const array) and PatternTypeKey (a type) for building UIs that let users select pattern types:
import { PATTERN_TYPE_KEYS, type PatternTypeKey } from 'flappa-doormal';
// PATTERN_TYPE_KEYS = ['lineStartsWith', 'lineStartsAfter', 'lineEndsWith', 'template', 'regex']
// Build a dropdown/select
PATTERN_TYPE_KEYS.map(key => <option value={key}>{key}</option>)
// Type-safe validation
const isPatternKey = (k: string): k is PatternTypeKey =>
(PATTERN_TYPE_KEYS as readonly string[]).includes(k);
When matching at line starts (e.g., {{naql}}), a new page can begin with a marker that is actually a continuation of the previous page (page wrap), not a true new segment.
Use pageStartGuard to allow a rule to match at the start of a page only if the previous page’s last non-whitespace character matches a pattern (tokens supported):
const segments = segmentPages(pages, {
rules: [{
fuzzy: true,
lineStartsWith: ['{{naql}}'],
split: 'at',
// Only allow a split at the start of a new page if the previous page ended with sentence punctuation:
pageStartGuard: '{{tarqim}}'
}]
});
This guard applies only at page starts. Mid-page line starts are unaffected.
In lineStartsWith, lineStartsAfter, lineEndsWith, and template patterns, parentheses () and square brackets [] are automatically escaped. This means you can write intuitive patterns without manual escaping:
// Write this (clean and readable):
{ lineStartsAfter: ['({{harf}}): '], split: 'at' }
// Instead of this (verbose escaping):
{ lineStartsAfter: ['\\({{harf}}\\): '], split: 'at' }
Important: Brackets inside {{tokens}} are NOT escaped - token patterns like {{harf}} which expand to [أ-ي] work correctly.
For full regex control (character classes, capturing groups), use the regex pattern type which does NOT auto-escape:
// Character class [أب] matches أ or ب
{ regex: '^[أب] ', split: 'at' }
// Capturing group (test|text) matches either
{ regex: '^(test|text) ', split: 'at' }
// Named capture groups extract metadata from raw regex too!
{ regex: '^(?<num>[٠-٩]+)\\s+[أ-ي\\s]+:\\s*(.+)' }
// meta.num = matched number, content = captured (.+) group
Limit rules to specific page ranges:
{
lineStartsWith: ['## '],
split: 'at',
min: 10, // Only pages 10+
max: 100, // Only pages up to 100
}
Split oversized segments based on character count:
{
maxContentLength: 500, // Split after 500 characters
prefer: 'longer', // Try to fill the character bucket
breakpoints: ['\\.'], // Recommended: split on punctuation within window
}
The library implements safety hardening for character-based splits:
maxContentLength must be at least 50.Apply text normalization transforms before segmentation rules are evaluated:
segmentPages(pages, {
preprocess: [
'removeZeroWidth', // Strip invisible Unicode control characters
'condenseEllipsis', // "..." → "…" (prevents {{tarqim}} false matches)
'fixTrailingWaw', // " و " → " و" (joins waw to next word)
],
rules: [...],
});
Available transforms:
| Transform | Effect | Use Case |
|---|---|---|
removeZeroWidth | Strips U+200B–U+200F, U+202A–U+202E, U+2060–U+2064, U+FEFF | Invisible chars interfering with patterns |
condenseEllipsis | ... → … | Prevent {{tarqim}} matching inside ellipsis |
fixTrailingWaw | و → و | Fix OCR artifacts with detached waw |
Page constraints:
preprocess: [
'removeZeroWidth', // All pages
{ type: 'condenseEllipsis', min: 100 }, // Pages 100+
{ type: 'fixTrailingWaw', min: 50, max: 500 }, // Pages 50-500
]
removeZeroWidth modes:
// Default: strip entirely
{ type: 'removeZeroWidth', mode: 'strip' }
// Alternative: replace with space (preserves word boundaries)
// Note: Won't insert space after existing whitespace (space, newline, tab)
{ type: 'removeZeroWidth', mode: 'space' }
Refine rule matching with page-specific constraints:
{
lineStartsWith: ['### '],
split: 'at',
// Range constraints
min: 10, // Only match on pages 10 and above
max: 500, // Only match on pages 500 and below
exclude: [50, [100, 110]], // Skip page 50 and range 100-110
// Negative lookahead: skip rule if content matches this pattern
// (e.g. skip chapter marker if it appears inside a table/list)
skipWhen: '^\s*- ',
}
Pass an optional logger to trace segmentation decisions or enable debug to attach match metadata to segments:
const segments = segmentPages(pages, {
rules: [...],
debug: true, // Enables detailed match metadata
logger: {
debug: (msg, data) => console.log(`[DEBUG] ${msg}`, data),
info: (msg, data) => console.info(`[INFO] ${msg}`, data),
warn: (msg, data) => console.warn(`[WARN] ${msg}`, data),
error: (msg, data) => console.error(`[ERROR] ${msg}`, data),
logger: {
debug: (msg, data) => console.log(`[DEBUG] ${msg}`, data),
info: (msg, data) => console.info(`[INFO] ${msg}`, data),
warn: (msg, data) => console.warn(`[WARN] ${msg}`, data),
error: (msg, data) => console.error(`[ERROR] ${msg}`, data),
}
});
// Helper to format debug reason
// import { getSegmentDebugReason } from 'flappa-doormal';
// console.log(getSegmentDebugReason(segments[0])); // "Rule #0 (lineStartsWith) [idx:2] (Matched: '{{naql}}')"
_flappa)When debug: true is enabled, the library attaches a _flappa object to each segment's meta property. This is extremely useful for understanding exactly why a segment was created and which pattern matched.
The metadata includes different fields based on the split reason:
1. Rule-based Splits
If a segment was created by one of your rules:
{
"meta": {
"_flappa": {
"rule": {
"index": 0, // Index of the rule in your rules array
"patternType": "lineStartsWith", // The type of pattern that matched
"wordIndex": 2, // Index of the specific pattern in the array
"word": "{{naql}}" // The specific pattern string that matched
}
}
}
}
2. Breakpoint-based Splits
If a segment was created by a breakpoint pattern (e.g. because it exceeded maxPages or maxContentLength):
{
"meta": {
"_flappa": {
"breakpoint": {
"index": 0, // Index of the breakpoint in your array
"pattern": "\\.", // The pattern (or `regex`) that matched
"kind": "pattern", // "pattern", "regex", or "pageBoundary"
"wordIndex": 1, // Index in `words` array (if using `words` field)
"word": "ثم " // The specific word that matched
}
}
}
}
3. Safety Fallback Splits (maxContentLength)
If no rule or breakpoint matched and the library was forced to perform a safety fallback split:
{
"meta": {
"_flappa": {
"contentLengthSplit": {
"maxContentLength": 5000,
"splitReason": "whitespace" // "whitespace", "unicode_boundary", or "grapheme_cluster"
}
}
}
}
whitespace: Found a safe space/newline to split at.unicode_boundary: No whitespace found, split at a safe character boundary (avoiding surrogate pairs).grapheme_cluster: Split at a grapheme boundary (avoiding diacritic/ZWJ corruption).Control how text from different pages is stitched together:
// Default: space ' ' joiner
// Result: "...end of page 1. Start of page 2..."
segmentPages(pages, { pageJoiner: 'space' });
// Result: "...end of page 1.\nStart of page 2..."
segmentPages(pages, { pageJoiner: 'newline' });
When a segment exceeds maxPages or maxContentLength, breakpoints split it at the "best" available match:
{
maxPages: 1, // Minimum segment size (page span)
breakpoints: ['{{tarqim}}'],
// 'longer' (default): Greedy. Finds the match furthest in the window.
// Result: Segments stay close to the max limit.
prefer: 'longer',
// 'shorter': Conservative. Finds the first available match.
// Result: Segments split as early as possible.
prefer: 'shorter',
}
When a breakpoint pattern matches, the split position is controlled by the split option:
⚠️ Split Defaults Differ: Rules default to
split: 'at', while Breakpoints default tosplit: 'after'.
{
breakpoints: [
// Default: split AFTER the match (match included in previous segment)
{ pattern: '{{tarqim}}' }, // or { pattern: '{{tarqim}}', split: 'after' }
// Alternative: split AT the match (match starts next segment)
{ pattern: 'ولهذا', split: 'at' },
],
}
split: 'after' (default)
// Pattern "ولهذا" with split: 'after' on "النص الأول ولهذا النص الثاني"
// - Segment 1: "النص الأول ولهذا" (ends WITH match)
// - Segment 2: "النص الثاني" (starts AFTER match)
split: 'at'
// Pattern "ولهذا" with split: 'at' on "النص الأول ولهذا النص الثاني"
// - Segment 1: "النص الأول" (ends BEFORE match)
// - Segment 2: "ولهذا النص الثاني" (starts WITH match)
Note: For empty pattern
''(page boundary fallback),splitis ignored since there is no matched text to include/exclude.
Pattern order matters - the first matching pattern wins:
{
// Patterns are tried in order
breakpoints: [
'\\.', // Try punctuation first (no need for \\s* - segments are trimmed)
'ولهذا', // Then try specific word
'', // Finally, fall back to page boundary
],
}
// If punctuation is found, "ولهذا" is never tried
Note on lookahead patterns: Zero-length patterns like
(?=X)are not supported for breakpoints because they can cause non-progress scenarios. Use{ pattern: 'X', split: 'at' }instead to achieve "split before X" behavior.
Note on whitespace: Segments are trimmed by default. With
split:'at', if the match consists only of whitespace, it will be trimmed from the start of the next segment. This is usually desirable for delimiter patterns.
Tip:
\s*after punctuation is redundant: Because segments are trimmed,{{tarqim}}\s*produces identical output to{{tarqim}}. The trailing whitespace captured by\s*gets trimmed anyway. Save yourself the extra characters!
pattern vs regex FieldBreakpoints support two pattern fields:
| Field | Bracket escaping | Use case |
|---|---|---|
pattern | ()[] auto-escaped | Simple patterns, token-friendly |
regex | None (raw regex) | Complex regex with groups, lookahead |
// Use `pattern` for simple patterns (brackets are auto-escaped)
{ pattern: '(a)', split: 'after' } // Matches literal "(a)"
{ pattern: '{{tarqim}}', split: 'after' } // Token expansion works
// Use `regex` for complex patterns with regex groups
{ regex: '\\s+(?:ولهذا|وكذلك|فلذلك)', split: 'at' } // Non-capturing group
{ regex: '{{tarqim}}', split: 'after' } // Tokens work in regex too!
If both pattern and regex are specified, regex takes precedence.
Breakpoint patterns match substrings, not whole words. A pattern like ولهذا will match inside مَولهذا, causing a mid-word split:
// Content: "النص الأول مَولهذا النص"
// Pattern: { pattern: 'ولهذا', split: 'at' }
// Result:
// - Segment 1: "النص الأول مَ" ← orphaned letter!
// - Segment 2: "ولهذا النص"
Solution: Require whitespace before the pattern to ensure whole-word matching:
// Single word - require preceding whitespace
{ pattern: '\\s+ولهذا', split: 'at' }
// Multiple words using alternation - each needs whitespace prefix
{ pattern: '\\s+(?:ولهذا|وكذلك|فلذلك)', split: 'at' }
Why not
\b? JavaScript's\bword boundary does not work with Arabic text. Since Arabic letters aren't considered "word characters" (\w=[a-zA-Z0-9_]), using\bwill match nothing - not even standalone words. Always use\s+prefix instead.
words Field (Simplified Word Breakpoints)For breaking on multiple words, the words field provides a simpler syntax with automatic whitespace boundaries:
{
breakpoints: [
// Instead of manually writing:
// { regex: '\\s+(?:فهذا|ثم|أقول)', split: 'at' }
// Use the `words` field:
{ words: ['فهذا', 'ثم', 'أقول'], min: 100 }
],
}
Features:
\s+ prefix for whole-word matchingsplit: 'at' (can be overridden){{naql}} expands as usual)// Override split behavior
{ words: ['والله أعلم'], split: 'after' } // Include phrase in previous segment
// Use tokens in words
{ words: ['{{naql}}', 'وكذلك'] } // Token expansion works
// Note: `words` cannot be combined with `pattern` or `regex`
// Note: Empty `words: []` is filtered out (no-op), NOT treated as page-boundary fallback
⚠️ Partial Word Matching: The words field matches text that starts with the word, not complete words only. For example, words: ['ثم'] will also match ثمامة (a name starting with ثم).
To match only complete words, add a trailing space:
// ❌ Matches 'ثم' anywhere, including inside 'ثمامة'
{ words: ['فهذا', 'ثم', 'أقول'] }
// ✅ Matches only standalone words followed by space
{ words: ['فهذا ', 'ثم ', 'أقول '] }
Security note (ReDoS): Breakpoints (and raw regex rules) compile user-provided regular expressions. Do not accept untrusted patterns (e.g. from end users) without validation/sandboxing; some regexes can trigger catastrophic backtracking and hang the process.
Control which matches to use:
{
lineEndsWith: ['\\.'],
split: 'after',
occurrence: 'last', // Only split at LAST period on page
}
Use {{numbered}} for the common "number - content" format:
const segments = segmentPages(pages, {
rules: [{
lineStartsAfter: ['{{numbered}}'],
split: 'at',
meta: { type: 'hadith' }
}]
});
// Matches: ٢٢ - حدثنا, ٦٦٩٦ – أخبرنا, etc.
// Content starts AFTER the number and dash
For capturing the hadith number, use explicit capture syntax:
const segments = segmentPages(pages, {
rules: [{
lineStartsAfter: ['{{raqms:hadithNum}} {{dash}} '],
split: 'at',
meta: { type: 'hadith' }
}]
});
// Each segment has:
// - content: The hadith text (without number prefix)
// - from/to: Page range
// - meta: { type: 'hadith', hadithNum: '٦٦٩٦' }
const segments = segmentPages(pages, {
rules: [{
lineStartsAfter: ['{{raqms:vol}}/{{raqms:page}} {{dash}} '],
split: 'at'
}]
});
// meta: { vol: '٣', page: '٤٥٦' }
const segments = segmentPages(pages, {
rules: [{
fuzzy: true,
lineStartsAfter: ['{{kitab:book}} '],
split: 'at',
meta: { type: 'chapter' }
}]
});
// Matches "كِتَابُ" or "كتاب" regardless of diacritics
const segments = segmentPages(pages, {
rules: [{
fuzzy: true,
lineStartsWith: ['{{naql:phrase}}'],
split: 'at'
}]
});
// meta.phrase captures which narrator phrase was matched:
// 'حدثنا', 'أخبرنا', 'حدثني', etc.
// Only capture the number, not the letter
const segments = segmentPages(pages, {
rules: [{
lineStartsWith: ['{{raqms:num}} {{harf}} {{dash}} '],
split: 'at'
}]
});
// Input: '٥ أ - البند الأول'
// meta: { num: '٥' } // harf not captured (no :name suffix)
Use {{rumuz}} for matching rijāl/takhrīj source abbreviations (common in narrator biography books and takhrīj notes):
const segments = segmentPages(pages, {
rules: [{
lineStartsAfter: ['{{raqms:num}} {{rumuz}}:'],
split: 'at'
}]
});
// Matches: ١١١٨ ع: ... / ١١١٨ خ سي: ... / ١١١٨ خ فق: ...
// meta: { num: '١١١٨' }
// content: '...' (rumuz stripped)
Supported codes: Single-letter (ع, خ, م, د, etc.), two-letter (خت, عس, سي, etc.), digit ٤, and the word تمييز (used in jarḥ wa taʿdīl books).
Note: Single-letter rumuz like
عare only matched when they appear as standalone codes, not as the first letter of words likeعَن. The pattern is diacritic-safe.
If your data uses only single-letter codes separated by spaces (e.g., د ت س ي ق), you can also use {{harfs}}.
Use analyzeCommonLineStarts(pages) to discover common line-start signatures across a book, useful for rule authoring:
import { analyzeCommonLineStarts } from 'flappa-doormal';
const patterns = analyzeCommonLineStarts(pages);
// [{ pattern: "{{numbered}}", count: 1234, examples: [...] }, ...]
You can control what gets analyzed and how results are ranked:
import { analyzeCommonLineStarts } from 'flappa-doormal';
// Top 20 most common line-start signatures (by frequency)
const topByCount = analyzeCommonLineStarts(pages, {
sortBy: 'count',
topK: 20,
});
// Only analyze markdown H2 headings (lines beginning with "##")
// This shows what comes AFTER the heading marker (e.g. "## {{bab}}", "## {{numbered}}\\[", etc.)
const headingVariants = analyzeCommonLineStarts(pages, {
lineFilter: (line) => line.startsWith('##'),
sortBy: 'count',
topK: 40,
});
// Support additional prefix styles without changing library code
// (e.g. markdown blockquotes ">> ..." + headings)
const quotedHeadings = analyzeCommonLineStarts(pages, {
lineFilter: (line) => line.startsWith('>') || line.startsWith('#'),
prefixMatchers: [/^>+/u, /^#+/u],
sortBy: 'count',
topK: 40,
});
Key options:
sortBy: 'specificity' (default) or 'count' (highest frequency first)lineFilter: restrict which lines are counted (e.g. only headings)prefixMatchers: consume syntactic prefixes (default includes headings via /^#+/u) so you can see variations after the prefixnormalizeArabicDiacritics: true by default (helps token matching like وأَخْبَرَنَا → {{naql}})whitespace: how whitespace is represented in returned patterns:
'regex' (default): uses \\s* placeholders between tokens'space': uses literal single spaces (' ') between tokens (useful if you don't want \\s to later match newlines when reusing these patterns)Note on brackets in returned patterns:
analyzeCommonLineStarts() returns template-like signatures, not “ready-to-run regex”.() / [] in the returned pattern (e.g. (ح) stays (ح)).lineStartsWith / lineStartsAfter / template, that’s fine: those template pattern types auto-escape ()[] outside {{tokens}}.regex rule, you may need to escape literal brackets yourself.For texts without line breaks (continuous prose), use analyzeRepeatingSequences():
import { analyzeRepeatingSequences } from 'flappa-doormal';
const patterns = analyzeRepeatingSequences(pages, {
minElements: 2,
maxElements: 4,
minCount: 3,
topK: 20,
});
// [{ pattern: "{{naql}}\\s*{{harf}}", count: 42, examples: [...] }, ...]
Key options:
minElements / maxElements: N-gram size range (default 1-3)minCount: Minimum occurrences to include (default 3)topK: Maximum patterns to return (default 20)requireToken: Only patterns containing {{tokens}} (default true)normalizeArabicDiacritics: Ignore diacritics when matching (default true)Use analysis functions to discover patterns, then pass to segmentPages().
For prose-like text without structural line breaks:
import { analyzeRepeatingSequences, segmentPages, type Page } from 'flappa-doormal';
// Continuous Arabic text with narrator phrases
const pages: Page[] = [
{ id: 1, content: 'حدثنا أحمد بن محمد عن عمر قال سمعت النبي حدثنا خالد بن زيد عن علي' },
{ id: 2, content: 'حدثنا سعيد بن جبير عن ابن عباس أخبرنا يوسف عن أنس' },
];
// Step 1: Discover repeating patterns
const patterns = analyzeRepeatingSequences(pages, { minCount: 2, topK: 10 });
// [{ pattern: '{{naql}}', count: 5, examples: [...] }, ...]
// Step 2: Build rules from discovered patterns
const rules = patterns.filter(p => p.count >= 3).map(p => ({
lineStartsWith: [p.pattern],
split: 'at' as const,
fuzzy: true,
}));
// Step 3: Segment
const segments = segmentPages(pages, { rules });
// [{ content: 'حدثنا أحمد بن محمد عن عمر قال سمعت النبي', from: 1 }, ...]
For hadith-style numbered entries:
import { analyzeCommonLineStarts, segmentPages, type Page } from 'flappa-doormal';
// Numbered hadith text
const pages: Page[] = [
{ id: 1, content: '٦٦٩٦ - حَدَّثَنَا أَبُو بَكْرٍ عَنِ النَّبِيِّ\n٦٦٩٧ - أَخْبَرَنَا عُمَرُ قَالَ' },
{ id: 2, content: '٦٦٩٨ - حَدَّثَنِي مُحَمَّدٌ عَنْ عَائِشَةَ' },
];
// Step 1: Discover common line-start patterns
const patterns = analyzeCommonLineStarts(pages, { topK: 10, minCount: 2 });
// [{ pattern: '{{raqms}}\\s*{{dash}}', count: 3, examples: [...] }, ...]
// Step 2: Build rules (add named capture for hadith number)
const topPattern = patterns[0]?.pattern ?? '{{raqms}} {{dash}} ';
const rules = [{
lineStartsAfter: [topPattern.replace('{{raqms}}', '{{raqms:num}}')],
split: 'at' as const,
meta: { type: 'hadith' }
}];
// Step 3: Segment
const segments = segmentPages(pages, { rules });
// [
// { content: 'حَدَّثَنَا أَبُو بَكْرٍ...', from: 1, meta: { type: 'hadith', num: '٦٦٩٦' } },
// { content: 'أَخْبَرَنَا عُمَرُ قَالَ', from: 1, meta: { type: 'hadith', num: '٦٦٩٧' } },
// { content: 'حَدَّثَنِي مُحَمَّدٌ...', from: 2, meta: { type: 'hadith', num: '٦٦٩٨' } },
// ]
If you already have pre-segmented data (e.g., records from a database or JSON file) and want to use flappa-doormal's token system to extract metadata and clean the content without further splitting, you can use the Metadata Extraction pattern.
By setting maxPages: 0, you guarantee a 1:1 mapping: each input page produces exactly one output segment, regardless of how much text is on the page.
import { segmentPages, type Page } from 'flappa-doormal';
const excerpts = [
{ nass: '٧٠١٦ - ١ - ١ - فَقَصَّتْهَا حَفْصَةُ', id: 1 },
{ nass: '٧٠١٧ (أ) - بَابُ الْقَيْدِ', id: 2 },
{ nass: 'باب الصلاة - الفصل الأول', id: 3 },
];
// Convert your data to the Page format
const pages: Page[] = excerpts.map(e => ({ content: e.nass, id: e.id }));
const result = segmentPages(pages, {
maxPages: 0, // IMPORTANT: Guarantees each page stays isolated (no merging/splitting)
rules: [
// 1. Extract triple numbers: ٧٠١٦ - ١ - ١
{
lineStartsAfter: ['{{raqms:num}} {{dash}} {{raqms:num2}} {{dash}} {{raqms:num3}} '],
},
// 2. Extract number + indicator: ٧٠١٧ (أ)
{
lineStartsAfter: ['{{raqms:num}} ({{harf:indicator}}) {{dash}} '],
},
// 3. Mark chapters using fuzzy tokens
{
fuzzy: true,
lineStartsWith: ['{{bab}} '],
meta: { type: 'Chapter' },
},
],
});
// Segment 0: { content: 'فَقَصَّتْهَا حَفْصَةُ', meta: { num: '٧٠١٦', num2: '١', num3: '١' }, ... }
// Segment 1: { content: 'بَابُ الْقَيْدِ', meta: { num: '٧٠١٧', indicator: 'أ' }, ... }
// Segment 2: { content: 'باب الصلاة - الفصل الأول', meta: { type: 'Chapter' }, ... }
{{raqms}}, {{dash}}, and {{harf}} instead of writing raw regex for every edge case.lineStartsAfter automatically removes the matched pattern, leaving only the clean text.{{raqms:num}} automatically populate the meta object.fuzzy: true to match headers like "Book" or "Chapter" regardless of Arabic diacritics.Use optimizeRules() to automatically merge compatible rules, remove duplicate patterns, and sort rules by specificity (longest patterns first):
import { optimizeRules } from 'flappa-doormal';
const rules = [
// These will be merged because meta/fuzzy options match
{ lineStartsWith: ['{{kitab}}'], fuzzy: true, meta: { type: 'header' } },
{ lineStartsWith: ['{{bab}}'], fuzzy: true, meta: { type: 'header' } },
// This will be kept separate
{ lineStartsAfter: ['{{numbered}}'], meta: { type: 'entry' } },
];
const { rules: optimized, mergedCount } = optimizeRules(rules);
// Result:
// optimized[0] = {
// lineStartsWith: ['{{kitab}}', '{{bab}}'],
// fuzzy: true,
// meta: { type: 'header' }
// }
// optimized[1] = { lineStartsAfter: ['{{numbered}}'], ... }
Use validateRules() to detect common mistakes in rule patterns before running segmentation:
import { validateRules } from 'flappa-doormal';
const issues = validateRules([
{ lineStartsAfter: ['raqms:num'] }, // Missing {{}}
{ lineStartsWith: ['{{unknown}}'] }, // Unknown token
{ lineStartsAfter: ['## (rumuz:rumuz)'] } // Typo - should be {{rumuz:rumuz}}
]);
// issues[0]?.lineStartsAfter?.[0]?.type === 'missing_braces'
// issues[1]?.lineStartsWith?.[0]?.type === 'unknown_token'
// issues[2]?.lineStartsAfter?.[0]?.type === 'missing_braces'
// To get a simple list of error strings for UI display:
import { formatValidationReport } from 'flappa-doormal';
const errors = formatValidationReport(issues);
// [
// 'Rule 1, lineStartsAfter: Missing {{}} around token "raqms:num"',
// 'Rule 2, lineStartsWith: Unknown token "{{unknown}}"',
// ...
// ]
Checks performed:
raqms:num without {{}}{{}} that don't exist (e.g., {{nonexistent}})When building UIs for rule editing, it's often useful to separate the token pattern (e.g., {{raqms}}) from the capture name (e.g., {{raqms:hadithNum}}).
import { applyTokenMappings, stripTokenMappings } from 'flappa-doormal';
// 1. Apply user-defined mappings to a raw template
const template = '{{raqms}} {{dash}}';
const mappings = [{ token: 'raqms', name: 'num' }];
const result = applyTokenMappings(template, mappings);
// result = '{{raqms:num}} {{dash}}'
// 2. Strip captures to get back to the canonical pattern
const raw = stripTokenMappings(result);
// raw = '{{raqms}} {{dash}}'
Before prompting an LLM, you can quickly extract high-signal pattern hints from the book using:
analyzeCommonLineStarts(pages, options) (from src/line-start-analysis.ts): common line-start signatures (tokenized)analyzeTextForRule(text) / detectTokenPatterns(text) (from src/pattern-detection.ts): turn a single representative line into a token template suggestionThese help the LLM avoid guessing and focus on the patterns actually present.
import { analyzeCommonLineStarts } from 'flappa-doormal';
const top = analyzeCommonLineStarts(pages, {
sortBy: 'count',
topK: 40,
minCount: 10,
});
console.log(top.map((p) => ({ pattern: p.pattern, count: p.count, example: p.examples[0] })));
Typical output (example):
[
{ pattern: "{{numbered}}", count: 1200, example: { pageId: 50, line: "١ - حَدَّثَنَا ..." } },
{ pattern: "{{bab}}", count: 180, example: { pageId: 66, line: "باب ..." } },
{ pattern: "##\\s*{{bab}}",count: 140, example: { pageId: 69, line: "## باب ..." } }
]
If you only want to analyze headings (to see what comes after ##):
const headingVariants = analyzeCommonLineStarts(pages, {
lineFilter: (line) => line.startsWith('##'),
sortBy: 'count',
topK: 40,
});
Pick 3–10 representative line prefixes from the book (often from the examples returned above) and run:
import { analyzeTextForRule } from 'flappa-doormal';
console.log(analyzeTextForRule("٢٩- خ سي: أحمد بن حميد ..."));
// -> { template: "{{raqms}}- {{rumuz}}: أحمد...", patternType: "lineStartsAfter", fuzzy: false, ... }
When you prompt the LLM, include a short “Hints” section:
analyzeCommonLineStarts patterns (with counts + 1–2 examples)analyzeTextForRule(...) resultsThen instruct the LLM to prioritize rules that align with those hints.
You can use an LLM to generate SegmentationOptions by pasting it a random subset of pages and asking it to infer robust segmentation rules. Here’s a ready-to-copy plain-text prompt:
You are helping me generate JSON configuration for a text-segmentation function called segmentPages(pages, options).
It segments Arabic book pages (e.g., Shamela) into logical segments (books/chapters/sections/entries/hadiths).
I will give you a random subset of pages so you can infer patterns. You must respond with ONLY JSON (no prose).
I will paste a random subset of pages. Each page has:
- id: page number (not necessarily consecutive)
- content: plain text; line breaks are \n
Output ONLY a JSON object compatible with SegmentationOptions (no prose, no code fences).
SegmentationOptions shape:
- rules: SplitRule[]
- optional: maxPages, breakpoints, prefer
SplitRule constraints:
- Each rule must use exactly ONE of: lineStartsWith, lineStartsAfter, lineEndsWith, template, regex
- Optional fields: split ("at" | "after"), meta, min, max, exclude, occurrence ("first" | "last"), fuzzy
Important behaviors:
- lineStartsAfter matches at line start but strips the marker from segment.content.
- Template patterns (lineStartsWith/After/EndsWith/template) auto-escape ()[] outside tokens.
- Raw regex patterns do NOT auto-escape and can include groups, named captures, etc.
Available tokens you may use in templates:
- {{basmalah}} (بسم الله / ﷽)
- {{kitab}} (كتاب)
- {{bab}} (باب)
- {{fasl}} (فصل | مسألة)
- {{naql}} (حدثنا/أخبرنا/... narration phrases)
- {{raqm}} (single Arabic-Indic digit)
- {{raqms}} (Arabic-Indic digits)
- {{num}} (single ASCII digit)
- {{nums}} (ASCII digits)
- {{dash}} (dash variants)
- {{tarqim}} (punctuation [. ! ? ؟ ؛])
- {{harf}} (Arabic letter)
- {{harfs}} (single-letter codes separated by spaces; e.g. "د ت س ي ق")
- {{rumuz}} (rijāl/takhrīj source abbreviations; matches blocks like "خت ٤", "خ سي", "خ فق")
Named captures:
- {{raqms:num}} captures to meta.num
- {{:name}} captures arbitrary text to meta.name
Your tasks:
1) Identify document structure from the sample:
- book headers (كتاب), chapter headers (باب), sections (فصل/مسألة), hadith numbering, biography entries, etc.
2) Propose a minimal but robust ordered ruleset:
- Put most-specific rules first.
- Use fuzzy:true for Arabic headings where diacritics vary.
- Use lineStartsAfter when you want to remove the marker (e.g., hadith numbers, rumuz prefixes).
3) Use constraints:
- Use min/max/exclude when front matter differs or specific pages are noisy.
4) If segments can span many pages:
- Set maxPages and breakpoints.
- Suggested breakpoints (in order): "{{tarqim}}", "\\n", "" (page boundary)
- Prefer "longer" unless there’s a reason to prefer shorter segments.
5) Capture useful metadata:
- For numbering patterns, capture the number into meta.num (e.g., {{raqms:num}}).
Examples (what good answers look like):
Example A: hadith-style numbered segments
Input pages:
PAGE 10:
٣٤ - حَدَّثَنَا ...\n... (rest of hadith)
PAGE 11:
٣٥ - حَدَّثَنَا ...\n... (rest of hadith)
Good JSON answer:
{
"rules": [
{
"lineStartsAfter": ["{{raqms:num}} {{dash}}\\s*"],
"split": "at",
"meta": { "type": "hadith" }
}
]
}
Example B: chapter markers + hadith numbers
Input pages:
PAGE 50:
كتاب الصلاة\nباب فضل الصلاة\n١ - حَدَّثَنَا ...\n...
PAGE 51:
٢ - حَدَّثَنَا ...\n...
Good JSON answer:
{
"rules": [
{ "fuzzy": true, "lineStartsWith": ["{{kitab}}"], "split": "at", "meta": { "type": "book" } },
{ "fuzzy": true, "lineStartsWith": ["{{bab}}"], "split": "at", "meta": { "type": "chapter" } },
{ "lineStartsAfter": ["{{raqms:num}}\\s*{{dash}}\\s*"], "split": "at", "meta": { "type": "hadith" } }
]
}
Example C: narrator/rijāl entries with rumuz (codes) + colon
Input pages:
PAGE 257:
٢٩- خ سي: أحمد بن حميد...\nوكان من حفاظ الكوفة.
PAGE 258:
١٠٢- ق: تمييز ولهم شيخ آخر...\n...
Good JSON answer:
{
"rules": [
{
"lineStartsAfter": ["{{raqms:num}}\\s*{{dash}}\\s*{{rumuz}}:\\s*"],
"split": "at",
"meta": { "type": "entry" }
}
]
}
Now wait for the pages.
const segments = segmentPages(pages, {
rules: [{
lineEndsWith: ['\\.'],
split: 'after',
occurrence: 'last',
}]
});
const segments = segmentPages(pages, {
rules: [
// First: Chapter headers (highest priority)
{ fuzzy: true, lineStartsAfter: ['{{kitab:book}} '], split: 'at', meta: { type: 'chapter' } },
// Second: Sub-chapters
{ fuzzy: true, lineStartsAfter: ['{{bab:section}} '], split: 'at', meta: { type: 'section' } },
// Third: Individual hadiths
{ lineStartsAfter: ['{{raqms:num}} {{dash}} '], split: 'at', meta: { type: 'hadith' } },
]
});
segmentPages(pages, options)Main segmentation function.
import { segmentPages, type Page, type SegmentationOptions, type Segment } from 'flappa-doormal';
const pages: Page[] = [
{ id: 1, content: 'First page content...' },
{ id: 2, content: 'Second page content...' },
];
const options: SegmentationOptions = {
// Optional preprocessing transforms (run before pattern matching)
// See "7.1 Preprocessing" section for details
preprocess: ['removeZeroWidth', 'condenseEllipsis'],
rules: [
{ lineStartsWith: ['## '], split: 'at' }
],
// How to join content across page boundaries in OUTPUT segments:
// - 'space' (default): page boundaries become spaces
// - 'newline': preserve page boundaries as newlines
pageJoiner: 'newline',
// Breakpoint preferences for resizing oversized segments:
// - 'longer' (default): maximizes segment size within limits
// - 'shorter': minimizes segment size (splits at first match)
prefer: 'longer',
// Post-structural limit: split if segment spans more than 2 pages
maxPages: 2,
// Post-structural limit: split if segment exceeds 5000 characters
maxContentLength: 5000,
// Enable match metadata in segments (meta.debug)
debug: true,
// Custom logger for tracing
logger: {
info: (m) => console.log(m),
warn: (m) => console.warn(m),
}
};
const segments: Segment[] = segmentPages(pages, options);
lineStartsAfter was used by accident)If you accidentally used lineStartsAfter for markers that should have been preserved (e.g. Arabic connective phrases like وروى / وذكر), you can recover those missing prefixes from existing segments.
recoverMistakenLineStartsAfterMarkers(pages, segments, options, selector, opts?)This function returns new segments with recovered content plus a report describing what happened.
Recommended (deterministic) mode: rerun segmentation with selected rules converted to lineStartsWith, then merge recovered content back.
import { recoverMistakenLineStartsAfterMarkers, segmentPages } from 'flappa-doormal';
const pages = [{ id: 1, content: 'وروى أحمد\nوذكر خالد' }];
const options = { rules: [{ lineStartsAfter: ['وروى '] }, { lineStartsAfter: ['وذكر '] }] };
const segments = segmentPages(pages, options);
// segments[0].content === 'أحمد' (marker stripped)
const { segments: recovered, report } = recoverMistakenLineStartsAfterMarkers(
pages,
segments,
options,
{ type: 'rule_indices', indices: [0] }, // recover only the first rule
);
// recovered[0].content === 'وروى أحمد'
// recovered[1].content === 'خالد' (unchanged)
console.log(report.summary);
Optional: best-effort anchoring mode attempts to recover without rerunning first, then falls back to rerun for unresolved segments:
const { segments: recovered } = recoverMistakenLineStartsAfterMarkers(
pages,
segments,
options,
{ type: 'rule_indices', indices: [0] },
{ mode: 'best_effort_then_rerun' }
);
Notes:
selector; it will not “guess” which rules are mistaken.recoverMistakenMarkersForRuns(runs, opts?)Batch version of recoverMistakenLineStartsAfterMarkers. Processes multiple independent segmentation runs (e.g. from different books) and returns a consolidated report.
import { recoverMistakenMarkersForRuns } from 'flappa-doormal';
const results = recoverMistakenMarkersForRuns([
{ pages: pages1, segments: segments1, options: options1, selector: selector1 },
{ pages: pages2, segments: segments2, options: options2, selector: selector2 },
]);
validateSegments(pages, options, segments, validationOptions?)Validates that segments correctly map back to the source pages and adhere to constraints.
import { validateSegments } from 'flappa-doormal';
const report = validateSegments(pages, options, segments, {
// Optional: Max content length to search before falling back (default: 500)
// Segments longer than this are checked via fast path unless issues are found.
fullSearchThreshold: 1000,
});
Returns a SegmentValidationReport containing:
ok: booleansummary: counts of errors/warningsissues: detailed list of problems (page attribution mismatch, maxPages violation, etc.)stripHtmlTags(html)Remove all HTML tags from content, keeping only text.
import { stripHtmlTags } from 'flappa-doormal';
const text = stripHtmlTags('<p>Hello <b>World</b></p>');
// Returns: 'Hello World'
For more sophisticated HTML to Markdown conversion (like converting <span data-type="title"> to ## headers), you can implement your own function. Here's an example:
const htmlToMarkdown = (html: string): string => {
return html
// Convert title spans to markdown headers
.replace(/<span[^>]*data-type=["']title["'][^>]*>(.*?)<\/span>/gi, '## $1')
// Strip narrator links but keep text
.replace(/<a[^>]*href=["']inr:\/\/[^"']*["'][^>]*>(.*?)<\/a>/gi, '$1')
// Strip all remaining HTML tags
.replace(/<[^>]*>/g, '');
};
expandTokens(template)Expand template tokens to regex pattern.
import { expandTokens } from 'flappa-doormal';
const pattern = expandTokens('{{raqms}} {{dash}}');
// Returns: '[\u0660-\u0669]+ [-–—ـ]'
makeDiacriticInsensitive(text)Make Arabic text diacritic-insensitive for fuzzy matching.
import { makeDiacriticInsensitive } from 'flappa-doormal';
const pattern = makeDiacriticInsensitive('حدثنا');
// Returns regex pattern matching 'حَدَّثَنَا', 'حدثنا', etc.
TOKEN_PATTERNSAccess available token definitions.
import { TOKEN_PATTERNS } from 'flappa-doormal';
console.log(TOKEN_PATTERNS.narrated);
// 'حدثنا|أخبرنا|حدثني|وحدثنا|أنبأنا|سمعت'
These functions help auto-detect tokens in text, useful for building UI tools that suggest rule configurations from user-highlighted text.
detectTokenPatterns(text)Analyzes text and returns all detected token patterns with their positions.
import { detectTokenPatterns } from 'flappa-doormal';
const detected = detectTokenPatterns("٣٤ - حدثنا");
// Returns:
// [
// { token: 'raqms', match: '٣٤', index: 0, endIndex: 2 },
// { token: 'dash', match: '-', index: 3, endIndex: 4 },
// { token: 'naql', match: 'حدثنا', index: 5, endIndex: 10 }
// ]
generateTemplateFromText(text, detected)Converts text to a template string using detected patterns.
import { detectTokenPatterns, generateTemplateFromText } from 'flappa-doormal';
const text = "٣٤ - ";
const detected = detectTokenPatterns(text);
const template = generateTemplateFromText(text, detected);
// Returns: "{{raqms}} {{dash}} "
suggestPatternConfig(detected)Suggests the best pattern type and options based on detected patterns.
import { detectTokenPatterns, suggestPatternConfig } from 'flappa-doormal';
// For numbered patterns (hadith-style)
const hadithDetected = detectTokenPatterns("٣٤ - ");
suggestPatternConfig(hadithDetected);
// Returns: { patternType: 'lineStartsAfter', fuzzy: false, metaType: 'hadith' }
// For structural patterns (chapter markers)
const chapterDetected = detectTokenPatterns("باب الصلاة");
suggestPatternConfig(chapterDetected);
// Returns: { patternType: 'lineStartsWith', fuzzy: true, metaType: 'bab' }
analyzeTextForRule(text)Complete analysis that combines detection, template generation, and config suggestion.
import { analyzeTextForRule } from 'flappa-doormal';
const result = analyzeTextForRule("٣٤ - حدثنا");
// Returns:
// {
// template: "{{raqms}} {{dash}} {{naql}}",
// patternType: 'lineStartsAfter',
// fuzzy: false,
// metaType: 'hadith',
// detected: [...]
// }
// Use the result to build a rule:
const rule = {
[result.patternType]: [result.template],
split: 'at',
fuzzy: result.fuzzy,
meta: { type: result.metaType }
};
Some tokens are composites (e.g. {{numbered}}), which are great for quick signatures but less convenient when you want to add named captures (e.g. capture the number).
You can expand composites back into their underlying template form:
import { expandCompositeTokensInTemplate } from 'flappa-doormal';
const base = expandCompositeTokensInTemplate('{{numbered}}');
// base === '{{raqms}} {{dash}} '
// Now you can add a named capture:
const withCapture = base.replace('{{raqms}}', '{{raqms:num}}');
// withCapture === '{{raqms:num}} {{dash}} '
SplitRuletype SplitRule = {
// Pattern (choose one)
lineStartsWith?: string[];
lineStartsAfter?: string[];
lineEndsWith?: string[];
template?: string;
regex?: string;
// Split behavior
split?: 'at' | 'after'; // Default: 'at'
occurrence?: 'first' | 'last' | 'all';
fuzzy?: boolean;
// Constraints
min?: number;
max?: number;
exclude?: (number | [number, number])[]; // Single page or [start, end] range
skipWhen?: string; // Regex pattern (tokens supported)
meta?: Record<string, unknown>;
};
Segmenttype Segment = {
content: string;
from: number;
to?: number;
meta?: Record<string, unknown>;
};
DetectedPatternResult from pattern detection utilities.
type DetectedPattern = {
token: string; // Token name (e.g., 'raqms', 'dash')
match: string; // The matched text
index: number; // Start index in original text
endIndex: number; // End index (exclusive)
};
// app/api/segment/route.ts (Next.js App Router)
import { segmentPages } from 'flappa-doormal';
import { NextResponse } from 'next/server';
export async function POST(request: Request) {
const { pages, rules } = await request.json();
const segments = segmentPages(pages, { rules });
return NextResponse.json({ segments });
}
// Node.js script
import { segmentPages, stripHtmlTags } from 'flappa-doormal';
const pages = rawPages.map((p, i) => ({
id: i + 1,
content: stripHtmlTags(p.html)
}));
const segments = segmentPages(pages, {
rules: [{
lineStartsAfter: ['{{raqms:num}} {{dash}} '],
split: 'at'
}]
});
console.log(`Found ${segments.length} segments`);
# Install dependencies
bun install
# Run tests
bun test
# Build
bun run build
# Run performance test (generates 50K pages, measures segmentation speed/memory)
bun run perf
# Lint
bunx biome lint .
# Format
bunx biome format --write .
{{token}}Single braces conflict with regex quantifiers {n,m}. Double braces are visually distinct and match common template syntax (Handlebars, Mustache).
lineStartsAfter vs lineStartsWithlineStartsWith: Keep marker in content (for detection only)lineStartsAfter: Strip marker, capture only content (for clean extraction)Fuzzy transforms are applied to raw Arabic text before wrapping in regex groups. This prevents corruption of regex metacharacters like (, ), |.
Complex logic is intentionally split into small, independently testable modules:
src/segmentation/match-utils.ts: match filtering + capture extractionsrc/segmentation/rule-regex.ts: SplitRule → compiled regex builder (buildRuleRegex, processPattern)src/segmentation/breakpoint-utils.ts: breakpoint windowing/exclusion helpers, page boundary join normalization, and progressive prefix page detection for accurate from/to attributionsrc/segmentation/breakpoint-processor.ts: breakpoint post-processing engine (applies breakpoints after structural segmentation)The library concatenates all pages into a single string for pattern matching across page boundaries. Memory usage scales linearly with total content size:
| Pages | Avg Page Size | Approximate Memory |
|---|---|---|
| 1,000 | 5 KB | ~5 MB |
| 6,000 | 5 KB | ~30 MB |
| 40,000 | 5 KB | ~200 MB |
For typical book processing (up to 6,000 pages), memory usage is well within Node.js defaults. For very large books (40,000+ pages), ensure adequate heap size.
See AGENTS.md for:
An interactive demo is available at flappa-doormal.surge.sh.
The demo source code is located in the demo/ directory and includes:
To run the demo locally:
cd demo
bun install
bun run dev
To deploy updates:
cd demo
bun run deploy
MIT
FAQs
Arabic text marker pattern library for generating regex from declarative configurations
We found that flappa-doormal demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
Multiple high-impact npm maintainers confirm they have been targeted in the same social engineering campaign that compromised Axios.

Security News
Axios compromise traced to social engineering, showing how attacks on maintainers can bypass controls and expose the broader software supply chain.

Security News
Node.js has paused its bug bounty program after funding ended, removing payouts for vulnerability reports but keeping its security process unchanged.