Security News
The Unpaid Backbone of Open Source: Solo Maintainers Face Increasing Security Demands
Solo open source maintainers face burnout and security challenges, with 60% unpaid and 60% considering quitting.
Table of Contents generated with DocToc
InterText provides pre-packaged solutioons for a number of tasks in text formatting and typesetting that tend to show up frequently. I'm aiming at conducing comparative benchmarks and soundness checks for all solutions (see Benchmarks, below, for available data). The areas covered so far and planned for the future include:
InterText HYPH for hyphenating text in multiple languages (only en-US covered so far, but underlying software is multilingual and configurable).
InterText SLABS for segmenting and re-assembling text according to Unicode Standard Annex #14: Unicode Line Breaking Algorithm (UAX#14); this is useful to determine line breaking opportunities (LBOs) for running text. So far, ASCII spaces (U+0020), Soft Hyphens (U+00ad) and implicit CJK Inter-Character Breaks work.
InterText HTML for parsing and generating HTML markup.
InterText ?ANSI? for colorizing console output.
InterText ?TBL? for tabulating console output; includes facilities to determing display width of individual characters and running text, taking into account 'wide' and 'narrow' characters.
InterText ?FMT? for formatting numbers.
Implemented with mnater/hyphenopoly
.
INTERTEXT.HYPH.hyphenate = ( text ) ->
: return the text with soft hyphens (U+00ad) inserted. For
languages other than US English, INTERTEXT.HYPH.new_hyphenator = ( settings ) ->
may in a future version
be used to obtain a custom hyphenation function.
INTERTEXT.HYPH.count_soft_hyphens = ( text ) ->
: Count occurances of U+00ad in text
.
INTERTEXT.HYPH.reveal_hyphens = ( text, replacement = '-' ) ->
: Replace all soft hyphens with
replacement
.
@slabs_from_text = ( text ) ->
—Given a text
, return an slb
object as described below that describes
all the UAX#14-compliant linebreak opportunities (LBOs) in that text.
@assemble = ( slb, first_idx = null, last_idx = null ) ->
—Given an slb
object and, optionally, two
slab indices, return a line of text, honoring the intermediate and final LBOs as needed for typesetting.
slb
ObjectsAn slb
is a plain JS object with two attributes:
slb.slabs
is a list of strings containing the individual subparts of the original text;slb.ends
ends is string of codepoints (in the range U+0021..U+ffff
, excluding surrogates,
non-printables, specials, and whitespace) of the same length as slb.slabs
, each code unit using one of a
number of codes to describe how the end (right edge in the case of LTR scripts, left edge in the case of
RTL scripts) is to be treated when re-assembling lines from slabs.Given a text that one would like to break into properly hyphenated lines of approximately equal length of, say, up to 14 characters each:
a very fine day for a cromulent solution
The first step is to hyphenate the text. InterText HYPH.hyphenate text
inserts 'Soft' (Discretionary)
Hyphen characters (U+00ad) into the text, here symbolized with 🞛
:
a very fine day for a cro🞛mu🞛lent so🞛lu🞛tion
Passing the hyphenated text to InterText SLABS.slabs_from_text()
returns this slb
object:
{ slabs: [
'a', 'very', 'fine',
'day', 'for', 'a',
'cro', 'mu', 'lent',
'so', 'lu', 'tion'
],
ends: '______||_||x'
}
As it stands,
SLABS
slb
objects use three different single-character markers in theends
string to indicate how to treat the corresponding slab with the same index:
x
indicates 'none': insert nothing (empty string) whether non-final or final_
indicates 'space': insert space (U+0020) when non-final, insert nothing (empty string) when final|
indicates 'hyphen': insert nothing when non-final, add hyphen (U+002d) when finalThese may change in the future.
One can then use ( INTERTEXT.SLABS.assemble slb, 0, idx for idx in [ 0 ... slb.slabs.length ] )
to
re-assemble all possible initial lines:
a 0 0 1
a very 0 1 6
a very fine 0 2 11
a very fine day 0 3 15
a very fine day for 0 4 19
a very fine day for a 0 5 21
a very fine day for a cro- 0 6 26
a very fine day for a cromu- 0 7 28
a very fine day for a cromulent 0 8 31
a very fine day for a cromulent so- 0 9 35
a very fine day for a cromulent solu- 0 10 37
a very fine day for a cromulent solution 0 11 40
We can stop at the third iteration (idx == 2
) since that yields a line that fits into the desired length
while the next one exceeds our 14-character limit. Continuing with a first_idx
of 3
, the candidates for
the second line are:
day 3 3 3
day for 3 4 7
day for a 3 5 9
day for a cro- 3 6 14
day for a cromu- 3 7 16
which gives us day for a cro-
as second line. Going on, one arrives at this finely formatted paragraph:
--------------
a very fine
day for a cro-
mulent solu-
tion
--------------
Hardly rocket science but also best not coded in too much of a cobbled-together ad-hoc way, all the more since this barely scratches the surface of the complexities in line-oriented typesetting, which include but are not limited to the following considersations:
When the output is indeed monspaced as shown here, we still have to take care of wide glyphs (e.g. Chinese
characters); InterText ?TBL?
will provide solutions for that. Generally speaking, using JavaScript
String#length
as a proxy for display is generally a bad idea and has only been done for presentation.
When lines are considerably longer than the average slab width, a lot of unnecessary computations are performed. In real life situations, it will probably be more performant to estimate how man slabs will fit onto a given line and start looking from there instead of trying out all the solutions that are probably much too short anyway.
Outside of the most restricted of environments, ligatures have to be taken into account, meaning that one has to either reconstruct font metrics in ones software (don't, see the next point) or else try each line candidate in the targetted application (e.g. a web browser) and retrieve the resulting typeset lengths. Needless, this will exacerbate performance considerations, so best to strive and limit the number of attempts need for each line.
If text with mixed styles (different fonts, italic, bold, subscripts) is taken into consideration, all of a sudden the task shifts from "let's just reconstruct the metrics of this TTF font so we can add all the character widths" to "let's write a full fledged universal font rendering engine that takes account of all the OpenType features and all the scripts and languages of the world". In other words, don't. Even. Try. Instead, use an existing piece of software.
I still believe that under many circumstances, hyphenation paired with 'slabification' gives a good enough approximation to cut down the number of line candidates in a meaningful way, especially when the typesetting algorithm used to turn slabs into paragraphs has a good grasp on the spatial statistics of what it is trying to achieve (as in 'most lines contain between x and y English slabs, and each CJK codepoint is worth around 0.8 English slabs on average'). You can't partition a long text in one go from end to end with confidence using these estimates, but one can use such numbers as a starting point to estimate how many of a given sequence of slabs will probably fit into a given line.
In advanced typesetting, and maybe even when outputting to the console or typesetting a technical manual in all-monospace, using hanging punctuation may result in a more balanced look. One will then have to adjust the right edge (and maybe the left one, too) depending on the last (and first) characters of each candidate line.
CSS properties like word-spacing
and letter-spacing
as well as variable
fonts provide an opportunity to typeset material (almost) imperceptibly
denser or to distribute excessive whitespace among spaces proper, inter-letter spacing, and streched
letters. This means that depending on preferences, it may be allowable to put material into a single line
that is just a teeny bit too long by condensing letter shapes or tracking just a teeeeeny bit.
Some writing systems (Arabic, Hebrew) allow or call for elongated letters that depend on available space; others may not use hyphens when breaking words.
When scripts are mixed, boundaries between two different scripts require our attention. This is a
considerably more vexing problem when mixing LTR (left-to-right) and RTL (right-to-left) scripts than in,
say, mixing Latin and CJK in a paragraph, but this is not to say the latter isn't blessed with a good
number of problems interesting questions that do not necessarily have unique answers.
The addressable unit of memory on the NCR 315 series is a "slab", short for "syllable", consisting of 12 data bits and a parity bit. Its size falls between a byte and a typical word (hence the name, 'syllable'). A slab may contain three digits (with at sign, comma, space, ampersand, point, and minus treated as digits) or two alphabetic characters of six bits each.—Wikipedia, "NCR 315"
Slabs used to be known as 'Logotypes' in typesetting:
There were later attempts to speed up the typesetting process by casting syllables or entire words as one piece. Those pieces were called logotypes—from Ancient Greek “lógos” meaning “word”.—(typography.guru)[https://typography.guru/journal/words-and-phrases-in-common-use-which-originated-in-the-field-of-typography-r78/]
HTML parsing uses atlassubbed/atlas-html-stream
to
turn HTML5 texts into series of datoms. Two HTML formats are
supported:
Unless you know what you're after you'll probably want to use the plain HTML5 flavor.
After { HTML, } = require 'intertext'
, use one of these methods:
HTML.html_as_datoms = ( text ) ->
to turn HTML fragments or entire documents into a list of datoms, or
HTML.mkts_html_as_datoms = ( text ) ->
to do the same with MKTScript.
Both methods work pretty much the same and are the inverse operations to HTML.datom_as_html()
:
$key
is the tagname prefixed with the left pointy
bracket as sigil, and attribute name/value pairs becoming properties of the datom.$key
is the tagname prefixed with the right pointy bracket
as sigil.$key
is the tagname prefixed
with the caret as sigil.$key
is ^text
and whose contents are stored under
the text
property.In SteamPipe streams, use the transforms returned by
$html_as_datoms()
$mkts_html_as_datoms()
for the same functionality; both transforms accept texts and buffers as inputs.
{ HTML, } = require 'intertext'
HTML.datom_as_html = ( d ) ->
For the tagname:
d.$key
will become the tagnameFor the attributes:
true
(the boolean, not the text) will be turned into 'lone attributes', such
that { $key: '<p', contenteditable: true, }
will result in <p contenteditable>
'
(single quotes)''
(two single quotes)$
will be ignoredd.$value
is an object, its facets will be turned into HTML attributes; all other keys are ignoredOpen questions:
[
, ~
, ]
)?
text = """<!DOCTYPE html>
<h1><strong>CHAPTER VI.</strong> <name ref=hd553>Humpty Dumpty</h1>
<p id=p227>However, the egg only got larger and larger, and <em>more and more human</em>:<br>
when she had come within a few yards of it, she saw that it had eyes and a nose and mouth; and when she
had come close to it, she saw clearly that it was <name ref=hd556>HUMPTY DUMPTY</name> himself. ‘It can’t
be anybody else!’ she said to herself.<br/>
‘I’m as certain of it, as if his name were written all over his face.’
"""
for d in HTML.html_as_datoms text
log JSON.stringify d
log '-'.repeat 108
log ( HTML.datom_as_html d for d in datoms ).join ''
... will produce:
{ "$key": "^doctype", "$value": "html", }
{ "$key": "^text", "text": "\n", }
{ "$key": "<h1", }
{ "$key": "<strong", }
{ "$key": "^text", "text": "CHAPTER VI.", }
{ "$key": ">strong", }
{ "$key": "^text", "text": " ", }
{ "$key": "<name", "ref": "hd553", }
{ "$key": "^text", "text": "Humpty Dumpty", }
{ "$key": ">h1", }
{ "$key": "^text", "text": "\n\n", }
{ "$key": "<p", "id": "p227", }
{ "$key": "^text", "text": "However, the egg only got larger and larger, and ", }
{ "$key": "<em", }
{ "$key": "^text", "text": "more and more human", }
{ "$key": ">em", }
{ "$key": "^text", "text": ":", }
{ "$key": "<br", }
{ "$key": "^text", "text": "\n\nwhen she had come within ... she saw clearly that it was ", }
{ "$key": "<name", "ref": "hd556", }
{ "$key": "^text", "text": "HUMPTY DUMPTY", }
{ "$key": ">name", }
{ "$key": "^text", "text": " himself. ‘It can’t\nbe anybody else!’ she said to herself.", }
{ "$key": "<br", }
{ "$key": ">br", }
{ "$key": "^text", "text": "\n\n‘I’m as certain ... all over his face.’\n", }
<!DOCTYPE html>
<h1><strong>CHAPTER VI.</strong> <name ref=hd553>Humpty Dumpty</h1>
<p id=p227>However, the egg only got larger and larger, and <em>more and more human</em>:<br>
when she had come within a few yards of it, she saw that it had eyes and a nose and mouth; and when she
had come close to it, she saw clearly that it was <name ref=hd556>HUMPTY DUMPTY</name> himself. ‘It can’t
be anybody else!’ she said to herself.<br></br>
‘I’m as certain of it, as if his name were written all over his face.’
As can be seen, no validation will be done, and the parser will happily produce events for unclosed and
unbalanced closing tags. There is a minor issue with the <br></br>
tag pair which will get resolved in
a future version.
Against 100,000 words randomly selected anew for each test case from /usr/share/dict/american-english
(102,305 words) over 5 runs, total time needed 32s; observe
fresh
hyphenate_mnater_hyphenopoly_sync 242,819 Hz 100.0 % │████████████▌│
hyphenate_sergeysolovev_hyphenated 176,003 Hz 72.5 % │█████████ │
hyphenate_bramstein_hypher 107,437 Hz 44.2 % │█████▌ │
hyphenate_ytiurin_hyphen 658 Hz 0.3 % │ │
These figures have been reproduced several times; if we do not re-generate the selection of words for each test case but have all hyphenators hyphenate the same collection over, performance seems to improve slightly:
same
hyphenate_mnater_hyphenopoly_sync 345,892 Hz 100.0 % │████████████▌│
hyphenate_sergeysolovev_hyphenated 219,550 Hz 63.5 % │███████▉ │
hyphenate_bramstein_hypher 121,050 Hz 35.0 % │████▍ │
hyphenate_ytiurin_hyphen 707 Hz 0.2 % │ │
Curiously when only a single run is done, bramstein/hypher
and sergeysolovev/hyphenated
changes places
and, curioser still, almost exactly their relative performances; also note how overall performance seems to
drop:
00:09 BENCHMARKS ▶ hyphenate_mnater_hyphenopoly_sync 144,789 Hz 100.0 % │████████████▌│
00:09 BENCHMARKS ▶ hyphenate_bramstein_hypher 108,914 Hz 75.2 % │█████████▍ │
00:09 BENCHMARKS ▶ hyphenate_sergeysolovev_hyphenated 46,895 Hz 32.4 % │████ │
00:09 BENCHMARKS ▶ hyphenate_ytiurin_hyphen 638 Hz 0.4 % │ │
/etc/dictionaries-common/words
, total 102,305 English words (so probably the exact same as
/usr/share/dict/american-english
)
hypher
would appear to have a rather serious flaw in that it insists on inserting a hyphen before the last
letter of a word when that word ends in an apostrophe (or a single quote) plus letter s
to indicate a
genitive (so far I have not tested whether that strange behavior also occurs with other situations involving
apostrophes or quotes); this occurs in 3,057 (3%) of all words in the list:
hyphenopoly hypher
—————————————————————————————————————————————————————————
thun-der-storm’s thun-der-stor-m’s
tib-ia’s tib-i-a’s
tights’s tight-s’s
time-stamp’s time-stam-p’s
In a very small number of words (36 or 0.035%), hyphenopoly
inserts fewer hyphens than hypher
; many of
these have letters with diacritics; observe that some words with diacritics are hyphenated by
hyphenopoly
:
hyphenopoly hypher
—————————————————————————————————————————————————————————
Düssel-dorf Düs-sel-dorf
Es-terházy Es-ter-házy
Furtwängler Furtwän-gler
Göteborg Göte-borg
Pokémon Poké-mon
Pétain Pé-tain
abbés ab-bés
as-so-ciate as-so-ci-ate
as-so-ciates as-so-ci-ates
châtelaine châte-laine
châtelaines châte-laines
clientèle clien-tèle
clientèles clien-tèles
croûton croû-ton
croûtons croû-tons
di-vorcée di-vor-cée
di-vorcées di-vor-cées
décol-leté dé-col-leté
détente dé-tente
flambéed flam-béed
ingénue in-génue
ingénues in-génues
matinée mat-inée
matinées mat-inées
present pre-sent
presents pre-sents
project pro-ject
projects pro-jects
protégé pro-tégé
protégés pro-tégés
précis pré-cis
précised pré-cised
précis-ing pré-cis-ing
recherché recher-ché
reci-procity rec-i-proc-ity
smörgåsbord smörgås-bord
In terms of speed, ytiurin/hyphen
is clearly the looser, being almost 500 times slower than the
consistenly fastest hyphenator, mnater/hyphenopoly
.
bramstein/hypher
and sergeysolovev/hyphenated
vie for the second place to the extent that modifying the
test setup somwhat will make them change places; however, at least bramstein/hypher
has some serious flaws
which seems surprising in view of its popularity. Given their poor configurability, the fact they will take
twice to four times as long as hyphenopoly
and apparently not catch more opportunities than that library,
the choice becomes a very easy one.
mnater/hyphenopoly
is the clear winner: it has the most extensive tweaking configuration (including
per-language exceptions, minimum number of letters to be left on both ends of words and so on); it is
extensively documented (see https://github.com/mnater/Hyphenopoly/docs). In no case have we observed a
hyphen placement that could be termed unacceptable. If anything, hyphenopoly
misses some obvious
opportunities; in particular, it seems to have an adversion (but not a strict taboo) against hyphenating
words in the genitive case. That, be it said, is still much better than suggesting to write tight-s’s
as
hypher
would have it.
_format = require 'number-format.js'
format_float = ( x ) -> _format '#,##0.000', x
format_integer = ( x ) -> _format '#,##0.', x
format_as_percentage = ( x ) -> _format '#,##0.00', x * 100
width_of
JS regex unicode properties:
/\p{Script_Extensions=Latin}/u
/\p{Script=Latin}/u
/\p{Script_Extensions=Cyrillic}/u
/\p{Script_Extensions=Greek}/u
/\p{Unified_Ideograph}/u
/\p{Script=Han}/u
/\p{Script_Extensions=Han}/u
/\p{Ideographic}/u
/\p{IDS_Binary_Operator}/u
/\p{IDS_Trinary_Operator}/u
/\p{Radical}/u
/\p{White_Space}/u
/\p{Script_Extensions=Hiragana}/u
/\p{Script=Hiragana}/u
/\p{Script_Extensions=Katakana}/u
/\p{Script=Katakana}/u
regex_cid_ranges =
hiragana: '[\u3041-\u3096]'
katakana: '[\u30a1-\u30fa]'
kana: '[\u3041-\u3096\u30a1-\u30fa]'
ideographic: '[\u3006-\u3007\u3021-\u3029\u3038-\u303a\u3400-\u4db5\u4e00-\u9fef\uf900-\ufa6d\ufa70-\ufad9\u{17000}-\u{187f7}\u{18800}-\u{18af2}\u{1b170}-\u{1b2fb}\u{20000}-\u{2a6d6}\u{2a700}-\u{2b734}\u{2b740}-\u{2b81d}\u{2b820}-\u{2cea1}\u{2ceb0}-\u{2ebe0}\u{2f800}-\u{2fa1d}]'
Should be extensible (extending/diminishing existing categories, add new ones)
FAQs
Services for Recurrent Text-related Tasks
The npm package intertext receives a total of 8 weekly downloads. As such, intertext popularity was classified as not popular.
We found that intertext demonstrated a not healthy version release cadence and project activity because the last version was released a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Solo open source maintainers face burnout and security challenges, with 60% unpaid and 60% considering quitting.
Security News
License exceptions modify the terms of open source licenses, impacting how software can be used, modified, and distributed. Developers should be aware of the legal implications of these exceptions.
Security News
A developer is accusing Tencent of violating the GPL by modifying a Python utility and changing its license to BSD, highlighting the importance of copyleft compliance.