Hyphenate: the Knuth-Liang (TeX) algorithm in PostScript

anastigmatix.net

This document has a standard, validated CSS2 stylesheet, which your browser does not seem to display properly. In a browser supporting web standards, this table of contents would be fixed at the side of the page for easy reference.

anastigmatix home

Hyphenation for PostScript

Updated port tuned for PostScript

Performance and accuracy

Hyphenate: reference

Hyphenate dictionary contents

hyphenator

Generated dictionary contents

Putting it all together

About encodings

Benchmarking

Knuth-Liang hyphenation for the PostScript® language

The word-division algorithm developed for TeX by Donald Knuth and Frank Liang is not the only algorithm for the purpose, but has some attractive features: it has been freely available for twenty years, with thorough documentation in volumes A and B of Knuth's Computers & Typesetting, and, because it is guided by pattern sets in a specified form, it benefits from many patterns compiled by the TeX user community that support word division in a variety of languages. Many of the patterns were not compiled ad hoc, but using a documented method like that in Liang's dissertation, and are paired with published papers describing how they were compiled and measuring their effectiveness in finding acceptable and avoiding unacceptable division points. No longer used only in TeX, the algorithm has been incorporated into the typesetting programs groff from the GNU project and troff from Plan9. Because of its popularity, it has also been made available in the programming languages Perl and Ruby. An earlier port to the PostScript language was done by Olavi Sakari.

An updated port tuned for PostScript

Mr. Sakari's port of the algorithm must be credited for establishing the practicality of such a project. On a benchmark (below), it hyphenated words at about 1/46 the speed of TeX itself, compiled code, on the same hardware, while occupying about 445 kilobytes of interpreter memory. That is quite usable performance, and quite respectable for an essentially direct, partial port of TeX's original Pascal data structures to an interpreted language.

The purposes of a new PostScript port were to explore the performance effect of recreating the algorithm using native PostScript data structures rather than replicas of the originals, in particular using PostScript dictionaries in place of Knuth's original trie structures, and to provide several additional features:

Package the hyphenation procedures and hyphenation data (patterns) separately as PostScript resources, with a dedicated resource category for hyphen patterns that can be selected for use.
Base text representation on Unicode with a selectable character encoding for input text, not necessarily the same as the encoding used in the selected hyphen pattern. Support PostScript-specific encodings such as ISOLatin1Encoding as well as common encodings such as iso-8859-1 (which is not equivalent) and utf-8.
Isolate the upper-to-lower-case folding, an integral part of the TeX algorithm, into a resource mechanically derived from the Unicode character database to ensure completeness.
Extend the algorithm to allow patterns expressed in utf-8, a single encoding that is PostScript-friendly and accommodates any language that can be written in Unicode.

Performance and accuracy

Benchmarking suggests that redesigning to exploit the native data structures of PostScript does offer substantial benefit: although the list of added features above could raise fears of bloat, net.anastigmatix.Hyphenate hyphenates words at about 1/8 the speed of compiled TeX on the same hardware, occupying about 300 kilobytes of interpreter memory with the plain TeX English patterns loaded. In the event a multi-byte character encoding is used for either the patterns or the input text, and at least one multi-byte-encoded character does appear in a pattern or exception entry, Hyphenate falls back to a fully general but slower procedure for encoding and case mapping, with roughly a 13 percent speed penalty.

The benchmark, described below, also verifies that Hyphenate's word divisions do match those of TeX itself on a list of 234,964 English words, with only these exceptions:

chaffless raffishly snuffless

cliffless raffishness tariffless

raffish skiffless trufflesque

The exceptions seem to be the only words in the list that both satisfy the lefthyphenmin and righthyphenmin settings and include what TeX would make automatic ffi or ffl ligatures, the subject of “TeX's interesting approach” that is not described in a double-dangerous-bend note on page 455 of The TeXbook.

Hyphenate: reference

Hyphenate is a ProcSet resource. To make it available to your own code, include in the setup section of your file:

/net.anastigmatix.Hyphenate /ProcSet findresource

The findresource will succeed, leaving a dictionary on the operand stack, if you have made the Hyphenate resource file [download] available in any of these ways:

You have included the resource file in the prolog of your own file, before the findresource (which belongs in the setup section)
You downloaded the resource file to your printer's persistent memory, since it was last powered on
You saved the resource file in your printer's (if it has a disk or flash filesystem), or ghostscript's, resource directory under the right name
You use document manager software that recognizes the %%DocumentNeededResources and %%IncludeResource DSC comments, you include these comments at the right position in your file to specify that it needs net.anastigmatix.Hyphenate, your document manager software is configured to automatically insert needed resources in files being printed, and you have put the Hyphenate resource file where your document manager can find it.

Hyphenate relies on several other resources, and you will need those files also. If you use the first method, directly including resources in your file's prolog, the prolog has to contain all of the needed resources, in any order as long as no resource comes before another resource it needs, and categories come before resources that belong in them. Any of the other methods should just work as long as all the files are where they need to be. These are the resources you will need:

Resource	Category	Description
net.anastigmatix.MetaPre	ProcSet	Staged-programming extensions for PostScript
net.anastigmatix.hyphenpattern	Category	Category to contain hyphen-pattern resources usable with Hyphenate.
choice of pattern	net.anastigmatix.hyphenpattern	The hyphen pattern resource you intend to use. The one named `hyphen.tex` contains the original TeX patterns for US English derived by Frank Liang.
net.anastigmatix.encoding	Category	Category for resources that describe character encoding schemes. Several of the common encodings—`us-ascii`, `iso-8859-1`, `utf-8`, and Adobe's `ISOLatin1Encoding` (which is similar, but not identical, to `iso-8859-1`)—are defined in the category resource file itself, so no separate encoding resource file is needed.
choice of encoding	net.anastigmatix.encoding	Select an encoding resource that describes the character encoding you have used in your text; should ordinarily correspond to the encoding vector you will use with your fonts (see below). Some common encodings are in the category file itself so you do not need another file here; see above for the list.
net.anastigmatix.unicodedata	Category	Category for resources that supply data from the Unicode character database.
CaseFoldingC+S	net.anastigmatix.unicodedata	Describes the simple upper-to-lowercase mappings for Unicode characters.
net.anastigmatix.Hyphenate	ProcSet	The main attraction.

The resource files are in a compact form. That is for efficiency, not to keep you from viewing them; there is a script for that on the resource packaging page.

The Hyphenate dictionary is read-only. It contains only one entry, the procedure hyphenator. There is no need to put the dictionary on the lookup stack.

Hyphenate dictionary contents

The dictionary obtained with /net.anastigmatix.Hyphenate /ProcSet findresource contains a single entry:

hyphenator: pattern inputcoding hyphenator dict
Given pattern, a dictionary returned by findresource for the chosen hyphen pattern, and inputcoding, returned by findresource for the chosen input encoding, hyphenator generates the dictionary dict that can be used for hyphenating words encoded in inputcoding according to the pattern pattern. The generated dictionary has five entries, lefthyphenmin, righthyphenmin, uchyph, hyphenation, and query, described next.

Generated dictionary contents

The dictionary generated by hyphenator based on a given hyphen pattern and input encoding contains these entries:

lefthyphenmin

integer, default: 2

The length in characters of the leading part of a word that will not be divided. The default is the same as that in plain TeX.

righthyphenmin

integer, default: 3

The length in characters of the trailing part of a word that will not be divided. The default is the same as that in plain TeX.

uchyph

boolean, default: true

Whether any word that begins with an uppercase letter (that is, a letter that is mapped to some other character by the CaseFoldingC+S resource) should be hyphenated. The default is the same as that in plain TeX. If this flag is set to false, query will immediately return all zeros (no acceptable division point) for any word that begins with an uppercase letter.

query

word query string

Return in string an indication of the acceptable and unacceptable division points in the string word, which is assumed to be in the input encoding specified when hyphenator generated the dictionary. The length of string is one less than the number of characters in word, which is not necessarily the length in bytes of word except when a single-byte encoding is in use. The first byte of string indicates whether division is acceptable between the first and second characters of word and so on. Any odd value indicates division is acceptable at that position; any even value indicates it is not.

hyphenation

words hyphenation –

Add the elements of words, an array, as hyphenation exceptions. Each element is a word with the hyphen character U+002D HYPHEN-MINUS inserted at any acceptable division point. All other positions are considered unacceptable. To add an exception word without hyphens is to prevent its being divided at all. Words are folded to lowercase for comparison, so an exception governs the division of any matching word regardless of case (except for words beginning with uppercase letters when uchyph is false, which are not divided at all).

To allow a convenient form of input, words may be names or strings, and it does not matter if the words array is literal or executable, so these examples both work:

[ (man-u-script) /man-u-scripts (ap-pen-dix) ] hyphenation
{ man-u-script man-u-scripts ap-pen-dix } hyphenation

If local allocation was in effect when hyphenator was executed, the exception dictionary is in local VM and subject to save and restore, unlike TeX where exceptions are cumulative without respect to grouping. If hyphenator was run with global allocation in effect, then the exception dictionary is not subject to save and restore, but then hyphenation can only be executed when global allocation is in effect, and the exceptions must either be names or globally-allocated strings.

It must be possible to represent all of the words in the same encoding used by the hyphenation patterns in use, which was fixed when the patterns were compiled and need not be the same as the input encoding. When patterns are coded in utf-8, which can express all of Unicode, no problem is possible.

If the input coding in use does not include the U+002D HYPHEN-MINUS character, it is not possible to do anything with this procedure except specify words not to be hyphenated at all. Adobe's ISOLatin1Encoding does have it, but at the goofy location AD hex/255 octal, so unless your keyboard has a key for that, if you are using this encoding you would add exceptions this way:

{ (man\255u\255script) } hyphenation

Putting it all together: an example

The ordinary use of Hyphenate would be within a larger library for setting filled, justified text, but that would make an oversized example. Here is a sample, though, of code that could appear in a program that wanted to set up Hyphenate and eventually query for the hyphenation of the word example.

% select the desired hyphen pattern.
% hyphen.tex is Knuth's original US English:

/hyphen.tex /net.anastigmatix.hyphenpattern findresource

% what is the encoding of the input file?

/iso-8859_1 /net.anastigmatix.encoding findresource

% set up hyphenation:

/net.anastigmatix.Hyphenate /ProcSet findresource
/hyphenator get % hyphenator is the only thing in Hyphenate's dictionary

% stack now holds: hyphenpattern inputcoding hyphenator

exec %  hyphenation dict

/myhyph exch def

myhyph begin userdict begin

/lefthyphenmin 2 store % unnecessary in real life, 2 is default
{ man-u-script man-u-scripts } hyphenation % add some exceptions

(example) query == % result: (\012\001\004\003\012\012)

% the result means (odd numbers are division points) ex-am-ple

% or use the hyphenator without its dictionary on the lookup stack:

end end

{ ap-pen-dix } myhyph /hyphenation get exec

(appendix) myhyph /query get exec == % (\012\001\000\000\001\012\012)

The \012s in the result (which some interpreters may show as \n) reserve the positions that can never be division points according to lefthyphenmin and righthyphenmin. As always, an even number is an unacceptable division, an odd number an acceptable one.

The DSC comments were omitted above for brevity, but you can view or download the complete example file here, either for viewing in a PostScript viewer or as text. It is commented for Language Level 3 only because hyphen patterns, to save file size, may use Flate compression; they do not use any other Level 3 features and can be made available in longer, uncompressed, Level 2 versions.

The “bare file” download will only run properly if you have already downloaded the necessary resources to your viewer or printer. The “self-contained file” version has been through a script that replaced each IncludeResource comment with the corresponding resource file—being sure to do the same for other IncludeResource comments in the included file—so it can be sent to any PostScript printer or viewer without any prearrangements.

	Bare file	Self-contained file
Hyphenation setup example	PostScript view text view	PostScript view text view

About encodings

There are only 256 possible byte values, but many more known characters in Unicode and many more glyphs in a font. Various encodings exist to make the most-often-needed characters in a given language easy to represent in strings of bytes. Many have been defined by standards bodies independent of PostScript, and some are defined by Adobe and little used except in PostScript code. Examples of standard encodings are us-ascii, iso-8859-1, and utf-8; Adobe encodings include StandardEncoding, ISOLatin1Encoding, and CE.

Two encodings will typically matter for working in PostScript: the encoding that maps numbers to glyphs in a font, determined by the font's encoding vector, and the encoding of the text in the file, determined by your text editor settings. Life offers the fewest surprises when these are the same, and that can be arranged by explicitly setting up fonts with encoding vectors that correspond to the encoding used in editing the text.

It is also possible to edit the file in one encoding, use another encoding vector for the fonts, and use PostScript code to transcode strings before they are shown. The net.anastigmatix.encoding resources can transcode several common encodings to and from Unicode, so it is possible to transcode from one encoding to another—for those characters present in both encodings—by going through Unicode as a “pivot encoding.”

Hyphenate faces a similar issue: each hyphenation pattern set is stored in some encoding, chosen when the pattern set was compiled, and that may not be the same encoding you use in your text editor for preparing the input. By telling Hyphenate what input encoding you have used—the pattern encoding is named in the pattern set—you enable Hyphenate to transcode input words correctly to match them against the patterns.

Some encodings use strictly one byte per character. They have the advantage of simplicity, in that the number of characters in a string is the same as its length, and the disadvantage that no more than 256 characters—in practice usually fewer—can be encoded. Many encodings exist that choose different sets of 256 or fewer characters to support the writing of different languages. The ISO 8859 family of encodings can support many European languages; all encodings in the family have 160 slots that always map the same, exactly the range the PostScript language level 2 relies on, and 96 slots that map different sets of characters appropriate to different languages. That makes the ISO 8859 encodings convenient for use with PostScript. The characters beyond ASCII can even be used in PostScript programs, as for variable or procedure names with accented letters.

Other encodings can use more than one byte per character, sometimes a variable number, with the most-used characters given shorter encodings. A popular example is utf-8, which can represent any character in Unicode using four bytes or fewer, with the 128 characters essential to original PostScript represented in single bytes exactly as PostScript expects them. That makes utf-8 also a convenient encoding for PostScript files. It does not safely allow non-ASCII characters to appear in PostScript names, because of possible collisions with the level 2 binary token format, but in strings, or in portions of the file read by Markup, there are no surprises.

The most convenient way to edit direct PostScript documents is to pick a suitable encoding, configure your text editor to use it, and in the PostScript program install a matching encoding vector in each font to be used. That is a bit more easily said than done, as few operating systems and text editors directly support the Adobe-specific encodings, and PostScript doesn't come with many font-encoding vectors that match standard encodings. There is, however, a font encoding vector named iso-8859_1-1998 defined as a side effect by the net.anastigmatix.encoding category resource used with Hyphenate. You can find it in the normal PostScript resource category Encoding, as in:

/iso-8859_1-1998 /Encoding findresource

but only after the net.anastigmatix.encoding category has been defined. If the font encoding vectors, your text editor settings, and the encoding resource given as the input encoding for Hyphenate all match, your experience should be free of encoding mysteries. Much more information on character encoding can be found at Jukka Korpela's site.

The ability to set up a font with a variable-length encoding like utf-8 did not enter PostScript until the CIDFonts of language level 3. It can still be possible, though, to use utf-8 as a convenient encoding for editing a direct PostScript file, set up a different encoding vector in the fonts, and transcode.

Because utf-8 has the property that no subsequence of bytes from a multibyte-encoded character can be mistaken for a shorter encoding of some other character, it is usable as a pattern encoding. Other multibyte encodings, generally lacking that property, are not.

Benchmarking

The file RegressHyphenTeX.ps (2.5 MB) contains 234,964 English words from a dictionary file distributed with the NetBSD 1.6.2 operating system and derived from Webster's Second New International Dictionary of the English Language. Each word is listed twice, once with hyphens placed by the default settings of plain TeX and once without, followed by the word cmp, like this:

(hy-phen) (hyphen) cmp
(hy-phen-ate) (hyphenate) cmp
(hy-phen-ated) (hyphenated) cmp

The file can be used to test any PostScript implementation of TeX hyphenation, simply by writing a procedure cmp that takes the word on top of stack, uses the procedure under test to add hyphens, and compares it to the word as hyphenated by TeX itself. The following short file is the necessary setup to benchmark Hyphenate. Simply download the file—the self-contained version if you have not already downloaded the resources it uses—append RegressHyphenTeX.ps (download separately, above), and feed to your favorite PostScript interpreter. To keep the size of RegressHyphenTeX.ps manageable, it uses Flate compression, which may be supported in some Language Level 2 interpreters but is only guaranteed supported in Level 3.

	Bare file	Self-contained file
Setup file to benchmark Hyphenate	PostScript view text view	PostScript view text view

A patch file was used against the earlier PostScript port of Knuth-Liang hyphenation done by Olavi Sakari to enable it to run the same benchmark. The procedure is simply to run patch (Larry Wall's program for applying differences to files) to modify his original, and then append RegressHyphenTeX.ps to it as before. The patch restricts the pattern set to the original plain TeX patterns, because those are what TeX used to generate the regression file, skips the demonstration text at the end of the file, and changes the fixed policy for number of letters undivided at a word end from two to three to match the default value of TeX's righthyphenmin; otherwise many thousands of spurious mismatches are reported.

$Id: Hyphenate.html,v 1.10 2006/09/29 04:39:33 chap Exp $

chaffless	raffishly	snuffless
cliffless	raffishness	tariffless
raffish	skiffless	trufflesque