This document has a standard, validated CSS2 stylesheet, which your browser does not seem to display properly. In a browser supporting web standards, this table of contents would be fixed at the side of the page for easy reference.
anastigmatix homeThe word-division algorithm developed for TeX by Donald Knuth and Frank Liang is not the only algorithm for the purpose, but has some attractive features: it has been freely available for twenty years, with thorough documentation in volumes A and B of Knuth's Computers & Typesetting, and, because it is guided by pattern sets in a specified form, it benefits from many patterns compiled by the TeX user community that support word division in a variety of languages. Many of the patterns were not compiled ad hoc, but using a documented method like that in Liang's dissertation, and are paired with published papers describing how they were compiled and measuring their effectiveness in finding acceptable and avoiding unacceptable division points. No longer used only in TeX, the algorithm has been incorporated into the typesetting programs groff from the GNU project and troff from Plan9. Because of its popularity, it has also been made available in the programming languages Perl and Ruby. An earlier port to the PostScript language was done by Olavi Sakari.
Mr. Sakari's port of the algorithm must be credited for establishing the practicality of such a project. On a benchmark (below), it hyphenated words at about 1/46 the speed of TeX itself, compiled code, on the same hardware, while occupying about 445 kilobytes of interpreter memory. That is quite usable performance, and quite respectable for an essentially direct, partial port of TeX's original Pascal data structures to an interpreted language.
The purposes of a new PostScript port were to explore the performance effect of recreating the algorithm using native PostScript data structures rather than replicas of the originals, in particular using PostScript dictionaries in place of Knuth's original trie structures, and to provide several additional features:
ISOLatin1Encoding
as well as common encodings such as iso-8859-1
(which is
not
equivalent) and utf-8
.
utf-8
,
a single encoding that is PostScript-friendly and accommodates any
language that can be written in Unicode.
Benchmarking
suggests that redesigning to exploit the native data structures
of PostScript does offer substantial benefit: although the list of added
features above could raise fears of bloat,
net.anastigmatix.Hyphenate
hyphenates words at about 1/8 the
speed of compiled TeX on the same hardware, occupying about 300 kilobytes
of interpreter memory with the plain TeX English patterns loaded. In the
event a multi-byte character encoding is used for either the patterns or
the input text, and at least one multi-byte-encoded character does
appear in a pattern or exception entry, Hyphenate
falls back to
a fully general but slower procedure for encoding and case mapping, with
roughly a 13 percent speed penalty.
The benchmark, described below, also verifies that Hyphenate
's
word divisions do match those of TeX itself on a list of 234,964 English
words, with only these exceptions:
chaffless raffishly snuffless cliffless raffishness tariffless raffish skiffless trufflesque
The exceptions seem to be the only words in the list that both satisfy the
lefthyphenmin
and righthyphenmin
settings and
include what TeX would make automatic ffi or ffl ligatures, the subject of
“TeX's interesting approach” that is not described in a
double-dangerous-bend note on page 455 of The TeXbook.
Hyphenate
is a ProcSet
resource.
To make it available to your own code, include in the setup section of
your file:
/net.anastigmatix.Hyphenate /ProcSet findresource
The findresource
will succeed, leaving a dictionary on the
operand stack, if you have made
the Hyphenate
resource file [download]
available in any of these ways:
findresource
(which belongs in the
setup section)%%DocumentNeededResources
and
%%IncludeResource
DSC comments, you include these comments at
the right position in your file to specify that it needs
net.anastigmatix.Hyphenate, your document manager software is configured
to automatically insert needed resources in files being printed, and you
have put the Hyphenate resource file where your document manager can find
it.
Hyphenate
relies on several other resources, and you will need
those files also. If you use the first method, directly including resources in
your file's prolog, the prolog has to contain all of the needed resources, in
any order as long as no resource comes before another resource it needs, and
categories come before resources that belong in them. Any of the other methods
should just work as long as all the files are where they need to be. These are
the resources you will need:
Resource | Category | Description |
---|---|---|
net.anastigmatix.MetaPre | ProcSet | Staged-programming extensions for PostScript |
net.anastigmatix.hyphenpattern | Category | Category to contain hyphen-pattern resources usable with Hyphenate. |
choice of pattern | net.anastigmatix.hyphenpattern |
The hyphen pattern resource you intend to use. The one named
hyphen.tex contains the original TeX patterns for
US English derived by Frank Liang.
|
net.anastigmatix.encoding | Category |
Category for resources that describe character encoding schemes.
Several of the common encodings—us-ascii ,
iso-8859-1 , utf-8 , and Adobe's
ISOLatin1Encoding (which is similar, but not identical,
to iso-8859-1 )—are defined in the category resource
file itself, so no separate encoding resource file is needed.
|
choice of encoding | net.anastigmatix.encoding | Select an encoding resource that describes the character encoding you have used in your text; should ordinarily correspond to the encoding vector you will use with your fonts (see below). Some common encodings are in the category file itself so you do not need another file here; see above for the list. |
net.anastigmatix.unicodedata | Category | Category for resources that supply data from the Unicode character database. |
CaseFoldingC+S | net.anastigmatix.unicodedata | Describes the simple upper-to-lowercase mappings for Unicode characters. |
net.anastigmatix.Hyphenate | ProcSet | The main attraction. |
The resource files are in a compact form. That is for efficiency, not to keep you from viewing them; there is a script for that on the resource packaging page.
The Hyphenate
dictionary is read-only. It contains only one
entry, the procedure hyphenator. There is no need to put the
dictionary on the lookup stack.
The dictionary obtained with /net.anastigmatix.Hyphenate /ProcSet
findresource
contains a single entry:
Given pattern, a dictionary returned by findresource for the chosen hyphen pattern, and inputcoding, returned by findresource for the chosen input encoding, hyphenator generates the dictionary dict that can be used for hyphenating words encoded in inputcoding according to the pattern pattern. The generated dictionary has five entries, lefthyphenmin, righthyphenmin, uchyph, hyphenation, and query, described next.
The dictionary generated by hyphenator based on a given hyphen pattern and input encoding contains these entries:
The length in characters of the leading part of a word that will not be divided. The default is the same as that in plain TeX.
The length in characters of the trailing part of a word that will not be divided. The default is the same as that in plain TeX.
Whether any word that begins with an uppercase letter (that is, a
letter that is mapped to some other character by the
CaseFoldingC+S
resource) should be hyphenated. The default
is the same as that in plain TeX. If this flag is set to false,
query will immediately return all zeros (no acceptable division
point) for any word that begins with an uppercase letter.
Return in string an indication of the acceptable and unacceptable division points in the string word, which is assumed to be in the input encoding specified when hyphenator generated the dictionary. The length of string is one less than the number of characters in word, which is not necessarily the length in bytes of word except when a single-byte encoding is in use. The first byte of string indicates whether division is acceptable between the first and second characters of word and so on. Any odd value indicates division is acceptable at that position; any even value indicates it is not.
Add the elements of words, an array, as hyphenation exceptions.
Each element is a word with the hyphen character
U+002D HYPHEN-MINUS
inserted at any acceptable division
point. All other positions are considered unacceptable. To add an
exception word without hyphens is to prevent its being divided at all.
Words are folded to lowercase for comparison, so an exception governs
the division of any matching word regardless of case (except for
words beginning with uppercase letters when uchyph is false,
which are not divided at all).
To allow a convenient form of input, words may be names or strings, and it does not matter if the words array is literal or executable, so these examples both work:
[ (man-u-script) /man-u-scripts (ap-pen-dix) ] hyphenation { man-u-script man-u-scripts ap-pen-dix } hyphenation
If local allocation was in effect when hyphenator was executed, the exception dictionary is in local VM and subject to save and restore, unlike TeX where exceptions are cumulative without respect to grouping. If hyphenator was run with global allocation in effect, then the exception dictionary is not subject to save and restore, but then hyphenation can only be executed when global allocation is in effect, and the exceptions must either be names or globally-allocated strings.
It must be possible to represent all of the words in the same encoding
used by the hyphenation patterns in use, which was fixed when the patterns
were compiled and need not be the same as the input encoding. When
patterns are coded in utf-8
, which can express all of
Unicode, no problem is possible.
If the input coding in use does not include the
U+002D HYPHEN-MINUS
character, it is not possible to do
anything with this procedure except specify words not to be hyphenated
at all. Adobe's ISOLatin1Encoding
does have it,
but at the goofy location AD hex/255 octal, so unless your keyboard has
a key for that, if you are using this encoding you would add exceptions
this way:
{ (man\255u\255script) } hyphenation
The ordinary use of Hyphenate
would be within a larger library
for setting filled, justified text, but that would make an oversized example.
Here is a sample, though, of code that could
appear in a program that wanted to set up Hyphenate
and
eventually query for the hyphenation of the word example.
% select the desired hyphen pattern. % hyphen.tex is Knuth's original US English: /hyphen.tex /net.anastigmatix.hyphenpattern findresource % what is the encoding of the input file? /iso-8859_1 /net.anastigmatix.encoding findresource % set up hyphenation: /net.anastigmatix.Hyphenate /ProcSet findresource /hyphenator get % hyphenator is the only thing in Hyphenate's dictionary % stack now holds: hyphenpattern inputcoding hyphenator exec % hyphenation dict /myhyph exch def myhyph begin userdict begin /lefthyphenmin 2 store % unnecessary in real life, 2 is default { man-u-script man-u-scripts } hyphenation % add some exceptions (example) query == % result: (\012\001\004\003\012\012) % the result means (odd numbers are division points) ex-am-ple % or use the hyphenator without its dictionary on the lookup stack: end end { ap-pen-dix } myhyph /hyphenation get exec (appendix) myhyph /query get exec == % (\012\001\000\000\001\012\012)
The \012s in the result (which some interpreters may show as \n) reserve
the positions that can never be division points according to
lefthyphenmin
and righthyphenmin
. As always, an
even number is an unacceptable division, an odd number an acceptable one.
The DSC comments were omitted above for brevity, but you can view or download
the complete example file here, either for viewing in a PostScript viewer or
as text. It is commented for Language Level 3 only because hyphen patterns, to
save file size, may use Flate
compression; they do not use any
other Level 3 features and can be made available in longer, uncompressed,
Level 2 versions.
The “bare file” download will only run properly if you have
already downloaded the necessary resources to your viewer or printer. The
“self-contained file” version has been through a script that
replaced each
IncludeResource
comment with the corresponding
resource file—being sure to do the same for other
IncludeResource
comments in the included file—so it
can be sent to any PostScript printer or viewer without any prearrangements.
Bare file | Self-contained file | |
---|---|---|
Hyphenation setup example |
PostScript view text view |
PostScript view text view |
There are only 256 possible byte values, but many more known characters
in Unicode and many more glyphs in a font. Various encodings exist to make
the most-often-needed characters in a given language easy to represent in
strings of bytes. Many have been defined by standards bodies independent of
PostScript, and some are defined by Adobe and little used except in
PostScript code. Examples of standard encodings are us-ascii
,
iso-8859-1
, and utf-8
; Adobe encodings include
StandardEncoding
, ISOLatin1Encoding
, and
CE
.
Two encodings will typically matter for working in PostScript: the encoding that maps numbers to glyphs in a font, determined by the font's encoding vector, and the encoding of the text in the file, determined by your text editor settings. Life offers the fewest surprises when these are the same, and that can be arranged by explicitly setting up fonts with encoding vectors that correspond to the encoding used in editing the text.
It is also possible to edit the file in one encoding, use another encoding
vector for the fonts, and use PostScript code to transcode strings before
they are shown. The net.anastigmatix.encoding
resources
can transcode several common encodings to and from Unicode, so it is possible
to transcode from one encoding to another—for those characters present
in both encodings—by going through Unicode as a “pivot
encoding.”
Hyphenate
faces a similar issue: each hyphenation pattern set is
stored in some encoding, chosen when the pattern set was compiled, and that
may not be the same encoding you use in your text editor for preparing the
input. By telling Hyphenate
what input encoding you have
used—the pattern encoding is named in the pattern set—you enable
Hyphenate
to transcode input words correctly to match them
against the patterns.
Some encodings use strictly one byte per character. They have the advantage of simplicity, in that the number of characters in a string is the same as its length, and the disadvantage that no more than 256 characters—in practice usually fewer—can be encoded. Many encodings exist that choose different sets of 256 or fewer characters to support the writing of different languages. The ISO 8859 family of encodings can support many European languages; all encodings in the family have 160 slots that always map the same, exactly the range the PostScript language level 2 relies on, and 96 slots that map different sets of characters appropriate to different languages. That makes the ISO 8859 encodings convenient for use with PostScript. The characters beyond ASCII can even be used in PostScript programs, as for variable or procedure names with accented letters.
Other encodings can use more than one byte per character, sometimes a
variable number, with the most-used characters given shorter encodings.
A popular example is utf-8
, which can represent any character
in Unicode using four bytes or fewer, with the 128 characters essential to
original PostScript represented in single bytes exactly as PostScript expects
them. That makes utf-8
also a convenient encoding for PostScript
files. It does not safely allow non-ASCII characters to appear in PostScript
names, because of possible collisions with the level 2 binary token format,
but in strings, or in portions of the file read by
Markup,
there are no surprises.
The most convenient way to edit direct PostScript documents is to pick a
suitable encoding, configure your text editor to use it, and in the PostScript
program install a matching encoding vector in each font to be used. That is
a bit more easily said than done, as few operating systems and text editors
directly support the Adobe-specific encodings, and PostScript doesn't come
with many font-encoding vectors that match standard encodings. There is,
however, a font encoding vector named iso-8859_1-1998
defined
as a side effect by the net.anastigmatix.encoding
category
resource used with Hyphenate
. You can find it in the normal
PostScript resource category Encoding
, as in:
/iso-8859_1-1998 /Encoding findresource
but only after the net.anastigmatix.encoding
category has been
defined. If the font encoding vectors, your text editor settings, and the
encoding resource given as the input encoding for Hyphenate
all match, your experience should be free of encoding mysteries. Much more
information on character encoding can be found at
Jukka Korpela's
site.
The ability to set up a font with a variable-length encoding like
utf-8
did not enter PostScript until the CIDFonts of
language level 3. It can still be possible, though, to use utf-8
as a convenient encoding for editing a direct PostScript file, set up a
different encoding vector in the fonts, and transcode.
Because utf-8
has the property that no subsequence of bytes
from a multibyte-encoded character can be mistaken for a shorter encoding
of some other character, it is usable as a pattern encoding. Other
multibyte encodings, generally lacking that property, are not.
The file RegressHyphenTeX.ps (2.5 MB)
contains 234,964 English words from a dictionary file distributed
with the NetBSD 1.6.2 operating system and derived from Webster's
Second New International Dictionary of the English Language. Each word
is listed twice, once with hyphens placed by the default settings of
plain TeX and once without, followed by the word cmp
, like
this:
(hy-phen) (hyphen) cmp (hy-phen-ate) (hyphenate) cmp (hy-phen-ated) (hyphenated) cmp
The file can be used to test any PostScript implementation of TeX
hyphenation, simply by writing a procedure cmp
that
takes the word on top of stack, uses the procedure under test to add
hyphens, and compares it to the word as hyphenated by TeX itself.
The following short file is the necessary setup to benchmark
Hyphenate
.
Simply download the file—the self-contained version if you have not
already downloaded the resources it uses—append
RegressHyphenTeX.ps
(download separately, above),
and feed to your favorite PostScript interpreter. To keep the size of
RegressHyphenTeX.ps
manageable, it uses Flate
compression, which may be supported in some Language Level 2 interpreters
but is only guaranteed supported in Level 3.
Bare file | Self-contained file | |
---|---|---|
Setup file to benchmark Hyphenate |
PostScript view text view |
PostScript view text view |
A patch file was used against the
earlier PostScript port of Knuth-Liang hyphenation done by
Olavi Sakari
to enable it to run the same benchmark. The procedure is simply to run
patch
(Larry Wall's program for applying differences to files)
to modify his original, and then append RegressHyphenTeX.ps
to it as before. The patch restricts the pattern set to the original plain
TeX patterns, because those are what TeX used to generate the regression
file, skips the demonstration text at the end of the file, and changes the
fixed policy for number of letters undivided at a word end from two to
three to match the default value of TeX's righthyphenmin
;
otherwise many thousands of spurious mismatches are reported.