Line Breaks

January 27, 2017

All text processors require code to split the words of a paragraph into lines no greater than a given width, a process known as line breaking. There are a variety of algorithms to perform that process, ranging from simple to complex, and they produce a variety of output of various degrees of “estheticness.” Most algorithms try to arrange all the lines of a paragraph so they are approximately the same length, which reduces any visual disparities in the appearance of the text that might distract the reader.

One simple line-breaking algorithm is the greedy algorithm: pack on to each line as many words as can fit, then go to the next line. For instance, given the text “aaa bb cc ddddd” and a line width of 6, the output would be as shown below left:

    ------        ------
    aaa bb        aaa
    cc            bb cc
    ddddd         ddddd
    ------        ------

The greedy algorithm minimizes the number of lines used, but most line-breaking algorithms prefer to minimize the amount of “raggedness.” One common measure of estheticness minimizes the slack at the end of the line; specifically, it seeks a minimum sum of the square of the number of spaces at the end of each line. The format shown above left has no space at the end of the first line, 4 spaces at the end of the second line, and 1 space at the end of the third line, for a total slack of 0 + 42 + 12 = 17. The purpose of squaring is to more heavily penalize large amounts of slack.

A better format is shown above right. That has 3 spaces at the end of the first line, 1 space at the end of the second line, and 1 space at the end of the third line, for total slack of 32 + 12 + 12 = 11.

From an algorithmic point of view, this is a minimization problem that can be solved in quadratic time by dynamic programming: Walk down the list of words, computing after each word the minimum slack to that point, then add the next word and recompute. The primary data structure used in computing the minimization is an upper-triangular matrix, shown below left:

    aaa    bb    cc  ddddd            aaa    bb    cc  ddddd
   ----- ----- ----- -----           ----- ----- ----- -----
 0    3     0    -3    -9          0    3     0
 1          4     1    -5          1          4     1
 2                4    -2          2                4
 3                      1          3                      1
   ----- ----- ----- -----           ----- ----- ----- -----

The first row is computed as 3, which is the number of spaces remaining after placing aaa on a line, 0, which is the number of spaces remaining after placing aaa bb on a line, -3, which is the number of spaces remaining after placing aaa bb cc on a line, and -9, which is the number of spaces remaining after placing aaa bb cc ddddd on a line; obviously, the last two entries on the first row are infeasible, as the line width exceeds the available space. The second row is computed as 4, which is the number of spaces remaining after placing bb on a line, 1, which is the number of spaces remaining after placing bb cc on a line, and -5, which is the number of spaces remaining after placing bb cc ddddd on a line; the last entry on the row is infeasible. Likewise the third and fourth rows. The feasible portion of the upper-triangular matrix is shown above right.

The next step is to take the minimum feasible value in each column: 3, 0, 1, and 1; if you square those and compute the sum, you get 32 + 02 + 12 + 12 = 11, which is the cost we computed above. More interesting is to take the index of the minimum feasible value in each column, which is 0, 0, 1, and 3 (the 3 in the aaa column is at index 0, the 0 in the bb column is at index 0, the 1 in the cc column is at index 1, and the 1 in the ddddd column is at index 3). Then we compute the line breaks using the index minimums pairwise as follows: the first pair 0, 0 is empty; the second pair 0, 1 defines the bounds of the first output line; the third pair 1, 3 defines the bounds of the second output line; and the implicit pair 3, 4 (4 is the end of the input) defines the bounds of the third output line.

And that’s the algorithm. Beware that reducing it to code can be tricky (I got it wrong more than once) because you have to be careful to keep the row and column indexes straight and you have to remember when to add and subtract 1 to point to the previous or next column or row. The algorithm obviously has quadratic time and space complexity to compute and manipulate the upper-triangular matrix.

Your task is to write a program to format paragraphs by the dynamic programming algorithm described above. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Advertisement

Pages: 1 2

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: