Wrapping Text With Regular Expressions
I have been asked “how to wrap text” a handful of times in the last year or so, and I have needed to do it myself a couple of times as well, so here is a Ruby function which does the job:
def wrap_text(txt, col = 80)
txt.gsub(/(.{1,#{col}})( +|$\n?)|(.{1,#{col}})/, "\\1\\3\n")
end
The Pattern in Detail
This pattern is parameterized on the wrap column, so let us inject that parameter before further analysis:
(.{1,80})( +|$\n?)|(.{1,80})
Now let us step through that:
-
(.{1,80})
The parenthesis makes this a capture, meaning we can refer to the matched text in the replacement string (as
\1
since this is the first capture group.) The period (.
) matches all characters except a newline, and{1,80}
tells that we want to repeat that 1-80 times. It is greedy, so it will do as many repeats as possible.So basically this is a match for at most 80 non-newline characters with the text matched, stored in capture group one.
-
( +|$\n?)
Here we also use parenthesis, though this time not to capture the text matched, but because we want an either/or, i.e. we want to match one of two things: either we match one or more spaces, expressed with a literal space followed by
+
to denote one or more matches, or (expressed with the|
operator) we want to match the end of the line (indicated by$
) optionally followed by a newline, indicated by a\n
and a question mark to make it optional.The reason why we make the newline optional is that we could be faced with the last line of the text, and while there is an end of that line, there is not necessarily a newline.
So this part will match a consecutive run of spaces, or end of line. The magic happens when we combine that with the previous pattern, which matched 1-80 non-newline characters, because this additional match requires that the 1-80 non-newline characters are followed either by the consecutive run of spaces, or end of the line.
-
|
The previous two patterns combined will match at most 80 non-newline characters followed by spaces or end of line. A problem exists if there are more than 80 non-newline characters not containing any spaces.
For this reason we add special handling for that case, this is done by using the alternation (
|
) operator. Alternations are matched left-to-right and the first one which matches is used. So only if the left side of this operator did not match, will we advance to the next pattern. -
(.{1,80})
This is the same as the first pattern but since we do not follow it with the “space or end of line” clause, it will match up to 80 characters of the line (greedy) regardless of what follows the “break.”
Like the first pattern, the result is put in a capture group, but since this is the third use of parenthesis, it is capture group three and can thus be references as
\3
in the replacement string.
Replacement String
Since we get the text we need either in capture group one or three, the replacement string needs to insert one of these, and a newline.
As only one of them will match, I am simply inserting both using \\1\\3\n
(the double backslashes are to cater for Ruby string escaping.)
Be aware that some implementations of regexp replacing use $1-n
instead of \1-n
for referencing a capture group.
Personally I favor the dollar notation, as it makes it seem like a variable, thus opening up for normal variable modification syntax, and also frees up the backslash, for example to prefix a capture group with a backslash in Ruby you easily end up with 6 backslashes in a row, just to insert one — so as you may have figured, in TextMate capture groups are referenced with the dollar notation.
Notes
-
It might make sense to make this method a member of the
String
class. -
It is possible to do a non-capturing group using
(?:…)
this would make the function:def wrap_text(txt, col = 80) txt.gsub(/(.{1,#{col}})(?: +|$\n?)|(.{1,#{col}})/, "\\1\\2\n") end
This is technically better, but at the expense of a slightly more complex regexp.
2006-06-30: Based on Florians comment, here is a version which deals with lines that end with trailing spaces:
def wrap_text(txt, col = 80)
txt.gsub(/(.{1,#{col}})( +|$)\n?|(.{#{col}})/, "\\1\\3\n")
end
I also changed the right side to do exactly col
repeats instead of the (unnecessary) range.