Wrapping Text With Regular Expressions
I have been asked “how to wrap text” a handful of times in the last year or so, and I have needed to do it myself a couple of times as well, so here is a Ruby function which does the job:
def wrap_text(txt, col = 80)
txt.gsub(/(.{1,#{col}})( +|$\n?)|(.{1,#{col}})/,
"\\1\\3\n")
end
The Pattern in Detail
This pattern is parameterized on the wrap column, so let us inject that parameter before further analysis:
(.{1,80})( +|$\n?)|(.{1,80})
Now let us step through that:
(.{1,80})The parenthesis makes this a capture, meaning we can refer to the matched text in the replacement string (as
\1since this is the first capture group.) The period (.) matches all characters except a newline, and{1,80}tells that we want to repeat that 1-80 times. It is greedy, so it will do as many repeats as possible.So basically this is a match for at most 80 non-newline characters with the text matched, stored in capture group one.
( +|$\n?)Here we also use parenthesis, though this time not to capture the text matched, but because we want an either/or, i.e. we want to match one of two things: either we match one or more spaces, expressed with a literal space followed by
+to denote one or more matches, or (expressed with the|operator) we want to match the end of the line (indicated by$) optionally followed by a newline, indicated by a\nand a question mark to make it optional.The reason why we make the newline optional is that we could be faced with the last line of the text, and while there is an end of that line, there is not necessarily a newline.
So this part will match a consecutive run of spaces, or end of line. The magic happens when we combine that with the previous pattern, which matched 1-80 non-newline characters, because this additional match requires that the 1-80 non-newline characters are followed either by the consecutive run of spaces, or end of the line.
|The previous two patterns combined will match at most 80 non-newline characters followed by spaces or end of line. A problem exists if there are more than 80 non-newline characters not containing any spaces.
For this reason we add special handling for that case, this is done by using the alternation (
|) operator. Alternations are matched left-to-right and the first one which matches is used. So only if the left side of this operator did not match, will we advance to the next pattern.(.{1,80})This is the same as the first pattern but since we do not follow it with the “space or end of line” clause, it will match up to 80 characters of the line (greedy) regardless of what follows the “break.”
Like the first pattern, the result is put in a capture group, but since this is the third use of parenthesis, it is capture group three and can thus be references as
\3in the replacement string.
Replacement String
Since we get the text we need either in capture group one or three, the replacement string needs to insert one of these, and a newline.
As only one of them will match, I am simply inserting both using \\1\\3\n (the double backslashes are to cater for Ruby string escaping.)
Be aware that some implementations of regexp replacing use $1-n instead of \1-n for referencing a capture group.
Personally I favor the dollar notation, as it makes it seem like a variable, thus opening up for normal variable modification syntax, and also frees up the backslash, for example to prefix a capture group with a backslash in Ruby you easily end up with 6 backslashes in a row, just to insert one — so as you may have figured, in TextMate capture groups are referenced with the dollar notation.
Notes
It might make sense to make this method a member of the
Stringclass.It is possible to do a non-capturing group using
(?:…)this would make the function:def wrap_text(txt, col = 80) txt.gsub(/(.{1,#{col}})(?: +|$\n?)|(.{1,#{col}})/, "\\1\\2\n") endThis is technically better, but at the expense of a slightly more complex regexp.
2006-06-30: Based on Florians comment, here is a version which deals with lines that end with trailing spaces:
def wrap_text(txt, col = 80)
txt.gsub(/(.{1,#{col}})( +|$)\n?|(.{#{col}})/,
"\\1\\3\n")
end
I also changed the right side to do exactly col repeats instead of the (unnecessary) range.
28 Jun 2006 | # Benoit Gagnon wrote…
This method looks a lot like the one that comes with Rails :)
http://api.rubyonrails.org/classes/ActionView/Helpers/TextHelper.html
28 Jun 2006 | # Allan Odgaard wrote…
I hadn’t seen that one, but I wouldn’t say that they look a lot like each other — the Rails method makes use of two
gsub’s and onestripwhere I settle with just the onegsub.In addition the behavior is a little different. The Rails method eats empty lines, e.g.:
Test TEXT
Will output (removing the paragraph break):
And it doesn’t force a break for lines too long, e.g. providing
2above as the line-width still gives the same result.Not sure if this is by-design, personally though, I wouldn’t want this behavior ;)
28 Jun 2006 | # Jay Soffian wrote…
Can you provide a little more context about why you’re doing this with a regex instead of, say Reformat Paragraph or /usr/bin/fmt?
If folks are looking for language specific solutions, in Perl there is the Text::Wrap module available via CPAN. In Python you can make use of TextWrapper in the textwrap module (included with Python).
j.
28 Jun 2006 | # Allan Odgaard wrote…
Jay: The entry was inspired by a friend of mine who asked over IM specifically for a regexp to word wrap text at column 80 (and force a break if a line was too long) — I do not know what environment he was in, other than he is a .NET whore, so neither TextMate,
fmt, perl, or similar was of any help to him ;)Since coincidentally he was not the first to ask me about this, I decided to write it up on this blog, also for the hopefully educational value in deconstructing the regexp (as many TM users are still not fully comfortable with regexps.)
If you watch my customization screencast I actually do call
fmtmyself from Ruby :) though the last time I needed to do something like word wrap myself was for emails sent out — here the text needed to be both wrapped and indented, resulting in the function below (which is what I based this entry on):This is btw for the ticket system. When I release a new build of TM the change log is scanned for ticket ID’s and an email is sent to people involved with that ticket ID quoting the relevant part of the release notes, but nicely wrapped and indented, using the function above.
Anything else? :)
28 Jun 2006 | # Allan Odgaard wrote…
This is a little funny, I just noticed that the TextMate Ruby bundle has a snippet to insert a word wrapping
gsub:)28 Jun 2006 | # Soryu wrote…
Haha, I know who the .net whore is! ;)
29 Jun 2006 | # Jeremy Dunck wrote…
Come come now, .Net developers are sharecroppers, not whores. :)
30 Jun 2006 | # Florian Hars wrote…
Your solution ignores tabs and converts trailing whitespace into spurious paragraph breaks:
$ perl -e '$a = "Testtest \nTest\n"; $a =~ s/(.{1,10})( +|$)|(.{1,10})/$1$3\n/gm; print $a' Testtest
Test
(Where is the preview button so I can check that markdown did in fact interprete my code example correctly? )
30 Jun 2006 | # Florian Hars wrote…
See, it didn’t. Here are some of the missing backslashes, plese insert as appropriate:
30 Jun 2006 | # Allan Odgaard wrote…
Florian: not handling tabs was deliberate since you can’t wrap text which contain tab characters to a given column without knowing the tab size which will be used for display.
So running the text through tab expansion would generally be the practical solution — though you can make the match for space into space-or-tab if you’re more interested in a character wrap, e.g. for wrapping email/mime text where there is a line width (make sure the
.then match a byte and not a (potential multi byte) character.)Adding a match for an optional newline after the one-or-more spaces should handle the excessive use of spaces in your example.
As for backslashes, this is unfortunately a display bug in WordPress, if/when I fix it, they should show up for your comment :)
02 Jul 2006 | # Allan Odgaard wrote…
For anyone interested, the backslash swallow problem has been fixed :)
The markdown plug-in which I use does disable the problematic WP comment text filter (
wpautop) but it didn’t provide the priority under which it was added (30) which maderemove_filtera no-op.08 Aug 2006 | # Kaspar Schiess wrote…
This’ll even hyphenate ;)
!/usr/bin/env ruby
require 'rubygems' require 'text/reform'
r = Text::Reform.new while line=gets puts r.format('['*40, line) end
You should
before running this inside TextMate.
greetings
15 Aug 2006 | # Mitch wrote…
I was looking for a similar solution for coldfusion. This works, but then I found that you can use the coldfusion function “wrap” like such: wrap(str,5). Got to love CF.
13 Oct 2006 | # Artūras Šlajus wrote…
This is version when you need strictly cut lines:
03 Dec 2006 | # Kike Lahuerta wrote…
This version is for vb.net
End Sub
06 Mar 2007 | # david wrote…
Can you say me if this is a good tutorial?
I want to learn regexp but i don’t now where i can found a good tutorial.
Can you help me?
05 Jul 2007 | # Tony wrote…
I’m trying to write a little reflow command for TextMate and I was wondering if there’s a way to get the current document’s wrap column.
In other words, instead of passing the column width
col = 80, is there a global TextMate variable that stores the stores the document’s wrap column?05 Jul 2007 | # Allan Odgaard wrote…
Tony: The
TM_COLUMNSvariable has the number. When there is a column selection it changes to number of columns selected (i.e. width of selection), but for your command that’s probably desired.03 Aug 2007 | # Anonymous wrote…
wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww
03 Aug 2007 | # Jose wrote…
Sorry, that was me above, if it would wrap or not. i’m looking for a way to word wrap non-fixed width fonts. because same number of w’s and i’s wont have the same display size, is there a way to make them look “equal”?
31 Aug 2007 | # Guillaume wrote…
Hum will it be a possibility to put “-” before the cut line in the case it s cuting a words…. (so not put “-” if it s cuting white space) ?
Thanks
Guillaume.
01 Sep 2007 | # Allan Odgaard wrote…
Guillaume: In ruby you would do something like below.
08 Sep 2008 | # Anonymous wrote…
dddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddd
16 Sep 2008 | # Anonymous wrote…
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa