aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/roff_escape.c
Commit message (Collapse)AuthorAgeFilesLines
* Since \. is not a character escape sequence, re-classify it from theIngo Schwarze2022-06-021-2/+2
| | | | | | | | | | | | | wrong parsing class ESCAPE_SPECIAL to the better-suited parsing class ESCAPE_UNDEF, exactly like it is already done for the similar \\, which isn't a character escape sequence either. No formatting change is intended just yet, but this will matter for upcoming improvements in the parser for roff(7) macro, string, and register names. See the node "5.23.2 Copy Mode" in "info groff" regarding what \\ and \. really mean.
* Avoid the layering violation of re-parsing for \E in roff_expand().Ingo Schwarze2022-06-021-19/+22
| | | | | | | | | To that end, add another argument to roff_escape() returning the index of the escape name. This also makes the code in roff_escape() a bit more uniform in so far as it no longer needs the "char esc_name" local variable but now does everything with indices into buf[]. No functional change.
* Fix a buffer overrun in the roff(7) escape sequence parser that couldIngo Schwarze2022-06-011-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | be triggered by macro arguments ending in double backslashes, for example if people wrote .Sq "\\" instead of the correct .Sq "\e". The bug was hard to find because it caused a segfault only very rarely, according to my measurements with a probability of less than one permille. I'm sorry that the first one to hit the bug was an arm64 release build run by deraadt@. Thanks to bluhm@ for providing access to an arm64 machine for debugging purposes. In the end, the bug turned out to be architecture-independent. The reason for the bug was that i assumed an invariant that does not exist. The function roff_parse_comment() is very careful to make sure that the input buffer does not end in an escape character before passing it on, so i assumed this is still true when reaching roff_expand() immediately afterwards. But roff_expand() can also be reached from roff_getarg(), in which case there *can* be a lone escape character at the end of the buffer in case copy mode processing found and converted a double backslash. Fix this by handling a trailing escape character correctly in the function roff_escape(). The lesson here probably is to refrain from assuming an invariant unless verifying that the invariant actually holds is reasonably simple. In some cases, in particular for invariants that are important but not simple, it might also make sense to assert(3) rather than just assume the invariant. An assertion failure is so much better than a buffer overrun...
* Rudimentary implementation of the \A escape sequence, following groffIngo Schwarze2022-05-311-3/+18
| | | | | | | | | | | | | | | | | | | | semantics (test identifier for syntactical validity), not at all following the completely unrelated Heirloom semantics (define hyperlink target position). The main motivation for providing this implementation is to get \A into the parsing class ESCAPE_EXPAND that corresponds to groff parsing behaviour, which is quite similar to the \B escape sequence (test numerical expression for syntactical validity). This is likely to improve parsing of nested escape sequences in the future. Validation isn't perfect yet. In particular, this implementation rejects \A arguments containing some escape sequences that groff allows to slip through. But that is unlikely to cause trouble even in documents using \A for non-trivial purposes. Rejecting the nested escapes in question might even improve robustnest because the rejected names are unlikely to really be usable for practical purposes - no matter that groff dubiously considers them syntactically valid.
* Trivial patch to put the roff(7) \g (interpolate format of register)Ingo Schwarze2022-05-311-1/+1
| | | | | | | | | | | | | escape sequence into the correct parsing class, ESCAPE_EXPAND. Expansion of \g is supposed to work exactly like the expansion of the related escape sequence \n (interpolate register value), but since we ignore the .af (assign output format) request, we just interpolate an empty string to replace the \g sequence. Surprising as it may seem, this actually makes a formatting difference for deviate input like ".O\gNx" which used to raise bogus "escaped character not allowed in a name" and "skipping unknown macro" errors and printed nothing, whereas now it correctly prints "OpenBSD".
* Dummy implementation of the roff(7) \V (interpolate environment variable)Ingo Schwarze2022-05-301-1/+1
| | | | | | | | | escape sequence. This is needed to get \V into the correct parsing class, ESCAPE_EXPAND. It is intentional that mandoc(1) output is *not* influenced by environment variables, so interpolate the name of the variable with some decorating punctuation rather than interpolating its value.
* Re-classify the roff(7) \r (reverse line feed) escape sequenceIngo Schwarze2022-05-201-1/+1
| | | | | | | from "ignore" to "unsupported" because when an input file uses it, mandoc(1) is likely to significantly misformat the output, usually showing parts of the output in a different order than the author intended.
* Make roff_expand() parse left-to-right rather than right-to-left.Ingo Schwarze2022-05-191-0/+477
Some escape sequences have side effects on global state, implying that the order of evaluation matters. For example, this fixes the long-standing bug that "\n+x\n+x\n+x" after ".nr x 0 1" used to print "321"; now it correctly prints "123". Right-to-left parsing was convenient because it implicitly handled nested escape sequences. With correct left-to-right parsing, nesting now requires an explicit implementation, here solved as follows: 1. Handle nested expanding escape sequences iteratively. When finding one, expand it, then retry parsing the enclosing escape sequence from the beginning, which will ultimately succeed as soon as it no longer contains any nested expanding escape sequences. 2. Handle nested non-expanding escape sequences recursively. When finding one, the escape sequence parser calls itself to find the end of the inner sequence, then continues parsing the outer sequence after that point. This requires the mandoc_escape() function to operate in two different modes. The roff(7) parser uses it in a mode where it generates diagnostics and may return an expansion request instead of a parse result. All other callers, in particular the formatters, use it in a simpler mode that never generates diagnostics and always returns a definite parsing result, but that requires all expanding escape sequences to already have been expanded earlier. The bulk of the code is the same for both modes. Since this required a major rewrite of the function anyway, move it into its own new file roff_escape.c and out of the file mandoc.c, which was misnamed in the first place and lacks a clear focus. As a side benefit, this also fixes a number of assertion failures that tb@ found with afl(1), for example "\n\\\\*0", "\v\-\\*0", and "\w\-\\\\\$0*0". As another side benefit, it also resolves some code duplication between mandoc_escape() and roff_expand() and centralizes all handling of escape sequences (except for expansion) in roff_escape.c, hopefully easing maintenance and feature improvements in the future. While here, also move end-of-input handling out of the complicated function roff_expand() and into the simpler function roff_parse_comment(), making the logic easier to understand. Since this is a major reorganization of a central component of mandoc(1), stability of the program might slightly suffer for a few weeks, but i believe that's not a problem at this point of the release cycle. The new code already satisfies the regression suite, but more tweaking and regression testing to further improve the handling of various escape sequences will likely follow in the near future.