xr Atom Feed

Description
Convert string regexp to rx notation
Latest
xr-2.0.0.20240802.81742.tar (.sig), 2024-Aug-02, 120 KiB
Maintainer
Mattias EngdegÄrd <mattiase@acm.org>
Website
https://github.com/mattiase/xr
Browse ELPA's repository
CGit or Gitweb
Badge

To install this package from Emacs, use package-install or list-packages.

Full description

                xr -- Emacs regexp parser and analyser
                ======================================

XR converts Emacs regular expressions to the structured rx form, thus
being an inverse of rx. It can also find mistakes and questionable
constructs inside regexp strings.

It can be useful for:

- Migrating existing code to rx form
- Understanding what a regexp string really means
- Finding errors in regexp strings

It can also parse and find mistakes in skip-sets, the regexp-like
arguments to skip-chars-forward and skip-chars-backward.

The xr package can be used interactively or by other code as a library.


* Example

  (xr-pp "\\`\\(?:[^^]\\|\\^\\(?: \\*\\|\\[\\)\\)")

  outputs

  (seq bos 
       (or (not (any "^"))
           (seq "^"
                (or " *" "["))))


* Installation

  From GNU ELPA (https://elpa.gnu.org/packages/xr.html):

    M-x package-install RET xr RET


* Interface

  Functions parsing regexp strings:
  
   xr       --  convert regexp to rx
   xr-pp    --  convert regexp to rx and pretty-print
   xr-lint  --  find mistakes in regexp
  
  Functions parsing skip sets:
  
   xr-skip-set       --  convert skip-set to rx
   xr-skip-set-pp    --  convert skip-set to rx and pretty-print
   xr-skip-set-lint  --  find mistakes in skip-set
  
  Utility:
  
   xr-pp-rx-to-str  --  pretty-print rx expression to string
  

* What the diagnostics mean

  - Unescaped literal 'X'

    A special character is taken literally because it occurs in a
    position where it does not need to be backslash-escaped. It is
    good style to do so anyway (assuming that it should occur as a
    literal character).

  - Escaped non-special character 'X'
  
    A character is backslash-escaped even though this is not necessary
    and does not turn it into a special sequence. Maybe the backslash
    was in error, or should be doubled if a literal backslash was
    expected.
  
  - Duplicated 'X' inside character alternative
  
    A character occurs twice inside [...]; this is obviously
    pointless. In particular, backslashes are not special inside
    [...]; they have no escaping power, and do not need to be escaped
    in order to include a literal backslash.
  
  - Repetition of repetition
  - Repetition of option
  - Optional repetition
  - Optional option
  
    A repetition construct is applied to an expression that is already
    repeated, such as a*+ or \(x?\)?. These expressions can be written
    with a single repetition and often indicate a different mistake,
    perhaps a missing backslash.

    When a repetition construct is ? or ??, it is termed 'option'
    instead; the principle is the same.

  - Reversed range 'Y-X' matches nothing

    The last character of a range precedes the first and therefore
    includes no characters at all (not even the endpoints). Most such
    ranges are caused by a misplaced hyphen.

  - Character 'B' included in range 'A-C'
  - Range 'A-C' includes character 'B'

    A range includes a character that also occurs individually. This
    is often caused by a misplaced hyphen.

  - Ranges 'A-M' and 'D-Z' overlap

    Two ranges have at least one character in common. This is often
    caused by a misplaced hyphen.

  - Two-character range 'A-B'

    A range only consists of its two endpoints, since they have
    consecutive character codes. This is often caused by a misplaced
    hyphen.

  - Range 'A-z' between upper and lower case includes symbols

    A range spans over upper and lower case letters, which also
    includes some symbols. This is probably unintentional. To cover
    both upper and lower case letters, use separate ranges, as in
    [A-Za-z].

  - Suspect character range '+-X': should '-' be literal?

    A range has + as one of its endpoints, which could mean that the
    hyphen was actually intended to be literal in order to match both
    + and -.
    This check is only enabled when the 'checks' argument is 'all'.

  - Possibly erroneous '\X' in character alternative

    A character alternative includes something that looks like a
    escape sequence, but no escape sequences are allowed there since
    backslash is not a special character in that context.
    It could also be a caused by too many backslashes.

    For example, "[\\n\\t]" matches the characters 'n', 't' and
    backslash, but could be an attempt to match newline and tab.
    This check is only enabled when the 'checks' argument is 'all'.

  - Duplicated character class '[:class:]'

    A character class occurs twice in a single character alternative
    or skip set.

  - Or-pattern more efficiently expressed as character alternative

    When an or-pattern can be written as a character alternative, it
    becomes more efficient and reduces regexp stack usage.
    For example, a\|b is better written [ab], and \s-\|\sw is usually
    better written [[:space:][:word:]]. (There is a subtle difference
    in how syntax properties are handled but it rarely matters.)
    This check is only enabled when the 'checks' argument is 'all'.

  - Duplicated alternative branch

    The same expression occurs in two different branches, like in
    A\|A. This has the effect of only including it once.

  - Branch matches superset/subset of a previous branch

    A branch in an or-expression matches a superset or subset of what
    another branch matches, like in [ab]\|a. This means that one of
    the branches can be eliminated without changing the meaning of the
    regexp.

  - Repetition subsumes/subsumed by preceding repetition

    An repeating expression matches a superset or subset of what the
    previous expression matches, in such a way that one of them is
    unnecessary. For example, [ab]+a* matches the same text as [ab]+,
    so the a* could be removed without changing the meaning of the
    regexp.

  - First/last item in repetition subsumes last/first item (wrapped)

    The first and last items in a repeated sequence, being effectively
    adjacent, match a superset or subset of each other, which makes
    for an unexpected inefficiency. For example, \(?:a*c[ab]+\)* can
    be seen as a*c[ab]+a*c[ab]+... where the [ab]+a* in the middle is
    a slow way of writing [ab]+ which is made worse by the outer
    repetition. The general remedy is to move the subsumed item out of
    the repeated sequence, resulting in a*\(?:c[ab]+\)* in the example
    above.

  - Non-newline follows end-of-line anchor
  - Line-start anchor follows non-newline

    A pattern that does not match a newline occurs right after an
    end-of-line anchor ($) or before a line-start anchor (^).
    This combination can never match.

  - Non-empty pattern follows end-of-text anchor

    A pattern that only matches a non-empty string occurs right after
    an end-of-text anchor (\'). This combination can never match.

  - Use \` instead of ^ in file-matching regexp
  - Use \' instead of $ in file-matching regexp

    In a regexp used for matching a file name, newlines are usually
    not relevant. Line-start and line-end anchors should therefore
    probably be replaced with string-start and string-end,
    respectively. Otherwise, the regexp may fail for file names that
    do contain newlines.

  - Possibly unescaped '.' in file-matching regexp

    In a regexp used for matching a file name, a naked dot is usually
    more likely to be a mistake (missing escaping backslash) than an
    actual intent to match any character except newline, since literal
    dots are very common in file name patterns.

  - Uncounted repetition

    The construct A\{,\} repeats A zero or more times which was
    probably not intended.

  - Implicit zero repetition

    The construct A\{\} only matches the empty string, which was
    probably not intended.

  - Suspect '[' in char alternative

    This warning indicates badly-placed square brackets in a character
    alternative, as in [A[B]C]. A literal ] must come first
    (possibly after a negating ^).

  - Literal '-' not first or last

    It is good style to put a literal hyphen last in character
    alternatives and skip sets, to clearly indicate that it was not
    intended as part of a range.

  - Repetition of zero-width assertion
  - Optional zero-width assertion

    A repetition operator was applied to a zero-width assertion, like
    ^ or \<, which is completely pointless. The error may be a missing
    escaping backslash.

  - Repetition of expression matching an empty string
  - Optional expression matching an empty string

    A repetition operator was applied to a sub-expression that could
    match the empty string; this is not necessarily wrong, but such
    constructs run very slowly on Emacs's regexp engine. Consider
    rewriting them into a form where the repeated expression cannot
    match the empty string.

    Example: \(?:a*b*\)* is equivalent to the much faster \(?:a\|b\)*.

    Another example: \(?:a?b*\)? is better written a?b*. 

    In general, A?, where A matches the empty string, can be
    simplified to just A.

  - Repetition of effective repetition

    A repetition construct is applied to an expression that itself
    contains a repetition, in addition to some patterns that may match
    the empty string. This can lead to bad matching performance.

    Example: \(?:a*b+\)* is equivalent to the much faster \(?:a\|b\)* .

    Another example: \(?:a*b+\)+ is better written a*b[ab]* .

  - Possibly mistyped ':?' at start of group

    A group starts as \(:? which makes it likely that it was really
    meant to be \(?: -- ie, a non-capturing group.
    This check is only enabled when the 'checks' argument is 'all'.

  - Unnecessarily escaped 'X'

    A character is backslash-escaped in a skip set despite not being
    one of the three special characters - (hyphen), \ (backslash) and
    ^ (caret). It could be unnecessary, or a backslash that should
    have been escaped.

  - Single-element range 'X-X'

    A range in a skip set has identical first and last elements. It is
    rather pointless to have it as a range.

  - Stray '\\' at end of string

    A single backslash at the end of a skip set is always ignored;
    double it if you want a literal backslash to be included.

  - Suspect skip set framed in '[...]'

    A skip set appears to be enclosed in [...], as if it were a
    regexp. Skip sets are not regexps and do not use brackets. To
    include the brackets themselves, put them next to each other.

  - Suspect character class framed in '[...]'

    A skip set contains a character class enclosed in double pairs of
    square brackets, as if it were a regexp. Character classes in skip
    sets are written inside a single pair of square brackets, like
    [:digit:].

  - Empty set matches nothing

    The empty string is a skip set that does not match anything, and
    is therefore pointless.

  - Negated empty set matches anything

    The string "^" is a skip set that matches anything, and is therefore
    pointless.


* See also

  The relint package (https://elpa.gnu.org/packages/relint.html) uses xr
  to find regexp mistakes in elisp code.

  The lex package (https://elpa.gnu.org/packages/lex.html), a lexical
  analyser generator, provides the lex-parse-re function which
  translates regexps to rx, but does not attempt to handle all the
  edge cases of Elisp's regexp syntax or pretty-print the result.

  The pcre2el package (https://github.com/joddie/pcre2el), a regexp
  syntax converter and interactive regexp explainer, can also be used
  for translating regexps to rx. However, xr is more accurate for this
  purpose.

Old versions

xr-1.25.0.20240801.165631.tar.lz2024-Aug-0123.2 KiB
xr-1.25.0.20240401.74532.tar.lz2024-Apr-2420.8 KiB
xr-1.25.0.20240123.121048.tar.lz2024-Jan-2320.8 KiB
xr-1.25.0.20231216.93050.tar.lz2023-Dec-2120.8 KiB
xr-1.25.0.20231026.84432.tar.lz2023-Oct-2620.8 KiB
xr-1.24.0.20230901.120103.tar.lz2023-Sep-0920.8 KiB
xr-1.23.0.20230731.161809.tar.lz2023-Jul-3119.1 KiB
xr-1.22.0.20220405.84021.tar.lz2022-Apr-0518.5 KiB
xr-1.21.0.20210430.163339.tar.lz2021-Jul-2723.8 KiB
xr-1.20.0.20201130.92903.tar.lz2021-Jan-2123.6 KiB

News

                          xr version history
                          ==================

Version 2.0

- Compatibility break: `xr-lint` and `xr-skip-set-lint` now return a
  two-level list of diagnostics, each of which now include an endpoint
  and severity field. The diagnostics are grouped so that messages
  that apply to the same problem are in the same middle-level list.

- Most warnings are now accompanied by info-level messages that point out
  related parts of the input string.

- `xr-lint` and `xr-skip-set-lint` no longer signal errors for invalid
  syntax; they are now returned as error-level messages. Other
  functions such as `xr` now signal `xr-parse-error` when the input
  string contains something that Emacs would not accept.

- Emacs version 27 or later now required

- Further performance improvements

Version 1.25
- Effective repetition of repetition check now always enabled
- Some performance improvements

Version 1.24
- \w and \W are now translated to (syntax word) and (not (syntax word)),
  instead of [[:word:]] and [^[:word:]] which are not exact equivalents.
- Repetition operators are now literals after \`. For example,
  \`* is now (seq bos "*"), not (* bos), because this is how Emacs works.
- New lint check: find [A-z] (range between upper and lower case)
- New `checks' argument to xr-lint, used to enable these new checks:
  - Detect [+-X] and [X-+] (range to/from '+')
  - Detect [\\t] etc (escape sequences in character alternative)
  - Detect \(:?...\), as a possible typo for \(?:...\)
  - Detect a\|b that could be [ab] which is more efficient
  - Detect effective repetition of repetition such as \(A+B*\)*

Version 1.23
- Represent explicitly the gap in ranges from ASCII to raw bytes:
  "[A-\xbb]" becomes (any "A-\x7f\x80-\xbb") because that is how
  Emacs regexps work. This also suppresses some false positives
  in `xr-lint' and `xr-skip-set-lint'.

Version 1.22
- More compact distribution

Version 1.21
- Suppress false complaint about (? (+ X))

Version 1.20
- Fix duplication removal in character alternatives, like [aaa]
- All diagnostics are now described in the README file
- Improved anchor conflict checks

Version 1.19
- Added filename-specific checks; new PURPOSE argument to `xr-lint'
- Warn about wrapped subsumption, like \(A*C[AB]*\)+
- Improved scope and accuracy of all subsumption checks
- Warn about anchors in conflict with other expressions, like \(A$\)B

Version 1.18
- Fix test broken in Emacs 26

Version 1.17
- Performance improvements

Version 1.16
- Translate [^\n] into nonl
- Better character class subset/superset analysis
- More accurate repetition subsumption check
- Use text quoting for messages

Version 1.15
- Warn about subsuming repetitions in sequence, like [AB]+A*

Version 1.14
- Warn about repetition of grouped repetition

Version 1.13
- More robust pretty-printing, especially for characters
- Generate (category CHAR) for unknown categories

Version 1.12
- Warn about branch subsumption, like [AB]\|A

Version 1.11
- Warn about repetition of empty-matching expressions
- Detect `-' not first or last in char alternatives or skip-sets
- Stronger ad-hoc [...] check in skip-sets

Version 1.10
- Warn about [[:class:]] in skip-sets
- Warn about two-character ranges like [*-+] in regexps

Version 1.9
- Don't complain about [z-a] and [^z-a] specifically
- Improved skip set checks

Version 1.8
- Improved skip set checks

Version 1.7
- Parse skip-sets, adding `xr-skip-set', `xr-skip-set-pp' and
  `xr-skip-set-lint'
- Ad-hoc check for misplaced `]' in regexps

Version 1.6
- Detect duplicated branches like A\|A

Version 1.5
- Add dialect option to `xr' and `xr-pp'
- Negative empty sets, [^z-a], now become `anything'

Version 1.4
- Detect overlap in character alternatives

Version 1.3
- Improved xr-lint warnings

Version 1.2
- `xr-lint' added