Module talk:User:Theknightwho/template regex

From Wiktionary, the free dictionary
Latest comment: 1 month ago by Theknightwho in topic Template/argument validation regex
Jump to navigation Jump to search

Template/argument validation regex

[edit]

@Benwing2 @JeffDoozan @Erutuon @Surjection @-sche @Chuck Entz @This, that and the other Not sure how useful this is, but I've written a template/argument validation regex, which is a stripped-down version of Module:template parser compressed into regex form. It can deal with arbitrary levels of nesting and all the various weird formatting constructs that impact templates/arguments on a page, but due to the inherent nature of regex it only captures the outermost template/argument, so you'd need to run it recursively to (e.g.) get every template on the page. There might be some clever way to use capture groups to do it all at once - I'm not sure.

I initially wrote this intending to use it as a way to catch stray extra { } or unpaired double/triple braces via an abuse filter (hence the focus on high accuracy), and in most situations it's well-enough optimised to be able to do that, but I'm reluctant to deploy it because it's still possible to generate abusive inputs that cause catastrophic backtracking if you put a large number of open braces in a row. In the template parser, it's simply a case of working rightmost-to-leftmost {{(3){{{(2){{{(1) }}}(1)}}}(2)}}(3) and discarding any leftover leftmost braces, so you never get runaway inputs, but I don't think that's possible in regex, as you can't move the pointer left. Theknightwho (talk) 00:29, 2 May 2024 (UTC)Reply

@Theknightwho Wow that is quite complex. It's too bad we are limited to regexes in abuse filters and such; I can't imagine it would be easy to debug the regex. Benwing2 (talk) 00:38, 2 May 2024 (UTC)Reply
@Benwing2 regex101 has a debugger, thankfully. Theknightwho (talk) 01:18, 2 May 2024 (UTC)Reply
@Benwing2 I think I've found a solution to this. The main issue is that we need a way for inner matches to be retained once they've been found, because otherwise things become exponentially complicated which rapidly gets out of hand. The conventional approach of parsing from the left and relying on backtracking makes this impossible, because that info is wiped every time it backtracks into a long set of opening braces.
The solution is to do something like this (which I've simplified for illustrative purposes - this one handles single-brace pairs, and the content can only contain \w and other pairs of braces):
(?={*({(?(1)\1)(?:\w+|(?R))*})){1,100}{*\K\1
The main lookahead contains {* (forcing a rightmost match), with the rest being a capture group with enclosing single braces, containing (?(1)\1) (if it's captured before, whatever it captured last time), plus (?:\w+|(?R))* (any remaining content, which could be a recursive match). The lookahead has the quantifier {1,100} to force it to run repeatedly, which effectively makes it a while loop ({{{{{foo}}}}}, {{{{{foo}}}}}, {{{{{foo}}}}} etc.). The final {*\K\1 matches the final successful match from the while loop, and \K shaves off any superfluous opening braces.
Since all the inner matches are captured via backreferences (i.e. as plaintext), it's O(N). I don't think it's possible to avoid the finite limit on the loop, though: the lookahead runs once if you use * and twice with +, and if you try to use a recursive subroutine instead the regex quits out after two loops because it thinks it's stuck in an infinite loop, because the same subroutine is being run at the same pointer location repeatedly, and it's not intelligent enough to realise that the changing backreference will eventually cause it to terminate.
I've not fully implemented this yet, but once I have it should be good to go. Theknightwho (talk) 13:35, 5 May 2024 (UTC)Reply