Module talk:abq-translit

From Wiktionary, the free dictionary
Latest comment: 1 year ago by Theknightwho in topic string.gsub
Jump to navigation Jump to search

string.gsub[edit]

Theknightwho: Sorry for reverting. This is a classic case where string.gsub is best from what I remember, because there are a lot of replacements performed in a row and none of them have Unicode character-aware set or repetition notation in them. But I could have asked about your reasoning first. — Eru·tuon 21:16, 10 December 2022 (UTC)Reply

@Erutuon I did this based on my experience with the sortkeys, where I was finding that mw.ustring.gsub was more memory efficient (by a tiny amount) - and I felt there was a small advantage in simplifying the module layout anyway, for usability purposes. I've noticed that there are a small number of pages in CAT:E, but none invoke this module. Theknightwho (talk) 21:26, 10 December 2022 (UTC)Reply
@Theknightwho: Ah, the actual difference that I encountered was that string.gsub is much faster. Benchmark here with stuff commented out; if you bump repetitions up to 100,000 (add a zero) the one that uses some string.gsub finishes in several seconds, but mw.ustring.gsub-only version runs out of time. It probably wouldn't make an entry run out of time, but it would noticeably slow down parsing of a long frequency list with transliterations (though there isn't one for Abaza). The benchmark didn't show a significant difference in memory usage for me, but there could be a bigger change on another run or on another page because the garbage collector is random.
I bet string.gsub uses Lua memory and mw.ustring.gsub doesn't except to construct the final string because it calls into PHP. But the memory total takes into account how much memory the garbage collector deallocates, which changes unpredictably. Using less memory at one point in a page might make the garbage collector deallocate later or not at all or deallocate less. — Eru·tuon 22:13, 10 December 2022 (UTC)Reply
@Erutuon That is a good point. Regardless of whether there's a long list of Abaza terms, it still becomes relevant on pages with large numbers of transliterations. At that point, the scant savings in memory (it did seem to be consistent, but it was under 1KB) are probably not worth it, given the sheer randomness. I also didn't do any kind of systematic checking - it was just a pattern that I noticed over time. It may be something as small as the fact that two fewer variables get created.
On a somewhat related note, I wonder if there might be a way to force garbage collection by creating a holding while loop when deleting certain objects, such that the logic does not continue until that object has been deleted from memory. I suspect it might be possible using a weak-referenced table (perhaps amending a table to being weak-referenced at the point of closure). Even if that is possible, though, it may cause intolerably long delays in parsing time. Theknightwho (talk) 22:29, 10 December 2022 (UTC)Reply
It might be worth making another module that compiles the long list of replacements into the minimum number of Lua patterns, which can then be copied to the transliteration module. The patterns might have to use mw.ustring.gsub, but I imagine it would mean fewer intermediate strings created, because a new string object is created whenever a replacement succeeds (and there wasn't already a copy of that string in Lua's string interner), and fewer runs through the string. Would require editing the replacements to be done in two steps, first editing the other module and then copying its output to the transliteration module. I'm not sure exactly how to do this yet, though I saw some Lua code once that converted Unicode code point ranges into patterns that would match UTF-8 sequences, which is similar. — Eru·tuon 00:54, 11 December 2022 (UTC)Reply
@Erutuon That sounds like a good idea, but it's above my head to know how to implement it. On the subject of reducing the number of intermediate strings created, one relatively straightforward change that we could make is:
  1. An initial check to determine whether a string is (a) lowercase, (b) initial uppercase or (c) fully uppercase;
  2. Convert the string to lowercase;
  3. Carry out all the usual substitutions;
  4. Convert the output into the format detected by the initial check.
This would not only reduce the size of the tables (and make things simpler for editors by reducing duplication), but would (almost) halve the number of mw.ustring.gsub stages in these Caucasian translit modules. Admittedly it wouldn't be able to deal with nonstandard capitalisation, but such entries are extremely rare (and I suspect nonexistent in these languages). Theknightwho (talk) 01:18, 11 December 2022 (UTC)Reply
I have drawn up an example at Module:User:Theknightwho/abq-translit. Theknightwho (talk) 02:13, 11 December 2022 (UTC)Reply