Please change {t-} to behave like {t} when diacritics are autoremoved.
But I never changed the interwiki link at all. It was {{t-}}
before and my bot didn't change that, yours did. So I'm not sure what you want me to do?
Oh, I see. Well, that has never actually been the practice as far as I know. {{tø}}
is used to omit the interwiki link, which makes sense if there is no Wiktionary to link to, but also if the transliteration is a SOP term that is unlikely to have an entry on any Wiktionary. {{t-SOP}}
already omitted the interwiki for that reason, so replacing it with {{tø}}
is the most sensible option.
Re: "So I'm not sure what you want me to do?": As I said, I'm suggesting that you change {{t-}}
to behave like {{t}}
in cases where it removes diacritics, because there are now (most likely) a few thousand pages that wrongly have {{t-}}
in those cases, and it will probably be a while before that situation can be repaired.
Re: second paragraph: I see you've updated the documentation now; thank you.
Actually, on second thought — I should be able to fix these entries by editing the bot to incorporate the diacritic-removing logic from Module:languages. That's probably not a permanent fix (I'm guessing that that logic is not very stable?), but it should at least let me undo the existing damage.
So, please disregard for now. I'll let you know if I run into problems.
Thanks,
The definitions in Module:languages could be changed if editors decide to, yes. Ideally, the bot would fetch the current rules before doing any work, but I don't know how feasible that is, because the bot would then need to parse the Lua code itself. On the other hand, you could write your own module that reads Module:languages and converts all the necessary information into a format your bot can easily parse and use. MewBot does something similar as well with inflection tables.
It would help if there were a function that could output the data in a more machine readable format (tab delimited?) maybe at Module:languages/display.
Lua itself is a machine readable format isn't it? :) You may want to look at Module:list of languages, which is meant for generating human readable tables. You could use it for inspiration, at least. I do think that it should go in a private module (Module:User:Rukhabot) because it's unlikely that the module would be useful to anyone else. Not that I can foresee anyway. If someone else ends up needing it, it can always be renamed.
Yes, machine readable by Lua. And it's possible to just call eval with a local Lua instance. I see that there are Lua parsers for Python, which is good enough for me, but if you're using Perl you may have more trouble. All this doesn't matter anyway assuming the spacing on the page is consistent. Carry on.
If the format of Module:languages isn't suited to a bot, you can always write a module to convert it into a more suitable format. For example, one of the functions in such a module could generate a simple list of all languages that have the entry_name
property, and list the replacement rules along with each language, in whatever format the bot owner finds convenient to parse.
@DTLHS: Re: "it's possible to just call eval with a local Lua instance": That will work in some cases, but not in others: many of our modules refer to MW features (such as the UTF-8 support), or to other modules. I think CodeCat's suggested approach, writing a module to present the data in a tractible format, is better even for Python.
@CodeCat: That worked out pretty well. What I did was, I added a formatDiacriticRemovalRulesAsJson function to Module:User:Ruakh that formats all the diacritic-removal rules into a JSON "text", and changed my bot to call action=expandtemplates&text={{%23invoke:User:Ruakh|formatDiacriticRemovalRulesAsJson}} to retrieve it, and apply the transformations accordingly. Thank you for the suggestion!
I think this approach will be useful for other bots/people/tools/tasks as well, so I'll see about starting a Module:Json to hold utility methods for generating JSON . . .
You suggested I "write a module to convert [the data] into a more suitable format", this format being "whatever format the bot owner finds convenient to parse". I, agreeing with your criteria, chose JSON, because it's a widely-supported and well-understood format, easily capable of representing anything we're likely to want it to. (And because the bot already uses a JSON parser to parse API responses with format=json.)
Oh, I see. No, JSON, despite the name, is not particularly related to JavaScript; it's just a generic serialization/storage/interchange format, rather like XML or CDF or ASN.1. It imposes a number of restrictions that JavaScript notation does not, and it's used in a lot of environments where JavaScript is not. (For example, at my job we use JSON for data used by various JUnit tests.) Even the common claim that JSON is a subset of JavaScript is technically not quite true, since there are a few characters that JSON allows in string literals but that JavaScript does not.
(The related JSONP, however, really is JavaScript-specific AFAIK.)
So it's a bit like XML in purpose, but looks more like what you would find in a programming language?