Please change {t-} to behave like {t} when diacritics are autoremoved.

Fragment of a discussion from User talk:Rua
Jump to navigation Jump to search

Re: first paragraph: The problem is that (for example) [[dubbing]] now contains {{t-|ru|дубли́рование|n|tr=dublírovanije}}, which (correctly) links to [[ru:дублирование]] but (incorrectly) presents it as a redlink.

Re: second paragraph: The documentation states only that {{}} is to be used "If there is no Wiktionary at all for that language". Judging from your bot's edit, that statement is no longer correct.

RuakhTALK21:01, 31 August 2013

But I never changed the interwiki link at all. It was {{t-}} before and my bot didn't change that, yours did. So I'm not sure what you want me to do?

Oh, I see. Well, that has never actually been the practice as far as I know. {{}} is used to omit the interwiki link, which makes sense if there is no Wiktionary to link to, but also if the transliteration is a SOP term that is unlikely to have an entry on any Wiktionary. {{t-SOP}} already omitted the interwiki for that reason, so replacing it with {{}} is the most sensible option.

CodeCat21:05, 31 August 2013

Re: "So I'm not sure what you want me to do?": As I said, I'm suggesting that you change {{t-}} to behave like {{t}} in cases where it removes diacritics, because there are now (most likely) a few thousand pages that wrongly have {{t-}} in those cases, and it will probably be a while before that situation can be repaired.

Re: second paragraph: I see you've updated the documentation now; thank you.

RuakhTALK21:31, 31 August 2013

Actually, on second thought — I should be able to fix these entries by editing the bot to incorporate the diacritic-removing logic from Module:languages. That's probably not a permanent fix (I'm guessing that that logic is not very stable?), but it should at least let me undo the existing damage.

So, please disregard for now. I'll let you know if I run into problems.

Thanks,

RuakhTALK00:53, 1 September 2013

The definitions in Module:languages could be changed if editors decide to, yes. Ideally, the bot would fetch the current rules before doing any work, but I don't know how feasible that is, because the bot would then need to parse the Lua code itself. On the other hand, you could write your own module that reads Module:languages and converts all the necessary information into a format your bot can easily parse and use. MewBot does something similar as well with inflection tables.

CodeCat01:30, 1 September 2013

It would help if there were a function that could output the data in a more machine readable format (tab delimited?) maybe at Module:languages/display.

DTLHS (talk)01:35, 1 September 2013

Lua itself is a machine readable format isn't it? :) You may want to look at Module:list of languages, which is meant for generating human readable tables. You could use it for inspiration, at least. I do think that it should go in a private module (Module:User:Rukhabot) because it's unlikely that the module would be useful to anyone else. Not that I can foresee anyway. If someone else ends up needing it, it can always be renamed.

CodeCat01:47, 1 September 2013

Yes, machine readable by Lua. And it's possible to just call eval with a local Lua instance. I see that there are Lua parsers for Python, which is good enough for me, but if you're using Perl you may have more trouble. All this doesn't matter anyway assuming the spacing on the page is consistent. Carry on.

DTLHS (talk)01:51, 1 September 2013

If the format of Module:languages isn't suited to a bot, you can always write a module to convert it into a more suitable format. For example, one of the functions in such a module could generate a simple list of all languages that have the entry_name property, and list the replacement rules along with each language, in whatever format the bot owner finds convenient to parse.

CodeCat01:56, 1 September 2013

That's a good idea! I'll give it a shot . . .

RuakhTALK02:42, 1 September 2013
 

@DTLHS: Re: "it's possible to just call eval with a local Lua instance": That will work in some cases, but not in others: many of our modules refer to MW features (such as the UTF-8 support), or to other modules. I think CodeCat's suggested approach, writing a module to present the data in a tractible format, is better even for Python.

@CodeCat: That worked out pretty well. What I did was, I added a formatDiacriticRemovalRulesAsJson function to Module:User:Ruakh that formats all the diacritic-removal rules into a JSON "text", and changed my bot to call action=expandtemplates&text={{%23invoke:User:Ruakh|formatDiacriticRemovalRulesAsJson}} to retrieve it, and apply the transformations accordingly. Thank you for the suggestion!

I think this approach will be useful for other bots/people/tools/tasks as well, so I'll see about starting a Module:Json to hold utility methods for generating JSON . . .

RuakhTALK17:58, 1 September 2013

I'm not sure if I understand. What did you use JSON for?

CodeCat17:59, 1 September 2013
Edited by author.
Last edit: 18:11, 1 September 2013

You suggested I "write a module to convert [the data] into a more suitable format", this format being "whatever format the bot owner finds convenient to parse". I, agreeing with your criteria, chose JSON, because it's a widely-supported and well-understood format, easily capable of representing anything we're likely to want it to. (And because the bot already uses a JSON parser to parse API responses with format=json.)

RuakhTALK18:10, 1 September 2013

Oh, I thought you used JavaScript or something...

CodeCat18:11, 1 September 2013

Oh, I see. No, JSON, despite the name, is not particularly related to JavaScript; it's just a generic serialization/storage/interchange format, rather like XML or CDF or ASN.1. It imposes a number of restrictions that JavaScript notation does not, and it's used in a lot of environments where JavaScript is not. (For example, at my job we use JSON for data used by various JUnit tests.) Even the common claim that JSON is a subset of JavaScript is technically not quite true, since there are a few characters that JSON allows in string literals but that JavaScript does not.

(The related JSONP, however, really is JavaScript-specific AFAIK.)

RuakhTALK00:39, 2 September 2013

So it's a bit like XML in purpose, but looks more like what you would find in a programming language?

CodeCat00:41, 2 September 2013

Yup, exactly. (A lot of people prefer it because it's more terse, or because they find its structure and semantics more intuitive, or more compatible (technically or philosophically) with languages they prefer, such as JavaScript/Perl/Python/PHP/Ruby. But of course this is all subjective.)

RuakhTALK01:23, 2 September 2013