Talk:아대륙

Transliteration with hanja in brackets is gone when a word is boldfaced

Latest comment: 1 year ago22 comments4 people in discussion

Hi.

As you can see below, the first example is not working.

Pls look at these examples:

인도 아대륙(印度亞大陸) ― Indo adaeryuk ― Indian subcontinent
인도 아대륙 ― Indo adaeryuk ― Indian subcontinent
인도 아대륙(印度亞大陸) ― Indo adaeryuk ― Indian subcontinent

Hopefully, it's easy to fix.

BTW, I am also unable to use both capitalisation and hyperlinking, e.g. like this ^인도. Anatoli T. ^{(обсудить}/^вклад) 22:21, 21 March 2023 (UTC)Reply

@Atitarev Unfortunately, this is non-trivial to solve. @Benwing2 This is related to the fact that strings are fed through the transliteration (and other substitution) modules in chunks, so as to avoid sending PUA characters through the substitution modules. These PUA characters are stand-ins for formatting, to prevent it from being messed around with. Unfortunately, this means the Korean transliteration can't know that the hangul in bold is followed by hanja in brackets. Ideally, we would be able to have some way of sending the full text through in one go, while somehow ensuring that the PUA characters are ignored by any substitution modules. That's not easy to do at all. Theknightwho (talk) 22:28, 21 March 2023 (UTC)Reply

@Theknightwho: Thanks for the response. Should we just make sure no highlighting is done? Anatoli T. ^{(обсудить}/^вклад) 22:31, 21 March 2023 (UTC)Reply

@Theknightwho Why don't you just have a language-specific flag to indicate whether to send everything through in one chunk? I don't understand why the PUA chars are there at all, it seems very strange to me and needs explanation, but there's no reason why you can't conditionalize like this. Benwing2 (talk) 22:55, 21 March 2023 (UTC)Reply

@Benwing2 One of the aims is to allow substitution modules to be simpler and more uniform, by removing the need to handle formatting altogether (i.e. external links, bold, italics etc). The PUA characters are just there as markers, so that the formatting can get put back again at the end. This also allows Module:languages itself to do its thing without them being disrupted at all, because it's a way of cutting out the need to continually tell operations to avoid these specific patterns. For example, {{l|de|'''w:de:Deutschland'''}} will correctly display as Deutschland, with the wikilink prefix still being correctly detected. Even inputs like {{l|de|w:'''de:Deutschland'''}} still work fine: Deutschland. This also applies to checks for a bunch of other things, like unsupported titles etc. e.g. {{l|mul|'''['''}} still links properly: [. This also makes it trivial to ensure that proper formatting is retained during transliteration etc.

It would probably be quite low risk to send these PUA chars through the substitution modules, but it would mean needing to update a bunch of them to handle them properly (like Module:ko-translit in this case). I feel like it would be better to have a solution that avoided the need for that, as this will inevitably come up again in the future with some other language. Theknightwho (talk) 00:03, 22 March 2023 (UTC)Reply

@Theknightwho I see, this use of PUA chars makes sense but I still don't see why you're objecting to a lang-specific flag that either sends everything through raw (without the use of PUA chars) or sends everything through with the PUA chars but in a single chunk. The Korean transliteration module seems to already know how to handle boldface correctly, so you may as well not try to out-clever it and just send the raw translit. Benwing2 (talk) 00:07, 22 March 2023 (UTC)Reply

@Theknightwho Ping. Benwing2 (talk) 00:07, 22 March 2023 (UTC)Reply

@Benwing2 Sorry - I should've been clearer: I think it's a good solution in the short-term, but it would be good to have a way to avoid needing to do this. Doing this for Korean will necessitate flagging Early Modern Korean, Jeju, Middle Korean and (maybe) Old Korean, and those are just the ones I can think of off the top of my head. I'd like to move away from flagging specific langs wherever possible, as they inevitably cause smaller languages to get forgotten. Plus, if this scenario comes up again, we won't know about it until someone complains.

All we need to do with Module:ko-translit is to tell it to ignore PUA characters as well. That way, it'll be able to handle weirder scenarios like <span> tags, too. Theknightwho (talk) 00:16, 22 March 2023 (UTC)Reply

@Theknightwho Sure, but I don't see an obvious way of doing this properly; sometimes the perfect is the enemy of the good. Benwing2 (talk) 00:17, 22 March 2023 (UTC)Reply

@Benwing2 Yeah, you're right. I've had a couple of ideas, but it's a nontrivial problem. It'll probably just come down to having some kind of standardised layout for substituion modules that includes a way of accounting for these as part of the package. Theknightwho (talk) 00:24, 22 March 2023 (UTC)Reply

@Atitarev @Benwing2 Good news - in this particular case, I managed to think of a different way to solve the issue. Turns out Module:ko-translit was still removing the hanja, but was returning nil because it had the empty string. I've refactored it so that the empty string doesn't cause a transliteration failure, as it can be safely left up to Module:languages to handle it, because it knows whether it's part of a longer translit or not. If any hanja don't get removed, Module:languages will know and still return nil, so it was a redundant check anyway. Theknightwho (talk) 00:52, 22 March 2023 (UTC)Reply

@Theknightwho, Benwing2, Fish bowl: Yay! Thanks, Theknightwho! Sorry for making you work hard on this and thanks to Benwing2 for the suggestions. Anatoli T. ^{(обсудить}/^вклад) 00:55, 22 March 2023 (UTC)Reply

@Atitarev @Fish bowl No worries. More generally, would it be worth switching to the // format? 아대륙／亞大陸 (adaeryuk) feels more consistent with other languages, and is easier to deal with from a technical perspective. Theknightwho (talk) 01:10, 22 March 2023 (UTC)Reply

@Theknightwho: Perhaps see Wiktionary talk:About Korean#Presentation of hanja in links. —Fish bowl (talk) 01:11, 22 March 2023 (UTC)Reply

@Theknightwho: The presentation form for Korean hanja is always this way: 아대륙(亞大陸) (adaeryuk) i.e. hangeul(hanja). There may be some trick on the wiki-input, I don't know. Anatoli T. ^{(обсудить}/^вклад) 01:13, 22 March 2023 (UTC)Reply

@Atitarev @Fish bowl I think it should be possible to vary the presentation, even if the underlying format was still with //. This would still work fine in usexes:

인도 아대륙／印度亞大陸 (indo adaeryuk) could be displayed with brackets, instead.

I would need to enable the // syntax for the usex template, as at the moment it only works in link templates. The main reason for standardising the syntax is because it makes it much more straightforward for the modules to handle inputs with multiple forms. How it's then displayed is a matter of preference. Theknightwho (talk) 01:18, 22 March 2023 (UTC)Reply

Please free to experiment @Theknightwho. Anatoli T. ^{(обсудить}/^вклад) 01:22, 22 March 2023 (UTC)Reply

@Benwing2 Just as an FYI re the above. I think this could be folded in the functionality of generate_forms. We could have a default layout (form1 (translit1)／form2 (translit2) etc.), which can then be adjusted by a generate_forms module to meet language-specific needs (such as only displaying one translit, as with the Chinese lects). For Korean, the output would use brackets, and so on. Theknightwho (talk) 01:31, 22 March 2023 (UTC)Reply

While you're at it, pls see if symbols ^ and - can be still used effectively with or without brackets, boldface or italics. My original examples don't have hyperlinks, since they don't work with ^. It did work at some stage (not related to previous edits). 인도 아대륙(印度亞大陸) (indo adaeryuk) Anatoli T. ^{(обсудить}/^вклад) 01:41, 22 March 2023 (UTC)Reply

@Theknightwho In case you didn't see the link I posted above: What would something like 공(空)하다 (gonghada) be in the formatting you propose? —Fish bowl (talk) 01:39, 22 March 2023 (UTC)Reply

I can imagine the input will be something like 공／空하다 (gong) — This unsigned comment was added by Atitarev (talk • contribs).

@Fish bowl I'd need to think about it. How would 공하다 typically be written in a hanja text? Does the suffix still use hangul? Theknightwho (talk) 01:53, 22 March 2023 (UTC)Reply

Ah okay - if that's the case, then I'd suggest {{l|ko|공하다//空하다}}. The generate_forms module would be able to recognise the identical suffixes, and handle the output accordingly. Doing it the way you suggest would be more difficult, because it would have no way of knowing that 하다 is part of both terms (and therefore wouldn't know not to put the whole of the second term in brackets). Theknightwho (talk) 01:55, 22 March 2023 (UTC)Reply

Talk:아대륙

Transliteration with hanja in brackets is gone when a word is boldfaced

Navigation menu

Search