User talk:Atitarev/Thai translit test cases

From Wiktionary, the free dictionary
Jump to navigation Jump to search

Manchester in Thai[edit]

@Alifshinobi, @Octahedron80: Hi,

Please don't define the term แมนเช็สเตอร์ (mɛɛn-chés-dtəə, Manchester) just yet (leave it red) but please check if the respelling is right. I will need to make more test cases for the transliteration modules re: Wiktionary:Grease_pit/2023/November#Inline_for_Template:th-usex_and_Template:km-usex. Anatoli T. (обсудить/вклад) 09:15, 23 November 2023 (UTC)[reply]

The word must be แมนเชสเตอร์ dude. Read as แมน-เช้ด-เต้อ or แมน-เช้ส-เต้อ. --Octahedron80 (talk) 09:30, 23 November 2023 (UTC)[reply]

@Octahedron80 Should the vowel in เช้ด be short? --A.S. (talk) 16:04, 23 November 2023 (UTC)[reply]
@Octahedron80, @Alifshinobi: Thank you, both! Indeed, should the second vowel be short, even if spelled "แมนเชสเตอร์" (more commonly).
The spelling "แมนเช็สเตอร์" is from the textbook. So I left it unchanged. From the suggestions I gather that ส can be both "t" and "s". The reader pronounced "s", actually something like "man-chés-dtə̂ə" to my ear. Anatoli T. (обсудить/вклад) 22:04, 23 November 2023 (UTC)[reply]
I always hear from media that เช้ด is long. Anyway, Manchester is actually read män-chis-ter (UK) but Thais don't follow that. --Octahedron80 (talk) 01:57, 24 November 2023 (UTC)[reply]

Test cases[edit]

@Theknightwho, @Benwing2: Hi. I made the test cases here. It may not be the appropriate or visible place. Late me know if you want them moved or changed formatting.

Since the examples are longer and {{m}} is failing with Thai sentences. I made different sections but so that you understand, each line has a desired output in the other section, line by line, e.g.

  1. สวัสดี ค่ะ คุณ ชื่อ อะไร คะ (test case) = สวัสดีค่ะ คุณชื่ออะไรคะ(sà-wàt-dii kâ · kun chʉ̂ʉ à-rai ká) (expected)

Etc. Hope it makes sense. Anatoli T. (обсудить/вклад) 01:14, 24 November 2023 (UTC)[reply]

@Atitarev @Theknightwho Yes this seems to make sense. I expect there will have to be the following:
  1. A translit method that scrapes the translits of the individual space-separated words. Single spaces map to single spaces but double spaces map to a center dot with spaces around it.
  2. A makeEntryName method that removes single spaces but links the space-separated components individually, and converts double spaces to single spaces.
  3. A display method that essentially does the same thing as #2.
@Theknightwho Can you help me understand what sort of transformations happen before calling the above three methods? Does the chop-it-up functionality apply to all of them or only to transliteration? If I disable the chop-it-up behavior, why do private-use characters get sent to the translit method instead of just sending the raw input? What do these private-use characters stand for? Thanks.
Benwing2 (talk) 03:44, 24 November 2023 (UTC)[reply]
@Benwing2 It applies to all of them. The private use characters represent formatting that shouldn't be touched, so they'll be standing in for things like [[ or '''. Theknightwho (talk) 03:47, 24 November 2023 (UTC)[reply]
@Benwing2, @Theknightwho: Please don't forget to handle {} for respelling: (if this method is confirmed to be used) - แมนเช็สเตอร์{แมน-เช็ส-เตอ} = แมนเช็สเตอร์ (mɛɛn-chés-dtəə). Hopefully, it can be applied to Mandarin {pinyin} et al as well. Anatoli T. (обсудить/вклад) 03:51, 24 November 2023 (UTC)[reply]
@Theknightwho The problem is that these methods may need to know where the links are, and handle them specially. So having the left and right (square) brackets obscured by private-use characters will be inconvenient. Same goes for single braces, if they are also converted to private-use chars. Benwing2 (talk) 04:57, 24 November 2023 (UTC)[reply]
@Atitarev My plan is to use a slightly different format for single braces, so instead of writing แมนเช็สเตอร์{แมน-เช็ส-เตอ} you'd write {แมนเช็สเตอร์/แมน-เช็ส-เตอ}, similar to the current |subst= mechanism. This adds one character to the required typing but it makes it so the code doesn't need to guess where the start of the substituted segment is. Benwing2 (talk) 04:59, 24 November 2023 (UTC)[reply]
@Benwing2: OK, thanks. I've changed the cases in the page accordingly. Anatoli T. (обсудить/вклад) 05:04, 24 November 2023 (UTC)[reply]
@Atitarev Your test cases need to be modified slightly so as to include the braces as well as the slash. You can put a space on the outside of the single braces as necessary to avoid three braces in a row; this extra space will be ignored by the translit and other methods mentioned above. Benwing2 (talk) 05:18, 24 November 2023 (UTC)[reply]
@Benwing2: Did I put it right just now?
I've also added a case with a duplication character , which is surrounded by spaces in standard orthography (it can duplicate a syllable, a phrase or a whole sentence, may cause some translit issues, probably need to mark the start and end of the portion to be duplicated. Anatoli T. (обсудить/вклад) 05:28, 24 November 2023 (UTC)[reply]
@Atitarev Hmm. I can add support for the duplication character but in general how do you know how much to duplicate? Is it contextual? What I can do is something like this: If there's a duplication character, by default it gets rendered in translit by copying the translit of the most recent scraped segment (i.e. the space-delimited segment before the duplication character). If something else is needed to be duplicated, use a syntax like this: {ๆ/...} where the ... represents a respelling of the stuff to be duplicated. Benwing2 (talk) 06:02, 24 November 2023 (UTC)[reply]
@Benwing2: The default behaviour is to repeat the last defined word or syllable (checks for spaces)
Even the current module, which is used by {{th-x}} doesn't know about cases when is supposed to repeat not the last word but more: ไปแล้ว (bpai lɛ́ɛo · lɛ́ɛo) is incorrect (because it doesn't know). It should give "bpai lɛ́ɛo · bpai lɛ́ɛo". The last two words(syllables) are to be reduplicated. I'll make a case with {ๆ/...} but the current modules can't handle it correctly, so "ไป แล้ว ๆ" should be respelled like this by the old template: ไปแล้ว ไปแล้ว(bpai lɛ́ɛo · bpai lɛ́ɛo) (BTW, the template {{th-xi}} needs a space before a bracket). Anatoli T. (обсудить/вклад) 06:10, 24 November 2023 (UTC)[reply]
@Benwing2: Yeah, it's sort of contextual. No way a module will be able to figure it out. It's OK to default to the last word but if more needs to be reduplicated, let's use {ๆ/...}. Anatoli T. (обсудить/вклад) 06:12, 24 November 2023 (UTC)[reply]
@Atitarev What do you mean by "the template {{th-xi}} needs a space before a bracket"? Can you give an example? Benwing2 (talk) 06:17, 24 November 2023 (UTC)[reply]
@Benwing2: The output needs a space before the opening bracket "(", just before the transliteration start). Not "...ๆ(bpai ..." but "...ๆ (bpai ..." Anatoli T. (обсудить/вклад) 06:22, 24 November 2023 (UTC)[reply]
@Atitarev Ahh, I see, you're saying there's a bug in the current {{th-xi}} implementation. Benwing2 (talk) 06:58, 24 November 2023 (UTC)[reply]
@Atitarev As for your more complex case, you need to write it something like this:
{{m|th|ไป แล้ว {ๆ/ไป แล้ว} }}
Here, what goes inside the inner braces after the slash is the Thai-script phonetic respelling of the duplicated term, which in this case appears to be the same as the written form. This will cause it to display and link using the ๆ symbol but transliterate using the transliteration of the respelling "ไป แล้ว". If you want to be able to write the source written form on the right side of the slash and have it be scraped, this may be possible but would require some additional syntax and logic to handle this. Benwing2 (talk) 07:05, 24 November 2023 (UTC)[reply]
@Benwing2: That's what I meant, thanks (re bug). I've made a change.
I'll leave this with you guys. When the work starts, I can add more cases or improve/clarify existing ones. Anatoli T. (обсудить/вклад) 07:52, 24 November 2023 (UTC)[reply]