Module talk:ja

Definition from Wiktionary, the free dictionary
Jump to: navigation, search
test cases: Module talk:ja/testcases

Thanks ZxxZxxZ! Wyang (talk) 04:13, 11 April 2013 (UTC)

no problem, I hope it works correctly since I don't know Japanese and can't test it. --Z 04:19, 11 April 2013 (UTC)
I noticed this: ["ヴ"]="vye",["フ"]. It's incorrect, ヴ is "vu" but becomes "v" with any small vowel letter (ヴャ - vya), same with "フ" (fu). --Anatoli (обсудить/вклад) 04:23, 11 April 2013 (UTC)
That's not the actual conversion. In the function kata_to_romaji(f) the three character string is detected as "'([ヴフ])ィェ'"; eg. ツュチフィェ -> tsyuchifye (a hypothetical sequence of course). Wyang (talk) 04:26, 11 April 2013 (UTC)
I haven't changed because it's not clear how "kr3" function is used. Perhaps "ィェ" should be "ye" but "ヴ" and "フ" just "vu" and "fu". "ヴ" and "フ" become "v" and "f" in front of small vowel letters ァ, ィ, ェ and ォ (and those with "y"). The same is true for a few other letters, like ツ, デ, ト, ド, etc., which are sometimes used in loanwords. --Anatoli (обсудить/вклад) 04:37, 11 April 2013 (UTC)
kr3 is used only if ヴ/フ is followed by ィェ. ヴ is still -> "vu". Wyang (talk) 04:40, 11 April 2013 (UTC)

Moved from User talk:ZxxZxxZ[edit]

Hi. Thanks for your help there. There is one more debugging request: for function M.romaji_to_kata(f), I want to have the string replaced using rk4, rk3, rk2, rk1 sequentially, like the previous one. When I invoke it, however, "tsyuchifye" should generate "ツュチフィェ" but it instead now generates "ツュチフィェ". Could you please take a look and see where went wrong? Thanks. Wyang (talk) 04:48, 11 April 2013 (UTC)

NP, I just took a look on how the Japanese writing system works to see if there is better ways to convert terms. One thing that I noted is that romaji looks to be irreversible (for example, wi may be both ヰ and ウィ), so is it really possible to convert from romaji to katakana? --Z 05:13, 11 April 2013 (UTC)
Ah, I should have removed these one-to-many ones (also zu). It is convertible. ヰ is obsolete in modern Japanese, so wi should be mapped to ウィ. Similarly zu should correspond to ズ, not ヅ. I suppose there are alternative ways of writing this, by analysing what follows the consonant. It definitely requires more work; don't know if that would work though. Wyang (talk) 05:21, 11 April 2013 (UTC)
It's irreversible, unfortunately or, at least, may not be very accurate. People can usually bear with "おう" being converted to "ou", in words where should be "ō" but the other way around is worse. We don't romanise "東京" (とうきょう) as "toukyou" but "Tōkyō" but "大きい" (おおきい) is also romanised with "ō" as "ōkii. Letter "ヅ" can be romanised as "dzu" to make it different from "ズ" (so it's used when typing) but usually it's "zu". --Anatoli (обсудить/вклад) 05:31, 11 April 2013 (UTC)
Katakana/hiragana to romaji would be useful to create romaji transliteration and romaji entries, so would katakana to hiragana (to build sorting keys in categories). Not sure about hiragana to katakana but most animals, onomatopoeia, etc. have variant spellings in katakana. --Anatoli (обсудить/вклад) 05:36, 11 April 2013 (UTC)
This shouldn't pose a difficulty, if the algorithm is: 1) de-macron, "ō" -> "ou", 2) "to" -> "と", 3) "o" -> "お". Wyang (talk) 05:40, 11 April 2013 (UTC)
You probably missed the section about "ōkii", it's おおきい (ookii), not おうきい (oukii). I don't understand what you meant by 2) "to" -> "と", 3) "o" -> "お". --Anatoli (обсудить/вклад) 05:46, 11 April 2013 (UTC)
I see what you mean. Macron 'o' is essentially a conflation of the combinations 'oo' and 'ou'. There would be no ambiguity if "ō" is disallowed in the input from the beginning (or if not disallowed, set to 'ou' by default unless specified, as 'ou' from Sino-Japanese words would greatly outnumber 'oo' which is mainly of native origin. Wyang (talk) 06:03, 11 April 2013 (UTC)
What I mean is, when this is used at romaji entries: Tōkyō may be used with {{ja-romaji}} with no specifications (as ō is by default 'ou') and this produces とうきょう, but ōkii has to have {{ja-romaji|rom=ookii}} or {{ja-romaji|hira=おおきい}} to limit it to 'oo'. Wyang (talk) 06:08, 11 April 2013 (UTC)
(before edit conflict) That's just how it is, the standard is to use "ō" here and many publications. Notable exceptions: the particles "は" (letter "ha") "へ" (letter "he") are transliterated as "wa" and "e", letter "を" is transliterated as "o", not "wo" (in any position). The tool can still be useful if the transliteration standard is not changed but will require manual override. Archaic letters can be ignored in romaji-kana conversion.
(after edit conflict) I see what you mean. we could have additional params for back translit but it remains to be seen where and how these modules are used, so that an adjustment or a collective decision could be made. We didn't use automatic transliteration before, so... --Anatoli (обсудить/вклад) 06:17, 11 April 2013 (UTC)
BTW, I don't know if a conversion table is necessary for ACCEL creation of JA entries (as the script like Template:ja new (Japanese version of Template:cmn new) will not be using Lua). But if you do, we could agree to type romaji like "Toukyou" and "ookii", so that the conversion to hiragana happened correctly. --Anatoli (обсудить/вклад) 06:31, 11 April 2013 (UTC)

to do[edit]

1)

Needs to convert string-final "n" to ン in kana_to_romaji(f). I added

if mw.ustring.sub(text,mw.ustring.len(text),mw.ustring.len(text)) == 'n' then text = (mw.ustring.sub(text,1,mw.ustring.len(text)-1) .. "ン") end

but it didn't work. Wyang (talk) 21:17, 11 April 2013 (UTC)

2) hidx

3) geminate consonants (done) Wyang (talk) 04:14, 12 April 2013 (UTC)

1) You mean when it is at the end of the word it should be ン? --Z 04:52, 12 April 2013 (UTC)
Yes. See the testcases. shinkansen is not converted correctly. I think converting final 'n' to ン prior to list conversion would solve that problem. Wyang (talk) 04:55, 12 April 2013 (UTC)
I'm not what the exact problem is with ン but ン is ALWAYS "n", also in front of ナ, ニ, etc. It gets an apostrophe ' in front of any vowel (large) - 遠泳 (えんえい = en'ei). Small ones are not used after ン and we don't ever romanise it as m, ng. --Anatoli (обсудить/вклад) 05:01, 12 April 2013 (UTC)
See Module talk:ja/testcases. shinkansen -> シンカンsエン. I guess it's because 'en' is converted first and there is nothing to convert the remaining 's' to. Although converting the final 'n' to ン first doesn't seem to work either. Wyang (talk) 05:06, 12 April 2013 (UTC)
n/s case fixed. --Z 05:23, 12 April 2013 (UTC)

Thanks! Looks like everything listed has been done now (1,2,3). Wyang (talk) 05:25, 12 April 2013 (UTC)

Documentation[edit]

Can somebody please write the documentation? I would, but I don't know how everything in it works. Please, we need to be careful to document our modules so others can use them more easily. —Μετάknowledgediscuss/deeds 04:51, 15 April 2013 (UTC)

Excellent. Thank you! —Μετάknowledgediscuss/deeds 20:16, 15 April 2013 (UTC)

romanization of ~っ[edit]

@TAKASUGI Shinji I'm not sure 't' is suitable either; なーんてねっ (nānte net) seems odd to me. I chose "h" so that あっ would become ah (I had totally forgotten about h as another method of romanizing long vowels).

Also, FWIW, I rewrote the romanization code recently and the old code simply didn't romanize ~っ at the end of a phrase at all, i.e. あっ (a), which I thought was somewhat problematic. What is your opinion on that behavior? —suzukaze (tc) 07:46, 22 January 2017 (UTC)

@Suzukaze-c: As you know, there is no established transcription for the final っ. I used t because it matches well at least for あっという間. Some scholars use q ([1]), which is based on the long tradition of the phonemic notation /q/ but may look too exotic. Others use an apostrophe ([2], [3]). @Atitarev, Eirikr, Haplology, Wyang, エリック・キィ: What do you think of romanization of the final っ? — TAKASUGI Shinji (talk) 09:20, 22 January 2017 (UTC)
This was discussed before: Wiktionary:Tea room/2014/August#六. Wyang (talk) 09:28, 22 January 2017 (UTC)
I’d forgotten it, thanks. We just omitted the final っ until the revision as of 2016-11-10T11:11:06, which used #. — TAKASUGI Shinji (talk) 09:57, 22 January 2017 (UTC)
"#" was a shortlived personal experiment that got published by accident, please don't mind that part.
I was unaware of both the discussion and the policy, thanks. A lot of policies here seem kind of outdated in comparison with current practice though, for example the points under 'relaxed rules' and the entry layout on Wiktionary:About Japanese. Can we use this opportunity to consider changing the policy on final っ? —suzukaze (tc) 10:05, 22 January 2017 (UTC)
  • There's policy, and then there's the technical side. I think the policy arose in part because omitting it is much easier -- if we make it "t" instead, or "h" instead, there are all kinds of odd corner cases that go funny, as explored in this current go-round.
So long as those corner cases can be properly thought through and planned for, I'm open to being convinced to change current practice. FWIW, I think omission works and is reasonably clear. ‑‑ Eiríkr Útlendi │Tala við mig 18:05, 22 January 2017 (UTC)
Maybe in the case of あっという間に where there is a と to consider it could be transliterated as 't', but in other cases it could be transliterated as something else. —suzukaze (tc) 00:16, 23 January 2017 (UTC)
How about deleting a space after っ in あっという間? That will yield atto iu ma. — TAKASUGI Shinji (talk) 13:54, 23 January 2017 (UTC)
It's totally reasonable but I also feel like morphologically it's あっ+と+いう+間+に and should maybe be romanized as such. —suzukaze (tc) 21:07, 23 January 2017 (UTC)
In this particular case, あっ and と are completely fused. There is no pause between them. — TAKASUGI Shinji (talk) 23:38, 23 January 2017 (UTC)
Which is why I proposed "where there is a と to consider it could be transliterated as 't'". I know there's no glottal stop in the pronunciation in the case of あっという間. —suzukaze (tc) 17:34, 24 January 2017 (UTC)
げっ (ge') / あっという () (at to iu ma)suzukaze (tc) 08:28, 25 January 2017 (UTC)
We shouldn’t use h. It is for a long vowel. — TAKASUGI Shinji (talk) 11:23, 25 January 2017 (UTC)
Hmm, but we already use Hepburn-style rōmaji anyway. I personally am all for alternatives like q and ' but I also fear that it may confuse casual users of Wiktionary. Of course we could also do the previous status quo of romanizing it as nothing but I am of the opinion that romanizating it visibly is beneficial. Would directly using ʔ be too radical? —suzukaze (tc) 12:01, 25 January 2017 (UTC)
Sorry guys. I have limited Internet access as I'm on a holidays in Thailand. I think っ after vowels shouldn't be romanised at all or or should be romanised as nothing. That's the common practice out there and this is not a unique situation when a foreign letter is romanised as nothing in some situations. --Anatoli T. (обсудить/вклад) 15:42, 25 January 2017 (UTC)
Nan de syô, kô, pah' to akaruku nattari, pah' to kuraku nattari surun de syô?
suzukaze (tc) 01:54, 1 September 2017 (UTC)
It should be an apostrophe if anything. It's a fairly common way to denote the glottal stop. h is inherently ambiguous and confusing since it is often used for long vowels in non-Hepburn (or rather, non-Unicode) romanization. According to this paper, while Kenkyusha's New Japanese-English Dictionary (研究社新和英大辞典), which the ALA-LC Romanization Table refers to, employs a breve ˘ to denote the glottal stop, selected words in OCLC WorldCat records represented the glottal stop with simply no representation or with an apostrophe, but none with ˘, let alone h or t.
(The quote provided by Suzukaze above is a weird one, as it employs both h and ' in place of っ. I wonder if Ishikawa actually meant a long vowel by h, but he also uses macrons (or circumflexes) so it seems unlikely, which is why it's so strange.) Nardog (talk) 14:54, 2 September 2017 (UTC)
The apostrophe representing the glottal stop is also seen in Random House Japanese-English English-Japanese Dictionary (1997) and Pocket Kenkyusha Japanese Dictionary (2003). Nardog (talk) 18:46, 8 September 2017 (UTC)
Since you have that much evidence, apostrophes are alright by me. —suzukaze (tc) 02:31, 9 September 2017 (UTC)

mw.loadData being used by gsub[edit]

A note in the module says that mw.ustring.gsub can't use arrays loaded by mw.loadData(). I had switched the module to directly using the various subtables of Module:ja/data before seeing the note, and it seemed to work, so I guess that note is wrong. — Eru·tuon 05:59, 17 August 2017 (UTC)

Are では and とは two words?[edit]

Currently with this module, in order to romanize the particles では or とは as dewa or towa, you have to put a space between the two kana, producing de_wa and to_wa. However, in certain contexts, では and とは both exhibit behaviors unpredictable from the way the case particles and are used. Simply で + は would suggest 'in, at, on', but when used at the beginning of a sentence, it could mean 'then' or, as an interjection, 'bye' (both often simplified as じゃ or じゃあ in speech); simply と + は would suggest 'with, to (sth)', but it could introduce a definition or a question asking for one, as in 「Wiktionaryとは?」("What is Wiktionary?"). Thus I think では and とは, when separated by spaces or in isolation, should be able to be recognized as one particle with irregular pronunciation, along with は. Nardog (talk) 10:33, 28 August 2017 (UTC)

I agree with you. Wyang (talk) 10:34, 28 August 2017 (UTC)
I think they should still be romanized as de wa and to wa. —suzukaze (tc) 01:08, 31 August 2017 (UTC)

export.romaji_to_kata()[edit]

might be redundant to Module:typing-aids/data/ja now. —suzukaze (tc) 08:50, 1 September 2017 (UTC)

😢, if the time has come to farewell one of the earliest functions of the module. (although – shouldn't the data be at Module:ja/data instead? The function is still used in many Japanese modules.) Wyang (talk) 09:43, 1 September 2017 (UTC)
It's probably not redundant as far as speed is concerned. The version in Module:typing-aids has got to be a bit slower than the original function, because it uses mw.ustring.gsub over and over. Then again, if the function isn't used heavily (that is, as heavily as kana-to-romaji would be used in a list of terms), that might not matter. — Eru·tuon 03:01, 14 November 2017 (UTC)

ruby stuff[edit]

Example from では (de wa):

* {{ja-usex|'''では''' <tt>C-v</tt> (次の画面を見る)をタイプして次の画面に進んで下さい。(さあ、やってみましょう。コントロールキーを押しながら <tt>v</tt> です)|^'''で は''' <tt>C-v</tt> (つぎ の がめん を みる)を タイプ して つぎ の がめん に すすんで ください。(^さあ、やってみましょう。^コントロールキー を おしながら <tt>v</tt> です)|'''Now''' type <tt>C-v</tt> (View next screen) to move to the next screen. (go ahead, do it by depressing the control key and <tt>v</tt> together)}}

The ruby function is trying to find kana that correspond to C-v, but there are none. Should this be fixed by creating a way to exclude text from ruby annotation, and marking C-v with whatever that code happens to be? (That can easily be done with the code that now excludes decimal character entities and HTML tags.) — Eru·tuon 00:59, 2 September 2017 (UTC)

What I ended up doing is making the ruby function ignore HTML tags (for instance, <tt>, named entities (&nbsp;, and numeric entities (&#32;), as well as anything encircled in double ampersands in both the annotated text and the annotation. It's a makeshift solution, so there may be problems with it later on. — Eru·tuon 18:32, 24 September 2017 (UTC)

kanji sortkey possibility - XJIS[edit]

https://www.google.com/search?q=xjis+sorting (not sure what happens to characters outside of XJIS though) —suzukaze (tc) 09:16, 24 September 2017 (UTC)