Module talk:zh-han

From Wiktionary, the free dictionary
Jump to navigation Jump to search

Variations in stroke orders (some complex cases)[edit]

@Erutuon I've just added snmjk, sncjk, snmjk+. Because I'm using 3 letters, the "snX" parameter will not display (the "asX" parameter works fine). Would you mind fixing this?

Also, there are some cases when I need to use two "snX" parameters instead of one because there are three different stroke counts depending on region. This results in a jumbled catastrophe (the "sn" parameter displays as "Chinese traditional Chinese").

Can the code be edited so that the "sn" parameter only takes what is specified in "snX1" so that it appears much cleaner? This may take some time but help is much appreciated. KevinUp (talk) 05:46, 15 December 2018 (UTC)[reply]

@KevinUp: I've made the module accept the new parameter names (I think). If it's not working, please give some examples.
Allowing two sn parameters would probably be pretty complicated. I was sort of working on making the parameters be handled in a more systematic way (rather than with lots of repetitive code) in Module:zh-han/sandbox, which might make that easier.
I don't understand what you mean by "snX1". — Eru·tuon 06:02, 15 December 2018 (UTC)[reply]
@Erutuon: Thanks! It seems to be working now, except for snmjk+ which displays as "mainland China and Japanese and Korean" rather than "Chinese (mainland China, Hong Kong), Japanese and Korean".
What I mean by "snX1" is that when two "snX" parameters such as "snjk" and "snm" are used at the same time, only the first one, "snjk" is used to display the arguments for "sn".
I'm also planning to add something like snhjk for cases where the strokes are different in Hong Kong compared to Taiwan.
Here's a complicated example because is counted as 2 strokes in mainland China:
(radical 170, (GHJKV) or ⿰阝⿳小(T))
  • +11 in traditional Chinese (Taiwan)
  • +10 in Chinese (mainland China, Hong Kong), Japanese and Korean
  • 14 strokes in traditional Chinese (Taiwan)
  • 13 strokes in Chinese (Hong Kong), Japanese and Korean
  • 12 strokes in mainland China
Thanks for the quick reply. KevinUp (talk) 06:15, 15 December 2018 (UTC)[reply]
I can maybe look at this again, but probably first need to figure out how to test that that combinations of template parameters are interpreted correctly. Then you can input combinations of parameters and what they should signify in a testcases module, and I can try to implement a system that results in the correct interpretation; for instance, when the parameters are |sn=11 and |snhjk=12, which regions or varieties of Chinese each stroke number should apply to. — Eru·tuon 22:43, 9 January 2019 (UTC)[reply]
@KevinUp, Erutuon, I think this is kind of parameter naming is quite unsustainable. Wouldn't it be better to use something like |sn=g:12,t:14,hjk:13? — justin(r)leung (t...) | c=› } 22:58, 9 January 2019 (UTC)[reply]
I think that would be much easier to understand and implement. — Eru·tuon 23:04, 9 January 2019 (UTC)[reply]
Yes. I think that would be better. Perhaps we could use a box to display the stroke counts for different regions instead? Too much text is clutter and confusing.
I have the mainland China stroke count based on Xinzixing (新字形) listed at Special:PrefixIndex/Module:mul/guoxue-data/. I can also get the data for Japan, Taiwan and Korea stroke counts but not the data for "additional stroke count" (minus the radical stroke count) based on different regions.
Because a lot of cleanup is needed, could we use an external module subpage to keep all of this data instead? One suggestion I could think of is to use the old as= and sn= parameters for the default stroke count supplied by the Unihan database (manual input) and an additional box below detailing the different regional stroke counts that is automatically generated. KevinUp (talk) 23:44, 9 January 2019 (UTC)[reply]
@Suzukaze-c Any comments regarding this? KevinUp (talk) 23:44, 9 January 2019 (UTC)[reply]
Update: Korean stroke count for 10758 hanja characters now available at Module:mul/hanja-data. Japan and Taiwan data will be available soon. KevinUp (talk) 00:54, 10 January 2019 (UTC)[reply]
I wonder about the "translingualness" of this information. Perhaps we could figure out a way to present data, that is specifically tied to a certain character shape, and a certain region. —Suzukaze-c 01:11, 10 January 2019 (UTC)[reply]
Another way of dealing with this is to have the stroke numbers listed separately under the Chinese, Japanese and Korean sections respectively. The translingual section isn't really useful. Much of its data was imported from the Unihan database which has various errors in it. There's also a separate project ongoing at wikidata (WikiProject CJKV character). KevinUp (talk) 01:42, 10 January 2019 (UTC)[reply]
@Suzukaze-c, KevinUp: I would support putting this info in the respective languages instead. This would be the right move after moving the glyph origin into the respective languages. The stroke orders should definitely not be bunched up to the right under translingual. Of course, the Chinese section would have to deal with G, T and H, but that's more reasonable than bunching all this info up in translingual. Also, we'd have to also see if the "Derived characters" and "Related characters" sections and the notes in "Alternative forms" or "Usage notes" discussing Han unification still makes any sense if we give up on the translingual section. — justin(r)leung (t...) | c=› } 03:12, 10 January 2019 (UTC)[reply]
@Justinrleung: I just remembered, some of Hong Kong's stroke count in 常用字字形表 is different from Taiwan, particularly (Hong Kong: 7, Taiwan: 8) and (Hong Kong: 11, Taiwan: 10) although both glyphs are written the same. Do you know where I can query the stroke count for Hong Kong (other than 常用字字形表) to confirm this? KevinUp (talk) 07:25, 10 January 2019 (UTC)[reply]
Found what I'm looking for here. So now I'm able to compile the stroke counts for all the regions (GHTJKV including Vietnam). When it's done I'll put up a beer parlour proposal to relocate stroke count information. I'm also keen to put the stroke order diagrams to the relevant language sections rather than the translingual section. In the mean time, I'll upload unsorted data to Special:PrefixIndex/Module:mul/ (mainland China and Korea available for now). KevinUp (talk) 10:02, 10 January 2019 (UTC)[reply]
In the Unihan database, some Chinese characters such as (xún) and (chū) are listed under a different radical — based on their right component and (dāo), rather than the left component (chuò) and ()) because of the convention used by the Kangxi Dictionary. For (xún) and (chū), four regions (mainland China/Hong Kong/Taiwan/Japan) use the radical (chuò) and ()) whereas Unihan/Unicode and South Korea still use the original radical found in the Kangxi dictionary. Since our sort keys for Han characters are based on the radical number used by Unihan/Unicode, the abolishment of the translingual section might cause some problems. Perhaps we could still maintain the translingual section to list down the radical form and stroke number used by Unihan/Unicode? KevinUp (talk) 08:54, 13 January 2019 (UTC)[reply]
Or should we just fix (chū) in Module:zh-sortkey/data/016 (U+521D, decimal=21021) from "刀05" (Unihan) to "衣02" (more common for Chinese). KevinUp (talk) 12:42, 13 January 2019 (UTC)[reply]

To be used in entries, Module:mul/guoxue-data/ submodules and Module:mul/hanja-data will probably have to be broken up into smaller modules, as was done for the Chinese sortkey data modules, or they will push a bunch of pages over the Lua memory limit. Module:mul/guoxue-data/cjk for instance adds around 16 MB to the Lua memory usage when I load it with mw.loadData in Module:sandbox. — Eru·tuon 01:04, 10 January 2019 (UTC)[reply]

One way of dealing with this is to put the data into more refined blocks, eg. 3400-4000, 4001-5000, 20000-21000, etc. Also, the data I provided contains some false positives, e.g. some simplified Chinese characters in the hanja-data while the guoxue-data contains data for all glyphs even if the character is not used in mainland China. I'll need to further clean up that data, based on the headers we have, ie. removing the value if it's not Chinese/Korean, etc. KevinUp (talk) 01:42, 10 January 2019 (UTC)[reply]
Update: Hanja data has been cleaned up and Module:mul/hk-edu-data is now available. KevinUp (talk) 08:54, 13 January 2019 (UTC)[reply]
To be done (for own reference): (1) Generate data for mainland China (List only G-Source glyphs, calculate "additional strokes" based on deduction of radical stroke count from IDS) (2) Upload stroke number data for Taiwan, Japan, Vietnam (Vietnam data to be based on Nguyễn (2014)). (3) Combine data from GHTJK into one file using parameters such as "asj", "snj", "asg", "sng", "ash", "snh" (Vietnam data is unstable and to be excluded), then break into smaller subpages to make the module usable. KevinUp (talk) 12:42, 13 January 2019 (UTC)[reply]

Uh, help?[edit]

@KevinUp I tried to increase the range of the category script, but it refused to work. Help me? Johnny Shiz (talk) 22:01, 11 February 2019 (UTC)[reply]

@Johnny Shiz: Fixed. Also, be aware that it takes a while for pages to start appearing in the category. — Eru·tuon 22:35, 11 February 2019 (UTC)[reply]
Also, triangular triplications will not be automatically added to Category:CJKV triplications because their IDS does not indicate that they are a triplication. For instance, the IDS for is ⿱子孖 (the IDS of in turn is ⿰子子). Not sure if these can be added to the category automatically or if they should be done manually.... — Eru·tuon 22:41, 11 February 2019 (UTC)[reply]
@Erutuon: We already have categories for duplicated, triplicated and quadruplicated Chinese characters, so I don't think there's a need to have Category:CJKV triplications. To the best of my knowledge, all triplicated and quadruplicated characters are of Chinese origin while Japanese kokuji, Korean gukja and Vietnamese Chu Nom don't usually have such characters. I think Category:CJKV triplications can be deleted. @Justinrleung, any thoughts on this? KevinUp (talk) 11:53, 12 February 2019 (UTC)[reply]
@Johnny Shiz, Erutuon, KevinUp: Perhaps they should be moved to Category:Duplicated CJKV characters, Category:Triplicated CJKV characters and Category:Quadruplicated CJKV characters since many of such characters are not restricted to use in Chinese. — justin(r)leung (t...) | c=› } 16:28, 12 February 2019 (UTC)[reply]
OMG thanks. By the way, "CJKV" = Chinese, Japanese, Korean, and Vietnamese. Johnny Shiz (talk) 19:46, 12 February 2019 (UTC)[reply]
It's also used in the Sawndip writing system, not just CJKV. The term Han character is much more appropriate. Once again, you are advised to seek community consensus, and wait until other editors have replied, before making unnecessary moves or decisions. KevinUp (talk)
@Justinrleung, Suzukaze-c, Dokurrat, Geographyinitiative, Dine2016 Compare the following categories. It's a mess in my opinion.
Category 1: Category:Duplicated Chinese characters Category 2: Category:Duplicated CJKV characters
Category 1: Category:Triplicated Chinese characters Category 2: Category:Triplicated CJKV characters
Category 1: Category:Quadruplicated Chinese characters Category 2: Category:Quadruplicated CJKV characters
Some of the characters that appear in the "CJKV" category doesn't appear in the "Chinese" category and vice versa, Previously, the Chinese characters were categorized using {{zh-etym-double}}, {{zh-etym-triple}}, {{zh-etym-quadruple}} but now those get included in the "CJKV" category instead.
Also, some entries such as (shān) are categorized by {{zh-cat|Triplicated}} instead of {{zh-etym-triple}} (See User talk:Dokurrat/Archive_1#叒巛彡). KevinUp (talk) 07:27, 13 February 2019 (UTC)[reply]
My opinion is that the "CJKV" category is confusing, and should be removed altogether. We can have separate "Japanese", "Korean", "Vietnamese", "Zhuang" duplications/triplications/etc but only if the character is of widespread usage, eg. Japanese Jōyō kanji or Korean Basic Hanja for educational use. Cleanup of Korean hanja and Vietnamese Hán Nôm is ongoing at a very slow pace, so we'll deal with those later.
By the way, Chinese Wikipedia has pages for various 疊字 (diézì): 二疊字, 三疊字 (sāndiézì), 四疊字, 五疊字, 六疊字, 八疊字 while Japanese Wikipedia has a page on 理義字. KevinUp (talk) 07:27, 13 February 2019 (UTC)[reply]
TLDR: Shall we just delete the "CJKV" categories altogether and deal with these characters on a case-by-case basis? KevinUp (talk) 07:27, 13 February 2019 (UTC)[reply]
I think that we should name them "Category:X Han characters", cf. Category:Han characters by language. Splitting by language + "widespread usage" is not worth it, IMO. Some of these don't have widespread usage in Chinese anyway. —Suzukaze-c 00:49, 15 February 2019 (UTC)[reply]