Script recognition module

Jump to navigation Jump to search

Script recognition module

Do we have any module that recognizes the script of characters? Given "J", it would return "Latn". Given "Δ", it would return Grek.

I noticed that {{l|sh|во̀да}} (во̀да) is correctly labelled as Cyrl and {{l|sh|vòda}} (vòda) as Latn, without using the "sc=" parameter.

--Daniel Carrero (talk)21:04, 27 July 2016

That would be Module:scripts. Specifically the findBestScript function.

CodeCat21:06, 27 July 2016

Thank you.

--Daniel Carrero (talk)21:07, 27 July 2016

I would like if {{auto cat}} (or {{charactercat}} or whatever template), when used in Category:Bb, automatically recognized that "Bb" is in Latin script. For example, it could be categorized into "Category:Latin script something", it could have "Latin script" in the description and the "Bb" in the description would have the right script label in the code.

Likewise, Category:Δδ can be created for Greek script.

And Category:Bb: ⠃ (Latin–Braille) already exists. The category name has a mixture of scripts, but the module is already prepared to recognize the different contents before and after the colon.

But findBestScript requires a language code and the categories mentioned are multi-language categories. Can't we change the module so that it iterates over all scripts, when the language is und or something?

--Daniel Carrero (talk)00:13, 28 July 2016

That can work, but what about cases like Latn vs Latinx? A language would never have both as its script, but if it blindly goes over all the scripts, it's different.

CodeCat00:34, 28 July 2016

You're right. A letter like "C" is probably both Latn and Latinx. The same problem probably would happen with pa-Arab, ota-Arab, etc. if we had similar categories for the Arabic script.

Maybe it's not feasible, but can findBestScript iterate over all scripts, but give priority for 4-letter scripts? If it finds something in Latn or Arab, it stops the search and does not iterate over Latinx and fa-Arab.

Or maybe just give priority to Latn over Latinx and forget Arab and the others unless they become a problem at some point.

--Daniel Carrero (talk)00:46, 28 July 2016

We could also change the data format of the scripts a bit, giving them a "hierarchy" of some sort.

CodeCat00:58, 28 July 2016

Suggestion: in Latinx, nv-Latn, pjt-Latn... add parent = "Latin",.

In Latn, Grek, Cyrl... add parent = "top",.

And in findBestScript, give priority to scripts that have "parent = top".

--Daniel Carrero (talk)01:20, 28 July 2016

Yeah, something like that.

CodeCat01:35, 28 July 2016

I added the parent in all scripts of Module:scripts/data. Feel free to check if I did it right. I'm not sure what to do with cases like Jpan, Hira, Kana, Hani, Hans, where scripts overlap, so when in doubt I used parent = "top", in all cases.

I also created a function :getParent(). I tested it; it's working.

I don't know yet if I would be able to make findBestScript give priority to scripts that have parent = "top",. If you'd like to do it, please be my guest. Otherwise, I think I should try later.

--Daniel Carrero (talk)08:37, 29 July 2016

I think there should just be no parent when there isn't any, rather than "top".

CodeCat15:44, 29 July 2016

Woopsie! Could we add ancestor information too? Sorry I'm such a goofus!

JohnC503:22, 1 August 2016
 

Ok. I removed all instances of parent="top".

--Daniel Carrero (talk)21:52, 9 August 2016