Script recognition module

Do we have any module that recognizes the script of characters? Given "J", it would return "Latn". Given "Δ", it would return Grek.

I noticed that {{l|sh|во̀да}} (во̀да) is correctly labelled as Cyrl and {{l|sh|vòda}} (vòda) as Latn, without using the "sc=" parameter.

--Daniel Carrero (talk)‎

That would be Module:scripts. Specifically the findBestScript function.

—CodeCa t‎

Thank you.

--Daniel Carrero (talk)‎

I would like if {{auto cat}} (or {{charactercat}} or whatever template), when used in Category:Bb, automatically recognized that "Bb" is in Latin script. For example, it could be categorized into "Category:Latin script something", it could have "Latin script" in the description and the "Bb" in the description would have the right script label in the code.

Likewise, Category:Δδ can be created for Greek script.

And Category:Bb: ⠃ (Latin–Braille) already exists. The category name has a mixture of scripts, but the module is already prepared to recognize the different contents before and after the colon.

But findBestScript requires a language code and the categories mentioned are multi-language categories. Can't we change the module so that it iterates over all scripts, when the language is und or something?

--Daniel Carrero (talk)‎

That can work, but what about cases like Latn vs Latinx? A language would never have both as its script, but if it blindly goes over all the scripts, it's different.

—CodeCa t‎

You're right. A letter like "C" is probably both Latn and Latinx. The same problem probably would happen with pa-Arab, ota-Arab, etc. if we had similar categories for the Arabic script.

Maybe it's not feasible, but can findBestScript iterate over all scripts, but give priority for 4-letter scripts? If it finds something in Latn or Arab, it stops the search and does not iterate over Latinx and fa-Arab.

Or maybe just give priority to Latn over Latinx and forget Arab and the others unless they become a problem at some point.

--Daniel Carrero (talk)‎

We could also change the data format of the scripts a bit, giving them a "hierarchy" of some sort.

—CodeCa t‎

Suggestion: in Latinx, nv-Latn, pjt-Latn... add parent = "Latin",.

In Latn, Grek, Cyrl... add parent = "top",.

And in findBestScript, give priority to scripts that have "parent = top".

--Daniel Carrero (talk)‎

Yeah, something like that.

—CodeCa t‎

I added the parent in all scripts of Module:scripts/data. Feel free to check if I did it right. I'm not sure what to do with cases like Jpan, Hira, Kana, Hani, Hans, where scripts overlap, so when in doubt I used parent = "top", in all cases.

I also created a function :getParent(). I tested it; it's working.

I don't know yet if I would be able to make findBestScript give priority to scripts that have parent = "top",. If you'd like to do it, please be my guest. Otherwise, I think I should try later.

--Daniel Carrero (talk)‎

I think there should just be no parent when there isn't any, rather than "top".

—CodeCa t‎

Woopsie! Could we add ancestor information too? Sorry I'm such a goofus!

—John C5‎

Ok. I removed all instances of parent="top".

--Daniel Carrero (talk)‎

Script recognition module

Script recognition module

Navigation menu

Search