Thanks for your reply.
Here's what I'm asking for.
First of all, the pairs with nuqta (a dot underneath) and without it should be searchable the same way Roman letters with diacritics and without are searchable.
The letters are not identical but So that if a user typed खून, ख़ून would also be listed.
- Words containing diacritics ॉ (candra), ् (virama) should be equal to those without them: चॉकलेट / चाकलेट, सन् / सन. Similar to the way English words entries with a space are equal to those having a hyphen (-) between them.
- Different forms of alif: ا, أ, إ, ﺁ and ٱ should be searchable together, e.g. أمس and امس, etc.
- Words containing any of these diacritics could be searchable as if they don't have them and the other way around:
Is it possible? Anatoli 12:51, 31 January 2011 (UTC)
12:51, 31 January 2011 (UTC)
Tim Starling added a bug for this, and I will add all of the characters you have listed above to the bug. I told him that we wanted this behavior in AutoComplete (in the search fields) as well as included in the DidYouMean extension which suggests alternative results when your search doesn't lead to an existing page, is there anywhere else this should be enabled?
Yes, it's what you're asking. The treatment for alternative letters should be like for Roman words conaining any of Appendix:Variations_of_"a" or others, so when you type etre, you can see être in the search box. Anatoli 19:21, 31 January 2011 (UTC)
Thank you, I tested various Arabic alifs, they seem to work, no luck with Hindi variants, though. Anatoli 21:41, 31 January 2011 (UTC)
I am pretty sure no modifications have yet been made, the bug was only just submitted and I understand that the guy who works on search functions for Mediawiki is pretty busy at the moment.
Then it means that various versions of alifs have already been addressed before. They could be used as a template, perhaps? Please let me know if you don't understand any of the requirements, not sure I expressed myself well - I wrote between jobs and then late at night. Anatoli 22:09, 31 January 2011 (UTC)
Persian often uses a zero-width nonjoiner (& # x200C;) as in ویکیپدیا. People who don’t know how to access it tend to substitute a space: ویکی پدیا. It’s a misspelling, but lots of people can’t help it.
In languages like Khmer and Thai that do not use word spaces, there is often a zero-width space (& # x200B;) as in តើអ្នកនិយាយភាសាអង់គ្លេសទេ. More often than not, it is simply left out (តើអ្នកនិយាយភាសាអង់គ្លេសទេ). Both spellings are correct.
Do you know how Mediawiki currently handles this? We obviously don't want all spaces to be normalized to that character, and we don't want that character to be normalized to a space either (or terrible matches would be made).