Module talk:IPA

From Wiktionary, the free dictionary
Jump to navigation Jump to search

Order by articulation?[edit]

This is just a small suggestion... Would it make sense to sort the symbols by their articulation type, so that the order here resembles the order used in the IPA chart? I know that only matters for Lua coders, but still. :) —CodeCat 21:40, 20 March 2013 (UTC)[reply]

This order is from the reference, w:X-SAMPA, which our {{X-SAMPA}} links to. Probably makes more sense to list them systematically by the source material. So, I think yes. Michael Z. 2013-03-20 22:48 z

Translating symbols without precomposed characters[edit]

These could probably be transliterated as well, with a precaution. Precomposed characters would always have to be substituted first, so that the longest possible sequence is always used. I don't how how gsub behaves in this case, though, so it may be necessary to add a bit of extra code. —CodeCat 02:33, 21 March 2013 (UTC)[reply]

I’m not sure that’s necessary for the characters that only have a form with diacritics. I believe that mw.gsub replaces all of the diacritics separately. Needs testing.
These lines in the code are mainly placeholders for when Unicode does add the characters. I will comment this more clearly.
Good point about order, though. MW’s Lua has undocumented quirks. Michael Z. 2013-03-21 04:10 z

Verification?[edit]

I think it would be a good idea to add something that also verifies the validity of the IPA, so that it checks whether there are any characters in the string that don't belong in IPA, and either shows an error or puts the entry in a category if it fails. There would have to be some leeway though, because current practice allows for brackets ( ) and spaces even though they're not strictly part of IPA. —CodeCat 02:38, 21 March 2013 (UTC)[reply]

Spaces are converted to dashes, following the X-SAMPA table.
Other characters will just pass through, so this should be reasonably safe. It’s just for converting the IPA entered in Wiktionary, so it is presumed to be checked by an editor. Validating the IPA that is entered into Wiktionary is out of scope, although that might be a valuable project. But this conversion does need testing. Michael Z. 2013-03-21 04:28 z
I don't see why it's out of scope. This module is called "IPA" isn't it? Besides, everything that is needed is already in place (the list of valid characters) so it would be very easy to do. In fact, it would be almost the same as converting it to X-SAMPA, except that you check afterwards whether any characters haven't been converted. —CodeCat 13:28, 21 March 2013 (UTC)[reply]
Okay; I was imagining something more complex. What’s the behaviour? If there are non-IPA characters included, then show an error message on the page? Or maybe better to add a class attribute and a hidden category, so that it only presents itself to editors. Michael Z. 2013-03-21 15:40 z

Why IPA to X-SAMPA?[edit]

We've abandoned X-SAMPA, so I don't think it's necessary to do this. Rather, the module should pass through IPA unchanged, while converting X-SAMPA to IPA as well. The output should always be IPA. —CodeCat 15:53, 18 May 2014 (UTC)[reply]

I think the module was started before we had given up on X-SAMPA, and some people knew IPA but didn't know X-SAMPA so they created this module to produce X-SAMPA from IPA. But yeah, now what we need is a module to convert X-SAMPA (easier to type) into IPA (easier to read). Kenny's been doing some work on it already today, but see the test cases on the documentation page for some problems it's still having. I hope the bugs can be cleared up and that very soon we can replace {{IPAchar| in Template:IPA with {{#invoke:IPA|XSAMPA_to_IPA| or the like. —Aɴɢʀ (talk) 20:15, 18 May 2014 (UTC)[reply]

X-SAMPA to IPA test cases[edit]

Discussion moved from Module:IPA/documentation.

For [bɑːdʲ] above I forgot the velarization mark after the b. For some reason, when I added it to the testcases subpages, that broke the palatalization mark after the d. Why do [bɑːdʲ] and [bˠɑː] work but [bˠɑːdʲ] doesn't? Can it not handle more than one underscore per term? —Aɴɢʀ (talk) 13:52, 19 May 2014 (UTC)[reply]

No. The problem was that the underscore fell on an odd character index (or shall I say even, because Lua uses 1-based indices). mw.ustring.gsub() with the pattern .. finds every non-overlapping pair of characters, in this case "[b", "_G", "A:", "d_" and "j]". Then the function replaces each pair with an entry in a given lookup table, or leaves it alone if there is no entry in the table. And there was no entry for "d_" or "j]". So there was no replacement.
I changed the implementation to avoid string substitution, and instead scan the X-SAMPA string incrementally.
(The same thing bit me while writing transliteration modules some time ago.) Keφr
OK, good, glad you figured it out! Some more tests:
  • [d̪ˠoːɟ] ← should be [d̪ˠoːɟ] (dóigh) looks OK
  • [ˈmɝdɚ] ← should be [ˈmɝdɚ] (murder) with precomposed ɝ and ɚ rather than ɜ and ə followed by the rhoticization diacritic.
Aɴɢʀ (talk) 15:07, 19 May 2014 (UTC)[reply]
Hmm. I've already updated Module:IPA/data accordingly. --kc_kennylau (talk) 15:26, 19 May 2014 (UTC)[reply]

Nonstandard IPA[edit]

There are a whole lot of obsolete and nonstandard symbols in the International Phonetic Alphabet that we could have this module convert into the standard symbols, but the only ones that I see with any regularity are g (U+0067) instead of ɡ (U+0261) and the ligatures (which Unicode confusingly calls digraphs) ʣ ʤ ʥ ʦ ʧ ʨ instead of dz dʒ dʑ ts tʃ tɕ. These seven at least should be added to the module for conversion to their standard counterparts. (Actually g → ɡ is already part of XSAMPA to IPA conversion, so we don't need to worry about it separately.) —Aɴɢʀ (talk) 14:40, 19 May 2014 (UTC)[reply]

Are you suggesting to merge Module:validate IPA into this module? Keφr 20:24, 20 May 2014 (UTC)[reply]
I guess not, since that module just puts things into a cleanup category. If someone enters {{IPA|/tiːʧ/|lang=en}} on a page, I don't want the page put into a cleanup category, I just want the page to display "IPA(key): /tiːtʃ/" as if they had entered it correctly. —Aɴɢʀ (talk) 20:48, 20 May 2014 (UTC)[reply]
I do think they should be merged. I can't really think of any reason not to merge them. Validation and conversion is really more or less a single process. —CodeCat 20:55, 20 May 2014 (UTC)[reply]
I guess I don't mind if they're put into a cleanup category, but I don't really see the point. As long as it displays correctly, that's good enough for me. —Aɴɢʀ (talk) 21:08, 20 May 2014 (UTC)[reply]
Aren’t the digraphs supposed to represent the combined sound; e.g., I think ʧ is replaced by t͡ʃ.
A cleanup category is a good idea. It can be ignored, but incorrect entry will be discouraged if editors choose to look at hidden categories or see others correct the input Michael Z. 2014-05-21 06:21 z
We can replace ʧ etc. with t͡ʃ etc.  In most languages there's no difference (in English /tiːtʃ/ and /tiːt͡ʃ/ are exactly equivalent), but in a few languages like Polish there is a difference, so t͡ʃ etc. probably makes more sense. —Aɴɢʀ (talk) 12:16, 21 May 2014 (UTC)[reply]
"teat she" is the same as "teachy"? And Polish has t͡ʂ, not t͡ʃ. Czech has t͡ʃ. Keφr 16:36, 21 May 2014 (UTC)[reply]
"teat she" is two words and wouldn't be a dictionary entry, but even if it were a dictionary-worthy phrase it could be represented /tiːt ʃi/ or /tiːt.ʃi/ as opposed to /tiːtʃi/ or /tiː.tʃi/. My point about Polish is just that it distinguishes t͡ʂ and (e.g. czy vs. trzy) so the tie bar is not optional in that language as it is in English. —Aɴɢʀ (talk) 18:28, 21 May 2014 (UTC)[reply]
English /tʃ/ in hotshoe and batshit is audibly different from /t͡ʃ/ (or /ʧ/) in achoo and ratchet. A period separator would emphasize it, but the tie bar makes the distinction. Michael Z. 2014-05-22 06:17 z
My point is merely that using the period/full stop to separate syllables means that the tie bar is not indispensible in English. I have no objection at all to people using the tie bar to transcribe English, but I personally don't use it when I transcribe English, and that doesn't make my transcriptions wrong. (Granted, I don't usually use the syllable separator either since I don't think English syllables always cleanly divide between segments, but in a word like batshit I'd pretty much have to.) —Aɴɢʀ (talk) 08:12, 22 May 2014 (UTC)[reply]
I'm curious what the actual phonetic difference between those is. —CodeCat 18:54, 21 May 2014 (UTC)[reply]
Between which? Keφr 19:02, 21 May 2014 (UTC)[reply]
Between those two Polish words. I can hear a difference in the audio pronunciations, but I can't pinpoint what the difference actually is, let alone express it in IPA. —CodeCat 19:04, 21 May 2014 (UTC)[reply]
Is this not what the tie bar is for? Reading Appendix:Polish pronunciation suggests reading w:affricate consonant. Keφr 19:10, 21 May 2014 (UTC)[reply]
I think in terms of articulation, there's just no coarticulation in trzy as there is in czy. Also, listening to the audio files, I'm gonna say the stop in trzy is dental [t̪] while in czy the stop portion of the affricate is postalveolar, maybe even retroflex [ʈ]. —Aɴɢʀ (talk) 19:30, 21 May 2014 (UTC)[reply]
I've been wondering about this as well. Since Polish /ʂ/ is retroflex, the entire affricate ought to be retroflex, so that would make it /ʈʂ/. Similarly, the English sound is properly a postalveolar affricate, so that would make its first element a postalveolar stop, for which IPA lacks a symbol. —CodeCat 19:39, 21 May 2014 (UTC)[reply]
Well, it lacks a simple symbol, but you can certainly use [t̠] for it if you want. But I'd never write /t̠͡ʃ/ at the phonemic level, and I'd probably never even write [t̠͡ʃ] at the phonetic level unless I was making a point specifically about how the starting point of the "ch" affricate is further back than a normal "t". —Aɴɢʀ (talk) 20:15, 21 May 2014 (UTC)[reply]

Characters that need to be escaped in template syntax[edit]

In all Mediawiki template calls, the characters = and | have a specific meaning in the syntax, so here at Wikimedia projects, if you need to use them in templates, you replace them with {{=}} and {{!}} respectively. When I do that with this module it works fine unless I try to subst the module invocation in:

  • {{#invoke:IPA|XSAMPA_to_IPA|/"d{zl{{=}}/}} renders as: /ˈdæzl̩/ (English dazzle)
  • {{#invoke:IPA|XSAMPA_to_IPA|[i:"{{!}}\i:{{!}}\i]}} renders as: [iːˈǀiːǀi] (Zulu icici)

but:

  • {{subst:#invoke:IPA|XSAMPA_to_IPA|/"d{zl{{=}}/}} renders as: /ˈdæzlææ̩ʉʉ/ (English dazzle)
  • {{subst:#invoke:IPA|XSAMPA_to_IPA|[i:"{{!}}\i:{{!}}\i]}} renders as: [iːˈææꜜʉʉ\iːææꜜʉʉ\i] (Zulu icici)

Is there any way to fix that? Or should I just avoid subst'ing it in in those cases? —Aɴɢʀ (talk) 13:33, 22 May 2014 (UTC)[reply]

Substitute {{=}} and {{!}} as well (i.e. {{subst:=}} and {{subst:!}}). Or in case of =, use explicitly numbered parameters: {{subst:#invoke:IPA|XSAMPA_to_IPA|1==/"d{zl=/}}. Otherwise no. Keφr 15:12, 22 May 2014 (UTC)[reply]

Taking it live[edit]

Okay, I've added the tiebar with pseudo-XSAMPA equivalent __ (i.e. two underscores) and it works at Module:IPA/testcases. Is there any reason not to edit {{IPA}} and {{rhymes}} so that they invoke this module directly, allowing XSAMPA input with IPA output? —Aɴɢʀ (talk) 11:07, 7 June 2014 (UTC)[reply]

Well, I tried it and it didn't work. Any ideas? —Aɴɢʀ (talk) 11:10, 17 June 2014 (UTC)[reply]
@Angr Both the open big bracket and the close big bracket are well-defined X-SAMPA codes, that's why you see æææɨʉʉʉ. --kc_kennylau (talk) 11:15, 17 June 2014 (UTC)[reply]
Yeah, I know; the question is, how do I get the template to realize I'm not using them that way? —Aɴɢʀ (talk) 11:17, 17 June 2014 (UTC)[reply]
As long as you provide the first parameter, it would not treat it literally. Note that other templates also take un-passed parameters literally. For example: {{cmn-pron/hom}} gives Template:cmn-pron/hom. --kc_kennylau (talk) 11:23, 17 June 2014 (UTC)[reply]

IPA pronunciations with invalid IPA characters[edit]

Which symbol is "invalid" in газовая колонка [ˈgazəvəjə kɐˈlonkə]? --Anatoli T. (обсудить/вклад) 06:20, 23 July 2014 (UTC)[reply]

The g. IPA uses a separate letter ɡ (U+0261), not the usual letter g (U+0067). The difference between them is not apparent in all fonts. —Aɴɢʀ (talk) 09:31, 23 July 2014 (UTC)[reply]
Thank you. There's still something wrong in газовая колонка. --Anatoli T. (обсудить/вклад) 10:24, 23 July 2014 (UTC)[reply]
I think there's something wrong in Module:ru-pron. I tried to fix it, but it doesn't seem to have worked. —Aɴɢʀ (talk) 10:42, 23 July 2014 (UTC)[reply]
Thank you. User:Wyang has fixed it. --Anatoli T. (обсудить/вклад) 13:03, 23 July 2014 (UTC)[reply]

A long mark in parentheses seem to be false positive such as /krɛ(ː)m/ in crème#nl since it can be read short or long. (My native language, th, has many words in this case.) Please also check [2]. The long mark in parentheses should be valid. --Octahedron80 (talk) 07:18, 24 March 2015 (UTC)[reply]

Due to the link I posted, I found more false cases that the symbol placed after some certain letters. The fact is the long mark can be placed after any vowels and consonants. Please debug them. --Octahedron80 (talk) 07:37, 24 March 2015 (UTC)[reply]
That can be avoided by listing the two variants separately, as I have done here. —Aɴɢʀ (talk) 15:49, 25 March 2015 (UTC)[reply]

Representation marks[edit]

Is it possible to arrange it so that {{IPA}} throws an error message if what it encloses doesn't begin and end with either /.../ or [...], but for {{IPAchar}} not to throw the error message? There are plenty of times (e.g. pronunciation appendices) when {{IPAchar}} doesn't need the representation marks. —Aɴɢʀ (talk) 10:49, 4 August 2014 (UTC)[reply]

According to the documentation for {{qualifier}}, it's meant to be used in definitions, not in pronunciation sections. Shouldn't {{accent}} be used instead? — Eru·tuon 20:14, 12 October 2016 (UTC)[reply]

The difference is in the names. Though I'd say that qualifiers belong anywhere but definitions. —CodeCat 20:16, 12 October 2016 (UTC)[reply]
Oh, I misread. {{qual}} is for lists of synonyms and such things. Yes, the difference is in the names: the two templates have the same data. But either the documentation for {{qual}} should not prohibit it being used in pronunciation sections, or {{IPA}} should deprecate the parameter |qualN= in favor of |aN=. It is inconsistent. — Eru·tuon 20:48, 12 October 2016 (UTC)[reply]
They're meant for different things. Qualifier is for things like "rare", and "rare" isn't an accent, but pronunciations can be rare. —CodeCat 20:50, 12 October 2016 (UTC)[reply]
I don't think "rare" is ever supposed to be used on its own in a pronunciation section. The variety in which a pronunciation is rare always has to be specified. — Eru·tuon 21:50, 12 October 2016 (UTC)[reply]
Or maybe I just don't know how this works... I guess documentation for {{qual}} doesn't say that {{qual|rare}} can't be used in a pronunciation section... — Eru·tuon 21:53, 12 October 2016 (UTC)[reply]
Well, 'rare', 'verb' (and so on) aren't accents so really should use {{qualifier}} while things like 'UK', 'US' (and so on) use {{accent}}. Renard Migrant (talk) 22:00, 12 October 2016 (UTC)[reply]
* {{a|US}} {{IPA|lang=en|/blabla/|/blublu/|qual2=rare}}CodeCat 22:07, 12 October 2016 (UTC)[reply]

IPA on Reconstruction pages[edit]

Many of the pages in Category:IPA pronunciations with invalid IPA characters are Reconstruction pages. Some of them were included because they had asterisks inside the bracketing (/*krap/ in the page on *krap), others outside (*[xáwis] in the page on *h₂éwis). I added some rules to allow asterisks in these positions, but it is an ad-hoc solution. Something more elegant is needed.

In addition, there are many odd characters used in reconstructions, such as capitals and superscripts (for examples, go preview the page *krap). These place some of the Reconstruction pages in the error categories. Not sure what would be the best way of dealing with this. I would rather not add a bunch of capitals and superscripts to the "valid" list in Module:IPA/data/symbols, because in most transcriptions these would be incorrect. It would make more sense to have a list of symbols that are used in a particular reconstructed language's transcription system. These lists could then be accessed when Module:IPA is adding the error categories. — Eru·tuon 10:37, 21 February 2017 (UTC)[reply]

I would rather that the characters used in reconstructions not be put inside IPA templates. PIE and PST, for example, are both conventionally written using all sorts of characters that aren't IPA, so they shouldn't be in {{IPA}} or {{IPAchar}}. —Aɴɢʀ (talk) 18:45, 21 February 2017 (UTC)[reply]
What would you call the transcription if it's not IPA? What should a new template be called? DTLHS (talk) 18:50, 21 February 2017 (UTC)[reply]
Do we need a new template at all? If they're in a proto-language for which we have no code, we could just write {{l|und||/*krap-I, kraʔ-II/}}, for example, instead of {{IPAchar|/*krap-I, kraʔ-II/}}. —Aɴɢʀ (talk) 19:09, 21 February 2017 (UTC)[reply]

Suggest replacement characters[edit]

@Erutuon Do you think we could suggest replacement characters in some cases in the error message? g to ɡ for example. DTLHS (talk) 22:28, 1 March 2017 (UTC)[reply]

@DTLHS: I like the idea. It should be possible, if we create a table of replacements indexed by the incorrect characters: { ["g"] = "ɡ", ["'"] = "ˈ", }. — Eru·tuon 22:32, 1 March 2017 (UTC)[reply]

Categories for words by syllable[edit]

I would suggest to disable this for any term that contains whitespace characters, since it looks weird to have a "16-syllable word" that is actually several words. SURJECTION ·talk·contr·log· 22:11, 12 June 2018 (UTC)[reply]

Note that I do recognize the existence of nocount, but perhaps it should apply automatically for some languages (the ones that don't use spaces to separate syllables, but words) if the term contains spaces. SURJECTION ·talk·contr·log· 22:25, 12 June 2018 (UTC)[reply]

...on second thought, it is probably better to implement these on the language specific IPA templates/modules. SURJECTION ·talk·contr·log· 22:27, 12 June 2018 (UTC)[reply]