Wiktionary talk:Votes/pl-2012-02/Handling of superscript and subscript letters

Definition from Wiktionary, the free dictionary
Jump to: navigation, search

The voting system is confusing. We've got an option 1, and option 2 and presumably a status quo, but options to Support, Oppose or Abstain (from what)?--Prosfilaes 02:00, 20 February 2012 (UTC)

Oh... I forgot to make separate Support-Oppose-Abtain sections for each option. - -sche (discuss) 02:02, 20 February 2012 (UTC)
Alright, how's it look now? - -sche (discuss) 02:22, 20 February 2012 (UTC)

re Option 1[edit]

If option 1 passes, we will be in an awkward position if we find e.g. "ад" in superscript in a Russian book: will we write "ᵃ<sup>д</sup>"?! - -sche (discuss) 02:25, 20 February 2012 (UTC)

There's "COMBINING CYRILLIC LETTER DE". When combined with a non-breaking space, it becomes a makeshift modifier letter. -- Liliana 11:12, 20 February 2012 (UTC)
What about "й" or "ζ"? You see the point I'm trying to make: there aren't modifiers for everything. Also, what is "COMBINING CYRILLIC LETTER DE" supposed to be used for? - -sche (discuss) 01:50, 21 February 2012 (UTC)

another option[edit]

I think there should be an option of allowing Unicode superscript characters iff they're meant as superscript characters (not allowing the modifier letters). This, in case Unicode adds such characters (or has them already in some script (as I think it does for (some?) numbers)).​—msh210 (talk) 01:11, 21 February 2012 (UTC)

Wouldn't it be better to replace option 2 with that? Alternatively, might it not be more appropriate to wait until Unicode adds such characters and reconsider our policy at that time, after having at look at them? - -sche (discuss) 01:47, 21 February 2012 (UTC)
I don't quite get what you mean? Do you mean we should ban ʷ and ʲ from IPA usage because they aren't superscripts? -- Liliana 02:28, 21 February 2012 (UTC)
I was not thinking of banning them from IPA usage. I was thinking of allowing future Unicode characters that are meant as superscript-styles normal letters.​—msh210 (talk) 00:24, 22 February 2012 (UTC)

Oppose votes.[edit]

I tend to feel that every vote should allow editors to explicitly oppose, if only as a sort of "meta-opposition" or "objection" to the vote itself ("I don't think this is a valid subject for a vote", "I don't think enough time was given for discussion", "I don't think this vote is set up properly", "I object to the vote's language about a 'two-thirds majority' because I think that's the prerogative of a closing admin", etc.). This is not always the same as supporting some different option, unless we stretch the word "option" a bit beyond what's intuitive ("I support the option of deleting this page", "I support the option of having further discussion and then redoing this vote", "I support the option of redoing this vote, except setting it up properly", "I support the option that is equivalent to one of the above options but without being subject to the 'two-thirds majority' cutoff", etc.). —RuakhTALK 23:08, 21 February 2012 (UTC)

OK, I'll revert my removal of the "oppose" section. (Now, I'm not sure if I should leave the "I support something else" section, or let it be part of "oppose". I suppose the latter.) Re the "two-thirds" part: I included that out of concern that the question might arise "option 1 has 6 votes, option 2 has 2 votes, and the status quo has 3... does option one pass because it has a majority over option 2 (6:2) or fail because it has the support of only barely more than half the voters (6:5)?" I admit it's a clumsy clause... If you think we're better off without it, please remove or rewrite it. - -sche (discuss) 00:53, 22 February 2012 (UTC)
Don't worry, I'm not actually opposed to that; it was just an example (but potentially a realistic one: a certain editor, who shall remain nameless, once started a vote between two options, with the prefatory text saying that whichever option got a simple majority would win, and I strenuously objected to that). —RuakhTALK 02:41, 22 February 2012 (UTC)
I absolutely agree. Every vote has to have an oppose section, containing the keyword "oppose". --Dan Polansky 10:58, 22 February 2012 (UTC)

and its ilk[edit]

How will this affect entries like , which actually use the modifier letters in question? — Raifʻhār Doremítzwr ~ (U · T · C) ~ 17:26, 22 February 2012 (UTC)

If I understand it, it would be moved to Mr and <sup> tags would be used in the headword line. —CodeCat 17:53, 22 February 2012 (UTC)
Your interpretation seems to contradict the part of the vote that states that "[n]either option affects the use of Unicode modifier letters to reproduce modifier letters". is supported by five citations, all from Usenet, which unambiguously use the modifier letter 〈ʳ〉. I take it that that means that stays. In that case, does that mean that the printed examples of M r will antedate or M r (in Mr)? — Raifʻhār Doremítzwr ~ (U · T · C) ~ 17:23, 26 February 2012 (UTC)
Because some people will copy-and-paste terms to look them up, I'm not opposed to including (as {{alternative form of}}s) terms like which are specifically attested in that form. They're like haĉek, IMO: technically incorrect, but attested — to be included and given usage notes. - -sche (discuss) 18:51, 26 February 2012 (UTC)
Yeah, that seems like a pretty reasonable solution. — Raifʻhār Doremítzwr ~ (U · T · C) ~ 23:52, 26 February 2012 (UTC)

The descriptive dictionary attests English usage, including its spelling. Character encoding, on the other hand, is not part of the living language, but an issue of technical representation, governed by prescriptive standards. These are encoding errors, not alternate spellings. Michael Z. 2012-03-12 05:48 z

What do we do for people who copy-and-paste search ? — Raifʻhār Doremítzwr ~ (U · T · C) ~ 14:29, 12 March 2012 (UTC)
I'm not opposed to having a redirect or an “alternative form” listing (preferably with a note about the encoding problem), as long as it's something that renders in browsers. Michael Z. 2012-03-12 14:41 z
What might the rendering problem be? And what can we do to solve it? — Raifʻhār Doremítzwr ~ (U · T · C) ~ 15:05, 12 March 2012 (UTC)
in this case I don't foresee one. I'm thinking of other cases where people have been using character encoding from other systems, so in Unicode a character may display as a completely different one (as in the custom encoding requiring a custom font in the nonsensical sample text in Fraktur), or as some kind non-displaying control character (as I think can happen when copying text entered in Windows-1252 encoding). Michael Z. 2012-03-12 19:00 z
I'd reserve the phrase "encoding error" for an actual technical encoding error, not an unusual choice. In this case, it's a willful choice, that displays as intended. If the French want to use U+02B3 as part of the set of characters they use in writing French, we have to accept that. We can include Usage notes, but it's an intentional choice and we should respect that.--Prosfilaes (talk) 22:51, 12 March 2012 (UTC)
Give a man a dictionary, and he thinks everything is lexicography! Technical standards like Unicode are not living languages, whose usage is to be documented and evolve over time. They are prescriptive. All the words in the standards document comprise the standard, and directly contravening them is an error, plain and simple. Errors happen, but to systematically enforce them in an open Web project would be appalling.
Have “the French” codified a spin-off project of Unicode, and did the Wikimedia community adopt it while I was sleeping? Or did some guy just write something in a few Usenet postings, and make it try to look right? We know what letters he meant to write, and we know how to represent them in Unicode.
If you're not happy with HTML and Unicode as our chosen standards, then join a committee to rewrite MediaWiki software or to amend Unicode to your liking with consensus, but please don't try to establish a tradition of subverting Web standards. Michael Z. 2012-03-13 01:45 z
Dude, I've been on the Unicode mailing list for a decade now. When it was pointed out that the Lakota were using a capital form of LATIN SMALL LETTER N WITH LONG RIGHT LEG, they didn't get told they were wrong, LATIN CAPITAL LETTER N WITH LONG RIGHT LEG was added. Unicode does not try and dictate orthographies to users. If Americans decide that 0 is a letter (e.g. pr0n), the Unicode experts will grumble that this is a bad choice and that it will stay category "Number, Decimal Digit [Nd]" and continue to have BIDI properties of "European number", but they won't go out and ban it. Tidy does not spit out an error for pr0n or Mʳ, and never will.
I'm happy to dispose of questions about interpreting platonic characters from ink blots in exchange for perfectly clear digital codes. If someone wrote some characters on computer, we know exactly what Unicode code points they used, and in this case it is completely clear that they intended to use those characters.--Prosfilaes (talk) 05:27, 13 March 2012 (UTC)
Dude, unlike your examples, what we're talking about is something Unicode 6 says not to do. Michael Z. 2012-03-13 13:53 z
The Unicode Standard is very clear that 0 is not a letter, that it's a number with all the properties thereof. Yet when people abuse it as a letter, we record it as such and nothing bad happens. If someone writes a French word as U+004D U+02B3, then that's what it is. I don't suggest that we use them, any more then we should start spelling words with 0, but we certainly should record it as such.--Prosfilaes (talk) 22:45, 13 March 2012 (UTC)
I very much agree with Michael that we should not misuse Unicode; thus, we should not use modifier letters for the superscripts in old books. I also very much agree with Prosfilaes that citations from Usenet, the technical encoding of which is discernible, should have copy-and-paste entries: so if "Mʳ" is attested (which it is), we should have an entry [[]] soft-redirecting people to Mr (or to some unsupported titles/ entry that will display Mr). - -sche (discuss) 23:06, 13 March 2012 (UTC)
we certainly should record it as such – Prosfilaes, a dictionary records words, terms in the language, including their usage and spelling. So it records that the abbreviation of Mister is spelled Mr.
To some extent, we also record styling, especially in our quotations. Italics, quotation marks, punctuation, etc., all help indicate the writer's intent and the meaning of the words they are applied to. So we attest the form of Mister with a raised letter, Mr (although some high-quality professional dictionaries do not).
But lexicography doesn't concern itself with whether the source of the raised letter was printed with a smaller r type with some of the metal kerned away, or a dedicated superscript type. The words and letters remain the same whether they are set in a metal slug with swashes, whether in a font engraved by Goudy or Garamond, whether in hot metal, phototypeset, digital type, from a typewriter or dot-matrix printer. The r remains a letter r whether it's chiselled in stone, impressed in clay tablets, papyrus, lambskin vellum, PDF, or ePub, whether carved in 1-inch wood type for posters, printed from an etched copper plate, or originally stored in ASCII, DOS CP-866, Windows CP-1252, or Unicode. (And some of our sources have been durably recorded several of these ways.) It doesn't matter if some schmoe chatting on Usenet happened to copy something that looked right to reproduce a raised letter r – it is still a raised letter r, regardless of the technical details.
But when we make a website based on HTML and Unicode standards, to be read in any web browser, screen reader, or braille display, then it does matter very much that we reproduce the raised English Latin letter r not with a character that represents an intrusive (linking) r phoneme, or indicates a preceding r-coloured phoneme, but with a raised Unicode Latin letter rMichael Z. 2012-03-14 02:35 z
I have a feeling you've missed the (pragmatic) crux of the opposing argument. We should include because, as it says in the CFI, "it's likely that someone would run across it and want to know what it means". We can nevertheless render its entry a sort of soft-redirect, and include a usage note explaining why the authors used , why they ought not (technically speaking) have done so, and what ʳ is meant to be used for. Hopefully, all that is uncontroversial. — Raifʻhār Doremítzwr ~ (U · T · C) ~ 03:20, 14 March 2012 (UTC)
No, I have no problem with creating a search target, as long as we don't also blindly (or wilfully) duplicate and perpetuate non-standard technical practices. Michael Z. 2012-03-14 14:03 z
So if someone uses Mʳ on Usenet, it's just a raised letter r, but if we use it, it's something else?
If we make a webpage with an entry for Mʳ, people should expect and get the exact some results as when they read the Usenet post that they're trying to look up here, be via braille display or screen reader. It may actually be more confusing to silently change it on them, since they can't visually verify that Mʳ looks like Mr.
Doremítzwr presents a purely pragmatic view, but it's not purely pragmatic to me. I have spent a good part of my life doing stuff for Project Gutenberg, including stuff like [1]. If you ask me whether that's a faithful transcription, I have to hem and haw. There were choices we had to make, and choices that were forced upon us by Unicode and HTML (neither of which were designed for 16th century books). I'm not sure that I can be sanguine about our transcriptions of the Early English Text Society books, either; I suspect between the 19th century scholars forcing the manuscripts into book form, and us forcing the books into Unicode HTML, that we've lost, obscured or mangled many features of the original.
So now, when I can work with something born digital, and can just copy a string of Unicode code-points, it's a pleasure. No longer do I have to worry about the Platonic identity of blobs of ink; when some writes "Mʳ", I can simply report that as "Mʳ", no mangling, no second-guessing, no interpretation, no transcription.--Prosfilaes (talk) 05:12, 14 March 2012 (UTC)
Digital sources are not going to keep us from having to think about what we do, even ones that are coded in (questionable) Unicode. The standard is a 550-page book, and I don't think its existence magically makes things work by ignoring it, nor especially by willfully violating it, as this vote proposes to do. Michael Z. 2012-03-14 14:03 z
Digital sources let us copy them exactly and not mangle them. I fail to see that mangling data when we don't have to is a good thing. Again, we've already been down this road; pr0n uses a digit for a letter, which is surely worse than using a modifier letter as a letter. There is no technical problem with using ʳ here; any Unicode-compliant system must accept arbitrary strings of Unicode code-points, even including code points unassigned in their version of the standard. (And if your copy of the standard is only 550 pages, I'm not sure it can be taken as authoritative; my copy of Unicode 3 is over a thousand pages, and therefore is obviously more authoritative.)--Prosfilaes (talk) 22:49, 14 March 2012 (UTC)
Digital sources let us copy them exactly and not mangle them – If they are not in well-formed UTF-8, then we will have to “mangle” them, unless you want Wiktionary to be full of little rectangles or question marks, returning HTML validation errors, and potentially failing to display text or load pages. Sources exist in EBCDIC, MARC-8, Mac Cyrillic, KOI8-U, UTF-32, Unicode with deprecated characters, and a thousand other encodings, with HTML, PDF, WordPerfect, LaTeX, Setext, and a hundred other markup or formatting systems, and the idea that we will someday be able to “copy them exactly” without another thought is shortsighted.
pr0n uses a digit for a letter, which is surely worse than using a modifier letter – I think the writer's intent is pretty clear: “pr0n” is not “porn,” but a self-consciously jocular, cyber-bowdlerized expression of porn. It's nothing like Mr with an IPA character, in a medium where HTML is not used. Here the writer's intent is obviously to write Mr with a raised letter R, and not to indicate an “r-coloured pronunciation of M.”
my copy of Unicode 3 is over a thousand pages – You win. Michael Z. 2012-03-15 17:38 z
┌─────────────────────────────────┘
The problem of modifiers displaying as "little rectangles or question marks, returning HTML validation errors" is a general one, i.e., they will display as rectangles whether they are in IPA sections or headwords. - -sche (discuss) 20:56, 15 March 2012 (UTC)
I'm not saying that modifiers cause display problems (they don't on my machine). I'm saying that we still have to re-encode text for many reasons (mischaracterized as “mangling”), and that will never change. Michael Z. 2012-03-16 16:22 z
Unicode is designed so that there is no ambiguity in converting most character sets to Unicode. (We certainly shouldn't be mangling deprecated Unicode codepoints; they're still part of the standard for a reason.) I don't get your exception; it's okay to use 0 as a letter (which it's not), but not ʳ as a letter (which it is) provided you have the right state of mind? Within the bounds of CFI, we could attest a lot more of l33t5p34k then we do. I'm guessing we could attest a whole lot of the w:Arabic chat alphabet from Usenet, which again abuses digits as letters.--Prosfilaes (talk) 22:47, 15 March 2012 (UTC)
Somebody spelled a word “pee–are–zero–en,” and entered all of the intended characters correctly. Somebody else spelled a word “em–are,” but entered a character that actually represents r-colouring of the preceding phoneme in IPA, in violation of Unicode's standard, because that character resembles what was intended. If you want, we can indicate in a note that this happens (if there are three independent instances), but it is simply an attestation of the spelling “em–are”, with a raised “are”. The dictionary attests spellings; it does not attest usage of HTML, text encoding, metal type, penmanship, or other non-lexical technical details. (Am I still being unclear?) Michael Z. 2012-03-16 16:22 z
Aha, I see what you're getting at, Michael. It's perhaps a bit like the WT:About Japanese discussion of whether to use half- or fullwidth numbers (1 vs 1), where we could decide to use 1 even if 1 (speaking hypothetically) were more common on Usenet, because they're both just representations of 1, and ʳ in is just a representation of a raised r. But even if we make 10日 the standard, we'll have entries for 10日: and even though we're poised to make Mr the standard, I expect that community consensus will be to still have entries for , etc (when attested), because people will copy-and-paste what they find on Usenet and look it up here. And really, it takes enough technical skill to locate ʳ (it's not on the keyboard) that I must conclude its use is conscious. Actually, we all seem to be in agreement (am I understanding your last comment correctly?): have entries for strings like when they're attested, and explain that they're technically improper. - -sche (discuss) 07:58, 17 March 2012 (UTC)
Why did you say they spelled a word "pee-are-zero-en" instead of "pee-are-oh-en"? How do we know that? Looking at l33t5p34k, I don't know whether most of the users would spell it "l-e-e-t-s-p-e-a-k with numbers substituted for the letters" or "l-3-3-t-5-p-3-4-k", but I suspect there would be people on both sides.
The dictionary attests spellings. The difference is that you look at Mʳ, say that the spelling is {{Platonic entity that bears some relationship to U+004D}} {{Platonic entity that bears some relationship to U+0072}} and map that to U+004D U+0072. I look at Mʳ and say that the spelling is U+004D U+02B3. The fact that we can play around with mappings to ill-defined Platonic entities, like we had to in the analog days, doesn't mean that we should; let us accept that the fundamental elements of spelling in a 21st century digital environment are Unicode codepoints and go on with our lives.--Prosfilaes (talk) 11:10, 17 March 2012 (UTC)
What is a “platonic entity?”
Let me ask you this: in the 1856 document reproduced in File:Majty.jpg, what is the last letter used to write Govr? Is it the same or a different letter than the one someone used to write “Mʳ” in a Usenet posting? Michael Z. 2012-03-18 00:35 z
A platonic entity, form or ideal is from Plato's w:Theory of Forms; it's something that's abstract and non-material, yet somehow real.
We agree to represent what was printed in that document as a r (U+0072) styled superscript. That doesn't make it true; that's just a convenient abstraction. I'd agree that the printed letter is some sort of R, like e, é and Ë are some sort of E, at least in English. Whether or not it should be transcribed as U+02B3 is a conventional thing; we've agreed that it should not. I think that assigning a Platonic ideal to ʳ is pointless, since we can represent it directly. Of course we should state that it's nonstandard and proscribed.--Prosfilaes (talk) 07:56, 18 March 2012 (UTC)
Okay, we may be moving towards a mutual understanding, but I don't agree with some of your apparent assumptions. If there is a platonic ideal here, it is a concept of a lexical entity, a word or symbol that existed before Unicode or digital text were ever conceived, and not fundamentally changed by the existence of Unicode and other encodings. Although Unicode attempts a very broad model for representing text, it is just one model alongside things like inked brush strokes and impressions of cast metal type, and it is not authoritative over or superseding them. Furthermore, Unicode is not everything, because representation of language includes visual attributes represented otherwise, like position and visual relationship, size, style, typographic weight, typeface, orientation, etc.
And note the simple fact that Unicode's dedicated modifier-letter code points don't include an entire uppercase or lowercase alphabet, Latin, Greek, Cyrillic or other (and probably never will), so they can't constitute a system for representing superscript alphabetic writing. HTML <sup> or an analog is a requisite part of such a system.
(What we naïvely call “superscripts” and “subscripts” isn't one thing, anyway: in typography, there are various raised and lowered-glyph entities, often treated differently, and varying between different languages and writing systems, including alphabetic abbreviations, ordinals, references for notes, specialized symbols, pronunciation modifiers, numerical fractions, parts of equations, etc.) Michael Z. 2012-03-18 16:37 z
Unicode is not everything, but it is our base, and as much as possible I'd like to transfer data from Unicode to Unicode, not Unicode -> Platonic interpretation -> Unicode. I'm not arguing for using them ourselves, so completeness is irrelevant; I'm merely arguing for recording how others use them. Usage notes, about how it might cause apoplexy in your readers, are entirely appropriate.--Prosfilaes (talk) 23:54, 18 March 2012 (UTC)
General comment, not in response to any specific comment above: just to be sure we're all on the same page going into this: this vote explicitly does not concern whether or not we have entries . It neither allows nor forbids them. - -sche (discuss) 07:58, 17 March 2012 (UTC)

Templates for superscribing and subscribing[edit]

Currently, simply using <sup> and <sub> tags creates (respectively) superscript and subscript characters that are too big, respectively too high and too low, and, in the case of italicised superscripts, causes the affected text to appear on top of the normal character to its left; moreover, both tags cause line spacing problems. In an attempt to mitigate these problems, I have created {{SUP}} and {{SUB}}. (The names {{sup}} and {{sub}} cannot be used because those names are reserved for ISO codes, {{sup}} potentially and {{sub}} actually, viz. Suku.) By combining the <sup> and <sub> tags with <small> tags, the templates render super- and subscripts of the correct proportions; furthermore, calling the second parameter of {{SUP}} adds a hair space before the superscript text, solving the italicisation problem. However, line spacing is still a problem. Could someone who knows how, please modify {{SUP}} and {{SUB}} to solve the line-spacing problem? If it is solved, we shall have means of effecting super- and subscripts that display properly without resorting to using modifier letters. — Raifʻhār Doremítzwr ~ (U · T · C) ~ 00:51, 3 March 2012 (UTC)

I don't think you should use a template to do things like insert invisible characters or non-semantic html. (A hair space or other whitespace turns Majty into Maj ty, and the HTML small element “represents side comments such as small print”[2]) If a browser, or rendering engine, or font doesn't display the text correctly, that is its bug, and we shouldn't be inserting invisible content to work around it. It may be that my browser and font display things just as we expect, even if yours doesn't, but using such hacky “fixes” changes the actual text that is indexed by search engines, cut and pasted, or read out by assistive software. (In fact superscripts and subscripts display fine in my browser, except for the line-height problem, which I've fixed with my user CSS.)
The integrity of the content is most important. Style problems should be fixed with a styling solution like CSS. If the letters are too close together, then change the letter-spacing or add a left-margin. If vertical-alignment wrecks the line-height, then use position: relative instead.
The line spacing problem can be fixed using CSS position: relative (instead of vertical-align: super or sub), to raise or lower the displayed glyph without expanding the line-height. See my summary at w: User:Mzajac/monobook.css/Superscript_fix – it's old, but I think it still works fine. Michael Z. 2012-03-11 22:18 z
Even simpler: try pasting the following in your user style sheet (User:Doremítzwr/vector.css). If it works, then we can add it to MediaWiki:Common.css to fix the appearance of all superscripts and subscripts, for all users.
sup, 
sub { 
 line-height: 1em; /* neutralize extra line spacing */
 margin-left: 0.2em; /* typographic thin space */
 font-size: 0.7em; /* default superscripts and subscripts are “smaller”, or about 0.8em */
 }
 Michael Z. 2012-03-11 23:22 z
That looks good. The character height, character size, and line spacing are all correct using that coding. However, the hair space was a little too wide, so I changed that to 0.1em, which looks better. Said hair space is only necessary for superscripts when the preceding regular text is italicised; is there any way to restrict the addition of the hair space just to superscipt text with preceding italic text? Or, failing that, to superscripts only, and not to subscripts? — Raifʻhār Doremítzwr ~ (U · T · C) ~ 14:28, 12 March 2012 (UTC)
Here's code to add the space to superscripts only.
I'll think about how to add it for italics only. Do we want the space when both preceding text and superscript are italicized, or only preceding text, or both situations? Michael Z. 2012-03-12 14:44 z
sup, 
sub { 
 line-height: 1em; /* neutralize extra line spacing */
 font-size: 0.7em; /* default superscripts and subscripts are “smaller”, or about 0.8em */
 }
 
sup { 
 margin-left: 0.2em; /* typographic thin space */
 }
Both situations. — Raifʻhār Doremítzwr ~ (U · T · C) ~ 15:06, 12 March 2012 (UTC)
Okay. I'm updating the code again. I think margin is better than padding for this, but in my browser margin-left: 0.2em makes the gap match padding-left: 0.1em, for some reason. Michael Z. 2012-03-12 15:10 z
Actually, sorry to be a pain, but, looking again, the superscripts are too high and the subscripts too low. Please see User:Doremítzwr/super- and subscript æquivalencies; the first column uses the Unicode modifier letters, the second column uses <sup> and <sub> tags, and the third column uses {{SUP}} and {{SUB}}. The superscripts in the third column are the perfect height and size, whereas the subscripts in the first column are the perfect height and size. Can you make it so that the superscripts in the second column appear like those in the third column and that the subscripts in the second column appear like those in the first column? — Raifʻhār Doremítzwr ~ (U · T · C) ~ 15:26, 12 March 2012 (UTC)
No worries. I think I can do that, but probably can't get to it for a day or so. What browser and OS are you using? — it's possible that what I see is quite different from what you see. Michael Z. 2012-03-12 16:00 z
Great, thanks. I'm using Firefox on Windows Vista. — Raifʻhār Doremítzwr ~ (U · T · C) ~ 16:46, 12 March 2012 (UTC)
The following code makes <sup> and <sub> identical to your respective samples in Safari/Mac. Try it.
Minor caveat: it only accounts for the usual ''tick marks''-generated italics for now. If it's successful, I could add code to account for em and cite tags and other kinds of italics.
Major caveat: the detailed tweaks to size and position the letters work for my browser's default font and its metrics. Different browsers use different fonts based on their default settings, their OS's installed fonts, and user preferences. Different fonts will give different results. This will give unpredictable results, and is probably not suitable for the public style sheet. Michael Z. 2012-03-16 19:44 z
/* fix raised and lowered text */
 
sup, 
sub { 
        line-height: 1em; /* neutralize extra line spacing */
        font-size: 0.7em; /* normally, superscripts and subscripts are css “smaller”, or about 0.83em */
        }
 
sup {
        position: relative;
        left: 0;
        bottom: 0.1em;
        }
 
        i+sup, 
        i>sup { /* with italics */
                margin-left: 0.2em; /* typographic thin space */
                }
 
sub {
        position: relative;
        left: 0;
        top: 0.2em;
        }
I tried it out, but there's one clear problem primâ facie — it adds the margin before the superscript when the superscript only is italicised; it should only add the margin when the preceding text is italicised. Is that fixable? Sorry for the terse response; I'll respond more fully tomorrow. — Raifʻhār Doremítzwr ~ (U · T · C) ~ 10:28, 20 March 2012 (UTC)
See if this fixes it. I haven't had time to test this one. Michael Z. 2012-03-22 15:37 z
/* fix raised and lowered text */
 
sup, 
sub { 
        line-height: 1em; /* neutralize extra line spacing */
        font-size: 0.7em; /* normally, superscripts and subscripts are css “smaller”, or about 0.83em */
        }
 
sup {
        position: relative;
        left: 0;
        bottom: 0.1em;
        }
 
        i+sup, 
        i>sup { /* with italics */
                margin-left: 0.2em; /* typographic thin space */
                }
  
                i>sup:first-child { /* but no margin if the superscript is the first item in italics */
                        margin-left: 0;
                        }

sub {
        position: relative;
        left: 0;
        top: 0.2em;
        }
I've done my own tweaking and ended up with this:
/* fix raised and lowered text */

sup, 
sub { 
        line-height: 1em; /* neutralize extra line spacing */
        font-size: 0.7em; /* normally, superscripts and subscripts are css “smaller”, or about 0.83em */
        }

sup {
        position: relative;
        left: 0;
        bottom: 0.1em;
        }

        i+sup, 
        i>sup { /* with italics */
                margin-left: 0.2em; /* typographic thin space */
                }
 
                i>sup:first-child { /* but no margin if the superscript is the first item in italics */
                        margin-left: 0;
                        }
sub {
        position: relative;
        left: 0;
        bottom: 0.3em;
        }
However, the characters are too bunched together when both are italicised; I know not why. How much of this is going into the public style sheet? — Raifʻhār Doremítzwr ~ (U · T · C) ~ 17:22, 26 March 2012 (UTC)
Nothing, yet. Michael Z. 2012-03-26 19:29 z
I meant how much will be…? — Raifʻhār Doremítzwr ~ (U · T · C) ~ 20:09, 26 March 2012 (UTC)
Good question. I think the line-height fix might be a good idea for all superscripts. It improves the design in several browsers, and is probably otherwise harmless. The font-size fix is overriding the browser's default rendering, but I'm not sure that browser makers have done a great job with this in the first place.
The spacing adjustments, on the other hand, are overriding the detailed rendering of the particular typefaces we are using, and they might be just as likely to make rendering worse in some browser–OS combination as they are of making it better in ours. Also, the CSS can't account for every combination of markup that italicizes the superscripts – I can probably work around the bunched-together problem you mention, but there are probably dozens of other combinations of em, i, cite, and other tags that I can't account for without writing dozens of over-specialized CSS declarations. Michael Z. 2012-04-01 01:01 z
After having looked at User:Doremítzwr/super- and subscript æquivalencies on a number of computers, I am disappointed to find that display is frequently inconsistent vis-à-vis italicisation and horizontal spacing. The line-height fix is the only thing that always works. Oh well, perhaps this is something we can't fix completely. Could you go ahead and make the appropriate changes to the public CSS please? — Raifʻhār Doremítzwr ~ (U · T · C) ~ 14:19, 5 April 2012 (UTC)
Adding the line-height fix to MediaWiki:Common.css. Not changing the font-size and distance above the baseline (so the browser sets them). Michael Z. 2012-04-06 05:34 z
Thanks. — Raifʻhār Doremítzwr ~ (U · T · C) ~ 10:22, 9 April 2012 (UTC)

Non-standard[edit]

Are we really taking a vote to contravene both HTML and Unicode standards? Seriously?

  • HTML5: “The [HTML] sup element represents superscript.”[3]
  • HTML5: “The sub element represents subscript.”[4]
  • Unicode 6.0: “Superscript modifier letters are intended for cases where the letters carry a specific meaning, as in phonetic transcription systems, and are not a substitute for generic styling mechanisms for superscripting of text, as for footnotes, mathematical and chemical expressions, and the like.” (p 229)[5]

They're standards, people! You can't just do something they explicitly forbid and pretend you're not wrecking your own work.

Majty is spelled em-ay-jay-tee-wye. That's the abbreviation Majty, or perhaps Maj-ty, with superscripted English letters ty, for clarity and economy in dense writing. Unicode's modifier letter code points do not represent the alphabetics t and y, as used in English words. They don't even have all 26 Latin letters, because these are meant for a completely different use. (Are we also planning on adding entries for English Mr, Comr, Comrs, Archd, Depty, Qu, Rts, Esqr, decd, ad nauseam?)

As to Microsoft's shitty default behaviour of forcing people to write with raised ordinals, that's the result of some software engineer showing his pointy-haired boss how frightfully clever he was, and, sadly, the idiot ran with it. Whoever added that feature to the software knew squat about good typography. In running text, just plain l.c. like 10th is better, but there's no reason to decorate functional writing like a dictionary with ordinals anyway. Michael Z. 2012-03-11 22:48 z

Unicode charts also give specific guidelines, for example:
  • “Phonetic Extensions, Range: 1D00–1D7F” (v 6.1, 2012): “These are non-IPA phonetic extensions, mostly for the Uralic Phonetic Alphabet (UPA). The small capitals, superscript, and subscript forms are for phonetic representations where style variations are semantically important. For general text, use regular Latin, Greek or Cyrillic letters with markup instead.”[6]
  • “Spacing Modifier Letters, Range: 02B0–02FF” (v 6.1, 2012): “r MODIFIER LETTER SMALL R [. . .] preceding four used for r-coloring or r-offglides”[7]
The current explicit guidelines underscore the intent that the standard has maintained for the use of these modifier code points, for over a decade:
  • Unicode 3.0, (2000) p 177, “Latin Superscripts. Graphically, some of the phonetic modifier signs are raised or super-scripted, some are lowered or subscripted, and some are vertically centered. Only those few forms that have specific usage in IPA or other major phonetic systems are encoded.”[8]
 Michael Z. 2012-03-13 15:28 z
I just created ʳ, quod vide. Does it say what you want it to say? — Raifʻhār Doremítzwr ~ (U · T · C) ~ 14:25, 17 March 2012 (UTC)
Thank you. Of course it's better to actually see what we're talking about. Do you know if the usage as a raised letter is attested independently three times over a one-year span, in English, French, or any other language?
I have to do some thinking about this. We create entries for symbols as lexical items, in addition to terms, but our necessary reliance on Unicode makes it feel natural to “define” code points (which are not lexical items, but a way to represent them). The latter often corresponds to the former, but it's easy to lose sight that the concept is fundamentally different. This problem is not about to go away. Michael Z. 2012-03-18 00:20 z
Well, 3ʳᵈ and 1ᵉʳ get some Google-Groups hits, but none from Usenet that I can find. It'd be really tedious to find more, since searching for ʳ doesn't return hits for it when it forms a part of longer strings. There are also the two Usenet uses that I found of 4ᵗʰ, but they're from the same person in the same Usenet group, so they count as just one citation for our purposes. — Raifʻhār Doremítzwr ~ (U · T · C) ~ 10:24, 20 March 2012 (UTC)

Status quo[edit]

Voters can also explicitly vote for the status quo, if they prefer the lack of regulation to both of the proposed regulations.

The status quo is not “lack of regulation.” We don't use XML, TeX, OPML, SGML, Markdown, Setext, or Textile, and we don't just make up new stuff on a whim.

The status quo is that we use Wikitext, HTML, and CSS.

By consensus we use HTML, implicit in the community-developed MediaWiki software, and explicit in every web page's DTD. HTML relies on the Unicode standard, and this is also declared in the meta Content-Type tag in every web page. There's also an unchallenged page saying “Wiktionary uses Unicode encoding.”

Questions with such technical precision as this one may not come up too often, but when they do, the way to resolve them is by consulting our long-established standards, not by declaring a free-for-all and requiring us to vote for in every HTML tag and every Unicode code point.

If you explicitly contravene these standards, without any vote supporting you, then you're going against community consensus. Michael Z. 2012-03-12 00:03 z

Er, so are you claiming that the status quo is to use th, or that the status quo is to use ᵗʰ? (I know you well enough to be certain that you mean the former, but in effort to emphasize the supposed self-evidence of your position, you've intentionally failed to make that explicit.) —RuakhTALK 00:40, 12 March 2012 (UTC)
I think the previous section makes it clear he wants th.--Prosfilaes (talk) 02:12, 12 March 2012 (UTC)
I am claiming that the status quo is not “you can use a Unicode code point to be whatever you feel like because we haven't specifically voted on using Unicode.” I am claiming that the “status quo” choice provided in the vote misrepresents our consensus treatment of HTML and Unicode, and it shouldn't imply that anything goes. The wording in the vote makes it sound like using Latin modifier letters for superscripts in English words is okay because we haven't disallowed it, but actually it's not okay at all, because it directly contravenes the standards that we have agreed on as the basis for our work here.
(For me it follows that we should follow the standards and represent superscripts with <sup>...</sup>, but that is a separate question which I spoke to above, in #Non-standard.) Michael Z. 2012-03-12 04:58 z
Come to think of it, the vote's wording is completely biased, and doesn't address the direct ramifications of the choices. I will edit boldly. Michael Z. 2012-03-12 05:08 z
Searchability is another issue. A text search for "majty" will not find "maj + SYMBOL RESEMBLING t + SYMBOL RESEMBLING y". As someone commented above, the letters in "majty" are supposed to be the letters t and y, just written smaller. Equinox 15:11, 12 March 2012 (UTC)
They are found in searches; for example, 1ˢᵗ is the second search result that appears when one searches for 1st. — Raifʻhār Doremítzwr ~ (U · T · C) ~ 15:30, 12 March 2012 (UTC)
Can we rely on all search engines making this equivalence (Google, Yahoo etc.) and not just our internal one, which is less commonly used? And should we rely on it, given the violation of standards? Equinox 15:32, 12 March 2012 (UTC)
Search engines do assign similarity or equivalence into various search terms (e.g., cafe and café). The details of their algorithms are unknowable and ever-changing. It doesn't pay to go crazy with SEO or second-guess them too much. We can improve searchability and ensure forwards-compatibility by applying standards correctly to our work. Michael Z. 2012-03-12 15:53 z

starting the vote[edit]

I prefer it when we reach consensus on what to do, and do that, without a vote, and it actually seems we have got a consensus here (or are you still holding out for modifiers, Raif?), but I suppose this vote should be started, anyway. I'll set it to start a week from now; push it back as necessary. - -sche (discuss) 23:13, 13 March 2012 (UTC)

If Michael fixes the <sup> and <sub> tags, as he says he can in #Templates for superscribing and subscribing, I'll drop my insistence on using modifier letters for super- and subscribing, except in cases where modifier letters are shown to have been used and which might lead a person to look them up here using copy-and-paste searching (I refer to cases like , discussed in #Mʳ and its ilk; my impression is that that exception is part of the consensus, too). Given that, I would hope that we'd agree to present these forms using <sup> or {{SUP}}, including in page titles by using {{DISPLAYTITLE}}, as in the case of princip l. — Raifʻhār Doremítzwr ~ (U · T · C) ~ 00:22, 14 March 2012 (UTC)

POV in the vote options[edit]

Regarding this change to the vote:

Undo revision 16422823 by Mzajac (talk): those changes "load" the vote (introduce a lot of POV); the points have already been made in the BP and now the talk page.

Sche, Option 1 sets a precedent for routinely ignoring the open Web's standards, which are a cornerstone of Wikimedia's and Wiktionary's accessibility, quality, and openness. It's the most important thing about this vote. To obscure this fact from the vote is highly biased. Michael Z. 2012-03-14 02:01 z

Some editors have expressed their preference that votes not include rationales at all; those editors suggest that the talk page and discussions that precede the vote are the place to express the rationales and arguments for an against the vote. I get that you feel very strongly about this, but perhaps other editors will chime in with regard to the changes. - -sche (discuss) 20:55, 15 March 2012 (UTC)
Understood. But this isn't the rationale behind what we're voting to do, this is what we're voting to do. The vote's wording could be much simpler, clearer, and more accurate if it said something along the lines of 1. Enter text according to the Unicode standard, 2. Allow violations of the Unicode standard as long as the glyphs look right in a visual browser. Michael Z. 2012-03-18 00:46 z