Template talk:Str len

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

This is the discussion/talk-page for: Template:Str len.

Merge the str len templates[edit]

checkY Done

I think we should merge {{str len}}, {{strlen}} and {{str len full}}. But not right away, this needs some testing and discussion first.

First a reminder to everyone: For most usage cases it is much easier and much less server costly to use {{str ≥ len}} or one of its sister templates.

Anyway, regarding this merge suggestion:

{{str len}}: Has the best name. But it can only count up to 80 in length. And it doesn't fail gracefully, since it produces an error message for strings longer than 80. And it counts the whitespace that surrounds its input, thus {{str len| a }} returns 3 instead of 1.

{{strlen}}: Can only count up to 64 in length. Returns 64 for strings that are too long, which is okay.

{{str len full}}: I think this is the best one, it on average uses the least resources and it can count up to 500 in length. It returns 500 for strings that are too long, which is okay. But it is brand new so it needs a little more testing before we can trust it. (I created it today. Just finished testing it, now going to document it.)

Some week from now when I hope some more users have tested {{str len full}} and we have studied the other templates that depend on these templates, then I think we should move {{str len full}} to {{str len}} and redirect {{strlen}} there.

Below are the results of the "NewPP limit reports" for each template with different input. I executed them each alone on an otherwise empty page. Those reports are inserted by the Wikipedia servers and can be found if you "view the source" of a generated page in your web browser. They show some data about how much server resources the templates on a page cost. (It works on previews too, so no need to save a page when doing tests like this.)

NewPP limit reports

{{str len|a}}:
Preprocessor node count: 31/1000000
Post-expand include size: 18/2048000 bytes
Template argument size: 11/2048000 bytes

{{strlen| a }}:
Preprocessor node count: 649/1000000
Post-expand include size: 4162/2048000 bytes
Template argument size: 387/2048000 bytes

{{str len full| a }}:
Preprocessor node count: 99/1000000
Post-expand include size: 641/2048000 bytes
Template argument size: 127/2048000 bytes

******************************************

{{str len|60-characters...}}:
Preprocessor node count: 1329/1000000
Post-expand include size: 8044/2048000 bytes
Template argument size: 14926/2048000 bytes

{{strlen| 60-characters... }}:
Preprocessor node count: 59/1000000
Post-expand include size: 624/2048000 bytes
Template argument size: 682/2048000 bytes

{{str len full| 60-characters... }}:
Preprocessor node count: 147/1000000
Post-expand include size: 1486/2048000 bytes
Template argument size: 1828/2048000 bytes

******************************************

Worst case for each string length template:

{{str len|80-characters...}}:
Preprocessor node count: 1769/1000000
Post-expand include size: 13924/2048000 bytes
Template argument size: 26306/2048000 bytes

{{strlen| a }}:
Preprocessor node count: 649/1000000
Post-expand include size: 4162/2048000 bytes
Template argument size: 387/2048000 bytes

{{str len full| 499-characters... }}:
Preprocessor node count: 176/1000000
Post-expand include size: 10017/2048000 bytes
Template argument size: 17740/2048000 bytes

******************************************

Too long strings:

{{str len|501-characters...}} = "String longer than max string length of 80"
Preprocessor node count: 1781/1000000
Post-expand include size: 14470/2048000 bytes
Template argument size: 161534/2048000 bytes

{{strlen| 501-characters... }} = "64"
Preprocessor node count: 19/1000000
Post-expand include size: 1006/2048000 bytes
Template argument size: 1509/2048000 bytes

{{str len full| 501-characters... }} = "500"
Preprocessor node count: 13/1000000
Post-expand include size: 1008/2048000 bytes
Template argument size: 1006/2048000 bytes

As can be seen above, they all cost very differently depending on what input they get.

{{str len full}} and {{strlen}} both use very little resources when they get too long strings. But {{str len}} uses a lot of resources in spite that it just returns an error message.

--David Göthberg (talk) 14:38, 26 March 2009 (UTC)[reply]

I don't really care what happens with the naming. If {{str len full}} can do more with fewer resources, then great. The error message can probably be eliminated or changed since there aren't that many things that depend on this particular behavior and they could probably be patched at the same time. However, I believe some of the other string functions do strictly depend on the fact that white space counts, so changing that behavior may be more burdensome. If you can plan out a way to seamlessly replace str_len with str_len_full, including fixing any dependencies in other templates, then I would have no objection to that. Dragons flight (talk) 05:51, 27 March 2009 (UTC)[reply]
Thanks for the heads up on the whitespace. And I see on your talk page that you consider counting whitespace to be a feature and not a bug. So I have now carefully checked all usage of {{str len}} and {{strlen}}, but none of the cases used whitespace or the error messages. So I have now changed all cases to use {{str len full}} instead.
But I converted {{str index}} to instead use {{str ≥ len}}, since that is exactly the kind of case where using {{str ≥ len}} is much better. For instance {{italictitle}} uses these string templates, and the updates I did today made {{italictitle}} cost less than half as much to parse as before. So now it "only" costs as much as a very large navbox...
Now I can move {{str len full}} to {{str len}}. There's two ways I can do that:
1: First delete {{str len}}, then move {{str len full}} there, then restore the old versions of {{str len}} to merge the edit histories. (The dates in their edit histories do not overlap, so it won't be confusing.)
2: I can first move {{str len}} to a new name, if you want to keep it?
--David Göthberg (talk) 07:07, 29 March 2009 (UTC)[reply]
I'm fine with delete/merge. At the end of the day it is the functionality that counts more than how it is accomplished. I do worry that having string functions that automatically trim whitespace is going to bite some programmer some day, but whatever. My main interest is in providing an stopgap solution and a demonstration that having real StringFunctions would be A Good Thing. Dragons flight (talk) 07:16, 29 March 2009 (UTC)[reply]
Okay, so a delete+move it is.
In most cases the whitespace is just a nuisance. But yeah, there are some situations where one need to handle whitespace. I am thinking of doing some changes to {{italictitle}}, and that will need some string functions that actually use whitespace. But I will know more about that when I have analysed it more.
Ah, so that is your motivation! My motivation is mostly that this is interesting programming. And that we needed the {{str ≥ len}} for some of the small article message boxes.
Oh, and before I go on and totally rework the other string templates I have to say this: Thanks Dragons flight and everyone else who have done the earlier versions of these templates, without the stuff you discovered/invented I could not have done the stuff I am doing now. I would never have realised we could "abuse" padleft like this! :))
--David Göthberg (talk) 19:28, 29 March 2009 (UTC)[reply]
checkY Done - And I changed all usage out there to {{str len}}, and deleted most of the subpages of {{str len full}} and {{strlen}} since they now became unused redirects. But I left the redirect from {{strlen}} to keep its edit history. And I left the redirect from {{str len full}} for now, but we should perhaps delete that one.
--David Göthberg (talk) 14:46, 30 March 2009 (UTC)[reply]

Why such complexity is used[edit]

This Template:Str_len is a clever, but complicated, way to overcome the lack of a builtin string-length function, back in 2009. The logic repeatedly uses parser-function {{padleft:{{{1}}}|n}} (to pad zeroes left on parameter 1), until the padded strings no longer match the original string, hence counting the length when those strings last matched each other, quickened by counting length in hundreds, tens and ones. This has been a desperate attempt to compensate for needing a quick, internal string-length function. Most software systems have internal character-handling functions, such as for string-length and substring-extraction, but Wikipedia's MediaWiki (1.6) software has been lacking such capability for many years. There was an attempt in 2007-2008 to gain approval to activate several typical string parser-functions (such as {{#len:xxxx}} length), but they were denied, even though they would have been much more efficient than the way {Str_len} uses {padleft:}, so many times, to eventually match the length of the string. If you think this whole situation is nutzoid, you are not alone. Many people have tried to get efficient string functions authorized for inclusion in Wikipedia's MediaWiki software, so the issues are still being debated. The role caused by bureaucracy, or committee, delays is not exactly known. Meanwhile, the complexity within {Str_len} is intended to quickly detect string length, by separation of total length, such as 236, as 2 hundreds, 3 tens and 6 ones (200+30+6), finding 300 was too long, then 40 was too long, and 7 was too long. The separation then uses Template:Str_len/core. That is much faster (and shorter) than having 237 simple if-expressions to compare padding ...234, 235, 236, and find 237 was 1 character too long. The complexity is used to quicken the process of detecting any string length, up to 500 long, and that is why a 2nd template is needed, as {Str_len/core}. -Wikid77 (talk) 01:13, 17 November 2010 (UTC)[reply]

Shorter algorithms to determine length[edit]

05-Dec-2010: The subtemplate Template:Str_len/core has used nested if-else-expressions to compare lengths in successive multiples of 100s, 10s and ones. That approach has led to many levels of nesting for the if-else logic. Instead, a #switch could be used to branch to the first matching condition (listed in reverse order), where each #switch is only 1 level deep compared to nested if-else logic over 7 levels deep. For example, to get the hundreds digit, then use a #switch branch as follows.

SWITCH LOGIC to find hundreds:
{{#switch: x{{{1}}}
  | x{{{{{|safesubst:}}}padleft:{{{1|}}}| 400 }} = 4
  | x{{{{{|safesubst:}}}padleft:{{{1|}}}| 300 }} = 3
  | x{{{{{|safesubst:}}}padleft:{{{1|}}}| 200 }} = 2
  | x{{{{{|safesubst:}}}padleft:{{{1|}}}| 100 }} = 1
  | #default= <!--return nothing to avoid a leading 0-->
}}

Compare the switch (above) to the nested if-else logic for the hundreds case, below.

IF-ELSE LOGIC to find hundreds:
{{{{{|safesubst:}}}#ifeq: x{{{1}}} | x{{{{{|safesubst:}}}padleft:{{{1}}}|100}}
| {{{{{|safesubst:}}}#ifeq: x{{{1}}} | x{{{{{|safesubst:}}}padleft:{{{1}}}|200}}
  | {{{{{|safesubst:}}}#ifeq: x{{{1}}}|x{{{{{|safesubst:}}}padleft:{{{1}}}|300}}
    | {{{{{|safesubst:}}}#ifeq:x{{{1}}}|x{{{{{|safesubst:}}}padleft:{{{1}}}|400}}
      | 4
      | 3
      }}
    | 2
    }}
  | 1
  }}
|<!--Don't return 0, since tens/ones don't want a leading 0-->
}}

The switch-logic uses only 1 level of nesting, compared to the 4 levels of if-else logic needed to handle strings above 400. The switch-logic is also shorter because it has only 1 instance of "x{{{1}}}". When using a binary search, then more levels of nesting would be needed. However, simply dividing a search into 2 parts would increase nesting from 1 to only 2 levels, as in the example, below, to find the tens digit, with comparing as above/below 50.

LOGIC to find tens:
{{{2}}}{{{{{|safesubst:}}}
#ifeq:x{{{1}}}|x{{{{{|safesubst:}}}padleft:{{{1}}}|{{{2}}}50}}
| {{#switch: x{{{1}}}
  | x{{{{{|safesubst:}}}padleft:{{{1|}}}| {{{2}}}90 }} = 9
  | x{{{{{|safesubst:}}}padleft:{{{1|}}}| {{{2}}}80 }} = 8
  | x{{{{{|safesubst:}}}padleft:{{{1|}}}| {{{2}}}70}} = 7
  | x{{{{{|safesubst:}}}padleft:{{{1|}}}| {{{2}}}60}} = 6
  | #default= 5 <!--when >= 50 and none of the above-->
  }}
| {{#switch: x{{{1}}}
  | x{{{{{|safesubst:}}}padleft:{{{1|}}}| {{{2}}}40 }} = 4
  | x{{{{{|safesubst:}}}padleft:{{{1|}}}| {{{2}}}30 }} = 3
  | x{{{{{|safesubst:}}}padleft:{{{1|}}}| {{{2}}}20}} = 2
  | x{{{{{|safesubst:}}}padleft:{{{1|}}}| {{{2}}}10}} = 1
  | #default= {{{{{|safesubst:}}}#if:{{{2|}}}|0}}
  }}
}}<!--endif-->

Compare the logic (above) to the 7 levels of nested if-else used by {Str_len/core} to find the tens digit, below (omitting most "safesubst:" for brevity).

IF-ELSE LOGIC to find tens:</nowiki>
{{{2|}}}{{
{{{|safesubst:}}}#ifeq: x{{{1|}}} | x{{padleft:{{{1|}}}| {{{2|}}}10}}
| {{#ifeq: x{{{1|}}} | x{{padleft:{{{1|}}}| {{{2|}}}20}}
  | {{#ifeq: x{{{1|}}} | x{{padleft:{{{1|}}}| {{{2|}}}30}}
    | {{#ifeq: x{{{1|}}} | x{{padleft:{{{1|}}}| {{{2|}}}40}}
      | {{#ifeq: x{{{1|}}} | x{{padleft:{{{1|}}}| {{{2|}}}60}}
        | {{#ifeq: x{{{1|}}} | x{{padleft:{{{1|}}}| {{{2|}}}80}}
          | {{#ifeq: x{{{1|}}} | x{{padleft:{{{1|}}}| {{{2|}}}90}}
            | 9
            | 8
            }}
          | {{#ifeq: x{{{1|}}} | x{{padleft:{{{1|}}}| {{{2|}}}70}}
            | 7
            | 6
            }}
          }}
        | {{#ifeq: x{{{1|}}} | x{{padleft:{{{1|}}}| {{{2|}}}50}}
          | 5
          | 4
          }}
        }}
      | 3
      }}
    | 2
    }}
  | 1
  }}
| {{{{{|safesubst:}}}#if:{{{2|}}}|0}} <!--return 0 if >=100-->
}}</nowiki>

Comparisons of the logic (above) show some improved algorithms for the coding of subtemplate {Str_len/core}. -Wikid77 (talk) 10:54, 5 December 2010 (UTC)[reply]

Hi Wikid77. As you might know I coded this optimised {{str len}} template. I based it on older {{str len}} templates made by other users and I also got input from other users, so as usual it really was a teamwork.
Unfortunately your code would cost more server resources to run. Here is what we knew when we designed this template:
1: Calling the magic word {{padleft:}} is very costly (heavy on the Wikimedia servers). This was told to us several times by the guys managing the Wikimedia servers and/or the guys coding the server software, I don't remember which. So we want to minimize the number of calls to padleft. That's what this is all about!
2: Parsing several if-cases cost almost as little as parsing a switch-case. Usually what makes the difference is what code is used in each comparison line inside the switch or if-case.
3: The great majority of usage cases out there checks very short strings. That is, it is way more common that the input to {{str len}} is 5 characters long than that it is 205 characters long.
So, let's look at your first code suggestion above. If we feed it a string that is 0-99 characters, your code uses four padleft calls. Since your switch-case will first check if the string is >= 400, then >= 300 and so on. While the old code (second code example above) only costs one padleft call to check that string. So in the majority of the usage cases out there your checking is four times as costly for the servers to perform. And when using padleft and a switch-case you can't check for short strings first, so we have to use if-cases to optimise it. Sorry.
And your second code suggestion above (third code example above) have the same problem. Your code costs more padleft calls when checking short strings. Remember, {{str len}} is mostly used to check short strings. For instance, for strings that is less than 20 characters long your code costs five padleft calls, while the old code only costs 1-2 padleft calls. Below is my full comparison of your search and the old search. See the documentation at {{str len/core}} for more on what these diagrams mean.
David's search, code example 4 above:

1--2--3--4--6--8--9
            |  |
            5  7

0 1 2 3 4 5 6 7 8 9 = Values to find.
1+2+3+4+6+6+7+7+7+7 = 50 comparisons to find all values once.


Wikid's search, code example 3 above:

5--9--8--7--6
|
4--3--2--1

0 1 2 3 4 5 6 7 8 9 = Values to find.
5+5+4+3+2+5+5+4+3+2 = 38 comparisons to find all values once.


Binary search:

4-----6--8--9
|     |  |
2--3  5  7
|
1

0 1 2 3 4 5 6 7 8 9 = Values to find.
3+3+3+3+3+3+4+4+4+4 = 34 comparisons to find all values once.
The diagram shows that a binary search of course is the most efficient on average. And Wikid's search comes fairly close to the binary search, except for short input strings where Wikid's search is much less efficient. But, for short input my search (David's) is more efficient than both the others. And short input is much more common so in total my search costs less server resources. Back when I coded this template the wast majority of usage cases were strings that were really short, most were below 20 characters long and only rarely longer than 39 characters. I guess that is still true.
So this template probably already is as optimised as is possible with template code. (Disclaimer: I kind of retired from Wikipedia in spring 2010 so I might have missed out some new functionality that might be useful to improve this.)
Oh, and since I am more or less retired I won't notice any comments here unless someone pokes me on my talkpage.
--David Göthberg (talk) 23:54, 24 January 2011 (UTC)[reply]
30-Jan-2011: The new algorithm below (under "#Optimizing for real string lengths") actually runs somewhat faster than the "already is as optimised as is possible" version in Template:Str_len; in fact, it runs almost 2x times (twice) as fast for actual strings of Wikipedia titles. The reason the new algorithm is so much incredibly faster (and shorter) is because it seeks to match the most-common string lengths, first, rather than treat every length having an end-digit "3" by using the same search order. In fact, almost no titles are 3 characters long, whereas many are 13 or 23, and treating digit "3" the same was a waste of search steps. The basic logic failure, above, was erroneously thinking that optimization could be done independent of the actual data being optimized. Case in point: imagine a string-length template which processed any string using 6 comparisons, but was used on data where 99% of strings were 19 characters long: changing to check for "19" (as the first comparison) would optimize performance as just 1 comparison in 99% of cases. That concept was extended, in the new algorithm below, to check for lengths in prioritized order, to match the most-likely real-world lengths sooner. If the world changed tomorrow, to rename most films to have 3-letter titles, then the algorithm would perform badly, and would need to be re-optimized for the actual data being processed. -Wikid77 10:52, 30 January 2011 (UTC)[reply]
  • That use of nested if-else-else-else seeks to optimize by reducing the use of {padleft} without considering ways to optimize the expansion-depth levels. Consequently, {str_len} needs 9 expansion levels for lengths 1-9, 10 levels for an 11-character string, but 13 expansion levels for a URL of 61 characters. By using other algorithms, the expansion levels can be reduced. -Wikid77 21:39, 27 January 2011 (UTC)[reply]

Strlen_short nests 3 levels when str_len uses 13[edit]

Expansion depth tests during 2010, of the nesting of Template:Str_len, show it used 9 expansion levels to count string lengths 1-9, 10 levels for an 11-character string, but 13 if-else expansion levels to find the length of a URL of 61 characters. Str_len always uses 4 or more calls to {padleft}, and can nest up to 18 levels for long strings. By using other algorithms, the expansion levels can be reduced. For example, Template:Strlen_short simply checks against a simple, prioritized linear search, never increasing beyond 3 levels of expansion depth nesting, for strings of 0-50 characters. For string lengths of 5,6,7 or 4, Strlen_short uses at most 4 calls to {padleft}, but it could be reduced to just 1 call to {padleft} for the most common length (assumed 5). Strlen_short can also be subst'd, as in {{subst:strlen_short|ABCDEF}} storing just the digit "6". -Wikid77 21:39, 27 January 2011 (UTC)[reply]

Optimizing for real string lengths[edit]

30-Jan-2011: The Template:Strlen_quick was created, as a faster (and shorter) alternative to {str_len}, by optimizing for real string data as used in articles. Using the actual string searches, from existing Wikipedia articles, it has been possible to determine the most-likely string lengths, such as 17/18 characters for titles. Then, a template is optimized to match those lengths faster: for example, suppose the top 1,000 articles all used an infobox code of 9 letters, in that case, checking for length 9, first, could avoid checking other lengths. In the case of 353,000 articles using {{Italic_title}}, the string lengths range from 2-99 letters, with the most-common lengths between 16-19 long, and 88% of all titles < 30 long. The distribution of lengths of titles has been as follows:

  • 84% > 10, 12% < 10, 51% in 10-19, 25% in 20-29, 7% in 30-39, 1.7% in 40-49, 0.6% >50.

For lengths 0-9, the increase is dramatic: almost no titles are 1 or 2 characters, a few are 3, some are 4, then more have lengths 5, 6, 7, 8, with 9 as 19x times more common than length 3. In trying to match title-length quickly, then check for the most-common first, as length 9-to-1 in reverse order.
Among lengths 10-19, the most common are at 17/18, then fewer when farther away, with 10 being the least-frequent length among those. Above 20, the lengths decrease in frequency, 21-to-29, as the reverse of 9-1, so checking 21, first, is 3x times more likely to match than 29. Among 30-39, the titles are quite rare, with 31 being as rare as length 5, and 39 being 3x times more rare, as occurring only 43-per-10,000 titles. By optimizing for the actual lengths of titles, those lengths can be matched perhaps twice as quickly. A pure binary search would give unfair advantage to rare lengths, so the string-search has been prioritized in favor of the more common lengths.

Tests of Template:Strlen_quick show that it, in fact, processes actual title lengths about 2x times (twice) as fast as the binary-search markup logic which has been used in {str_len} and {str_len/core}. More later. -Wikid77 10:52, 30 January 2011 (UTC)[reply]

String length of appeared text instead of raw markup[edit]

Right now this template/module checks the actual markup length instead of the appeared text, e.g. {{Str len|[[I love you|y]]}} gives "16" instead of "1". Is there any way to gain the latter result? Thanks. -- Sameboat - 同舟 (talk · contri.) 02:32, 5 February 2016 (UTC)[reply]