Wiktionary talk:Todo/Translations templates outside translations sections

Definition from Wiktionary, the free dictionary
Jump to: navigation, search

Code[edit]

time \
  bzip2 -d < enwiktionary-20120812-pages-articles.xml.bz2 \
  | perl -e \
      ' use warnings;
        use strict;
        $/ = "</page>\n";
        while(<>)
        {
          last unless m{</page>};
          die unless m{<ns>(\d+)</ns>};
          next unless $1 eq "0";
          next if m{<text xml:space="preserve" />};
          die unless m{<title>([^<]+)</title>};
          my $title = $1;
          die unless s{^.*<text xml:space="preserve">}{}s;
          die unless s{</text>.*$}{\n}s;
          s/^=+ *Translations *=+ *\n((?!=).*\n)*//gm;
          print "* [[$title]]\n" if m/{{t[-+ø|]/;
        }
       ' \
  > trans-template-outside-trans-section.txt

RuakhTALK 03:12, 19 August 2012 (UTC)


Alternatively, here's a pure-Perl version, provided the dump has already been decompressed (from *.xml.bz2 to just *.xml):

use warnings;
use strict;
 
use utf8;
 
open my $dump_fh, '<:encoding(UTF-8)',
                  'enwiktionary-20120831-pages-articles.xml'
  or die "Couldn't open dump: $!";
 
open my $output_fh, '>:encoding(UTF-8)',
                    'trans-template-outside-trans-section.txt'
  or die "Couldn't open output: $!";
 
$/ = "</page>\n";
 
while(<$dump_fh>)
{
  last unless m{</page>};
  die unless m{<ns>(\d+)</ns>};
  next unless $1 eq "0";
  next if m{<text xml:space="preserve" />};
  die unless m{<title>([^<]+)</title>};
  my $title = $1;
  die unless s{^.*<text xml:space="preserve">}{}s;
  die unless s{</text>.*$}{\n}s;
  s/^=+ *Translations *=+ *\n((?!=).*\n)*//gm;
  print $output_fh
        "* [[$title]] — {{temp|$1}}\n" if m/{{(t[-+ø]?)[|]/;
}

Notes:

  • This may take an age and a half to run, due to the added UTF-8 stuff; codepoints in the range 128–255 seem to cause major pessimization. (But then, I'm running a really old version of Perl. Newer Perls have hopefully fixed that.)
  • Be sure to save the above script as UTF-8; the line use utf8; tells Perl to expect the script to be encoded in UTF-8.
  • I've modified the output to indicate which translation template is used. (Of course, if a page uses multiple translation-templates outside translations-sections, then this will indicate only one of them.) This is to address one of -sche's questions.
  • On the second-to-last line, you see the \n? Depending on your system and your Perl installation, you may need to write \r\n instead. To determine whether that's the case, try running the command perl -e "print qq{a\nb\nc\n}" > tmp.txt and opening the file tmp.txt. If you see a, b, and c on separate lines, then all is well. If you see them all on the same line, perhaps with boxes or weird characters between them, then you need to use \r\n rather than \n

RuakhTALK 12:16, 6 September 2012 (UTC)

Creative error[edit]

Now, this — this is creative misuse of a template, lol. Using a translation template to do anything other than provide a translation. Although some of them, like Дмитрий, sorta make sense. - -sche (discuss) 01:46, 20 August 2012 (UTC)

Liffey[edit]

I can't tell why that entry showed up on the list. It seems to only use {{t}} in a trans table. I thought I'd leave a note here in case I'm missing something obvious, but if it's just a false positive, that's fine. - -sche (discuss) 14:50, 27 August 2012 (UTC)

This isn't a list of translations-templates outside translations tables, it's a list of translations-templates outside translations sections; that is, outside the bit of wikitext that starts with a "Translations" header and ends with some other header. See Liffey?diff=17704949. —RuakhTALK 15:23, 27 August 2012 (UTC)
Yep, a lot of the English entries listed merely need the ====Translations==== header, so they're very easy to fix. Mglovesfun (talk) 15:42, 27 August 2012 (UTC)

Appendix pages[edit]

Can this be extended to Appendix pages as well? There are still several cases of this in the descendants sections of PIE entries. —CodeCat 18:10, 7 September 2012 (UTC)

[1] works pretty well, obviously it ignores section headers. Mglovesfun (talk) 18:21, 7 September 2012 (UTC)