Last modified: 2011-12-18 17:08:36 UTC
Bug detected at: http://de.wikipedia.org/w/api.php?format=yamlfm&action=parse&page=BD:Label5&prop=section Two 3rd level headers are embedded in a template call, the parsed results are messed up: Byteoffsets from number=1.1 on give end-of-page offset. There is no index and no fromtitle. May have the same cause as 25203#c3 (“The api isn't at fault here, its only displaying what the parser output says there is.”).
Hmm... { "warnings": { "parse": { "*": "Unrecognized value for parameter 'prop': section" } }, "parse": { "title": "Benutzer Diskussion:Label5" } }
(In reply to comment #1) Awww, somehow the s in the end of the URL got lost. Correct link: http://de.wikipedia.org/w/api.php?format=yamlfm&action=parse&page=BD:Label5&prop=sections
Ok I can confirm your results there. The first two sections (first one is 'regular', second is in the templated text): { "toclevel": 1, "level": "2", "line": "Gr\u00fc\u00df Gott und Herzlich Willkommen auf meiner Benutzer-Diskussionsseite", "number": "1", "index": "1", "fromtitle": "Benutzer_Diskussion:Label5", "byteoffset": 3417, "anchor": "Gr.C3.BC.C3.9F_Gott_und_Herzlich_Willkommen_auf_meiner_Benutzer-Diskussionsseite" }, { "toclevel": 2, "level": "3", "line": "Meine WP-W\u00fcnsche f\u00fcr 2011", "number": "1.1", "index": "", "fromtitle": false, "byteoffset": 7897, "anchor": "Meine_WP-W.C3.BCnsche_f.C3.BCr_2011" }, Since this second one comes from within a template, the current parser can't really assign it a byte position within the article text. I'm not too familiar with how this output is generated so will have to take a peek to say more. Ideally it at least shouldn't mess up the later sections, but I'm not sure how a "byteoffset" helps if you don't have a "bytelength"... possibly this is just a bad data structure that's not really suitable for how sections are handled. :(
(In reply to comment #3) > Since this second one comes from within a template, the current parser can't > really assign it a byte position within the article text. I'm not too familiar > with how this output is generated so will have to take a peek to say more. > Ideally it at least shouldn't mess up the later sections, but I'm not sure how > a "byteoffset" helps if you don't have a "bytelength"... possibly this is just > a bad data structure that's not really suitable for how sections are handled. > :( Why is it actually called byteoffset when it is a character offset and not a byte offset? I propose renaming it to charoffset, maybe. I understand that the parser has no notion of sections in templates, I don't really care. But what I care about is the byteoffsets. Or actually where a section starts (and then implicitly where it ends), so that I can take them apart.
Does not only affect templates but also tables: Benutzer Diskussion:Caliban@dewiki. And <div> elements: Benutzer Diskussion:Elchbauer@dewiki. And parser functions: Benutzer Diskussion:4Frankie@dewiki.
Can confirm this bug on de:wiki 1.18mwf e.g. on(In reply to comment #3) > Ok I can confirm your results there. > > The first two sections (first one is 'regular', second is in the templated > text): > > { > "toclevel": 1, > "level": "2", > "line": "Gr\u00fc\u00df Gott und Herzlich Willkommen auf meiner > Benutzer-Diskussionsseite", > "number": "1", > "index": "1", > "fromtitle": "Benutzer_Diskussion:Label5", > "byteoffset": 3417, > "anchor": > "Gr.C3.BC.C3.9F_Gott_und_Herzlich_Willkommen_auf_meiner_Benutzer-Diskussionsseite" > }, > { > "toclevel": 2, > "level": "3", > "line": "Meine WP-W\u00fcnsche f\u00fcr 2011", > "number": "1.1", > "index": "", > "fromtitle": false, > "byteoffset": 7897, > "anchor": "Meine_WP-W.C3.BCnsche_f.C3.BCr_2011" > }, > > Since this second one comes from within a template, the current parser can't > really assign it a byte position within the article text. I'm not too familiar > with how this output is generated so will have to take a peek to say more. > Ideally it at least shouldn't mess up the later sections, but I'm not sure how > a "byteoffset" helps if you don't have a "bytelength"... possibly this is just > a bad data structure that's not really suitable for how sections are handled. > :( The point is in the byteoffset field should be a "" in order to be correct recognized e.g. by DrTrigonBot. Look at [1] there you have e.g. index="T-7" byteoffset="" for all template entries, except the level 3 headings were you get e.g. index="" byteoffset="137405" which confuses my bot a little bit! My workaround is to catch the empty index string, but since this is considered to be a bug I cannot rely on the fact that there will always be an empty index string... [1] http://de.wikipedia.org/w/api.php?action=parse&page=Wikipedia:L%C3%B6schkandidaten/12.%20Dezember%202009&prop=sections Greetings