Last modified: 2011-11-25 07:43:29 UTC
The wiki page http://wikitech.wikimedia.org/view/Text_storage_data shows the change in the number of text storage objects of each type over the last 18 months. Some of those object types are changing when they shouldn't be. As written in that page, "The rise in "object/historyblobcurstub" doesn't really make sense. The rise in "gzip,external/simple pointer" is concerning." Find out why they're changing and fix it.
object/historyblobcurstub entries should only exist on old stuff that referenced the MediaWiki 1.4-era "cur" table. You should absolutely *never* see an increase in those objects unless something weird is happening like existing ones are getting duplicated somehow in the course of trying to merge data... It should be possible to compare their contents to see if the new ones are unique or duplicates? 'gzip,external/simple pointer' sounds like the default that gets created when saving a new revision, so if there's editing activity and in the absence of anything else I'd expect those to go up over time.
Ah, the 'gzip,external/simple pointer' is stuff not marked as UTF-8... that might be a bit worrying actually. :) Shouldn't occur on new entries unless there's some config special case off the top of my head. Whether those entries are problematic or not depends on what the $wgLegacyEncoding setting is on the sites those blobs belong to.
(In reply to comment #2) > Ah, the 'gzip,external/simple pointer' is stuff not marked as UTF-8... that > might be a bit worrying actually. :) > > Shouldn't occur on new entries unless there's some config special case off the > top of my head. Whether those entries are problematic or not depends on what > the $wgLegacyEncoding setting is on the sites those blobs belong to. This is interesting. I went to inquiry about a just-produced one (eswiki) +----------+---------------------+ | old_id | old_flags | +----------+---------------------+ | 51956028 | utf-8,gzip,external | | 51956027 | utf-8,gzip,external | | 51956026 | utf-8,gzip,external | | 51956025 | utf-8,gzip,external | | 51956024 | utf-8,gzip,external | | 51956023 | gzip,external | | 51956022 | utf-8,gzip,external | | 51956021 | utf-8,gzip,external | | 51956020 | utf-8,gzip,external | | 51956019 | utf-8,gzip,external | +----------+---------------------+ It turns out it doesn't (apparently) have revision: select rev_id, rev_text_id, old_flags from revision join text on (rev_text_id=old_id) where rev_id <= 51506304 order by rev_id desc limit 10; +----------+-------------+---------------------+ | rev_id | rev_text_id | old_flags | +----------+-------------+---------------------+ | 51506304 | 51956026 | utf-8,gzip,external | | 51506303 | 51956025 | utf-8,gzip,external | | 51506302 | 51956024 | utf-8,gzip,external | <-- | 51506301 | 51956022 | utf-8,gzip,external | <-- | 51506300 | 51956021 | utf-8,gzip,external | | 51506299 | 51956020 | utf-8,gzip,external | | 51506298 | 51956019 | utf-8,gzip,external | | 51506297 | 51956018 | utf-8,gzip,external | | 51506296 | 51956016 | utf-8,gzip,external | | 51506295 | 51956015 | utf-8,gzip,external | +----------+-------------+---------------------+ Special:Recentchanges doesn't show anything suspicious around those two entries. There were three Abusefilter hits at that time, Especial:AbuseLog/1077805-1077807 and it more or less correlates with the number of gzip,external entries. AbuseFilter is indeed storing items in text table, *and not setting utf-8 flag for them*. So I think all of them will be AbuseFilter hits, which are utf-8 but not marked as such, not content in legacy encoding.
Yes, select afl_var_dump from abuse_filter_log order by afl_id desc confirms that suspicion.
Good catch, Platonides! Looks like this is where AbuseFilter stores variable state dumps on filter matches so they can be checked out later. AbuseFilter::storeVarDump() and AbuseFilter::loadVarDump() are manually using the text table and ExternalStore -- this probably should be using a common interface underneath Revision's use of the same. Since it doesn't use the Revision code paths right now the var dumps won't be run through $wgLegacyEncoding conversion on load, but if they get refactored into a common code path, that conversion might start happening before a refactored AbuseFilter::loadVarDump gets its data back. This should not be fatal but would at least cause those old entries to display incorrectly on some sites[1] where non-ASCII chars were contained in the text. [1] from http://noc.wikimedia.org/conf/highlight.php?file=InitialiseSettings.php 'wgLegacyEncoding' => array( 'enwiki' => 'windows-1252', 'dawiki' => 'windows-1252', 'svwiki' => 'windows-1252', 'nlwiki' => 'windows-1252', 'dawiktionary' => 'windows-1252', 'svwiktionary' => 'windows-1252', 'default' => false, ), // all other sites will not attempt conversion A bigger worry is that batch recompression or other maintenance work might try to renormalize those entries in a way that AbuseFilter's code path doesn't recognize. Switching to using a common code path would avoid having to worry about new data formats (or saving the wrong data format, as it currently does!) but we'd have to fix the old data entries or devise a workaround.
Fixing those should be easy enough. What about also adding to them an abusefilter flag? That way at least maintenance scripts could easily recognise and skip them.