Last modified: 2014-07-08 12:01:25 UTC

Wikimedia Bugzilla is closed!

Wikimedia migrated from Bugzilla to Phabricator. Bug reports are handled in Wikimedia Phabricator.
This static website is read-only and for historical purposes. It is not possible to log in and except for displaying bug reports and their history, links might be broken. See T55895, the corresponding Phabricator task for complete and up-to-date bug report information.
Bug 53895 - (hash-mismatch) Gerrit SSH: Intermittent key_verify failed for server_host_key and 'hash mismatch'
(hash-mismatch)
Gerrit SSH: Intermittent key_verify failed for server_host_key and 'hash mism...
Status: RESOLVED FIXED
Product: Wikimedia
Classification: Unclassified
Git/Gerrit (Other open bugs)
wmf-deployment
All All
: High normal with 1 vote (vote)
: ---
Assigned To: Nobody - You can work on this!
:
: 57483 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2013-09-07 20:11 UTC by Raimond Spekking
Modified: 2014-07-08 12:01 UTC (History)
16 users (show)

See Also:
Web browser: ---
Mobile Platform: ---
Assignee Huggle Beta Tester: ---


Attachments

Description Raimond Spekking 2013-09-07 20:11:23 UTC
On translatewiki.net during running repoupdate script: Randomly the script bails out with

hash mismatch
key_verify failed for server_host_key
fatal: The remote end hung up unexpectedly
error: Could not fetch origin

This happens since migration of Gerrit to the new server two days ago.
Comment 1 Chad H. 2013-09-08 03:32:09 UTC
We retained the same key for exactly this reason...
Comment 2 Chad H. 2013-09-09 22:14:50 UTC
So, this sounds like you've got an old entry in your known_hosts files pointing to the old box.

We changed IP addresses when moving servers (shouldn't have to ever happen again), so please check your known_hosts for any outdated entries that you can remove.
Comment 3 Siebrand Mazeland 2013-09-09 23:23:49 UTC
(In reply to comment #2)
> We changed IP addresses when moving servers (shouldn't have to ever happen
> again), so please check your known_hosts for any outdated entries that you
> can remove.

How can I identify outdated entries? There are no IP addresses in known_hosts. Sample entry:

 |1|umKi+qzw6pf8uXi/Z6/KtqlisCw=|YFoX/CdDjXhcVUVJ803EiP9nyro= ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEA2JmNg8ir9QvWwmS/C2k0PEqty1O26D0Nq24YGKC5jq1cr/0a92Pk7wa9FMMM/2O88bbe6rXZUPBKzDX1vVtYD+5vR4/c1XTnHWlNJ9sd6xSYjHhznqYs81VnjGMCLMPV1GhlIfUZsnQ+
w1FaQUvJe39TEtwADA7ZOFAfT0M/Oqk=
Comment 4 Raimond Spekking 2013-09-21 19:30:42 UTC
Still seeing this error randomly.
Comment 5 Ori Livneh 2013-09-26 09:05:31 UTC
Several reports of this in the last few days. Reporters include Krenair, YuviPanda, and Krinkle.
Comment 6 Yuvi Panda 2013-09-26 09:06:41 UTC
(Worked for me when I tried again)
Comment 7 Ori Livneh 2013-10-10 07:01:50 UTC
Just happened to me w/operations/puppet.

$ git pull
hash mismatch
key_verify failed for server_host_key
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
Comment 8 Siebrand Mazeland 2013-10-10 11:25:34 UTC
(In reply to comment #7)
> Just happened to me w/operations/puppet.
> 
> $ git pull
> hash mismatch
> key_verify failed for server_host_key
> fatal: Could not read from remote repository.
> 
> Please make sure you have the correct access rights
> and the repository exists.

We see similar errors very regularly when updating 600 or so extension repos at translatewiki.net. I'm pretty certain that we have the correct access rights with L10n-bot, have the correct access rights at the local machine, and have consistent scripting up update the repos.

A run I did just now resulted in the following errors:

Permission denied (publickey).
fatal: The remote end hung up unexpectedly
error: Could not fetch origin
/resources/siebrand/mediawiki-extensions/extensions/CategoryMagicWords failed to update

Permission denied (publickey).
fatal: The remote end hung up unexpectedly
error: Could not fetch origin
/resources/siebrand/mediawiki-extensions/extensions/ReplaceSet failed to update

Just to make sure that it wasn't me configuring the two above repos incorrectly, I ran the updates again. This time with the following result:

Permission denied (publickey).
fatal: The remote end hung up unexpectedly
error: Could not fetch origin
/resources/siebrand/mediawiki-extensions/extensions/DidYouKnow failed to update

Permission denied (publickey).
fatal: The remote end hung up unexpectedly
error: Could not fetch gerrit
/resources/siebrand/mediawiki-extensions/extensions/FormatDates failed to update

Permission denied (publickey).
fatal: The remote end hung up unexpectedly
error: Could not fetch gerrit
/resources/siebrand/mediawiki-extensions/extensions/GoogleDocTag failed to update

Permission denied (publickey).
fatal: The remote end hung up unexpectedly
error: Could not fetch origin
/resources/siebrand/mediawiki-extensions/extensions/InviteSignup failed to update

Permission denied (publickey).
fatal: The remote end hung up unexpectedly
error: Could not fetch origin
/resources/siebrand/mediawiki-extensions/extensions/LightweightRDFa failed to update

Permission denied (publickey).
fatal: The remote end hung up unexpectedly
error: Could not fetch origin
Permission denied (publickey).
fatal: The remote end hung up unexpectedly
error: Could not fetch gerrit
/resources/siebrand/mediawiki-extensions/extensions/Numbertext failed to update

/resources/siebrand/mediawiki-extensions/extensions/NumberOfWikis failed to update

Permission denied (publickey).
fatal: The remote end hung up unexpectedly
error: Could not fetch gerrit
/resources/siebrand/mediawiki-extensions/extensions/PageLanguage failed to update

Permission denied (publickey).
fatal: The remote end hung up unexpectedly
error: Could not fetch origin
/resources/siebrand/mediawiki-extensions/extensions/SidebarDonateBox failed to update

hash mismatch
key_verify failed for server_host_key
fatal: The remote end hung up unexpectedly
error: Could not fetch gerrit
/resources/siebrand/mediawiki-extensions/extensions/UserStatus failed to update

Permission denied (publickey).
fatal: The remote end hung up unexpectedly
error: Could not fetch gerrit
/resources/siebrand/mediawiki-extensions/extensions/VersionView failed to update



To compare, when updating repos on localhost form GitHub, I've not seen a similar error once.
Comment 9 Antoine "hashar" Musso (WMF) 2013-12-17 14:32:30 UTC
*** Bug 57483 has been marked as a duplicate of this bug. ***
Comment 10 Antoine "hashar" Musso (WMF) 2013-12-17 14:33:29 UTC
That does happen once or two per day on Zuul. Usually "hash mismatch" errors though we had some host key verification failed on Nov 20th.
Comment 11 Niklas Laxström 2013-12-17 14:57:52 UTC
Also got one today in command line.

FYI: the -1s in Jenkins caused by this are very confusing.
Comment 12 Ed Sanders 2014-01-12 14:47:58 UTC
Is this related: https://gerrit.wikimedia.org/r/#/c/107036/ ?
Comment 13 Antoine "hashar" Musso (WMF) 2014-01-13 10:46:19 UTC
(In reply to comment #12)
> Is this related: https://gerrit.wikimedia.org/r/#/c/107036/ ?

Looking at Zuul debugging log on gallium.wikimedia.org it is a different issue. Filled another bug 59991 for it. Seems to be an issue in the python git module.
Comment 14 Antoine "hashar" Musso (WMF) 2014-02-06 14:24:05 UTC
Another example, this time with the job that sync VisualEditor in mediawiki/extensions.git.  The merge of https://gerrit.wikimedia.org/r/#/c/111608/ triggered job http://integration.wikimedia.org/ci/job/mwext-VisualEditor-sync-gerrit/61/console which shows:

  ssh -i /var/lib/jenkins/.ssh/jenkins-mwext-sync_id_rsa \
    -p 29418 jenkins-mwext-sync@gerrit.wikimedia.org \
    'gerrit review --code-review +2 --verified +2 --submit b519550809bba725b017281fe6c33c4c2fd123c1'
  hash mismatch
  key_verify failed for server_host_key
Comment 15 Bartosz Dziewoński 2014-02-16 18:59:29 UTC
This continues to happen, nearly daily.

You could probably get a good list of affected changesets by grepping logs of #wikimedia-dev for my name and "ignore jenkins" :/
Comment 16 Bartosz Dziewoński 2014-06-11 21:24:05 UTC
Subsided for a while, then started happening a bit more often for me locally. Example in Gerrit from today: https://gerrit.wikimedia.org/r/#/c/138992/1
Comment 17 Marcin Cieślak 2014-06-11 23:56:41 UTC
Could somebody tcpdump it? It seems it me more like a broken (suddenly terminated) connection, probably occuring (mostly) early in the SSH negotiation phase.
Comment 18 Bartosz Dziewoński 2014-06-12 15:40:45 UTC
Today again: https://gerrit.wikimedia.org/r/#/c/139047/
Comment 19 Bartosz Dziewoński 2014-06-17 00:39:52 UTC
Two examples from just today:
* https://gerrit.wikimedia.org/r/#/c/139807/
* https://gerrit.wikimedia.org/r/#/c/140046/
Comment 20 Antoine "hashar" Musso (WMF) 2014-06-19 11:02:45 UTC
Bartosz: there is no need more for more examples.  We have traces of those errors in Zuul log and it happens a couple time per day.

Marcin: we could tcpdump it if only we had a way to reliably reproduce the issue :-(
Comment 21 Paul Bourke 2014-06-19 11:11:20 UTC
Hi, I've been able to reproduce this on a local Gerrit instance quite reliably by running the following:

while true; do ssh <gerrit> -p 29418; done

A workaround that does work is to use the bouncy castle SSL library.  See the following thread for more info: https://groups.google.com/forum/#!topic/repo-discuss/JE7OM6o7DMs
Comment 22 Krinkle 2014-06-27 13:35:10 UTC
The google group topic mentioned this issue in Apache mina-sshd (upstream from Gerrit):

https://issues.apache.org/jira/browse/SSHD-330

Which has been fixed in https://git-wip-us.apache.org/repos/asf?p=mina-sshd.git;a=commit;h=2aed686bdb21681a421033c6ee5997e5cd8a9a83

If that is indeed the root issue, we them to make a minor release and Gerrit to upgrade to it.
Comment 23 christian 2014-07-01 15:31:34 UTC
The description of the SSHD-330 issue explains pretty much every aspect of
the bug that we experienced.

From it's sporadic nature to the ways some people could reproduce, but others
couldn't.

I'll see to preparing a new gerrit release ... hopefully we can get something
deployed around that.
Comment 24 Gerrit Notification Bot 2014-07-01 19:49:07 UTC
Change 143388 had a related patch set uploaded by QChris:
Upgrade sshd to include the fix for hash mismatch

https://gerrit.wikimedia.org/r/143388
Comment 25 Antoine "hashar" Musso (WMF) 2014-07-02 08:40:40 UTC
Christian could you possibly providee a gerrit.war  that has the patch ? I would like to test it out on the labs instance I am using for CI dev. Thanks!
Comment 26 christian 2014-07-02 09:12:05 UTC
(In reply to Antoine "hashar" Musso from comment #25)
> Christian could you possibly providee a gerrit.war  that has the patch ?

Sure. For the next 2 weeks, you can fetch it from

  http://quelltextlich.at/gerrit-2.8.1-4-ga1048ce.war

> I
> would like to test it out on the labs instance I am using for CI dev.

Seeing the description of SSHD-330 allowed me to come up with an environment
that allows to reproduce the bug. There, our deployed gerrit war failed for
14 of 10000 connection attempts. The war I linked above showed 0 failures for
10000 connection attempts.

^d already said he'll discuss deploying the war with greg-g. So we'll
hopefully see it live soon.
Comment 27 Antoine "hashar" Musso (WMF) 2014-07-02 09:31:58 UTC
I have upgraded Gerrit on my test instance integration-dev.eqiad.wmflabs . There is no more any hash mismatch triggered when running for a while:

while true; do ssh -p 29418 localhost; done;
Comment 28 christian 2014-07-02 13:58:04 UTC
Since this bug has been around for a while and has affected quite some
people, I've been asked to give a short explanation of the root issue
and what SSHD-330 does.

Gerrit uses Apache Mina's SSHD [1] as ssh server. When connecting to
gerrit through ssh, this ssh server uses Java's own crypto/security
implementations to negotiate session keys (i.e.: different for each
connection attempt) with the client. Java's default provider yielded
those session keys without leading zero bytes, and Apache Mina's SSHD
relied on no leading zero bytes being present.

But at some point Java [2] changed behaviour and is no longer
stripping leading zero bytes, but Apache Mina SSHD still relied on no
leading zero bytes being present. Hence assumptions mismatched and
caused the issue.

The Java we use at gerrit.wikimedia.org is recent enough to no longer
strip leading zero bytes. So when connecting to our gerrit through
ssh, either

* the negotiated session key starts with a non-zero byte, and
  everything works nicely. This case happens most of the time.

* the negotiated session key starts with a zero byte. Then gerrit's
  built-in Apache Mina SSHD falsely treats the key as if there were no
  leading zero bytes and the connection setup with the client fails.

SSHD-330 adds stripping of leading zero bytes from the session key to
Apache Mina SSHD and thereby fixes the issue we are seeing.

------

There was recently some FUD around OpenSSL generated keys not being
affected. That did not work for me, and I do not see in code how this
would make a difference.

Also, there was some recent discussion around extracting the keys from
the keystore to proper files. I did not get a chance to try that, but
that could do the trick too ... indirectly.
Because in order to get gerrit to use keys from separate files, one
needs to install BouncyCastle libraries to gerrit. BouncyCastle will
act as provider for the needed security/crypto functionality and
get used instead of Java's default providers. As the BouncyCastle
providers (for now) also strip the leading zero bytes, that could
work out.

Regardless, having Apache Mina SSHD to strip leading zero bytes seems
most reliable, so we backported the Apache Mina SSHD's upstream fix to
the version used in our gerrit, and rebuilt gerrit using that custom
built Apache Mina SSHD.

[1] https://mina.apache.org/sshd-project/
[2] I know that OpenJDK versions up to
  OpenJDK Runtime Environment (IcedTea7 2.2.1) (Gentoo build 1.7.0_05-b21)
work and the default providers strip the leading zeros, while the ones from
  OpenJDK Runtime Environment (IcedTea 2.4.7) (7u55-2.4.7-1ubuntu1~0.12.04.2)
do not strip them.


Thanks Krinkle for the pointer to SSHD-330!
Comment 29 Gerrit Notification Bot 2014-07-02 22:00:17 UTC
Change 143388 merged by Chad:
Upgrade sshd to include the fix for hash mismatch

https://gerrit.wikimedia.org/r/143388
Comment 30 christian 2014-07-08 07:51:56 UTC
The fix has been deployed at gerrit.wikimedia.org.
Comment 31 Bartosz Dziewoński 2014-07-08 11:55:38 UTC
<3
Comment 32 Ori Livneh 2014-07-08 12:01:25 UTC
(In reply to christian from comment #28)
> Since this bug has been around for a while and has affected quite some
> people, I've been asked to give a short explanation of the root issue
> and what SSHD-330 does.
> 
> Gerrit uses Apache Mina's SSHD [1] as ssh server. When connecting to
> gerrit through ssh, this ssh server uses Java's own crypto/security
> implementations to negotiate session keys (i.e.: different for each
> connection attempt) with the client. Java's default provider yielded
> those session keys without leading zero bytes, and Apache Mina's SSHD
> relied on no leading zero bytes being present.
> 
> But at some point Java [2] changed behaviour and is no longer
> stripping leading zero bytes, but Apache Mina SSHD still relied on no
> leading zero bytes being present. Hence assumptions mismatched and
> caused the issue.
> 
> The Java we use at gerrit.wikimedia.org is recent enough to no longer
> strip leading zero bytes. So when connecting to our gerrit through
> ssh, either
> 
> * the negotiated session key starts with a non-zero byte, and
>   everything works nicely. This case happens most of the time.
> 
> * the negotiated session key starts with a zero byte. Then gerrit's
>   built-in Apache Mina SSHD falsely treats the key as if there were no
>   leading zero bytes and the connection setup with the client fails.
> 
> SSHD-330 adds stripping of leading zero bytes from the session key to
> Apache Mina SSHD and thereby fixes the issue we are seeing.
> 
> ------
> 
> There was recently some FUD around OpenSSL generated keys not being
> affected. That did not work for me, and I do not see in code how this
> would make a difference.
> 
> Also, there was some recent discussion around extracting the keys from
> the keystore to proper files. I did not get a chance to try that, but
> that could do the trick too ... indirectly.
> Because in order to get gerrit to use keys from separate files, one
> needs to install BouncyCastle libraries to gerrit. BouncyCastle will
> act as provider for the needed security/crypto functionality and
> get used instead of Java's default providers. As the BouncyCastle
> providers (for now) also strip the leading zero bytes, that could
> work out.
> 
> Regardless, having Apache Mina SSHD to strip leading zero bytes seems
> most reliable, so we backported the Apache Mina SSHD's upstream fix to
> the version used in our gerrit, and rebuilt gerrit using that custom
> built Apache Mina SSHD.
> 
> [1] https://mina.apache.org/sshd-project/
> [2] I know that OpenJDK versions up to
>   OpenJDK Runtime Environment (IcedTea7 2.2.1) (Gentoo build 1.7.0_05-b21)
> work and the default providers strip the leading zeros, while the ones from
>   OpenJDK Runtime Environment (IcedTea 2.4.7) (7u55-2.4.7-1ubuntu1~0.12.04.2)
> do not strip them.
> 
> 
> Thanks Krinkle for the pointer to SSHD-330!

And thank you for the analysis and the informative summary -- well done!

Note You need to log in before you can comment on or make changes to this bug.


Navigation
Links