amarao January 4, 2017 at 17:50

The story of one bug (# 1653967)

Abstract : A real story from the life of real administrators on catching an idiotic bug.
The instructive part: Never underestimate dependency dependencies.

Introduction

In-house upgrade in the laboratory from Openstack Mitaka to Openstack Newton (newer version). Several deprecated options in the configuration files, keystone moved from eventlet to WSGI and broke the existing configuration with haproxy; because of the typical “ipv6 listen”, apache did not conflict with haproxy for the same ports used on the star (one listened to ipv6, the other ipv4 only), so requests went to haproxy instead of apache, where they died from 503, because there was no upstream ... However, the story is not about that.

After the main problems were sorted out, Nova (one of the components of Openstack) started to crash when starting:ConfigFileValueError: Value for option url is not valid: invalid URI: 'http://neutron-server.example.com:21345'.. It was very strange. Considering that 100,500 options have been changed in the config, there is a suspicion that we are using an outdated option that should no longer be used. However, the documentation said that an example of an option is url = http://controller:9696.

Debugging

Obvious debugging steps:

Comment out option - doesn't fall
Repeat the option from the example - does not fall
Replace the port option with "ours" - perhaps you can not use too large a port number - it does not fall
Replace url options with ours - crashes
To return "controller" into place - does not fall
Suspicion: fqdn cannot: replace controller with controller.dns - doesn't crash
Suspicion: too many points (we had 8 points in the url in real code) - controller.dns1.dns2.dns3.dns4 - does not fall
Leave only the first part of our name: http://neutron-server:9696- falls! the hypothesis is already clear.
Check1: http://neutronserver:9696- does not fall
Check2: http://with-dashes:9696- is falling!

In total, a bug: the presence of a hyphen in the hostname raises a ConfigFileValueError. A bug is reported: bugs.launchpad.net/ubuntu/+source/nova/+bug/1653967

Check that this is a bug: RFC3986 states that:

unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
reg-name      = *( unreserved / pct-encoded / sub-delims ) 
host          = IP-literal / IPv4address / reg-name

(this is a BNF notation that says a hyphen can be used on host).

We all know this anyway, but it's always better to double-check.

Examining the code that reports the error:

            try:
                return convert(opt._get_from_namespace(namespace, group_name))
            except KeyError:  # nosec: Valid control flow instruction
                pass
            except ValueError as ve:
                raise ConfigFileValueError(
                    "Value for option %s is not valid: %s"
                    % (opt.name, str(ve)))

The error occurred on two options: url and novncproxy_base_url. The error is identical, although grep is more convenient in the second. We begin to look for the second. Here's how it is defined in code:

    cfg.URIOpt(
        'novncproxy_base_url',
        default='http://127.0.0.1:6080/vnc_auto.html',
        deprecated_group='DEFAULT',
        help="""

Yeah. And cfg is from from oslo_config import cfg. oslo.config is the Openstack library for working with configs. We look raw.

We see:

class URI(ConfigType):
...
    def __call__(self, value):
        if not rfc3986.is_valid_uri(value, require_scheme=True,
                                    require_authority=True):
            raise ValueError('invalid URI: %r' % value)

Suddenly:

>>> import rfc3986
>>> rfc3986.is_valid_uri('http://test.com')
True
>>> rfc3986.is_valid_uri('http://test-test.com')
False

Oops The mess. But there is: github.com/sigmavirus24/rfc3986/issues/11 The
bug has been fixed for a long time. In version 0.2.2. And on the host we have:

apt-cache policy python-rfc3986
python-rfc3986:
  Installed: 0.2.0-2
  Candidate: 0.2.0-2
  Version table:
 *** 0.2.0-2 500
        500 http://archive.ubuntu.com/ubuntu xenial/main amd64 Packages
        100 /var/lib/dpkg/status

But in a more recent version in zesty there is version 0.3.1-2, which does not suffer from such a problem.

Further proceedings

Once upon a time a bug was made. He was there for a while, then he was fixed. But during this time, the Code in which the Bug was located was blocked, and no one paid attention to the Fix Bug, and the version with the Bug remained in the deb repository for years. She didn't care about anyone - until two commits in oslo.config and nova happened:

commit 45ee2bed52a57b9801435b43ad45d8f50204580d
Author: Masaki Matsushita 
Date:   Mon Sep 28 20:28:28 2015 +0900
    Add URIOpt
    This change add URIOpt which validates string as URI.
    Closes-Bug: #1500398
    Change-Id: Ie8736b8654b9feb2a2b174159f08dbea03568d84

commit 6091de77eda12286786e28ae4f0779e7efc54634
Author: Maciej Szankin 
Date:   Thu Jul 28 10:30:59 2016 -0500
    Improve consistency in VNC opts
    * Updated header flags
    * Moved all vars to list
    * Removed possible values and related options sections where they were not
      needed
    * Changed IntOpt to PortOpt where needed
    Change-Id: I3255a867091f8e14c907c7fde9a2aa3abc249ae9
    Implements: Blueprint centralize-config-options-newton

This commit was made from StrOpt by UriOpt and started using (via oslo.conf) python-rfc3986. Due to the fact that the old version of python-rfc3986 was blocked, an unexpected regression occurred in the software.

Bonus: how will we fix it

Usually in such cases, if upgrading to a newer version is easy (and does not cause other problems), then we simply take the package from a newer version of the distribution kit (in this case, zesty that has not yet been released, aka ubuntu-17.04). We will put it in our private repository running aptly (as is) and will use it when installing / configuring the server. If such a package did not exist in nature, we would configure the CI job to package it and publish it (to the aptly repository). If this option were not available (for example, incompatible changes), then we would add another patch to our patchqueue for nova that would make StrOpt instead of UriOpt. This implies that we will be rebuilding nova from the ubuntu package with our own patches. This is done by CI, which publishes packages to our very own private repository.

A little flame

And how would this problem be solved in a proprietary environment? Everyone makes mistakes (otherwise we would have software without bugs). After the error was fixed in support of the first level, after the butting about the installed versions, updates and contracts, it would reach the support of the second level, third level, and so on up to a person with real qualifications who can look into the code. He found and fixed the problem. What is the estimate for that bug fix? Two hours to the first level, another hour to the second, business day to research the problem, another business day to fix, maybe another day to release and test. This is the perfect scenario. In practice, my most optimistic estimates speak of the weeks turning into "the next release in half a year, corrected."

How much did it take for me, in the opensource project, to fix the problem on my own? ~ 14:30, today the problem was discovered, and I reported it to the launchpad. At 15:20, it was already known about the problem with addiction, at 15:30 it was verified that there was no such problem with the new version of python-rfc3986. At 16:50 (according to the Cypriot time) I finish writing this post on Habr.

Tags: