The story of one bug (# 1653967)

    Abstract : A real story from the life of real administrators on catching an idiotic bug.
    The instructive part: Never underestimate dependency dependencies.

    Introduction


    In-house upgrade in the laboratory from Openstack Mitaka to Openstack Newton (newer version). Several deprecated options in the configuration files, keystone moved from eventlet to WSGI and broke the existing configuration with haproxy; because of the typical “ipv6 listen”, apache did not conflict with haproxy for the same ports used on the star (one listened to ipv6, the other ipv4 only), so requests went to haproxy instead of apache, where they died from 503, because there was no upstream ... However, the story is not about that.

    After the main problems were sorted out, Nova (one of the components of Openstack) started to crash when starting:ConfigFileValueError: Value for option url is not valid: invalid URI: 'http://neutron-server.example.com:21345'.. It was very strange. Considering that 100,500 options have been changed in the config, there is a suspicion that we are using an outdated option that should no longer be used. However, the documentation said that an example of an option is url = http://controller:9696.

    Debugging


    Obvious debugging steps:
    • Comment out option - doesn't fall
    • Repeat the option from the example - does not fall
    • Replace the port option with "ours" - perhaps you can not use too large a port number - it does not fall
    • Replace url options with ours - crashes
    • To return "controller" into place - does not fall
    • Suspicion: fqdn cannot: replace controller with controller.dns - doesn't crash
    • Suspicion: too many points (we had 8 points in the url in real code) - controller.dns1.dns2.dns3.dns4 - does not fall
    • Leave only the first part of our name: http://neutron-server:9696- falls! the hypothesis is already clear.
    • Check1: http://neutronserver:9696- does not fall
    • Check2: http://with-dashes:9696- is falling!

    In total, a bug: the presence of a hyphen in the hostname raises a ConfigFileValueError. A bug is reported: bugs.launchpad.net/ubuntu/+source/nova/+bug/1653967

    Check that this is a bug: RFC3986 states that:
    unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
    reg-name      = *( unreserved / pct-encoded / sub-delims ) 
    host          = IP-literal / IPv4address / reg-name
    

    (this is a BNF notation that says a hyphen can be used on host).

    We all know this anyway, but it's always better to double-check.

    Examining the code that reports the error:

                try:
                    return convert(opt._get_from_namespace(namespace, group_name))
                except KeyError:  # nosec: Valid control flow instruction
                    pass
                except ValueError as ve:
                    raise ConfigFileValueError(
                        "Value for option %s is not valid: %s"
                        % (opt.name, str(ve)))
    

    The error occurred on two options: url and novncproxy_base_url. The error is identical, although grep is more convenient in the second. We begin to look for the second. Here's how it is defined in code:

        cfg.URIOpt(
            'novncproxy_base_url',
            default='http://127.0.0.1:6080/vnc_auto.html',
            deprecated_group='DEFAULT',
            help="""
    

    Yeah. And cfg is from from oslo_config import cfg. oslo.config is the Openstack library for working with configs. We look raw.

    We see:

    class URI(ConfigType):
    ...
        def __call__(self, value):
            if not rfc3986.is_valid_uri(value, require_scheme=True,
                                        require_authority=True):
                raise ValueError('invalid URI: %r' % value)
    

    Suddenly:

    >>> import rfc3986
    >>> rfc3986.is_valid_uri('http://test.com')
    True
    >>> rfc3986.is_valid_uri('http://test-test.com')
    False
    

    Oops The mess. But there is: github.com/sigmavirus24/rfc3986/issues/11 The
    bug has been fixed for a long time. In version 0.2.2. And on the host we have:

    apt-cache policy python-rfc3986
    python-rfc3986:
      Installed: 0.2.0-2
      Candidate: 0.2.0-2
      Version table:
     *** 0.2.0-2 500
            500 http://archive.ubuntu.com/ubuntu xenial/main amd64 Packages
            100 /var/lib/dpkg/status
    

    But in a more recent version in zesty there is version 0.3.1-2, which does not suffer from such a problem.

    Further proceedings


    Once upon a time a bug was made. He was there for a while, then he was fixed. But during this time, the Code in which the Bug was located was blocked, and no one paid attention to the Fix Bug, and the version with the Bug remained in the deb repository for years. She didn't care about anyone - until two commits in oslo.config and nova happened:

    commit 45ee2bed52a57b9801435b43ad45d8f50204580d
    Author: Masaki Matsushita 
    Date:   Mon Sep 28 20:28:28 2015 +0900
        Add URIOpt
        This change add URIOpt which validates string as URI.
        Closes-Bug: #1500398
        Change-Id: Ie8736b8654b9feb2a2b174159f08dbea03568d84
    

    commit 6091de77eda12286786e28ae4f0779e7efc54634
    Author: Maciej Szankin 
    Date:   Thu Jul 28 10:30:59 2016 -0500
        Improve consistency in VNC opts
        * Updated header flags
        * Moved all vars to list
        * Removed possible values and related options sections where they were not
          needed
        * Changed IntOpt to PortOpt where needed
        Change-Id: I3255a867091f8e14c907c7fde9a2aa3abc249ae9
        Implements: Blueprint centralize-config-options-newton
    

    This commit was made from StrOpt by UriOpt and started using (via oslo.conf) python-rfc3986. Due to the fact that the old version of python-rfc3986 was blocked, an unexpected regression occurred in the software.

    Bonus: how will we fix it


    Usually in such cases, if upgrading to a newer version is easy (and does not cause other problems), then we simply take the package from a newer version of the distribution kit (in this case, zesty that has not yet been released, aka ubuntu-17.04). We will put it in our private repository running aptly (as is) and will use it when installing / configuring the server. If such a package did not exist in nature, we would configure the CI job to package it and publish it (to the aptly repository). If this option were not available (for example, incompatible changes), then we would add another patch to our patchqueue for nova that would make StrOpt instead of UriOpt. This implies that we will be rebuilding nova from the ubuntu package with our own patches. This is done by CI, which publishes packages to our very own private repository.

    A little flame


    And how would this problem be solved in a proprietary environment? Everyone makes mistakes (otherwise we would have software without bugs). After the error was fixed in support of the first level, after the butting about the installed versions, updates and contracts, it would reach the support of the second level, third level, and so on up to a person with real qualifications who can look into the code. He found and fixed the problem. What is the estimate for that bug fix? Two hours to the first level, another hour to the second, business day to research the problem, another business day to fix, maybe another day to release and test. This is the perfect scenario. In practice, my most optimistic estimates speak of the weeks turning into "the next release in half a year, corrected."

    How much did it take for me, in the opensource project, to fix the problem on my own? ~ 14:30, today the problem was discovered, and I reported it to the launchpad. At 15:20, it was already known about the problem with addiction, at 15:30 it was verified that there was no such problem with the new version of python-rfc3986. At 16:50 (according to the Cypriot time) I finish writing this post on Habr.

    Also popular now: