Fedora Quality Planet

Getting started with Pagure CI

Posted by Adam Williamson on February 17, 2017 06:50 AM

I spent a few hours today setting up a couple of the projects I look after, fedfind and resultsdb_conventions, to use Pagure CI. It was surprisingly easy! Many thanks to Pingou and Lubomir for working on this, and of course Kevin for helping me out with the Jenkins side.

You really do just have to request a Jenkins project and then follow the instructions. I followed the step-by-step, submitted a pull request, and everything worked first time. So the interesting part for me was figuring out exactly what to run in the Jenkins job.

The instructions get you to the point where you’re in a checkout of the git repository with the pull request applied, and then you get to do…whatever you can given what you’re allowed to do in the Jenkins builder environment. That doesn’t include installing packages or running mock. So I figured what I’d do for my projects – which are both Python – is set up a good tox profile. With all the stuff discussed below, the actual test command in the Jenkins job – after the boilerplate from the guide that checks out and merges the pull request – is simply tox.

First things first, the infra Jenkins builders didn’t have tox installed, so Kevin kindly fixed that for me. I also convinced him to install all the variant Python version packages – python26, and the non-native Python 3 packages – on each of the Fedora builders, so I can be confident I get pretty much the same tox run no matter which of the builders the job winds up on.

Of course, one thing worth noting at this point is that tox installs all dependencies from PyPI: if something your code depends on isn’t in there (or installed on the Jenkins builders), you’ll be stuck. So another thing I got to do was start publishing fedfind on PyPI! That was pretty easy, though I did wind up cribbing a neat trick from this PyPI issue so I can keep my README in Markdown format but have setup.py convert it to rst when using it as the long_description for PyPI, so it shows up properly formatted, as long as pypandoc is installed (but work even if it isn’t, so you don’t need pandoc just to install the project).

After playing with it for a bit, I figured out that what I really wanted was to have two workflows. One is to run just the core test suite, without any unnecessary dependencies, with python setup.py test – this is important when building RPM packages, to make sure the tests pass in the exact environment the package is built in (and for). And then I wanted to be able to run the tests across multiple environments, with coverage and linting, in the CI workflow. There’s no point running code coverage or a linter while building RPMs, but you certainly want to do it for code changes.

So I put the install, test and CI requirements into three separate text files in each repo – install.requires, tests.requires and tox.requires – and adjusted the setup.py files to do this in their setup():

install_requires = open('install.requires').read().splitlines(),
tests_require = open('tests.requires').read().splitlines(),

In tox.ini I started with this:

deps=-r{toxinidir}/install.requires
     -r{toxinidir}/tests.requires
     -r{toxinidir}/tox.requires

so the tox runs get the extra dependencies. I usually write pytest tests, so to start with in tox.ini I just had this command:

commands=py.test

Pytest integration for setuptools can be done in various ways, but I use this one. Add a class to setup.py:

import sys
from setuptools import setup, find_packages
from setuptools.command.test import test as TestCommand

class PyTest(TestCommand):
    user_options = [('pytest-args=', 'a', "Arguments to pass to py.test")]

    def initialize_options(self):
        TestCommand.initialize_options(self)
        self.pytest_args = ''
        self.test_suite = 'tests'

    def run_tests(self):
        #import here, cause outside the eggs aren't loaded
        import pytest
        errno = pytest.main(self.pytest_args.split())
        sys.exit(errno)

and then this line in setup():

cmdclass = {'test': PyTest},

And that’s about the basic shape of it. With an envlist, we get the core tests running both through tox and setup.py. But we can do better! Let’s add some extra deps to tox.requires:

coverage
diff-cover
pylint
pytest-cov

and tweak the commands in tox.ini:

commands=py.test --cov-report term-missing --cov-report xml --cov fedfind
         diff-cover coverage.xml --fail-under=90
         diff-quality --violations=pylint --fail-under=90

By adding a few args to our py.test call we get a coverage report for our library with the pull request applied. The subsequent commands use the neat diff_cover tool to add some more information. diff-cover basically takes the full coverage report (coverage.xml is produced by --cov-report xml) and considers only the lines that are touched by the pull request; the --fail-under arg tells it to fail if there is less than 90% coverage of the modified lines. diff-quality runs a linter (in this case, pylint) on the code and, again, considers only the lines changed by the pull request. As you might expect, --fail-under=90 tells it to fail if the ‘quality’ of the changed code is below 90% (it normalizes all the linter scores to a percentage scale, so that really means a pylint score of less than 9.0).

So without messing around with shipping all our stuff off to hosted services, we get a pretty decent indicator of the test coverage and code quality of the pull request, and it shows up as failing tests if they’re not good enough.

It’s kind of overkill to run the coverage and linter on all the tested Python environments, but it is useful to do it at least on both Python 2 and 3, since the pylint results may differ, and the code might hit different paths. Running them on every minor version isn’t really necessary, but it doesn’t take that long so I’m not going to sweat it too much.

But that does bring me to the last refinement I made, because you can vary what tox does in different environments. One thing I wanted for fedfind was to run the tests not just on Python 2.6, but with the ancient versions of several dependencies that are found in RHEL / EPEL 6. And there’s also an interesting bug in pylint which makes it crash when running on fedfind under Python 3.6. So my tox.ini really looks this:

[tox]
envlist = py26,py27,py34,py35,py36,py37
skip_missing_interpreters=true
[testenv]
deps=py27,py34,py35,py36,py37: -r{toxinidir}/install.requires
     py26: -r{toxinidir}/install.requires.py26
     py27,py34,py35,py36,py37: -r{toxinidir}/tests.requires
     py26: -r{toxinidir}/tests.requires.py26
     py27,py34,py35,py36,py37: -r{toxinidir}/tox.requires
     py26: -r{toxinidir}/tox.requires.py26
commands=py27,py34,py35,py36,py37: py.test --cov-report term-missing --cov-report xml --cov fedfind
         py26: py.test
         py27,py34,py35,py36,py37: diff-cover coverage.xml --fail-under=90
         # pylint breaks on functools imports in python 3.6+
         # https://github.com/PyCQA/astroid/issues/362
         py27,py34,py35: diff-quality --violations=pylint --fail-under=90
setenv =
    PYTHONPATH = {toxinidir}

As you can probably guess, what’s going on there is we’re installing different dependencies and running different commands in different tox ‘environments’. pip doesn’t really have a proper dependency solver, which – among other things – unfortunately means tox barfs if you try and do something like listing the same dependency twice, the first time without any version restriction, the second time with a version restriction. So I had to do a bit more duplication than I really wanted, but never mind. What the files wind up doing is telling tox to install specific, old versions of some dependencies for the py26 environment:

[install.requires.py26]
cached-property
productmd
setuptools == 0.6.rc10
six == 1.7.3

[tests.requires.py26]
pytest==2.3.5
mock==1.0.1

tox.requires.py26 is just shorter, skipping the coverage and pylint bits, because it turns out to be a pain trying to provide old enough versions of various other things to run those checks with the older pytest, and there’s no real need to run the coverage and linter on py26 as long as they run on py27 (see above). As you can see in the commands section, we just run plain py.test and skip the other two commands on py26; on py36 and py37 we skip the diff-quality run because of the pylint bug.

So now on every pull request, we check the code (and tests – it’s usually the tests that break, because I use some pytest feature that didn’t exist in 2.3.5…) still work with the ancient RHEL 6 Python, pytest, mock, setuptools and six, check it on various other Python interpreter versions, and enforce some requirements for test coverage and code quality. And the package builds can still just do python setup.py test and not require coverage or pylint. Who needs github and coveralls? 😉

Of course, after doing all this I needed a pull request to check it on. For resultsdb_conventions I just made a dumb fake one, but for fedfind, because I’m an idiot, I decided to write that better compose ID parser I’ve been meaning to do for the last week. So that took another hour and a half. And then I had to clean up the test suite…sigh.

Bluetooth in Fedora

Posted by Nathaniel McCallum on February 16, 2017 08:53 PM

So… Bluetooth. It’s everywhere now. Well, everywhere except Fedora. Fedora does, of course support bluetooth. But even the most common workflows are somewhat spotty. We should improve this.

To this end, I’ve enlisted the help of the Don Zickus, kernel developer extrordinaire, and Adam Williamson, the inimitable Fedora QA guru. The plan is to create a set of user tests for the most common bluetooth tasks. This plan has several goals.

First, we’d like to know when stuff is broken. For example, the recent breakage in linux-firmware. Catching this stuff early is a huge plus.

Second, we’d like to get high quality bug reports. When things do break, vague bug reports often cause things to sit in limbo for a while. Making sure we have all the debugging information up front can make reports actionable.

Third, we’d (eventually) like to block a new Fedora release if major functionality is broken. We’re obviously not ready for this step yet. But once the majority of workflows work on the hardware we care about, we need to ensure that we don’t ship a Fedora release with broken code.

To this end we are targeting three workflows which cover the most common cases:

  • Keyboards
  • Headsets
  • Mice

For more information, or to help develop the user testing, see the Fedora QA bug. Here’s to a better future!

Announcing the resultsdb-users mailing list

Posted by Adam Williamson on February 16, 2017 01:28 AM

I’ve been floating an idea around recently to people who are currently using ResultsDB in some sense – either sending reports to it, or consuming reports from it – or plan to do so. The idea was to have a group where we can discuss (and hopefully co-ordinate) use of ResultsDB – a place to talk about result metadata conventions and so forth.

It seemed to get a bit of traction, so I’ve created a new mailing list: resultsdb-users. If you’re interested, please do subscribe, through the web interface, or by sending a mail with ‘subscribe’ in the subject to this address.

If you’re not familiar with ResultsDB – well, it’s a generic storage engine for test results. It’s more or less a database with a REST API and some very minimal rules for what constitutes a ‘test result’. The only requirements really are some kind of test name plus a result, chosen from four options; results can include any other arbitrary key:value pairs you like, and a few have special meaning in the web UI, but that’s about it. This is one of the reasons for the new list: because ResultsDB is so generic, if we want to make it easily and reliably possible to find related groups of results in any given ResultsDB, we need to come up with ways to ensure related results share common metadata values, and that’s one of the things I expect we’ll be talking about on the list.

It began life as Taskotron‘s result storage engine, but it’s pretty independent, and you could certainly get value out of a ResultsDB instance without any of the other bits of Taskotron.

Right now ResultsDB is used in production in Fedora for storing results from Taskotron, openQA and Autocloud, and an instance is also used inside Red Hat for storing results from some RH test systems.

Please note: despite the list being a fedoraproject one, the intent is to co-ordinate with folks from CentOS, Red Hat and maybe even further afield as well; we’re just using an fp.o list as it’s a quick convenient way to get a nice mailman3/hyperkitty list without having to go set up a list server on taskotron.org or something.

The future of Fedora QA

Posted by Adam Williamson on February 12, 2017 05:33 PM

Welcome to version 2.0 of this blog post! This space was previously occupied by a whole bunch of longwinded explanation about some changes that are going on in Fedoraland, and are going to be accelerating (I think) in the near future. But it was way too long. So here’s the executive summary!

First of all: if you do nothing else to get up to speed on Stuff That’s Going On, watch Ralph Bean’s Factory 2.0 talk and Adam Samalik’s Modularity talk from Devconf 2017. Stephen Gallagher’s Fedora Server talk and Dennis Gilmore’s ‘moving everyone to Rawhide’ talk are also valuable, but please at least watch Ralph’s. It’s a one-hour overview of all the big stuff that people really want to build for Fedora (and RH) soon.

To put it simply: Fedora (and RH) don’t want to be only in the business of releasing a bunch of RPMs and operating system images every X months (or years) any more. And we’re increasing moving away from the traditional segmented development process where developers/package maintainers make the bits, then release engineering bundles them all up into ‘things’, and then QA looks at the ‘things’ and says “er, it doesn’t boot, try again”, and we do that for several months until QA is happy, then we release it and start over. There is a big project to completely overhaul the way we build and ship products, using a pipeline that involves true CI, where each proposed change to Fedora produces an immediate feedback loop of testing and the change is blocked if the testing fails. Again, watch Ralph’s talk, because what he basically does is put up a big schematic of this entire system and go into a whole bunch of detail about his vision for how it’s all going to work.

As part of this, some of the folks in RH’s Fedora QA team whose job has been to work on ‘automated testing’ – a concept that is very tied to the traditional model for building and shipping a ‘distribution’, and just means taking some of the tasks assigned to QA/QE in that model and automating them – are now instead going to be part of a new team at Red Hat whose job is to work on the infrastructure that supports this CI pipeline. That doesn’t mean they’re leaving Fedora, or we’re going to throw away all the work we’ve invested in the components of Taskotron and start all over again, but it does mean that some or all of the components of Taskotron are going to be re-envisaged as part of a modernized pipeline for building and shipping whatever it is we want to call Fedora in the future – and also, if things go according to plan, for building and shipping CentOS and Red Hat products, as part of the vision is that as many components of the pipeline as possible will be shared among many projects.

So that’s one thing that’s happening to Fedora QA: the RH team is going to get a bit smaller, but it’s for good and sensible reasons. You’re also not going to see those folks disappear into some kind of internal RH wormhole, they’ll still be right here working on Fedora, just in a somewhat different context.

Of course, all of this change has other implications for Fedora QA as well, and I reckon this is a good time for those of us still wearing ‘Fedora QA’ hats – whether we’re paid by Red Hat or not – to be reconsidering exactly what our goals and priorities ought to be. Much like with Taskotron, we really haven’t sat down and done that for several years. I’ve been thinking about it myself for a while, and I wouldn’t say I have it all figured out, but I do have some thoughts.

For a start I think we should be looking ahead to the time when we’re no longer on what the anaconda team used to call ‘the blocker treadmill’, where a large portion of our working time is eaten up by a more or less constant cycle of waking up, finding out what broke in Rawhide or Branched today, and trying to get it fixed. If the plans above come about, that should happen a lot less for a couple of reasons: firstly Fedora won’t just be a project which releases a bunch of OS images every six months any more, and secondly, distribution-level CI ought to mean that things aren’t broken all the damn time any more. In an ideal scenario, a lot of the basic fundamental breakage that, right now, is still mostly caught by QA – and that we spend a lot of our cycles on dealing with – will just no longer be our problem. In a proper CI system, it becomes truly the developers’ responsibility: developers don’t get to throw in a change that breaks everything and then wait for QA to notice and tell them about it. If they try and send a change that breaks everything, it gets rejected, and hopefully, the breakage never really ‘happens’.

Sadly (or happily, given I still have a mortgage to pay off) this probably doesn’t mean Project Colada will finally be reality and we all get to sit on the beach drinking cocktails for the rest of our lives. CI is a great process for ensuring your project basically works all the time, but ‘basically works’ is a long way from ‘perfect’. Software is still software, after all, and a CI process is never going to catch all of the bugs. Freeing QA from the blocker treadmill lets us look up and think, well, what else can we do?

To be clear, I think we’re still going to need ‘release validation’. In fact, if the bits of the plan about having more release streams than just ‘all the bits, every six months’ come off, we’ll need more release validation. But hopefully there’ll be a lot more “well, this doesn’t quite work right in this quite involved real-world scenario” and less “it doesn’t boot and I think it ate my cat” involved. For the near future, we’re going to have to keep up the treadmill: bar a few proofs of concept and stuff, Fedora 26 is still an ‘all the bits, every six months’ release, and there’s still an awful lot of “it doesn’t boot” involved. (Right now, Rawhide doesn’t even compose, let alone boot!) But it’s not too early to start thinking about how we might want to revise the ‘release validation’ concept for a world where the wheels don’t fall off the bus every five minutes. It might be a good idea to go back to the teams responsible for all the Fedora products – Server, Workstation, Atomic et. al – and see if we need to take another good look at the documents that define what those products should deliver, and the test processes we have in place to try and determine whether they deliver them.

We’re also still going to be doing ‘updates testing’ and ‘test days’, I think. In fact, the biggest consequence of a world where the CI stuff works out might be that we are free to do more of those. There may be some change in what ‘updates’ are – it may not just be RPM packages any more – but whatever interesting forms of ‘update’ we wind up shipping out to people, we’re still going to need to make sure they work properly, and manual testing is always going to be able to find things that automated tests miss there.

I think the question of to what extent we still have a role in ‘automated testing’ and what it should be is also a really interesting one. One of the angles of the ‘more collaboration between RH and Fedora’ bit here is that RH is now very interested in ‘upstreaming’ a bunch of its internal tests that it previously considered to be sort of ‘RH secret sauce’. Specifically, there’s a set of tests from RH’s ‘Platform QE’ team which currently run through a pipeline using RH’s Beaker test platform which we’d really like to have at least a subset of running on Fedora. So there’s an open question about whether and to what extent Fedora QA would have a role in adapting those tests to Fedora and overseeing their operation. The nuts and bolts of ‘make sure Fedora has the necessary systems in place to be able to run the tests at all’ is going to be the job of the new ‘infrastructure’ team, but we may well wind up being involved in the work of adapting the tests themselves to Fedora and deciding which ones we want to run and for what purposes. In general, there is likely still going to be a requirement for ‘automated testing’ that isn’t CI – it’s still going to be necessary to test the things we build at a higher level. I don’t think we can yet know exactly what requirements we’ll have there, but it’s something to think about and figure out as we move forward, and I think it’s definitely going to be part of our job.

We may also need to reconsider how Fedora QA, and indeed Fedora as a whole, decides what is really important. Right now, there’s a pretty solid process for this, but it’s quite tied to the ‘all the things, every six months’ release cycle. For each release we decide which Fedora products are ‘release blocking’, and we care about those, and the bits that go into them and the tools for building them, an awful lot more than we care about anything else. This works pretty well to focus our limited resources on what’s really important. But if we’re going to be moving to having more and more varied ‘Fedora’ products with different release streams, the binary ‘is it release blocking?’ question doesn’t really work any more. Fedora as a whole might need a better way of doing that, and QA should have a role to play in figuring that out and making sure we work out our priorities properly from it.

So there we go! I hope that was useful and thought-provoking. We’ve got a QA meeting coming up tomorrow (2017-02-13) at 1600 UTC where I’m hoping we can chew these topics over a bit, just to serve as an opportunity to get people thinking. Hope to see you there, or on the mailing list!

openQA and Autocloud result submission to ResultsDB

Posted by Adam Williamson on February 07, 2017 05:18 AM

So I’ve just arrived back from a packed two weeks in Brno, and I’ll probably have some more stuff to post soon. But let’s lead with some big news!

One of the big topics at Devconf and around the RH offices was the ongoing effort to modernize both Fedora and RHEL’s overall build processes to be more flexible and involve a lot more testing (or, as some people may have put it, “CI CI CI”). A lot of folks wearing a lot of hats are involved in different bits of this effort, but one thing that seems to stay constant is that ResultsDB will play a significant role.

ResultsDB started life as the result storage engine for AutoQA, and the concept and name was preserved as AutoQA was replaced by Taskotron. Its current version, however, is designed to be a scalable, capable and generic store for test results from any test system, not just Taskotron. Up until last week, though, we’d never quite got around to hooking up any other systems to it to demonstrate this.

Well, that’s all changed now! In the course of three days, Jan Sedlak and I got both Fedora’s openQA instance and Autocloud reporting to ResultsDB. As results come out of both those systems, fedmsg consumers take the results, process them into a common format, and forward them to ResultsDB. This means there are groups with results from both systems for the same compose together, and you’ll find metadata in very similar format attached to the results from both systems. This is all deployed in production right now – the results from every daily compose from both openQA and Autocloud are being forwarded smoothly to ResultsDB.

To aid in this effort I wrote a thing we’re calling resultsdb_conventions for now. I think of it as being a code representation of some ‘conventions’ for formatting and organizing results in ResultsDB, as well as a tool for conveniently reporting results in line with those conventions. The attraction of ResultsDB is that it’s very little more than a RESTful API for a database; it enforces a pretty bare minimum in terms of required data for each result. A result must provide only a test name, an ‘item’ that was tested, and a status (‘outcome’) from a choice of four. ResultsDB allows a result to include as much more data as it likes, in the form of a freeform key:value data store, but it does not require any extra data to be provided, or impose any policy on its form.

This makes ResultsDB flexible, but also means we will need to establish conventions where appropriate to ensure related results can be conveniently located and reasoned about. resultsdb_conventions is my initial contribution to this effort, originally written just to reduce duplication between the openQA and Autocloud result submitters and ensure they used a common layout, but intended to perhaps cover far more use cases in the future.

Having this data in ResultsDB is likely to be practically useful either immediately or in the very near future, but we’re also hoping it acts as a demonstration that using ResultsDB to consolidate results from multiple test sources is not only possible but quite easy. And I’m hoping resultsdb_conventions can be a starting point for a discussion and some consensus around what metadata we provide, and in what format, for various types of result. If all goes well, we’re hoping to hook up manual test result submission to ResultsDB next, via the relval-ng project that’s had some discussion on the QA mailing lists. Stay tuned for more on that!

Welcome Fedora Quality Planet

Posted by Kamil Páral on January 31, 2017 10:31 AM

Hello, I’d like to introduce a new sub-planet of Fedora Planet to you, located at http://fedoraplanet.org/quality/ (you don’t need to remember the URL, there’s a sub-planet picker in the top right corner of Fedora Planet pages that allows you to switch between sub-planets).

Fedora Quality Planet will contain news and useful information about QA tools and processes present in Fedora, updates on our quality automation efforts, guides for package maintainers (and other teams) how to interact with our tools and checks or understand the reported failures, announcements about critical issues in Fedora releases, and more.

Our goal is to have a single place for you to visit (or subscribe to) and get a good overview of what’s happening in the Fedora Quality space. Of course all Fedora Quality posts should also show up in the main Fedora Planet feed, so if you’re already subscribed to that, you shouldn’t miss our posts either.

If you want to join our effort and publish some interesting quality-related posts into Fedora Quality Planet, you’re more then welcome! Please see the instructions how to syndicate your blog. If you have any questions or need help, ask in the test mailing list or ping kparal or adamw on #fedora-qa freenode IRC channel. Thanks!


The Tale Of The Two-Day, One-Character Patch

Posted by Adam Williamson on January 12, 2017 02:57 AM

I’m feeling like writing a very long explanation of a very small change again. Some folks have told me they enjoy my attempts to detail the entire step-by-step process of debugging some somewhat complex problem, so sit back, folks, and enjoy…The Tale Of The Two-Day, One-Character Patch!

Recently we landed Python 3.6 in Fedora Rawhide. A Python version bump like that requires all Python-dependent packages in the distribution to be rebuilt. As usually happens, several packages failed to rebuild successfully, so among other work, I’ve been helping work through the list of failed packages and fixing them up.

Two days ago, I reached python-deap. As usual, I first simply tried a mock build of the package: sometimes it turns out we already fixed whatever had previously caused the build to fail, and simply retrying will make it work. But that wasn’t the case this time.

The build failed due to build dependencies not being installable – python2-pypandoc, in this case. It turned out that this depends on pandoc-citeproc, and that wasn’t installable because a new ghc build had been done without rebuilds of the set of pandoc-related packages that must be rebuilt after a ghc bump. So I rebuilt pandoc, and ghc-aeson-pretty (an updated version was needed to build an updated pandoc-citeproc which had been committed but not built), and finally pandoc-citeproc.

With that done, I could do a successful scratch build of python-deap. I tweaked the package a bit to enable the test suites – another thing I’m doing for each package I’m fixing the build of, if possible – and fired off an official build.

Now you may notice that this looks a bit odd, because all the builds for the different arches succeeded (they’re green), but the overall ‘State’ is “failed”. What’s going on there? Well, if you click “Show result”, you’ll see this:

BuildError: The following noarch package built differently on different architectures: python-deap-doc-1.0.1-2.20160624git232ed17.fc26.noarch.rpm
rpmdiff output was:
error: cannot open Packages index using db5 - Permission denied (13)
error: cannot open Packages database in /var/lib/rpm
error: cannot open Packages database in /var/lib/rpm
removed     /usr/share/doc/python-deap/html/_images/cma_plotting_01_00.png
removed     /usr/share/doc/python-deap/html/examples/es/cma_plotting_01_00.hires.png
removed     /usr/share/doc/python-deap/html/examples/es/cma_plotting_01_00.pdf
removed     /usr/share/doc/python-deap/html/examples/es/cma_plotting_01_00.png

So, this is a good example of where background knowledge is valuable. Getting from step to step in this kind of debugging/troubleshooting process is a sort of combination of logic, knowledge and perseverance. Always try to be logical and methodical. When you start out you won’t have an awful lot of knowledge, so you’ll need a lot of perseverance; hopefully, the longer you go on, the more knowledge you’ll pick up, and thus the less perseverance you’ll need!

In this case the error is actually fairly helpful, but I also know a bit about packages (which helps) and remembered a recent mailing list discussion. Fedora allows arched packages with noarch subpackages, and this is how python-deap is set up: the main packages are arched, but there is a python-deap-docs subpackage that is noarch. We’re concerned with that package here. I recalled a recent mailing list discussion of this “built differently on different architectures” error.

As discussed in that thread, we’re failing a Koji check specific to this kind of package. If all the per-arch builds succeed individually, Koji will take the noarch subpackage(s) from each arch and compare them; if they’re not all the same, Koji will consider this an error and fail the build. After all, the point of a noarch package is that its contents are the same for all arches and so it shouldn’t matter which arch build we take the noarch subpackage from. If it comes out different on different arches, something is clearly up.

So this left me with the problem of figuring out which arch was different (it’d be nice if the Koji message actually told us…) and why. I started out just looking at the build logs for each arch and searching for ‘cma_plotting’. This is actually another important thing: one of the most important approaches to have in your toolbox for this kind of work is just ‘searching for significant-looking text strings’. That might be a grep or it might be a web search, but you’ll probably wind up doing a lot of both. Remember good searching technique: try to find the most ‘unusual’ strings you can to search for, ones for which the results will be strongly correlated with your problem. This quickly told me that the problematic arch was ppc64. The ‘removed’ files were not present in that build, but they were present in the builds for all other arches.

So I started looking more deeply into the ppc64 build log. If you search for ‘cma_plotting’ in that file, you’ll see the very first result is “WARNING: Exception occurred in plotting cma_plotting”. That sounds bad! Below it is a long Python traceback – the text starting “Traceback (most recent call last):”.

So what we have here is some kind of Python thing crashing during the build. If we quickly compare with the build logs on other arches, we don’t see the same thing at all – there is no traceback in those build logs. Especially since this shows up right when the build process should be generating the files we know are the problem (the cma_plotting files, remember), we can be pretty sure this is our culprit.

Now this is a pretty big scary traceback, but we can learn some things from it quite easily. One is very important: we can see quite easily what it is that’s going wrong. If we look at the end of the traceback, we see that all the last calls involve files in /usr/lib64/python2.7/site-packages/matplotlib. This means we’re dealing with a Python module called matplotlib. We can quite easily associate that with the package python-matplotlib, and now we have our next suspect.

If we look a bit before the traceback, we can get a bit more general context of what’s going on, though it turns out not to be very important in this case. Sometimes it is, though. In this case we can see this:

+ sphinx-build-2 doc build/html
Running Sphinx v1.5.1

Again, background knowledge comes in handy here: I happen to know that Sphinx is a tool for generating documentation. But if you didn’t already know that, you should quite easily be able to find it out, by good old web search. So what’s going on is the package build process is trying to generate python-deap’s documentation, and that process uses this matplotlib library, and something is going very wrong – but only on ppc64, remember – in matplotlib when we try to generate one particular set of doc files.

So next I start trying to figure out what’s actually going wrong in matplotlib. As I mentioned, the traceback is pretty long. This is partly just because matplotlib is big and complex, but it’s more because it’s a fairly rare type of Python error – an infinite recursion. You’ll see the traceback ends with many, many repetitions of this line:

  File "/usr/lib64/python2.7/site-packages/matplotlib/mathtext.py", line 861, in _get_glyph
    return self._get_glyph('rm', font_class, sym, fontsize)

followed by:

  File "/usr/lib64/python2.7/site-packages/matplotlib/mathtext.py", line 816, in _get_glyph
    uniindex = get_unicode_index(sym, math)
  File "/usr/lib64/python2.7/site-packages/matplotlib/mathtext.py", line 87, in get_unicode_index
    if symbol == '-':
RuntimeError: maximum recursion depth exceeded in cmp

What ‘recursion’ means is pretty simple: it just means that a function can call itself. A common example of where you might want to do this is if you’re trying to walk a directory tree. In Python-y pseudo-code it might look a bit like this:

def read_directory(directory):
    print(directory.name)
    for entry in directory:
        if entry is file:
            print(entry.name)
        if entry is directory:
            read_directory(entry)

To deal with directories nested in other directories, the function just calls itself. The danger is if you somehow mess up when writing code like this, and it winds up in a loop, calling itself over and over and never escaping: this is ‘infinite recursion’. Python, being a nice language, notices when this is going on, and bails after a certain number of recursions, which is what’s happening here.

So now we know where to look in matplotlib, and what to look for. Let’s go take a look! matplotlib, like most everything else in the universe these days, is in github, which is bad for ecosystem health but handy just for finding stuff. Let’s go look at the function from the backtrace.

Well, this is pretty long, and maybe a bit intimidating. But an interesting thing is, we don’t really need to know what this function is for – I actually still don’t know precisely (according to the name it should be returning a ‘glyph’ – a single visual representation for a specific character from a font – but it actually returns a font, the unicode index for the glyph, the name of the glyph, the font size, and whether the glyph is italicized, for some reason). What we need to concentrate on is the question of why this function is getting in a recursion loop on one arch (ppc64) but not any others.

First let’s figure out how the recursion is actually triggered – that’s vital to figuring out what the next step in our chain is. The line that triggers the loop is this one:

                return self._get_glyph('rm', font_class, sym, fontsize)

That’s where it calls itself. It’s kinda obvious that the authors expect that call to succeed – it shouldn’t run down the same logical path, but instead get to the ‘success’ path (the return font, uniindex, symbol_name, fontsize, slanted line at the end of the function) and thus break the loop. But on ppc64, for some reason, it doesn’t.

So what’s the logic path that leads us to that call, both initially and when it recurses? Well, it’s down three levels of conditionals:

    if not found_symbol:
        if self.cm_fallback:
            <other path>
        else:
            if fontname in ('it', 'regular') and isinstance(self, StixFonts):
                return self._get_glyph('rm', font_class, sym, fontsize)

So we only get to this path if found_symbol is not set by the time we reach that first if, then if self.cm_fallback is not set, then if the fontname given when the function was called was ‘it’ or ‘regular’ and if the class instance this function (actually method) is a part of is an instance of the StixFonts class (or a subclass). Don’t worry if we’re getting a bit too technical at this point, because I did spend a bit of time looking into those last two conditions, but ultimately they turned out not to be that significant. The important one is the first one: if not found_symbol.

By this point, I’m starting to wonder if the problem is that we’re failing to ‘find’ the symbol – in the first half of the function – when we shouldn’t be. Now there are a couple of handy logical shortcuts we can take here that turned out to be rather useful. First we look at the whole logic flow of the found_symbol variable and see that it’s a bit convoluted. From the start of the function, there are two different ways it can be set True – the if self.use_cmex block and then the ‘fallback’ if not found_symbol block after that. Then there’s another block that starts if found_symbol: where it gets set back to False again, and another lookup is done:

    if found_symbol:
    (...)
        found_symbol = False
        font = self._get_font(new_fontname)
        if font is not None:
            glyphindex = font.get_char_index(uniindex)
            if glyphindex != 0:
                found_symbol = True

At first, though, we don’t know if we’re even hitting that block, or if we’re failing to ‘find’ the symbol earlier on. It turns out, though, that it’s easy to tell – because of this earlier block:

    if not found_symbol:
        try:
            uniindex = get_unicode_index(sym, math)
            found_symbol = True
        except ValueError:
            uniindex = ord('?')
            warn("No TeX to unicode mapping for '%s'" %
                 sym.encode('ascii', 'backslashreplace'),
                 MathTextWarning)

Basically, if we don’t find the symbol there, the code logs a warning. We can see from our build log that we don’t see any such warning, so we know that the code does initially succeed in finding the symbol – that is, when we get to the if found_symbol: block, found_symbol is True. That logically means that it’s that block where the problem occurs – we have found_symbol going in, but where that block sets it back to False then looks it up again (after doing some kind of font substitution, I don’t know why, don’t care), it fails.

The other thing I noticed while poking through this code is a later warning. Remember that the infinite recursion only happens if fontname in ('it', 'regular') and isinstance(self, StixFonts)? Well, what happens if that’s not the case is interesting:

            if fontname in ('it', 'regular') and isinstance(self, StixFonts):
                return self._get_glyph('rm', font_class, sym, fontsize)
            warn("Font '%s' does not have a glyph for '%s' [U+%x]" %
                 (new_fontname,
                  sym.encode('ascii', 'backslashreplace').decode('ascii'),
                  uniindex),
                 MathTextWarning)

that is, if that condition isn’t satisfied, instead of calling itself, the next thing the function does is log a warning. So it occurred to me to go and see if there are any of those warnings in the build logs. And, whaddayaknow, there are four such warnings in the ppc64 build log:

/usr/lib64/python2.7/site-packages/matplotlib/mathtext.py:866: MathTextWarning: Font 'rm' does not have a glyph for '1' [U+1d7e3]
  MathTextWarning)
/usr/lib64/python2.7/site-packages/matplotlib/mathtext.py:867: MathTextWarning: Substituting with a dummy symbol.
  warn("Substituting with a dummy symbol.", MathTextWarning)
/usr/lib64/python2.7/site-packages/matplotlib/mathtext.py:866: MathTextWarning: Font 'rm' does not have a glyph for '0' [U+1d7e2]
  MathTextWarning)
/usr/lib64/python2.7/site-packages/matplotlib/mathtext.py:866: MathTextWarning: Font 'rm' does not have a glyph for '-' [U+2212]
  MathTextWarning)
/usr/lib64/python2.7/site-packages/matplotlib/mathtext.py:866: MathTextWarning: Font 'rm' does not have a glyph for '2' [U+1d7e4]
  MathTextWarning)

but there are no such warnings in the logs for other arches. That’s really rather interesting. It makes one possibility very unlikely: that we do reach the recursed call on all arches, but it fails on ppc64 and succeeds on the other arches. It’s looking far more likely that the problem is the “re-discovery” bit of the function – the if found_symbol: block where it looks up the symbol again – is usually working on other arches, but failing on ppc64.

So just by looking at the logical flow of the function, particularly what happens in different conditional branches, we’ve actually been able to figure out quite a lot, without knowing or even caring what the function is really for. By this point, I was really focusing in on that if found_symbol: block. And that leads us to our next suspect. The most important bit in that block is where it actually decides whether to set found_symbol to True or not, here:

        font = self._get_font(new_fontname)
        if font is not None:
            glyphindex = font.get_char_index(uniindex)
            if glyphindex != 0:
                found_symbol = True

I didn’t actually know whether it was failing because self._get_font didn’t find anything, or because font.get_char_index returned 0. I think I just played a hunch that get_char_index was the problem, but it wouldn’t be too difficult to find out by just editing the code a bit to log a message telling you whether or not font was None, and re-running the test suite.

Anyhow, I wound up looking at get_char_index, so we need to go find that. You could work backwards through the code and figure out what font is an instance of so you can find it, but that’s boring: it’s far quicker just to grep the damn code. If you do that, you get various results that are calls of it, then this:

src/ft2font_wrapper.cpp:const char *PyFT2Font_get_char_index__doc__ =
src/ft2font_wrapper.cpp:    "get_char_index()\n"
src/ft2font_wrapper.cpp:static PyObject *PyFT2Font_get_char_index(PyFT2Font *self, PyObject *args, PyObject *kwds)
src/ft2font_wrapper.cpp:    if (!PyArg_ParseTuple(args, "k:get_char_index", &ccode)) {
src/ft2font_wrapper.cpp:        {"get_char_index", (PyCFunction)PyFT2Font_get_char_index, METH_VARARGS, PyFT2Font_get_char_index__doc__},

Which is the point at which I started mentally buckling myself in, because now we’re out of Python and into C++. Glorious C++! I should note at this point that, while I’m probably a half-decent Python coder at this point, I am still pretty awful at C(++). I may be somewhat or very wrong in anything I say about it. Corrections welcome.

So I buckled myself in and went for a look at this ft2font_wrapper.cpp thing. I’ve seen this kind of thing a couple of times before, so by squinting at it a bit sideways, I could more or less see that this is what Python calls an extension module: basically, it’s a Python module written in C or C++. This gets done if you need to create a new built-in type, or for speed, or – as in this case – because the Python project wants to work directly with a system shared library (in this case, freetype), either because it doesn’t have Python bindings or because the project doesn’t want to use them for some reason.

This code pretty much provides a few classes for working with Freetype fonts. It defines a class called matplotlib.ft2font.FT2Font with a method get_char_index, and that’s what the code back up in mathtext.py is dealing with: that font we were dealing with is an FT2Font instance, and we’re using its get_char_index method to try and ‘find’ our ‘symbol’.

Fortunately, this get_char_index method is actually simple enough that even I can figure out what it’s doing:

static PyObject *PyFT2Font_get_char_index(PyFT2Font *self, PyObject *args, PyObject *kwds)
{
    FT_UInt index;
    FT_ULong ccode;

    if (!PyArg_ParseTuple(args, "I:get_char_index", &ccode)) {
        return NULL;
    }

    index = FT_Get_Char_Index(self->x->get_face(), ccode);

    return PyLong_FromLong(index);
}

(If you’re playing along at home for MEGA BONUS POINTS, you now have all the necessary information and you can try to figure out what the bug is. If you just want me to explain it, keep reading!)

There’s really not an awful lot there. It’s calling FT_Get_Char_Index with a couple of args and returning the result. Not rocket science.

In fact, this seemed like a good point to start just doing a bit of experimenting to identify the precise problem, because we’ve reduced the problem to a very small area. So this is where I stopped just reading the code and started hacking it up to see what it did.

First I tweaked the relevant block in mathtext.py to just log the values it was feeding in and getting out:

        font = self._get_font(new_fontname)
        if font is not None:
            glyphindex = font.get_char_index(uniindex)
            warn("uniindex: %s, glyphindex: %s" % (uniindex, glyphindex))
            if glyphindex != 0:
                found_symbol = True

Sidenote: how exactly to just print something out to the console when you’re building or running tests can vary quite a bit depending on the codebase in question. What I usually do is just look at how the project already does it – find some message that is being printed when you build or run the tests, and then copy that. Thus in this case we can see that the code is using this warn function (it’s actually warnings.warn), and we know those messages are appearing in our build logs, so…let’s just copy that.

Then I ran the test suite on both x86_64 and ppc64, and compared. This told me that the Python code was passing the same uniindex values to the C code on both x86_64 and ppc64, but getting different results back – that is, I got the same recorded uniindex values, but on x86_64 the resulting glyphindex value was always something larger than 0, but on ppc64, it was sometimes 0.

The next step should be pretty obvious: log the input and output values in the C code.

index = FT_Get_Char_Index(self->x->get_face(), ccode);
printf("ccode: %lu index: %u\n", ccode, index);

Another sidenote: one of the more annoying things with this particular issue was just being able to run the tests with modifications and see what happened. First, I needed an actual ppc64 environment to use. The awesome Patrick Uiterwijk of Fedora release engineering provided me with one. Then I built a .src.rpm of the python-matplotlib package, ran a mock build of it, and shelled into the mock environment. That gives you an environment with all the necessary build dependencies and the source and the tests all there and prepared already. Then I just copied the necessary build, install and test commands from the spec file. For a simple pure-Python module this is all usually pretty easy and you can just check the source out and do it right in your regular environment or in a virtualenv or something, but for something like matplotlib which has this C++ extension module too, it’s more complex. The spec builds the code, then installs it, then runs the tests out of the source directory with PYTHONPATH=BUILDROOT/usr/lib64/python2.7/site-packages , so the code that was actually built and installed is used for the tests. When I wanted to modify the C part of matplotlib, I edited it in the source directory, then re-ran the ‘build’ and ‘install’ steps, then ran the tests; if I wanted to modify the Python part I just edited it directly in the BUILDROOT location and re-ran the tests. When I ran the tests on ppc64, I noticed that several hundred of them failed with exactly the bug we’d seen in the python-deap package build – this infinite recursion problem. Several others failed due to not being able to find the glyph, without hitting the recursion. It turned out the package maintainer had disabled the tests on ppc64, and so Fedora 24+’s python-matplotlib has been broken on ppc64 since about April).

So anyway, with that modified C code built and used to run the test suite, I finally had a smoking gun. Running this on x86_64 and ppc64, the logged ccode values were totally different. The values logged on ppc64 were huge. But as we know from the previous logging, there was no difference in the value when the Python code passed it to the C code (the uniindex value logged in the Python code).

So now I knew: the problem lay in how the C code took the value from the Python code. At this point I started figuring out how that worked. The key line is this one:

if (!PyArg_ParseTuple(args, "I:get_char_index", &ccode)) {

That PyArg_ParseTuple function is what the C code is using to read in the value that mathtext.py calls uniindex and it calls ccode, the one that’s somehow being messed up on ppc64. So let’s read the docs!

This is one unusual example where the Python docs, which are usually awesome, are a bit difficult, because that’s a very thin description which doesn’t provide the references you usually get. But all you really need to do is read up – go back to the top of the page, and you get a much more comprehensive explanation. Reading carefully through the whole page, we can see pretty much what’s going on in this call. It basically means that args is expected to be a structure representing a single Python object, a number, which we will store into the C variable ccode. The tricky bit is that second arg, "I:get_char_index". This is the ‘format string’ that the Python page goes into a lot of helpful detail about.

As it tells us, PyArg_ParseTuple “use[s] format strings which are used to tell the function about the expected arguments…A format string consists of zero or more “format units.” A format unit describes one Python object; it is usually a single character or a parenthesized sequence of format units. With a few exceptions, a format unit that is not a parenthesized sequence normally corresponds to a single address argument to these functions.” Next we get a list of the ‘format units’, and I is one of those:

 I (integer) [unsigned int]
    Convert a Python integer to a C unsigned int, without overflow checking.

You might also notice that the list of format units include several for converting Python integers to other things, like i for ‘signed int’ and h for ‘short int’. This will become significant soon!

The :get_char_index bit threw me for a minute, but it’s explained further down:

“A few other characters have a meaning in a format string. These may not occur inside nested parentheses. They are: … : The list of format units ends here; the string after the colon is used as the function name in error messages (the “associated value” of the exception that PyArg_ParseTuple() raises).” So in our case here, we have only a single ‘format unit’ – I – and get_char_index is just a name that’ll be used in any error messages this call might produce.

So now we know what this call is doing. It’s saying “when some Python code calls this function, take the args it was called with and parse them into C structures so we can do stuff with them. In this case, we expect there to be just a single arg, which will be a Python integer, and we want to convert it to a C unsigned integer, and store it in the C variable ccode.”

(If you’re playing along at home but you didn’t get it earlier, you really should be able to get it now! Hint: read up just a few lines in the C code. If not, go refresh your memory about architectures…)

And once I understood that, I realized what the problem was. Let’s read up just a few lines in the C code:

FT_ULong ccode;

Unlike Python, C and C++ are ‘typed languages’. That just means that all variables must be declared to be of a specific type, unlike Python variables, which you don’t have to declare explicitly and which can change type any time you like. This is a variable declaration: it’s simply saying “we want a variable called ccode, and it’s of type FT_ULong“.

If you know anything at all about C integer types, you should know what the problem is by now (you probably worked it out a few paragraphs back). But if you don’t, now’s a good time to learn!

There are several different types you can use for storing integers in C: short, int, long, and possibly long long (depends on your arch). This is basically all about efficiency: you can only put a small number in a short, but if you only need to store small numbers, it might be more efficient to use a short than a long. Theoretically, when you use a short the compiler will allocate less memory than when you use an int, which uses less memory again than a long, which uses less than a long long. Practically speaking some of them wind up being the same size on some platforms, but the basic idea’s there.

All the types have signed and unsigned variants. The difference there is simple: signed numbers can be negative, unsigned ones can’t. Say an int is big enough to let you store 101 different values: a signed int would let you store any number between -50 and +50, while an unsigned int would let you store any number between 0 and 100.

Now look at that ccode declaration again. What is its type? FT_ULong. That ULong…sounds a lot like unsigned long, right?

Yes it does! Here, have a cookie. C code often declares its own aliases for standard C types like this; we can find Freetype’s in its API documentation, which I found by the cunning technique of doing a web search for FT_ULong. That finds us this handy definition: “A typedef for unsigned long.”

Aaaaaaand herein lies our bug! Whew, at last. As, hopefully, you can now see, this ccode variable is declared as an unsigned long, but we’re telling PyArg_ParseTuple to convert the Python object such that we can store it as an unsigned int, not an unsigned long.

But wait, you think. Why does this seem to work OK on most arches, and only fail on ppc64? Again, some of you will already know the answer, good for you, now go read something else. 😉 For the rest of you, it’s all about this concept called ‘endianness’, which you might have come across and completely failed to understand, like I did many times! But it’s really pretty simple, at least if we skate over it just a bit.

Consider the number “forty-two”. Here is how we write it with numerals: 42. Right? At least, that’s how most humans do it, these days, unless you’re a particularly hardy survivor of the fall of Rome, or something. This means we humans are ‘big-endian’. If we were ‘little-endian’, we’d write it like this: 24. ‘Big-endian’ just means the most significant element comes ‘first’ in the representation; ‘little-endian’ means the most significant element comes last.

All the arches Fedora supports except for ppc64 are little-endian. On little-endian arches, this error doesn’t actually cause a problem: even though we used the wrong format unit, the value winds up being correct. On (64-bit) big-endian arches, however, it does cause a problem – when you tell PyArg_ParseTuple to convert to an unsigned long, but store the result into a variable that was declared as an unsigned int, you get a completely different value (it’s multiplied by 2×32). The reasons for this involve getting into a more technical understanding of little-endian vs. big-endian (we actually have to get into the icky details of how values are really represented in memory), which I’m going to skip since this post is already long enough.

But you don’t really need to understand it completely, certainly not to be able to spot problems like this. All you need to know is that there are little-endian and big-endian arches, and little-endian are far more prevalent these days, so it’s not unusual for low-level code to have weird bugs on big-endian arches. If something works fine on most arches but not on one or two, check if the ones where it fails are big-endian. If so, then keep a careful eye out for this kind of integer type mismatch problem, because it’s very, very likely to be the cause.

So now all that remained to do was to fix the problem. And here we go, with our one character patch:

diff --git a/src/ft2font_wrapper.cpp b/src/ft2font_wrapper.cpp
index a97de68..c77dd83 100644
--- a/src/ft2font_wrapper.cpp
+++ b/src/ft2font_wrapper.cpp
@@ -971,7 +971,7 @@ static PyObject *PyFT2Font_get_char_index(PyFT2Font *self, PyObject *args, PyObj
     FT_UInt index;
     FT_ULong ccode;

-    if (!PyArg_ParseTuple(args, "I:get_char_index", &ccode)) {
+    if (!PyArg_ParseTuple(args, "k:get_char_index", &ccode)) {
         return NULL;
     }

There’s something I just love about a one-character change that fixes several hundred test failures. 🙂 As you can see, we simply change the I – the format unit for unsigned int – to k – the format unit for unsigned long. And with that, the bug is solved! I applied this change on both x86_64 and ppc64, re-built the code and re-ran the test suite, and observed that several hundred errors disappeared from the test suite on ppc64, while the x86_64 tests continued to pass.

So I was able to send that patch upstream, apply it to the Fedora package, and once the package build went through, I could finally build python-deap successfully, two days after I’d first tried it.

Bonus extra content: even though I’d fixed the python-deap problem, as I’m never able to leave well enough alone, it wound up bugging me that there were still several hundred other failures in the matplotlib test suite on ppc64. So I wound up looking into all the other failures, and finding several other similar issues, which got the failure count down to just two sets of problems that are too domain-specific for me to figure out, and actually also happen on aarch64 and ppc64le (they’re not big-endian issues). So to both the people running matplotlib on ppc64…you’re welcome 😉

Seriously, though, I suspect without these fixes, we might have had some odd cases where a noarch package’s documentation would suddenly get messed up if the package happened to get built on a ppc64 builder.

QA protip of the day: make sure your test runner fails properly

Posted by Adam Williamson on January 01, 2017 01:44 AM

Just when you thought you were safe…it’s time for a blog post!

For the last few days I’ve been working on fixing Rawhide packages that failed to build as part of the Python 3.6 mass rebuild. In the course of this, I’ve been enabling test suites for packages where there is one, we can plausibly run it, and we weren’t doing so before, because tests are great and running them during package builds is great. (And it’s in the guidelines).

I’ve now come across two projects which have a unittest-based test script which does something like this:

#!/usr/bin/python3

class SomeTests(unittest.TestCase):
    [tests here]

def main():
    suite = unittest.TestLoader().loadTestsFromTestCase(SomeTests)
    unittest.TextTestRunner(verbosity=3).run(suite)

if __name__ == '__main__':
    main()

Now if you just run this script manually all the time and inspect its output, you’ll be fine, because it’ll tell you whether the tests passed or not. However, if you try and use it in any kind of automated way you’re going to have trouble, because this script will always exit 0, even if some or all the tests fail. This, of course, makes it rather useless for running during a package build, because the build will never fail even if all the tests do.

If you’re going to write your own test script like this (which…seriously consider if you should just rely on unittest’s ‘gathering’ stuff instead, or use nose(2), or use pytest…), then it’s really a good idea to make sure your test script actually fails if any of the tests fail. Thus:

#!/usr/bin/python3

import sys

class SomeTests(unittest.TestCase):
    [tests here]

def main():
    suite = unittest.TestLoader().loadTestsFromTestCase(SomeTests)
    ret = unittest.TextTestRunner(verbosity=3).run(suite)
    if ret.wasSuccessful():
        sys.exit()
    else:
        sys.exit("Test(s) failed!")

if __name__ == '__main__':
    main()

(note: just doing sys.exit() will exit 0; doing sys.exit('any string') prints the string and exits 1).

Packagers, look out for this kind of bear trap when packaging…if the package doesn’t use a common test pattern or system but has a custom script like this, check it and make sure it behaves sanely.

Oooh, look! A shiny thing!

Posted by Adam Williamson on November 02, 2016 01:27 AM

Hmm.

Today I was supposed to be finalizing the test cases for Thursday’s switchable graphics Test Day.

Instead, thanks to this:

(jeff) adamw: Hey, I’ve been approached by 2 different people telling me dnf system-upgrade was failing. In both cases they had to import the F24 key manually. And in both cases they were going F21->F24. Do you know if that’s a documented limitation somewhere?

I somehow wound up spending the day using yum and dnf to upgrade a virtual machine from Fedora 13 to 15, to 16, to 17, to 23, to 24. And making a bunch of improvements to:

Easily distracted?

Moi?

Oh, sorry, I saw something shiny over there…(wanders off)

Fedora 25 switchable graphics Test Day this Thursday, 2016-11-03

Posted by Adam Williamson on October 31, 2016 09:30 PM

Yep, it’s Test Day time again – most likely the final Test Day of the Fedora 25 cycle. This Thursday, 2016-11-03, will be switchable graphics Test Day!

‘Switchable graphics’ refers to the fairly common current practice of laptops having two graphics adapters, one low-power one for general purpose use, one more powerful one for use with applications that require more oomph (e.g. games or 3D rendering applications). NVIDIA brands this as ‘Optimus’, and AMD just as ‘Switchable Graphics’ or ‘Dynamic Switchable Graphics’.

There are some enhancements to Fedora’s support for such systems in Fedora 25, and part of the Test Day’s purpose is to test those enhancements. The other part of the Test Day’s purpose is to ensure that support for switchable graphics on Fedora 25 Workstation with Wayland by default is as good as it can be, and that in some cases where we know Wayland support is not sufficient, fallback to X.org works as expected.

If you’re not sure whether you have a system with switchable graphics, you can run xrandr --listproviders. If the output from this command lists more than one ‘provider’, you likely have switchable graphics. If you do, please come along to the Test Day and help us test, if you can spare the time. If you believe your system has switchable graphics, but the command only lists one provider, please come join the Test Day chat on the day – we may be able to investigate and figure out what’s going on!

The Test Day page and the test cases are still being revised and tweaked as I write this, but as the week goes along, all the instructions you need to run the tests will be present there. As always, the event will be in #fedora-test-day on Freenode IRC. If you don’t know how to use IRC, you can read these instructions, or just use WebIRC.

What just happened?

Posted by Adam Williamson on October 27, 2016 08:08 AM

4pm: “Well, guess it’s time to write the F25 Final blocker status mail.”

4:10pm: “Yeesh, I guess I’d better figure out which of the three iSCSI blocker bugs is actually still valid, and maybe take a quick look at what the problem is.”

1:06am: “Well, I think I’m done fixing iSCSI now. But I seem to have sprouted four new openQA action items. Blocker status mail? What blocker status mail?”

Fedora 25 Workstation Wayland-by-default Test Day report

Posted by Adam Williamson on October 14, 2016 10:30 PM

Hi folks! As yesterday’s Test Day was pretty popular and widely-covered, I thought I’d blog the report as well as emailing it out.

We had a great Test Day! 49 testers combined ran a total of 341 tests and filed or referenced 35 bugs. 9 of those have since been closed as duplicates, leaving 26 open reports:

  • #1299505 gnome-calculator prints “Currency LTL is not provided by IMF or ECB” in Financial mode
  • #1330034 [abrt] calibre: QObject::disconnect() : python2.7 killed by SIGSEGV
  • #1367846 Scrolling is way too fast in writer
  • #1376471 global menu can’t be used for 15-30 seconds, startup notification stuck, missing item in alt+tab (on wayland)
  • #1379098 [Regression] Gnome-shell crashes on switching back from tty ([abrt] gnome-shell: wl_resource_post_event() : gnome-shell killed by SIGSEGV)
  • #1383471 [abrt] WARNING: CPU: 3 PID: 12176 at ./include/linux/swap.h:276 page_cache_tree_insert+0x1cc/0x1e0
  • #1384431 activities screen shows applications and search results at the same time
  • #1384440 dragging gnome dash application to a specific workplace doesn’t open the application to this workspace (on wayland)
  • #1384489 Music not recognizing/importing files
  • #1384502 Recent tab not available
  • #1384537 Opening a new gnome-software window creates a new entry without a proper icon
  • #1384546 Removing application does not bring back the install icon immediately
  • #1384551 Printing directions using maps do not show the marked path
  • #1384560 Screenshot of gnome-maps does not show the map part at all
  • #1384569 Places dropdown search does not function if weather is open on secondary monitor
  • #1384570 gnome-initial-setup does not exit at the end
  • #1384572 Places dropdown search does not function if clocks is open on secondary monitor
  • #1384590 [abrt] gnome-photos: babl_get_name() : gnome-photos killed by SIGABRT
  • #1384596 gnome-boxes: starting fails without any feedback
  • #1384599 gnome-calculator currency conversion is hard to use
  • #1384616 thumbnail “border” seems misaligned in activities overview
  • #1384651 Selecting city should automatically be added on Clock Application
  • #1384665 [abrt] authconfig-gtk: gdk_window_enable_synchronized_configure() : python2.7 killed by SIGSEGV
  • #1384671 system-config-language does not work under Wayland
  • #1384675 system-config-users does not work under Wayland
  • #1384678 Missing top-left icon (and full application name) on Wayland

Some of these are not Wayland bugs, but it’s not a bad thing that people found some non-Wayland bugs as well while testing! We did find several new Wayland issues, but on the positive side, no really big bugs that weren’t already known and on the radar for the final release.

So the event looks like a success on all fronts: we found some new bugs to squish, but it also gives us a decent indication that the Workstation-on-Wayland experience is in a good enough condition for a first stable release. We also confirmed that the Workstation-on-X11 session is available as a fallback and that works properly, for anyone who can’t use Wayland for any reason.

Many thanks to all the testers for their hard work!

wikitcms, relval, fedfind and testdays moved to Pagure

Posted by Adam Williamson on October 14, 2016 09:54 PM

Today I moved several of my pet projects from the cgit instance on this server to Pagure. You can now find them here:

The home page URLs for each project on this server – e.g. https://www.happyassassin.net/fedfind – also now redirect to the Pagure project pages.

I also deleted some other repos that were hosted in my cgit instance entirely, because I don’t think they were any longer of interest to anyone and I didn’t want to maintain them. Those were mostly related to Fedlet, which I haven’t been working on for 2-3 years now.

For now the repos for the three main projects – wikitcms, relval and fedfind – remain in my cgit instance, containing just a single text file documenting the move to Pagure; in a month or so I will remove these repositories and decommission the cgit instance. So, update your checkouts! 🙂

This saves me maintaining the repos, provides pull review and issue mechanisms, and it’s a good thing to have all Fedora-ish code projects in Pagure in general, I think.

Many thanks to pingou and everyone else who works on Pagure, it’s a great project!

Fedora Workstation Wayland Test Day: 2016-10-13!

Posted by Adam Williamson on October 12, 2016 12:23 AM

Hi folks! Time to announce another Fedora 25 Test Day: Wayland Test Day! Note, the wiki page doesn’t exist yet, as the wiki is having issues right now, but I wanted to make the announcement early enough to give people time to prepare.

You may have read that we plan to make Fedora 25 Workstation the first release to default to Wayland (rather than X11) as the graphical server. This has been in place since Fedora 25 Alpha, but to prepare for the general release, we’d like to run a test day and get some broad-based testing to ensure that Wayland is at least good enough for an initial release, and that the option to switch to X11 works properly for those cases where it might be necessary.

You’ll be able to run most of the tests from a live image (without doing a permanent installation). All the test instructions will be on the wiki page and there will be QA and developer folks around all day in the IRC channel to help you test and report any issues you find.

Just about anyone with a computer can help with this testing, and we’d like to have feedback from as many users as possible, so please, if you have a little time on Thursday, come help out! As always, the event will be in #fedora-test-day on Freenode IRC. If you don’t know how to use IRC, you can read these instructions, or just use WebIRC.

X crash during Fedora update when system has hybrid graphics and systemd-udev is in update

Posted by Adam Williamson on October 04, 2016 09:36 PM

Hi folks! This is a PSA about a fairly significant bug we’ve recently been able to pin down in Fedora 24+.

Here’s the short version: especially if your system has hybrid graphics (that is, it has an Intel video adapter and also an AMD or NVIDIA one, and it’s supposed to switch to the most appropriate one for what you’re currently doing – NVIDIA calls this ‘Optimus’), DON’T UPDATE YOUR SYSTEM BY RUNNING DNF FROM THE DESKTOP. (Also if you have multiple graphics adapters that aren’t strictly ‘hybrid graphics’; the bug affects any case with multiple graphics adapters).

Here’s the slightly longer version. If your system has more than one graphics adapter, and you update the systemd-udev package while X is running, X may well crash. So if the update process was running inside the X session, it will also crash and will not complete. This will leave you in the unfortunate situation where RPM thinks you have two versions of several packages installed at the same time (and also a bunch of package scripts that should have run will not have run).

The bug is actually triggered by restarting systemd-udev-trigger.service; anything which does that will cause X to crash on an affected system. So far only systems with multiple adapters are reported to be affected; not absolutely all such systems are affected, but a good percentage appear to be. It occurs when the systemd-udev package is updated because the package %postun scriptlet – which is run on update when the old version of the package is removed – restarts that service.

The safest possible way to update a Fedora system is to use the ‘offline updates’ mechanism. If you use GNOME, this is how updates work if you just wait for the notifications to appear, the ones that tell you you can reboot to install updates now. What’s actually happening there is that the system has downloaded and cached the updates, and when you click ‘reboot’, it will boot to a special state where very few things are running – just enough to run the package update – run the package update, then reboot back to the normal system. This is the safest way to apply updates. If you don’t want to wait for notifications, you can run GNOME Software, click the Updates button, and click the little circular arrow to force a refresh of available updates.

If you don’t use GNOME, you can use the offline update system via pkcon, like this:

sudo pkcon refresh force && \
sudo pkcon update --only-download && \
sudo pkcon offline-trigger && \
sudo systemctl reboot

If you don’t want to use offline updates, the second safest approach is to run the update from a virtual terminal. That is, instead of opening a terminal window in your desktop, hit ctrl-alt-f3 and you’ll get a console login screen. Log in and run the update from this console. If your system is affected by the bug, and you leave your desktop running during the update, X will still crash, but the update process will complete successfully.

If your system only has a single graphics adapter, this bug should not affect you. However, it’s still not a good idea to run system updates from inside your desktop, as any other bug which happens to cause either the terminal app, or the desktop, or X to crash will also kill the update process. Using offline updates or at least installing updates from a VT is much safer.

The bug reports for this issue are:

  • #1341327 – for the X part of the problem
  • #1378974 – for the systemd part of the problem

Updates for Fedora 24 and Fedora 25 are currently being prepared. However, the nature of the bug actually means that installing the update will trigger the bug, for the last time. The updates will ensure that subsequent updates to systemd-udev will no longer cause the problem. We are aiming to get the fix into Fedora 25 Beta, so that systems installed from Fedora 25 Beta release images will not suffer from the bug at all, but existing Fedora 25 systems will encounter the bug when installing the update.

UEFI for QEMU now in Fedora repositories

Posted by Kamil Páral on June 27, 2016 12:55 PM

I haven’t seen any announcement, but I noticed Fedora repositories now contain edk2-ovmf package. That is the package that is necessary to emulate UEFI in QEMU/KVM virtual machines. It seems all licensing issues having been finally resolved and now you can easily run UEFI systems in your virtual machines!

I have updated Using_UEFI_with_QEMU wiki page accordingly.

Enjoy.


‘Package XXX is not signed’ error during upgrade to Fedora 24

Posted by Kamil Páral on June 22, 2016 11:54 AM

Many people hit issues like this when trying to upgrade to Fedora 24:

 Error: Package a52dec-0.7.4-19.fc24.x86_64.rpm is not signed

You can easily see that this is a very widespread issue if you look at comments section under our upgrade guide on fedora magazine. In fact, this issue probably affects everyone who has rpmfusion repository enabled (which is a very popular third-party repository). Usually the a52dec package is mentioned, because it’s early in the alphabet listing, but it can be a different one (depending on what you installed from rpmfusion).

The core issue is that even though their Fedora 24 repository is available, the packages in it are not signed yet – they simply did not have time to do that yet. However, rpmfusion repository metadata from Fedora 23 demand that all packages are signed (which is a good thing, package signing is crucial to prevent all kinds of nasty security attacks). The outcome is that DNF rejects the transaction for being unsecure.

According to rpmfusion maintainers, they are working on signing their repositories and it should be done hopefully soon. So if you’re not in a hurry with your upgrade, just wait a while and the problem will disappear soon (hopefully).

But, if you insist that you want to upgrade now, what are your options?

Some people suggest you can add --nogpgcheck option to the command line. Please don’t do that! That completely bypasses any security checks, even for proper Fedora packages! It will get you vulnerable to security attacks.

A much better option is to temporarily remove rpmfusion repositories:

$ sudo dnf remove 'rpmfusion-*-release'

and run the upgrade command again. You’ll likely need to add --allowerasing option, because it will probably want to remove some packages that you installed from rpmfusion (like vlc):

$ sudo dnf system-upgrade download --releasever=24 --allowerasing

This is OK, after you upgrade your system, you can enable rpmfusion repositories again, and install the packages that were removed prior to upgrade.

(I recommend to really remove rpmfusion repositories and not just disable them, because they manage their repos in a non-standard way, enabling and disabling their updates and updates-testing repos during the system lifecycle according to their needs, so it’s hard to know which repos to enable after the system upgrade – they are not the same as were enabled before the system upgrade. What they are doing is really rather ugly and it’s much better to perform a clean installation of their repos.)

After the system upgrade finishes, simply visit their website, install the repos again, and install any packages that you’re missing. This way, your upgrade was performed in a safe way. The packages installed from rpmfusion might still be installed unsafely (depending whether they manage to sign the repo by that time or not), but it’s much better than to upgrade your whole system unsafely.

To close this up, I’m sorry that people are hit by these complications, but it’s not something Fedora project can directly influence (except for banning third-party repos during system upgrades completely, or some similar drastic measure). This is in hands of those third-party repos. Hopefully lots of this pain will go away once we start using Flatpak.