Fedora Quality Planet

New openQA tests: update live image build/install

Posted by Adam Williamson on February 08, 2019 06:20 PM

Hot on the heels of adding installer image build/install tests to openQA, I’ve now added tests which do just the same, but for the Workstation live image.

That means that, when running the desktop tests for an update, openQA will also run a test that builds a Workstation live image and a test that boots and installs it. The packages from the update will be used – if relevant – in the live image creation environment, and included in the live image itself. This will allow us to catch problems in updates that relate to the build and basic functionality of live images.

Here’s an update where you can see that both the installer and live image build tests ran successfully and passed – see the updates-everything-boot-iso and updates-workstation-live-iso flavors.

I’m hoping this will help us catch compose issues much more easily during the upcoming Fedora 30 release process.

Devconf.cz 2019 trip report

Posted by Adam Williamson on February 06, 2019 07:01 PM

I’ve just got back from my Devconf.cz 2019 trip, after spending a few days after the conference in Red Hat’s Brno office with other Fedora QA team members, then a few days visiting family.

I gave both my talks – Don’t Move That Fence ‘Til You Know Why It’s There and Things Fedora QA Robots Do – and both were well-attended and, I think, well-received. The slide decks are up on the talk pages, and recordings should I believe go up on the Devconf Youtube channel soon.

I attended many other talks, my personal favourite being Stef Walter’s Using machine learning to find Linux bugs. Stef noticed something I also have noticed in our openQA work – that “test flakes” are very often not just some kind of “random blip” but genuine bugs that can be investigated and fixed with a little care – and ran with it, using the extensive amount of results generated by the automated test suite for Cockpit as input data for a machine learning-based system which clusters “test flakes” based on an analysis of key data from the logs for each test. In this way they can identify when a large number of apparent “flakes” seem to have significant common features and are likely to be occurrences of the same bug, allowing them then to go back and analyze the commonalities between those cases and identify the underlying bug. We likely aren’t currently running enough tests in openQA to utilize the approach Stef outlined in full, but the concept is very interesting and may be useful in future with more data, and perhaps for Fedora CI results.

Other useful and valuable talks I saw included Dan Walsh on podman, Lennart Poettering on portable services, Daniel Mach and Jaroslav Mracek on the future of DNF, Kevin Fenzi and Stephen Smoogen on the future of EPEL, Jiri Benc and Marian Šámal on a summer coding camp for kids, Ben Cotton on community project management, the latest edition of Will Woods’ and Stephen Gallagher’s crusade to kill all scriptlets, and the Fedora Council BoF.

There were also of course lots of useful “hallway track” sessions with Miroslav Vadkerti, Kevin Fenzi, Mohan Boddu, Patrick Uiterwijk, Alexander Bokovoy, Dominik Perpeet, Matthew Miller, Brian Exelbierd and many more – it is invaluable to be able to catch up with people in person and discuss things that are harder to do in tickets and IRC.

As usual it was an enjoyable and productive event, and the rum list at the Bar That Doesn’t Exist remains as impressive as ever…;)

Devconf.cz 2019

Posted by Adam Williamson on January 24, 2019 08:31 PM

For anyone who – inexplicably – hasn’t already had it in their social calendar in pink sharpie for months, I will be at Devconf.cz 2019 this weekend, at FIT VUT in Brno. I’ll be doing two talks: Things Fedora QA Robots Do on Friday at 3pm (which is basically a brain dump about the pile of little fedmsg consumers that do quite important jobs that probably no-one knows about but me), and Don’t Move That Fence ‘Til You Know Why It’s There on Saturday at 11am, which is a less QA-specific talk that’s about how I reckon you ought to go about changing code. The slides for both talks are up now, if you want a sneak preview (though if you do, you’re disqualified from the audience participation section of the “fence” talk!)

Do come by to the talks, if you’re around and there’s nothing more interesting in that timeslot. Otherwise feel free to buttonhole me around the conference any time.

New openQA tests: update installer tests and desktop app start/stop test

Posted by Adam Williamson on January 23, 2019 10:20 AM

It’s been a while since I wrote about significant developments in Fedora openQA, so today I’ll be writing about two! I wrote about one of them a bit in my last post, but that was primarily about a bug I ran into along the way, so now let’s focus on the changes themselves.

Testing of install media built from packages in updates-testing

We have long had a problem in Fedora testing that we could not always properly test installer changes. This is most significant during the period of development after a new release has branched from Rawhide, but before it is released as the new stable Fedora release (we use the name ‘Branched’ to refer to a release in this state; in a month or so, Fedora 30 will branch from Rawhide and become the current Branched release).

During most of this time, the Bodhi update system is enabled for the release. New packages built for the release do not immediately appear in any repositories, but – as with stable releases – must be submitted as “updates”, sometimes together with related packages. Once submitted as an update, the package(s) are sent to the “updates-testing” repository for the release. This repository is enabled on installed Branched systems by default (this is a difference from stable releases), so testers who have already installed Branched will receive the package(s) at this point (unless they disable the “updates-testing” repository, which some do). However, the package is still not truly a part of the release at this point. It is not included in the nightly testing composes, nor will it be included in any Beta or Final candidate composes that may be run while it is in updates-testing. That means that if the actual release media were composed while the package was still in updates-testing, it would not be a part of the release proper. Packages only become part of these composes once they pass through Bodhi and are ‘pushed stable’.

This system allows us to back out packages that turn out to be problematic, and hopefully to prevent them from destabilizing the test and release composes by not pushing them stable if they turn out to cause problems. It also means more conservative testers have the option to disable the “updates-testing” repository and avoid some destabilizing updates, though of course if all the testers did this, no-one would be finding the problems. In the last few years we have also been running several automated tests on updates (via Taskotron, openQA and the CI pipeline) and reporting results from those to Bodhi, allowing packagers to pull the update if the tests find problems.

However, there has long been a bit of a problem in this process: if the update works fine on an installed system but causes problems if included in (for example) an installer image or live image, we have no very good way to find this out. There was no system for automatically building media like this that include the updates currently in testing so they could be tested. The only way to find this sort of problem was for testers to manually create test media – a process that is not widely understood, is time consuming, and can be somewhat difficult. We also of course could not do automated testing without media to test.

We’ve looked at different ways of addressing this in the past, but ultimately none of them came to much (yet), so last year I decided to just go ahead and do something. And after a bit of a roadblock (see that last post), that something is now done!

Our openQA now has two new tests it runs on all the updates it tests. The first test – here’s an example run – builds a network install image, and the second – example run – tests it. Most importantly, any packages from the update under testing are both used in the process of building the install image (if they are relevant to that process) and included in the installer image (if they are packages which would usually be in such an image). Thus if the update breaks the production of the image, or the basic functionality of the image itself, this will be caught. This (finally) means that we have some idea whether a new anaconda, lorax, pykickstart, systemd, dbus, blivet, dracut or any one of dozens of other key packages might break the installer. If you’re a packager and you see that one of these two tests has failed for your update, we should look into that! If you’re not sure how to go about that, you can poke me, bcl, or the anaconda developers in Freenode #anaconda, and we should be able to help.

It is also possible for a human tester to download the image produced by the first test and run more in-depth tests on it manually; I haven’t yet done anything to make that possibility more visible or easier, but will try to look into ways of doing that over the next few weeks.

GNOME application start/stop testing

My colleague Lukáš Růžička has recently been looking into what we might be able to do to streamline and improve our desktop application testing, something I’d honestly been avoiding because it seemed quite intractable! After some great work by Lukáš, one major fruit of this work is now visible in Fedora openQA: a GNOME application start/stop test suite. Here’s an example run of it – note that more recent runs have a ton of failures caused by a change in GNOME, Lukáš has proposed a change to the test to address that but I have not yet reviewed it.

This big test suite just tests starting and then exiting a large number of the default installed applications on the Fedora Workstation edition, making sure they both launch and exit successfully. This is of course pretty easy for a human to do – but it’s extremely tedious and time-consuming, so it’s something we don’t do very often at all (usually only a handful of times per release cycle), meaning we may not notice that an application which perhaps we don’t commonly use has a very critical bug (like failing to launch at all) for some time.

Making an automated system like openQA do this is actually quite a lot of work, so it was a great job by Lukas to get it working. Now by monitoring the results of this test on the nightly composes closely, we should find out much more quickly if one of the tested applications is completely broken (or has gone missing entirely).

AdamW’s Debugging Adventures: The Mysterious Disappearing /proc

Posted by Adam Williamson on January 18, 2019 03:15 AM

Yep, folks, it’s that time again – time for one of old Grandpa Adam’s tall tales of root causing adventure…

There’s a sort of catch-22 situation in Fedora that has been a personal bugbear for a very long time. It mainly affects Branched releases – each new Fedora release, when it has branched from Rawhide, but before it has been released. During this period the Bodhi update system is in effect, meaning all new packages have to go through Bodhi review before they are included in the composes for the release. This means, in theory, we should be able to make sure nothing really broken lands in the release. However, there’s a big class of really important updates we have never been able to test properly at all: updates that affect the installer.

The catch-22 is this – release engineering only builds install media from the ‘stable’ package set, those packages that have gone through review. So if a package under review breaks the installer, we can’t test whether it breaks the installer unless we push it stable. Well, you can, but it’s quite difficult – you have to learn how to build an installer image yourself, then build one containing the packages from the update and test it. I can do that, but most other people aren’t going to bother.

I’ve filed bugs and talked to people about ways to resolve this multiple times over many years, but a few months back I just got sick of the problem and decided to fix it myself. So I wrote an openQA update test which automates the process: it builds an installer image, with the packages from the update available to the installer image build tool. I also included a subsequent test which takes that image and runs an install with it. Since I already had the process for doing this manually down pat, it wasn’t actually very difficult.

Only…when I deployed the test to the openQA staging instance and actually tried it out, I found the installer image build would frequently fail in a rather strange way.

The installer image build process works (more or less) by creating a temporary directory, installing a bunch of packages to it (using dnf’s feature of installing to an alternative ‘root’), fiddling around with that environment a bit more, creating a disk image whose root is that temporary directory, then fiddling with the image a bit to make it into a bootable ISO. (HANDWAVE HANDWAVE). However, I was finding it would commonly fail in the ‘fiddling around with the environment’ stage, because somehow some parts of the environment had disappeared. Specifically, it’d show this error:

FileNoteFoundError: [Errno 2] No such file or directory: '/var/tmp/lorax.q8xfvc0p/installroot//proc/modules'

lorax was, at that point, trying to touch that directory (never mind why). That’s the /proc/modules inside the temporary root, basically. The question was, why was it disappearing? And why had neither myself nor bcl (the lorax maintainer) seen it happening previously in manual use, or in official composes?

I tried reproducing it in a virtual machine…and failed. Then I tried again, and succeeded. Then I ran the command again…and it worked! That pattern turned out to repeat: I could usually get it to happen the first time I tried it in a VM, but any subsequent attempts in the same VM succeeded.

So this was seeming really pretty mysterious. Brian couldn’t get it to happen at all.

At this point I wrote a dumb, short Python script which just constantly monitored the disappearing location and told me when it appeared and disappeared. I hacked up the openQA test to run this script, and upload the result. Using the timestamps, I was able to figure out exactly what bit of lorax was running when the directory suddenly disappeared. But…I couldn’t immediately see why anything in that chunk of lorax would wind up deleting the directory.

At this point, other work became more important, and I wound up leaving this on the back burner for a couple of months. Then I came back to it a couple days ago. I picked back up where I left off, and did a custom build of lorax with some debug logging statements strewn around the relevant section, to figure out really precisely where we were when things went wrong. But this turned out to be a bit of a brick wall, because it turned out that at the time the directory disappeared, lorax was just…running mksquashfs. And I could not figure out any plausible reason at all why a run of mksquashfs would cause the directory to vanish.

After a bit, though, the thought struck me – maybe it’s not lorax itself wiping the directory out at all! Maybe something else is doing it. So I thought to look at the system logs. And lo and behold, I found my smoking gun. At the exact time my script logged that the directory had disappeared, this message appeared in the system log:

Jan 18 01:57:30 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com systemd[1]: Starting Cleanup of Temporary Directories...

now, remember our problem directory is in /var/tmp. So this smells very suspicious indeed! So I figured out what that service actually is – to do this, you just grep for the description (“Cleanup of Temporary Directories”) in /usr/lib/systemd/system – and it turned out to be /usr/lib/systemd/system/systemd-tmpfiles-clean.service, which is part of systemd’s systemd-tmpfiles mechanism, which you can read up on in great detail in man systemd-tmpfiles and man tmpfiles.d.

I had run into it a few times before, so I had a vague idea what I was dealing with and what to look for. It’s basically a mechanism for managing temporary files and directories: you can write config snippets which systemd will read and do stuff like creating expected temporary files or directories on boot (this lets packages manage temporary directories without doing it themselves in scriptlets). I poked through the docs again and, sure enough, it turns out another thing the system can do is delete temporary files that reach a certain age:

The date field, when set, is used to decide what files to delete when cleaning. If a file or directory is
older than the current time minus the age field, it is deleted. The field format is a series of integers
each followed by one of the following suffixes for the respective time units: s, m or min, h, d, w, ms, and
us, meaning seconds, minutes, hours, days, weeks, milliseconds, and microseconds, respectively. Full names
of the time units can be used too.

This systemd-tmpfiles-clean.service does that job. So I went looking for tmpfiles.d snippets that cover /var/tmp, and sure enough, found one, in Fedora’s stock config file /usr/lib/tmpfiles.d/tmp.conf:

q /var/tmp 1777 root root 30d

The 30d there is the ‘age’ field. So this tells the tmpfiles mechanism that it’s fine to wipe anything under /var/tmp which is older than 30 days.

Of course, naively we might think our directory won’t be older than 30 days – after all, we only just ran lorax! But remember, lorax installs packages into this temporary directory, and files and directories in packages get some of their time attributes from the package. So we (at this point, Brian and I were chatting about the problem as I poked it) looked into how systemd-tmpfiles defines age, precisely:

The age of a file system entry is determined from its last modification timestamp (mtime), its last access
timestamp (atime), and (except for directories) its last status change timestamp (ctime). Any of these three
(or two) values will prevent cleanup if it is more recent than the current time minus the age field.

So since our thing is a directory, its mtime and atime are relevant. So Brian and I both looked into those. He did it manually, while I hacked up my check script to also print the mtime and atime of the directory when it existed. And sure enough, it turned out these were several months in the past – they were obviously related to the date the filesystem package (from which /proc/modules comes) was built. They were certainly longer than 30 days ago.

Finally, I looked into what was actually running systemd-tmpfiles-clean.service; it’s run on a timer, systemd-tmpfiles-clean.timer. That timer is set to run the service 15 minutes after the system boots, and every day thereafter.

So all of this hooked up nicely into a convincing story. openQA kept running into this problem because it always runs the test in a freshly-booted VM – that ’15 minutes after boot’ was turning out to be right in the middle of the image creation. My manual reproductions were failing on the first try for the same reason – but then succeeding on the second and subsequent tries because the cleaner would not run again until the next day. And Brian and I had either never or rarely seen this when we ran lorax manually for one reason or another because it was pretty unlikely the “once a day” timer would happen to wake up and run just when we had lorax running (and if it did happen, we’d try again, and when it worked, we’d figure it was just some weird transient failure). The problem likely never happens in official composes, I think, because the tmpfiles timer isn’t active at all in the environment lorax gets run in (haven’t double-checked this, though).

Brian now gets to deal with the thorny problem of trying to fix this somehow on the lorax side (so the tmpfiles cleanup won’t remove bits of the temporary tree even if it does run while lorax is running). Now I know what’s going on, it was easy enough to work around in the openQA test – I just have the test do systemctl stop systemd-tmpfiles-clean.timer before running the image build.

AdamW’s Debugging Adventures: Python 3 Porting 201

Posted by Adam Williamson on January 09, 2019 04:12 AM

Hey folks! Time for another edition of AdamW’s Debugging Adventures, wherein I boast about how great I am at fixin’ stuff.

Today’s episode is about a bug in the client for Fedora’s Koji buildsystem which has been biting more and more Fedora maintainers lately. The most obvious thing it affects is task watching. When you do a package build with fedpkg, it will by default “watch” the build task – it’ll update you when the various subtasks start and finish, and not quit until the build ultimately succeeds or fails. You can also directly watch tasks with koji watch-task. So this is something Fedora maintainers see a lot. There’s also a common workflow where you chain something to the successful completion of a fedpkg build or koji watch-task, which relies on the task watch completing successfully and exiting 0, if the build actually completed.

However, recently, people noticed that this task watching seemed to be just…failing, quite a lot. While the task was still running, it’d suddenly exit, usually showing this message:

ConnectionError: (‘Connection aborted.’, RemoteDisconnected(‘Remote end closed connection without response’,))

After a while, nirik realized that this seemed to be associated with the client going from running under Python 2 by default to running under Python 3 by default. This seems to happen when running on Python 3; it doesn’t seem to happen when running on Python 2.

Today I finally decided it had got annoying enough that I’d spend some time trying to track it down.

It’s pretty obvious that the message we see relates to an exception, in some way. But ultimately something is catching that exception and printing it out and then exiting (we’re not actually getting a traceback, as you do if the exception is ultimately left to reach the interpreter). So my first approach was to dig into the watch-task code from the top down, and try and find something that handles exceptions that looks like it might be the bit we were hitting.

And…I failed! This happens, sometimes. In fact I still haven’t found the exact bit of code that prints the message and exits. Sometimes, this just happens. It’s OK. Don’t give up. Try something else!

So what I did next was kind of a long shot – I just grepped the code for the exception text. I wasn’t really expecting this to work, as there’s nothing to suggest the actual exception is part of Koji; it’s most likely the code doesn’t contain any of that text at all. But hey, it’s easy to do, so why not? And as it happened, I got lucky and hit paydirt: there happens to be a comment with some of the text from the error we’re hitting. And it sure looks like it might be relevant to the problem we’re having! The comment itself, and the function it’s in, looked so obviously promising that I went ahead and dug a little deeper.

That function, is_conn_error(), is used by only one other thing: this _sendCall() method in the same file. And that seems very interesting, because what it does can be boiled down to: “hey, we got an error! OK, send it to is_conn_error(). If that returns True, then just log a debug message and kick the session. If that returns False, then raise an exception”. That behaviour obviously smells a lot like it could be causing our problem. So, I now had a working theory: for some reason, given some particular server behaviour, is_conn_error() returns True on Python 2 but False on Python 3. That causes this _sendCall() to raise an exception instead of just resetting the session and carrying on, and some other code – which we no longer need to find – catches that exception, prints it, and quits.

The next step was to test this theory – because at this point it’s only a theory, it could be entirely wrong. I’ve certainly come up with entirely plausible theories like this before which turned out to be not what was going on at all. So, like a true lazy shortcut enthusiast, I hacked up my local copy of Koji’s __init__.py and sprinkled a bunch of lines like print("HERE 1!") and print("HERE 2!") through the whole of is_conn_error(). Then I just rankoji wait-task commands on random tasks until one failed.

This is fine. When you’re just trying to debug the problem you don’t need to be super elegant about it. You don’t need to do a proper git patch and rebuild the Koji package for your system and use proper logging methods and all the rest of it. Just dumping some print lines in a working copy of the file is just fine, if it works. Just remember to put everything back as it was before later. 🙂

So, as it happened the god of root causing was on my side today, and it turned out I was right on the money. When one of the koji watch-task commands failed, it hit my HERE 1! and HERE 3! lines right when it died. Those told me we were indeed running through is_conn_error() right before the error, and further, where we were coming out of it. We were entering the if isinstance(e, socket.error) block at the start of the function, and returning False because the exception (e) did appear to be an instance of socket.error, but either did not have an errno attribute, or it was not one of errno.ECONNRESET, errno.ECONNABORTED, or errno.EPIPE.

Obviously, this made me curious as to what the exception actually is, whether it has an errno at all, and if so, what it is. So I threw in a few more debugging lines – to print out type(e), and getattr(e, 'errno', 'foobar'). The result of this was pretty interesting. The second print statement gave me ‘foobar’, meaning the exception doesn’t have an errno attribute at all. And the type of the exception was…requests.exceptions.ConnectionError.

That’s a bit curious! You wouldn’t necessarily expect requests.exceptions.ConnectionError to be an instance of socket.error, would you? So why are we in a block that only handles instances of socket.error? Also, it’s clear the code doesn’t expect this, because there’s a block later in the function that explicitly handles instances of requests.exceptions.ConnectionError – but because this earlier block that handles socket.error instances always returns, we will never reach that block if requests.exceptions.ConnectionError instances are also instances of socket.error. So there’s clearly something screwy going on here.

So of course the next thing to do is…look up socket.error in the Python 2 and Python 3 docs. ANY TIME you’re investigating a mysterious Python 3 porting issue, remember this can be useful. Here’s the Python 2 socket.error entry, and the Python 3 socket.error entry. And indeed there’s a rather significant difference! The Python 2 docs talk about socket.error as an exception that is, well, its own unique thing. However, the Python 3 docs say: “A deprecated alias of OSError.” – and even tell us specifically that this changed in Python 3.3: “Changed in version 3.3: Following PEP 3151, this class was made an alias of OSError.” Obviously, this is looking an awful lot like one more link in the chain of what’s going wrong here.

With a bit of Python knowledge you should be able to figure out what’s going on now. Think: if socket.error is now just an alias of OSError, what does if isinstance(e, socket.error) mean, in Python 3.3+ ? It means just the same as if isinstance(e, OSError). And guess what? requests.exception.ConnectionError happens to be a subclass of OSError. Thus, if e is an instance of requests.exception.ConnectionError, isinstance(e, socket.error) will return True in Python 3.3+. In Python 2, it returns False. It’s easy to check this in an interactive Python shell or with a test script, to confirm.

Because of this, when we run under Python 3 and e is a requests.exception.ConnectionError, we’re unexpectedly entering this block intended for handling socket.error exceptions and – because that block always returns, having the return False line that gets hit if the errno attribute check fails – we’re never actually reaching the later block that’s actually intended to handle requests.exception.ConnectionError instances at all, we return False before we get there.

There are a few different ways you could fix this – you could just drop the return False short-circuit line in the socket.error block, for instance, or change the ordering so the requests.exception.ConnectionError handling is done first. In the end I sent a pull request which drops the return False, but also drops the if isinstance(e, socket.error) checks (there’s another, for nested exceptions, later) entirely. Since socket.error is meant to be deprecated in Python 3.3+ we shouldn’t really use it, and we probably don’t need to – we can just rely on the errno attribute check alone. Whatever type the exception is, if it has an errno attribute and that attribute is errno.ECONNRESET, errno.ECONNABORTED, or errno.EPIPE, I think we can be pretty sure this is a connection error.

What’s the moral of this debugging tale? I guess it’s this: when porting from Python 2 to Python 3 (or doing anything similar to that), fixing the things that outright crash or obviously behave wrong is sometimes the easy part. Even if everything seems to be working fine on a simple test, it’s certainly possible that subtler issues like this could be lurking in the background, causing unexpected failures or (possibly worse) subtly incorrect behaviour. And of course, that’s just another reason to add to the big old “Why To Have A Really Good Test Suite” list!

There’s also a ‘secondary moral’, I guess, and that’s this: predicting all the impacts of an interface change like this is hard. Remember the Python 3 docs mentioned a PEP associated with this change? Well, here it is. If you read it, it’s clear the proposers actually put quite a lot of effort into thinking about how existing code might be affected by the change, but it looks like they still didn’t consider a case like this. They talk about “Careless (or “naïve”) code” which “blindly catches any of OSError, IOError, socket.error, mmap.error, WindowsError, select.error without checking the errno attribute”, and about “Careful code is defined as code which, when catching any of the above exceptions, examines the errno attribute to determine the actual error condition and takes action depending on it” – and claim that “useful compatibility doesn’t alter the behaviour of careful exception-catching code”. However, Koji’s code here clearly matches their definition of “careful” code – it considers both the exception’s type, and the errno attribute, in making decisions – but because it is not just doing except socket.error as e or similar, but catching the exception elsewhere and then passing it to this function and using isinstance, it still gets tripped up by the change.

So…the ur-moral, as always, is: software is hard!

AdamW’s Debugging Adventures: Has Anyone Seen My Kernel?

Posted by Adam Williamson on November 05, 2018 11:59 PM

Welp, since I haven’t blogged for a while, here’s another root-causing write up! Honestly, root-causing things is one of my favorite parts of the job, lately.

I’ve been on vacation recently, and when I came back, it looked like several things were wrong with Rawhide. Several of these were relatively easy fixes: live images not behaving right at all and the installer not launching properly any more both only took a couple of hours to track down. But there were also a couple of bugs causing more recent composes to fail entirely. The inimitable puiterwijk dealt with one of those (aarch64 cloud images not building properly), and I wound up taking the other one: overnight, most live image composes had suddenly stopped working.

What’s The Problem?

The starting point for this debugging odyssey was the error messages we got in the failed composes:

DEBUG pylorax.treebuilder: kernels=[]
ERROR livemedia-creator: No kernels found, cannot rebuild_initrds

So, the first step is pretty easy: let’s just go look up where those errors come from. pylorax and livemedia-creator are both part of the lorax tree, so we’ll start there. It’s easy to use grep to find the sources of those two messages: they’re both in treebuilder.py, the first here, in findkernels() and the second here, in TreeBuilder.rebuild_initrds(). As the second happens if there is no self.kernels, and we we can see just a few lines further back that self.kernels is a property based on a call to findkernels(), it’s pretty obvious that ultimately what’s going wrong here is that findkernels() isn’t finding any kernels.

So…Why Aren’t We Finding Any Kernels?

So next, of course, I put my thinking cap on and had a good look at findkernels() – not just at the code itself, but at its history. When something that was working breaks, you’re always looking for a change that caused it. There were no recent changes to findkernels(), and I couldn’t see anything obviously wrong in its implementation – it’s basically looking for files named vmlinuz-(something) in /boot – so it didn’t look like the quest was over yet.

So at this point there were kinda two possibilities: findkernels() was broken, or it was working fine but there really weren’t any kernels to find. I decided the easiest way to figure out which we were dealing with was just to reproduce the problem and take a look. With any sufficiently complex problem, you can usually get some way into figuring it out just by looking and logs and code and thinking about things logically, but at some point you’ll want to try and reproduce the problem in a controlled environment you can poke at – knowing when you’re at that point isn’t an exact science, but if you find yourself with multiple possibilities and no easy way to decide which you’re dealing with, that’s often a good indicator.

Doing a test live image build is a bit awkward for Fedora at present, unfortunately, but not really too bad. I’ve been meaning to write a helper tool for a while, but never got around to it. The way I do it – because I want to replicate as close as possible how the official images are built – is just to try and do exactly what the build system does, only on my local system and by hand. What the build system does is spin up a mock root, install the live image creation tools and a flattened kickstart into it, and run the tool. So, that’s what I do! A handy ‘trick’ here is just to look at the logs from a real live image creation task – like this one – and shadow what they do. Note these logs will get garbage collected at some point, but I can’t do anything about that.

The main log to look at is root.log. First, we see that the tools are installed to the mock:

DEBUG util.py:439:  Marking packages as installed by the group:
DEBUG util.py:439:   @livemedia-build glibc-all-langpacks coreutils               lorax-lmc-novirt
DEBUG util.py:439:                    util-linux          selinux-policy-targeted bash            
DEBUG util.py:439:                    shadow-utils

So we’ll do just the same with our mock:

mock -r fedora-rawhide-x86_64 --install glibc-all-langpacks coreutils lorax-lmc-novirt util-linux selinux-policy-targeted bash shadow-utils

(Really all you want to do is install the livemedia-build group, but I haven’t actually found an equivalent of mock --groupinstall, so I just do that).

We can find the actual image compose command in the task log again, by looking for ‘livemedia-creator’ – it looks like this:

INFO backend.py:285:  Running in chroot: ['/sbin/livemedia-creator', '--ks', '/chroot_tmpdir/koji-image-f30-build-30594219.ks', '--logfile', '/chroot_tmpdir/lmc-logs/livemedia-out.log', '--no-virt', '--resultdir', '/chroot_tmpdir/lmc', '--project', 'Fedora-KDE-Live', '--make-iso', '--volid', 'Fedora-KDE-Live-rawh-20181101.n.', '--iso-only', '--iso-name', 'Fedora-KDE-Live-x86_64-Rawhide-20181101.n.0.iso', '--releasever', 'Rawhide', '--title', 'Fedora-KDE-Live', '--macboot']

we can easily turn that Python array into a console command by just replacing occurrences of ', ' with :

/sbin/livemedia-creator --ks /chroot_tmpdir/koji-image-f30-build-30594219.ks --logfile /chroot_tmpdir/lmc-logs/livemedia-out.log --no-virt --resultdir /chroot_tmpdir/lmc --project Fedora-KDE-Live --make-iso --volid Fedora-KDE-Live-rawh-20181101.n. --iso-only --iso-name Fedora-KDE-Live-x86_64-Rawhide-20181101.n.0.iso --releasever Rawhide --title Fedora-KDE-Live --macboot

We can see that’s using a scratch directory called /chroot_tmpdir, and a kickstart file called koji-image-f30-build-30594219.ks. This kickstart can be found as one of the task assets, so we’ll grab it and copy it into the mock’s /root for now:

sudo cp koji-image-f30-build-30606109.ks /var/lib/mock/fedora-rawhide-x86_64/root/root

Then finally, we’re going to get a shell in the mock root, using the old-style chroot implementation (this is necessary for live image builds to work, the new systemd-based implementation doesn’t work yet):

mock -r fedora-rawhide-x86_64 --shell --old-chroot

Inside the mock, we’ll create the /chroot_tmpdir scratch dir, copy the kickstart into it, and finally run the image creation:

mkdir -p /chroot_tmpdir
cd /chroot_tmpdir
cp /root/koji-image-f30-build-30606109.ks .
/sbin/livemedia-creator --ks /chroot_tmpdir/koji-image-f30-build-30594219.ks --logfile /chroot_tmpdir/lmc-logs/livemedia-out.log --no-virt --resultdir /chroot_tmpdir/lmc --project Fedora-KDE-Live --make-iso --volid Fedora-KDE-Live-rawh-20181101.n. --iso-only --iso-name Fedora-KDE-Live-x86_64-Rawhide-20181101.n.0.iso --releasever Rawhide --title Fedora-KDE-Live --macboot

And when I did that, it worked away for a while – half an hour or so – and eventually failed exactly like the ‘official’ build had! So now I had a failure in a controlled environment (my little mock root) to look at. Note that if you’re playing along at home, this will only work so long as you can grab that kickstart from Koji, and the 20181101.n.0 compose files are kept around, which will only be for another two weeks or so – after that you won’t be able to reproduce this, but you can of course follow the same procedure with a newer Koji task if you want to reproduce a newer official live image build.

Next, I needed to examine the actual filesystem produced by the image build process and see if it really had any kernels in it (remember, that’s what we were trying to figure out). This requires a bit of knowledge about how livemedia-creator works, which you’d have had to look up if you didn’t know it already: it creates an image file, loopback mounts it, and installs into the loopback-mounted directory. When it fails, it leaves the image file around, and you can just mount it again and poke around. The file will be in the lmc/ subdirectory of the directory where the image build command was run, with a filename like lmc-disk-ctfz98m5.img (the alphanumeric soup bit is random), and we mount it like this:

mount -o loop lmc/lmc-disk-ctfz98m5.img /mnt/sysimage

(the mount point having been created and left around by the tool). Now we can look in /mnt/sysimage/boot, and when I did that…I found that, indeed, it contained no vmlinuz-* files at all! So, I had eliminated the possibility that findkernels() was going wrong: it was doing its job just fine, and it wasn’t finding any kernels because…there were no kernels to find.

OK…So Why Aren’t There Any Kernels?

So now I had to try and work out: why were there no kernels in the image’s /boot? I knew from the logs of earlier, successful image composes that, in successful composes, there really were kernels to find: it wasn’t that there had never been kernels, but this had only recently become fatal for some reason. The difference really was that there used to be kernels present when this lorax code ran, but now there weren’t.

This led me into a fun new piece of research: figuring out how kernel files get into /boot in a Fedora system at all. You might think – I did – that they’re simply perfectly normal packaged files installed by the kernel-core package. But it turns out, they’re not! The kernel-core package does own the relevant /boot/vmlinuz-* file, but it doesn’t actually directly install it: it uses an RPM directive called %ghost to own it without installing it. So the file must get installed some other way. Here again I cheated a bit with prior knowledge – I knew this overall mechanism existed, though I didn’t know until now that it really installed the kernel file itself – but if you don’t have that, you could look at the %posttrans script in the kernel package: when a kernel is installed, a command called kernel-install gets run.

I also found out (by diffing the logged packages from the 20181030.n.0 and 20181101.n.0 live image composes) that the kernel itself had been updated in the 20181101.n.0 compose (which was when things started breaking). So once again I had a couple of promising lines of inquiry: the new kernel, and this kernel-install path.

Well, turns out systemd haters in the audience can get very excited, because kernel-install is part of systemd:

[adamw@adam lorax (master %)]$ rpm -qf `which kernel-install`

Anyway, I refreshed my memory a bit about what kernel-install does, but it’s kinda complicated and it calls out to various other things, including /usr/sbin/new-kernel-pkg (part of grubby) and some plugins in /usr/lib/kernel/install.d (various of which come from systemd, grub2, and dracut). So I think what I did next (my memory gets a bit hazy) was to wonder whether the same problem would affect a regular install from the same packages.

I got the last working Rawhide network install image, and set it to install from the Everything repo from the failed 20181101.n.0 compose. I let that install run, then checked that the /boot/vmlinuz-(whatever) file existed in the installed system…which it did. This sort of let out one theory I had: that the new kernel package had somehow messed something up such that the kernel file never actually got installed properly at all.

So, I got to wondering whether kernel-install really was the thing that put the /boot/vmlinuz-(whatever) file in place (I still wasn’t sure at this point), whether it reproducibly failed to do so in the live image creation environment but succeeded in doing so in the regular install environment, and if so, what differed between the two.

I could see the exact kernel-install command just by examining the kernel-core package scripts:

rpm -q --scripts kernel-core | grep kernel-install
/bin/kernel-install add 4.20.0-0.rc0.git7.1.fc30.x86_64 /lib/modules/4.20.0-0.rc0.git7.1.fc30.x86_64/vmlinuz || exit $?

So I tried just deleting /boot/vmlinuz* and re-running that command in the installed system…and sure enough, the file re-appeared! So now I was pretty sure kernel-install was the thing that’s supposed to install it. I also tried doing this in my live image creation mock environment:

chroot /mnt/sysimage
/bin/kernel-install add 4.20.0-0.rc0.git7.1.fc30.x86_64 /lib/modules/4.20.0-0.rc0.git7.1.fc30.x86_64/vmlinuz

…and sure enough, it didn’t create the /boot/vmlinuz-(foo) file. So now I was narrowing in on the problem: something about the live image creation environment meant that this kernel-install invocation didn’t install the file, when it probably should.

OK…So Why Isn’t kernel-install Installing Kernels?

At this point I probably could’ve figured out the problem by reading the source if I’d read it carefully enough, but I decided to carry on with the practical experimentation. I tried running the script through sh -x in each environment, to see exactly what commands were run by the script in each case, and somehow – I forget how – I zeroed in on one of the /usr/lib/kernel/install.d plugin scripts: /usr/lib/kernel/install.d/20-grub.install. This is part of grub2. I think I found these scripts from the sh -x output, and noticed that this one has what looks like the code to actually install the kernel file to /boot. So I made that script run with -z as well, and this finally got me my next big breakthrough. In the installed system I could see that script doing a lot of stuff, but in the live environment it seemed to exit almost as soon as it started:

+ [[ -n '' ]]
+ exit 0

It’s not 100% obvious, but I was pretty sure that just meant it was failing in the test right at the start:

    exit 0

So I went and looked up $KERNEL_INSTALL_MACHINE_ID and the references suggested that it’s basically tied to /etc/machine-id. So I looked, and, lo and behold, in the regular installed system, that file contained a random alphanumeric string, but in the live image creation environment, the file was empty! This sure looked suspicious.

I read through some references on the file, and found that it’s usually meant to get created by a call to /usr/bin/systemd-machine-id-setup in systemd package scripts. So I tried running systemd-machine-id-setup in the live image creation environment, and suddenly the file got a random string, and when I ran kernel-install again, the kernel file did get installed to /boot!

OK…So Why Isn’t There A Machine ID?

So now I’d moved along to a new problem again: why was /etc/machine-id empty when the kernel %post script ran, but if I tried to generate it again, it worked? Was the initial generation failing? Was it happening too late? Was it working, but the file getting touched by something else?

Again, I looked at quite a lot of code to try and figure it out – there’s code that at least could touch /etc/machine-id in anaconda, for instance, and in lorax – but in the end I decided to go with practical experimentation again. So I did a custom scratch build of systemd to try and find out what happened when the %post script ran. I changed the command from this:

systemd-machine-id-setup &>/dev/null || :

to this:

systemd-machine-id-setup > /tmp/smids.log 2>&1
cat /etc/machine-id >> /tmp/smids.log

pulled that build into a side repository, edited the live kickstart to use that side repository, and re-ran the live image creation. And that hit paydirt in a big way, because in smids.log I saw this:

systemd-machine-id-setup: error while loading shared libraries: libssl.so.1.1: cannot open shared object file: No such file or directory

…and here was the ultimate solution to our mystery! The attempt to set the machine-id in systemd %post was failing because it needs libssl, but it obviously wasn’t present yet. libssl is part of openssl-libs, but the systemd spec did not specify that its %post script needs openssl-libs installed. What I surmise had happened was that up until 20181030, some other dependency in some other package happened to mean that dnf would always choose to install openssl-libs before installing systemd, so no-one had ever noticed this missing dependency…but on 20181101, some change to some package caused dnf to start installing systemd before openssl-libs, and suddenly, this problem showed up. So – as is very often the case – once I’d finally managed to nail down the problem, the fix was simple: we just add the missing dependency to systemd, so that openssl-libs will always be installed before systemd’s %post is run. With that fix, generation of /etc/machine-id will succeed again, and so the plugin script that installs the kernel file to /boot won’t bail early, and so there will be a kernel file in /boot, and lorax won’t fail when it tries to regenerate initrds because there aren’t any kernels present!

…and so ends another exciting episode of AdamW’s Debugging Adventures 🙂

PSA: System update fails when trying to remove rtkit-0.11-19.fc29

Posted by Kamil Páral on October 15, 2018 11:20 AM
Recently a bug in rtkit packaging has been fixed, but the update will fail on all Fedora 29 pre-release installation that have rtkit installed (Workstation has it for sure). The details and the workaround is described here:


Adam’s Debugging Adventures: The Immutable Mutable Object

Posted by Adam Williamson on June 27, 2018 11:03 PM

Here’s a puzzle for you, Python fans:

[adamw@adam dnf (master %)]$ python3
Python 3.6.5 (default, Apr 23 2018, 22:53:50) 
[GCC 8.0.1 20180410 (Red Hat 8.0.1-0.21)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from dnf.conf import MainConf
>>> mc = MainConf()
>>> print(mc.group_package_types)
['mandatory', 'default', 'conditional']
>>> mc.group_package_types.append('foo')
>>> print(mc.group_package_types)
['mandatory', 'default', 'conditional']

Note: if you want to try reproducing this…make sure you use DNF 3. It works as expected with DNF < 3. That’s why it just showed up as a problem.

Before I explain what’s going on there…let’s unpick the problem a bit for non-Pythonistas.

In Python (and in many other languages) some things – objects – are ‘mutable’, and some are ‘immutable’. A ‘mutable’ object can be changed. An ‘immutable’ object cannot.

In the Python session above, we create an object, mc, which is an instance of the class MainConf (don’t worry if you don’t entirely understand that, it’s not compulsory). We then examine one attribute of mc: mc.group_package_types. Python tells us that this is ['mandatory', 'default', 'conditional'] – which is a Python list containing the values 'mandatory', 'default' and 'conditional'.

In Python, lists are ‘mutable’. That means you can take a list and change it somehow – remove an item from it, add an item to it, re-order it – and it’s still the same object. Any existing reference to the object is still valid, and now refers to the changed list.

For comparison, an example of an ‘immutable’ object is a string. If you do this:

mystring = "here's some text"

you can’t then change the actual string object referenced by mystring there. It has no methods that let you change it in any way, and there are no functions that can operate on it and change it. You can do this:

mystring = "here's some text"
mystring = "here's some different text"
mystring = mystring.replace("some", "some more")

and at each step the contents of the string to which the name mystring refers are different – but also, at each step mystring refers to a DIFFERENT object. (That’s why the replace() string method returns a new string – it can’t mutate the existing string). So if you did this:

mystring = "here's some text"
otherref = mystring
mystring = "here's some different text"

then at the end, otherref still points to the first-created string object and its value is still "here's some text", while mystring points to a new string object and its value is "here's some different text". Let’s compare with a similar case using a mutable object:

mylist = [1, 2, 3]
otherref = mylist

In this case, when we get to the end, both mylist and otherref are still pointing to the same object, the original object, and both prints will print [1, 2, 3, 4]. No new list object was created at any point after the initial creation of mylist.

So with that understood, take a look back at the original example, and maybe you can see why it’s so weird. By all appearances, it looks like we have a pretty simple scenario here: we have an object that has an attribute which is a list, and we just append a value to that list. Then we go look at its value again, and it…hasn’t changed at all? But we didn’t get any kind of crash, or error, or anything. Our append call returned apparently successfully. It just…didn’t seem to change anything. The list is an immutable mutable object!

This is a real problem in real code: it broke the most recent Fedora Rawhide compose. So, obviously, I was pretty keen to figure out what the hell was going on, here! It turns out that it’s down to dnf getting clever (arguably over-clever).

Python’s a very…flexible language. The key to the problem here turned out to be exactly what happens when we actually look up the group_package_types attribute of mc, the dnf.conf.MainConf instance.

Getting and setting attributes of objects in Python is usually a kind of core operation that you never really think about, you just expect it to work in the ‘usual way’. A simple approximation of how it works is that the object has a Python dict (like a ‘hash’ in some languages – a key/value store, more or less) whose keys are attribute names and whose values are the attribute values. When you ask for an attribute of an object, Python checks if its name is one of the keys in that object’s dict, and if it is, returns the value. If it’s not, Python raises an error. When you set an attribute, Python stuffs it into the dict.

But since Python’s flexible, it provides some mechanisms to let you mess around with this stuff, if you want to. You can define __setattr__, __getattr__ and/or __getattribute__ methods in a class, and they’ll affect this behaviour.

The base object class that almost all Python classes inherit from defines the default __setattr__ and __getattribute__, which work sort of like the approximation I gave above. If you override __setattr__ in a class, then when something tries to set an attribute for an instance of that class, that method will get called instead of the default object.__setattr__. If you override __getattribute__, then that method will get called instead of object.__getattribute__ when something tries to look up an attribute of an instance of that class.

If you leave __getattribute__ alone but define __getattr__, then when something tries to look up an attribute, first the stock object.__getattribute__ will be used to try and look it up, but if that doesn’t find it, rather than raising an exception immediately, Python will try your __getattr__ method.

We can actually override __setattr__ and __getattr__ to do a very simplified demonstration of how the default implementation usually works:


class TestClass(object):
    def __init__(self):
        self.__dict__["_attrs"] = {}

    def __setattr__(self, name, value):
        print("Hi, setattr here, just settin' attrs...")
        self._attrs[name] = value

    def __getattr__(self, name):
        print("Hi, getattr here, here's your attr!")
        return self._attrs[name]

tc = TestClass()
tc.foo = [1, 2, 3]

Note that __dict__ is the store that object.__getattribute__ uses, so that’s why we set up our backing store with self.__dict__["_attrs"] = {} – that ensures that when we look up self._attrs, we will find it via __getattribute__. We can’t just do self._attrs = {} because then we wind up in an infinite recursion in our __getattr__.

If you save that and run it, you’ll see [1, 2, 3] then [1, 2, 3, 4] (plus the messages that prove our new methods are being used). Our mutable attribute is nice and properly mutable. We can append things to it and everything. Notice that when we append the value, we hit __getattr__ but not __setattr__.

So, how does this manage not to work with dnf config classes? (MainConf is a subclass of BaseConfig, and so are a ton of other config-related classes in dnf – we actually encountered this bug with another subclass, RepoConf). It turns out to be because dnf overrides BaseConfig.__setattr__ and BaseConfig.__getattr__ to do some “clever” stuff, and it breaks this!

We don’t need to go into what its __setattr__ does in detail, except to note one thing: it doesn’t store the values in the __dict__ store, so object.__getattribute__ can never find them. When looking up any attribute on an instance of one of these classes except _config (which is the store the class’ __setattr__ and __getattr__ methods themselves use, just like _attrs in our example, and is created directly in __dict__ in the same way), we always wind up in the class’s __getattr__.

Here’s the whole of current dnf’s BaseConfig.__getattr__:

def __getattr__(self, name):
    option = getattr(self._config, name, None)
    if option is None:
        return None
        value = option().getValue()
    except Exception as ex:
        return None
    if isinstance(value, cfg.VectorString):
        return list(value)
    if isinstance(value, str):
        return ucd(value)
    return value

There is some more stuff going on in the background here that we don’t need to worry about too much (a feature of DNF, I have found, is that it has layers upon layers. It contains multitudes. You usually can’t debug anything in DNF without going through at least eight levels of things calling other things that turn out to be yet other things that turn out to be written in C++ just cuz.) In the case of the group_package_types option, and also the option we were actually dealing with in the buggy case (the baseurl option for a repo), the option is basically a list-y type, so we wind up in the if isinstance(value, cfg.VectorString): branch here.

So if you follow it through, when we asked for mc.group_package_types, unlike in the default Python implementation or our simplified example, we didn’t just pull an existing list object out from some sekrit store in the mc object. No. We got some kind of object (fact fans: it’s a libdnf.conf.OptionStringList – libdnf is the C++ bit I mentioned earlier…) out of the self._config dict that’s acting as our sort-of attribute store, and ran its getValue method to get some other sort of object (fact fans: it’s a libdnf.conf.VectorString), then we ran list() on that object, and returned that.

The problem is that the thing that gets returned is basically a temporary copy of the ‘real’ backing value. It’s a mutable object – it really is a list! – and we can mutate it…but the next time anyone looks up the same attribute we looked up to get that list, they won’t get the same list object we got. This wacky __getattr__ will run through the same indirection maze and return a new listified copy of the backing value. Every time you look up the attribute, it does that. We can mutate the copies all we like, but doing that doesn’t actually change the backing value.

Now, it’s easy enough to work around this, once you know what’s going on. The overridden __setattr__, of course, actually does change the backing store. So if you explicitly try to set an attribute (rather than getting one and mutating it), that does work:

[adamw@adam dnf (master %)]$ python3
Python 3.6.5 (default, Apr 23 2018, 22:53:50) 
[GCC 8.0.1 20180410 (Red Hat 8.0.1-0.21)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from dnf.conf import MainConf
>>> mc = MainConf()
>>> print(mc.group_package_types)
['mandatory', 'default', 'conditional']
>>> mc.group_package_types = mc.group_package_types + ['foo']
>>> print(mc.group_package_types)
['mandatory', 'default', 'conditional', 'foo']

That time, it worked because we didn’t try to mutate our magical immutable mutable object. We just flat out replaced it with a new list. When we explicitly set the attribute like that, we hit the overridden __setattr__ method, which does the necessary magic to write the new value to the backing store.

But any regular Pythonista who sees that the instance attribute foo.bar is a mutable object is naturally going to assume that they can go ahead and mutate it. That’s just…standard. They aren’t going to figure they should ignore the fact that it’s a mutable object and just replace it every time they want to change it. That’s exactly what we did do in the code that got broken. That’s the exact code that used to work with DNF 2 but no longer does with DNF 3.

So, that took a few hours to figure out! But I got there in the end. I really just wanted to write up my latest ‘interesting’ debugging adventure! But if there’s a moral to this story…I guess it’s “think really hard about whether messing with core behaviour like this is the right way to go about implementing your special thing”?

Oh, and by the way: comments should be working again now! I just disabled the plugin that was interfering with them. So, you know, comment away.

QA: the glamorous bit

Posted by Adam Williamson on May 31, 2018 04:28 PM

Of course, we all know that working in QA is more or less a 24×7 whirl of red carpets and high-end cocktail parties…but today is particularly glamorous! Here’s what I’m doing right now:

  1. Build an RPM of a git snapshot of Plymouth
  2. Put it in a temporary repo
  3. Build an installer image containing it
  4. Boot the installer image in a VM, see if it reaches anaconda
  5. Repeat, more or less ad infinitum

I just can’t take the excitement!

Fedora 28, broken comments, QA goings on…

Posted by Adam Williamson on April 27, 2018 11:40 PM

Long time no blog, once more!

Fedora 28

So of course, the big news of the week is that Fedora 28 was signed off yesterday and will be coming out on 2018-05-01. If you examine the Fedora 28 schedule, you will observe that this was in fact the originally targeted date for the release. The earliest targeted date.

Yes. It’s a Fedora release. Coming out on time. That noise you hear is the approaching meteor that will wipe out all life on Earth. You’re welcome. 😉

We’ve always said the schedules for Fedora are really estimates and we don’t consider it a problem if there’s a week or two delay to fix up bugs, and that’s still the case. We may well wind up slipping again for F29. But hey, it’s nice to get it done “on time” just once. I did, in fact, check, and this really is the very first time a Fedora release has ever been entirely on time. Fedora 8 was close – it was only a day late, if you discount a very early draft schedule – but still, a day’s a day!

There are, as always, a few bugs I really wish we’d been able to fix for the release. But that’s pretty much always the case, and these are no worse than ones we’ve shipped before. We have to draw a line somewhere, for a distro that releases as often as Fedora. This should be another pretty solid release. My desktop and main laptop are running it already, and it’s pretty fine.

Comments: Yes, They’re Broken

Quick note for people who keep emailing me: yes, posting comments on this blog appears to be broken. No, I’m not particularly bothered. I actually have been meaning to convert this into an entirely static blog with no commenting for years, I just don’t want to deal with WordPress or any dynamic blog framework really any more. But I never have time to do it, as I want to include existing comments in the conversion, which isn’t straightforward. I’m gonna get it done one of these days, though.

openQA news: upgrade tests for updates, aarch64 testing…

I’ve been doing a lot of miscellaneous stuff I haven’t blogged about lately, but here’s one thing I’m pretty proud of: Simo and Rob from the FreeIPA team asked if it would be possible to test whether Fedora package updates broke FreeIPA upgrades, as Simo had noticed a case where upgrading a server to Fedora 27 didn’t work. We already had tests that test deploying a FreeIPA server and client on one Fedora release, then upgrading both to the next Fedora release and seeing if things still worked – but we weren’t running it on updates, we only ran it on nightly composes of Branched and Rawhide. So effectively we know all the way up until a given release comes out whether upgrading works for it, but once it comes out, we didn’t know if upgrading is suddenly broken by a later update.

These tests are some of the longest-running we have, so I was a bit worried about whether we’d have the resources to run them on updates, but I figured I’d go ahead and try it, and after a day or two of bashing was able to get it running in staging. After a week, staging seemed to be keeping up with the load, so I’ve pushed this out into production today. If you look at recent openQA update tests, like this one, you’ll see an updates-server-upgrade flavor with a couple of tests in it: these are testing that installing the previous Fedora release, deploying a FreeIPA server and client, then upgrading them to the release the update is for, with the update included, works OK. I’m quite happy with that! I may extend this basic mechanism to also run the Workstation upgrade test as well. Note that these tests don’t run for updates that are for the oldest current stable Fedora, as we don’t support upgrades from EOL releases (and openQA doesn’t keep the necessary base disk images for EOL releases around, so we actually couldn’t run the tests).

Aside from that, the biggest openQA news lately is that we got the staging instance testing on aarch64. Here’s the aarch64 tests for Fedora 28 Final, for instance. This isn’t perfect yet – there are several spurious failures each time the tests run. I think this is because the workers are kind of overloaded, they’re a bit short on RAM and especially on storage bandwidth (they each just have a single consumer-grade 7200RPM hard disk). I’m working with infra to try and improve that situation before we consider pushing this into production.

Other QA goings on

One thing that’s been quite pleasant for me lately is I’m no longer trying to do quite so much of…everything (and inevitably missing some things). Sumantro and coremodule have done a great job of taking over Test Day co-ordination and some other community-ish tasks, so I don’t have to worry about trying to keep up with those any more. Sumantro has been bringing a whole bundle of energy to organizing Test Days and onboarding events, so we’ve had lots more Test Days these last two cycles, and more people to take part in them, which is great. We’ve also had more folks taking part in validation testing. It’s made life a lot less stressful around here!

I’ve been mostly concentrating on co-ordinating things like release validation testing, doing a bit of mentoring for the newer team members, and keeping openQA ticking over. It’s nice to be able to focus a bit more.

Whitelisting rpmlint errors in Taskotron/Bodhi

Posted by Kamil Páral on March 05, 2018 01:14 PM

If you submit a new Fedora update into Bodhi, you’ll see an Automated Tests tab on that update page (an example), and one of the test results (once it’s done) will be from rpmlint. If you click on it, you’ll get a full log with rpmlint output.

If you wish to whitelist some errors which are not relevant for your package or are clearly a mistake (like spelling issues, etc), it is now possible. The steps how to do this are described at:


This has been often requested, so hopefully this will help you have the automated tests results all in green, instead of being bothered by invalid errors. If something doesn’t work, and it seems to be our bug in how we execute rpmlint (instead of a bug in rpmlint itself), please file a bug in task-rpmlint or contact us (qa-devel mailing list, #fedora-qa IRC channel on Freenode).

Linux kernel 4.13 and SMB protocol version fun

Posted by Adam Williamson on November 04, 2017 01:40 AM

There’s been a rather interesting change in the Linux kernel recently, which may affect you if you’re mounting network drives using SMB (the Windows native protocol, occasionally also called CIFS).

There have been several versions of the protocol – Wikipedia has a good writeup. Both servers and clients may support different versions; when accessing a shared resource, the client tells the server which protocol version it wants to use, and if the server supports that version then everyone’s happy and the access goes ahead; if the server doesn’t support that version, you get an error and no-one’s happy.

Up until kernel 4.13, the kernel’s default SMB protocol version was 1.0. So when you mount an SMB share, if you don’t explicitly specify a protocol version with the vers= mount option, with kernel 4.12 or earlier, SMB 1.0 will be used.

With kernel 4.13, the default protocol version is changed to 3.0. So now, when mounting SMB mounts that don’t explicitly specify a version, your system will request 3.0.

As I understand it, the main reason for this is security: SMB 3.0 is considerably more secure as a protocol than 1.0. Microsoft has been gradually trying to push Windows users towards later versions of the protocol over the last few releases.

Kernel 4.13 has been released as an update for Fedora 25 and Fedora 26, so users of those Fedora releases will hit this change when updating the kernel. Fedora 27 comes with kernel 4.13 out of the box.

Obviously, this comes with some compatibility consequences. If the server providing the share is running Windows 8 or later, you should be fine. However, in other cases, you may find your SMB mount suddenly fails after the kernel update. Older versions of Windows do not support SMB 3.0.

Samba added SMB 3.0 support in version 4.2, at least according to this page, so mounts provided by earlier Samba versions similarly will not work.

If your server is a NAS, it may or may not support SMB 3.0. My NAS is a Thecus N5550, so I know that for ThecusOS 5-based NASes, firmware version added SMB 3.0 support. However, it’s not enabled by default; you have to log into the admin UI, go to Network Service, select Samba/CIFS, and set ‘SMB Max Protocol’ to 3. Note that with this update, the default SMB minimum version is set to 2, so the NAS will no longer support 1.0 – you can change the minimum version to ‘NT1’ if you have a client which cannot do 2 or 3, though.

If you know information about SMB protocol support for any other NAS brand or other common SMB server of any kind, please post a comment and I’ll add it to this post.

If you get caught out by this, the best solution is to somehow update the server end of your setup so that it supports SMB 3.0. However, if you can’t do that, you can use the vers mount option. Use the highest version that works – 2.x isn’t as good as 3.0, but better than 1.0. The available choices are documented in man mount.cifs; at present they are 1.0, 2.0, 2.1 and 3.0.

Automatically shrink your VM disk images when you delete files

Posted by Kamil Páral on October 06, 2017 04:02 PM

If you use VMs a lot, you know that with the most popular qcow2 disk format, the disk image starts small, but grows with every filesystem change happening inside the VM. Deleting files inside the VM doesn’t shrink it. Of course that wastes a lot of disk space on your host – the VMs often contain gigabytes of freed space inside the VM, but not on the host. Shrinking the VM images is possible, but tedious and slow. Well, recently I learned that’s actually not true anymore. You can use the TRIM command, used to signalize to SSD drives that some space can be freed, to do the same in virtualization stack – signalize from VM to host that some space can be freed, and the disk image shrunk. How to do that? As usual, this is a shameless copy of instructions found elsewhere on the Internets. The instructions assume you’re using virt-manager or libvirt directly.

First, you need to using qcow2 images, not raw images (you can configure this when adding new disks to your VM).

Second, you need to set your disk bus to SCSI (not VirtIO, which is the default).


Third, you need to set your SCSI Controller to VirtIO SCSI (not hypervisor default).


Fourth, you need to edit your VM configuration file using virsh edit vmname and adjust your hard drive’s driver line to include discard='unmap', e.g. like this:

<disk type='file' device='disk'>
 <driver name='qemu' type='qcow2' discard='unmap'/>

And that’s it. Now you boot your VM and try to issue:

$ sudo fstrim -av
/boot: 319.8 MiB (335329280 bytes) trimmed
/: 101.5 GiB (108928946176 bytes) trimmed

You should see some output printed, even if it’s just 0 bytes trimmed, and not an error.

If you’re using LVM, you’ll also need to edit /etc/lvm/lvm.conf and set:

issue_discards = 1

Then it should work, after a reboot.

Now, if you want trimming to occur automatically in your VM, you have two options (I usually do both):

Enable the fstrim timer that trims the system once a week by default:

$ sudo systemctl enable fstrim.timer

And configure the root filesystem (and any other one you’re interested in) to issue discard command automatically after each file is deleted. Edit /etc/fstab and add a discard mount option, like this:

UUID=6d368798-f4c2-44f9-8334-6be3c64cc449 / ext4 defaults,discard 1 1

And that’s it. Try to create a big file using dd, watch your VM image grow. Then delete the file, watch the image shrink. Awesome. If only we had this by default.

SSH to your VMs without knowing their IP address

Posted by Kamil Páral on October 06, 2017 03:31 PM

This is a shameless copy of this blog post, but I felt like I need to put it here as well, so that I can find it the next time I need it 🙂

libvirt approach

When you run a lot of VMs, especially for testing, every time with a fresh operating system, connecting to them is a pain, because you always need to figure out their IP address first. Turns out that is no longer true. I simply added this snippet to my ~/.ssh/config:

# https://penguindroppings.wordpress.com/2017/09/20/easy-ssh-into-libvirt-vms-and-lxd-containers/
# NOTE: doesn't work with uppercase VM names
Host *.vm
 CheckHostIP no
 Compression no
 UserKnownHostsFile /dev/null
 StrictHostKeyChecking no
 ProxyCommand nc $(virsh domifaddr $(echo %h | sed "s/\.vm//g") | awk -F'[ /]+' '{if (NR>2 && $5) print $5}') %p

and now I can simply execute ssh test.vm for a VM named test and I’m connected! A huge time saver. It doesn’t work with uppercase letters in VM names and I didn’t bother to try to fix that. Also, since I run VMs just for testing purposes, I disabled all ssh security checks (you should not do that for important machines).

avahi approach

There’s also a second approach I used for persistent VMs (those that survive for longer than a single install&reboot cycle). You can use Avahi to search for a hostname on the .local domain to find the IP address. Fedora has this enabled by default (if you have nss-mdns package installed, I believe, which should be by default). So, in the VM, set a custom hostname, for example f27:

$ sudo hostnamectl set-hostname f27
$ reboot

Now, you can run ssh f27.local and it should connect you to the VM automatically.

Flock 2017: trip report

Posted by Adam Williamson on September 16, 2017 03:56 AM

Better late than never, here’s my report from Flock 2017!

Thanks to my excellent foresight in the areas of ‘being white’ and ‘being Canadian’ I had no particular trouble getting through security / immigration, which was nice. The venue was kinda interesting – the whole town had this very specific flavor that seems to be shared among slightly second-class seaside towns the world over. Blackpool, White Rock or Hyannis, there’s something about them all…but the rooms were fairly clean, the hot water worked, the power worked, and the wifi worked fairly well for a conference, so all the important stuff was OK. Hyannis seriously needs to discover the crosswalk, though – I nearly got killed four times travelling about 100 yards from the hotel to a Subway right across the street and back. Unfortunately the ‘street’ was a large rotary with exactly zero accommodations for pedestrians…

Attendance seemed a bit thinner than usual, and quite heavily Red Hat-y; I’ve heard different reasons for this, from budget issues to Trump-related visa / immigration issues. It was a shame. There were definitely still enough people to make the event worthwhile, but it felt like some groups who would normally be there just weren’t.

From the QA team we had myself, Tim Flink, Sumantro Mukherjee and Lukas Brabec. We got some in-person planning / discussion done, of course, and had a team dinner. It was particularly nice to be in the same place as Sumantro for a while, as usually our time zones are awful, he gets to the office right when I’m going to bed – so we were able to talk over a lot of stuff and agree on quite a list of future projects.

The talks, as usual, were generally very practical, focused and useful – one of the nicest things about Flock is it’s a very low-BS conference. I managed to do some catch-up on modularity plans and status by following the track of modularity talks on Thursday. Aside from that, some of the talks I saw included the Hubs status update, Stef’s dist-git tests talk, the Greenwave session, the Bodhi hackfest, Sumantro’s kernel testing session, and a few others.

I gave a talk on how packagers can work with our automated test systems. As always seems to be the case I got scheduled very early in the conference, and again as always seems to be the case, I wound up writing my talk about an hour before giving it. Which was especially fun because while I still had about ten slides to write, my laptop starting suffering from a rather odd firmware bug which caused it to get stuck at the lowest possible CPU speed. Pro tip: LibreOffice does not like running at 400MHz. So I wasn’t entirely as prepared as I could have been, but I think it went OK. I had the usual thing where, once I reached the end of the talk, I realized how I should have started it, but never mind. If I ever get to give the talk again, I’ll tweak it. As a footnote, Peter Jones – being Peter Jones – naturally had all the tools and the know-how necessary to take my laptop apart and disconnect the battery, which turned out to be the only possible way to clear the CPU-throttling firmware state, so thanks very much to him for that!

As usual, though, the most productive thing about the conference was just being in the same place at the same time as lots of the folks who really make stuff happen in Fedora, and being able to work on things in real time, make plans, and pick brains. So I spent quite a lot of time bouncing around between Kevin Fenzi, Dennis Gilmore, and Peter Jones, trying to fix up Fedora 27 and Rawhide composes; we got an awful lot of bugs solved during the week. I got to talk to Ralph Bean, Pingou, Randy Barlow, Pengfei Jia, Dan Callaghan, Ryan Lerch, Jeremy Cline and various others about Bodhi, Pagure, Greenwave and various other key bits of current and future infrastructure; this was very useful in planning how we’re going to move forward with compose gating and a few other things. In the kernel testing session, Sumantro, Laura Abbott and myself came up with a plan to run regular Test Days around kernel rebases for stable releases, which should help reduce the amount of issues caused by those rebases.

We started working on a ‘rerun test’ button for automated tests in Bodhi during the Bodhi hackfest; this is still a work in progress but it’s going in interesting directions.

PSA: If you had dnf-automatic enabled and updated to Fedora 26, it probably stopped working

Posted by Adam Williamson on September 15, 2017 02:34 AM

So the other day I noticed this rather unfortunate bug on one of my servers.

Fedora 26 included a jump from DNF 1.x to DNF 2.x. It seems that DNF 2.x came with a poorly-documented change to the implementation of dnf-automatic, the tool it provides for automatically notifying of, downloading and/or installing updates.

Simply put: if you had enabled dnf-automatic in Fedora 25 or earlier, using the standard mechanism it provided – edit /etc/dnf/automatic.conf to configure the behaviour you want, and run systemctl enable dnf-automatic.timer – then you upgraded to Fedora 26, then it probably just stopped working entirely. If you were relying on it to install updates for you…it probably hasn’t been. You can read the full details on why this is the case in the bug report.

We’ve now fixed this by sending out an update to dnf which should restore compatibility with the DNF 1.x implementation of dnf-automatic, by restoring dnf-automatic.service and dnf-automatic.timer (which function just as they did before) while preserving the new mechanisms introduced in DNF 2.x (the function-specific timers and services). But of course, you’ll have to install this update manually on any systems which need it. So if you do have any F26 systems where you’re expecting dnf-automatic to work…you probably want to log into them and run ‘dnf update’ manually to get the fixed dnf.

PSA ends!

A modest proposal

Posted by Adam Williamson on September 07, 2017 06:01 PM
                                                       PROPOSED STANDARD
                                                            Errata Exist

Internet Engineering Task Force (IETF)                   Adam Williamson
Request for Comments: 9999                                       Red Hat
Updates: 7159                                             September 2017
Category: Standards Track
ISSN: 9999-9999

     Let Me Put a Fucking Comma There, Goddamnit, JSON


   Seriously, JSON, for the love of all that is fucking holy, let me
   end a series of items with a fucking comma.

Fedora 26 Upgrade Test Day tomorrow (2017-06-30)!

Posted by Adam Williamson on June 29, 2017 09:07 PM

It’s that time again: we have another test day coming up! Tomorrow (Friday 2017-06-30) will be Fedora 26 Upgrade Test Day. As the name might suggest, we’ll be testing upgrades to Fedora 26. It’d be great to have coverage of as many configurations and architectures as possible, so please, if you have a bit of spare time and some kind of environment to which you can install Fedora 24 or 25 and test upgrading to Fedora 26, come out and help test!

The Test Day page contains all the instructions you need to run the tests and send along your results. As always, the event is in #fedora-test-day on Freenode IRC. If you don’t know how to use IRC, you can read these instructions, or just use WebIRC.

Taskotron: depcheck task replaced by rpmdeplint

Posted by Kamil Páral on June 22, 2017 10:27 AM

If you are a Fedora packager, you might be interested to know that in Taskotron we replaced the depcheck task with rpmdeplint task. So if there are any dependency issues with the new update you submit to Bodhi, you’ll see that as dist.rpmdeplint failure (in the Automated Tests tab). The failure logs should look very similar to the depcheck ones (basically, the logs contain the errors dnf would spit out if it tried to install that package), so there should be no transitioning effort needed.

If you listen for depcheck results somehow, i.e. in FMN, make sure to update your rules to listen for dist.rpmdeplint instead. We have updated the default filters in FMN, so if you haven’t changed them, you should receive notifications for failures in rpmdeplint (and also upgradepath and abicheck) for submitted updates owned by you.

The reason for this switch is that we wanted to get rid of custom dependency checking (done directly on top of libsolv), and use an existing tool for that instead. That saves us time, we don’t need to study all the pitfalls of dependency resolution, and we benefit from someone else maintaining and developing the tool (that doesn’t mean we won’t send patches if needed). rpmdeplint offered exactly what we were looking for.

We will decommission depcheck task from Taskotron execution in the next few days, if there are no issues. Rpmdeplint results are already being published for all proposed updates.

If you have any questions, please ask in comments or reach us at #fedora-qa freenode irc channel or qa-devel (or test or devel) mailing list.

PSA: RPM database issues after update to libdb-5.3.28-21 on Fedora 24 and Fedora 25

Posted by Adam Williamson on June 09, 2017 09:12 PM

Hi there, folks!

This is an important PSA for Fedora 24, 25 and 26 (pre-release) users. tl;dr version: if you recently updated and got some kind of error or crash and now you’re getting RPM database errors, you need to do the old reliable RPM database fix dance:

# rm -f /var/lib/rpm/__db*
# rpm --rebuilddb

and all should be well again. We do apologize for this.

Longer version: there’s a rather subtle and tricky bug in libdb (the database that RPM uses) which has been causing problems with upgrades from Fedora 24/25 to Fedora 26. The developers have made a few attempts to fix this, and testing this week had indicated that the most recent attempt – libdb-5.3.28-21 – was working well. We believed the fix needed to be applied both on the ‘from’ and the ‘to’ end of any affected transaction, so we went ahead and sent the -21 update out to Fedora 24, 25 and 26.

Unfortunately it now seems like -21 may still have bugs that were not found in the testing; in the last few hours several people have reported that they hit some kind of crash during an update involving libdb -21, and subsequently there was a problem with their RPM database.

While we investigate and figure out what to do about fixing this properly, in the short term, if you’re affected, just doing the old “rebuild the RPM database” trick seems to resolve the problem:

# rm -f /var/lib/rpm/__db*
# rpm --rebuilddb

EDIT: Update 2017-06-13: We briefly sent a -22 build to updates-testing for 24, 25 and 26 with the fixes reverted. It turns out that updating from -21 to -22 can, again, cause the same kinds of problem, which can be resolved in the same way. We’ve removed the -22 update now and are sticking with -21, and will just be advising affected people to rebuild their databases if they hit database issues. Note that if you updated from -21 to -22 while it was in updates-testing, that update may also have caused database issues, which can be resolved in the same way; and when you update from -22 to -23 or later, the same may happen again.

It’s unfortunate that we have to break that one out of cold storage (I hadn’t had to do it for so long I’d almost forgotten it…), but it should at least get you back up and working for now.

We do apologize sincerely for this mess, and we’ll try and do all we can to fix it up ASAP.

Kernel Performance Testing on ARM

Posted by sumantro on June 06, 2017 12:54 AM
This post will be talking about , how you can do kernel regression , stress and performance testing on ARM architecture.


To set up your ARM device , you need an image to get started. I was intending to test the latest compose(Fedora 26 1.4 Beta on Raspberry Pi 3 Model B). Download the file (Workstation raw-xz for armhfp) or any variant that you want to test.

Once the file is download, all you need to do is to get a SD card and write the img in the card.

There are two ways of doing it using "Fedora Media Writer" which can now burn the image for ARM devices. The other way is the old dd , here is how you do it using dd

Once the dd has executed itself successfully , its time to plug in the SD in ARM device and boot it up. Once the ARM device is booted up all you need to do is to
clone the kernel test suite from here

Dependencies and Execution:

You will need 2 packages

You can install them by executing "sudo dnf install fedora-python gcc"

Executing test cases:

Each test should be contained in a unique directory within the appropriate top level. The directory must contain an executable 'runtest.sh' which will drive the specific test. There is no guarantee on the order of execution. Each test should be fully independent, and have no dependency on other tests. The top level directories are reflective of how the master test suite is called. Each option is a super-set of the options before it. At this time we have:
  • minimal: This directory should include small, fast, and important tests which would should be run on every system.
  • default: This directory will include most tests which are not destructive, or particularly long to run. When a user runs with no flags, all tests in both default and minimal will be run.
  • stress: This directory will include longer running and more resource intensive tests which a user might not want to run in the common case due to time or resource constraints.
  • destructive: This directory contains tests which have a higher probability of causing harm to a system even in the pass case. This would include things like potential for data loss.
  • performance: This directory contains longer running performance tests. These tests should typically be the only load on a system to get an accurate result.

After Executing
$ sudo ./runtests.sh -t performance
Each test is executed by the control script by calling runtest.sh. stdout and stderr are both redirected to the log. Any user running with default flags should see nothing but the name of the directory and pass/fail/skip. The runtest.sh should manage the full test run. This includes compiling any necessary source, checking for any specific dependencies, and skipping if they are not met. 

View the Log file with cat <file path> , the log file will give the device information and the test result.

The test is complete and the results is "PASS" in this case .

Almost at the end of the file you will get the data and values of the tests.

[Restrospection] Fedora QA Global Onboarding Call 2017-06-03

Posted by sumantro on June 05, 2017 07:11 AM


We had a Fedora QA onboarding on 2017-06-03 and it was successful the Agenda and the Feedback can be found on this etherpad. People from different countries and regions found the call useful.

Few changes which made things better

1. Using bluejeans was smooth and better than Hangouts.
2. Starting the doodle 2 weeks before the call and giving enough time to vote.
3. Using a bunch of quick links as reference points and then explaining.

Action Items

1. Consistency is the key to success , doing the onboarding call every 2 months will be more engaging. Also , it gives a sense of assurance to the new contributors to simply plug themselves up in one of these calls and start from there even if they miss one , they would still be able to contribute to the release cycle. The proposal is to create a wiki page and link it up to Fedora QA Join where people will benefit from it.

2. Feedback , FAQ ,Quick Links, Logs and recordings should be marked and kept in a wiki page which will constantly tell us where we need improvement and maybe answer few general questions for new contributors.

Proposed Timeline

After Branched

This is the time we generally plan for the test days and start off with pre-alpha testing.Onboarding call during this time will help us gather community ideas by which we can drive the test day planning and if someone wants to run a test day then we can help them plan accordingly.

Before Beta

After the alpha , we are mostly in a phase we have the test days happening , blocker bugs being filed and lot of test coverage (rel-val) needs to happen during
this stage , having an on boarding call at this time will help in the new contributors to work on something specific which is aligned to our goals. This will be a place where we can have more and more people participating and help us testing the iso and features in test days.

Before Final
This is a good time as we are done with most of the test days as a part of change set , we can conduct a few more test namely the system upgrade test day and the kernel test day. This call will help us testing on most of the off-the-shelf-hardware and ensure that whole band of hardwares are tested. This is also the time where we need most validation to be done most of the architecture , hence will help us keep the contributors engaged.

Fedora QA Onboarding Call 2017-06-03 1400-1500 UTC

Posted by sumantro on June 02, 2017 05:50 PM

There is a going to be a Fedora QA On Boarding Call 2017-06-03 1400-1500 UTC over blue jeans, while release validation for Fedora Beta 1.3 is underway , this is good time to get started. In this call we will be talking about how you as a contributor can get started with Fedora QA. The agenda can be found on this etherpad .

Hope to see you all!

LinuxFest Northwest report

Posted by Adam Williamson on May 10, 2017 02:17 AM

EDIT: recording link added!

Hi folks!

This weekend was LinuxFest Northwest 2017, and as usual I was down in Bellingham to attend it. Had a good time, again as usual. Luckily I got to do my talk first thing and get it out of the way. Here’s a recording, and here’s the slide deck. It was a general talk on Fedora’s past, present and future.

I saw several other good talks, including Bryan Lunduke‘s ‘Lunduke Hour Live’ featuring a great discussion with John Sullivan of the Free Software Foundation. I also saw the openSUSE 101 talk he did with James Mason – it was quite interesting to compare and contrast the openSUSE organization with Fedora’s. Together with James and an Ubuntu developer, I formed a heckler’s row at Kevin Burkeland’s Linux 102 talk on choosing a distribution; it was actually a great talk that was pretty well though-through and had nice things to say about Fedora and openSUSE, so our heckling was sadly pre-empted.

I spent a few hours working on the booth too, but as usual the Jeffs Sandys and Fitzmaurice were the real booth heroes, so thanks once more to them.

The trivia event on Saturday night was pretty fun (and our team, The Unholy Alliance (of SUSE and Fedora folks) won with only minor cheating!). My now-traditional Sunday afternoon board gaming with Jakob Perry and co. was also fun (and I managed not to come last…)

Got to chat with Jesse Keating, Brian Lane, Laura Abbott (briefly – hope your voice is recovered by now!) and many other fine folks too. It was also really nice to hear from a whole bunch of different people that they tried out a recent Fedora release and really liked it – almost feels like we’re doing something right!

If I promised you something at the conference and I don’t get in touch by the end of this week, please do give me a poke and remind me, I probably forgot…

Test Day DNF 2.0

Posted by sumantro on May 08, 2017 11:02 AM

Tuesday, 2017-05-09, is the DNF 2.0 Test Day! As part of this planned Change for Fedora 26, we need your help to test DNF 2.0!

Why test DNF 2.0?

DNF-2 is the upstream DNF version, the only version actively developed. Currently the upstream contains many user requested features, increased compatibility with yum and over 30 bug fixes. To DNF-1 back porting of patches from upstream is difficult and only critical security and usability fixes will be cherry-picked to Fedora.

With DNF 2.0 in places,users can notice usability improvements of DNF like better messages during resolution errors, showing whether package was installed as weak dependency, better handling of obsolete packages, less tracebacks, etc. One command line option and one configuration option changed semantic so DNF could behave differently in some way (these changes are compatible with yum but incompatible with DNF-1)We hope to see whether it’s working well enough and catch any remaining issues.

We need your help!

All the instructions are on the wiki page, so please read through and come help us test! As always, the event will be in #fedora-test-day on Freenode IRC.

Automated *non*-critical path update functional testing for Fedora

Posted by Adam Williamson on April 28, 2017 11:06 PM

Yep, this here is a sequel to my most recent best-seller, Automated critical path update functional testing for Fedora 🙂

When I first thought about running update tests with openQA, I wasn’t actually thinking about testing critical path packages. I just made that the first implementation because it was easy. But I first thought about doing it when we added the FreeIPA tests to openQA – it seemed pretty obvious that it’d be handy to run the same tests on FreeIPA-related updates as well as running them on the nightly development release composes. So all along, I was planning to come up with a way to do that too.

Funnily enough, right after I push out the critpath update testing stuff, a FreeIPA-related update that broke FreeIPA showed up, and Stephen Gallagher poked me on IRC and said “hey, it sure would be nice if we could run the openQA tests on FreeIPA-related updates!”, so I said “funny you should ask…”

I bumped the topic up my todo list a bit, and wrote it that afternoon, and now it’s deployed in production. For now, it’s pretty simple: we just have a hand-written list of packages that we want to run some of the update tests for, whenever an update shows up with one of those packages in it. Simple enough, but it works: whenever an update containing one of those packages is submitted or edited, the server update tests (including the FreeIPA tests) will get run, and the results will be visible in Bodhi.

Here’s a run on the staging instance that was triggered using the new code; since I sent it to the production instance no relevant updates have been submitted or edited, but it should work just the same there. So from now on whenever our FreeIPA-ish overlords submit an update, we’ll get an idea of whether it breaks everything right away.

We can extend this system to other packages, but I couldn’t think of any (besides postgresql, which I threw in there) which would really benefit from the current update tests but aren’t already in the critical path (all the important bits of GNOME or in the critical path, for example, so all the desktop update tests get run on all GNOME updates already). If you can think of any, go ahead and let us know.

Automated critical path update functional testing for Fedora

Posted by Adam Williamson on April 25, 2017 01:08 AM

A little thing I’ve been working on lately finally went live today…this thing:

openQA test results in Bodhi

Several weeks ago now, I adapted Fedora’s openQA to run an appropriate subset of tests on critical path updates. We originally set up our openQA deployment strictly to run tests at the distribution compose level, but I noticed that most of the post-install tests would actually also be quite useful things to test for critical path updates, too.

First, I set up a slightly different openQA workflow that starts from an existing disk image of a clean installed system, downloads the packages from a given update, sets up a local repository containing the packages, and runs dnf -y update before going ahead with the main part of the test.

Then, I adapted our openQA scheduler to trigger this workflow whenever a critical path update is submitted or edited, and forward the results to ResultsDB.

All of this went into production a few weeks ago, and the tests have been run on every critical path update since then. But there was a big piece missing: making the information easily available to the update submitter (and anyone else interested). So I wanted to make the results visible in Bodhi, alongside the Taskotron results. So I sent a patch for Bodhi, and the new Bodhi release with that change included was deployed to production today.

The last two Bodhi releases actually make some other great improvements to the display of automated test results, thanks to Ryan Lerch and Randy Barlow. The results are actually retrieved from ResultsDB by client-side Javascript every time someone views an update. Previously, this was done quite inefficiently and the results were shown at the top of the main update page, which meant they would show up piecemeal for several seconds after the page had mostly loaded, which was rather annoying especially for large updates.

Now the results are retrieved in a much more efficient manner and shown on a separate tab, where a count of the results is displayed once they’ve all been retrieved.

So with Bodhi 2.6, you should have a much more pleasant experience viewing automated test results in Bodhi’s web UI – and for critical path updates, you’ll now see results from openQA functional testing as well as Taskotron tests!

At present, the tests openQA runs fall into three main groups:

  1. Some simple ‘base’ tests, which check that SELinux is enabled, service manipulation (enabling, disabling, starting and stopping services) works, no default-enabled services fail to start, and updating the system with dnf works.

  2. Some desktop tests (currently run only on GNOME): launching and using a graphical terminal works, launching Firefox and doing some basic tests in it works, and updating the system with the graphical updater (GNOME Software in GNOME’s case) works.

  3. Some server tests: is the firewall configured and working as expected, is Cockpit enabled by default, and does it basically work, and both server and client tests for the database server (PostgreSQL) and domain controller (FreeIPA) server roles.

So if any of these fail for a critical path update, you should be able to see it. You can click any of the results to see the openQA webUI view of the test.

At present you cannot request a re-run of a single test. We’re thinking about mechanisms for allowing this at present. You can cause the entire set of openQA tests to be run again by editing the update: you don’t have to add or remove any builds, any kind of edit (just change a character in the description) will do.

If you need help interpreting any openQA test results, please ask on the test@ mailing list or drop by #fedora-qa . Myself or garretraziel should be available there most of the time.

Please do send along any thoughts, questions, suggestions or complaints to test@ or as a comment on this blog post. We’ll certainly be looking to extend and improve this system in future!

Fedora Media Writer Test Day (Fedora 26 edition) on Thursday 2017-04-20!

Posted by Adam Williamson on April 18, 2017 11:43 PM

It’s that time again: we have another test day coming up this week! Thursday 2017-04-20 will be another Fedora Media Writer Test Day. We’ve run these during the Fedora 24 and 25 cycles as well, but we want to make sure the tool is ready for the Fedora 26 release, and also test a major new feature it has this time around: support for writing ARM images. So please, if you have a bit of spare time and a system to test booting on – especially a Fedora-supported ARM device – come out and help us test!

The Test Day page contains all the instructions you need to run the tests and send along your results. As always, the event is in #fedora-test-day on Freenode IRC. If you don’t know how to use IRC, you can read these instructions, or just use WebIRC.

[Fedora Classroom] Fedora QA 101 and 102

Posted by sumantro on April 17, 2017 09:39 PM
This post will be summing up what we will be covering as a part of Fedora QA classroom this season. The idea is to understand how to do things the right way and to increase contributors.

The topics covered will be:
<style type="text/css">p { margin-bottom: 0.1in; line-height: 120%; }a:link { }</style>
1. Working with Fedora
2. Installing on VM(s)
3. Configuring and Installing fedora
4. Fedora Release Cycle
5. Live boot and Fedora Media Writer
6. Setting up accounts
7. Types of testing
8. Finding Test Cases
9. Writing Test Cases for Packages
10. Github
11. Bugzilla
12. Release Validation Testing
13. Update Testing
14. Manual Testing
14.1 Release validation
14.2 Update Testing

The 102 will cover Automated Testing and How to Host your own test days during the release cycle.

To make the workflow smooth , we have made a book which will act as an reference even after the classrooms are over.


Fedora Media Writer Test Day 2017-04-20

Posted by sumantro on April 16, 2017 08:11 AM
Fedora Media Writer , is a very handy tool to create live USB media. This became the primary downloadable in Fedora 25. We ran a test day installment to check for 3 major OS Windows , Mac OS and Fedora. The test day focused on writing Fedora images (workstation/server/spins) to a flash drive.

This installment of test day will focus on out of the box support for ARM v7 Arch apart from Intel 64 Bit and 32 Bit. The testers can either download image of their choice and then verify if the image by checksum and booting it on KVM and of course bare metal.

We will be calling this test day on 2017-04-20 , grab a blank SD card or USB and it will take roughly about 30 mins with a good internet speed to complete the test case.

Details will be published in Fedora community blog and @test-announce list. 

The wiki page says it all https://fedoraproject.org/wiki/Test_Day:2017-04-20_Fedora_Media_Writer

[Test Day Report]Anaconda BlivetGUI test day

Posted by sumantro on April 10, 2017 07:26 AM
Hey Testers,

I just wanted to pitch in the test day report for Anaconda BlivetGUI Test Day. It was a huge success and we had about 28 testers (many new faces) .

Testers : 28

Bugs Filed:[12]

Blog of the test day : https://communityblog.fedoraproject.org/anaconda-blivetgui-test-day/

I would like to thank each and every tester and the change owners for helping us test this crucial feature!

If you are one of those people who couldn't make it to the test day , you can go ahead and grab a copy for Fedora Alpha 1.7 and start installing Fedora using Blivet GUI . If something breaks , make sure to file a bug under blivet-gui .

On behalf of Fedora QA team

Fedora 26 Alpha released, and blivet-gui Anaconda Test Day on Thursday (2016-04-06)

Posted by Adam Williamson on April 05, 2017 01:57 AM

Hi again folks! Two bits of Fedora 26 news today. First off, Fedora 26 Alpha has been released! It got delayed by a couple of weeks due to rather a grab-bag of issues – mainly problems with FreeIPA and several kernel bugs – but the delays did at least mean we wound up with a really pretty solid build, according to our testing so far. Please do grab the Alpha, play around with it, and see how it works for you. Remember to read the Common Bugs page, though I’m still working on it at the moment.

Secondly, we have another Test Day coming up this Thursday, 2017-04-06! Anaconda blivet-gui Test Day will be a pretty big one. In Fedora 26, an additional partitioning interface is added to Anaconda (the Fedora installer). As well as anaconda’s own custom partitioning interface, there is now a choice to run the blivet-gui partitioning tool from anaconda. This tool is built on the same backend as anaconda itself, but provides an alternative user interface. It’s been available as a standalone tool since Fedora 21, but Fedora 26 is the first time it can be run from the installer to do install-time partitioning. The Test Day will be all about testing out this new feature and making sure it integrates properly with anaconda and works properly in various situations. Please do come along and help out if you have time!

The Test Day page contains all the instructions you need to run the tests and send along your results. As always, the event is in #fedora-test-day on Freenode IRC. If you don’t know how to use IRC, you can read these instructions, or just use WebIRC.

[Test Day Annoucement] Anaconda Blivet GUI

Posted by sumantro on April 04, 2017 09:44 AM

Thursday 2017-04-06 will be Anaconda Blivet-GUI Test Day!
As part of this planned Change for Fedora 26, So this is an important Test Day!

We'll be testing the new detailed bottom-up configuration screen has been long requested by users and inclusion of blivet-gui into Anaconda finally makes this a reality. On the other hand, it just adds a new option without changing the existing advanced storage configuration so users that prefer the top-down configuration can still use it. to see whether it's working well enough and catch any remaining issues.
It's also pretty easy to join in: all you'll need is alpha 1.7 (which you can grab from the wiki page).
Anaconda grew a rather important new option in F26: as well as the two existing partitioning choices (automatic, and the existing anaconda custom part interface) there's now a *third* choice.so now you can do custom partitioning with blivet-gui run within anaconda, as well as using anaconda's own interface (because there just weren't enough ways for custom partitioning to go wrong already),so, we'll have a test day for using that interface, to try and shake out whatever problems it inevitably has.As always, the event will be in #fedora-test-day on Freenode IRC.

Fedora 26 crypto policy Test Day today (2017-03-30)!

Posted by Adam Williamson on March 30, 2017 05:39 PM

Sorry for the short notice, folks! Today is Fedora 26 crypto policy Test Day. This event is intended to test the Fedora 26 changes and updates to the ongoing crypto policy feature which intends to provide a centralized and unified configuration system for the various cryptography libraries commonly used in Fedora.

The Test Day page contains all the instructions you need to run the tests and send along your results. As always, the event is in #fedora-test-day on Freenode IRC. If you don’t know how to use IRC, you can read these instructions, or just use WebIRC.

Fedora Activity Day, Bangalore 2017

Posted by Kanika Murarka on March 13, 2017 01:53 PM

The Fedora Activity Day (FAD) is a regional event (either one-day or a multi-day) that allows Fedora contributors to gather together in order to work on specific tasks related to the Fedora Project.

On February 25th ’17, FAD was conducted in one of the admirable university of Bangalore, University Visvesvaraya College of Engineering(UVCE). It was not a typical “hackathon” or “DocSprint” but a series of productive and interactive sessions on different tools.

The goal of this FAD was to make students aware about Fedora so that they can test, develop and contribute. The event was a one-day-event, started at 10:30 in morning and concluded at 3 in evening.

The first talk was on Ansible, which is an IT automation tool. It can configure systems, deploy software, and orchestrate more advanced IT tasks such as continuous deployments or zero downtime rolling updates. The session was taken up by Vipul Siddharth and Prakash Mishra, who are Fedora contributors. They discussed about the importance of such automation tool and gave a small demo for getting started with Ansible.

The Ansible session was followed by the session on contributing to Linux Kernel, given by our esteemed guest Vaishali Thakkar (@kernel_girl ). Vaishali is Linux Kernel developer at Oracle, she is working in kernel security engineering group and associated with the open source internship programs and some community groups.Vaishali highlighted upon each and every aspect of kernel one should know before contributing. She discussed the lifecycle and how-where-when of a pull request. The session was a 3 hour long session with a short lunch break. The first part of the session was focused on theoretical aspects of sending your first patch to kernel community and the second part was a demo session where she sent a patch from scratch (Slides).

The last session was taken up by Sumantro Mukherjee (Fedora Ambassador) and me, on pathways to contribute to Fedora with a short interactive session.

The speakers were awarded tshirts as a mark of respect.I would like to thank Sumantro Mukherjee, Fedora Community and IEEE subchapter of UVCE college for making FAD possible.

Kernel testing made easy!

Posted by sumantro on February 21, 2017 11:37 PM
Hey Folks , this is sincere effort to bring into notice that people who want to stay on top of the game in terms of bleeding edge. The most important part is to check if the kernel version is supporting your system fine. If it does , then its awesome but if it doesn't you might wanna report it to the team with the proper failure logs which might be helpful for future references.

To get started with , you need a bleeding edge kernel to start with. You can get the latest kernel from Bodhi.

Most of the kernel(s) are updates and hence you need to enable update-testing repo to install the kernel from the update-testing repo. 
Once you have enabled the update testing repo , you can also disable it by executing "dnf config-manager --set-disabled updates-testing".While I'm writing this the latest kernel in update-testing for f25 was "kernel-4.9.10-200.fc25"

Once , you are done installing now , comes the part of checking if all the virtal features of your machine works smoothly. Of course , after a deep manual inspection you can trigger the test suite which will test the major parts for you.

First , you need to install some packages which are important , although many of you might just have all these packages.

The above pic shows the installation of packages, Once you have the required packages , you need to clone the pagure repo "git clone https://pagure.io/kernel-tests.git" .

After cloning you can simply start the test suite , you need to switch to the cloned folder and execute " cp config.example .config" after a successful execution , you need to open the .config file in vi/any text editor and edit the values of "submit=authenticated" and "username=<fas username>" . Once you are done , just run the test by executing " sudo ./runtests.sh" . And for performance testing you can run "sudo ./runtests.sh -t performance" both of these tests are most likely to pass if they dont you need to send/update the log and post karmas on bodhi for people to note if regressions are noted.

For any changes refer to :

Getting started with Pagure CI

Posted by Adam Williamson on February 17, 2017 06:50 AM

I spent a few hours today setting up a couple of the projects I look after, fedfind and resultsdb_conventions, to use Pagure CI. It was surprisingly easy! Many thanks to Pingou and Lubomir for working on this, and of course Kevin for helping me out with the Jenkins side.

You really do just have to request a Jenkins project and then follow the instructions. I followed the step-by-step, submitted a pull request, and everything worked first time. So the interesting part for me was figuring out exactly what to run in the Jenkins job.

The instructions get you to the point where you’re in a checkout of the git repository with the pull request applied, and then you get to do…whatever you can given what you’re allowed to do in the Jenkins builder environment. That doesn’t include installing packages or running mock. So I figured what I’d do for my projects – which are both Python – is set up a good tox profile. With all the stuff discussed below, the actual test command in the Jenkins job – after the boilerplate from the guide that checks out and merges the pull request – is simply tox.

First things first, the infra Jenkins builders didn’t have tox installed, so Kevin kindly fixed that for me. I also convinced him to install all the variant Python version packages – python26, and the non-native Python 3 packages – on each of the Fedora builders, so I can be confident I get pretty much the same tox run no matter which of the builders the job winds up on.

Of course, one thing worth noting at this point is that tox installs all dependencies from PyPI: if something your code depends on isn’t in there (or installed on the Jenkins builders), you’ll be stuck. So another thing I got to do was start publishing fedfind on PyPI! That was pretty easy, though I did wind up cribbing a neat trick from this PyPI issue so I can keep my README in Markdown format but have setup.py convert it to rst when using it as the long_description for PyPI, so it shows up properly formatted, as long as pypandoc is installed (but work even if it isn’t, so you don’t need pandoc just to install the project).

After playing with it for a bit, I figured out that what I really wanted was to have two workflows. One is to run just the core test suite, without any unnecessary dependencies, with python setup.py test – this is important when building RPM packages, to make sure the tests pass in the exact environment the package is built in (and for). And then I wanted to be able to run the tests across multiple environments, with coverage and linting, in the CI workflow. There’s no point running code coverage or a linter while building RPMs, but you certainly want to do it for code changes.

So I put the install, test and CI requirements into three separate text files in each repo – install.requires, tests.requires and tox.requires – and adjusted the setup.py files to do this in their setup():

install_requires = open('install.requires').read().splitlines(),
tests_require = open('tests.requires').read().splitlines(),

In tox.ini I started with this:


so the tox runs get the extra dependencies. I usually write pytest tests, so to start with in tox.ini I just had this command:


Pytest integration for setuptools can be done in various ways, but I use this one. Add a class to setup.py:

import sys
from setuptools import setup, find_packages
from setuptools.command.test import test as TestCommand

class PyTest(TestCommand):
    user_options = [('pytest-args=', 'a', "Arguments to pass to py.test")]

    def initialize_options(self):
        self.pytest_args = ''
        self.test_suite = 'tests'

    def run_tests(self):
        #import here, cause outside the eggs aren't loaded
        import pytest
        errno = pytest.main(self.pytest_args.split())

and then this line in setup():

cmdclass = {'test': PyTest},

And that’s about the basic shape of it. With an envlist, we get the core tests running both through tox and setup.py. But we can do better! Let’s add some extra deps to tox.requires:


and tweak the commands in tox.ini:

commands=py.test --cov-report term-missing --cov-report xml --cov fedfind
         diff-cover coverage.xml --fail-under=90
         diff-quality --violations=pylint --fail-under=90

By adding a few args to our py.test call we get a coverage report for our library with the pull request applied. The subsequent commands use the neat diff_cover tool to add some more information. diff-cover basically takes the full coverage report (coverage.xml is produced by --cov-report xml) and considers only the lines that are touched by the pull request; the --fail-under arg tells it to fail if there is less than 90% coverage of the modified lines. diff-quality runs a linter (in this case, pylint) on the code and, again, considers only the lines changed by the pull request. As you might expect, --fail-under=90 tells it to fail if the ‘quality’ of the changed code is below 90% (it normalizes all the linter scores to a percentage scale, so that really means a pylint score of less than 9.0).

So without messing around with shipping all our stuff off to hosted services, we get a pretty decent indicator of the test coverage and code quality of the pull request, and it shows up as failing tests if they’re not good enough.

It’s kind of overkill to run the coverage and linter on all the tested Python environments, but it is useful to do it at least on both Python 2 and 3, since the pylint results may differ, and the code might hit different paths. Running them on every minor version isn’t really necessary, but it doesn’t take that long so I’m not going to sweat it too much.

But that does bring me to the last refinement I made, because you can vary what tox does in different environments. One thing I wanted for fedfind was to run the tests not just on Python 2.6, but with the ancient versions of several dependencies that are found in RHEL / EPEL 6. And there’s also an interesting bug in pylint which makes it crash when running on fedfind under Python 3.6. So my tox.ini really looks this:

envlist = py26,py27,py34,py35,py36,py37
deps=py27,py34,py35,py36,py37: -r{toxinidir}/install.requires
     py26: -r{toxinidir}/install.requires.py26
     py27,py34,py35,py36,py37: -r{toxinidir}/tests.requires
     py26: -r{toxinidir}/tests.requires.py26
     py27,py34,py35,py36,py37: -r{toxinidir}/tox.requires
     py26: -r{toxinidir}/tox.requires.py26
commands=py27,py34,py35,py36,py37: py.test --cov-report term-missing --cov-report xml --cov fedfind
         py26: py.test
         py27,py34,py35,py36,py37: diff-cover coverage.xml --fail-under=90
         # pylint breaks on functools imports in python 3.6+
         # https://github.com/PyCQA/astroid/issues/362
         py27,py34,py35: diff-quality --violations=pylint --fail-under=90
setenv =
    PYTHONPATH = {toxinidir}

As you can probably guess, what’s going on there is we’re installing different dependencies and running different commands in different tox ‘environments’. pip doesn’t really have a proper dependency solver, which – among other things – unfortunately means tox barfs if you try and do something like listing the same dependency twice, the first time without any version restriction, the second time with a version restriction. So I had to do a bit more duplication than I really wanted, but never mind. What the files wind up doing is telling tox to install specific, old versions of some dependencies for the py26 environment:

setuptools == 0.6.rc10
six == 1.7.3


tox.requires.py26 is just shorter, skipping the coverage and pylint bits, because it turns out to be a pain trying to provide old enough versions of various other things to run those checks with the older pytest, and there’s no real need to run the coverage and linter on py26 as long as they run on py27 (see above). As you can see in the commands section, we just run plain py.test and skip the other two commands on py26; on py36 and py37 we skip the diff-quality run because of the pylint bug.

So now on every pull request, we check the code (and tests – it’s usually the tests that break, because I use some pytest feature that didn’t exist in 2.3.5…) still work with the ancient RHEL 6 Python, pytest, mock, setuptools and six, check it on various other Python interpreter versions, and enforce some requirements for test coverage and code quality. And the package builds can still just do python setup.py test and not require coverage or pylint. Who needs github and coveralls? 😉

Of course, after doing all this I needed a pull request to check it on. For resultsdb_conventions I just made a dumb fake one, but for fedfind, because I’m an idiot, I decided to write that better compose ID parser I’ve been meaning to do for the last week. So that took another hour and a half. And then I had to clean up the test suite…sigh.

Bluetooth in Fedora

Posted by Nathaniel McCallum on February 16, 2017 08:53 PM

So… Bluetooth. It’s everywhere now. Well, everywhere except Fedora. Fedora does, of course support bluetooth. But even the most common workflows are somewhat spotty. We should improve this.

To this end, I’ve enlisted the help of the Don Zickus, kernel developer extrordinaire, and Adam Williamson, the inimitable Fedora QA guru. The plan is to create a set of user tests for the most common bluetooth tasks. This plan has several goals.

First, we’d like to know when stuff is broken. For example, the recent breakage in linux-firmware. Catching this stuff early is a huge plus.

Second, we’d like to get high quality bug reports. When things do break, vague bug reports often cause things to sit in limbo for a while. Making sure we have all the debugging information up front can make reports actionable.

Third, we’d (eventually) like to block a new Fedora release if major functionality is broken. We’re obviously not ready for this step yet. But once the majority of workflows work on the hardware we care about, we need to ensure that we don’t ship a Fedora release with broken code.

To this end we are targeting three workflows which cover the most common cases:

  • Keyboards
  • Headsets
  • Mice

For more information, or to help develop the user testing, see the Fedora QA bug. Here’s to a better future!

Announcing the resultsdb-users mailing list

Posted by Adam Williamson on February 16, 2017 01:28 AM

I’ve been floating an idea around recently to people who are currently using ResultsDB in some sense – either sending reports to it, or consuming reports from it – or plan to do so. The idea was to have a group where we can discuss (and hopefully co-ordinate) use of ResultsDB – a place to talk about result metadata conventions and so forth.

It seemed to get a bit of traction, so I’ve created a new mailing list: resultsdb-users. If you’re interested, please do subscribe, through the web interface, or by sending a mail with ‘subscribe’ in the subject to this address.

If you’re not familiar with ResultsDB – well, it’s a generic storage engine for test results. It’s more or less a database with a REST API and some very minimal rules for what constitutes a ‘test result’. The only requirements really are some kind of test name plus a result, chosen from four options; results can include any other arbitrary key:value pairs you like, and a few have special meaning in the web UI, but that’s about it. This is one of the reasons for the new list: because ResultsDB is so generic, if we want to make it easily and reliably possible to find related groups of results in any given ResultsDB, we need to come up with ways to ensure related results share common metadata values, and that’s one of the things I expect we’ll be talking about on the list.

It began life as Taskotron‘s result storage engine, but it’s pretty independent, and you could certainly get value out of a ResultsDB instance without any of the other bits of Taskotron.

Right now ResultsDB is used in production in Fedora for storing results from Taskotron, openQA and Autocloud, and an instance is also used inside Red Hat for storing results from some RH test systems.

Please note: despite the list being a fedoraproject one, the intent is to co-ordinate with folks from CentOS, Red Hat and maybe even further afield as well; we’re just using an fp.o list as it’s a quick convenient way to get a nice mailman3/hyperkitty list without having to go set up a list server on taskotron.org or something.

The future of Fedora QA

Posted by Adam Williamson on February 12, 2017 05:33 PM

Welcome to version 2.0 of this blog post! This space was previously occupied by a whole bunch of longwinded explanation about some changes that are going on in Fedoraland, and are going to be accelerating (I think) in the near future. But it was way too long. So here’s the executive summary!

First of all: if you do nothing else to get up to speed on Stuff That’s Going On, watch Ralph Bean’s Factory 2.0 talk and Adam Samalik’s Modularity talk from Devconf 2017. Stephen Gallagher’s Fedora Server talk and Dennis Gilmore’s ‘moving everyone to Rawhide’ talk are also valuable, but please at least watch Ralph’s. It’s a one-hour overview of all the big stuff that people really want to build for Fedora (and RH) soon.

To put it simply: Fedora (and RH) don’t want to be only in the business of releasing a bunch of RPMs and operating system images every X months (or years) any more. And we’re increasing moving away from the traditional segmented development process where developers/package maintainers make the bits, then release engineering bundles them all up into ‘things’, and then QA looks at the ‘things’ and says “er, it doesn’t boot, try again”, and we do that for several months until QA is happy, then we release it and start over. There is a big project to completely overhaul the way we build and ship products, using a pipeline that involves true CI, where each proposed change to Fedora produces an immediate feedback loop of testing and the change is blocked if the testing fails. Again, watch Ralph’s talk, because what he basically does is put up a big schematic of this entire system and go into a whole bunch of detail about his vision for how it’s all going to work.

As part of this, some of the folks in RH’s Fedora QA team whose job has been to work on ‘automated testing’ – a concept that is very tied to the traditional model for building and shipping a ‘distribution’, and just means taking some of the tasks assigned to QA/QE in that model and automating them – are now instead going to be part of a new team at Red Hat whose job is to work on the infrastructure that supports this CI pipeline. That doesn’t mean they’re leaving Fedora, or we’re going to throw away all the work we’ve invested in the components of Taskotron and start all over again, but it does mean that some or all of the components of Taskotron are going to be re-envisaged as part of a modernized pipeline for building and shipping whatever it is we want to call Fedora in the future – and also, if things go according to plan, for building and shipping CentOS and Red Hat products, as part of the vision is that as many components of the pipeline as possible will be shared among many projects.

So that’s one thing that’s happening to Fedora QA: the RH team is going to get a bit smaller, but it’s for good and sensible reasons. You’re also not going to see those folks disappear into some kind of internal RH wormhole, they’ll still be right here working on Fedora, just in a somewhat different context.

Of course, all of this change has other implications for Fedora QA as well, and I reckon this is a good time for those of us still wearing ‘Fedora QA’ hats – whether we’re paid by Red Hat or not – to be reconsidering exactly what our goals and priorities ought to be. Much like with Taskotron, we really haven’t sat down and done that for several years. I’ve been thinking about it myself for a while, and I wouldn’t say I have it all figured out, but I do have some thoughts.

For a start I think we should be looking ahead to the time when we’re no longer on what the anaconda team used to call ‘the blocker treadmill’, where a large portion of our working time is eaten up by a more or less constant cycle of waking up, finding out what broke in Rawhide or Branched today, and trying to get it fixed. If the plans above come about, that should happen a lot less for a couple of reasons: firstly Fedora won’t just be a project which releases a bunch of OS images every six months any more, and secondly, distribution-level CI ought to mean that things aren’t broken all the damn time any more. In an ideal scenario, a lot of the basic fundamental breakage that, right now, is still mostly caught by QA – and that we spend a lot of our cycles on dealing with – will just no longer be our problem. In a proper CI system, it becomes truly the developers’ responsibility: developers don’t get to throw in a change that breaks everything and then wait for QA to notice and tell them about it. If they try and send a change that breaks everything, it gets rejected, and hopefully, the breakage never really ‘happens’.

Sadly (or happily, given I still have a mortgage to pay off) this probably doesn’t mean Project Colada will finally be reality and we all get to sit on the beach drinking cocktails for the rest of our lives. CI is a great process for ensuring your project basically works all the time, but ‘basically works’ is a long way from ‘perfect’. Software is still software, after all, and a CI process is never going to catch all of the bugs. Freeing QA from the blocker treadmill lets us look up and think, well, what else can we do?

To be clear, I think we’re still going to need ‘release validation’. In fact, if the bits of the plan about having more release streams than just ‘all the bits, every six months’ come off, we’ll need more release validation. But hopefully there’ll be a lot more “well, this doesn’t quite work right in this quite involved real-world scenario” and less “it doesn’t boot and I think it ate my cat” involved. For the near future, we’re going to have to keep up the treadmill: bar a few proofs of concept and stuff, Fedora 26 is still an ‘all the bits, every six months’ release, and there’s still an awful lot of “it doesn’t boot” involved. (Right now, Rawhide doesn’t even compose, let alone boot!) But it’s not too early to start thinking about how we might want to revise the ‘release validation’ concept for a world where the wheels don’t fall off the bus every five minutes. It might be a good idea to go back to the teams responsible for all the Fedora products – Server, Workstation, Atomic et. al – and see if we need to take another good look at the documents that define what those products should deliver, and the test processes we have in place to try and determine whether they deliver them.

We’re also still going to be doing ‘updates testing’ and ‘test days’, I think. In fact, the biggest consequence of a world where the CI stuff works out might be that we are free to do more of those. There may be some change in what ‘updates’ are – it may not just be RPM packages any more – but whatever interesting forms of ‘update’ we wind up shipping out to people, we’re still going to need to make sure they work properly, and manual testing is always going to be able to find things that automated tests miss there.

I think the question of to what extent we still have a role in ‘automated testing’ and what it should be is also a really interesting one. One of the angles of the ‘more collaboration between RH and Fedora’ bit here is that RH is now very interested in ‘upstreaming’ a bunch of its internal tests that it previously considered to be sort of ‘RH secret sauce’. Specifically, there’s a set of tests from RH’s ‘Platform QE’ team which currently run through a pipeline using RH’s Beaker test platform which we’d really like to have at least a subset of running on Fedora. So there’s an open question about whether and to what extent Fedora QA would have a role in adapting those tests to Fedora and overseeing their operation. The nuts and bolts of ‘make sure Fedora has the necessary systems in place to be able to run the tests at all’ is going to be the job of the new ‘infrastructure’ team, but we may well wind up being involved in the work of adapting the tests themselves to Fedora and deciding which ones we want to run and for what purposes. In general, there is likely still going to be a requirement for ‘automated testing’ that isn’t CI – it’s still going to be necessary to test the things we build at a higher level. I don’t think we can yet know exactly what requirements we’ll have there, but it’s something to think about and figure out as we move forward, and I think it’s definitely going to be part of our job.

We may also need to reconsider how Fedora QA, and indeed Fedora as a whole, decides what is really important. Right now, there’s a pretty solid process for this, but it’s quite tied to the ‘all the things, every six months’ release cycle. For each release we decide which Fedora products are ‘release blocking’, and we care about those, and the bits that go into them and the tools for building them, an awful lot more than we care about anything else. This works pretty well to focus our limited resources on what’s really important. But if we’re going to be moving to having more and more varied ‘Fedora’ products with different release streams, the binary ‘is it release blocking?’ question doesn’t really work any more. Fedora as a whole might need a better way of doing that, and QA should have a role to play in figuring that out and making sure we work out our priorities properly from it.

So there we go! I hope that was useful and thought-provoking. We’ve got a QA meeting coming up tomorrow (2017-02-13) at 1600 UTC where I’m hoping we can chew these topics over a bit, just to serve as an opportunity to get people thinking. Hope to see you there, or on the mailing list!

Fedorahosted to Pagure

Posted by Kanika Murarka on February 12, 2017 06:48 AM

Fedorahosted.org was established in late 2007 using Trac for issues and wiki pages, Fedora Account System groups for access control and source uploads, and offering a variety of Source Control Management tools (git, svn, hg, bzr). With the rise of new workflows and source repositories, fedorahosted.org has ceased to grow, adding just one new project this year and a handful the year before.

As we all know, Fedorahosted is shutting down end of this month, its time to migrate your projects from fedorahosted to one of the following:-

  1. Pagure
  2. Hosting and managing own Trac instance on OpenShift
  3. JIRA
  4. Phabricator
  5. GitHub
  6. Taiga

Pagure is the brainchild of Pierre-Yves Chibon, a member of the Fedora Engineering team. We will be looking into Pagure migration because Pagure is a new, full featured git repository service and its open-source and we ❤ opensource.

So, Pagure provides us Pagure test instance where we can create projects and test importing data. Note:from time to time it is been cleared out, so do not use it for any long-term use.

Here is How Pagure will support Fedorahosted projects ?

Features Fedorahosted Pagure
Priorities We can add as many priority levels as required with weights Same
We can assign a Default priority No such option
Custom priority tags Same
Milestone Ability to add as many milestone as we want Same
Option to add a due date Same
Keeps a track of completed time Does not record completed time
Option to select default milestone No such option
Resolution Ability to add as many resolutions as we want Same
Can set a default resolution type By default it is closed as ‘None’
Other Things Have separate column for Severity, component, Version Here it is easy, it has just Tags
Navigation and Searching Difficult Easy
Permission Different types of permission exist Only, ‘admin’ permission exist
Creating and maintaining tickets Difficult Easy
Enabling Plug-ins Easy Easy

So, lets try importing something to staging pagure repo, I will be showing demo using Fedora QA repo, which has recently been shifted from fedorahosted to pagure.

  1. You should have XML_RPC permission or admin rights for fedorahosted repo.
  2. We will use Pagure-importer to do migration.
  3. Install it using pip . python3 -m pip install pagure_importer
  4. Create a new repo ex- Test-fedoraqascreenshot-from-2017-02-10-16-53-21
  5. Go to Settings and make the repo, tickets friendly by adding new milestones and priorities.screenshot-from-2017-02-10-16-56-50
  6. Clone the issue tracker for issues from pagure. Use: pgimport clone ssh://git@stg.pagure.io/tickets/Test-fedoraqa.git.This will clone the pagure foobar repository into the default set /tmp directory as /tmp/Test-fedoraqa.gitscreenshot-from-2017-02-10-18-28-20
  7. Activate the pagure tickets hook from project settings. This is necessary step to also get pagure database updated for tickets repository changes.screenshot-from-2017-02-10-18-30-19
  8. Deactivate the pagure Fedmsg hook from project settings. This will avoid the issues import to spam the fedmsg bus. The Hook can be reactivated once the import has completed.
  9. The fedorahosted command can be used to import issues from a fedorahosted project to pagure.
    $ pgimport fedorahosted --help
        Usage: pgimport fedorahosted [OPTIONS] PROJECT_URL
        --tags  Import pagure tags as well.
        --private By default make all issues private.
        --username TEXT FAS username
        --password TEXT FAS password
        --offset INTEGER Number of issue in pagure before import
        --help  Show this message and exit.
        --nopush Do not push the result of pagure-importer back
    $ pgimport fedorahosted https://fedorahosted.org/fedora-qa --tags

    This command will import all the tickets information with all tags to /tmp/foobar.git repository. If you are getting this error: ERROR: Error in response: {u'message': u'XML_RPC privileges are required to perform this operation', u'code': 403, u'name': u'JSONRPCError'}, means you dont have XML_RPC permission.

  10. You will be prompted for FAS username and password.screenshot-from-2017-02-10-19-01-51
  11. Go to tmp folder cd /tmp/
  12. Now, we need to push the tickets to new repo. The push command can be used to push a clone pagure ticket repo back to pagure
    $ pgimport push Test-fedoraqa.gitscreenshot-from-2017-02-10-19-10-04 screenshot-from-2017-02-10-19-10-16
  13. Refresh your repo, and it will look like thisscreenshot
  14. Now you can edit tickets in any way you want.

Stuck somewhere? Feel free to comment and contact. Thanks for reading this 🙂

openQA and Autocloud result submission to ResultsDB

Posted by Adam Williamson on February 07, 2017 05:18 AM

So I’ve just arrived back from a packed two weeks in Brno, and I’ll probably have some more stuff to post soon. But let’s lead with some big news!

One of the big topics at Devconf and around the RH offices was the ongoing effort to modernize both Fedora and RHEL’s overall build processes to be more flexible and involve a lot more testing (or, as some people may have put it, “CI CI CI”). A lot of folks wearing a lot of hats are involved in different bits of this effort, but one thing that seems to stay constant is that ResultsDB will play a significant role.

ResultsDB started life as the result storage engine for AutoQA, and the concept and name was preserved as AutoQA was replaced by Taskotron. Its current version, however, is designed to be a scalable, capable and generic store for test results from any test system, not just Taskotron. Up until last week, though, we’d never quite got around to hooking up any other systems to it to demonstrate this.

Well, that’s all changed now! In the course of three days, Jan Sedlak and I got both Fedora’s openQA instance and Autocloud reporting to ResultsDB. As results come out of both those systems, fedmsg consumers take the results, process them into a common format, and forward them to ResultsDB. This means there are groups with results from both systems for the same compose together, and you’ll find metadata in very similar format attached to the results from both systems. This is all deployed in production right now – the results from every daily compose from both openQA and Autocloud are being forwarded smoothly to ResultsDB.

To aid in this effort I wrote a thing we’re calling resultsdb_conventions for now. I think of it as being a code representation of some ‘conventions’ for formatting and organizing results in ResultsDB, as well as a tool for conveniently reporting results in line with those conventions. The attraction of ResultsDB is that it’s very little more than a RESTful API for a database; it enforces a pretty bare minimum in terms of required data for each result. A result must provide only a test name, an ‘item’ that was tested, and a status (‘outcome’) from a choice of four. ResultsDB allows a result to include as much more data as it likes, in the form of a freeform key:value data store, but it does not require any extra data to be provided, or impose any policy on its form.

This makes ResultsDB flexible, but also means we will need to establish conventions where appropriate to ensure related results can be conveniently located and reasoned about. resultsdb_conventions is my initial contribution to this effort, originally written just to reduce duplication between the openQA and Autocloud result submitters and ensure they used a common layout, but intended to perhaps cover far more use cases in the future.

Having this data in ResultsDB is likely to be practically useful either immediately or in the very near future, but we’re also hoping it acts as a demonstration that using ResultsDB to consolidate results from multiple test sources is not only possible but quite easy. And I’m hoping resultsdb_conventions can be a starting point for a discussion and some consensus around what metadata we provide, and in what format, for various types of result. If all goes well, we’re hoping to hook up manual test result submission to ResultsDB next, via the relval-ng project that’s had some discussion on the QA mailing lists. Stay tuned for more on that!

Welcome Fedora Quality Planet

Posted by Kamil Páral on January 31, 2017 10:31 AM

Hello, I’d like to introduce a new sub-planet of Fedora Planet to you, located at http://fedoraplanet.org/quality/ (you don’t need to remember the URL, there’s a sub-planet picker in the top right corner of Fedora Planet pages that allows you to switch between sub-planets).

Fedora Quality Planet will contain news and useful information about QA tools and processes present in Fedora, updates on our quality automation efforts, guides for package maintainers (and other teams) how to interact with our tools and checks or understand the reported failures, announcements about critical issues in Fedora releases, and more.

Our goal is to have a single place for you to visit (or subscribe to) and get a good overview of what’s happening in the Fedora Quality space. Of course all Fedora Quality posts should also show up in the main Fedora Planet feed, so if you’re already subscribed to that, you shouldn’t miss our posts either.

If you want to join our effort and publish some interesting quality-related posts into Fedora Quality Planet, you’re more then welcome! Please see the instructions how to syndicate your blog. If you have any questions or need help, ask in the test mailing list or ping kparal or adamw on #fedora-qa freenode IRC channel. Thanks!

The Tale Of The Two-Day, One-Character Patch

Posted by Adam Williamson on January 12, 2017 02:57 AM

I’m feeling like writing a very long explanation of a very small change again. Some folks have told me they enjoy my attempts to detail the entire step-by-step process of debugging some somewhat complex problem, so sit back, folks, and enjoy…The Tale Of The Two-Day, One-Character Patch!

Recently we landed Python 3.6 in Fedora Rawhide. A Python version bump like that requires all Python-dependent packages in the distribution to be rebuilt. As usually happens, several packages failed to rebuild successfully, so among other work, I’ve been helping work through the list of failed packages and fixing them up.

Two days ago, I reached python-deap. As usual, I first simply tried a mock build of the package: sometimes it turns out we already fixed whatever had previously caused the build to fail, and simply retrying will make it work. But that wasn’t the case this time.

The build failed due to build dependencies not being installable – python2-pypandoc, in this case. It turned out that this depends on pandoc-citeproc, and that wasn’t installable because a new ghc build had been done without rebuilds of the set of pandoc-related packages that must be rebuilt after a ghc bump. So I rebuilt pandoc, and ghc-aeson-pretty (an updated version was needed to build an updated pandoc-citeproc which had been committed but not built), and finally pandoc-citeproc.

With that done, I could do a successful scratch build of python-deap. I tweaked the package a bit to enable the test suites – another thing I’m doing for each package I’m fixing the build of, if possible – and fired off an official build.

Now you may notice that this looks a bit odd, because all the builds for the different arches succeeded (they’re green), but the overall ‘State’ is “failed”. What’s going on there? Well, if you click “Show result”, you’ll see this:

BuildError: The following noarch package built differently on different architectures: python-deap-doc-1.0.1-2.20160624git232ed17.fc26.noarch.rpm
rpmdiff output was:
error: cannot open Packages index using db5 - Permission denied (13)
error: cannot open Packages database in /var/lib/rpm
error: cannot open Packages database in /var/lib/rpm
removed     /usr/share/doc/python-deap/html/_images/cma_plotting_01_00.png
removed     /usr/share/doc/python-deap/html/examples/es/cma_plotting_01_00.hires.png
removed     /usr/share/doc/python-deap/html/examples/es/cma_plotting_01_00.pdf
removed     /usr/share/doc/python-deap/html/examples/es/cma_plotting_01_00.png

So, this is a good example of where background knowledge is valuable. Getting from step to step in this kind of debugging/troubleshooting process is a sort of combination of logic, knowledge and perseverance. Always try to be logical and methodical. When you start out you won’t have an awful lot of knowledge, so you’ll need a lot of perseverance; hopefully, the longer you go on, the more knowledge you’ll pick up, and thus the less perseverance you’ll need!

In this case the error is actually fairly helpful, but I also know a bit about packages (which helps) and remembered a recent mailing list discussion. Fedora allows arched packages with noarch subpackages, and this is how python-deap is set up: the main packages are arched, but there is a python-deap-docs subpackage that is noarch. We’re concerned with that package here. I recalled a recent mailing list discussion of this “built differently on different architectures” error.

As discussed in that thread, we’re failing a Koji check specific to this kind of package. If all the per-arch builds succeed individually, Koji will take the noarch subpackage(s) from each arch and compare them; if they’re not all the same, Koji will consider this an error and fail the build. After all, the point of a noarch package is that its contents are the same for all arches and so it shouldn’t matter which arch build we take the noarch subpackage from. If it comes out different on different arches, something is clearly up.

So this left me with the problem of figuring out which arch was different (it’d be nice if the Koji message actually told us…) and why. I started out just looking at the build logs for each arch and searching for ‘cma_plotting’. This is actually another important thing: one of the most important approaches to have in your toolbox for this kind of work is just ‘searching for significant-looking text strings’. That might be a grep or it might be a web search, but you’ll probably wind up doing a lot of both. Remember good searching technique: try to find the most ‘unusual’ strings you can to search for, ones for which the results will be strongly correlated with your problem. This quickly told me that the problematic arch was ppc64. The ‘removed’ files were not present in that build, but they were present in the builds for all other arches.

So I started looking more deeply into the ppc64 build log. If you search for ‘cma_plotting’ in that file, you’ll see the very first result is “WARNING: Exception occurred in plotting cma_plotting”. That sounds bad! Below it is a long Python traceback – the text starting “Traceback (most recent call last):”.

So what we have here is some kind of Python thing crashing during the build. If we quickly compare with the build logs on other arches, we don’t see the same thing at all – there is no traceback in those build logs. Especially since this shows up right when the build process should be generating the files we know are the problem (the cma_plotting files, remember), we can be pretty sure this is our culprit.

Now this is a pretty big scary traceback, but we can learn some things from it quite easily. One is very important: we can see quite easily what it is that’s going wrong. If we look at the end of the traceback, we see that all the last calls involve files in /usr/lib64/python2.7/site-packages/matplotlib. This means we’re dealing with a Python module called matplotlib. We can quite easily associate that with the package python-matplotlib, and now we have our next suspect.

If we look a bit before the traceback, we can get a bit more general context of what’s going on, though it turns out not to be very important in this case. Sometimes it is, though. In this case we can see this:

+ sphinx-build-2 doc build/html
Running Sphinx v1.5.1

Again, background knowledge comes in handy here: I happen to know that Sphinx is a tool for generating documentation. But if you didn’t already know that, you should quite easily be able to find it out, by good old web search. So what’s going on is the package build process is trying to generate python-deap’s documentation, and that process uses this matplotlib library, and something is going very wrong – but only on ppc64, remember – in matplotlib when we try to generate one particular set of doc files.

So next I start trying to figure out what’s actually going wrong in matplotlib. As I mentioned, the traceback is pretty long. This is partly just because matplotlib is big and complex, but it’s more because it’s a fairly rare type of Python error – an infinite recursion. You’ll see the traceback ends with many, many repetitions of this line:

  File "/usr/lib64/python2.7/site-packages/matplotlib/mathtext.py", line 861, in _get_glyph
    return self._get_glyph('rm', font_class, sym, fontsize)

followed by:

  File "/usr/lib64/python2.7/site-packages/matplotlib/mathtext.py", line 816, in _get_glyph
    uniindex = get_unicode_index(sym, math)
  File "/usr/lib64/python2.7/site-packages/matplotlib/mathtext.py", line 87, in get_unicode_index
    if symbol == '-':
RuntimeError: maximum recursion depth exceeded in cmp

What ‘recursion’ means is pretty simple: it just means that a function can call itself. A common example of where you might want to do this is if you’re trying to walk a directory tree. In Python-y pseudo-code it might look a bit like this:

def read_directory(directory):
    for entry in directory:
        if entry is file:
        if entry is directory:

To deal with directories nested in other directories, the function just calls itself. The danger is if you somehow mess up when writing code like this, and it winds up in a loop, calling itself over and over and never escaping: this is ‘infinite recursion’. Python, being a nice language, notices when this is going on, and bails after a certain number of recursions, which is what’s happening here.

So now we know where to look in matplotlib, and what to look for. Let’s go take a look! matplotlib, like most everything else in the universe these days, is in github, which is bad for ecosystem health but handy just for finding stuff. Let’s go look at the function from the backtrace.

Well, this is pretty long, and maybe a bit intimidating. But an interesting thing is, we don’t really need to know what this function is for – I actually still don’t know precisely (according to the name it should be returning a ‘glyph’ – a single visual representation for a specific character from a font – but it actually returns a font, the unicode index for the glyph, the name of the glyph, the font size, and whether the glyph is italicized, for some reason). What we need to concentrate on is the question of why this function is getting in a recursion loop on one arch (ppc64) but not any others.

First let’s figure out how the recursion is actually triggered – that’s vital to figuring out what the next step in our chain is. The line that triggers the loop is this one:

                return self._get_glyph('rm', font_class, sym, fontsize)

That’s where it calls itself. It’s kinda obvious that the authors expect that call to succeed – it shouldn’t run down the same logical path, but instead get to the ‘success’ path (the return font, uniindex, symbol_name, fontsize, slanted line at the end of the function) and thus break the loop. But on ppc64, for some reason, it doesn’t.

So what’s the logic path that leads us to that call, both initially and when it recurses? Well, it’s down three levels of conditionals:

    if not found_symbol:
        if self.cm_fallback:
            <other path>
            if fontname in ('it', 'regular') and isinstance(self, StixFonts):
                return self._get_glyph('rm', font_class, sym, fontsize)

So we only get to this path if found_symbol is not set by the time we reach that first if, then if self.cm_fallback is not set, then if the fontname given when the function was called was ‘it’ or ‘regular’ and if the class instance this function (actually method) is a part of is an instance of the StixFonts class (or a subclass). Don’t worry if we’re getting a bit too technical at this point, because I did spend a bit of time looking into those last two conditions, but ultimately they turned out not to be that significant. The important one is the first one: if not found_symbol.

By this point, I’m starting to wonder if the problem is that we’re failing to ‘find’ the symbol – in the first half of the function – when we shouldn’t be. Now there are a couple of handy logical shortcuts we can take here that turned out to be rather useful. First we look at the whole logic flow of the found_symbol variable and see that it’s a bit convoluted. From the start of the function, there are two different ways it can be set True – the if self.use_cmex block and then the ‘fallback’ if not found_symbol block after that. Then there’s another block that starts if found_symbol: where it gets set back to False again, and another lookup is done:

    if found_symbol:
        found_symbol = False
        font = self._get_font(new_fontname)
        if font is not None:
            glyphindex = font.get_char_index(uniindex)
            if glyphindex != 0:
                found_symbol = True

At first, though, we don’t know if we’re even hitting that block, or if we’re failing to ‘find’ the symbol earlier on. It turns out, though, that it’s easy to tell – because of this earlier block:

    if not found_symbol:
            uniindex = get_unicode_index(sym, math)
            found_symbol = True
        except ValueError:
            uniindex = ord('?')
            warn("No TeX to unicode mapping for '%s'" %
                 sym.encode('ascii', 'backslashreplace'),

Basically, if we don’t find the symbol there, the code logs a warning. We can see from our build log that we don’t see any such warning, so we know that the code does initially succeed in finding the symbol – that is, when we get to the if found_symbol: block, found_symbol is True. That logically means that it’s that block where the problem occurs – we have found_symbol going in, but where that block sets it back to False then looks it up again (after doing some kind of font substitution, I don’t know why, don’t care), it fails.

The other thing I noticed while poking through this code is a later warning. Remember that the infinite recursion only happens if fontname in ('it', 'regular') and isinstance(self, StixFonts)? Well, what happens if that’s not the case is interesting:

            if fontname in ('it', 'regular') and isinstance(self, StixFonts):
                return self._get_glyph('rm', font_class, sym, fontsize)
            warn("Font '%s' does not have a glyph for '%s' [U+%x]" %
                  sym.encode('ascii', 'backslashreplace').decode('ascii'),

that is, if that condition isn’t satisfied, instead of calling itself, the next thing the function does is log a warning. So it occurred to me to go and see if there are any of those warnings in the build logs. And, whaddayaknow, there are four such warnings in the ppc64 build log:

/usr/lib64/python2.7/site-packages/matplotlib/mathtext.py:866: MathTextWarning: Font 'rm' does not have a glyph for '1' [U+1d7e3]
/usr/lib64/python2.7/site-packages/matplotlib/mathtext.py:867: MathTextWarning: Substituting with a dummy symbol.
  warn("Substituting with a dummy symbol.", MathTextWarning)
/usr/lib64/python2.7/site-packages/matplotlib/mathtext.py:866: MathTextWarning: Font 'rm' does not have a glyph for '0' [U+1d7e2]
/usr/lib64/python2.7/site-packages/matplotlib/mathtext.py:866: MathTextWarning: Font 'rm' does not have a glyph for '-' [U+2212]
/usr/lib64/python2.7/site-packages/matplotlib/mathtext.py:866: MathTextWarning: Font 'rm' does not have a glyph for '2' [U+1d7e4]

but there are no such warnings in the logs for other arches. That’s really rather interesting. It makes one possibility very unlikely: that we do reach the recursed call on all arches, but it fails on ppc64 and succeeds on the other arches. It’s looking far more likely that the problem is the “re-discovery” bit of the function – the if found_symbol: block where it looks up the symbol again – is usually working on other arches, but failing on ppc64.

So just by looking at the logical flow of the function, particularly what happens in different conditional branches, we’ve actually been able to figure out quite a lot, without knowing or even caring what the function is really for. By this point, I was really focusing in on that if found_symbol: block. And that leads us to our next suspect. The most important bit in that block is where it actually decides whether to set found_symbol to True or not, here:

        font = self._get_font(new_fontname)
        if font is not None:
            glyphindex = font.get_char_index(uniindex)
            if glyphindex != 0:
                found_symbol = True

I didn’t actually know whether it was failing because self._get_font didn’t find anything, or because font.get_char_index returned 0. I think I just played a hunch that get_char_index was the problem, but it wouldn’t be too difficult to find out by just editing the code a bit to log a message telling you whether or not font was None, and re-running the test suite.

Anyhow, I wound up looking at get_char_index, so we need to go find that. You could work backwards through the code and figure out what font is an instance of so you can find it, but that’s boring: it’s far quicker just to grep the damn code. If you do that, you get various results that are calls of it, then this:

src/ft2font_wrapper.cpp:const char *PyFT2Font_get_char_index__doc__ =
src/ft2font_wrapper.cpp:    "get_char_index()\n"
src/ft2font_wrapper.cpp:static PyObject *PyFT2Font_get_char_index(PyFT2Font *self, PyObject *args, PyObject *kwds)
src/ft2font_wrapper.cpp:    if (!PyArg_ParseTuple(args, "k:get_char_index", &ccode)) {
src/ft2font_wrapper.cpp:        {"get_char_index", (PyCFunction)PyFT2Font_get_char_index, METH_VARARGS, PyFT2Font_get_char_index__doc__},

Which is the point at which I started mentally buckling myself in, because now we’re out of Python and into C++. Glorious C++! I should note at this point that, while I’m probably a half-decent Python coder at this point, I am still pretty awful at C(++). I may be somewhat or very wrong in anything I say about it. Corrections welcome.

So I buckled myself in and went for a look at this ft2font_wrapper.cpp thing. I’ve seen this kind of thing a couple of times before, so by squinting at it a bit sideways, I could more or less see that this is what Python calls an extension module: basically, it’s a Python module written in C or C++. This gets done if you need to create a new built-in type, or for speed, or – as in this case – because the Python project wants to work directly with a system shared library (in this case, freetype), either because it doesn’t have Python bindings or because the project doesn’t want to use them for some reason.

This code pretty much provides a few classes for working with Freetype fonts. It defines a class called matplotlib.ft2font.FT2Font with a method get_char_index, and that’s what the code back up in mathtext.py is dealing with: that font we were dealing with is an FT2Font instance, and we’re using its get_char_index method to try and ‘find’ our ‘symbol’.

Fortunately, this get_char_index method is actually simple enough that even I can figure out what it’s doing:

static PyObject *PyFT2Font_get_char_index(PyFT2Font *self, PyObject *args, PyObject *kwds)
    FT_UInt index;
    FT_ULong ccode;

    if (!PyArg_ParseTuple(args, "I:get_char_index", &ccode)) {
        return NULL;

    index = FT_Get_Char_Index(self->x->get_face(), ccode);

    return PyLong_FromLong(index);

(If you’re playing along at home for MEGA BONUS POINTS, you now have all the necessary information and you can try to figure out what the bug is. If you just want me to explain it, keep reading!)

There’s really not an awful lot there. It’s calling FT_Get_Char_Index with a couple of args and returning the result. Not rocket science.

In fact, this seemed like a good point to start just doing a bit of experimenting to identify the precise problem, because we’ve reduced the problem to a very small area. So this is where I stopped just reading the code and started hacking it up to see what it did.

First I tweaked the relevant block in mathtext.py to just log the values it was feeding in and getting out:

        font = self._get_font(new_fontname)
        if font is not None:
            glyphindex = font.get_char_index(uniindex)
            warn("uniindex: %s, glyphindex: %s" % (uniindex, glyphindex))
            if glyphindex != 0:
                found_symbol = True

Sidenote: how exactly to just print something out to the console when you’re building or running tests can vary quite a bit depending on the codebase in question. What I usually do is just look at how the project already does it – find some message that is being printed when you build or run the tests, and then copy that. Thus in this case we can see that the code is using this warn function (it’s actually warnings.warn), and we know those messages are appearing in our build logs, so…let’s just copy that.

Then I ran the test suite on both x86_64 and ppc64, and compared. This told me that the Python code was passing the same uniindex values to the C code on both x86_64 and ppc64, but getting different results back – that is, I got the same recorded uniindex values, but on x86_64 the resulting glyphindex value was always something larger than 0, but on ppc64, it was sometimes 0.

The next step should be pretty obvious: log the input and output values in the C code.

index = FT_Get_Char_Index(self->x->get_face(), ccode);
printf("ccode: %lu index: %u\n", ccode, index);

Another sidenote: one of the more annoying things with this particular issue was just being able to run the tests with modifications and see what happened. First, I needed an actual ppc64 environment to use. The awesome Patrick Uiterwijk of Fedora release engineering provided me with one. Then I built a .src.rpm of the python-matplotlib package, ran a mock build of it, and shelled into the mock environment. That gives you an environment with all the necessary build dependencies and the source and the tests all there and prepared already. Then I just copied the necessary build, install and test commands from the spec file. For a simple pure-Python module this is all usually pretty easy and you can just check the source out and do it right in your regular environment or in a virtualenv or something, but for something like matplotlib which has this C++ extension module too, it’s more complex. The spec builds the code, then installs it, then runs the tests out of the source directory with PYTHONPATH=BUILDROOT/usr/lib64/python2.7/site-packages , so the code that was actually built and installed is used for the tests. When I wanted to modify the C part of matplotlib, I edited it in the source directory, then re-ran the ‘build’ and ‘install’ steps, then ran the tests; if I wanted to modify the Python part I just edited it directly in the BUILDROOT location and re-ran the tests. When I ran the tests on ppc64, I noticed that several hundred of them failed with exactly the bug we’d seen in the python-deap package build – this infinite recursion problem. Several others failed due to not being able to find the glyph, without hitting the recursion. It turned out the package maintainer had disabled the tests on ppc64, and so Fedora 24+’s python-matplotlib has been broken on ppc64 since about April).

So anyway, with that modified C code built and used to run the test suite, I finally had a smoking gun. Running this on x86_64 and ppc64, the logged ccode values were totally different. The values logged on ppc64 were huge. But as we know from the previous logging, there was no difference in the value when the Python code passed it to the C code (the uniindex value logged in the Python code).

So now I knew: the problem lay in how the C code took the value from the Python code. At this point I started figuring out how that worked. The key line is this one:

if (!PyArg_ParseTuple(args, "I:get_char_index", &ccode)) {

That PyArg_ParseTuple function is what the C code is using to read in the value that mathtext.py calls uniindex and it calls ccode, the one that’s somehow being messed up on ppc64. So let’s read the docs!

This is one unusual example where the Python docs, which are usually awesome, are a bit difficult, because that’s a very thin description which doesn’t provide the references you usually get. But all you really need to do is read up – go back to the top of the page, and you get a much more comprehensive explanation. Reading carefully through the whole page, we can see pretty much what’s going on in this call. It basically means that args is expected to be a structure representing a single Python object, a number, which we will store into the C variable ccode. The tricky bit is that second arg, "I:get_char_index". This is the ‘format string’ that the Python page goes into a lot of helpful detail about.

As it tells us, PyArg_ParseTuple “use[s] format strings which are used to tell the function about the expected arguments…A format string consists of zero or more “format units.” A format unit describes one Python object; it is usually a single character or a parenthesized sequence of format units. With a few exceptions, a format unit that is not a parenthesized sequence normally corresponds to a single address argument to these functions.” Next we get a list of the ‘format units’, and I is one of those:

 I (integer) [unsigned int]
    Convert a Python integer to a C unsigned int, without overflow checking.

You might also notice that the list of format units include several for converting Python integers to other things, like i for ‘signed int’ and h for ‘short int’. This will become significant soon!

The :get_char_index bit threw me for a minute, but it’s explained further down:

“A few other characters have a meaning in a format string. These may not occur inside nested parentheses. They are: … : The list of format units ends here; the string after the colon is used as the function name in error messages (the “associated value” of the exception that PyArg_ParseTuple() raises).” So in our case here, we have only a single ‘format unit’ – I – and get_char_index is just a name that’ll be used in any error messages this call might produce.

So now we know what this call is doing. It’s saying “when some Python code calls this function, take the args it was called with and parse them into C structures so we can do stuff with them. In this case, we expect there to be just a single arg, which will be a Python integer, and we want to convert it to a C unsigned integer, and store it in the C variable ccode.”

(If you’re playing along at home but you didn’t get it earlier, you really should be able to get it now! Hint: read up just a few lines in the C code. If not, go refresh your memory about architectures…)

And once I understood that, I realized what the problem was. Let’s read up just a few lines in the C code:

FT_ULong ccode;

Unlike Python, C and C++ are ‘typed languages’. That just means that all variables must be declared to be of a specific type, unlike Python variables, which you don’t have to declare explicitly and which can change type any time you like. This is a variable declaration: it’s simply saying “we want a variable called ccode, and it’s of type FT_ULong“.

If you know anything at all about C integer types, you should know what the problem is by now (you probably worked it out a few paragraphs back). But if you don’t, now’s a good time to learn!

There are several different types you can use for storing integers in C: short, int, long, and possibly long long (depends on your arch). This is basically all about efficiency: you can only put a small number in a short, but if you only need to store small numbers, it might be more efficient to use a short than a long. Theoretically, when you use a short the compiler will allocate less memory than when you use an int, which uses less memory again than a long, which uses less than a long long. Practically speaking some of them wind up being the same size on some platforms, but the basic idea’s there.

All the types have signed and unsigned variants. The difference there is simple: signed numbers can be negative, unsigned ones can’t. Say an int is big enough to let you store 101 different values: a signed int would let you store any number between -50 and +50, while an unsigned int would let you store any number between 0 and 100.

Now look at that ccode declaration again. What is its type? FT_ULong. That ULong…sounds a lot like unsigned long, right?

Yes it does! Here, have a cookie. C code often declares its own aliases for standard C types like this; we can find Freetype’s in its API documentation, which I found by the cunning technique of doing a web search for FT_ULong. That finds us this handy definition: “A typedef for unsigned long.”

Aaaaaaand herein lies our bug! Whew, at last. As, hopefully, you can now see, this ccode variable is declared as an unsigned long, but we’re telling PyArg_ParseTuple to convert the Python object such that we can store it as an unsigned int, not an unsigned long.

But wait, you think. Why does this seem to work OK on most arches, and only fail on ppc64? Again, some of you will already know the answer, good for you, now go read something else. 😉 For the rest of you, it’s all about this concept called ‘endianness’, which you might have come across and completely failed to understand, like I did many times! But it’s really pretty simple, at least if we skate over it just a bit.

Consider the number “forty-two”. Here is how we write it with numerals: 42. Right? At least, that’s how most humans do it, these days, unless you’re a particularly hardy survivor of the fall of Rome, or something. This means we humans are ‘big-endian’. If we were ‘little-endian’, we’d write it like this: 24. ‘Big-endian’ just means the most significant element comes ‘first’ in the representation; ‘little-endian’ means the most significant element comes last.

All the arches Fedora supports except for ppc64 are little-endian. On little-endian arches, this error doesn’t actually cause a problem: even though we used the wrong format unit, the value winds up being correct. On (64-bit) big-endian arches, however, it does cause a problem – when you tell PyArg_ParseTuple to convert to an unsigned long, but store the result into a variable that was declared as an unsigned int, you get a completely different value (it’s multiplied by 2×32). The reasons for this involve getting into a more technical understanding of little-endian vs. big-endian (we actually have to get into the icky details of how values are really represented in memory), which I’m going to skip since this post is already long enough.

But you don’t really need to understand it completely, certainly not to be able to spot problems like this. All you need to know is that there are little-endian and big-endian arches, and little-endian are far more prevalent these days, so it’s not unusual for low-level code to have weird bugs on big-endian arches. If something works fine on most arches but not on one or two, check if the ones where it fails are big-endian. If so, then keep a careful eye out for this kind of integer type mismatch problem, because it’s very, very likely to be the cause.

So now all that remained to do was to fix the problem. And here we go, with our one character patch:

diff --git a/src/ft2font_wrapper.cpp b/src/ft2font_wrapper.cpp
index a97de68..c77dd83 100644
--- a/src/ft2font_wrapper.cpp
+++ b/src/ft2font_wrapper.cpp
@@ -971,7 +971,7 @@ static PyObject *PyFT2Font_get_char_index(PyFT2Font *self, PyObject *args, PyObj
     FT_UInt index;
     FT_ULong ccode;

-    if (!PyArg_ParseTuple(args, "I:get_char_index", &ccode)) {
+    if (!PyArg_ParseTuple(args, "k:get_char_index", &ccode)) {
         return NULL;

There’s something I just love about a one-character change that fixes several hundred test failures. 🙂 As you can see, we simply change the I – the format unit for unsigned int – to k – the format unit for unsigned long. And with that, the bug is solved! I applied this change on both x86_64 and ppc64, re-built the code and re-ran the test suite, and observed that several hundred errors disappeared from the test suite on ppc64, while the x86_64 tests continued to pass.

So I was able to send that patch upstream, apply it to the Fedora package, and once the package build went through, I could finally build python-deap successfully, two days after I’d first tried it.

Bonus extra content: even though I’d fixed the python-deap problem, as I’m never able to leave well enough alone, it wound up bugging me that there were still several hundred other failures in the matplotlib test suite on ppc64. So I wound up looking into all the other failures, and finding several other similar issues, which got the failure count down to just two sets of problems that are too domain-specific for me to figure out, and actually also happen on aarch64 and ppc64le (they’re not big-endian issues). So to both the people running matplotlib on ppc64…you’re welcome 😉

Seriously, though, I suspect without these fixes, we might have had some odd cases where a noarch package’s documentation would suddenly get messed up if the package happened to get built on a ppc64 builder.

QA protip of the day: make sure your test runner fails properly

Posted by Adam Williamson on January 01, 2017 01:44 AM

Just when you thought you were safe…it’s time for a blog post!

For the last few days I’ve been working on fixing Rawhide packages that failed to build as part of the Python 3.6 mass rebuild. In the course of this, I’ve been enabling test suites for packages where there is one, we can plausibly run it, and we weren’t doing so before, because tests are great and running them during package builds is great. (And it’s in the guidelines).

I’ve now come across two projects which have a unittest-based test script which does something like this:


class SomeTests(unittest.TestCase):
    [tests here]

def main():
    suite = unittest.TestLoader().loadTestsFromTestCase(SomeTests)

if __name__ == '__main__':

Now if you just run this script manually all the time and inspect its output, you’ll be fine, because it’ll tell you whether the tests passed or not. However, if you try and use it in any kind of automated way you’re going to have trouble, because this script will always exit 0, even if some or all the tests fail. This, of course, makes it rather useless for running during a package build, because the build will never fail even if all the tests do.

If you’re going to write your own test script like this (which…seriously consider if you should just rely on unittest’s ‘gathering’ stuff instead, or use nose(2), or use pytest…), then it’s really a good idea to make sure your test script actually fails if any of the tests fail. Thus:


import sys

class SomeTests(unittest.TestCase):
    [tests here]

def main():
    suite = unittest.TestLoader().loadTestsFromTestCase(SomeTests)
    ret = unittest.TextTestRunner(verbosity=3).run(suite)
    if ret.wasSuccessful():
        sys.exit("Test(s) failed!")

if __name__ == '__main__':

(note: just doing sys.exit() will exit 0; doing sys.exit('any string') prints the string and exits 1).

Packagers, look out for this kind of bear trap when packaging…if the package doesn’t use a common test pattern or system but has a custom script like this, check it and make sure it behaves sanely.

Oooh, look! A shiny thing!

Posted by Adam Williamson on November 02, 2016 01:27 AM


Today I was supposed to be finalizing the test cases for Thursday’s switchable graphics Test Day.

Instead, thanks to this:

(jeff) adamw: Hey, I’ve been approached by 2 different people telling me dnf system-upgrade was failing. In both cases they had to import the F24 key manually. And in both cases they were going F21->F24. Do you know if that’s a documented limitation somewhere?

I somehow wound up spending the day using yum and dnf to upgrade a virtual machine from Fedora 13 to 15, to 16, to 17, to 23, to 24. And making a bunch of improvements to:

Easily distracted?


Oh, sorry, I saw something shiny over there…(wanders off)

Fedora 25 switchable graphics Test Day this Thursday, 2016-11-03

Posted by Adam Williamson on October 31, 2016 09:30 PM

Yep, it’s Test Day time again – most likely the final Test Day of the Fedora 25 cycle. This Thursday, 2016-11-03, will be switchable graphics Test Day!

‘Switchable graphics’ refers to the fairly common current practice of laptops having two graphics adapters, one low-power one for general purpose use, one more powerful one for use with applications that require more oomph (e.g. games or 3D rendering applications). NVIDIA brands this as ‘Optimus’, and AMD just as ‘Switchable Graphics’ or ‘Dynamic Switchable Graphics’.

There are some enhancements to Fedora’s support for such systems in Fedora 25, and part of the Test Day’s purpose is to test those enhancements. The other part of the Test Day’s purpose is to ensure that support for switchable graphics on Fedora 25 Workstation with Wayland by default is as good as it can be, and that in some cases where we know Wayland support is not sufficient, fallback to X.org works as expected.

If you’re not sure whether you have a system with switchable graphics, you can run xrandr --listproviders. If the output from this command lists more than one ‘provider’, you likely have switchable graphics. If you do, please come along to the Test Day and help us test, if you can spare the time. If you believe your system has switchable graphics, but the command only lists one provider, please come join the Test Day chat on the day – we may be able to investigate and figure out what’s going on!

The Test Day page and the test cases are still being revised and tweaked as I write this, but as the week goes along, all the instructions you need to run the tests will be present there. As always, the event will be in #fedora-test-day on Freenode IRC. If you don’t know how to use IRC, you can read these instructions, or just use WebIRC.

What just happened?

Posted by Adam Williamson on October 27, 2016 08:08 AM

4pm: “Well, guess it’s time to write the F25 Final blocker status mail.”

4:10pm: “Yeesh, I guess I’d better figure out which of the three iSCSI blocker bugs is actually still valid, and maybe take a quick look at what the problem is.”

1:06am: “Well, I think I’m done fixing iSCSI now. But I seem to have sprouted four new openQA action items. Blocker status mail? What blocker status mail?”

Fedora 25 Workstation Wayland-by-default Test Day report

Posted by Adam Williamson on October 14, 2016 10:30 PM

Hi folks! As yesterday’s Test Day was pretty popular and widely-covered, I thought I’d blog the report as well as emailing it out.

We had a great Test Day! 49 testers combined ran a total of 341 tests and filed or referenced 35 bugs. 9 of those have since been closed as duplicates, leaving 26 open reports:

  • #1299505 gnome-calculator prints “Currency LTL is not provided by IMF or ECB” in Financial mode
  • #1330034 [abrt] calibre: QObject::disconnect() : python2.7 killed by SIGSEGV
  • #1367846 Scrolling is way too fast in writer
  • #1376471 global menu can’t be used for 15-30 seconds, startup notification stuck, missing item in alt+tab (on wayland)
  • #1379098 [Regression] Gnome-shell crashes on switching back from tty ([abrt] gnome-shell: wl_resource_post_event() : gnome-shell killed by SIGSEGV)
  • #1383471 [abrt] WARNING: CPU: 3 PID: 12176 at ./include/linux/swap.h:276 page_cache_tree_insert+0x1cc/0x1e0
  • #1384431 activities screen shows applications and search results at the same time
  • #1384440 dragging gnome dash application to a specific workplace doesn’t open the application to this workspace (on wayland)
  • #1384489 Music not recognizing/importing files
  • #1384502 Recent tab not available
  • #1384537 Opening a new gnome-software window creates a new entry without a proper icon
  • #1384546 Removing application does not bring back the install icon immediately
  • #1384551 Printing directions using maps do not show the marked path
  • #1384560 Screenshot of gnome-maps does not show the map part at all
  • #1384569 Places dropdown search does not function if weather is open on secondary monitor
  • #1384570 gnome-initial-setup does not exit at the end
  • #1384572 Places dropdown search does not function if clocks is open on secondary monitor
  • #1384590 [abrt] gnome-photos: babl_get_name() : gnome-photos killed by SIGABRT
  • #1384596 gnome-boxes: starting fails without any feedback
  • #1384599 gnome-calculator currency conversion is hard to use
  • #1384616 thumbnail “border” seems misaligned in activities overview
  • #1384651 Selecting city should automatically be added on Clock Application
  • #1384665 [abrt] authconfig-gtk: gdk_window_enable_synchronized_configure() : python2.7 killed by SIGSEGV
  • #1384671 system-config-language does not work under Wayland
  • #1384675 system-config-users does not work under Wayland
  • #1384678 Missing top-left icon (and full application name) on Wayland

Some of these are not Wayland bugs, but it’s not a bad thing that people found some non-Wayland bugs as well while testing! We did find several new Wayland issues, but on the positive side, no really big bugs that weren’t already known and on the radar for the final release.

So the event looks like a success on all fronts: we found some new bugs to squish, but it also gives us a decent indication that the Workstation-on-Wayland experience is in a good enough condition for a first stable release. We also confirmed that the Workstation-on-X11 session is available as a fallback and that works properly, for anyone who can’t use Wayland for any reason.

Many thanks to all the testers for their hard work!