Fedora Quality Planet

Tales from GNOME Asia 2023

Posted by sumantro on December 05, 2023 02:44 PM

Epilogue


The GNOME Asia 2023 Summit, a key event in the open-source technology calendar, took place in Kathmandu, Nepal, from December 1 to 3, 2023. This conference is the premier annual event for the GNOME community in Asia, focusing on the GNOME desktop, applications, and development tools. The GNOME Foundation, which oversees this summit, aims to bring together various stakeholders, including users, developers, foundation leaders, government representatives, and businesses, to discuss present technologies and future developments. This iteration of the GNOME summit (as will be covered in the report) will have two parts, the community and the actual event proceedings. GNOME Asia Summit brought back the key FOSS stakeholders of Nepal into the limelight and also collaborated with the Fedora Project - helping them host a Release Party for Fedora Linux 39 and Celebrate the 20th year of the Fedora Project.


Day 0 

Like all "great" FOSS Projects do, Day 0 is all about intense collaboration over the Signal group. Everyone is traveling. Folks sharing travel tips and hacks. Mostly laying the pathway for others to follow. Ideally, the plan is to have all the people land and reach safely to the accommodation booked for the next couple of days. This part got taken care of nicely. We all had a meet and greet on the 30th of November evening at Marriot Fairfield. We all went for dinner and settled in our comfy rooms for the day. 


Day 1


The event venue was less than 5 km away and Justin approached the hotel to get us a bus so everyone could fit in. The agenda had a few sessions that were of interest and of course the release party!
The keynote was by Justin Flory. Justin talked extensively about Open Source and drew parallels between the methods and the culture of a community where Open Source thrives. Justin also shared his stories with the community and how that has helped him grow as a professional. 









The session following this was with Matthias Clasen, who happens to work in GNOME Desktop and GTK at Red Hat. He spoke in depth about "How GNOME works", this was one of the most insightful sessions for me. Being someone who is a Quality Engineer for the Fedora Project, it becomes crucial that I know in depth about the primary release blocking Desktop Environment (DE).  Matthias also shared his knowledge on Dbus and compositor, and the audience enjoyed it. 





Following was a session on GNOME Extensions, I don't have much experience with extensions and this session gave me some basic ideas about it. 




Jens Peterson from Fedora gave a talk about Declarative GTK programming, this was also a very interesting thing for me. It taught me a few neat things about GTK.



After this, we had the most-awaited event, The Fedora Linux 39 Release Party.  The session was slotted for 3hrs and the agenda was as follows.




Here's some photos 





Day 2


On the second day, we had a lot of sessions around GNOME but the most important and the one I loved the most would definitely be Nikita speaking on Krita. That talk taught me so much about brushes and gave me a good overview as novice on "why and how" Krita can do better than many closed source tools and softwares. 




Dinner and Socials 





#wearefedora :)


The move forward

 These are changing times, socially we are now safe from COVID and we are at the juncture where we need to bet on building Fedora community stronger the in emerging markets of the upcoming years. Every community event like such will have deep and lasting impacts on the generations who will contribute and make meaningful strides in furthering our mission and vision. 
 
Nepal has always been a strong advocate of Linux in local language and I think fedora can help reboot some of the initiatives. I believe there is a lot of potential in having such events. More than 80% of our audience inquired about getting started and building packages. 

Bisecting Fedora kernel

Posted by Kamil Páral on August 15, 2023 03:07 PM

This post shows how to bisect a Fedora kernel to find the source of a regression. I needed that recently and I found no good guide, so I’m at least capturing my notes here, perhaps you find it useful. This approach can be used to identify which exact commit caused a bad kernel behavior on your hardware, and then report it to kernel maintainers. Note, you need to have a reliable way of reproducing the problem. If it happens randomly and infrequently, it’s much harder to debug.

0. Try the latest Rawhide kernel

Before you spend too much time on this, it’s always worth a shot to test the latest Rawhide kernel. Perhaps the bug is fixed already?

Usually the kernel consists of these installed packages: kernel, kernel-core, kernel-modules, kernel-modules-core, kernel-modules-extra. But see what you have installed on your system, e.g. with: rpm -qa | grep ^kernel | sort .

Install the latest Rawhide kernel:

sudo dnf update --setopt=installonly_limit=0 --repo fedora --releasever rawhide kernel{,-core,-modules,-modules-core,-modules-extra}

You want to use --setopt=installonly_limit=0 throughout this exercise to make sure you don’t accidentally remove a working kernel from your system and don’t end up with just broken ones (there’s a limit of three kernels installed at the same time by default). But it means you’ll need to remove tested kernels manually from time to time, otherwise you run out of space in /boot.

Reboot and keep pressing F8 during startup to display the GRUB boot menu. Make sure to select the newly installed kernel, boot it, test it. Note down whether it’s good or bad. If the problem is still there, we’ll need to continue debugging.

Note: When you want to remove that tested kernel, obviously you can’t be currently running from it. Then use standard dnf remove to get rid of it, or use dnf history for a more convenient way (e.g. dnf history undo last).

I. Narrow down the issue in Fedora-packaged kernels

As the first step, it’s useful to figure out which Fedora-packaged kernel is the last one with good behavior (a “good kernel“), and which one is the first one with bad behavior (a “bad kernel“). That will help you narrow down the scope. It’s much faster to download and install already built kernels than to compile your own (which we’ll do later).

Most probably you’re currently running a bad kernel (because you’re reading this). So reboot, display the GRUB boot menu and boot an older kernel. See if it’s good or bad, note it down. Unless the problem is very recent, all available kernels (usually three) in the GRUB menu will be bad. It’s time to start downloading older kernels from Koji. Use a reasonable strategy, e.g. install a month old kernel, or several months old, and gradually halve the intervals and narrow down until you find the latest good kernel. You don’t need to worry about using kernels from other Fedora releases (as you can see in their .fcNN suffix), they are standalone and work in any release. You can download the kernel subpackages manually, or use koji command (from the koji package), e.g.:

koji download-build --arch x86_64 kernel-6.5.0-0.rc6.43.fc39

That downloads many more subpackages than you need, so install just those needed (see the previous section), e.g. like this:

sudo dnf --setopt=installonly_limit=0 install ./kernel{,-core,-modules,-modules-core,-modules-extra}-6.5*.rpm

For each picked kernel, install it, boot into it, test it, note down whether it’s good or bad. Continue until you’ve found the latest good packaged kernel and the first bad packaged kernel.

II. Find git commits used for building identified good and bad kernels

Now that you have the closest good and bad packaged kernel, we need to figure out which git commits from the upstream Linux kernel were used to build them. In some cases, the git commit hash is included directly in the RPM filename. For example in my case, I reported that kernel-6.4.0-0.rc0.20230427git6e98b09da931.5.fc39 is the last good kernel, and kernel-6.4.0-0.rc0.20230428git33afd4b76393.7.fc39 is the first bad kernel. From those filenames, you can see that git commit 6e98b09da931 is good and git commit 33afd4b76393 is bad.

Not always is the commit hash part of the filename, e.g. with the example of kernel-6.5.0-0.rc6.43.fc39. In this case, you need to download the .src.rpm file from that build. Either manually from Koji, or using:

koji download-build --arch src kernel-6.5.0-0.rc6.43.fc39

Unpack that .src.rpm (my favorite decompress tool is deco), find linux-*.tar.xz archive and run the following command (adjust the archive filename):

$ xzcat -qq linux-6.5-rc6.tar.xz | git get-tar-commit-id
2ccdd1b13c591d306f0401d98dedc4bdcd02b421

(This command is documented in the kernel.spec file, also in that directory). Now you know the git commit hash used for that kernel build. Figure out commits for both the good and bad kernel you identified.

III. Use git bisect to find the exact commit that broke it

It’s time to clone the upstream Linux kernel repo:

git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git ~/src/linux

And also the Fedora distgit kernel repo:

fedpkg clone -a ~/distgit/kernel

We’ll now use git bisect to arrive at the breaking commit which caused the problem. After each step, we’ll need to build the kernel, test it, and mark it as good or bad. Let’s start:

cd ~/src/linux
git bisect start
git bisect good YOUR_GOOD_COMMIT
git bisect bad YOUR_BAD_COMMIT

Git now prints a commit hash to be tested (and switches the repository to that commit), and an estimate of how many steps remain. We now need to take the current contents of the source code and build our own kernel.

Note: When building the kernel, I was advised to avoid the overhead of packaging, to speed up the process. I’m sure it’s a good advice, but I didn’t find a good guide on how to do that (including how to retrieve the Fedora kernel config, build the kernel manually, copy it to the right places, create initramfs, create a boot option in GRUB, etc). So I just ran the whole process including packaging. On my machine, the compilation took about 40 minutes and packaging took 10 minutes, and I needed to do about 11 rounds, so it was an OK tradeoff for me. (If you can write a guide how to do that without packaging, please do and link it in the comments, I’d love to read it).

Let’s create a tarball of the current source code like this:

git archive --prefix=linux-local/ HEAD | xz -0 -T0 > linux-local.tar.xz

Usually the tarballs have a version number in both the filename and the included directory (which is then also matched in a spec file). You can do that if you wish, I didn’t want to spend too much time on throwaway builds, so I just used a static filename and overwrote it each time.

Let’s move the tarball to the distgit repo:

mv ~/src/linux/linux-local.tar.xz ~/distgit/kernel/

Now we need to adjust the distgit spec file a bit:

cd ~/distgit/kernel
# edit kernel.spec

I made the following changes to the spec file:

-# define buildid .local
+%define buildid .local
-%define specrpmversion 6.4.9
+%define specrpmversion 6.4.0
-%define specversion 6.4.9
+%define specversion 6.4.0
-%define tarfile_release 6.4.9
+%define tarfile_release local
-%define specrelease 200%{?buildid}%{?dist}
+%define specrelease 0.gitYOUR_TESTED_COMMIT%{?buildid}%{?dist}

Now we can start the build:

nice fedpkg mockbuild --with baseonly --with vanilla --without debuginfo

Options --with baseonly and --without debuginfo make sure we don’t build unnecessary stuff. --with vanilla was needed, because Fedora-specific patches didn’t apply to the older source code.

After a long time, your results should be available in results_kernel/ and look something like this:

$ ls -1 results_kernel/6.4.0/0.git6e98b09da931.local.fc38/
build.log
hw_info.log
installed_pkgs.log
kernel-6.4.0-0.git6e98b09da931.local.fc38.src.rpm
kernel-6.4.0-0.git6e98b09da931.local.fc38.x86_64.rpm
kernel-core-6.4.0-0.git6e98b09da931.local.fc38.x86_64.rpm
kernel-devel-6.4.0-0.git6e98b09da931.local.fc38.x86_64.rpm
kernel-devel-matched-6.4.0-0.git6e98b09da931.local.fc38.x86_64.rpm
kernel-modules-6.4.0-0.git6e98b09da931.local.fc38.x86_64.rpm
kernel-modules-core-6.4.0-0.git6e98b09da931.local.fc38.x86_64.rpm
kernel-modules-extra-6.4.0-0.git6e98b09da931.local.fc38.x86_64.rpm
kernel-modules-internal-6.4.0-0.git6e98b09da931.local.fc38.x86_64.rpm
kernel-uki-virt-6.4.0-0.git6e98b09da931.local.fc38.x86_64.rpm
root.log
state.log

See that all the RPMs have the git commit hash identifier that you specified in the spec file. Now you just need to install the kernel (see in a previous section), boot it (make sure to display the GRUB menu and verify that the correct kernel is selected), and test it.

Note: If you have Secure Boot enabled, you’ll need to disable it in order to boot your own kernel (or figure out how to sign it yourself). Don’t forget to re-enable it once this is all over.

Once you’ve determined whether this kernel is good or bad, tell it to git bisect:

cd /src/linux
git bisect good   # or bad

And now the whole cycle repeats. Create a new archive using git archive, move it to the distgit directory, adjust the specrelease field in kernel.spec to match the new commit hash, and use fedpkg to build another kernel. Eventually, git bisect will print out the exact commit that caused the problem.

IV. Report your findings

Report the problem and the identified breaking commit into Red Hat Bugzilla under the kernel component. Please also save and attach the bisect log:

cd /src/linux
git bisect log > git-bisect-log.txt

Then also report this problem (possibly a regression) to the kernel upstream and mention it in the RH Bugzilla ticket. Thanks and good luck.

DevConf.CZ 2023, Rawhide update test gating, ELN testing and more!

Posted by Adam Williamson on June 20, 2023 11:17 AM

I'm in Brno, working from the office for a few days after the end of DevConf.CZ. It was a great conference, good to see people and feel some positive energy after all the stuff with RH layoffs and so on. It was really well attended, and there were a lot of useful talks. I presented on the current state of openQA and Fedora CI, with Miroslav Vadkerti kindly covering the Fedora CI stuff (thanks to Miro for that). The segmented talk video hasn't been updated yet, but you can watch it from the recorded live stream starting here (at 6:04:32). I think it went pretty well, I'm much happier with this latest version of the talk than the versions I did at DevConf.US and LinuxCon (sorry to anyone who saw those ones!)

The talk by Aoife Moloney and Michal Konecny on how the CPE team (which handles Fedora infra and apps) has started doing organized pre-scoping for projects instead of just diving in was really interesting and informative. The anaconda meetup wound up being just the anaconda team and myself and Sumantro from QA, but it was very useful as we were able to talk about the plans for moving forward with the new anaconda webUI and how we can contribute testing for that - look out for Test Weeks coming soon. Davide Cavalca's talk on Fedora ELN usage at Meta was great, and inspired me to work on stuff (more on that later).

There were a lot of random conversations as always - thanks to it being June, the "hallway track" mostly evolved into the "shadow track", under the shade of a big tree in the courtyard, with beanbags and ice cream! That's a definite improvement. The social event was in a great location - around an outdoor swimming pool (although we couldn't swim - apparently we couldn't serve drinks if swimming was allowed, so that seems like the best choice!) All in all, a great conference. I'm very much looking forward to Flock in Cork now, and will be doing my talk there again if it's accepted.

Tomorrow will be an exciting day, because (barring any unforeseen issues) we'll be turning on gating of Rawhide updates! I've been working towards this for some time now - improving the reliability of the tests, implementing test re-run support from Bodhi, implementing the critical path group stuff, and improving the Bodhi web UI display of test results and gating status - so I'm really looking forward to getting it done (and hoping it goes well). This should mean Rawhide's stability improves even more, and Kevin and I don't have to scramble quite so much to "shadow gate" Rawhide any more (by untagging builds that fail the tests).

Davide mentioned during his ELN talk that they ran into an issue that openQA would have caught if it ran on ELN, so I asked if that would be useful, and he said yes. So, yesterday I did it. This required changes to fedfind, the openQA tests, and the openQA scheduler - and then after that all worked out well and I deployed it, I realized it also needed changes to the result reporting code and a couple of other things too, which I had to do in rather a hurry! But it's all sorted out now, and we have new ELN composes automatically tested in production when they land. Initially only a couple of default-install-and-boot tests were running, I'm now working to extend the test set and the tested images.

Other than that I've been doing a lot of work on the usual things - keeping openQA updated and running smoothly, investigating and fixing test failures, improving stuff in Bodhi and Greenwave, and reviewing new tests written by lruzicka. I'll be on vacation for a week or so from Friday, which will be a nice way to decompress from DevConf, then back to work on a bunch of ideas that came out of it!

Thoughts on a pile of laptops

Posted by Adam Williamson on February 02, 2023 09:42 PM

Hi folks! For the first post of 2023, I thought I'd do something a bit different. If you want to keep up with what I've been working on, these days Mastodon is the best place - I've been posting a quick summary at the end of every working day there. Seems to be working out well so far. The biggest thing lately is that "grouped critical path", which I wrote about in my last post, is deployed in production now. This has already reduced the amount of tests openQA has to run, and I'm working on some further changes to optimize things more.

So instead of that, I want to rhapsodize on this pile of laptops:

A pile of laptops

On the top is the one I used as my main laptop for the last six years, and my main system for the last couple, since I got rid of my desktop. It's a Dell XPS 13 9360, the "Kaby Lake" generation. Not pictured (as it's over here being typed on, not in the pile) is its replacement, a 2022 XPS 13 (9315), which I bought in December and have been pretty happy with so far. On the bottom of the pile is a Lenovo tester (with AMD Ryzen hardware) which I tried to use as my main system for a bit, but it didn't work out as it only has 8G of RAM and that turns out to be...not enough. Second from bottom is a terrible budget Asus laptop with Windows on it that I keep around for the occasional time I need to use Windows - mainly to strip DRM from ebooks. Not pictured is the older XPS 13 I used before the later two, which broke down after a few years.

But the hidden star of the show is the one second from top. It has a high-resolution 13" display with pretty slim bezels and a built-in webcam. It has dual NVIDIA and Intel GPUs. It has 8G of RAM, SSD storage and a multicore CPU, and runs Fedora 36 just fine, with decent (3-4hr) battery life. It weighs 3.15lb (1.43kg) and has USB, HDMI and ethernet outs.

It also has a built-in DVD drive, VGA out and an ExpressCard slot (anyone remember those?) That's because it's from 2010.

It's a Sony Vaio Z VPC-Z11, and I still use it as a backup/test system. It barely feels outdated at all (until you remember about the DVD drive, which is actually pretty damn useful sometimes still). Every time I open it I'm still amazed at what a ridiculous piece of kit it is/was. Just do an image search for "2010 laptop" and you'll see stuff like, well, this. That's what pretty much every laptop looked like in 2010. They had 4G of RAM if you were lucky, and hard disks. They weighed 2kg+. They had huge frickin' bezels. The Macbook Air had come out in 2008, but it was an underpowered thing with a weak CPU and HDD storage. The 2010 models had SSDs, but maxed out at 4G RAM and still had pretty weak CPUs (and way bigger bezels, and worse screens, and they certainly didn't have DVD drives). They'd probably feel pretty painful to use now, but the Vaio still feels fine. Here's a glamour shot:

One very cool laptop

I've only had to replace its battery twice and its SSDs (it came from the factory with two SSDs configured RAID-0, because weird Sony is like that) once in 12 years. Probably one day it will finally not be really usable any more, but who the heck knows how long that will be.

Fedora 37, openQA news, Mastodon and more

Posted by Adam Williamson on November 18, 2022 10:34 PM

Hey, time for my now-apparently-annual blog post, I guess? First, a quick note: I joined the herd showing up on Mastodon, on the Fosstodon server, as @adamw@fosstodon.org. So, you know, follow me or whatever. I posted to Twitter even less than I post here, but we'll see what happens!

The big news lately is of course that Fedora 37 is out. Pulling this release together was a bit more painful than has been the norm lately, and it does have at least one bug I'm sad we didn't sort out, but unless you have one of a very few motherboards from six years ago and want to do a reinstall, everything should be great!

Personally I've been running Fedora Silverblue this cycle, as an experiment to see how it fares as a daily driver and a dogfooding base. Overall it's been working fine; there are still some awkward corners if you are strict about avoiding RPM overlays, though. I'm definitely interested in Colin's big native container rework proposal, which would significantly change how the rpm-ostree-based systems work and make package layering a more 'accepted' thing to do. I also found that sourcing apps feels slightly odd - I'd kinda like to use Fedora Flatpaks for everything, from a dogfooding perspective, but not everything I use is available as one, so I wound up with kind of a mix of things sourced from Flathub and from Fedora Flatpaks. I was also surprised that Fedora Flatpaks aren't generally updated terribly often, and don't seem to have 'development' branches - while Fedora 37 was in development, I couldn't get Flatpak builds of apps that matched the Fedora 37 RPM builds, I was stuck running Fedora 36-based Flatpaks. So it actually impeded my ability to test the latest versions of everything. It'd be nice to see some improvement here going forward.

My biggest project this year has been working towards gating Rawhide critical path updates on the openQA tests, as we do for stable and Branched releases. This has been a deceptively large effort; ensuring all the tests work OK on Rawhide was a relatively small job, but the experience of actually having the tests running has been interesting. There are, overall, a lot more updates for Rawhide than any other release, and obviously, they tend to break things more often. First I turned the tests on for the staging instance, then after a few months trying to get on top of things there, turned them on for the production instance. I planned to run this way for a month or two to see if I could stay on top of keeping the tests running smoothly and passing when they should, and dealing with breakage. On the whole, it's been possible...but just barely. The increased workload means tests can take several hours to complete after an update is submitted, which isn't ideal. Because we don't have the gating turned on, when somebody does submit an update that breaks the tests, I have to ensure it gets fixed right away or else get it untagged before the next Rawhide compose happens, or else the test will fail for every subsequent update too; that can be stressful. We also have had quite a lot of 'fun' with intermittent problems like systemd-oomd killing things it shouldn't. This can result in a lot of time spent manually restarting failed tests, coming up with awkward workarounds, and trying to debug the problems.

So, I kinda felt like things aren't quite solid enough yet to turn the gating on, and I wound up working down a path intended to help with the "too many jobs take too long" and "intermittent failures" angles. This actually started out when I added a proper critical path definition for KDE. This rather increased the openQA workload, as it added a bunch of packages to critical path that weren't there before. There was especially a fun moment when a couple hundred KDE package updates got submitted separately as Rawhide updates, and openQA spent a day running 55 tests on all of them, including all the GNOME and Server tests.

As part of getting the KDE stuff added to the critical path, I wound up doing a big update to the script that actually generates the critical path definition, and working on that made me realize it wouldn't be difficult to track the critical path package set by group, not just as one big flat list. That, in turn, could allow us to only run "relevant" openQA tests for an update: if the update is only in the KDE critical path, we don't need to run the GNOME and Server tests on it, for instance. So for the last few weeks I've been working on what turned out to be quite a lot of pieces relevant to that.

First, I added the fundamental support in the critical path generation script. Then I had to make Bodhi work with this. Bodhi decides whether an update is critical path or not, and openQA gets that information from Bodhi. Bodhi, as currently configured, actually gets this information from PDC, which seems to me an unnecessary layer of indirection, especially as we're hoping to retire PDC; Bodhi could just as easily itself be the 'source of truth' for the critical path. So I made Bodhi capable of reading critpath information directly from the files output by the script, then made it use the group information for Greenwave queries and show it in the web UI and API query results. That's all a hard requirement for running fewer tests on some updates, because without that, we would still always gate on all the openQA tests for every critical path update - so if we didn't run all the tests for some update, it would always fail gating. I also changed the Greenwave policies accordingly, to only require the appropriate set of tests to pass for each critical path group, once our production Bodhi is set up to use all this new stuff - until then, the combined policy for the non-grouped decision contexts Bodhi still uses for now winds up identical to what it was before.

Once a new Bodhi release is made and deployed to production, and we configure it to use the new grouped-critpath stuff instead of the flat definition from PDC, all of the groundwork is in place for me to actually change the openQA scheduler to check which critical path group(s) an update is in, and only schedule the appropriate tests. But along the way, I noticed this change meant Bodhi was querying Greenwave for even more decision contexts for each update. Right now for critical path updates Bodhi usually sends two queries to Greenwave (if there are more than seven packages in the update, it sends 2*((number of packages in update+1)/8) queries). With these changes, if an update was in, say, three critical path groups, it would send 4 (or more) queries. This slows things down, and also produces rather awkward and hard-to-understand output in the web UI. So I decided to fix that too. I made it so the gating status displayed in the web UI is combined from however many queries Bodhi has to make, instead of just displaying the result of each query separately. Then I tweaked greenwave to allow querying multiple decision contexts together, and had Bodhi make use of that. With those changes combined, Bodhi should only have to query once for most updates, and for updates with more than seven packages, the displayed gating status won't be confusing any more!

I'm hoping all those Bodhi changes can be deployed to stable soon, so I can move forward with the remaining work needed, and ultimately see how much of an improvement we see. I'm hoping we'll wind up having to run rather fewer tests, which should reduce the wait time for tests to complete and also mitigate the problem of intermittent failures a bit. If this works out well enough, we might be able to move ahead with actually turning on the gating for Rawhide updates, which I'm really looking forward to doing.

AdamW's Debugging Adventures: Bootloaders and machine IDs

Posted by Adam Williamson on January 11, 2022 10:08 PM

Hi folks! Well, it looks like I forgot to blog for...checks watch....checks calendar...a year. Wow. Whoops. Sorry about that. I'm still here, though! We released, uh, lots of Fedoras since the last time I wrote about that. Fedora 35 is the current one. It's, uh, mostly great! Go get a copy, why don't you?

And while that's downloading, you can get comfy and listen to another of Crazy Uncle Adam's Debugging Adventures. In this episode, we'll be uncomfortably reminded just how much of the code that causes your system to actually boot at all consists of fragile shell script with no tests, so this'll be fun!

Last month, booting a system installed from Rawhide live images stopped working properly. You could boot the live image fine, run the installation fine, but on rebooting, the system would fail to boot with an error: dracut: FATAL: Don't know how to handle 'root=live:CDLABEL=Fedora-WS-Live-rawh-20211229-n-1'. openQA caught this, and so did one of our QA community members - Ahed Almeleh - who filed a bug. After the end-of-year holidays, I got to figuring out what was going wrong.

As usual, I got a bit of a head start from pre-existing knowledge. I happen to know that error message is referring to kernel arguments that are set in the bootloader configuration of the live image itself. dracut is the tool that handles an early phase of boot where we boot into a temporary environment that's loaded entirely into system memory, set up the real system environment, and boot that. This early environment is contained in the initrd files you can find alongside the kernel on most Linux distributions; that's what they're for. Part of dracut's job is to be run when a kernel is installed to produce this environment, and then other parts of dracut are included in the environment itself to handle initializing things, finding the real system root, preparing it, and then switching to it. The initrd environments on Fedora live images are built to contain a dracut 'module' (called 90dmsquash-live) that knows to interpret root=live:CDLABEL=Fedora-WS-Live-rawh-20211229-n-1 as meaning 'go look for a live system root on the filesystem with that label and boot that'. Installed systems don't contain that module, because, well, they don't need to know how to do that, and you wouldn't really ever want an installed system to try and do that.

So the short version here is: the installed system has the wrong kernel argument for telling dracut where to find the system root. It should look something like root=/dev/mapper/fedora-root (where we're pointing to a system root on an LVM volume that dracut will set up and then switch to). So the obvious next question is: why? Why is our installed system getting this wrong argument? It seemed likely that it 'leaked' from the live system to the installed system somehow, but I needed to figure out how.

From here, I had kinda two possible ways to investigate. The easiest and fastest would probably be if I happened to know exactly how we deal with setting up bootloader configuration when running a live install. Then I'd likely have been able to start poking the most obvious places right away and figure out the problem. But, as it happens, I didn't at the time remember exactly how that works. I just remembered that I wind up having to figure it out every few years, and it's complicated and scary, so I tend to forget again right afterwards. I kinda knew where to start looking, but didn't really want to have to work it all out again from scratch if I could avoid it.

So I went with the other possibility, which is always: figure out when it broke, and figure out what changed between the last time it worked and the first time it broke. This usually makes life much easier because now you know one of the things on that list is the problem. The shorter and simpler the list, the easier life gets.

I looked at the openQA result history and found that the bug was introduced somewhere between 20211215.n.0 and 20211229.n.1 (unfortunately kind of a wide range). The good news is that only a few packages could plausibly be involved in this bug; the most likely are dracut itself, grub2 (the bootloader), grubby (a Red Hat / Fedora-specific grub configuration...thing), anaconda (the Fedora installer, which obviously does some bootloader configuration stuff), the kernel itself, and systemd (which is of course involved in the boot process itself, but also - perhaps less obviously - is where a script called kernel-install that is used (on Fedora and many other distros) to 'install' kernels lives (this was another handy thing I happened to know already, but really - it's always a safe bet to include systemd on the list of potential suspects for anything boot-related).

Looking at what changed between 2021-12-15 and 2021-12-29, we could let out grub2 and grubby as they didn't change. There were some kernel builds, but nothing in the scriptlets changed in any way that could be related. dracut got a build with one change, but again it seemed clearly unrelated. So I was down to anaconda and systemd as suspects. On an initial quick check during the vacation, I thought anaconda had not changed, and took a brief look at systemd, but didn't see anything immediately obvious.

When I came back to look at it more thoroughly, I realized anaconda did get a new version (36.12) on 2021-12-15, so that initially interested me quite a lot. I spent some time going through the changes in that version, and there were some that really could have been related - it changed how running things during install inside the installed system worked (which is definitely how we do some bootloader setup stuff during install), and it had interesting commit messages like "Remove the dracut_args attribute" and "Remove upd-kernel". So I spent an afternoon fairly sure it'd turn out to be one of those, reviewed all those changes, mocked up locally how they worked, examined the logs of the actual image composes, and...concluded that none of those seemed to be the problem at all. The installer seemed to still be doing things the same as it always had. There weren't any tell-tale missing or failing bootloader config steps. However, this time wasn't entirely wasted: I was reminded of exactly what anaconda does to configure the bootloader when installing from a live image.

When we install from a live image, we don't do what the 'traditional' installer does and install a bunch of RPM packages using dnf. The live image does not contain any RPM packages. The live image itself was built by installing a bunch of RPM packages, but it is the result of that process. Instead, we essentially set up the filesystems on the drive(s) we're installing to and then just dump the contents of the live image filesystem itself onto them. Then we run a few tweaks to adjust anything that needs adjusting for this now being an installed system, not a live one. One of the things we do is re-generate the initrd file for the installed system, and then re-generate the bootloader configuration. This involves running kernel-install (which places the kernel and initrd files onto the boot partition, and writes some bootloader configuration 'snippet' files), and then running grub2-mkconfig. The main thing grub2-mkconfig does is produce the main bootloader configuration file, but that's not really why we run it at this point. There's a very interesting comment explaining why in the anaconda source:

# Update the bootloader configuration to make sure that the BLS
# entries will have the correct kernel cmdline and not the value
# taken from /proc/cmdline, that is used to boot the live image.

Which is exactly what we were dealing with here. The "BLS entries" we're talking about here are the things I called 'snippet' files above, they live in /boot/loader/entries on Fedora systems. These are where the kernel arguments used at boot are specified, and indeed, that's where the problematic root=live:... arguments were specified in broken installs - in the "BLS entries" in /boot/loader/entries. So it seemed like, somehow, this mechanism just wasn't working right any more - we were expecting this run of grub2-mkconfig in the installed system root after live installation to correct those snippets, but it wasn't. However, as I said, I couldn't establish that any change to anaconda was causing this.

So I eventually shelved anaconda at least temporarily and looked at systemd. And it turned out that systemd had changed too. During the time period in question, we'd gone from systemd 250~rc1 to 250~rc3. (If you check the build history of systemd the dates don't seem to match up - by 2021-12-29 the 250-2 build had happened already, but in fact the 250-1 and 250-2 builds were untagged for causing a different problem, so the 2021-12-29 compose had 250~rc3). By now I was obviously pretty focused on kernel-install as the most likely related part of systemd, so I went to my systemd git checkout and ran:

git log v250-rc1..v250-rc3 src/kernel-install/

which shows all the commits under src/kernel-install between 250-rc1 and 250-rc3. And that gave me another juicy-looking, yet thankfully short, set of commits:

641e2124de6047e6010cd2925ea22fba29b25309 kernel-install: replace 00-entry-directory with K_I_LAYOUT in k-i 357376d0bb525b064f468e0e2af8193b4b90d257 kernel-install: Introduce KERNEL_INSTALL_MACHINE_ID in /etc/machine-info 447a822f8ee47b63a4cae00423c4d407bfa5e516 kernel-install: Remove "Default" from list of suffixes checked

So I went and looked at all of those. And again...I got it wrong at first! This is I guess a good lesson from this Debugging Adventure: you don't always get the right answer at first, but that's okay. You just have to keep plugging, and always keep open the possibility that you're wrong and you should try something else. I spent time thinking the cause was likely a change in anaconda before focusing on systemd, then focused on the wrong systemd commit first. I got interested in 641e212 first, and had even written out a whole Bugzilla comment blaming it before I realized it wasn't the culprit (fortunately, I didn't post it!) I thought the problem was that the new check for $BOOT_ROOT/$MACHINE_ID would not behave as it should on Fedora and cause the install scripts to do something different from what they should - generating incorrect snippet files, or putting them in the wrong place, or something.

Fortunately, I decided to test this before declaring it was the problem, and found out that it wasn't. I did this using something that turned out to be invaluable in figuring out the real problem.

You may have noticed by this point - harking back to our intro - that this critical kernel-install script, key to making sure your system boots, is...a shell script. That calls other shell scripts. You know what else is a big pile of shell scripts? dracut. You know, that critical component that both builds and controls the initial boot environment. Big pile of shell script. The install script - the dracut command itself - is shell. All the dracut modules - the bits that do most of the work - are shell. There's a bit of C in the source tree (I'm not entirely sure what that bit does), but most of it's shell.

Critical stuff like this being written in shell makes me shiver, because shell is very easy to get wrong, and quite hard to test properly (and in fact neither dracut nor kernel-install has good tests). But one good thing about it is that it's quite easy to debug, thanks to the magic of sh -x. If you run some shell script via sh -x (whether that's really sh, or bash or some other alternative pretending to be sh), it will run as normal but print out most of the logic (variable assignments, tests, and so on) that happen along the way. So on a VM where I'd run a broken install, I could do chroot /mnt/sysimage (to get into the root of the installed system), find the exact kernel-install command that anaconda ran from one of the logs in /var/log/anaconda (I forget which), and re-run it through sh -x. This showed me all the logic going on through the run of kernel-install itself and all the scripts it sources under /usr/lib/kernel/install.d. Using this, I could confirm that the check I suspected had the result I suspected - I could see that it was deciding that layout="other", not layout="bls", here. But I could also figure out a way to override that decision, confirm that it worked, and find that it didn't solve the problem: the config snippets were still wrong, and running grub2-mkconfig didn't fix them. In fact the config snippets got wronger - it turned out that we do want kernel-install to pick 'other' rather than 'bls' here, because Fedora doesn't really implement BLS according to the upstream specs, so if we let kernel-install think we do, the config snippets we get are wrong.

So now I'd been wrong twice! But each time, I learned a bit more that eventually helped me be right. After I decided that commit wasn't the cause after all, I finally spotted the problem. I figured this out by continuing with the sh -x debugging, and noticing an inconsistency. By this point I'd thought to find out what bit of grub2-mkconfig should be doing the work of correcting the key bit of configuration here. It's in a Fedora-only downstream patch to one of the scriptlets in /etc/grub.d. It replaces the options= line in any snippet files it finds with what it reckons the kernel arguments "should be". So I got curious about what exactly was going wrong there. I tweaked grub2-mkconfig slightly to run those scriptlets using sh -x by changing these lines in grub2-mkconfig:

echo "### BEGIN $i ###"
"$i"
echo "### END $i ###"

to read:

echo "### BEGIN $i ###"
sh -x "$i"
echo "### END $i ###"

Now I could re-run grub2-mkconfig and look at what was going on behind the scenes of the scriptlet, and I noticed that it wasn't finding any snippet files at all. But why not?

The code that looks for the snippet files reads the file /etc/machine-id as a string, then looks for files in /boot/loader/entries whose names start with that string (and end in .conf). So I went and looked at my sample system and...found that the files in /boot/loader/entries did not start with the string in /etc/machine-id. The files in /boot/loader/entries started with a69bd9379d6445668e7df3ddbda62f86, but the ID in /etc/machine-id was b8d80a4c887c40199c4ea1a8f02aa9b4. This is why everything was broken: because those IDs didn't match, grub2-mkconfig couldn't find the files to correct, so the argument was wrong, so the system didn't boot.

Now I knew what was going wrong and I only had two systemd commits left on the list, it was pretty easy to see the problem. It was in 357376d. That changes how kernel-install names these snippet files when creating them. It names them by finding a machine ID to use as a prefix. Previously, it used whatever string was in /etc/machine-id; if that file didn't exist or was empty, it just used the string "Default". After that commit, it also looks for a value specified in /etc/machine-info. If there's a /etc/machine-id but not /etc/machine-info when you run kernel-install, it uses the value from /etc/machine-id and writes it to /etc/machine-info.

When I checked those files, it turned out that on the live image, the ID in both /etc/machine-id and /etc/machine-info was a69bd9379d6445668e7df3ddbda62f86 - the problematic ID on the installed system. When we generate the live image itself, kernel-install uses the value from /etc/machine-id and writes it to /etc/machine-info, and both files wind up in the live filesystem. But on the installed system, the ID in /etc/machine-info was that same value, but the ID in /etc/machine-id was different (as we saw above).

Remember how I mentioned above that when doing a live install, we essentially dump the live filesystem itself onto the installed system? Well, one of the 'tweaks' we make when doing this is to re-generate /etc/machine-id, because that ID is meant to be unique to each installed system - we don't want every system installed from a Fedora live image to have the same machine ID as the live image itself. However, as this /etc/machine-info file is new, we don't strip it from or re-generate it in the installed system, we just install it. The installed system has a /etc/machine-info with the same ID as the live image's machine ID, but a new, different ID in /etc/machine-id. And this (finally) was the ultimate source of the problem! When we run them on the installed system, the new version of kernel-install writes config snippet files using the ID from /etc/machine-info. But Fedora's patched grub2-mkconfig scriptlet doesn't know about that mechanism at all (since it's brand new), and expects the snippet files to contain the ID from /etc/machine-id.

There are various ways you could potentially solve this, but after consulting with systemd upstream, the one we chose is to have anaconda exclude /etc/machine-info when doing a live install. The changes to systemd here aren't wrong - it does potentially make sense that /etc/machine-id and /etc/machine-info could both exist and specify different IDs in some cases. But for the case of Fedora live installs, it doesn't make sense. The sanest result is for those IDs to match and both be the 'fresh' machine ID that's generated at the end of the install process. By just not including /etc/machine-info on the installed system, we achieve this result, because now when kernel-install runs at the end of the install process, it reads the ID from /etc/machine-id and writes it to /etc/machine-info, and both IDs are the same, grub2-mkconfig finds the snippet files and edits them correctly, the installed system boots, and I can move along to the next debugging odyssey...

On inclusive language: an extended metaphor involving parties because why not

Posted by Adam Williamson on June 17, 2020 12:57 AM

So there's been some discussion within Red Hat about inclusive language lately, obviously related to current events and the worldwide protests against racism, especially anti-Black racism. I don't want to get into any internal details, but in one case we got into some general debate about the validity of efforts to use more inclusive language. I thought up this florid party metaphor, and I figured instead of throwing it at an internal list, I'd put it up here instead. If you have constructive thoughts on it, go ahead and mail me or start a twitter thread or something. If you have non-constructive thoughts on it, keep 'em to yourself!

Before we get into my pontificating, though, here's some useful practical resources if you just want to read up on how you can make the language in your projects and docs more inclusive:

To provide a bit of context: I was thinking about a suggestion that people promoting the use of more inclusive language are "trying to be offended". And here's where my mind went!

Imagine you are throwing a party. You send out the invites, order in some hors d'ouevres (did I spell that right? I never spell that right), queue up some Billie Eilish (everyone loves Billie Eilish, it's a scientific fact), set out the drinks, and wait for folks to arrive. In they all come, the room's buzzing, everyone seems to be having a good time, it's going great!

But then you notice (or maybe someone else notices, and tells you) that most of the people at your party seem to be straight white dudes and their wives and girlfriends. That's weird, you think, I'm an open minded modern guy, I'd be happy to see some Black folks and maybe a cute gay couple or something! What gives? I don't want people to think I'm some kind of racist or sexist or homophobe or something!

So you go and ask some non-white folks and some non-straight folks and some non-male folks what's going on. What is it? Is it me? What did I do wrong?

Well, they say, look, it's a hugely complex issue, I mean, we could be here all night talking about it. And yes, fine, that broken pipeline outside your house might have something to do with it (IN-JOKE ALERT). But since you ask, look, let us break this one part of it down for you.

You know how you've got a bouncer outside, and every time someone rolls up to the party he looks them up and down and says "well hi there! What's your name? Is it on the BLACKLIST or the WHITELIST?" Well...I mean...that might put some folks off a bit. And you know how you made the theme of the party "masters and slaves"? You know, that might have something to do with it too. And, yeah, you see how you sent all the invites to men and wrote "if your wife wants to come too, just put her name in your reply"? I mean, you know, that might speak to some people more than others, you hear what I'm saying?

Now...this could go one of two ways. On the Good Ending, you might say "hey, you know what? I didn't think about that. Thanks for letting me know. I guess next time I'll maybe change those things up a bit and maybe it'll help. Hey thanks! I appreciate it!"

and that would be great. But unfortunately, you might instead opt for the Bad Ending. In the Bad Ending, you say something like this:

"Wow. I mean, just wow. I feel so attacked here. It's not like I called it a 'blacklist' because I'm racist or something. I don't have a racist bone in my body, why do you have to read it that way? You know blacklist doesn't even MEAN that, right? And jeez, look, the whole 'masters and slaves' thing was just a bit of fun, it's not like we made all the Black people the slaves or something! And besides that whole thing was so long ago! And I mean look, most people are straight, right? It's just easier to go with what's accurate for most people. It's so inconvenient to have to think about EVERYBODY all the time. It's not like I'm homophobic or anything. If gay people would just write back and say 'actually I have a husband' or whatever they'd be TOTALLY welcome, I'm all cool with that. God, why do you have to be so EASILY OFFENDED? Why do you want to make me feel so guilty?"

So, I mean. Out of Bad Ending Person and Good Ending Person...whose next party do we think is gonna be more inclusive?

So obviously, in this metaphor, Party Throwing Person is Red Hat, or Google, or Microsoft, or pretty much any company that says "hey, we accept this industry has a problem with inclusion and we're trying to do better", and the party is our software and communities and events and so on. If you are looking at your communities and wondering why they seem to be pretty white and male and straight, and you ask folks for ideas on how to improve that, and they give you some ideas...just listen. And try to take them on board. You asked. They're trying to help. They are not saying you are a BAD PERSON who has done BAD THINGS and OFFENDED them and you must feel GUILTY for that. They're just trying to help you make a positive change that will help more folks feel more welcome in your communities.

You know, in a weird way, if our Party Throwing Person wasn't quite Good Ending Person or Bad Ending person but instead said "hey, you know what, I don't care about women or Black people or gays or whatever, this is a STRAIGHT WHITE GUY PARTY! WOOOOO! SOMEONE TAP THAT KEG!"...that's almost not as bad. At least you know where you stand with that. You don't feel like you're getting gaslit. You can just write that idiot and their party off and try and find another. The kind of Bad Ending Person who keeps insisting they're not racist or sexist or homophobic and they totally want more minorities to show up at their party but they just can't figure out why they all seem to be so awkward and easily offended and why they want to make poor Bad Ending Person feel so guilty...you know...that gets pretty tiring to deal with sometimes.

Fedora CoreOS Test Day coming up on 2020-06-08

Posted by Adam Williamson on June 03, 2020 10:45 PM

Mark your calendars for next Monday, folks: 2020-06-08 will be the very first Fedora CoreOS test day! Fedora QA and the CoreOS team are collaborating to bring you this event. We'll be asking participants to test the bleeding-edge next stream of Fedora CoreOS, run some test cases, and also read over the documentation and give feedback.

All the details are on the Test Day page. You can join in on the day on Freenode IRC, we'll be using #fedora-coreos rather than #fedora-test-day for this event. Please come by and help out if you have the time!

Taskotron is EOL (end of life) today

Posted by Kamil Páral on April 30, 2020 07:49 AM

As previously announced, Taskotron (project page) will be shut down today. See the announcement and its discussion for more details and some background info.

As a result, certain tests (beginning with “dist.“) will no longer appear for new updates in Bodhi (in Automated Tests tab). Some of those tests (and even new ones) will hopefully come back in the future with the help of Fedora CI.

Thank you to everyone who contributed to Taskotron in the past or found our test reports helpful.

taskotron

Fedora 32 release and Lenovo announcement

Posted by Adam Williamson on April 28, 2020 11:18 PM

It's been a big week in Fedora news: first came the announcement of Lenovo planning to ship laptops preloaded with Fedora, and today Fedora 32 is released. I'm happy this release was again "on time" (at least if you go by our definition and not Phoronix's!), though it was kinda chaotic in the last week or so. We just changed the installer, the partitioning library, the custom partitioning tool, the kernel and the main desktop's display manager - that's all perfectly normal stuff to change a day before you sign off the release, right? I'm pretty confident this is fine!

But seriously folks, I think it turned out to be a pretty good sausage, like most of the ones we've put on the shelves lately. Please do take it for a spin and see how it works for you.

I'm also really happy about the Lenovo announcement. The team working on that has been doing an awful lot of diplomacy and negotiation and cajoling for quite a while now and it's great to see it pay off. The RH Fedora QA team was formally brought into the plan in the last month or two, and Lenovo has kindly provided us with several test laptops which we've distributed around. While the project wasn't public we were clear that we couldn't do anything like making the Fedora 32 release contingent on test results on Lenovo hardware purely for this reason or anything like that, but both our team and Lenovo's have been running tests and we did accept several freeze exceptions to fix bugs like this one, which also affected some Dell systems and maybe others too. Now this project is officially public, it's possible we'll consider adding some official release criteria for the supported systems, or something like that, so look out for proposals on the mailing lists in future.

Automatically shrink your VM disk images when you delete files (Fedora 32 update)

Posted by Kamil Páral on April 24, 2020 03:09 PM

I’ve already written about this in 2017, but things got simpler since then. Time for an update!

If you use virtual machines, your disk images just grow and grow, but never shrink – deleted files inside a VM never free up the space on the host. But you can configure the VM to handle TRIM (discard) commands, and then your disk images will reflect deleted files and shrink as well. Here’s how (with Fedora 32 using qemu 4.2 and virt-manager 2.2).

Adjust VM configuration

  1. When creating a new VM, use qcow2 disk images (that’s the default), not raw.
  2. Your new VM should have VirtIO disks (that’s the default).
  3. In virt-manager in VM configuration, select your VirtIO disk, go to Advanced -> Performance, and set Discard mode: unmap.
    virt-manager-unmap

Test it

Now boot your VM and try to issue a TRIM command:

$ sudo fstrim -av
/boot: 908.5 MiB (952631296 bytes) trimmed on /dev/vda1
/: 6.8 GiB (7240171520 bytes) trimmed on /dev/mapper/fedora-root

You should see some output printed, even if it’s just 0 bytes trimmed, not an error.

Let’s see if the disk image actually shrinks. You need to list its size using du (or ls -s) to see the disk allocated size, not the apparent file size (because the disk image is sparse):

$ du -h discardtest.qcow2 
1.4G discardtest.qcow2

Now create a file inside the VM:

$ dd if=/dev/urandom of=file bs=1M count=500

We created a 500 MB file inside the VM and the disk image grew accordingly (give it a few seconds):

$ du -h discardtest.qcow2
1.9G discardtest.qcow2

Now, remove the file inside the VM and issue a TRIM:

$ rm file -f
$ sudo fstrim -av

And the disk image size should shrink back (give it a few seconds):

$ du -h discardtest.qcow2
1.4G discardtest.qcow2

If you configure your system to send TRIM in real-time (see below), it should shrink right after rm and no fstrim should be needed.

Issue TRIM automatically

With Fedora 32, fstrim.timer is automatically enabled and will trim your system once per week. You can reconfigure it to run more frequently, if you want. You can check the timer using:

$ sudo systemctl list-timers

If you want a real-time TRIM, edit /etc/fstab in the VM and add a discard mount option to the filesystem in question, like this:

UUID=6d368798-f4c2-44f9-8334-6be3c64cc449 / ext4 defaults,discard 1 1

This has some performance impact (they say), but the disk image will shrink right after a file is deleted. (Note: XFS as a root filesystem doesn’t issue TRIM commands without additional tweaking, read more here).

Stay informed about QA events

Posted by Kamil Páral on April 09, 2020 02:50 PM

Hello, this is a reminder that you can easily stay informed about important upcoming QA events and help with testing Fedora, especially now during Fedora 32 development period.

The first obvious option for existing Fedora contributors is to subscribe to the test-announce mailing list. We announce all our QA meetings, test days, composes nominated for testing and other important information in there.

A second, not that well-known option which I want to highlight today, is to add our QA calendar to your calendar software (Google Calendar, Thunderbird, etc). You’ll see our QA meetings (including blocker review meetings) and test days right next to your personal events, so they will be hard to miss. A guide how to do that is here on our QA homepage.

Thank you everyone who joins our efforts and helps us make Fedora better.

Do not upgrade to Fedora 32, and do not adjust your sets

Posted by Adam Williamson on February 14, 2020 05:30 PM

If you were unlucky today, you might have received a notification from GNOME in Fedora 30 or 31 that Fedora 32 is now available for upgrade.

This might have struck you as a bit odd, it being rather early for Fedora 32 to be out and there not being any news about it or anything. And if so, you'd be right! This was an error, and we're very sorry for it.

What happened is that a particular bit of data which GNOME Software (among other things) uses as its source of truth about Fedora releases was updated for the branching of Fedora 32...but by mistake, 32 was added with status 'Active' (meaning 'stable release') rather than 'Under Development'. This fooled poor GNOME Software into thinking a new stable release was available, and telling you about it.

Kamil Paral spotted this very quickly and releng fixed it right away, but if your GNOME Software happened to check for updates during the few minutes the incorrect data was up, it will have cached it, and you'll see the incorrect notification for a while.

Please DO NOT upgrade to Fedora 32 yet. It is under heavy development and is very much not ready for normal use. We're very sorry for the incorrect notification and we hope it didn't cause too much disruption.

Using Zuul CI with Pagure.io

Posted by Adam Williamson on February 12, 2020 06:15 PM

I attended Devconf.cz again this year - I'll try and post a full blog post on that soon. One of the most interesting talks, though, was CI/CD for Fedora packaging with Zuul, where Fabien Boucher and Matthieu Huin introduced the work they've done to integrate a specific Zuul instance (part of the Software Factory effort) with the Pagure instance Fedora uses for packages and also with Pagure.io, the general-purpose Pagure instance that many Fedora groups use to host projects, including us in QA.

They've done a lot of work to make it as simple as possible to hook up a project in either Pagure instance to run CI via Zuul, and it looked pretty cool, so I thought I'd try it on one of our projects and see how it compares to other options, like the Jenkins-based Pagure CI.

I wound up more or less following the instructions on this Wiki page, but it does not give you an example of a minimal framework in the project repository itself to actually run some checks. However, after I submitted the pull request for fedora-project-config as explained on the wiki page, Tristan Cacqueray was kind enough to send me this as a pull request for my project repository.

So, all that was needed to get a kind of 'hello world' process running was:

  1. Add the appropriate web hook in the project options
  2. Add the 'zuul' user as a committer on the project in the project options
  3. Get a pull request merged to fedora-project-config to add the desired project
  4. Add a basic Zuul config which runs a single job

After that, the next step was to have it run useful checks. I set the project up such that all the appropriate checks could be run just by calling tox (which is a great test runner for Python projects) - see the tox configuration. Then, with a bit more help from Tristan, I was able to tweak the Zuul config to run it successfully. This mainly required a couple of things:

  1. Adding nodeset: fedora-31-vm to the Zuul config - this makes the CI job run on a Fedora 31 VM rather than the default CentOS 7 VM (CentOS 7's tox is too old for a modern Python 3 project)
  2. Modifying the job configuration to ensure tox is installed (there's a canned role for this, called ensure-tox) and also all available Python interpreters (using the package module)

This was all pretty small and easy stuff, and we had the whole thing up and running in a few hours. Now it all works great, so whenever a pull request is submitted for the project, the tests are automatically run and the results shown on the pull request.

You can set up more complex workflows where Zuul takes over merging of pull requests entirely - an admin posts a comment indicating a PR is ready to merge, whereupon Zuul will retest it and then merge it automatically if the test succeeds. This can also be used to merge series of PRs together, with proper testing. But for my small project, this simple integration is enough so far.

It's been a positive experience working with the system so far, and I'd encourage others to try it for their packages and Pagure projects!

AdamW's Debugging Adventures: "dnf is locked by another application"

Posted by Adam Williamson on October 18, 2019 01:45 PM

Gather round the fire, kids, it's time for another Debugging Adventure! These are posts where I write up the process of diagnosing the root cause of a bug, where it turned out to be interesting (to me, anyway...)

This case - Bugzilla #1750575 - involved dnfdragora, the package management tool used on Fedora Xfce, which is a release-blocking environment for the ARM architecture. It was a pretty easy bug to reproduce: any time you updated a package, the update would work, but then dnfdragora would show an error "DNF is locked by another process. dnfdragora will exit.", and immediately exit.

The bug sat around on the blocker list for a while; Daniel Mach (a DNF developer) looked into it a bit but didn't have time to figure it out all the way. So I got tired of waiting for someone else to do it, and decided to work it out myself.

Where's the error coming from?

As a starting point, I had a nice error message - so the obvious thing to do is figure out where that message comes from. The text appears in a couple of places in dnfdragora - in an exception handler and also in a method for setting up a connection to dnfdaemon. So, if we didn't already know (I happened to) this would be the point at which we'd realize that dnfdragora is a frontend app to a backend - dnfdaemon - which does the heavy lifting.

So, to figure out in more detail how we were getting to one of these two points, I hacked both the points where that error is logged. Both of them read logger.critical(errmsg). I changed this to logger.exception(errmsg). logger.exception is a very handy feature of Python's logging module which logs whatever message you specify, plus a traceback to the current state, just like the traceback you get if the app actually crashes. So by doing that, the dnfdragora log (it logs to a file dnfdragora.log in the directory you run it from) gave us a traceback showing how we got to the error:

2019-10-14 17:53:29,436 <a href="ERROR">dnfdragora</a> dnfdaemon client error: g-io-error-quark: GDBus.Error:org.baseurl.DnfSystem.LockedError: dnf is locked by another application (36)
Traceback (most recent call last):
File "/usr/bin/dnfdragora", line 85, in <module>
main_gui.handleevent()
File "/usr/lib/python3.7/site-packages/dnfdragora/ui.py", line 1273, in handleevent
if not self._searchPackages(filter, True) :
File "/usr/lib/python3.7/site-packages/dnfdragora/ui.py", line 949, in _searchPackages
packages = self.backend.search(fields, strings, self.match_all, self.newest_only, tags )
File "/usr/lib/python3.7/site-packages/dnfdragora/misc.py", line 135, in newFunc
rc = func(*args, **kwargs)
File "/usr/lib/python3.7/site-packages/dnfdragora/dnf_backend.py", line 464, in search
newest_only, tags)
File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 508, in Search
fields, keys, attrs, match_all, newest_only, tags))
File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 293, in _run_dbus_async
result = self._get_result(data)
File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 277, in _get_result
self._handle_dbus_error(user_data['error'])
File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 250, in _handle_dbus_error
raise DaemonError(str(err))
dnfdaemon.client.DaemonError: g-io-error-quark: GDBus.Error:org.baseurl.DnfSystem.LockedError: dnf is locked by another application (36)</module>

So, this tells us quite a bit of stuff. We know we're crashing in some sort of 'search' operation, and dbus seems to be involved. We can also see a bit more of the architecture here. Note how we have dnfdragora/dnf_backend.py and dnfdaemon/client/init.py included in the trace, even though we're only in the dnfdragora executable here (dnfdaemon is a separate process). Looking at that and then looking at those files a bit, it's quite easy to see that the dnfdaemon Python library provides a sort of framework for a client class called (oddly enough) DnfDaemonBase which the actual client - dnfdragora in our case - is expected to subclass and flesh out. dnfdragora does this in a class called DnfRootBackend, which inherits from both dnfdragora.backend.Backend (a sort of abstraction layer for dnfdragora to have multiple of these backends, though at present it only actually has this one) and dnfdaemon.client.Client, which is just a small extension to DnfDaemonBase that adds some dbus signal handling.

So now we know more about the design we're dealing with, and we can also see that we're trying to do some sort of search operation which looks like it works by the client class communicating with the actual dnfdaemon server process via dbus, only we're hitting some kind of error in that process, and interpreting it as 'dnf is locked by another application'. If we dig a little deeper, we can figure out a bit more. We have to read through all of the backtrace frames and examine the functions, but ultimately we can figure out that DnfRootBackend.Search() is wrapped by dnfdragora.misc.ExceptionHandler, which handles dnfdaemon.client.DaemonError exceptions - like the one that's ultimately getting raised here! - by calling the base class's own exception_handler() on them...and for us, that's BaseDragora.exception_handler, one of the two places we found earlier that ultimately produces this "DNF is locked by another process. dnfdragora will exit" text. We also now have two indications (the dbus error itself, and the code in exception_handler() that the error we're dealing with is "LockedError".

A misleading error...

At this point, I went looking for the text LockedError, and found it in two files in dnfdaemon that are kinda variants on each other - daemon/dnfdaemon-session.py and daemon/dnfdaemon-system.py. I didn't actually know offhand which of the two is used in our case, but it doesn't really matter, because the codepath to LockedError is the same in both. There's a function called check_lock() which checks that self._lock == sender, and if it doesn't, raises LockedError. That sure looks like where we're at.

So at this point I did a bit of poking around into how self._lock gets set and unset in the daemon. It turns out to be pretty simple. The daemon is basically implemented as a class with a bunch of methods that are wrapped by @dbus.service.method, which makes them accessible as DBus methods. (One of them is Search(), and we can see that the client class's own Search() basically just calls that). There are also methods called Lock() and Unlock(), which - not surprisingly - set and release this lock, by setting the daemon class' self._lock to be either an identifier for the DBus client or None, respectively. And when the daemon is first initialized, the value is set to None.

At this point, I realized that the error we're dealing with here is actually a lie in two important ways:

  1. The message claims that the problem is the lock being held "by another application", but that's not what check_lock() checks, really. It passes only if the caller holds the lock. It does fail if the lock is held "by another application", but it also fails if the lock is not held at all. Given all the code we looked at so far, we can't actually trust the message's assertion that something else is holding the lock. It is also possible that the lock is not held at all.
  2. The message suggests that the lock in question is a lock on dnf itself. I know dnf/libdnf do have locking, so up to now I'd been assuming we were actually dealing with the locking in dnf itself. But at this point I realized we weren't. The dnfdaemon lock code we just looked at doesn't actually call or wrap dnf's own locking code in any way. This lock we're dealing with is entirely internal to dnfdaemon. It's really a lock on the dnfdaemon instance itself.

So, at this point I started thinking of the error as being "dnfdaemon is either locked by another DBus client, or not locked at all".

So what's going on with this lock anyway?

My next step, now I understood the locking process we're dealing with, was to stick some logging into it. I added log lines to the Lock() and Unlock() methods, and I also made check_lock() log what sender and self._lock were set to before returning. Because it sets self._lock to None, I also added a log line to the daemon's init that just records that we're in it. That got me some more useful information:

2019-10-14 18:53:03.397784 XXX In DnfDaemon.init now!
2019-10-14 18:53:03.402804 XXX LOCK: sender is :1.1835
2019-10-14 18:53:03.407524 XXX CHECK LOCK: sender is :1.1835
XXX CHECK LOCK: self._lock is :1.1835
2019-10-14 18:53:07.556499 XXX CHECK LOCK: sender is :1.1835
XXX CHECK LOCK: self._lock is :1.1835
[...snip a bunch more calls to check_lock where the sender is the same...]
2019-10-14 18:53:13.560828 XXX CHECK LOCK: sender is :1.1835
XXX CHECK LOCK: self._lock is :1.1835
2019-10-14 18:53:13.560941 XXX CHECK LOCK: sender is :1.1835
XXX CHECK LOCK: self._lock is :1.1835
2019-10-14 18:53:16.513900 XXX In DnfDaemon.init now!
2019-10-14 18:53:16.516724 XXX CHECK LOCK: sender is :1.1835
XXX CHECK LOCK: self._lock is None

so we could see that when we started dnfdragora, dnfdaemon started up and dnfdragora locked it almost immediately, then throughout the whole process of reproducing the bug - run dnfdragora, search for a package to be updated, mark it for updating, run the transaction, wait for the error - there were several instances of DBus method calls where everything worked fine (we see check_lock() being called and finding sender and self._lock set to the same value, the identifier for dnfdragora), but then suddenly we see the daemon's init running again for some reason, not being locked, and then a check_lock() call that fails because the daemon instance's self._lock is None.

After a couple of minutes, I guessed what was going on here, and the daemon's service logs confirmed it - dnfdaemon was crashing and automatically restarting. The first attempt to invoke a DBus method after the crash and restart fails, because dnfdragora has not locked this new instance of the daemon (it has no idea it just crashed and restarted), so check_lock() fails. So as soon as a DBus method invocation is attempted after the dnfdaemon crash, dnfdragora errors out with the confusing "dnf is locked by another process" error.

The crash was already mentioned in the bug report, but until now the exact interaction between the crash and the error had not been worked out - we just knew the daemon crashed and the app errored out, but we didn't really know what order those things happened in or how they related to each other.

OK then...why is dnfdaemon crashing?

So, the question now became: why is dnfdaemon crashing? Well, the backtrace we had didn't tell us a lot; really it only told us that something was going wrong in libdbus, which we could also tell from the dnfdaemon service log:

Oct 14 18:53:15 adam.happyassassin.net dnfdaemon-system[226042]: dbus[226042]: arguments to dbus_connection_unref() were incorrect, assertion "connection->generation == _dbus_current_generation" failed in file ../../dbus/dbus-connection.c line 2823.
Oct 14 18:53:15 adam.happyassassin.net dnfdaemon-system[226042]: This is normally a bug in some application using the D-Bus library.
Oct 14 18:53:15 adam.happyassassin.net dnfdaemon-system[226042]:   D-Bus not built with -rdynamic so unable to print a backtrace

that last line looked like a cue, so of course, off I went to figure out how to build DBus with -rdynamic. A bit of Googling told me - thanks "the3dfxdude"! - that the trick is to compile with --enable-asserts. So I did that and reproduced the bug again, and got a bit of a better backtrace. It's a long one, but by picking through it carefully I could spot - in frame #17 - the actual point at which the problem happened, which was in dnfdaemon.server.DnfDaemonBase.run_transaction(). (Note, this is a different DnfDaemonBase class from dnfdaemon.client.DnfDaemonBase; I don't know why they have the same name, that's just confusing).

So, the daemon's crashing on this self.TransactionEvent('end-run', NONE) call. I poked into what that does a bit, and found a design here that kinda mirrors what happens on the client side: this DnfDaemonBase, like the other one, is a framework for a full daemon implementation, and it's subclassed by a DnfDaemon class here. That class defines a TransactionEvent method that emits a DBus signal. So...we're crashing when trying to emit a dbus signal. That all adds up with the backtrace going through libdbus and all. But, why are we crashing?

At this point I tried to make a small reproducer (which basically just set up a DnfDaemon instance and called self.TransactionEvent in the same way, I think) but that didn't work - I didn't know why at the time, but figured it out later. Continuing to trace it out through code wouldn't be that easy because now we're in DBus, which I know from experience is a big complex codebase that's not that easy to just reason your way through. We had the actual DBus error to work from too - "arguments to dbus_connection_unref() were incorrect, assertion "connection->generation == _dbus_current_generation" failed" - and I looked into that a bit, but there were no really helpful leads there (I got a bit more understanding about what the error means exactly, but it didn't help me understand why it was happening at all).

Time for the old standby...

So, being a bit stuck, I fell back on the most trusty standby: trial and error! Well, also a bit of logic. It did occur to me that the dbus broker is itself a long-running daemon that other things can talk to. So I started just wondering if something was interfering with dnfdaemon's connection with the dbus broker, somehow. This was in my head as I poked around at stuff - wherever I wound up looking, I was looking for stuff that involved dbus.

But to figure out where to look, I just started hacking up dnfdaemon a bit. Now this first part is probably pure intuition, but that self._reset_base() call on the line right before the self.TransactionEvent call that crashes bugged me. It's probably just long experience telling me that anything with "reset" or "refresh" in the name is bad news. :P So I thought, hey, what happens if we move it?

I stuck some logging lines into this run_transaction so I knew where we got to before we crashed - this is a great dumb trick, btw, just stick lines like self.logger('XXX HERE 1'), self.logger('XXX HERE 2') etc. between every significant line in the thing you're debugging, and grep the logs for "XXX" - and moved the self._reset_base() call down under the self.TransactionEvent call...and found that when I did that, we got further, the self.TransactionEvent call worked and we crashed the next time something else tried to emit a DBus signal. I also tried commenting out the self._reset_base() call entirely, and found that now we would only crash the next time a DBus signal was emitted after a subsequent call to the Unlock() method, which is another method that calls self._reset_base(). So, at this point I was pretty confident in this description: "dnfdaemon is crashing on the first interaction with DBus after self._reset_base() is called".

So my next step was to break down what _reset_base() was actually doing. Turns out all of the detail is in the DnfDaemonBase skeleton server class: it has a self._base which is a dnf.base.Base() instance, and that method just calls that instance's close() method and sets self._base to None. So off I went into dnf code to see what dnf.base.Base.close() does. Turns out it basically does two things: it calls self._finalize_base() and then calls self.reset(True, True, True).

Looking at the code it wasn't immediately obvious which of these would be the culprit, so it was all aboard the trial and error train again! I changed the call to self._reset_base() in the daemon to self._base.reset(True, True, True)...and the bug stopped happening! So that told me the problem was in the call to _finalize_base(), not the call to reset(). So I dug into what _finalize_base() does and kinda repeated this process - I kept drilling down through layers and splitting up what things did into individual pieces, and doing subsets of those pieces at a time to try and find the "smallest" thing I could which would cause the bug.

To take a short aside...this is what I really like about these kinds of debugging odysseys. It's like being a detective, only ultimately you know that there's a definite reason for what's happening and there's always some way you can get closer to it. If you have enough patience there's always a next step you can take that will get you a little bit closer to figuring out what's going on. You just have to keep working through the little steps until you finally get there.

Eventually I lit upon this bit of dnf.rpm.transaction.TransactionWrapper.close(). That was the key, as close as I could get to it: reducing the daemon's self._reset_base() call to just self._base._priv_ts.ts = None (which is what that line does) was enough to cause the bug. That was the one thing out of all the things that self._reset_base() does which caused the problem.

So, of course, I took a look at what this ts thing was. Turns out it's an instance of rpm.TransactionSet, from RPM's Python library. So, at some point, we're setting up an instance of rpm.TransactionSet, and at this point we're dropping our reference to it, which - point to ponder - might trigger some kind of cleanup on it.

Remember how I was looking for things that deal with dbus? Well, that turned out to bear fruit at this point...because what I did next was simply to go to my git checkout of rpm and grep it for 'dbus'. And lo and behold...this showed up.

Turns out RPM has plugins (TIL!), and in particular, it has this one, which talks to dbus. (What it actually does is try to inhibit systemd from suspending or shutting down the system while a package transaction is happening). And this plugin has a cleanup function which calls something called dbus_shutdown() - aha!

This was enough to get me pretty suspicious. So I checked my system and, indeed, I had a package rpm-plugin-systemd-inhibit installed. I poked at dependencies a bit and found that python3-dnf recommends that package, which means it'll basically be installed on nearly all Fedora installs. Still looking like a prime suspect. So, it was easy enough to check: I put the code back to a state where the crash happened, uninstalled the package, and tried again...and bingo! The crash stopped happening.

So at this point the case was more or less closed. I just had to do a bit of confirming and tidying up. I checked and it turned out that indeed this call to dbus_shutdown() had been added quite recently, which tied in with the bug not showing up earlier. I looked up the documentation for dbus_shutdown() which confirmed that it's a bit of a big cannon which certainly could cause a problem like this:

"Frees all memory allocated internally by libdbus and reverses the effects of dbus_threads_init().

libdbus keeps internal global variables, for example caches and thread locks, and it can be useful to free these internal data structures.

...

You can't continue to use any D-Bus objects, such as connections, that were allocated prior to dbus_shutdown(). You can, however, start over; call dbus_threads_init() again, create new connections, and so forth."

and then I did a scratch build of rpm with the commit reverted, tested, and found that indeed, it solved the problem. So, we finally had our culprit: when the rpm.TransactionSet instance went out of scope, it got cleaned up, and that resulted in this plugin's cleanup function getting called, and dbus_shutdown() happening. The RPM devs had intended that call to clean up the RPM plugin's DBus handles, but this is all happening in a single process, so the call also cleaned up the DBus handles used by dnfdaemon itself, and that was enough (as the docs suggest) to cause any further attempts to communicate with DBus in dnfdaemon code to blow up and crash the daemon.

So, that's how you get from dnfdragora claiming that DNF is locked by another process to a stray RPM plugin crashing dnfdaemon on a DBus interaction!

Converting fedmsg consumers to fedora-messaging

Posted by Adam Williamson on June 10, 2019 12:24 PM

So in case you hadn't heard, the Fedora infrastructure team is currently trying to nudge people in the direction of moving from fedmsg to fedora-messaging.

Fedmsg is the Fedora project-wide messaging bus we've had since 2012. It backs FMN / Fedora Notifications and Badges, and is used extensively within Fedora infrastructure for the general purpose of "have this one system do something whenever this other system does something else". For instance, openQA job scheduling and result reporting are both powered by fedmsg.

Over time, though, there have turned out to be a few issues with fedmsg. It has a few awkward design quirks, but most significantly, it's designed such that message delivery can never be guaranteed. In practice it's very reliable and messages almost always are delivered, but for building critical systems like Rawhide package gating, the infrastructure team decided we really needed a system where message delivery can be formally guaranteed.

There was initially an idea to build a sort of extension to fedmsg allowing for message delivery to be guaranteed, but in the end it was decided instead to replace fedmsg with a new AMQP-based system called fedora-messaging. At present both fedmsg and fedora-messaging are live and there are bridges in both directions: all messages published as fedmsgs are republished as fedora-messaging messages by a 0MQ->AMQP bridge, and all messages published as fedora-messaging messages are republished as fedmsgs by an AMQP->0MQ bridge. This is intended to ease the migration process by letting you migrate a publisher or consumer of fedmsgs to fedora-messaging at any time without worrying about whether the corresponding consumers and/or publishers have also been migrated.

This is just the sort of project I usually work on in the 'quiet time' after one release comes out and before the next one really kicks into high gear, so since Fedora 30 just came out, last week I started converting the openQA fedmsg consumers to fedora-messaging. Here's a quick write-up of the process and some of the issues I found along the way!

I found these three pages in the fedora-messaging docs to be the most useful:

  1. Consumers
  2. Messages
  3. Configuration (especially the 'consumer-config' part)

Another important bit you might need are the sample config files for the production broker and stable broker.

All the fedmsg consumers I wrote followed this approach, where you essentially write consumer classes and register them as entry points in the project's setup.py. Once the project is installed, the fedmsg-hub service provided by fedmsg runs all these registered consumers (as long as a configuration setting is set to turn them on).

This exact pattern does not exist in fedora-messaging - there is no hub service. But fedora-messaging does provide a somewhat-similar pattern which is the natural migration path for this type of consumer. In this approach you still have consumer classes, but instead of registering them as entry points, you write configuration files for them and place them in /etc/fedora-messaging. You can then run an instantiated systemd service that runs fedora-messaging consume with the configuration file you created.

So to put it all together with a specific example: to schedule openQA jobs, we had a fedmsg consumer class called OpenQAScheduler which was registered as a moksha.consumer called fedora_openqa.scheduler.prod in setup.py, and had a config_key named "fedora_openqa.scheduler.prod.enabled". As long as a config file in /etc/fedmsg.d contained 'fedora_openqa.scheduler.prod.enabled': True, the fedmsg-hub service then ran this consumer. The consumer class itself defined what messages it would subscribe to, using its topic attribute.

In a fedora-messaging world, the OpenQAScheduler class is tweaked a bit to handle an AMQP-style message, and the entrypoint in setup.py and the config_key in the class are removed. Instead, we create a configuration file /etc/fedora-messaging/fedora_openqa_scheduler.toml and enable and start the fm-consumer@fedora_openqa_scheduler.service systemd service. Note that all the necessary bits for this are shipped in the fedora-messaging package, so you need that package installed on the system where the consumer will run.

That configuration file looks pretty much like the sample I put in the repository. This is based on the sample files I mentioned above.

The amqp_url specifies which AMQP broker to connect to and what username to use: in this sample we're connecting to the production Fedora broker and using the public 'fedora' identity. The callback specifies the Python path to the consumer callback class (our OpenQAScheduler class). The [tls] section points to the CA certificate, certificate and private key to be used for authenticating with the broker: since we're using the public 'fedora' identity, these are the files shipped in the fedora-messaging package itself which let you authenticate as that identity. For production use, I think the intent is that you request a separate identity from Fedora infra (who will generate certs and keys for it) and use that instead - so you'd change the amqp_url and the paths in the [tls] section appropriately.

The other key things you have to set are the queue name - which appears twice in the sample file as 00000000-0000-0000-0000-000000000000, for each consumer you are supposed to generate a random UUID with uuidgen and use that as the queue name, each consumer should have its own queue - and the routing_keys in the [[bindings]] section. Those are the topics the consumer will subscribe to - unlike in the fedmsg system, this is set in configuration rather than in the consumer class itself. Another thing you may wish to take advantage of is the consumer_config section: this is basically a freeform configuration store that the consumer class can read settings from. So you can have multiple configuration files that run the same consumer class but with different settings - you might well have different 'production' and 'staging' configurations. We do indeed use this for the openQA job scheduler consumer: we use a setting in this consumer_config section to specify the hostname of the openQA instance to connect to.

So, what needs changing in the actual consumer class itself? For me, there wasn't a lot. For a start, the class should now just inherit from object - there is no base class for consumers in the fedora-messaging world, there's no equivalent to fedmsg.consumers.FedmsgConsumer. You can remove things like the topic attribute (that's now set in configuration) and validate_signatures. You may want to set up a init, which is a good place to read in settings from consumer_config and set up a logger (more on logging in a bit). The method for actually reading a message should be named call() (so yes, fedora-messaging just calls the consumer instance itself on the message, rather than explicitly calling one of its methods). And the message object itself the method receives is slightly different: it will be an instance of fedora_messaging.api.Message or a subclass of it, not just a dict. The topic, body and other bits of the message are available as attributes, not dict items. So instead of message['topic'], you'd use message.topic. The message body is message.body.

Here I ran into a significant wrinkle. If you're consuming a native fedora-messaging message, the message.body will be the actual body of the message. However, if you're consuming a message that was published as a fedmsg and has been republished by the fedmsg->fedora-messaging bridge, message.body won't be what you'd probably expect. Looking at an example fedmsg, we'd probably expect the message.body of the converted fedora-messaging message to be just the msg dict, right? Just a dict with keys repo and agent. However, at present, the bridge actually publishes the entire fedmsg as the message.body - what you get as message.body is that whole dict. To get to the 'true' body, you have to take message.body['msg']. This is a problem because whenever the publisher is converted to fedora-messaging, there won't be a message.body['msg'] any more, and your consumer will likely break. It seems that the bridge's behavior here will likely be changed soon, but for now, this is a bit of a problem.

Once I figured this out, I wrote a little helper function called _find_true_body to fudge around this issue. You are welcome to steal it for your own use if you like. It should always find the 'true' body of any message your consumer receives, whether it's native or converted, and it will work when the bridge is fixed in future too so you won't need to update your consumer when that happens (though later on down the road it'll be safe to just get rid of the function and use message.body directly).

Those things, plus rejigging the logging a bit, were all I needed to do to convert my consumers - it wasn't really that much work in the end.

To dig into logging a bit more: fedmsg consumer class instances had a log() method you could use to send log messages, you didn't have to set up your own logging infrastructure. (Although a problem of this system was that it gave no indication which consumer a log message came from). fedora-messaging does not have this. If you want a consumer to log, you have to set up the logging infrastructure within the consumer, and tweak the configuration file a bit.

The pattern I chose was to import logging and then init a logger instance for each consumer class in its init(), like this:

self.logger = logging.getLogger(self.__class__.__name__)

Then you can log messages with self.logger.info("message") or whatever. I thought that would be all I'd need, but actually, if you just do that, there's nothing set up to actually receive the messages and log them anywhere. So you have to add a bit to the TOML config file that looks like this:

[log_config.loggers.OpenQAScheduler]
level = "INFO"
propagate = false
handlers = ["console"]

the OpenQAScheduler there is the class name; change it to the actual name of the consumer class. That will have the messages logged to the console, which - when you run the consumer as a systemd service - means they wind up in the system journal, which was enough for me. You can also configure a handler to send email alerts, for instance, if you like - you can see an example of this in Bodhi's config file.

One other wrinkle I ran into was with authenticating to the staging broker. The sample configuration file has the right URL and [tls] section for this, but the files referenced in the [tls] section aren't actually in the fedora-messaging package. To successfully connect to the staging broker, as fedora.stg, you need to grab the necessary files from the fedora-messaging git repo and place them into /etc/fedora-messaging.

To see the whole of the changes I had to make to the openQA consumers, you can look at the commits on the fedora-messaging branch of the repo and also this set of commits to the Fedora infra ansible repo.

New openQA tests: update live image build/install

Posted by Adam Williamson on February 08, 2019 10:20 AM

Hot on the heels of adding installer image build/install tests to openQA, I've now added tests which do just the same, but for the Workstation live image.

That means that, when running the desktop tests for an update, openQA will also run a test that builds a Workstation live image and a test that boots and installs it. The packages from the update will be used - if relevant - in the live image creation environment, and included in the live image itself. This will allow us to catch problems in updates that relate to the build and basic functionality of live images.

Here's an update where you can see that both the installer and live image build tests ran successfully and passed - see the updates-everything-boot-iso and updates-workstation-live-iso flavors.

I'm hoping this will help us catch compose issues much more easily during the upcoming Fedora 30 release process.

Devconf.cz 2019 trip report

Posted by Adam Williamson on February 06, 2019 11:01 AM

I've just got back from my Devconf.cz 2019 trip, after spending a few days after the conference in Red Hat's Brno office with other Fedora QA team members, then a few days visiting family.

I gave both my talks - Don't Move That Fence 'Til You Know Why It's There and Things Fedora QA Robots Do - and both were well-attended and, I think, well-received. The slide decks are up on the talk pages, and recordings should I believe go up on the Devconf Youtube channel soon.

I attended many other talks, my personal favourite being Stef Walter's Using machine learning to find Linux bugs. Stef noticed something I also have noticed in our openQA work - that "test flakes" are very often not just some kind of "random blip" but genuine bugs that can be investigated and fixed with a little care - and ran with it, using the extensive amount of results generated by the automated test suite for Cockpit as input data for a machine learning-based system which clusters "test flakes" based on an analysis of key data from the logs for each test. In this way they can identify when a large number of apparent "flakes" seem to have significant common features and are likely to be occurrences of the same bug, allowing them then to go back and analyze the commonalities between those cases and identify the underlying bug. We likely aren't currently running enough tests in openQA to utilize the approach Stef outlined in full, but the concept is very interesting and may be useful in future with more data, and perhaps for Fedora CI results.

Other useful and valuable talks I saw included Dan Walsh on podman, Lennart Poettering on portable services, Daniel Mach and Jaroslav Mracek on the future of DNF, Kevin Fenzi and Stephen Smoogen on the future of EPEL, Jiri Benc and Marian Šámal on a summer coding camp for kids, Ben Cotton on community project management, the latest edition of Will Woods' and Stephen Gallagher's crusade to kill all scriptlets, and the Fedora Council BoF.

There were also of course lots of useful "hallway track" sessions with Miroslav Vadkerti, Kevin Fenzi, Mohan Boddu, Patrick Uiterwijk, Alexander Bokovoy, Dominik Perpeet, Matthew Miller, Brian Exelbierd and many more - it is invaluable to be able to catch up with people in person and discuss things that are harder to do in tickets and IRC.

As usual it was an enjoyable and productive event, and the rum list at the Bar That Doesn't Exist remains as impressive as ever...;)

Devconf.cz 2019

Posted by Adam Williamson on January 24, 2019 12:31 PM

For anyone who - inexplicably - hasn't already had it in their social calendar in pink sharpie for months, I will be at Devconf.cz 2019 this weekend, at FIT VUT in Brno. I'll be doing two talks: Things Fedora QA Robots Do on Friday at 3pm (which is basically a brain dump about the pile of little fedmsg consumers that do quite important jobs that probably no-one knows about but me), and Don't Move That Fence 'Til You Know Why It's There on Saturday at 11am, which is a less QA-specific talk that's about how I reckon you ought to go about changing code. The slides for both talks are up now, if you want a sneak preview (though if you do, you're disqualified from the audience participation section of the "fence" talk!)

Do come by to the talks, if you're around and there's nothing more interesting in that timeslot. Otherwise feel free to buttonhole me around the conference any time.

New openQA tests: update installer tests and desktop app start/stop test

Posted by Adam Williamson on January 23, 2019 02:20 AM

It's been a while since I wrote about significant developments in Fedora openQA, so today I'll be writing about two! I wrote about one of them a bit in my last post, but that was primarily about a bug I ran into along the way, so now let's focus on the changes themselves.

Testing of install media built from packages in updates-testing

We have long had a problem in Fedora testing that we could not always properly test installer changes. This is most significant during the period of development after a new release has branched from Rawhide, but before it is released as the new stable Fedora release (we use the name 'Branched' to refer to a release in this state; in a month or so, Fedora 30 will branch from Rawhide and become the current Branched release).

During most of this time, the Bodhi update system is enabled for the release. New packages built for the release do not immediately appear in any repositories, but - as with stable releases - must be submitted as "updates", sometimes together with related packages. Once submitted as an update, the package(s) are sent to the "updates-testing" repository for the release. This repository is enabled on installed Branched systems by default (this is a difference from stable releases), so testers who have already installed Branched will receive the package(s) at this point (unless they disable the "updates-testing" repository, which some do). However, the package is still not truly a part of the release at this point. It is not included in the nightly testing composes, nor will it be included in any Beta or Final candidate composes that may be run while it is in updates-testing. That means that if the actual release media were composed while the package was still in updates-testing, it would not be a part of the release proper. Packages only become part of these composes once they pass through Bodhi and are 'pushed stable'.

This system allows us to back out packages that turn out to be problematic, and hopefully to prevent them from destabilizing the test and release composes by not pushing them stable if they turn out to cause problems. It also means more conservative testers have the option to disable the "updates-testing" repository and avoid some destabilizing updates, though of course if all the testers did this, no-one would be finding the problems. In the last few years we have also been running several automated tests on updates (via Taskotron, openQA and the CI pipeline) and reporting results from those to Bodhi, allowing packagers to pull the update if the tests find problems.

However, there has long been a bit of a problem in this process: if the update works fine on an installed system but causes problems if included in (for example) an installer image or live image, we have no very good way to find this out. There was no system for automatically building media like this that include the updates currently in testing so they could be tested. The only way to find this sort of problem was for testers to manually create test media - a process that is not widely understood, is time consuming, and can be somewhat difficult. We also of course could not do automated testing without media to test.

We've looked at different ways of addressing this in the past, but ultimately none of them came to much (yet), so last year I decided to just go ahead and do something. And after a bit of a roadblock (see that last post), that something is now done!

Our openQA now has two new tests it runs on all the updates it tests. The first test - here's an example run - builds a network install image, and the second - example run - tests it. Most importantly, any packages from the update under testing are both used in the process of building the install image (if they are relevant to that process) and included in the installer image (if they are packages which would usually be in such an image). Thus if the update breaks the production of the image, or the basic functionality of the image itself, this will be caught. This (finally) means that we have some idea whether a new anaconda, lorax, pykickstart, systemd, dbus, blivet, dracut or any one of dozens of other key packages might break the installer. If you're a packager and you see that one of these two tests has failed for your update, we should look into that! If you're not sure how to go about that, you can poke me, bcl, or the anaconda developers in Freenode #anaconda, and we should be able to help.

It is also possible for a human tester to download the image produced by the first test and run more in-depth tests on it manually; I haven't yet done anything to make that possibility more visible or easier, but will try to look into ways of doing that over the next few weeks.

GNOME application start/stop testing

My colleague Lukáš Růžička has recently been looking into what we might be able to do to streamline and improve our desktop application testing, something I'd honestly been avoiding because it seemed quite intractable! After some great work by Lukáš, one major fruit of this work is now visible in Fedora openQA: a GNOME application start/stop test suite. Here's an example run of it - note that more recent runs have a ton of failures caused by a change in GNOME, Lukáš has proposed a change to the test to address that but I have not yet reviewed it.

This big test suite just tests starting and then exiting a large number of the default installed applications on the Fedora Workstation edition, making sure they both launch and exit successfully. This is of course pretty easy for a human to do - but it's extremely tedious and time-consuming, so it's something we don't do very often at all (usually only a handful of times per release cycle), meaning we may not notice that an application which perhaps we don't commonly use has a very critical bug (like failing to launch at all) for some time.

Making an automated system like openQA do this is actually quite a lot of work, so it was a great job by Lukas to get it working. Now by monitoring the results of this test on the nightly composes closely, we should find out much more quickly if one of the tested applications is completely broken (or has gone missing entirely).

AdamW's Debugging Adventures: The Mysterious Disappearing /proc

Posted by Adam Williamson on January 17, 2019 07:15 PM

Yep, folks, it's that time again - time for one of old Grandpa Adam's tall tales of root causing adventure...

There's a sort of catch-22 situation in Fedora that has been a personal bugbear for a very long time. It mainly affects Branched releases - each new Fedora release, when it has branched from Rawhide, but before it has been released. During this period the Bodhi update system is in effect, meaning all new packages have to go through Bodhi review before they are included in the composes for the release. This means, in theory, we should be able to make sure nothing really broken lands in the release. However, there's a big class of really important updates we have never been able to test properly at all: updates that affect the installer.

The catch-22 is this - release engineering only builds install media from the 'stable' package set, those packages that have gone through review. So if a package under review breaks the installer, we can't test whether it breaks the installer unless we push it stable. Well, you can, but it's quite difficult - you have to learn how to build an installer image yourself, then build one containing the packages from the update and test it. I can do that, but most other people aren't going to bother.

I've filed bugs and talked to people about ways to resolve this multiple times over many years, but a few months back I just got sick of the problem and decided to fix it myself. So I wrote an openQA update test which automates the process: it builds an installer image, with the packages from the update available to the installer image build tool. I also included a subsequent test which takes that image and runs an install with it. Since I already had the process for doing this manually down pat, it wasn't actually very difficult.

Only...when I deployed the test to the openQA staging instance and actually tried it out, I found the installer image build would frequently fail in a rather strange way.

The installer image build process works (more or less) by creating a temporary directory, installing a bunch of packages to it (using dnf's feature of installing to an alternative 'root'), fiddling around with that environment a bit more, creating a disk image whose root is that temporary directory, then fiddling with the image a bit to make it into a bootable ISO. (HANDWAVE HANDWAVE). However, I was finding it would commonly fail in the 'fiddling around with the environment' stage, because somehow some parts of the environment had disappeared. Specifically, it'd show this error:

FileNoteFoundError: [Errno 2] No such file or directory: '/var/tmp/lorax.q8xfvc0p/installroot//proc/modules'

lorax was, at that point, trying to touch that directory (never mind why). That's the /proc/modules inside the temporary root, basically. The question was, why was it disappearing? And why had neither myself nor bcl (the lorax maintainer) seen it happening previously in manual use, or in official composes?

I tried reproducing it in a virtual machine...and failed. Then I tried again, and succeeded. Then I ran the command again...and it worked! That pattern turned out to repeat: I could usually get it to happen the first time I tried it in a VM, but any subsequent attempts in the same VM succeeded.

So this was seeming really pretty mysterious. Brian couldn't get it to happen at all.

At this point I wrote a dumb, short Python script which just constantly monitored the disappearing location and told me when it appeared and disappeared. I hacked up the openQA test to run this script, and upload the result. Using the timestamps, I was able to figure out exactly what bit of lorax was running when the directory suddenly disappeared. But...I couldn't immediately see why anything in that chunk of lorax would wind up deleting the directory.

At this point, other work became more important, and I wound up leaving this on the back burner for a couple of months. Then I came back to it a couple days ago. I picked back up where I left off, and did a custom build of lorax with some debug logging statements strewn around the relevant section, to figure out really precisely where we were when things went wrong. But this turned out to be a bit of a brick wall, because it turned out that at the time the directory disappeared, lorax was just...running mksquashfs. And I could not figure out any plausible reason at all why a run of mksquashfs would cause the directory to vanish.

After a bit, though, the thought struck me - maybe it's not lorax itself wiping the directory out at all! Maybe something else is doing it. So I thought to look at the system logs. And lo and behold, I found my smoking gun. At the exact time my script logged that the directory had disappeared, this message appeared in the system log:

Jan 18 01:57:30 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com systemd[1]: Starting Cleanup of Temporary Directories...

now, remember our problem directory is in /var/tmp. So this smells very suspicious indeed! So I figured out what that service actually is - to do this, you just grep for the description ("Cleanup of Temporary Directories") in /usr/lib/systemd/system - and it turned out to be /usr/lib/systemd/system/systemd-tmpfiles-clean.service, which is part of systemd's systemd-tmpfiles mechanism, which you can read up on in great detail in man systemd-tmpfiles and man tmpfiles.d.

I had run into it a few times before, so I had a vague idea what I was dealing with and what to look for. It's basically a mechanism for managing temporary files and directories: you can write config snippets which systemd will read and do stuff like creating expected temporary files or directories on boot (this lets packages manage temporary directories without doing it themselves in scriptlets). I poked through the docs again and, sure enough, it turns out another thing the system can do is delete temporary files that reach a certain age:

Age
The date field, when set, is used to decide what files to delete when cleaning. If a file or directory is
older than the current time minus the age field, it is deleted. The field format is a series of integers
each followed by one of the following suffixes for the respective time units: s, m or min, h, d, w, ms, and
us, meaning seconds, minutes, hours, days, weeks, milliseconds, and microseconds, respectively. Full names
of the time units can be used too.

This systemd-tmpfiles-clean.service does that job. So I went looking for tmpfiles.d snippets that cover /var/tmp, and sure enough, found one, in Fedora's stock config file /usr/lib/tmpfiles.d/tmp.conf:

q /var/tmp 1777 root root 30d

The 30d there is the 'age' field. So this tells the tmpfiles mechanism that it's fine to wipe anything under /var/tmp which is older than 30 days.

Of course, naively we might think our directory won't be older than 30 days - after all, we only just ran lorax! But remember, lorax installs packages into this temporary directory, and files and directories in packages get some of their time attributes from the package. So we (at this point, Brian and I were chatting about the problem as I poked it) looked into how systemd-tmpfiles defines age, precisely:

The age of a file system entry is determined from its last modification timestamp (mtime), its last access
timestamp (atime), and (except for directories) its last status change timestamp (ctime). Any of these three
(or two) values will prevent cleanup if it is more recent than the current time minus the age field.

So since our thing is a directory, its mtime and atime are relevant. So Brian and I both looked into those. He did it manually, while I hacked up my check script to also print the mtime and atime of the directory when it existed. And sure enough, it turned out these were several months in the past - they were obviously related to the date the filesystem package (from which /proc/modules comes) was built. They were certainly longer than 30 days ago.

Finally, I looked into what was actually running systemd-tmpfiles-clean.service; it's run on a timer, systemd-tmpfiles-clean.timer. That timer is set to run the service 15 minutes after the system boots, and every day thereafter.

So all of this hooked up nicely into a convincing story. openQA kept running into this problem because it always runs the test in a freshly-booted VM - that '15 minutes after boot' was turning out to be right in the middle of the image creation. My manual reproductions were failing on the first try for the same reason - but then succeeding on the second and subsequent tries because the cleaner would not run again until the next day. And Brian and I had either never or rarely seen this when we ran lorax manually for one reason or another because it was pretty unlikely the "once a day" timer would happen to wake up and run just when we had lorax running (and if it did happen, we'd try again, and when it worked, we'd figure it was just some weird transient failure). The problem likely never happens in official composes, I think, because the tmpfiles timer isn't active at all in the environment lorax gets run in (haven't double-checked this, though).

Brian now gets to deal with the thorny problem of trying to fix this somehow on the lorax side (so the tmpfiles cleanup won't remove bits of the temporary tree even if it does run while lorax is running). Now I know what's going on, it was easy enough to work around in the openQA test - I just have the test do systemctl stop systemd-tmpfiles-clean.timer before running the image build.

AdamW's Debugging Adventures: Python 3 Porting 201

Posted by Adam Williamson on January 08, 2019 08:12 PM

Hey folks! Time for another edition of AdamW's Debugging Adventures, wherein I boast about how great I am at fixin' stuff.

Today's episode is about a bug in the client for Fedora's Koji buildsystem which has been biting more and more Fedora maintainers lately. The most obvious thing it affects is task watching. When you do a package build with fedpkg, it will by default "watch" the build task - it'll update you when the various subtasks start and finish, and not quit until the build ultimately succeeds or fails. You can also directly watch tasks with koji watch-task. So this is something Fedora maintainers see a lot. There's also a common workflow where you chain something to the successful completion of a fedpkg build or koji watch-task, which relies on the task watch completing successfully and exiting 0, if the build actually completed.

However, recently, people noticed that this task watching seemed to be just...failing, quite a lot. While the task was still running, it'd suddenly exit, usually showing this message:

ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

After a while, nirik realized that this seemed to be associated with the client going from running under Python 2 by default to running under Python 3 by default. This seems to happen when running on Python 3; it doesn't seem to happen when running on Python 2.

Today I finally decided it had got annoying enough that I'd spend some time trying to track it down.

It's pretty obvious that the message we see relates to an exception, in some way. But ultimately something is catching that exception and printing it out and then exiting (we're not actually getting a traceback, as you do if the exception is ultimately left to reach the interpreter). So my first approach was to dig into the watch-task code from the top down, and try and find something that handles exceptions that looks like it might be the bit we were hitting.

And...I failed! This happens, sometimes. In fact I still haven't found the exact bit of code that prints the message and exits. Sometimes, this just happens. It's OK. Don't give up. Try something else!

So what I did next was kind of a long shot - I just grepped the code for the exception text. I wasn't really expecting this to work, as there's nothing to suggest the actual exception is part of Koji; it's most likely the code doesn't contain any of that text at all. But hey, it's easy to do, so why not? And as it happened, I got lucky and hit paydirt: there happens to be a comment with some of the text from the error we're hitting. And it sure looks like it might be relevant to the problem we're having! The comment itself, and the function it's in, looked so obviously promising that I went ahead and dug a little deeper.

That function, is_conn_error(), is used by only one other thing: this _sendCall() method in the same file. And that seems very interesting, because what it does can be boiled down to: "hey, we got an error! OK, send it to is_conn_error(). If that returns True, then just log a debug message and kick the session. If that returns False, then raise an exception". That behaviour obviously smells a lot like it could be causing our problem. So, I now had a working theory: for some reason, given some particular server behaviour, is_conn_error() returns True on Python 2 but False on Python 3. That causes this _sendCall() to raise an exception instead of just resetting the session and carrying on, and some other code - which we no longer need to find - catches that exception, prints it, and quits.

The next step was to test this theory - because at this point it's only a theory, it could be entirely wrong. I've certainly come up with entirely plausible theories like this before which turned out to be not what was going on at all. So, like a true lazy shortcut enthusiast, I hacked up my local copy of Koji's init.py and sprinkled a bunch of lines like print("HERE 1!") and print("HERE 2!") through the whole of is_conn_error(). Then I just rankoji wait-task commands on random tasks until one failed.

This is fine. When you're just trying to debug the problem you don't need to be super elegant about it. You don't need to do a proper git patch and rebuild the Koji package for your system and use proper logging methods and all the rest of it. Just dumping some print lines in a working copy of the file is just fine, if it works. Just remember to put everything back as it was before later. :)

So, as it happened the god of root causing was on my side today, and it turned out I was right on the money. When one of the koji watch-task commands failed, it hit my HERE 1! and HERE 3! lines right when it died. Those told me we were indeed running through is_conn_error() right before the error, and further, where we were coming out of it. We were entering the if isinstance(e, socket.error) block at the start of the function, and returning False because the exception (e) did appear to be an instance of socket.error, but either did not have an errno attribute, or it was not one of errno.ECONNRESET, errno.ECONNABORTED, or errno.EPIPE.

Obviously, this made me curious as to what the exception actually is, whether it has an errno at all, and if so, what it is. So I threw in a few more debugging lines - to print out type(e), and getattr(e, 'errno', 'foobar'). The result of this was pretty interesting. The second print statement gave me 'foobar', meaning the exception doesn't have an errno attribute at all. And the type of the exception was...requests.exceptions.ConnectionError.

That's a bit curious! You wouldn't necessarily expect requests.exceptions.ConnectionError to be an instance of socket.error, would you? So why are we in a block that only handles instances of socket.error? Also, it's clear the code doesn't expect this, because there's a block later in the function that explicitly handles instances of requests.exceptions.ConnectionError - but because this earlier block that handles socket.error instances always returns, we will never reach that block if requests.exceptions.ConnectionError instances are also instances of socket.error. So there's clearly something screwy going on here.

So of course the next thing to do is...look up socket.error in the Python 2 and Python 3 docs. ANY TIME you're investigating a mysterious Python 3 porting issue, remember this can be useful. Here's the Python 2 socket.error entry, and the Python 3 socket.error entry. And indeed there's a rather significant difference! The Python 2 docs talk about socket.error as an exception that is, well, its own unique thing. However, the Python 3 docs say: "A deprecated alias of OSError." - and even tell us specifically that this changed in Python 3.3: "Changed in version 3.3: Following PEP 3151, this class was made an alias of OSError." Obviously, this is looking an awful lot like one more link in the chain of what's going wrong here.

With a bit of Python knowledge you should be able to figure out what's going on now. Think: if socket.error is now just an alias of OSError, what does if isinstance(e, socket.error) mean, in Python 3.3+ ? It means just the same as if isinstance(e, OSError). And guess what? requests.exception.ConnectionError happens to be a subclass of OSError. Thus, if e is an instance of requests.exception.ConnectionError, isinstance(e, socket.error) will return True in Python 3.3+. In Python 2, it returns False. It's easy to check this in an interactive Python shell or with a test script, to confirm.

Because of this, when we run under Python 3 and e is a requests.exception.ConnectionError, we're unexpectedly entering this block intended for handling socket.error exceptions and - because that block always returns, having the return False line that gets hit if the errno attribute check fails - we're never actually reaching the later block that's actually intended to handle requests.exception.ConnectionError instances at all, we return False before we get there.

There are a few different ways you could fix this - you could just drop the return False short-circuit line in the socket.error block, for instance, or change the ordering so the requests.exception.ConnectionError handling is done first. In the end I sent a pull request which drops the return False, but also drops the if isinstance(e, socket.error) checks (there's another, for nested exceptions, later) entirely. Since socket.error is meant to be deprecated in Python 3.3+ we shouldn't really use it, and we probably don't need to - we can just rely on the errno attribute check alone. Whatever type the exception is, if it has an errno attribute and that attribute is errno.ECONNRESET, errno.ECONNABORTED, or errno.EPIPE, I think we can be pretty sure this is a connection error.

What's the moral of this debugging tale? I guess it's this: when porting from Python 2 to Python 3 (or doing anything similar to that), fixing the things that outright crash or obviously behave wrong is sometimes the easy part. Even if everything seems to be working fine on a simple test, it's certainly possible that subtler issues like this could be lurking in the background, causing unexpected failures or (possibly worse) subtly incorrect behaviour. And of course, that's just another reason to add to the big old "Why To Have A Really Good Test Suite" list!

There's also a 'secondary moral', I guess, and that's this: predicting all the impacts of an interface change like this is hard. Remember the Python 3 docs mentioned a PEP associated with this change? Well, here it is. If you read it, it's clear the proposers actually put quite a lot of effort into thinking about how existing code might be affected by the change, but it looks like they still didn't consider a case like this. They talk about "Careless (or "naïve") code" which "blindly catches any of OSError, IOError, socket.error, mmap.error, WindowsError, select.error without checking the errno attribute", and about "Careful code is defined as code which, when catching any of the above exceptions, examines the errno attribute to determine the actual error condition and takes action depending on it" - and claim that "useful compatibility doesn't alter the behaviour of careful exception-catching code". However, Koji's code here clearly matches their definition of "careful" code - it considers both the exception's type, and the errno attribute, in making decisions - but because it is not just doing except socket.error as e or similar, but catching the exception elsewhere and then passing it to this function and using isinstance, it still gets tripped up by the change.

So...the ur-moral, as always, is: software is hard!

PSA: System update fails when trying to remove rtkit-0.11-19.fc29

Posted by Kamil Páral on October 15, 2018 11:20 AM
Recently a bug in rtkit packaging has been fixed, but the update will fail on all Fedora 29 pre-release installation that have rtkit installed (Workstation has it for sure). The details and the workaround is described here:

 

Whitelisting rpmlint errors in Taskotron/Bodhi

Posted by Kamil Páral on March 05, 2018 01:14 PM

If you submit a new Fedora update into Bodhi, you’ll see an Automated Tests tab on that update page (an example), and one of the test results (once it’s done) will be from rpmlint. If you click on it, you’ll get a full log with rpmlint output.

If you wish to whitelist some errors which are not relevant for your package or are clearly a mistake (like spelling issues, etc), it is now possible. The steps how to do this are described at:

https://fedoraproject.org/wiki/Taskotron/Tasks/dist.rpmlint#Whitelisting_errors

This has been often requested, so hopefully this will help you have the automated tests results all in green, instead of being bothered by invalid errors. If something doesn’t work, and it seems to be our bug in how we execute rpmlint (instead of a bug in rpmlint itself), please file a bug in task-rpmlint or contact us (qa-devel mailing list, #fedora-qa IRC channel on Freenode).

Automatically shrink your VM disk images when you delete files

Posted by Kamil Páral on October 06, 2017 04:02 PM

Update: This got significantly simpler with newer qemu and virt-manager, read an updated post.

If you use VMs a lot, you know that with the most popular qcow2 disk format, the disk image starts small, but grows with every filesystem change happening inside the VM. Deleting files inside the VM doesn’t shrink it. Of course that wastes a lot of disk space on your host – the VMs often contain gigabytes of freed space inside the VM, but not on the host. Shrinking the VM images is possible, but tedious and slow. Well, recently I learned that’s actually not true anymore. You can use the TRIM command, used to signalize to SSD drives that some space can be freed, to do the same in virtualization stack – signalize from VM to host that some space can be freed, and the disk image shrunk. How to do that? As usual, this is a shameless copy of instructions found elsewhere on the Internets. The instructions assume you’re using virt-manager or libvirt directly.

First, you need to using qcow2 images, not raw images (you can configure this when adding new disks to your VM).

Second, you need to set your disk bus to SCSI (not VirtIO, which is the default).

disk-scsi

Third, you need to set your SCSI Controller to VirtIO SCSI (not hypervisor default).

controller-scsi

Fourth, you need to edit your VM configuration file using virsh edit vmname and adjust your hard drive’s driver line to include discard='unmap', e.g. like this:

<disk type='file' device='disk'>
 <driver name='qemu' type='qcow2' discard='unmap'/>

And that’s it. Now you boot your VM and try to issue:

$ sudo fstrim -av
/boot: 319.8 MiB (335329280 bytes) trimmed
/: 101.5 GiB (108928946176 bytes) trimmed

You should see some output printed, even if it’s just 0 bytes trimmed, and not an error.

If you’re using LVM, you’ll also need to edit /etc/lvm/lvm.conf and set:

issue_discards = 1

Then it should work, after a reboot.

Now, if you want trimming to occur automatically in your VM, you have two options (I usually do both):

Enable the fstrim timer that trims the system once a week by default:

$ sudo systemctl enable fstrim.timer

And configure the root filesystem (and any other one you’re interested in) to issue discard command automatically after each file is deleted. Edit /etc/fstab and add a discard mount option, like this:

UUID=6d368798-f4c2-44f9-8334-6be3c64cc449 / ext4 defaults,discard 1 1

And that’s it. Try to create a big file using dd, watch your VM image grow. Then delete the file, watch the image shrink. Awesome. If only we had this by default.

SSH to your VMs without knowing their IP address

Posted by Kamil Páral on October 06, 2017 03:31 PM

This is a shameless copy of this blog post, but I felt like I need to put it here as well, so that I can find it the next time I need it 🙂

libvirt approach

When you run a lot of VMs, especially for testing, every time with a fresh operating system, connecting to them is a pain, because you always need to figure out their IP address first. Turns out that is no longer true. I simply added this snippet to my ~/.ssh/config:

# https://penguindroppings.wordpress.com/2017/09/20/easy-ssh-into-libvirt-vms-and-lxd-containers/
# NOTE: doesn't work with uppercase VM names
Host *.vm
 CheckHostIP no
 Compression no
 UserKnownHostsFile /dev/null
 StrictHostKeyChecking no
 ProxyCommand nc $(virsh domifaddr $(echo %h | sed "s/\.vm//g") | awk -F'[ /]+' '{if (NR>2 && $5) print $5}') %p

and now I can simply execute ssh test.vm for a VM named test and I’m connected! A huge time saver. It doesn’t work with uppercase letters in VM names and I didn’t bother to try to fix that. Also, since I run VMs just for testing purposes, I disabled all ssh security checks (you should not do that for important machines).

avahi approach

There’s also a second approach I used for persistent VMs (those that survive for longer than a single install&reboot cycle). You can use Avahi to search for a hostname on the .local domain to find the IP address. Fedora has this enabled by default (if you have nss-mdns package installed, I believe, which should be by default). So, in the VM, set a custom hostname, for example f27:

$ sudo hostnamectl set-hostname f27
$ reboot

Now, you can run ssh f27.local and it should connect you to the VM automatically.

Taskotron: depcheck task replaced by rpmdeplint

Posted by Kamil Páral on June 22, 2017 10:27 AM

If you are a Fedora packager, you might be interested to know that in Taskotron we replaced the depcheck task with rpmdeplint task. So if there are any dependency issues with the new update you submit to Bodhi, you’ll see that as dist.rpmdeplint failure (in the Automated Tests tab). The failure logs should look very similar to the depcheck ones (basically, the logs contain the errors dnf would spit out if it tried to install that package), so there should be no transitioning effort needed.

If you listen for depcheck results somehow, i.e. in FMN, make sure to update your rules to listen for dist.rpmdeplint instead. We have updated the default filters in FMN, so if you haven’t changed them, you should receive notifications for failures in rpmdeplint (and also upgradepath and abicheck) for submitted updates owned by you.

The reason for this switch is that we wanted to get rid of custom dependency checking (done directly on top of libsolv), and use an existing tool for that instead. That saves us time, we don’t need to study all the pitfalls of dependency resolution, and we benefit from someone else maintaining and developing the tool (that doesn’t mean we won’t send patches if needed). rpmdeplint offered exactly what we were looking for.

We will decommission depcheck task from Taskotron execution in the next few days, if there are no issues. Rpmdeplint results are already being published for all proposed updates.

If you have any questions, please ask in comments or reach us at #fedora-qa freenode irc channel or qa-devel (or test or devel) mailing list.

Kernel Performance Testing on ARM

Posted by sumantro on June 06, 2017 12:54 AM
This post will be talking about , how you can do kernel regression , stress and performance testing on ARM architecture.

Setup:

To set up your ARM device , you need an image to get started. I was intending to test the latest compose(Fedora 26 1.4 Beta on Raspberry Pi 3 Model B). Download the file (Workstation raw-xz for armhfp) or any variant that you want to test.

Once the file is download, all you need to do is to get a SD card and write the img in the card.

There are two ways of doing it using "Fedora Media Writer" which can now burn the image for ARM devices. The other way is the old dd , here is how you do it using dd





Once the dd has executed itself successfully , its time to plug in the SD in ARM device and boot it up. Once the ARM device is booted up all you need to do is to
clone the kernel test suite from here


Dependencies and Execution:

You will need 2 packages
1.gcc
2.fedora-python

You can install them by executing "sudo dnf install fedora-python gcc"

Executing test cases:

Each test should be contained in a unique directory within the appropriate top level. The directory must contain an executable 'runtest.sh' which will drive the specific test. There is no guarantee on the order of execution. Each test should be fully independent, and have no dependency on other tests. The top level directories are reflective of how the master test suite is called. Each option is a super-set of the options before it. At this time we have:
  • minimal: This directory should include small, fast, and important tests which would should be run on every system.
  • default: This directory will include most tests which are not destructive, or particularly long to run. When a user runs with no flags, all tests in both default and minimal will be run.
  • stress: This directory will include longer running and more resource intensive tests which a user might not want to run in the common case due to time or resource constraints.
  • destructive: This directory contains tests which have a higher probability of causing harm to a system even in the pass case. This would include things like potential for data loss.
  • performance: This directory contains longer running performance tests. These tests should typically be the only load on a system to get an accurate result.

After Executing
$ sudo ./runtests.sh -t performance
Each test is executed by the control script by calling runtest.sh. stdout and stderr are both redirected to the log. Any user running with default flags should see nothing but the name of the directory and pass/fail/skip. The runtest.sh should manage the full test run. This includes compiling any necessary source, checking for any specific dependencies, and skipping if they are not met. 



View the Log file with cat <file path> , the log file will give the device information and the test result.











The test is complete and the results is "PASS" in this case .







































Almost at the end of the file you will get the data and values of the tests.

[Restrospection] Fedora QA Global Onboarding Call 2017-06-03

Posted by sumantro on June 05, 2017 07:11 AM

Retrospection


We had a Fedora QA onboarding on 2017-06-03 and it was successful the Agenda and the Feedback can be found on this etherpad. People from different countries and regions found the call useful.

Few changes which made things better

1. Using bluejeans was smooth and better than Hangouts.
2. Starting the doodle 2 weeks before the call and giving enough time to vote.
3. Using a bunch of quick links as reference points and then explaining.

Action Items

1. Consistency is the key to success , doing the onboarding call every 2 months will be more engaging. Also , it gives a sense of assurance to the new contributors to simply plug themselves up in one of these calls and start from there even if they miss one , they would still be able to contribute to the release cycle. The proposal is to create a wiki page and link it up to Fedora QA Join where people will benefit from it.

2. Feedback , FAQ ,Quick Links, Logs and recordings should be marked and kept in a wiki page which will constantly tell us where we need improvement and maybe answer few general questions for new contributors.

Proposed Timeline

After Branched

This is the time we generally plan for the test days and start off with pre-alpha testing.Onboarding call during this time will help us gather community ideas by which we can drive the test day planning and if someone wants to run a test day then we can help them plan accordingly.


Before Beta

After the alpha , we are mostly in a phase we have the test days happening , blocker bugs being filed and lot of test coverage (rel-val) needs to happen during
this stage , having an on boarding call at this time will help in the new contributors to work on something specific which is aligned to our goals. This will be a place where we can have more and more people participating and help us testing the iso and features in test days.



Before Final
 
This is a good time as we are done with most of the test days as a part of change set , we can conduct a few more test namely the system upgrade test day and the kernel test day. This call will help us testing on most of the off-the-shelf-hardware and ensure that whole band of hardwares are tested. This is also the time where we need most validation to be done most of the architecture , hence will help us keep the contributors engaged.
 

Fedora QA Onboarding Call 2017-06-03 1400-1500 UTC

Posted by sumantro on June 02, 2017 05:50 PM



There is a going to be a Fedora QA On Boarding Call 2017-06-03 1400-1500 UTC over blue jeans, while release validation for Fedora Beta 1.3 is underway , this is good time to get started. In this call we will be talking about how you as a contributor can get started with Fedora QA. The agenda can be found on this etherpad .

Hope to see you all!

Test Day DNF 2.0

Posted by sumantro on May 08, 2017 11:02 AM


Tuesday, 2017-05-09, is the DNF 2.0 Test Day! As part of this planned Change for Fedora 26, we need your help to test DNF 2.0!

Why test DNF 2.0?

DNF-2 is the upstream DNF version, the only version actively developed. Currently the upstream contains many user requested features, increased compatibility with yum and over 30 bug fixes. To DNF-1 back porting of patches from upstream is difficult and only critical security and usability fixes will be cherry-picked to Fedora.

With DNF 2.0 in places,users can notice usability improvements of DNF like better messages during resolution errors, showing whether package was installed as weak dependency, better handling of obsolete packages, less tracebacks, etc. One command line option and one configuration option changed semantic so DNF could behave differently in some way (these changes are compatible with yum but incompatible with DNF-1)We hope to see whether it’s working well enough and catch any remaining issues.

We need your help!

All the instructions are on the wiki page, so please read through and come help us test! As always, the event will be in #fedora-test-day on Freenode IRC.

[Fedora Classroom] Fedora QA 101 and 102

Posted by sumantro on April 17, 2017 09:39 PM
This post will be summing up what we will be covering as a part of Fedora QA classroom this season. The idea is to understand how to do things the right way and to increase contributors.

The topics covered will be:
<style type="text/css">p { margin-bottom: 0.1in; line-height: 120%; }a:link { }</style>
1. Working with Fedora
2. Installing on VM(s)
3. Configuring and Installing fedora
4. Fedora Release Cycle
5. Live boot and Fedora Media Writer
6. Setting up accounts
7. Types of testing
8. Finding Test Cases
9. Writing Test Cases for Packages
10. Github
11. Bugzilla
12. Release Validation Testing
13. Update Testing
14. Manual Testing
14.1 Release validation
14.2 Update Testing


The 102 will cover Automated Testing and How to Host your own test days during the release cycle.

To make the workflow smooth , we have made a book which will act as an reference even after the classrooms are over.

https://drive.google.com/file/d/0Bzphf6h7upukTDlrczVEb0l3TmM/view?usp=sharing

Fedora Media Writer Test Day 2017-04-20

Posted by sumantro on April 16, 2017 08:11 AM
Fedora Media Writer , is a very handy tool to create live USB media. This became the primary downloadable in Fedora 25. We ran a test day installment to check for 3 major OS Windows , Mac OS and Fedora. The test day focused on writing Fedora images (workstation/server/spins) to a flash drive.

This installment of test day will focus on out of the box support for ARM v7 Arch apart from Intel 64 Bit and 32 Bit. The testers can either download image of their choice and then verify if the image by checksum and booting it on KVM and of course bare metal.

We will be calling this test day on 2017-04-20 , grab a blank SD card or USB and it will take roughly about 30 mins with a good internet speed to complete the test case.

Details will be published in Fedora community blog and @test-announce list. 

The wiki page says it all https://fedoraproject.org/wiki/Test_Day:2017-04-20_Fedora_Media_Writer

[Test Day Report]Anaconda BlivetGUI test day

Posted by sumantro on April 10, 2017 07:26 AM
Hey Testers,

I just wanted to pitch in the test day report for Anaconda BlivetGUI Test Day. It was a huge success and we had about 28 testers (many new faces) .

Testers : 28


Bugs Filed:[12]
1.https://bugzilla.redhat.com/show_bug.cgi?id=1430349
2.https://bugzilla.redhat.com/show_bug.cgi?id=1439538
3.https://bugzilla.redhat.com/show_bug.cgi?id=1439591
4.https://bugzilla.redhat.com/show_bug.cgi?id=1439744
5.https://bugzilla.redhat.com/show_bug.cgi?id=1439717
6.https://bugzilla.redhat.com/show_bug.cgi?id=1439684
7.https://bugzilla.redhat.com/show_bug.cgi?id=1439572
8.https://bugzilla.redhat.com/show_bug.cgi?id=1439111
9.https://bugzilla.redhat.com/show_bug.cgi?id=1439592
10.https://bugzilla.redhat.com/show_bug.cgi?id=1440143
11.https://bugzilla.redhat.com/show_bug.cgi?id=1440150
12.https://bugzilla.redhat.com/show_bug.cgi?id=1439729


Blog of the test day : https://communityblog.fedoraproject.org/anaconda-blivetgui-test-day/

I would like to thank each and every tester and the change owners for helping us test this crucial feature!

If you are one of those people who couldn't make it to the test day , you can go ahead and grab a copy for Fedora Alpha 1.7 and start installing Fedora using Blivet GUI . If something breaks , make sure to file a bug under blivet-gui .

Thanks
Sumantro
On behalf of Fedora QA team

[Test Day Annoucement] Anaconda Blivet GUI

Posted by sumantro on April 04, 2017 09:44 AM

Thursday 2017-04-06 will be Anaconda Blivet-GUI Test Day!
As part of this planned Change for Fedora 26, So this is an important Test Day!

We'll be testing the new detailed bottom-up configuration screen has been long requested by users and inclusion of blivet-gui into Anaconda finally makes this a reality. On the other hand, it just adds a new option without changing the existing advanced storage configuration so users that prefer the top-down configuration can still use it. to see whether it's working well enough and catch any remaining issues.
It's also pretty easy to join in: all you'll need is alpha 1.7 (which you can grab from the wiki page).
Anaconda grew a rather important new option in F26: as well as the two existing partitioning choices (automatic, and the existing anaconda custom part interface) there's now a *third* choice.so now you can do custom partitioning with blivet-gui run within anaconda, as well as using anaconda's own interface (because there just weren't enough ways for custom partitioning to go wrong already),so, we'll have a test day for using that interface, to try and shake out whatever problems it inevitably has.As always, the event will be in #fedora-test-day on Freenode IRC.

Fedora Activity Day, Bangalore 2017

Posted by Kanika Murarka on March 13, 2017 01:53 PM

The Fedora Activity Day (FAD) is a regional event (either one-day or a multi-day) that allows Fedora contributors to gather together in order to work on specific tasks related to the Fedora Project.

On February 25th ’17, FAD was conducted in one of the admirable university of Bangalore, University Visvesvaraya College of Engineering(UVCE). It was not a typical “hackathon” or “DocSprint” but a series of productive and interactive sessions on different tools.

The goal of this FAD was to make students aware about Fedora so that they can test, develop and contribute. The event was a one-day-event, started at 10:30 in morning and concluded at 3 in evening.

The first talk was on Ansible, which is an IT automation tool. It can configure systems, deploy software, and orchestrate more advanced IT tasks such as continuous deployments or zero downtime rolling updates. The session was taken up by Vipul Siddharth and Prakash Mishra, who are Fedora contributors. They discussed about the importance of such automation tool and gave a small demo for getting started with Ansible.

The Ansible session was followed by the session on contributing to Linux Kernel, given by our esteemed guest Vaishali Thakkar (@kernel_girl ). Vaishali is Linux Kernel developer at Oracle, she is working in kernel security engineering group and associated with the open source internship programs and some community groups.Vaishali highlighted upon each and every aspect of kernel one should know before contributing. She discussed the lifecycle and how-where-when of a pull request. The session was a 3 hour long session with a short lunch break. The first part of the session was focused on theoretical aspects of sending your first patch to kernel community and the second part was a demo session where she sent a patch from scratch (Slides).

The last session was taken up by Sumantro Mukherjee (Fedora Ambassador) and me, on pathways to contribute to Fedora with a short interactive session.

The speakers were awarded tshirts as a mark of respect.I would like to thank Sumantro Mukherjee, Fedora Community and IEEE subchapter of UVCE college for making FAD possible.


Kernel testing made easy!

Posted by sumantro on February 21, 2017 11:37 PM
Hey Folks , this is sincere effort to bring into notice that people who want to stay on top of the game in terms of bleeding edge. The most important part is to check if the kernel version is supporting your system fine. If it does , then its awesome but if it doesn't you might wanna report it to the team with the proper failure logs which might be helpful for future references.

To get started with , you need a bleeding edge kernel to start with. You can get the latest kernel from Bodhi.





Most of the kernel(s) are updates and hence you need to enable update-testing repo to install the kernel from the update-testing repo. 
Once you have enabled the update testing repo , you can also disable it by executing "dnf config-manager --set-disabled updates-testing".While I'm writing this the latest kernel in update-testing for f25 was "kernel-4.9.10-200.fc25"

Once , you are done installing now , comes the part of checking if all the virtal features of your machine works smoothly. Of course , after a deep manual inspection you can trigger the test suite which will test the major parts for you.

First , you need to install some packages which are important , although many of you might just have all these packages.


The above pic shows the installation of packages, Once you have the required packages , you need to clone the pagure repo "git clone https://pagure.io/kernel-tests.git" .




After cloning you can simply start the test suite , you need to switch to the cloned folder and execute " cp config.example .config" after a successful execution , you need to open the .config file in vi/any text editor and edit the values of "submit=authenticated" and "username=<fas username>" . Once you are done , just run the test by executing " sudo ./runtests.sh" . And for performance testing you can run "sudo ./runtests.sh -t performance" both of these tests are most likely to pass if they dont you need to send/update the log and post karmas on bodhi for people to note if regressions are noted.

For any changes refer to :
https://fedoraproject.org/wiki/QA:Testcase_kernel_regression

Bluetooth in Fedora

Posted by Nathaniel McCallum on February 16, 2017 08:53 PM

So… Bluetooth. It’s everywhere now. Well, everywhere except Fedora. Fedora does, of course support bluetooth. But even the most common workflows are somewhat spotty. We should improve this.

To this end, I’ve enlisted the help of the Don Zickus, kernel developer extrordinaire, and Adam Williamson, the inimitable Fedora QA guru. The plan is to create a set of user tests for the most common bluetooth tasks. This plan has several goals.

First, we’d like to know when stuff is broken. For example, the recent breakage in linux-firmware. Catching this stuff early is a huge plus.

Second, we’d like to get high quality bug reports. When things do break, vague bug reports often cause things to sit in limbo for a while. Making sure we have all the debugging information up front can make reports actionable.

Third, we’d (eventually) like to block a new Fedora release if major functionality is broken. We’re obviously not ready for this step yet. But once the majority of workflows work on the hardware we care about, we need to ensure that we don’t ship a Fedora release with broken code.

To this end we are targeting three workflows which cover the most common cases:

  • Keyboards
  • Headsets
  • Mice

For more information, or to help develop the user testing, see the Fedora QA bug. Here’s to a better future!

Fedorahosted to Pagure

Posted by Kanika Murarka on February 12, 2017 06:48 AM

Fedorahosted.org was established in late 2007 using Trac for issues and wiki pages, Fedora Account System groups for access control and source uploads, and offering a variety of Source Control Management tools (git, svn, hg, bzr). With the rise of new workflows and source repositories, fedorahosted.org has ceased to grow, adding just one new project this year and a handful the year before.

As we all know, Fedorahosted is shutting down end of this month, its time to migrate your projects from fedorahosted to one of the following:-

  1. Pagure
  2. Hosting and managing own Trac instance on OpenShift
  3. JIRA
  4. Phabricator
  5. GitHub
  6. Taiga

Pagure is the brainchild of Pierre-Yves Chibon, a member of the Fedora Engineering team. We will be looking into Pagure migration because Pagure is a new, full featured git repository service and its open-source and we ❤ opensource.

So, Pagure provides us Pagure test instance where we can create projects and test importing data. Note:from time to time it is been cleared out, so do not use it for any long-term use.

Here is How Pagure will support Fedorahosted projects ?

Features Fedorahosted Pagure
Priorities We can add as many priority levels as required with weights Same
We can assign a Default priority No such option
Custom priority tags Same
Milestone Ability to add as many milestone as we want Same
Option to add a due date Same
Keeps a track of completed time Does not record completed time
Option to select default milestone No such option
Resolution Ability to add as many resolutions as we want Same
Can set a default resolution type By default it is closed as ‘None’
Other Things Have separate column for Severity, component, Version Here it is easy, it has just Tags
Navigation and Searching Difficult Easy
Permission Different types of permission exist Only, ‘admin’ permission exist
Creating and maintaining tickets Difficult Easy
Enabling Plug-ins Easy Easy

So, lets try importing something to staging pagure repo, I will be showing demo using Fedora QA repo, which has recently been shifted from fedorahosted to pagure.

  1. You should have XML_RPC permission or admin rights for fedorahosted repo.
  2. We will use Pagure-importer to do migration.
  3. Install it using pip . python3 -m pip install pagure_importer
  4. Create a new repo ex- Test-fedoraqascreenshot-from-2017-02-10-16-53-21
  5. Go to Settings and make the repo, tickets friendly by adding new milestones and priorities.screenshot-from-2017-02-10-16-56-50
  6. Clone the issue tracker for issues from pagure. Use: pgimport clone ssh://git@stg.pagure.io/tickets/Test-fedoraqa.git.This will clone the pagure foobar repository into the default set /tmp directory as /tmp/Test-fedoraqa.gitscreenshot-from-2017-02-10-18-28-20
  7. Activate the pagure tickets hook from project settings. This is necessary step to also get pagure database updated for tickets repository changes.screenshot-from-2017-02-10-18-30-19
  8. Deactivate the pagure Fedmsg hook from project settings. This will avoid the issues import to spam the fedmsg bus. The Hook can be reactivated once the import has completed.
  9. The fedorahosted command can be used to import issues from a fedorahosted project to pagure.
    $ pgimport fedorahosted --help
        Usage: pgimport fedorahosted [OPTIONS] PROJECT_URL
    
        Options:
        --tags  Import pagure tags as well.
        --private By default make all issues private.
        --username TEXT FAS username
        --password TEXT FAS password
        --offset INTEGER Number of issue in pagure before import
        --help  Show this message and exit.
        --nopush Do not push the result of pagure-importer back
    
    
    $ pgimport fedorahosted https://fedorahosted.org/fedora-qa --tags

    This command will import all the tickets information with all tags to /tmp/foobar.git repository. If you are getting this error: ERROR: Error in response: {u'message': u'XML_RPC privileges are required to perform this operation', u'code': 403, u'name': u'JSONRPCError'}, means you dont have XML_RPC permission.

  10. You will be prompted for FAS username and password.screenshot-from-2017-02-10-19-01-51
  11. Go to tmp folder cd /tmp/
  12. Now, we need to push the tickets to new repo. The push command can be used to push a clone pagure ticket repo back to pagure
    $ pgimport push Test-fedoraqa.gitscreenshot-from-2017-02-10-19-10-04 screenshot-from-2017-02-10-19-10-16
  13. Refresh your repo, and it will look like thisscreenshot
  14. Now you can edit tickets in any way you want.

Stuck somewhere? Feel free to comment and contact. Thanks for reading this 🙂

Welcome Fedora Quality Planet

Posted by Kamil Páral on January 31, 2017 10:31 AM

Hello, I’d like to introduce a new sub-planet of Fedora Planet to you, located at http://fedoraplanet.org/quality/ (you don’t need to remember the URL, there’s a sub-planet picker in the top right corner of Fedora Planet pages that allows you to switch between sub-planets).

Fedora Quality Planet will contain news and useful information about QA tools and processes present in Fedora, updates on our quality automation efforts, guides for package maintainers (and other teams) how to interact with our tools and checks or understand the reported failures, announcements about critical issues in Fedora releases, and more.

Our goal is to have a single place for you to visit (or subscribe to) and get a good overview of what’s happening in the Fedora Quality space. Of course all Fedora Quality posts should also show up in the main Fedora Planet feed, so if you’re already subscribed to that, you shouldn’t miss our posts either.

If you want to join our effort and publish some interesting quality-related posts into Fedora Quality Planet, you’re more then welcome! Please see the instructions how to syndicate your blog. If you have any questions or need help, ask in the test mailing list or ping kparal or adamw on #fedora-qa freenode IRC channel. Thanks!

Fedora 25 i18n test day 2016-09-28

Posted by sumantro on September 27, 2016 04:23 PM
Hey All, this is a call for action for the internationalization[i18n] test day which is happening tomorrow and we will like to have people testing the keyboard layouts of different languages and emojis.

How to test

If you test with an installed system, make sure you have all the current updates installed, using the update manager, especially if you use the Alpha images.Grab a copy latest nightly , you can get it here .
Once you are done you can start testing out for :

Reporting

Result Page : http://testdays.fedorainfracloud.org/events/10





Found Bugs ? Report here

Retrospection of Fedora QA Global Onboarding Call

Posted by sumantro on August 23, 2016 07:18 AM
Building on the premise , that Fedora QA mailing-list started having loads of new contributors, we decided to kick off on-boarding calls for the new joiners to understand the Fedora QA process in the right way. Last Saturday, (2016-08-20) , we got the votes from all the new joiners and started off with the Fedora QA on-boarding call .

Last time when we ran it , we faced few roadblocks , major one being the "Participation" limit of Hangouts. Since, we wanted the call to be more interactive we thought it will be best to have it over "Hangouts". One major roadblock was that although we had a very interactive call with a load of question answers , the participation limit was pretty much restricted because we reached the max number of participants.

In the last call, we went for "Hangouts on Air" this is where we traded our interactivity with viewership. As this call technically had zero limitation in terms of participation  , the communication pattern was mostly simplex. Although pirate pad , help a lot in making the duplex conversation with the chatbox feature in it.

Adding to the silver lining, we had the recording up with zero downtime. People with lower bandwidth can always watch the video or download it from youtube for reference.

Adamw post for the call : https://www.happyassassin.net/2016/08/19/fedora-qa-onboarding-call-tomorrow-2016-08-20-at-1700-utc/

Youtube recording : https://www.youtube.com/watch?v=ASQmkOrB_DY

Having said that , We will be conducting on-boarding calls when ~30 new contributor join Fedora QA

UEFI for QEMU now in Fedora repositories

Posted by Kamil Páral on June 27, 2016 12:55 PM

I haven’t seen any announcement, but I noticed Fedora repositories now contain edk2-ovmf package. That is the package that is necessary to emulate UEFI in QEMU/KVM virtual machines. It seems all licensing issues having been finally resolved and now you can easily run UEFI systems in your virtual machines!

I have updated Using_UEFI_with_QEMU wiki page accordingly.

Enjoy.

‘Package XXX is not signed’ error during upgrade to Fedora 24

Posted by Kamil Páral on June 22, 2016 11:54 AM

Many people hit issues like this when trying to upgrade to Fedora 24:

 Error: Package a52dec-0.7.4-19.fc24.x86_64.rpm is not signed

You can easily see that this is a very widespread issue if you look at comments section under our upgrade guide on fedora magazine. In fact, this issue probably affects everyone who has rpmfusion repository enabled (which is a very popular third-party repository). Usually the a52dec package is mentioned, because it’s early in the alphabet listing, but it can be a different one (depending on what you installed from rpmfusion).

The core issue is that even though their Fedora 24 repository is available, the packages in it are not signed yet – they simply did not have time to do that yet. However, rpmfusion repository metadata from Fedora 23 demand that all packages are signed (which is a good thing, package signing is crucial to prevent all kinds of nasty security attacks). The outcome is that DNF rejects the transaction for being unsecure.

According to rpmfusion maintainers, they are working on signing their repositories and it should be done hopefully soon. So if you’re not in a hurry with your upgrade, just wait a while and the problem will disappear soon (hopefully).

But, if you insist that you want to upgrade now, what are your options?

Some people suggest you can add --nogpgcheck option to the command line. Please don’t do that! That completely bypasses any security checks, even for proper Fedora packages! It will get you vulnerable to security attacks.

A much better option is to temporarily remove rpmfusion repositories:

$ sudo dnf remove 'rpmfusion-*-release'

and run the upgrade command again. You’ll likely need to add --allowerasing option, because it will probably want to remove some packages that you installed from rpmfusion (like vlc):

$ sudo dnf system-upgrade download --releasever=24 --allowerasing

This is OK, after you upgrade your system, you can enable rpmfusion repositories again, and install the packages that were removed prior to upgrade.

(I recommend to really remove rpmfusion repositories and not just disable them, because they manage their repos in a non-standard way, enabling and disabling their updates and updates-testing repos during the system lifecycle according to their needs, so it’s hard to know which repos to enable after the system upgrade – they are not the same as were enabled before the system upgrade. What they are doing is really rather ugly and it’s much better to perform a clean installation of their repos.)

After the system upgrade finishes, simply visit their website, install the repos again, and install any packages that you’re missing. This way, your upgrade was performed in a safe way. The packages installed from rpmfusion might still be installed unsafely (depending whether they manage to sign the repo by that time or not), but it’s much better than to upgrade your whole system unsafely.

To close this up, I’m sorry that people are hit by these complications, but it’s not something Fedora project can directly influence (except for banning third-party repos during system upgrades completely, or some similar drastic measure). This is in hands of those third-party repos. Hopefully lots of this pain will go away once we start using Flatpak.

Hosting your own Fedora Test Day

Posted by sumantro on June 07, 2016 03:08 AM
This post will be talking about how to hold your own test day. In most of the time although the situation will not arise when you have to do all the work single handed. But below is a draft of how you can go ahead an host your own testdays or atleast proceed with.


Procedure:

1. Decide the change which you wanna test for
2. Create a ticket
3. Find out if that type of test case ran sometime before or not, if yes then you can easily re-use the previous wikipage of test cases but if not then you have to write a fresh new testday wiki and test case wiki .
4.Once you are done setting up the wiki(s) , you need to think about the metaapp data page
5. And the results will be shown in the Test day App.

How to setup:
1.For fedora 24 you can find the change set here .

2. Create a Trac ticket. One example is below

                      
                                 
                               

3.Once you are sure you have picked up at least one check if that type of test ever ran before, if not then first start off by grabbing the test day wiki template.

                                           

Actual wiki testday page 
                     
                       
                                      
4.Next and the most crucial part is setting up the test case page, this page contains instruction as in what is supposed to be done and how. It also explains which test case should yield what result(s).One example is below


                           




5.Setting up the meta page for the app to work fine. Here is how , create a wikipage exactly like the one which is given below





6.Once done, you need to add the meta link here

7. After this moment one, your test day is live and you can find it's Testday App result page which will look much like this.

                              


That completes all the procedures required , needless to say that you should announce it on @test list and @test-announce list!

Fedora Media Writer testday [Report]

Posted by sumantro on April 21, 2016 11:06 AM
This is a proud moment as we have seen a good amount of people participating in the event .

Testers : 22 , Test : 40
Bugs Filed : 16 , New :16  , Dup : 0 , Fixed : 0




Participants
  1. Ayan
  2. Karthik Subrahmanya
  3. achembar
  4. arehtykitna
  5. frantisekz
  6. kparal
  7. lbrabec
  8. pschindl
  9. priynag
  10. renault
  11. satellit
  12. sayak
  13. sumantro
  14. swarnava
  15. cpanceac
  16. deaddrift
  17. lsatenstein
  18. roger.k.wells
  19. cmurf
  20. jsedlak
  21. juliuxpigface
  22. qqqqqqq


Bugs and Github issues Reported

  1. moving liveusb-creator to a different display runs out of memory and kills desktop - 1328452
  2. Luc doesn't reuse already downloaded image and stuck - 1328369
  3. AttributeError: 'NoneType' object has no attribute 'device' - 1328337
  4. liveusb-creator: pycurl.error: cannot invoke setopt() - perform() is currently running -1328340
  5. crash when choosing ISO on samba server- 1328560
  6.  Windows 10 install usb stick not recognized-1328794
  7. liveusb-creator on Windows 10 opens diskpart.exe but does nothing-1328484
  8. partitions are not unmounted before writing image to a device, can lead to corrupted data written-1328498
  9. do not leave partially downloaded files on disk-1328789
  10. no writing progress is displayed-1328462
  11. USB stick light flashes continuously while LUC is running, whether it's read/writing or not- 1328563
  12. cpu is over 100% for liveusb-creator process while application is doing nothing - 1318491
  13. Open button doesn't work in file selection dialog-1328457
  14. https://github.com/lmacken/liveusb-creator/issues/43
  15. https://github.com/lmacken/liveusb-creator/issues/44
  16. https://github.com/lmacken/liveusb-creator/issues/45