Fedora Quality Planet

On inclusive language: an extended metaphor involving parties because why not

Posted by Adam Williamson on June 17, 2020 12:57 AM

So there's been some discussion within Red Hat about inclusive language lately, obviously related to current events and the worldwide protests against racism, especially anti-Black racism. I don't want to get into any internal details, but in one case we got into some general debate about the validity of efforts to use more inclusive language. I thought up this florid party metaphor, and I figured instead of throwing it at an internal list, I'd put it up here instead. If you have constructive thoughts on it, go ahead and mail me or start a twitter thread or something. If you have non-constructive thoughts on it, keep 'em to yourself!

Before we get into my pontificating, though, here's some useful practical resources if you just want to read up on how you can make the language in your projects and docs more inclusive:

To provide a bit of context: I was thinking about a suggestion that people promoting the use of more inclusive language are "trying to be offended". And here's where my mind went!

Imagine you are throwing a party. You send out the invites, order in some hors d'ouevres (did I spell that right? I never spell that right), queue up some Billie Eilish (everyone loves Billie Eilish, it's a scientific fact), set out the drinks, and wait for folks to arrive. In they all come, the room's buzzing, everyone seems to be having a good time, it's going great!

But then you notice (or maybe someone else notices, and tells you) that most of the people at your party seem to be straight white dudes and their wives and girlfriends. That's weird, you think, I'm an open minded modern guy, I'd be happy to see some Black folks and maybe a cute gay couple or something! What gives? I don't want people to think I'm some kind of racist or sexist or homophobe or something!

So you go and ask some non-white folks and some non-straight folks and some non-male folks what's going on. What is it? Is it me? What did I do wrong?

Well, they say, look, it's a hugely complex issue, I mean, we could be here all night talking about it. And yes, fine, that broken pipeline outside your house might have something to do with it (IN-JOKE ALERT). But since you ask, look, let us break this one part of it down for you.

You know how you've got a bouncer outside, and every time someone rolls up to the party he looks them up and down and says "well hi there! What's your name? Is it on the BLACKLIST or the WHITELIST?" Well...I mean...that might put some folks off a bit. And you know how you made the theme of the party "masters and slaves"? You know, that might have something to do with it too. And, yeah, you see how you sent all the invites to men and wrote "if your wife wants to come too, just put her name in your reply"? I mean, you know, that might speak to some people more than others, you hear what I'm saying?

Now...this could go one of two ways. On the Good Ending, you might say "hey, you know what? I didn't think about that. Thanks for letting me know. I guess next time I'll maybe change those things up a bit and maybe it'll help. Hey thanks! I appreciate it!"

and that would be great. But unfortunately, you might instead opt for the Bad Ending. In the Bad Ending, you say something like this:

"Wow. I mean, just wow. I feel so attacked here. It's not like I called it a 'blacklist' because I'm racist or something. I don't have a racist bone in my body, why do you have to read it that way? You know blacklist doesn't even MEAN that, right? And jeez, look, the whole 'masters and slaves' thing was just a bit of fun, it's not like we made all the Black people the slaves or something! And besides that whole thing was so long ago! And I mean look, most people are straight, right? It's just easier to go with what's accurate for most people. It's so inconvenient to have to think about EVERYBODY all the time. It's not like I'm homophobic or anything. If gay people would just write back and say 'actually I have a husband' or whatever they'd be TOTALLY welcome, I'm all cool with that. God, why do you have to be so EASILY OFFENDED? Why do you want to make me feel so guilty?"

So, I mean. Out of Bad Ending Person and Good Ending Person...whose next party do we think is gonna be more inclusive?

So obviously, in this metaphor, Party Throwing Person is Red Hat, or Google, or Microsoft, or pretty much any company that says "hey, we accept this industry has a problem with inclusion and we're trying to do better", and the party is our software and communities and events and so on. If you are looking at your communities and wondering why they seem to be pretty white and male and straight, and you ask folks for ideas on how to improve that, and they give you some ideas...just listen. And try to take them on board. You asked. They're trying to help. They are not saying you are a BAD PERSON who has done BAD THINGS and OFFENDED them and you must feel GUILTY for that. They're just trying to help you make a positive change that will help more folks feel more welcome in your communities.

You know, in a weird way, if our Party Throwing Person wasn't quite Good Ending Person or Bad Ending person but instead said "hey, you know what, I don't care about women or Black people or gays or whatever, this is a STRAIGHT WHITE GUY PARTY! WOOOOO! SOMEONE TAP THAT KEG!"...that's almost not as bad. At least you know where you stand with that. You don't feel like you're getting gaslit. You can just write that idiot and their party off and try and find another. The kind of Bad Ending Person who keeps insisting they're not racist or sexist or homophobic and they totally want more minorities to show up at their party but they just can't figure out why they all seem to be so awkward and easily offended and why they want to make poor Bad Ending Person feel so guilty...you know...that gets pretty tiring to deal with sometimes.

Fedora CoreOS Test Day coming up on 2020-06-08

Posted by Adam Williamson on June 03, 2020 10:45 PM

Mark your calendars for next Monday, folks: 2020-06-08 will be the very first Fedora CoreOS test day! Fedora QA and the CoreOS team are collaborating to bring you this event. We'll be asking participants to test the bleeding-edge next stream of Fedora CoreOS, run some test cases, and also read over the documentation and give feedback.

All the details are on the Test Day page. You can join in on the day on Freenode IRC, we'll be using #fedora-coreos rather than #fedora-test-day for this event. Please come by and help out if you have the time!

Taskotron is EOL (end of life) today

Posted by Kamil Páral on April 30, 2020 07:49 AM

As previously announced, Taskotron (project page) will be shut down today. See the announcement and its discussion for more details and some background info.

As a result, certain tests (beginning with “dist.“) will no longer appear for new updates in Bodhi (in Automated Tests tab). Some of those tests (and even new ones) will hopefully come back in the future with the help of Fedora CI.

Thank you to everyone who contributed to Taskotron in the past or found our test reports helpful.

taskotron

Fedora 32 release and Lenovo announcement

Posted by Adam Williamson on April 28, 2020 11:18 PM

It's been a big week in Fedora news: first came the announcement of Lenovo planning to ship laptops preloaded with Fedora, and today Fedora 32 is released. I'm happy this release was again "on time" (at least if you go by our definition and not Phoronix's!), though it was kinda chaotic in the last week or so. We just changed the installer, the partitioning library, the custom partitioning tool, the kernel and the main desktop's display manager - that's all perfectly normal stuff to change a day before you sign off the release, right? I'm pretty confident this is fine!

But seriously folks, I think it turned out to be a pretty good sausage, like most of the ones we've put on the shelves lately. Please do take it for a spin and see how it works for you.

I'm also really happy about the Lenovo announcement. The team working on that has been doing an awful lot of diplomacy and negotiation and cajoling for quite a while now and it's great to see it pay off. The RH Fedora QA team was formally brought into the plan in the last month or two, and Lenovo has kindly provided us with several test laptops which we've distributed around. While the project wasn't public we were clear that we couldn't do anything like making the Fedora 32 release contingent on test results on Lenovo hardware purely for this reason or anything like that, but both our team and Lenovo's have been running tests and we did accept several freeze exceptions to fix bugs like this one, which also affected some Dell systems and maybe others too. Now this project is officially public, it's possible we'll consider adding some official release criteria for the supported systems, or something like that, so look out for proposals on the mailing lists in future.

Automatically shrink your VM disk images when you delete files (Fedora 32 update)

Posted by Kamil Páral on April 24, 2020 03:09 PM

I’ve already written about this in 2017, but things got simpler since then. Time for an update!

If you use virtual machines, your disk images just grow and grow, but never shrink – deleted files inside a VM never free up the space on the host. But you can configure the VM to handle TRIM (discard) commands, and then your disk images will reflect deleted files and shrink as well. Here’s how (with Fedora 32 using qemu 4.2 and virt-manager 2.2).

Adjust VM configuration

  1. When creating a new VM, use qcow2 disk images (that’s the default), not raw.
  2. Your new VM should have VirtIO disks (that’s the default).
  3. In virt-manager in VM configuration, select your VirtIO disk, go to Advanced -> Performance, and set Discard mode: unmap.
    virt-manager-unmap

Test it

Now boot your VM and try to issue a TRIM command:

$ sudo fstrim -av
/boot: 908.5 MiB (952631296 bytes) trimmed on /dev/vda1
/: 6.8 GiB (7240171520 bytes) trimmed on /dev/mapper/fedora-root

You should see some output printed, even if it’s just 0 bytes trimmed, not an error.

Let’s see if the disk image actually shrinks. You need to list its size using du (or ls -s) to see the disk allocated size, not the apparent file size (because the disk image is sparse):

$ du -h discardtest.qcow2 
1.4G discardtest.qcow2

Now create a file inside the VM:

$ dd if=/dev/urandom of=file bs=1M count=500

We created a 500 MB file inside the VM and the disk image grew accordingly (give it a few seconds):

$ du -h discardtest.qcow2
1.9G discardtest.qcow2

Now, remove the file inside the VM and issue a TRIM:

$ rm file -f
$ sudo fstrim -av

And the disk image size should shrink back (give it a few seconds):

$ du -h discardtest.qcow2
1.4G discardtest.qcow2

If you configure your system to send TRIM in real-time (see below), it should shrink right after rm and no fstrim should be needed.

Issue TRIM automatically

With Fedora 32, fstrim.timer is automatically enabled and will trim your system once per week. You can reconfigure it to run more frequently, if you want. You can check the timer using:

$ sudo systemctl list-timers

If you want a real-time TRIM, edit /etc/fstab in the VM and add a discard mount option to the filesystem in question, like this:

UUID=6d368798-f4c2-44f9-8334-6be3c64cc449 / ext4 defaults,discard 1 1

This has some performance impact (they say), but the disk image will shrink right after a file is deleted. (Note: XFS as a root filesystem doesn’t issue TRIM commands without additional tweaking, read more here).

Stay informed about QA events

Posted by Kamil Páral on April 09, 2020 02:50 PM

Hello, this is a reminder that you can easily stay informed about important upcoming QA events and help with testing Fedora, especially now during Fedora 32 development period.

The first obvious option for existing Fedora contributors is to subscribe to the test-announce mailing list. We announce all our QA meetings, test days, composes nominated for testing and other important information in there.

A second, not that well-known option which I want to highlight today, is to add our QA calendar to your calendar software (Google Calendar, Thunderbird, etc). You’ll see our QA meetings (including blocker review meetings) and test days right next to your personal events, so they will be hard to miss. A guide how to do that is here on our QA homepage.

Thank you everyone who joins our efforts and helps us make Fedora better.

Do not upgrade to Fedora 32, and do not adjust your sets

Posted by Adam Williamson on February 14, 2020 05:30 PM

If you were unlucky today, you might have received a notification from GNOME in Fedora 30 or 31 that Fedora 32 is now available for upgrade.

This might have struck you as a bit odd, it being rather early for Fedora 32 to be out and there not being any news about it or anything. And if so, you'd be right! This was an error, and we're very sorry for it.

What happened is that a particular bit of data which GNOME Software (among other things) uses as its source of truth about Fedora releases was updated for the branching of Fedora 32...but by mistake, 32 was added with status 'Active' (meaning 'stable release') rather than 'Under Development'. This fooled poor GNOME Software into thinking a new stable release was available, and telling you about it.

Kamil Paral spotted this very quickly and releng fixed it right away, but if your GNOME Software happened to check for updates during the few minutes the incorrect data was up, it will have cached it, and you'll see the incorrect notification for a while.

Please DO NOT upgrade to Fedora 32 yet. It is under heavy development and is very much not ready for normal use. We're very sorry for the incorrect notification and we hope it didn't cause too much disruption.

Using Zuul CI with Pagure.io

Posted by Adam Williamson on February 12, 2020 06:15 PM

I attended Devconf.cz again this year - I'll try and post a full blog post on that soon. One of the most interesting talks, though, was CI/CD for Fedora packaging with Zuul, where Fabien Boucher and Matthieu Huin introduced the work they've done to integrate a specific Zuul instance (part of the Software Factory effort) with the Pagure instance Fedora uses for packages and also with Pagure.io, the general-purpose Pagure instance that many Fedora groups use to host projects, including us in QA.

They've done a lot of work to make it as simple as possible to hook up a project in either Pagure instance to run CI via Zuul, and it looked pretty cool, so I thought I'd try it on one of our projects and see how it compares to other options, like the Jenkins-based Pagure CI.

I wound up more or less following the instructions on this Wiki page, but it does not give you an example of a minimal framework in the project repository itself to actually run some checks. However, after I submitted the pull request for fedora-project-config as explained on the wiki page, Tristan Cacqueray was kind enough to send me this as a pull request for my project repository.

So, all that was needed to get a kind of 'hello world' process running was:

  1. Add the appropriate web hook in the project options
  2. Add the 'zuul' user as a committer on the project in the project options
  3. Get a pull request merged to fedora-project-config to add the desired project
  4. Add a basic Zuul config which runs a single job

After that, the next step was to have it run useful checks. I set the project up such that all the appropriate checks could be run just by calling tox (which is a great test runner for Python projects) - see the tox configuration. Then, with a bit more help from Tristan, I was able to tweak the Zuul config to run it successfully. This mainly required a couple of things:

  1. Adding nodeset: fedora-31-vm to the Zuul config - this makes the CI job run on a Fedora 31 VM rather than the default CentOS 7 VM (CentOS 7's tox is too old for a modern Python 3 project)
  2. Modifying the job configuration to ensure tox is installed (there's a canned role for this, called ensure-tox) and also all available Python interpreters (using the package module)

This was all pretty small and easy stuff, and we had the whole thing up and running in a few hours. Now it all works great, so whenever a pull request is submitted for the project, the tests are automatically run and the results shown on the pull request.

You can set up more complex workflows where Zuul takes over merging of pull requests entirely - an admin posts a comment indicating a PR is ready to merge, whereupon Zuul will retest it and then merge it automatically if the test succeeds. This can also be used to merge series of PRs together, with proper testing. But for my small project, this simple integration is enough so far.

It's been a positive experience working with the system so far, and I'd encourage others to try it for their packages and Pagure projects!

AdamW's Debugging Adventures: "dnf is locked by another application"

Posted by Adam Williamson on October 18, 2019 01:45 PM

Gather round the fire, kids, it's time for another Debugging Adventure! These are posts where I write up the process of diagnosing the root cause of a bug, where it turned out to be interesting (to me, anyway...)

This case - Bugzilla #1750575 - involved dnfdragora, the package management tool used on Fedora Xfce, which is a release-blocking environment for the ARM architecture. It was a pretty easy bug to reproduce: any time you updated a package, the update would work, but then dnfdragora would show an error "DNF is locked by another process. dnfdragora will exit.", and immediately exit.

The bug sat around on the blocker list for a while; Daniel Mach (a DNF developer) looked into it a bit but didn't have time to figure it out all the way. So I got tired of waiting for someone else to do it, and decided to work it out myself.

Where's the error coming from?

As a starting point, I had a nice error message - so the obvious thing to do is figure out where that message comes from. The text appears in a couple of places in dnfdragora - in an exception handler and also in a method for setting up a connection to dnfdaemon. So, if we didn't already know (I happened to) this would be the point at which we'd realize that dnfdragora is a frontend app to a backend - dnfdaemon - which does the heavy lifting.

So, to figure out in more detail how we were getting to one of these two points, I hacked both the points where that error is logged. Both of them read logger.critical(errmsg). I changed this to logger.exception(errmsg). logger.exception is a very handy feature of Python's logging module which logs whatever message you specify, plus a traceback to the current state, just like the traceback you get if the app actually crashes. So by doing that, the dnfdragora log (it logs to a file dnfdragora.log in the directory you run it from) gave us a traceback showing how we got to the error:

2019-10-14 17:53:29,436 <a href="ERROR">dnfdragora</a> dnfdaemon client error: g-io-error-quark: GDBus.Error:org.baseurl.DnfSystem.LockedError: dnf is locked by another application (36)
Traceback (most recent call last):
File "/usr/bin/dnfdragora", line 85, in <module>
main_gui.handleevent()
File "/usr/lib/python3.7/site-packages/dnfdragora/ui.py", line 1273, in handleevent
if not self._searchPackages(filter, True) :
File "/usr/lib/python3.7/site-packages/dnfdragora/ui.py", line 949, in _searchPackages
packages = self.backend.search(fields, strings, self.match_all, self.newest_only, tags )
File "/usr/lib/python3.7/site-packages/dnfdragora/misc.py", line 135, in newFunc
rc = func(*args, **kwargs)
File "/usr/lib/python3.7/site-packages/dnfdragora/dnf_backend.py", line 464, in search
newest_only, tags)
File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 508, in Search
fields, keys, attrs, match_all, newest_only, tags))
File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 293, in _run_dbus_async
result = self._get_result(data)
File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 277, in _get_result
self._handle_dbus_error(user_data['error'])
File "/usr/lib/python3.7/site-packages/dnfdaemon/client/__init__.py", line 250, in _handle_dbus_error
raise DaemonError(str(err))
dnfdaemon.client.DaemonError: g-io-error-quark: GDBus.Error:org.baseurl.DnfSystem.LockedError: dnf is locked by another application (36)</module>

So, this tells us quite a bit of stuff. We know we're crashing in some sort of 'search' operation, and dbus seems to be involved. We can also see a bit more of the architecture here. Note how we have dnfdragora/dnf_backend.py and dnfdaemon/client/init.py included in the trace, even though we're only in the dnfdragora executable here (dnfdaemon is a separate process). Looking at that and then looking at those files a bit, it's quite easy to see that the dnfdaemon Python library provides a sort of framework for a client class called (oddly enough) DnfDaemonBase which the actual client - dnfdragora in our case - is expected to subclass and flesh out. dnfdragora does this in a class called DnfRootBackend, which inherits from both dnfdragora.backend.Backend (a sort of abstraction layer for dnfdragora to have multiple of these backends, though at present it only actually has this one) and dnfdaemon.client.Client, which is just a small extension to DnfDaemonBase that adds some dbus signal handling.

So now we know more about the design we're dealing with, and we can also see that we're trying to do some sort of search operation which looks like it works by the client class communicating with the actual dnfdaemon server process via dbus, only we're hitting some kind of error in that process, and interpreting it as 'dnf is locked by another application'. If we dig a little deeper, we can figure out a bit more. We have to read through all of the backtrace frames and examine the functions, but ultimately we can figure out that DnfRootBackend.Search() is wrapped by dnfdragora.misc.ExceptionHandler, which handles dnfdaemon.client.DaemonError exceptions - like the one that's ultimately getting raised here! - by calling the base class's own exception_handler() on them...and for us, that's BaseDragora.exception_handler, one of the two places we found earlier that ultimately produces this "DNF is locked by another process. dnfdragora will exit" text. We also now have two indications (the dbus error itself, and the code in exception_handler() that the error we're dealing with is "LockedError".

A misleading error...

At this point, I went looking for the text LockedError, and found it in two files in dnfdaemon that are kinda variants on each other - daemon/dnfdaemon-session.py and daemon/dnfdaemon-system.py. I didn't actually know offhand which of the two is used in our case, but it doesn't really matter, because the codepath to LockedError is the same in both. There's a function called check_lock() which checks that self._lock == sender, and if it doesn't, raises LockedError. That sure looks like where we're at.

So at this point I did a bit of poking around into how self._lock gets set and unset in the daemon. It turns out to be pretty simple. The daemon is basically implemented as a class with a bunch of methods that are wrapped by @dbus.service.method, which makes them accessible as DBus methods. (One of them is Search(), and we can see that the client class's own Search() basically just calls that). There are also methods called Lock() and Unlock(), which - not surprisingly - set and release this lock, by setting the daemon class' self._lock to be either an identifier for the DBus client or None, respectively. And when the daemon is first initialized, the value is set to None.

At this point, I realized that the error we're dealing with here is actually a lie in two important ways:

  1. The message claims that the problem is the lock being held "by another application", but that's not what check_lock() checks, really. It passes only if the caller holds the lock. It does fail if the lock is held "by another application", but it also fails if the lock is not held at all. Given all the code we looked at so far, we can't actually trust the message's assertion that something else is holding the lock. It is also possible that the lock is not held at all.
  2. The message suggests that the lock in question is a lock on dnf itself. I know dnf/libdnf do have locking, so up to now I'd been assuming we were actually dealing with the locking in dnf itself. But at this point I realized we weren't. The dnfdaemon lock code we just looked at doesn't actually call or wrap dnf's own locking code in any way. This lock we're dealing with is entirely internal to dnfdaemon. It's really a lock on the dnfdaemon instance itself.

So, at this point I started thinking of the error as being "dnfdaemon is either locked by another DBus client, or not locked at all".

So what's going on with this lock anyway?

My next step, now I understood the locking process we're dealing with, was to stick some logging into it. I added log lines to the Lock() and Unlock() methods, and I also made check_lock() log what sender and self._lock were set to before returning. Because it sets self._lock to None, I also added a log line to the daemon's init that just records that we're in it. That got me some more useful information:

2019-10-14 18:53:03.397784 XXX In DnfDaemon.init now!
2019-10-14 18:53:03.402804 XXX LOCK: sender is :1.1835
2019-10-14 18:53:03.407524 XXX CHECK LOCK: sender is :1.1835
XXX CHECK LOCK: self._lock is :1.1835
2019-10-14 18:53:07.556499 XXX CHECK LOCK: sender is :1.1835
XXX CHECK LOCK: self._lock is :1.1835
[...snip a bunch more calls to check_lock where the sender is the same...]
2019-10-14 18:53:13.560828 XXX CHECK LOCK: sender is :1.1835
XXX CHECK LOCK: self._lock is :1.1835
2019-10-14 18:53:13.560941 XXX CHECK LOCK: sender is :1.1835
XXX CHECK LOCK: self._lock is :1.1835
2019-10-14 18:53:16.513900 XXX In DnfDaemon.init now!
2019-10-14 18:53:16.516724 XXX CHECK LOCK: sender is :1.1835
XXX CHECK LOCK: self._lock is None

so we could see that when we started dnfdragora, dnfdaemon started up and dnfdragora locked it almost immediately, then throughout the whole process of reproducing the bug - run dnfdragora, search for a package to be updated, mark it for updating, run the transaction, wait for the error - there were several instances of DBus method calls where everything worked fine (we see check_lock() being called and finding sender and self._lock set to the same value, the identifier for dnfdragora), but then suddenly we see the daemon's init running again for some reason, not being locked, and then a check_lock() call that fails because the daemon instance's self._lock is None.

After a couple of minutes, I guessed what was going on here, and the daemon's service logs confirmed it - dnfdaemon was crashing and automatically restarting. The first attempt to invoke a DBus method after the crash and restart fails, because dnfdragora has not locked this new instance of the daemon (it has no idea it just crashed and restarted), so check_lock() fails. So as soon as a DBus method invocation is attempted after the dnfdaemon crash, dnfdragora errors out with the confusing "dnf is locked by another process" error.

The crash was already mentioned in the bug report, but until now the exact interaction between the crash and the error had not been worked out - we just knew the daemon crashed and the app errored out, but we didn't really know what order those things happened in or how they related to each other.

OK then...why is dnfdaemon crashing?

So, the question now became: why is dnfdaemon crashing? Well, the backtrace we had didn't tell us a lot; really it only told us that something was going wrong in libdbus, which we could also tell from the dnfdaemon service log:

Oct 14 18:53:15 adam.happyassassin.net dnfdaemon-system[226042]: dbus[226042]: arguments to dbus_connection_unref() were incorrect, assertion "connection->generation == _dbus_current_generation" failed in file ../../dbus/dbus-connection.c line 2823.
Oct 14 18:53:15 adam.happyassassin.net dnfdaemon-system[226042]: This is normally a bug in some application using the D-Bus library.
Oct 14 18:53:15 adam.happyassassin.net dnfdaemon-system[226042]:   D-Bus not built with -rdynamic so unable to print a backtrace

that last line looked like a cue, so of course, off I went to figure out how to build DBus with -rdynamic. A bit of Googling told me - thanks "the3dfxdude"! - that the trick is to compile with --enable-asserts. So I did that and reproduced the bug again, and got a bit of a better backtrace. It's a long one, but by picking through it carefully I could spot - in frame #17 - the actual point at which the problem happened, which was in dnfdaemon.server.DnfDaemonBase.run_transaction(). (Note, this is a different DnfDaemonBase class from dnfdaemon.client.DnfDaemonBase; I don't know why they have the same name, that's just confusing).

So, the daemon's crashing on this self.TransactionEvent('end-run', NONE) call. I poked into what that does a bit, and found a design here that kinda mirrors what happens on the client side: this DnfDaemonBase, like the other one, is a framework for a full daemon implementation, and it's subclassed by a DnfDaemon class here. That class defines a TransactionEvent method that emits a DBus signal. So...we're crashing when trying to emit a dbus signal. That all adds up with the backtrace going through libdbus and all. But, why are we crashing?

At this point I tried to make a small reproducer (which basically just set up a DnfDaemon instance and called self.TransactionEvent in the same way, I think) but that didn't work - I didn't know why at the time, but figured it out later. Continuing to trace it out through code wouldn't be that easy because now we're in DBus, which I know from experience is a big complex codebase that's not that easy to just reason your way through. We had the actual DBus error to work from too - "arguments to dbus_connection_unref() were incorrect, assertion "connection->generation == _dbus_current_generation" failed" - and I looked into that a bit, but there were no really helpful leads there (I got a bit more understanding about what the error means exactly, but it didn't help me understand why it was happening at all).

Time for the old standby...

So, being a bit stuck, I fell back on the most trusty standby: trial and error! Well, also a bit of logic. It did occur to me that the dbus broker is itself a long-running daemon that other things can talk to. So I started just wondering if something was interfering with dnfdaemon's connection with the dbus broker, somehow. This was in my head as I poked around at stuff - wherever I wound up looking, I was looking for stuff that involved dbus.

But to figure out where to look, I just started hacking up dnfdaemon a bit. Now this first part is probably pure intuition, but that self._reset_base() call on the line right before the self.TransactionEvent call that crashes bugged me. It's probably just long experience telling me that anything with "reset" or "refresh" in the name is bad news. :P So I thought, hey, what happens if we move it?

I stuck some logging lines into this run_transaction so I knew where we got to before we crashed - this is a great dumb trick, btw, just stick lines like self.logger('XXX HERE 1'), self.logger('XXX HERE 2') etc. between every significant line in the thing you're debugging, and grep the logs for "XXX" - and moved the self._reset_base() call down under the self.TransactionEvent call...and found that when I did that, we got further, the self.TransactionEvent call worked and we crashed the next time something else tried to emit a DBus signal. I also tried commenting out the self._reset_base() call entirely, and found that now we would only crash the next time a DBus signal was emitted after a subsequent call to the Unlock() method, which is another method that calls self._reset_base(). So, at this point I was pretty confident in this description: "dnfdaemon is crashing on the first interaction with DBus after self._reset_base() is called".

So my next step was to break down what _reset_base() was actually doing. Turns out all of the detail is in the DnfDaemonBase skeleton server class: it has a self._base which is a dnf.base.Base() instance, and that method just calls that instance's close() method and sets self._base to None. So off I went into dnf code to see what dnf.base.Base.close() does. Turns out it basically does two things: it calls self._finalize_base() and then calls self.reset(True, True, True).

Looking at the code it wasn't immediately obvious which of these would be the culprit, so it was all aboard the trial and error train again! I changed the call to self._reset_base() in the daemon to self._base.reset(True, True, True)...and the bug stopped happening! So that told me the problem was in the call to _finalize_base(), not the call to reset(). So I dug into what _finalize_base() does and kinda repeated this process - I kept drilling down through layers and splitting up what things did into individual pieces, and doing subsets of those pieces at a time to try and find the "smallest" thing I could which would cause the bug.

To take a short aside...this is what I really like about these kinds of debugging odysseys. It's like being a detective, only ultimately you know that there's a definite reason for what's happening and there's always some way you can get closer to it. If you have enough patience there's always a next step you can take that will get you a little bit closer to figuring out what's going on. You just have to keep working through the little steps until you finally get there.

Eventually I lit upon this bit of dnf.rpm.transaction.TransactionWrapper.close(). That was the key, as close as I could get to it: reducing the daemon's self._reset_base() call to just self._base._priv_ts.ts = None (which is what that line does) was enough to cause the bug. That was the one thing out of all the things that self._reset_base() does which caused the problem.

So, of course, I took a look at what this ts thing was. Turns out it's an instance of rpm.TransactionSet, from RPM's Python library. So, at some point, we're setting up an instance of rpm.TransactionSet, and at this point we're dropping our reference to it, which - point to ponder - might trigger some kind of cleanup on it.

Remember how I was looking for things that deal with dbus? Well, that turned out to bear fruit at this point...because what I did next was simply to go to my git checkout of rpm and grep it for 'dbus'. And lo and behold...this showed up.

Turns out RPM has plugins (TIL!), and in particular, it has this one, which talks to dbus. (What it actually does is try to inhibit systemd from suspending or shutting down the system while a package transaction is happening). And this plugin has a cleanup function which calls something called dbus_shutdown() - aha!

This was enough to get me pretty suspicious. So I checked my system and, indeed, I had a package rpm-plugin-systemd-inhibit installed. I poked at dependencies a bit and found that python3-dnf recommends that package, which means it'll basically be installed on nearly all Fedora installs. Still looking like a prime suspect. So, it was easy enough to check: I put the code back to a state where the crash happened, uninstalled the package, and tried again...and bingo! The crash stopped happening.

So at this point the case was more or less closed. I just had to do a bit of confirming and tidying up. I checked and it turned out that indeed this call to dbus_shutdown() had been added quite recently, which tied in with the bug not showing up earlier. I looked up the documentation for dbus_shutdown() which confirmed that it's a bit of a big cannon which certainly could cause a problem like this:

"Frees all memory allocated internally by libdbus and reverses the effects of dbus_threads_init().

libdbus keeps internal global variables, for example caches and thread locks, and it can be useful to free these internal data structures.

...

You can't continue to use any D-Bus objects, such as connections, that were allocated prior to dbus_shutdown(). You can, however, start over; call dbus_threads_init() again, create new connections, and so forth."

and then I did a scratch build of rpm with the commit reverted, tested, and found that indeed, it solved the problem. So, we finally had our culprit: when the rpm.TransactionSet instance went out of scope, it got cleaned up, and that resulted in this plugin's cleanup function getting called, and dbus_shutdown() happening. The RPM devs had intended that call to clean up the RPM plugin's DBus handles, but this is all happening in a single process, so the call also cleaned up the DBus handles used by dnfdaemon itself, and that was enough (as the docs suggest) to cause any further attempts to communicate with DBus in dnfdaemon code to blow up and crash the daemon.

So, that's how you get from dnfdragora claiming that DNF is locked by another process to a stray RPM plugin crashing dnfdaemon on a DBus interaction!

Converting fedmsg consumers to fedora-messaging

Posted by Adam Williamson on June 10, 2019 12:24 PM

So in case you hadn't heard, the Fedora infrastructure team is currently trying to nudge people in the direction of moving from fedmsg to fedora-messaging.

Fedmsg is the Fedora project-wide messaging bus we've had since 2012. It backs FMN / Fedora Notifications and Badges, and is used extensively within Fedora infrastructure for the general purpose of "have this one system do something whenever this other system does something else". For instance, openQA job scheduling and result reporting are both powered by fedmsg.

Over time, though, there have turned out to be a few issues with fedmsg. It has a few awkward design quirks, but most significantly, it's designed such that message delivery can never be guaranteed. In practice it's very reliable and messages almost always are delivered, but for building critical systems like Rawhide package gating, the infrastructure team decided we really needed a system where message delivery can be formally guaranteed.

There was initially an idea to build a sort of extension to fedmsg allowing for message delivery to be guaranteed, but in the end it was decided instead to replace fedmsg with a new AMQP-based system called fedora-messaging. At present both fedmsg and fedora-messaging are live and there are bridges in both directions: all messages published as fedmsgs are republished as fedora-messaging messages by a 0MQ->AMQP bridge, and all messages published as fedora-messaging messages are republished as fedmsgs by an AMQP->0MQ bridge. This is intended to ease the migration process by letting you migrate a publisher or consumer of fedmsgs to fedora-messaging at any time without worrying about whether the corresponding consumers and/or publishers have also been migrated.

This is just the sort of project I usually work on in the 'quiet time' after one release comes out and before the next one really kicks into high gear, so since Fedora 30 just came out, last week I started converting the openQA fedmsg consumers to fedora-messaging. Here's a quick write-up of the process and some of the issues I found along the way!

I found these three pages in the fedora-messaging docs to be the most useful:

  1. Consumers
  2. Messages
  3. Configuration (especially the 'consumer-config' part)

Another important bit you might need are the sample config files for the production broker and stable broker.

All the fedmsg consumers I wrote followed this approach, where you essentially write consumer classes and register them as entry points in the project's setup.py. Once the project is installed, the fedmsg-hub service provided by fedmsg runs all these registered consumers (as long as a configuration setting is set to turn them on).

This exact pattern does not exist in fedora-messaging - there is no hub service. But fedora-messaging does provide a somewhat-similar pattern which is the natural migration path for this type of consumer. In this approach you still have consumer classes, but instead of registering them as entry points, you write configuration files for them and place them in /etc/fedora-messaging. You can then run an instantiated systemd service that runs fedora-messaging consume with the configuration file you created.

So to put it all together with a specific example: to schedule openQA jobs, we had a fedmsg consumer class called OpenQAScheduler which was registered as a moksha.consumer called fedora_openqa.scheduler.prod in setup.py, and had a config_key named "fedora_openqa.scheduler.prod.enabled". As long as a config file in /etc/fedmsg.d contained 'fedora_openqa.scheduler.prod.enabled': True, the fedmsg-hub service then ran this consumer. The consumer class itself defined what messages it would subscribe to, using its topic attribute.

In a fedora-messaging world, the OpenQAScheduler class is tweaked a bit to handle an AMQP-style message, and the entrypoint in setup.py and the config_key in the class are removed. Instead, we create a configuration file /etc/fedora-messaging/fedora_openqa_scheduler.toml and enable and start the fm-consumer@fedora_openqa_scheduler.service systemd service. Note that all the necessary bits for this are shipped in the fedora-messaging package, so you need that package installed on the system where the consumer will run.

That configuration file looks pretty much like the sample I put in the repository. This is based on the sample files I mentioned above.

The amqp_url specifies which AMQP broker to connect to and what username to use: in this sample we're connecting to the production Fedora broker and using the public 'fedora' identity. The callback specifies the Python path to the consumer callback class (our OpenQAScheduler class). The [tls] section points to the CA certificate, certificate and private key to be used for authenticating with the broker: since we're using the public 'fedora' identity, these are the files shipped in the fedora-messaging package itself which let you authenticate as that identity. For production use, I think the intent is that you request a separate identity from Fedora infra (who will generate certs and keys for it) and use that instead - so you'd change the amqp_url and the paths in the [tls] section appropriately.

The other key things you have to set are the queue name - which appears twice in the sample file as 00000000-0000-0000-0000-000000000000, for each consumer you are supposed to generate a random UUID with uuidgen and use that as the queue name, each consumer should have its own queue - and the routing_keys in the [[bindings]] section. Those are the topics the consumer will subscribe to - unlike in the fedmsg system, this is set in configuration rather than in the consumer class itself. Another thing you may wish to take advantage of is the consumer_config section: this is basically a freeform configuration store that the consumer class can read settings from. So you can have multiple configuration files that run the same consumer class but with different settings - you might well have different 'production' and 'staging' configurations. We do indeed use this for the openQA job scheduler consumer: we use a setting in this consumer_config section to specify the hostname of the openQA instance to connect to.

So, what needs changing in the actual consumer class itself? For me, there wasn't a lot. For a start, the class should now just inherit from object - there is no base class for consumers in the fedora-messaging world, there's no equivalent to fedmsg.consumers.FedmsgConsumer. You can remove things like the topic attribute (that's now set in configuration) and validate_signatures. You may want to set up a init, which is a good place to read in settings from consumer_config and set up a logger (more on logging in a bit). The method for actually reading a message should be named call() (so yes, fedora-messaging just calls the consumer instance itself on the message, rather than explicitly calling one of its methods). And the message object itself the method receives is slightly different: it will be an instance of fedora_messaging.api.Message or a subclass of it, not just a dict. The topic, body and other bits of the message are available as attributes, not dict items. So instead of message['topic'], you'd use message.topic. The message body is message.body.

Here I ran into a significant wrinkle. If you're consuming a native fedora-messaging message, the message.body will be the actual body of the message. However, if you're consuming a message that was published as a fedmsg and has been republished by the fedmsg->fedora-messaging bridge, message.body won't be what you'd probably expect. Looking at an example fedmsg, we'd probably expect the message.body of the converted fedora-messaging message to be just the msg dict, right? Just a dict with keys repo and agent. However, at present, the bridge actually publishes the entire fedmsg as the message.body - what you get as message.body is that whole dict. To get to the 'true' body, you have to take message.body['msg']. This is a problem because whenever the publisher is converted to fedora-messaging, there won't be a message.body['msg'] any more, and your consumer will likely break. It seems that the bridge's behavior here will likely be changed soon, but for now, this is a bit of a problem.

Once I figured this out, I wrote a little helper function called _find_true_body to fudge around this issue. You are welcome to steal it for your own use if you like. It should always find the 'true' body of any message your consumer receives, whether it's native or converted, and it will work when the bridge is fixed in future too so you won't need to update your consumer when that happens (though later on down the road it'll be safe to just get rid of the function and use message.body directly).

Those things, plus rejigging the logging a bit, were all I needed to do to convert my consumers - it wasn't really that much work in the end.

To dig into logging a bit more: fedmsg consumer class instances had a log() method you could use to send log messages, you didn't have to set up your own logging infrastructure. (Although a problem of this system was that it gave no indication which consumer a log message came from). fedora-messaging does not have this. If you want a consumer to log, you have to set up the logging infrastructure within the consumer, and tweak the configuration file a bit.

The pattern I chose was to import logging and then init a logger instance for each consumer class in its init(), like this:

self.logger = logging.getLogger(self.__class__.__name__)

Then you can log messages with self.logger.info("message") or whatever. I thought that would be all I'd need, but actually, if you just do that, there's nothing set up to actually receive the messages and log them anywhere. So you have to add a bit to the TOML config file that looks like this:

[log_config.loggers.OpenQAScheduler]
level = "INFO"
propagate = false
handlers = ["console"]

the OpenQAScheduler there is the class name; change it to the actual name of the consumer class. That will have the messages logged to the console, which - when you run the consumer as a systemd service - means they wind up in the system journal, which was enough for me. You can also configure a handler to send email alerts, for instance, if you like - you can see an example of this in Bodhi's config file.

One other wrinkle I ran into was with authenticating to the staging broker. The sample configuration file has the right URL and [tls] section for this, but the files referenced in the [tls] section aren't actually in the fedora-messaging package. To successfully connect to the staging broker, as fedora.stg, you need to grab the necessary files from the fedora-messaging git repo and place them into /etc/fedora-messaging.

To see the whole of the changes I had to make to the openQA consumers, you can look at the commits on the fedora-messaging branch of the repo and also this set of commits to the Fedora infra ansible repo.

New openQA tests: update live image build/install

Posted by Adam Williamson on February 08, 2019 10:20 AM

Hot on the heels of adding installer image build/install tests to openQA, I've now added tests which do just the same, but for the Workstation live image.

That means that, when running the desktop tests for an update, openQA will also run a test that builds a Workstation live image and a test that boots and installs it. The packages from the update will be used - if relevant - in the live image creation environment, and included in the live image itself. This will allow us to catch problems in updates that relate to the build and basic functionality of live images.

Here's an update where you can see that both the installer and live image build tests ran successfully and passed - see the updates-everything-boot-iso and updates-workstation-live-iso flavors.

I'm hoping this will help us catch compose issues much more easily during the upcoming Fedora 30 release process.

Devconf.cz 2019 trip report

Posted by Adam Williamson on February 06, 2019 11:01 AM

I've just got back from my Devconf.cz 2019 trip, after spending a few days after the conference in Red Hat's Brno office with other Fedora QA team members, then a few days visiting family.

I gave both my talks - Don't Move That Fence 'Til You Know Why It's There and Things Fedora QA Robots Do - and both were well-attended and, I think, well-received. The slide decks are up on the talk pages, and recordings should I believe go up on the Devconf Youtube channel soon.

I attended many other talks, my personal favourite being Stef Walter's Using machine learning to find Linux bugs. Stef noticed something I also have noticed in our openQA work - that "test flakes" are very often not just some kind of "random blip" but genuine bugs that can be investigated and fixed with a little care - and ran with it, using the extensive amount of results generated by the automated test suite for Cockpit as input data for a machine learning-based system which clusters "test flakes" based on an analysis of key data from the logs for each test. In this way they can identify when a large number of apparent "flakes" seem to have significant common features and are likely to be occurrences of the same bug, allowing them then to go back and analyze the commonalities between those cases and identify the underlying bug. We likely aren't currently running enough tests in openQA to utilize the approach Stef outlined in full, but the concept is very interesting and may be useful in future with more data, and perhaps for Fedora CI results.

Other useful and valuable talks I saw included Dan Walsh on podman, Lennart Poettering on portable services, Daniel Mach and Jaroslav Mracek on the future of DNF, Kevin Fenzi and Stephen Smoogen on the future of EPEL, Jiri Benc and Marian Šámal on a summer coding camp for kids, Ben Cotton on community project management, the latest edition of Will Woods' and Stephen Gallagher's crusade to kill all scriptlets, and the Fedora Council BoF.

There were also of course lots of useful "hallway track" sessions with Miroslav Vadkerti, Kevin Fenzi, Mohan Boddu, Patrick Uiterwijk, Alexander Bokovoy, Dominik Perpeet, Matthew Miller, Brian Exelbierd and many more - it is invaluable to be able to catch up with people in person and discuss things that are harder to do in tickets and IRC.

As usual it was an enjoyable and productive event, and the rum list at the Bar That Doesn't Exist remains as impressive as ever...;)

Devconf.cz 2019

Posted by Adam Williamson on January 24, 2019 12:31 PM

For anyone who - inexplicably - hasn't already had it in their social calendar in pink sharpie for months, I will be at Devconf.cz 2019 this weekend, at FIT VUT in Brno. I'll be doing two talks: Things Fedora QA Robots Do on Friday at 3pm (which is basically a brain dump about the pile of little fedmsg consumers that do quite important jobs that probably no-one knows about but me), and Don't Move That Fence 'Til You Know Why It's There on Saturday at 11am, which is a less QA-specific talk that's about how I reckon you ought to go about changing code. The slides for both talks are up now, if you want a sneak preview (though if you do, you're disqualified from the audience participation section of the "fence" talk!)

Do come by to the talks, if you're around and there's nothing more interesting in that timeslot. Otherwise feel free to buttonhole me around the conference any time.

New openQA tests: update installer tests and desktop app start/stop test

Posted by Adam Williamson on January 23, 2019 02:20 AM

It's been a while since I wrote about significant developments in Fedora openQA, so today I'll be writing about two! I wrote about one of them a bit in my last post, but that was primarily about a bug I ran into along the way, so now let's focus on the changes themselves.

Testing of install media built from packages in updates-testing

We have long had a problem in Fedora testing that we could not always properly test installer changes. This is most significant during the period of development after a new release has branched from Rawhide, but before it is released as the new stable Fedora release (we use the name 'Branched' to refer to a release in this state; in a month or so, Fedora 30 will branch from Rawhide and become the current Branched release).

During most of this time, the Bodhi update system is enabled for the release. New packages built for the release do not immediately appear in any repositories, but - as with stable releases - must be submitted as "updates", sometimes together with related packages. Once submitted as an update, the package(s) are sent to the "updates-testing" repository for the release. This repository is enabled on installed Branched systems by default (this is a difference from stable releases), so testers who have already installed Branched will receive the package(s) at this point (unless they disable the "updates-testing" repository, which some do). However, the package is still not truly a part of the release at this point. It is not included in the nightly testing composes, nor will it be included in any Beta or Final candidate composes that may be run while it is in updates-testing. That means that if the actual release media were composed while the package was still in updates-testing, it would not be a part of the release proper. Packages only become part of these composes once they pass through Bodhi and are 'pushed stable'.

This system allows us to back out packages that turn out to be problematic, and hopefully to prevent them from destabilizing the test and release composes by not pushing them stable if they turn out to cause problems. It also means more conservative testers have the option to disable the "updates-testing" repository and avoid some destabilizing updates, though of course if all the testers did this, no-one would be finding the problems. In the last few years we have also been running several automated tests on updates (via Taskotron, openQA and the CI pipeline) and reporting results from those to Bodhi, allowing packagers to pull the update if the tests find problems.

However, there has long been a bit of a problem in this process: if the update works fine on an installed system but causes problems if included in (for example) an installer image or live image, we have no very good way to find this out. There was no system for automatically building media like this that include the updates currently in testing so they could be tested. The only way to find this sort of problem was for testers to manually create test media - a process that is not widely understood, is time consuming, and can be somewhat difficult. We also of course could not do automated testing without media to test.

We've looked at different ways of addressing this in the past, but ultimately none of them came to much (yet), so last year I decided to just go ahead and do something. And after a bit of a roadblock (see that last post), that something is now done!

Our openQA now has two new tests it runs on all the updates it tests. The first test - here's an example run - builds a network install image, and the second - example run - tests it. Most importantly, any packages from the update under testing are both used in the process of building the install image (if they are relevant to that process) and included in the installer image (if they are packages which would usually be in such an image). Thus if the update breaks the production of the image, or the basic functionality of the image itself, this will be caught. This (finally) means that we have some idea whether a new anaconda, lorax, pykickstart, systemd, dbus, blivet, dracut or any one of dozens of other key packages might break the installer. If you're a packager and you see that one of these two tests has failed for your update, we should look into that! If you're not sure how to go about that, you can poke me, bcl, or the anaconda developers in Freenode #anaconda, and we should be able to help.

It is also possible for a human tester to download the image produced by the first test and run more in-depth tests on it manually; I haven't yet done anything to make that possibility more visible or easier, but will try to look into ways of doing that over the next few weeks.

GNOME application start/stop testing

My colleague Lukáš Růžička has recently been looking into what we might be able to do to streamline and improve our desktop application testing, something I'd honestly been avoiding because it seemed quite intractable! After some great work by Lukáš, one major fruit of this work is now visible in Fedora openQA: a GNOME application start/stop test suite. Here's an example run of it - note that more recent runs have a ton of failures caused by a change in GNOME, Lukáš has proposed a change to the test to address that but I have not yet reviewed it.

This big test suite just tests starting and then exiting a large number of the default installed applications on the Fedora Workstation edition, making sure they both launch and exit successfully. This is of course pretty easy for a human to do - but it's extremely tedious and time-consuming, so it's something we don't do very often at all (usually only a handful of times per release cycle), meaning we may not notice that an application which perhaps we don't commonly use has a very critical bug (like failing to launch at all) for some time.

Making an automated system like openQA do this is actually quite a lot of work, so it was a great job by Lukas to get it working. Now by monitoring the results of this test on the nightly composes closely, we should find out much more quickly if one of the tested applications is completely broken (or has gone missing entirely).

AdamW's Debugging Adventures: The Mysterious Disappearing /proc

Posted by Adam Williamson on January 17, 2019 07:15 PM

Yep, folks, it's that time again - time for one of old Grandpa Adam's tall tales of root causing adventure...

There's a sort of catch-22 situation in Fedora that has been a personal bugbear for a very long time. It mainly affects Branched releases - each new Fedora release, when it has branched from Rawhide, but before it has been released. During this period the Bodhi update system is in effect, meaning all new packages have to go through Bodhi review before they are included in the composes for the release. This means, in theory, we should be able to make sure nothing really broken lands in the release. However, there's a big class of really important updates we have never been able to test properly at all: updates that affect the installer.

The catch-22 is this - release engineering only builds install media from the 'stable' package set, those packages that have gone through review. So if a package under review breaks the installer, we can't test whether it breaks the installer unless we push it stable. Well, you can, but it's quite difficult - you have to learn how to build an installer image yourself, then build one containing the packages from the update and test it. I can do that, but most other people aren't going to bother.

I've filed bugs and talked to people about ways to resolve this multiple times over many years, but a few months back I just got sick of the problem and decided to fix it myself. So I wrote an openQA update test which automates the process: it builds an installer image, with the packages from the update available to the installer image build tool. I also included a subsequent test which takes that image and runs an install with it. Since I already had the process for doing this manually down pat, it wasn't actually very difficult.

Only...when I deployed the test to the openQA staging instance and actually tried it out, I found the installer image build would frequently fail in a rather strange way.

The installer image build process works (more or less) by creating a temporary directory, installing a bunch of packages to it (using dnf's feature of installing to an alternative 'root'), fiddling around with that environment a bit more, creating a disk image whose root is that temporary directory, then fiddling with the image a bit to make it into a bootable ISO. (HANDWAVE HANDWAVE). However, I was finding it would commonly fail in the 'fiddling around with the environment' stage, because somehow some parts of the environment had disappeared. Specifically, it'd show this error:

FileNoteFoundError: [Errno 2] No such file or directory: '/var/tmp/lorax.q8xfvc0p/installroot//proc/modules'

lorax was, at that point, trying to touch that directory (never mind why). That's the /proc/modules inside the temporary root, basically. The question was, why was it disappearing? And why had neither myself nor bcl (the lorax maintainer) seen it happening previously in manual use, or in official composes?

I tried reproducing it in a virtual machine...and failed. Then I tried again, and succeeded. Then I ran the command again...and it worked! That pattern turned out to repeat: I could usually get it to happen the first time I tried it in a VM, but any subsequent attempts in the same VM succeeded.

So this was seeming really pretty mysterious. Brian couldn't get it to happen at all.

At this point I wrote a dumb, short Python script which just constantly monitored the disappearing location and told me when it appeared and disappeared. I hacked up the openQA test to run this script, and upload the result. Using the timestamps, I was able to figure out exactly what bit of lorax was running when the directory suddenly disappeared. But...I couldn't immediately see why anything in that chunk of lorax would wind up deleting the directory.

At this point, other work became more important, and I wound up leaving this on the back burner for a couple of months. Then I came back to it a couple days ago. I picked back up where I left off, and did a custom build of lorax with some debug logging statements strewn around the relevant section, to figure out really precisely where we were when things went wrong. But this turned out to be a bit of a brick wall, because it turned out that at the time the directory disappeared, lorax was just...running mksquashfs. And I could not figure out any plausible reason at all why a run of mksquashfs would cause the directory to vanish.

After a bit, though, the thought struck me - maybe it's not lorax itself wiping the directory out at all! Maybe something else is doing it. So I thought to look at the system logs. And lo and behold, I found my smoking gun. At the exact time my script logged that the directory had disappeared, this message appeared in the system log:

Jan 18 01:57:30 ibm-p8-kvm-03-guest-02.virt.pnr.lab.eng.rdu2.redhat.com systemd[1]: Starting Cleanup of Temporary Directories...

now, remember our problem directory is in /var/tmp. So this smells very suspicious indeed! So I figured out what that service actually is - to do this, you just grep for the description ("Cleanup of Temporary Directories") in /usr/lib/systemd/system - and it turned out to be /usr/lib/systemd/system/systemd-tmpfiles-clean.service, which is part of systemd's systemd-tmpfiles mechanism, which you can read up on in great detail in man systemd-tmpfiles and man tmpfiles.d.

I had run into it a few times before, so I had a vague idea what I was dealing with and what to look for. It's basically a mechanism for managing temporary files and directories: you can write config snippets which systemd will read and do stuff like creating expected temporary files or directories on boot (this lets packages manage temporary directories without doing it themselves in scriptlets). I poked through the docs again and, sure enough, it turns out another thing the system can do is delete temporary files that reach a certain age:

Age
The date field, when set, is used to decide what files to delete when cleaning. If a file or directory is
older than the current time minus the age field, it is deleted. The field format is a series of integers
each followed by one of the following suffixes for the respective time units: s, m or min, h, d, w, ms, and
us, meaning seconds, minutes, hours, days, weeks, milliseconds, and microseconds, respectively. Full names
of the time units can be used too.

This systemd-tmpfiles-clean.service does that job. So I went looking for tmpfiles.d snippets that cover /var/tmp, and sure enough, found one, in Fedora's stock config file /usr/lib/tmpfiles.d/tmp.conf:

q /var/tmp 1777 root root 30d

The 30d there is the 'age' field. So this tells the tmpfiles mechanism that it's fine to wipe anything under /var/tmp which is older than 30 days.

Of course, naively we might think our directory won't be older than 30 days - after all, we only just ran lorax! But remember, lorax installs packages into this temporary directory, and files and directories in packages get some of their time attributes from the package. So we (at this point, Brian and I were chatting about the problem as I poked it) looked into how systemd-tmpfiles defines age, precisely:

The age of a file system entry is determined from its last modification timestamp (mtime), its last access
timestamp (atime), and (except for directories) its last status change timestamp (ctime). Any of these three
(or two) values will prevent cleanup if it is more recent than the current time minus the age field.

So since our thing is a directory, its mtime and atime are relevant. So Brian and I both looked into those. He did it manually, while I hacked up my check script to also print the mtime and atime of the directory when it existed. And sure enough, it turned out these were several months in the past - they were obviously related to the date the filesystem package (from which /proc/modules comes) was built. They were certainly longer than 30 days ago.

Finally, I looked into what was actually running systemd-tmpfiles-clean.service; it's run on a timer, systemd-tmpfiles-clean.timer. That timer is set to run the service 15 minutes after the system boots, and every day thereafter.

So all of this hooked up nicely into a convincing story. openQA kept running into this problem because it always runs the test in a freshly-booted VM - that '15 minutes after boot' was turning out to be right in the middle of the image creation. My manual reproductions were failing on the first try for the same reason - but then succeeding on the second and subsequent tries because the cleaner would not run again until the next day. And Brian and I had either never or rarely seen this when we ran lorax manually for one reason or another because it was pretty unlikely the "once a day" timer would happen to wake up and run just when we had lorax running (and if it did happen, we'd try again, and when it worked, we'd figure it was just some weird transient failure). The problem likely never happens in official composes, I think, because the tmpfiles timer isn't active at all in the environment lorax gets run in (haven't double-checked this, though).

Brian now gets to deal with the thorny problem of trying to fix this somehow on the lorax side (so the tmpfiles cleanup won't remove bits of the temporary tree even if it does run while lorax is running). Now I know what's going on, it was easy enough to work around in the openQA test - I just have the test do systemctl stop systemd-tmpfiles-clean.timer before running the image build.

AdamW's Debugging Adventures: Python 3 Porting 201

Posted by Adam Williamson on January 08, 2019 08:12 PM

Hey folks! Time for another edition of AdamW's Debugging Adventures, wherein I boast about how great I am at fixin' stuff.

Today's episode is about a bug in the client for Fedora's Koji buildsystem which has been biting more and more Fedora maintainers lately. The most obvious thing it affects is task watching. When you do a package build with fedpkg, it will by default "watch" the build task - it'll update you when the various subtasks start and finish, and not quit until the build ultimately succeeds or fails. You can also directly watch tasks with koji watch-task. So this is something Fedora maintainers see a lot. There's also a common workflow where you chain something to the successful completion of a fedpkg build or koji watch-task, which relies on the task watch completing successfully and exiting 0, if the build actually completed.

However, recently, people noticed that this task watching seemed to be just...failing, quite a lot. While the task was still running, it'd suddenly exit, usually showing this message:

ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

After a while, nirik realized that this seemed to be associated with the client going from running under Python 2 by default to running under Python 3 by default. This seems to happen when running on Python 3; it doesn't seem to happen when running on Python 2.

Today I finally decided it had got annoying enough that I'd spend some time trying to track it down.

It's pretty obvious that the message we see relates to an exception, in some way. But ultimately something is catching that exception and printing it out and then exiting (we're not actually getting a traceback, as you do if the exception is ultimately left to reach the interpreter). So my first approach was to dig into the watch-task code from the top down, and try and find something that handles exceptions that looks like it might be the bit we were hitting.

And...I failed! This happens, sometimes. In fact I still haven't found the exact bit of code that prints the message and exits. Sometimes, this just happens. It's OK. Don't give up. Try something else!

So what I did next was kind of a long shot - I just grepped the code for the exception text. I wasn't really expecting this to work, as there's nothing to suggest the actual exception is part of Koji; it's most likely the code doesn't contain any of that text at all. But hey, it's easy to do, so why not? And as it happened, I got lucky and hit paydirt: there happens to be a comment with some of the text from the error we're hitting. And it sure looks like it might be relevant to the problem we're having! The comment itself, and the function it's in, looked so obviously promising that I went ahead and dug a little deeper.

That function, is_conn_error(), is used by only one other thing: this _sendCall() method in the same file. And that seems very interesting, because what it does can be boiled down to: "hey, we got an error! OK, send it to is_conn_error(). If that returns True, then just log a debug message and kick the session. If that returns False, then raise an exception". That behaviour obviously smells a lot like it could be causing our problem. So, I now had a working theory: for some reason, given some particular server behaviour, is_conn_error() returns True on Python 2 but False on Python 3. That causes this _sendCall() to raise an exception instead of just resetting the session and carrying on, and some other code - which we no longer need to find - catches that exception, prints it, and quits.

The next step was to test this theory - because at this point it's only a theory, it could be entirely wrong. I've certainly come up with entirely plausible theories like this before which turned out to be not what was going on at all. So, like a true lazy shortcut enthusiast, I hacked up my local copy of Koji's init.py and sprinkled a bunch of lines like print("HERE 1!") and print("HERE 2!") through the whole of is_conn_error(). Then I just rankoji wait-task commands on random tasks until one failed.

This is fine. When you're just trying to debug the problem you don't need to be super elegant about it. You don't need to do a proper git patch and rebuild the Koji package for your system and use proper logging methods and all the rest of it. Just dumping some print lines in a working copy of the file is just fine, if it works. Just remember to put everything back as it was before later. :)

So, as it happened the god of root causing was on my side today, and it turned out I was right on the money. When one of the koji watch-task commands failed, it hit my HERE 1! and HERE 3! lines right when it died. Those told me we were indeed running through is_conn_error() right before the error, and further, where we were coming out of it. We were entering the if isinstance(e, socket.error) block at the start of the function, and returning False because the exception (e) did appear to be an instance of socket.error, but either did not have an errno attribute, or it was not one of errno.ECONNRESET, errno.ECONNABORTED, or errno.EPIPE.

Obviously, this made me curious as to what the exception actually is, whether it has an errno at all, and if so, what it is. So I threw in a few more debugging lines - to print out type(e), and getattr(e, 'errno', 'foobar'). The result of this was pretty interesting. The second print statement gave me 'foobar', meaning the exception doesn't have an errno attribute at all. And the type of the exception was...requests.exceptions.ConnectionError.

That's a bit curious! You wouldn't necessarily expect requests.exceptions.ConnectionError to be an instance of socket.error, would you? So why are we in a block that only handles instances of socket.error? Also, it's clear the code doesn't expect this, because there's a block later in the function that explicitly handles instances of requests.exceptions.ConnectionError - but because this earlier block that handles socket.error instances always returns, we will never reach that block if requests.exceptions.ConnectionError instances are also instances of socket.error. So there's clearly something screwy going on here.

So of course the next thing to do is...look up socket.error in the Python 2 and Python 3 docs. ANY TIME you're investigating a mysterious Python 3 porting issue, remember this can be useful. Here's the Python 2 socket.error entry, and the Python 3 socket.error entry. And indeed there's a rather significant difference! The Python 2 docs talk about socket.error as an exception that is, well, its own unique thing. However, the Python 3 docs say: "A deprecated alias of OSError." - and even tell us specifically that this changed in Python 3.3: "Changed in version 3.3: Following PEP 3151, this class was made an alias of OSError." Obviously, this is looking an awful lot like one more link in the chain of what's going wrong here.

With a bit of Python knowledge you should be able to figure out what's going on now. Think: if socket.error is now just an alias of OSError, what does if isinstance(e, socket.error) mean, in Python 3.3+ ? It means just the same as if isinstance(e, OSError). And guess what? requests.exception.ConnectionError happens to be a subclass of OSError. Thus, if e is an instance of requests.exception.ConnectionError, isinstance(e, socket.error) will return True in Python 3.3+. In Python 2, it returns False. It's easy to check this in an interactive Python shell or with a test script, to confirm.

Because of this, when we run under Python 3 and e is a requests.exception.ConnectionError, we're unexpectedly entering this block intended for handling socket.error exceptions and - because that block always returns, having the return False line that gets hit if the errno attribute check fails - we're never actually reaching the later block that's actually intended to handle requests.exception.ConnectionError instances at all, we return False before we get there.

There are a few different ways you could fix this - you could just drop the return False short-circuit line in the socket.error block, for instance, or change the ordering so the requests.exception.ConnectionError handling is done first. In the end I sent a pull request which drops the return False, but also drops the if isinstance(e, socket.error) checks (there's another, for nested exceptions, later) entirely. Since socket.error is meant to be deprecated in Python 3.3+ we shouldn't really use it, and we probably don't need to - we can just rely on the errno attribute check alone. Whatever type the exception is, if it has an errno attribute and that attribute is errno.ECONNRESET, errno.ECONNABORTED, or errno.EPIPE, I think we can be pretty sure this is a connection error.

What's the moral of this debugging tale? I guess it's this: when porting from Python 2 to Python 3 (or doing anything similar to that), fixing the things that outright crash or obviously behave wrong is sometimes the easy part. Even if everything seems to be working fine on a simple test, it's certainly possible that subtler issues like this could be lurking in the background, causing unexpected failures or (possibly worse) subtly incorrect behaviour. And of course, that's just another reason to add to the big old "Why To Have A Really Good Test Suite" list!

There's also a 'secondary moral', I guess, and that's this: predicting all the impacts of an interface change like this is hard. Remember the Python 3 docs mentioned a PEP associated with this change? Well, here it is. If you read it, it's clear the proposers actually put quite a lot of effort into thinking about how existing code might be affected by the change, but it looks like they still didn't consider a case like this. They talk about "Careless (or "naïve") code" which "blindly catches any of OSError, IOError, socket.error, mmap.error, WindowsError, select.error without checking the errno attribute", and about "Careful code is defined as code which, when catching any of the above exceptions, examines the errno attribute to determine the actual error condition and takes action depending on it" - and claim that "useful compatibility doesn't alter the behaviour of careful exception-catching code". However, Koji's code here clearly matches their definition of "careful" code - it considers both the exception's type, and the errno attribute, in making decisions - but because it is not just doing except socket.error as e or similar, but catching the exception elsewhere and then passing it to this function and using isinstance, it still gets tripped up by the change.

So...the ur-moral, as always, is: software is hard!

PSA: System update fails when trying to remove rtkit-0.11-19.fc29

Posted by Kamil Páral on October 15, 2018 11:20 AM
Recently a bug in rtkit packaging has been fixed, but the update will fail on all Fedora 29 pre-release installation that have rtkit installed (Workstation has it for sure). The details and the workaround is described here:

 

Whitelisting rpmlint errors in Taskotron/Bodhi

Posted by Kamil Páral on March 05, 2018 01:14 PM

If you submit a new Fedora update into Bodhi, you’ll see an Automated Tests tab on that update page (an example), and one of the test results (once it’s done) will be from rpmlint. If you click on it, you’ll get a full log with rpmlint output.

If you wish to whitelist some errors which are not relevant for your package or are clearly a mistake (like spelling issues, etc), it is now possible. The steps how to do this are described at:

https://fedoraproject.org/wiki/Taskotron/Tasks/dist.rpmlint#Whitelisting_errors

This has been often requested, so hopefully this will help you have the automated tests results all in green, instead of being bothered by invalid errors. If something doesn’t work, and it seems to be our bug in how we execute rpmlint (instead of a bug in rpmlint itself), please file a bug in task-rpmlint or contact us (qa-devel mailing list, #fedora-qa IRC channel on Freenode).

Automatically shrink your VM disk images when you delete files

Posted by Kamil Páral on October 06, 2017 04:02 PM

Update: This got significantly simpler with newer qemu and virt-manager, read an updated post.

If you use VMs a lot, you know that with the most popular qcow2 disk format, the disk image starts small, but grows with every filesystem change happening inside the VM. Deleting files inside the VM doesn’t shrink it. Of course that wastes a lot of disk space on your host – the VMs often contain gigabytes of freed space inside the VM, but not on the host. Shrinking the VM images is possible, but tedious and slow. Well, recently I learned that’s actually not true anymore. You can use the TRIM command, used to signalize to SSD drives that some space can be freed, to do the same in virtualization stack – signalize from VM to host that some space can be freed, and the disk image shrunk. How to do that? As usual, this is a shameless copy of instructions found elsewhere on the Internets. The instructions assume you’re using virt-manager or libvirt directly.

First, you need to using qcow2 images, not raw images (you can configure this when adding new disks to your VM).

Second, you need to set your disk bus to SCSI (not VirtIO, which is the default).

disk-scsi

Third, you need to set your SCSI Controller to VirtIO SCSI (not hypervisor default).

controller-scsi

Fourth, you need to edit your VM configuration file using virsh edit vmname and adjust your hard drive’s driver line to include discard='unmap', e.g. like this:

<disk type='file' device='disk'>
 <driver name='qemu' type='qcow2' discard='unmap'/>

And that’s it. Now you boot your VM and try to issue:

$ sudo fstrim -av
/boot: 319.8 MiB (335329280 bytes) trimmed
/: 101.5 GiB (108928946176 bytes) trimmed

You should see some output printed, even if it’s just 0 bytes trimmed, and not an error.

If you’re using LVM, you’ll also need to edit /etc/lvm/lvm.conf and set:

issue_discards = 1

Then it should work, after a reboot.

Now, if you want trimming to occur automatically in your VM, you have two options (I usually do both):

Enable the fstrim timer that trims the system once a week by default:

$ sudo systemctl enable fstrim.timer

And configure the root filesystem (and any other one you’re interested in) to issue discard command automatically after each file is deleted. Edit /etc/fstab and add a discard mount option, like this:

UUID=6d368798-f4c2-44f9-8334-6be3c64cc449 / ext4 defaults,discard 1 1

And that’s it. Try to create a big file using dd, watch your VM image grow. Then delete the file, watch the image shrink. Awesome. If only we had this by default.

SSH to your VMs without knowing their IP address

Posted by Kamil Páral on October 06, 2017 03:31 PM

This is a shameless copy of this blog post, but I felt like I need to put it here as well, so that I can find it the next time I need it 🙂

libvirt approach

When you run a lot of VMs, especially for testing, every time with a fresh operating system, connecting to them is a pain, because you always need to figure out their IP address first. Turns out that is no longer true. I simply added this snippet to my ~/.ssh/config:

# https://penguindroppings.wordpress.com/2017/09/20/easy-ssh-into-libvirt-vms-and-lxd-containers/
# NOTE: doesn't work with uppercase VM names
Host *.vm
 CheckHostIP no
 Compression no
 UserKnownHostsFile /dev/null
 StrictHostKeyChecking no
 ProxyCommand nc $(virsh domifaddr $(echo %h | sed "s/\.vm//g") | awk -F'[ /]+' '{if (NR>2 && $5) print $5}') %p

and now I can simply execute ssh test.vm for a VM named test and I’m connected! A huge time saver. It doesn’t work with uppercase letters in VM names and I didn’t bother to try to fix that. Also, since I run VMs just for testing purposes, I disabled all ssh security checks (you should not do that for important machines).

avahi approach

There’s also a second approach I used for persistent VMs (those that survive for longer than a single install&reboot cycle). You can use Avahi to search for a hostname on the .local domain to find the IP address. Fedora has this enabled by default (if you have nss-mdns package installed, I believe, which should be by default). So, in the VM, set a custom hostname, for example f27:

$ sudo hostnamectl set-hostname f27
$ reboot

Now, you can run ssh f27.local and it should connect you to the VM automatically.

Taskotron: depcheck task replaced by rpmdeplint

Posted by Kamil Páral on June 22, 2017 10:27 AM

If you are a Fedora packager, you might be interested to know that in Taskotron we replaced the depcheck task with rpmdeplint task. So if there are any dependency issues with the new update you submit to Bodhi, you’ll see that as dist.rpmdeplint failure (in the Automated Tests tab). The failure logs should look very similar to the depcheck ones (basically, the logs contain the errors dnf would spit out if it tried to install that package), so there should be no transitioning effort needed.

If you listen for depcheck results somehow, i.e. in FMN, make sure to update your rules to listen for dist.rpmdeplint instead. We have updated the default filters in FMN, so if you haven’t changed them, you should receive notifications for failures in rpmdeplint (and also upgradepath and abicheck) for submitted updates owned by you.

The reason for this switch is that we wanted to get rid of custom dependency checking (done directly on top of libsolv), and use an existing tool for that instead. That saves us time, we don’t need to study all the pitfalls of dependency resolution, and we benefit from someone else maintaining and developing the tool (that doesn’t mean we won’t send patches if needed). rpmdeplint offered exactly what we were looking for.

We will decommission depcheck task from Taskotron execution in the next few days, if there are no issues. Rpmdeplint results are already being published for all proposed updates.

If you have any questions, please ask in comments or reach us at #fedora-qa freenode irc channel or qa-devel (or test or devel) mailing list.

Kernel Performance Testing on ARM

Posted by sumantro on June 06, 2017 12:54 AM
This post will be talking about , how you can do kernel regression , stress and performance testing on ARM architecture.

Setup:

To set up your ARM device , you need an image to get started. I was intending to test the latest compose(Fedora 26 1.4 Beta on Raspberry Pi 3 Model B). Download the file (Workstation raw-xz for armhfp) or any variant that you want to test.

Once the file is download, all you need to do is to get a SD card and write the img in the card.

There are two ways of doing it using "Fedora Media Writer" which can now burn the image for ARM devices. The other way is the old dd , here is how you do it using dd





Once the dd has executed itself successfully , its time to plug in the SD in ARM device and boot it up. Once the ARM device is booted up all you need to do is to
clone the kernel test suite from here


Dependencies and Execution:

You will need 2 packages
1.gcc
2.fedora-python

You can install them by executing "sudo dnf install fedora-python gcc"

Executing test cases:

Each test should be contained in a unique directory within the appropriate top level. The directory must contain an executable 'runtest.sh' which will drive the specific test. There is no guarantee on the order of execution. Each test should be fully independent, and have no dependency on other tests. The top level directories are reflective of how the master test suite is called. Each option is a super-set of the options before it. At this time we have:
  • minimal: This directory should include small, fast, and important tests which would should be run on every system.
  • default: This directory will include most tests which are not destructive, or particularly long to run. When a user runs with no flags, all tests in both default and minimal will be run.
  • stress: This directory will include longer running and more resource intensive tests which a user might not want to run in the common case due to time or resource constraints.
  • destructive: This directory contains tests which have a higher probability of causing harm to a system even in the pass case. This would include things like potential for data loss.
  • performance: This directory contains longer running performance tests. These tests should typically be the only load on a system to get an accurate result.

After Executing
$ sudo ./runtests.sh -t performance
Each test is executed by the control script by calling runtest.sh. stdout and stderr are both redirected to the log. Any user running with default flags should see nothing but the name of the directory and pass/fail/skip. The runtest.sh should manage the full test run. This includes compiling any necessary source, checking for any specific dependencies, and skipping if they are not met. 



View the Log file with cat <file path> , the log file will give the device information and the test result.











The test is complete and the results is "PASS" in this case .







































Almost at the end of the file you will get the data and values of the tests.

[Restrospection] Fedora QA Global Onboarding Call 2017-06-03

Posted by sumantro on June 05, 2017 07:11 AM

Retrospection


We had a Fedora QA onboarding on 2017-06-03 and it was successful the Agenda and the Feedback can be found on this etherpad. People from different countries and regions found the call useful.

Few changes which made things better

1. Using bluejeans was smooth and better than Hangouts.
2. Starting the doodle 2 weeks before the call and giving enough time to vote.
3. Using a bunch of quick links as reference points and then explaining.

Action Items

1. Consistency is the key to success , doing the onboarding call every 2 months will be more engaging. Also , it gives a sense of assurance to the new contributors to simply plug themselves up in one of these calls and start from there even if they miss one , they would still be able to contribute to the release cycle. The proposal is to create a wiki page and link it up to Fedora QA Join where people will benefit from it.

2. Feedback , FAQ ,Quick Links, Logs and recordings should be marked and kept in a wiki page which will constantly tell us where we need improvement and maybe answer few general questions for new contributors.

Proposed Timeline

After Branched

This is the time we generally plan for the test days and start off with pre-alpha testing.Onboarding call during this time will help us gather community ideas by which we can drive the test day planning and if someone wants to run a test day then we can help them plan accordingly.


Before Beta

After the alpha , we are mostly in a phase we have the test days happening , blocker bugs being filed and lot of test coverage (rel-val) needs to happen during
this stage , having an on boarding call at this time will help in the new contributors to work on something specific which is aligned to our goals. This will be a place where we can have more and more people participating and help us testing the iso and features in test days.



Before Final
 
This is a good time as we are done with most of the test days as a part of change set , we can conduct a few more test namely the system upgrade test day and the kernel test day. This call will help us testing on most of the off-the-shelf-hardware and ensure that whole band of hardwares are tested. This is also the time where we need most validation to be done most of the architecture , hence will help us keep the contributors engaged.
 

Fedora QA Onboarding Call 2017-06-03 1400-1500 UTC

Posted by sumantro on June 02, 2017 05:50 PM



There is a going to be a Fedora QA On Boarding Call 2017-06-03 1400-1500 UTC over blue jeans, while release validation for Fedora Beta 1.3 is underway , this is good time to get started. In this call we will be talking about how you as a contributor can get started with Fedora QA. The agenda can be found on this etherpad .

Hope to see you all!

Test Day DNF 2.0

Posted by sumantro on May 08, 2017 11:02 AM


Tuesday, 2017-05-09, is the DNF 2.0 Test Day! As part of this planned Change for Fedora 26, we need your help to test DNF 2.0!

Why test DNF 2.0?

DNF-2 is the upstream DNF version, the only version actively developed. Currently the upstream contains many user requested features, increased compatibility with yum and over 30 bug fixes. To DNF-1 back porting of patches from upstream is difficult and only critical security and usability fixes will be cherry-picked to Fedora.

With DNF 2.0 in places,users can notice usability improvements of DNF like better messages during resolution errors, showing whether package was installed as weak dependency, better handling of obsolete packages, less tracebacks, etc. One command line option and one configuration option changed semantic so DNF could behave differently in some way (these changes are compatible with yum but incompatible with DNF-1)We hope to see whether it’s working well enough and catch any remaining issues.

We need your help!

All the instructions are on the wiki page, so please read through and come help us test! As always, the event will be in #fedora-test-day on Freenode IRC.

[Fedora Classroom] Fedora QA 101 and 102

Posted by sumantro on April 17, 2017 09:39 PM
This post will be summing up what we will be covering as a part of Fedora QA classroom this season. The idea is to understand how to do things the right way and to increase contributors.

The topics covered will be:
<style type="text/css">p { margin-bottom: 0.1in; line-height: 120%; }a:link { }</style>
1. Working with Fedora
2. Installing on VM(s)
3. Configuring and Installing fedora
4. Fedora Release Cycle
5. Live boot and Fedora Media Writer
6. Setting up accounts
7. Types of testing
8. Finding Test Cases
9. Writing Test Cases for Packages
10. Github
11. Bugzilla
12. Release Validation Testing
13. Update Testing
14. Manual Testing
14.1 Release validation
14.2 Update Testing


The 102 will cover Automated Testing and How to Host your own test days during the release cycle.

To make the workflow smooth , we have made a book which will act as an reference even after the classrooms are over.

https://drive.google.com/file/d/0Bzphf6h7upukTDlrczVEb0l3TmM/view?usp=sharing

Fedora Media Writer Test Day 2017-04-20

Posted by sumantro on April 16, 2017 08:11 AM
Fedora Media Writer , is a very handy tool to create live USB media. This became the primary downloadable in Fedora 25. We ran a test day installment to check for 3 major OS Windows , Mac OS and Fedora. The test day focused on writing Fedora images (workstation/server/spins) to a flash drive.

This installment of test day will focus on out of the box support for ARM v7 Arch apart from Intel 64 Bit and 32 Bit. The testers can either download image of their choice and then verify if the image by checksum and booting it on KVM and of course bare metal.

We will be calling this test day on 2017-04-20 , grab a blank SD card or USB and it will take roughly about 30 mins with a good internet speed to complete the test case.

Details will be published in Fedora community blog and @test-announce list. 

The wiki page says it all https://fedoraproject.org/wiki/Test_Day:2017-04-20_Fedora_Media_Writer

[Test Day Report]Anaconda BlivetGUI test day

Posted by sumantro on April 10, 2017 07:26 AM
Hey Testers,

I just wanted to pitch in the test day report for Anaconda BlivetGUI Test Day. It was a huge success and we had about 28 testers (many new faces) .

Testers : 28


Bugs Filed:[12]
1.https://bugzilla.redhat.com/show_bug.cgi?id=1430349
2.https://bugzilla.redhat.com/show_bug.cgi?id=1439538
3.https://bugzilla.redhat.com/show_bug.cgi?id=1439591
4.https://bugzilla.redhat.com/show_bug.cgi?id=1439744
5.https://bugzilla.redhat.com/show_bug.cgi?id=1439717
6.https://bugzilla.redhat.com/show_bug.cgi?id=1439684
7.https://bugzilla.redhat.com/show_bug.cgi?id=1439572
8.https://bugzilla.redhat.com/show_bug.cgi?id=1439111
9.https://bugzilla.redhat.com/show_bug.cgi?id=1439592
10.https://bugzilla.redhat.com/show_bug.cgi?id=1440143
11.https://bugzilla.redhat.com/show_bug.cgi?id=1440150
12.https://bugzilla.redhat.com/show_bug.cgi?id=1439729


Blog of the test day : https://communityblog.fedoraproject.org/anaconda-blivetgui-test-day/

I would like to thank each and every tester and the change owners for helping us test this crucial feature!

If you are one of those people who couldn't make it to the test day , you can go ahead and grab a copy for Fedora Alpha 1.7 and start installing Fedora using Blivet GUI . If something breaks , make sure to file a bug under blivet-gui .

Thanks
Sumantro
On behalf of Fedora QA team

[Test Day Annoucement] Anaconda Blivet GUI

Posted by sumantro on April 04, 2017 09:44 AM

Thursday 2017-04-06 will be Anaconda Blivet-GUI Test Day!
As part of this planned Change for Fedora 26, So this is an important Test Day!

We'll be testing the new detailed bottom-up configuration screen has been long requested by users and inclusion of blivet-gui into Anaconda finally makes this a reality. On the other hand, it just adds a new option without changing the existing advanced storage configuration so users that prefer the top-down configuration can still use it. to see whether it's working well enough and catch any remaining issues.
It's also pretty easy to join in: all you'll need is alpha 1.7 (which you can grab from the wiki page).
Anaconda grew a rather important new option in F26: as well as the two existing partitioning choices (automatic, and the existing anaconda custom part interface) there's now a *third* choice.so now you can do custom partitioning with blivet-gui run within anaconda, as well as using anaconda's own interface (because there just weren't enough ways for custom partitioning to go wrong already),so, we'll have a test day for using that interface, to try and shake out whatever problems it inevitably has.As always, the event will be in #fedora-test-day on Freenode IRC.

Fedora Activity Day, Bangalore 2017

Posted by Kanika Murarka on March 13, 2017 01:53 PM

The Fedora Activity Day (FAD) is a regional event (either one-day or a multi-day) that allows Fedora contributors to gather together in order to work on specific tasks related to the Fedora Project.

On February 25th ’17, FAD was conducted in one of the admirable university of Bangalore, University Visvesvaraya College of Engineering(UVCE). It was not a typical “hackathon” or “DocSprint” but a series of productive and interactive sessions on different tools.

The goal of this FAD was to make students aware about Fedora so that they can test, develop and contribute. The event was a one-day-event, started at 10:30 in morning and concluded at 3 in evening.

The first talk was on Ansible, which is an IT automation tool. It can configure systems, deploy software, and orchestrate more advanced IT tasks such as continuous deployments or zero downtime rolling updates. The session was taken up by Vipul Siddharth and Prakash Mishra, who are Fedora contributors. They discussed about the importance of such automation tool and gave a small demo for getting started with Ansible.

The Ansible session was followed by the session on contributing to Linux Kernel, given by our esteemed guest Vaishali Thakkar (@kernel_girl ). Vaishali is Linux Kernel developer at Oracle, she is working in kernel security engineering group and associated with the open source internship programs and some community groups.Vaishali highlighted upon each and every aspect of kernel one should know before contributing. She discussed the lifecycle and how-where-when of a pull request. The session was a 3 hour long session with a short lunch break. The first part of the session was focused on theoretical aspects of sending your first patch to kernel community and the second part was a demo session where she sent a patch from scratch (Slides).

The last session was taken up by Sumantro Mukherjee (Fedora Ambassador) and me, on pathways to contribute to Fedora with a short interactive session.

The speakers were awarded tshirts as a mark of respect.I would like to thank Sumantro Mukherjee, Fedora Community and IEEE subchapter of UVCE college for making FAD possible.


Kernel testing made easy!

Posted by sumantro on February 21, 2017 11:37 PM
Hey Folks , this is sincere effort to bring into notice that people who want to stay on top of the game in terms of bleeding edge. The most important part is to check if the kernel version is supporting your system fine. If it does , then its awesome but if it doesn't you might wanna report it to the team with the proper failure logs which might be helpful for future references.

To get started with , you need a bleeding edge kernel to start with. You can get the latest kernel from Bodhi.





Most of the kernel(s) are updates and hence you need to enable update-testing repo to install the kernel from the update-testing repo. 
Once you have enabled the update testing repo , you can also disable it by executing "dnf config-manager --set-disabled updates-testing".While I'm writing this the latest kernel in update-testing for f25 was "kernel-4.9.10-200.fc25"

Once , you are done installing now , comes the part of checking if all the virtal features of your machine works smoothly. Of course , after a deep manual inspection you can trigger the test suite which will test the major parts for you.

First , you need to install some packages which are important , although many of you might just have all these packages.


The above pic shows the installation of packages, Once you have the required packages , you need to clone the pagure repo "git clone https://pagure.io/kernel-tests.git" .




After cloning you can simply start the test suite , you need to switch to the cloned folder and execute " cp config.example .config" after a successful execution , you need to open the .config file in vi/any text editor and edit the values of "submit=authenticated" and "username=<fas username>" . Once you are done , just run the test by executing " sudo ./runtests.sh" . And for performance testing you can run "sudo ./runtests.sh -t performance" both of these tests are most likely to pass if they dont you need to send/update the log and post karmas on bodhi for people to note if regressions are noted.

For any changes refer to :
https://fedoraproject.org/wiki/QA:Testcase_kernel_regression

Bluetooth in Fedora

Posted by Nathaniel McCallum on February 16, 2017 08:53 PM

So… Bluetooth. It’s everywhere now. Well, everywhere except Fedora. Fedora does, of course support bluetooth. But even the most common workflows are somewhat spotty. We should improve this.

To this end, I’ve enlisted the help of the Don Zickus, kernel developer extrordinaire, and Adam Williamson, the inimitable Fedora QA guru. The plan is to create a set of user tests for the most common bluetooth tasks. This plan has several goals.

First, we’d like to know when stuff is broken. For example, the recent breakage in linux-firmware. Catching this stuff early is a huge plus.

Second, we’d like to get high quality bug reports. When things do break, vague bug reports often cause things to sit in limbo for a while. Making sure we have all the debugging information up front can make reports actionable.

Third, we’d (eventually) like to block a new Fedora release if major functionality is broken. We’re obviously not ready for this step yet. But once the majority of workflows work on the hardware we care about, we need to ensure that we don’t ship a Fedora release with broken code.

To this end we are targeting three workflows which cover the most common cases:

  • Keyboards
  • Headsets
  • Mice

For more information, or to help develop the user testing, see the Fedora QA bug. Here’s to a better future!

Fedorahosted to Pagure

Posted by Kanika Murarka on February 12, 2017 06:48 AM

Fedorahosted.org was established in late 2007 using Trac for issues and wiki pages, Fedora Account System groups for access control and source uploads, and offering a variety of Source Control Management tools (git, svn, hg, bzr). With the rise of new workflows and source repositories, fedorahosted.org has ceased to grow, adding just one new project this year and a handful the year before.

As we all know, Fedorahosted is shutting down end of this month, its time to migrate your projects from fedorahosted to one of the following:-

  1. Pagure
  2. Hosting and managing own Trac instance on OpenShift
  3. JIRA
  4. Phabricator
  5. GitHub
  6. Taiga

Pagure is the brainchild of Pierre-Yves Chibon, a member of the Fedora Engineering team. We will be looking into Pagure migration because Pagure is a new, full featured git repository service and its open-source and we ❤ opensource.

So, Pagure provides us Pagure test instance where we can create projects and test importing data. Note:from time to time it is been cleared out, so do not use it for any long-term use.

Here is How Pagure will support Fedorahosted projects ?

Features Fedorahosted Pagure
Priorities We can add as many priority levels as required with weights Same
We can assign a Default priority No such option
Custom priority tags Same
Milestone Ability to add as many milestone as we want Same
Option to add a due date Same
Keeps a track of completed time Does not record completed time
Option to select default milestone No such option
Resolution Ability to add as many resolutions as we want Same
Can set a default resolution type By default it is closed as ‘None’
Other Things Have separate column for Severity, component, Version Here it is easy, it has just Tags
Navigation and Searching Difficult Easy
Permission Different types of permission exist Only, ‘admin’ permission exist
Creating and maintaining tickets Difficult Easy
Enabling Plug-ins Easy Easy

So, lets try importing something to staging pagure repo, I will be showing demo using Fedora QA repo, which has recently been shifted from fedorahosted to pagure.

  1. You should have XML_RPC permission or admin rights for fedorahosted repo.
  2. We will use Pagure-importer to do migration.
  3. Install it using pip . python3 -m pip install pagure_importer
  4. Create a new repo ex- Test-fedoraqascreenshot-from-2017-02-10-16-53-21
  5. Go to Settings and make the repo, tickets friendly by adding new milestones and priorities.screenshot-from-2017-02-10-16-56-50
  6. Clone the issue tracker for issues from pagure. Use: pgimport clone ssh://git@stg.pagure.io/tickets/Test-fedoraqa.git.This will clone the pagure foobar repository into the default set /tmp directory as /tmp/Test-fedoraqa.gitscreenshot-from-2017-02-10-18-28-20
  7. Activate the pagure tickets hook from project settings. This is necessary step to also get pagure database updated for tickets repository changes.screenshot-from-2017-02-10-18-30-19
  8. Deactivate the pagure Fedmsg hook from project settings. This will avoid the issues import to spam the fedmsg bus. The Hook can be reactivated once the import has completed.
  9. The fedorahosted command can be used to import issues from a fedorahosted project to pagure.
    $ pgimport fedorahosted --help
        Usage: pgimport fedorahosted [OPTIONS] PROJECT_URL
    
        Options:
        --tags  Import pagure tags as well.
        --private By default make all issues private.
        --username TEXT FAS username
        --password TEXT FAS password
        --offset INTEGER Number of issue in pagure before import
        --help  Show this message and exit.
        --nopush Do not push the result of pagure-importer back
    
    
    $ pgimport fedorahosted https://fedorahosted.org/fedora-qa --tags

    This command will import all the tickets information with all tags to /tmp/foobar.git repository. If you are getting this error: ERROR: Error in response: {u'message': u'XML_RPC privileges are required to perform this operation', u'code': 403, u'name': u'JSONRPCError'}, means you dont have XML_RPC permission.

  10. You will be prompted for FAS username and password.screenshot-from-2017-02-10-19-01-51
  11. Go to tmp folder cd /tmp/
  12. Now, we need to push the tickets to new repo. The push command can be used to push a clone pagure ticket repo back to pagure
    $ pgimport push Test-fedoraqa.gitscreenshot-from-2017-02-10-19-10-04 screenshot-from-2017-02-10-19-10-16
  13. Refresh your repo, and it will look like thisscreenshot
  14. Now you can edit tickets in any way you want.

Stuck somewhere? Feel free to comment and contact. Thanks for reading this 🙂

Welcome Fedora Quality Planet

Posted by Kamil Páral on January 31, 2017 10:31 AM

Hello, I’d like to introduce a new sub-planet of Fedora Planet to you, located at http://fedoraplanet.org/quality/ (you don’t need to remember the URL, there’s a sub-planet picker in the top right corner of Fedora Planet pages that allows you to switch between sub-planets).

Fedora Quality Planet will contain news and useful information about QA tools and processes present in Fedora, updates on our quality automation efforts, guides for package maintainers (and other teams) how to interact with our tools and checks or understand the reported failures, announcements about critical issues in Fedora releases, and more.

Our goal is to have a single place for you to visit (or subscribe to) and get a good overview of what’s happening in the Fedora Quality space. Of course all Fedora Quality posts should also show up in the main Fedora Planet feed, so if you’re already subscribed to that, you shouldn’t miss our posts either.

If you want to join our effort and publish some interesting quality-related posts into Fedora Quality Planet, you’re more then welcome! Please see the instructions how to syndicate your blog. If you have any questions or need help, ask in the test mailing list or ping kparal or adamw on #fedora-qa freenode IRC channel. Thanks!

Fedora 25 i18n test day 2016-09-28

Posted by sumantro on September 27, 2016 04:23 PM
Hey All, this is a call for action for the internationalization[i18n] test day which is happening tomorrow and we will like to have people testing the keyboard layouts of different languages and emojis.

How to test

If you test with an installed system, make sure you have all the current updates installed, using the update manager, especially if you use the Alpha images.Grab a copy latest nightly , you can get it here .
Once you are done you can start testing out for :

Reporting

Result Page : http://testdays.fedorainfracloud.org/events/10





Found Bugs ? Report here

Retrospection of Fedora QA Global Onboarding Call

Posted by sumantro on August 23, 2016 07:18 AM
Building on the premise , that Fedora QA mailing-list started having loads of new contributors, we decided to kick off on-boarding calls for the new joiners to understand the Fedora QA process in the right way. Last Saturday, (2016-08-20) , we got the votes from all the new joiners and started off with the Fedora QA on-boarding call .

Last time when we ran it , we faced few roadblocks , major one being the "Participation" limit of Hangouts. Since, we wanted the call to be more interactive we thought it will be best to have it over "Hangouts". One major roadblock was that although we had a very interactive call with a load of question answers , the participation limit was pretty much restricted because we reached the max number of participants.

In the last call, we went for "Hangouts on Air" this is where we traded our interactivity with viewership. As this call technically had zero limitation in terms of participation  , the communication pattern was mostly simplex. Although pirate pad , help a lot in making the duplex conversation with the chatbox feature in it.

Adding to the silver lining, we had the recording up with zero downtime. People with lower bandwidth can always watch the video or download it from youtube for reference.

Adamw post for the call : https://www.happyassassin.net/2016/08/19/fedora-qa-onboarding-call-tomorrow-2016-08-20-at-1700-utc/

Youtube recording : https://www.youtube.com/watch?v=ASQmkOrB_DY

Having said that , We will be conducting on-boarding calls when ~30 new contributor join Fedora QA

UEFI for QEMU now in Fedora repositories

Posted by Kamil Páral on June 27, 2016 12:55 PM

I haven’t seen any announcement, but I noticed Fedora repositories now contain edk2-ovmf package. That is the package that is necessary to emulate UEFI in QEMU/KVM virtual machines. It seems all licensing issues having been finally resolved and now you can easily run UEFI systems in your virtual machines!

I have updated Using_UEFI_with_QEMU wiki page accordingly.

Enjoy.

‘Package XXX is not signed’ error during upgrade to Fedora 24

Posted by Kamil Páral on June 22, 2016 11:54 AM

Many people hit issues like this when trying to upgrade to Fedora 24:

 Error: Package a52dec-0.7.4-19.fc24.x86_64.rpm is not signed

You can easily see that this is a very widespread issue if you look at comments section under our upgrade guide on fedora magazine. In fact, this issue probably affects everyone who has rpmfusion repository enabled (which is a very popular third-party repository). Usually the a52dec package is mentioned, because it’s early in the alphabet listing, but it can be a different one (depending on what you installed from rpmfusion).

The core issue is that even though their Fedora 24 repository is available, the packages in it are not signed yet – they simply did not have time to do that yet. However, rpmfusion repository metadata from Fedora 23 demand that all packages are signed (which is a good thing, package signing is crucial to prevent all kinds of nasty security attacks). The outcome is that DNF rejects the transaction for being unsecure.

According to rpmfusion maintainers, they are working on signing their repositories and it should be done hopefully soon. So if you’re not in a hurry with your upgrade, just wait a while and the problem will disappear soon (hopefully).

But, if you insist that you want to upgrade now, what are your options?

Some people suggest you can add --nogpgcheck option to the command line. Please don’t do that! That completely bypasses any security checks, even for proper Fedora packages! It will get you vulnerable to security attacks.

A much better option is to temporarily remove rpmfusion repositories:

$ sudo dnf remove 'rpmfusion-*-release'

and run the upgrade command again. You’ll likely need to add --allowerasing option, because it will probably want to remove some packages that you installed from rpmfusion (like vlc):

$ sudo dnf system-upgrade download --releasever=24 --allowerasing

This is OK, after you upgrade your system, you can enable rpmfusion repositories again, and install the packages that were removed prior to upgrade.

(I recommend to really remove rpmfusion repositories and not just disable them, because they manage their repos in a non-standard way, enabling and disabling their updates and updates-testing repos during the system lifecycle according to their needs, so it’s hard to know which repos to enable after the system upgrade – they are not the same as were enabled before the system upgrade. What they are doing is really rather ugly and it’s much better to perform a clean installation of their repos.)

After the system upgrade finishes, simply visit their website, install the repos again, and install any packages that you’re missing. This way, your upgrade was performed in a safe way. The packages installed from rpmfusion might still be installed unsafely (depending whether they manage to sign the repo by that time or not), but it’s much better than to upgrade your whole system unsafely.

To close this up, I’m sorry that people are hit by these complications, but it’s not something Fedora project can directly influence (except for banning third-party repos during system upgrades completely, or some similar drastic measure). This is in hands of those third-party repos. Hopefully lots of this pain will go away once we start using Flatpak.

Hosting your own Fedora Test Day

Posted by sumantro on June 07, 2016 03:08 AM
This post will be talking about how to hold your own test day. In most of the time although the situation will not arise when you have to do all the work single handed. But below is a draft of how you can go ahead an host your own testdays or atleast proceed with.


Procedure:

1. Decide the change which you wanna test for
2. Create a ticket
3. Find out if that type of test case ran sometime before or not, if yes then you can easily re-use the previous wikipage of test cases but if not then you have to write a fresh new testday wiki and test case wiki .
4.Once you are done setting up the wiki(s) , you need to think about the metaapp data page
5. And the results will be shown in the Test day App.

How to setup:
1.For fedora 24 you can find the change set here .

2. Create a Trac ticket. One example is below

                      
                                 
                               

3.Once you are sure you have picked up at least one check if that type of test ever ran before, if not then first start off by grabbing the test day wiki template.

                                           

Actual wiki testday page 
                     
                       
                                      
4.Next and the most crucial part is setting up the test case page, this page contains instruction as in what is supposed to be done and how. It also explains which test case should yield what result(s).One example is below


                           




5.Setting up the meta page for the app to work fine. Here is how , create a wikipage exactly like the one which is given below





6.Once done, you need to add the meta link here

7. After this moment one, your test day is live and you can find it's Testday App result page which will look much like this.

                              


That completes all the procedures required , needless to say that you should announce it on @test list and @test-announce list!

Fedora Media Writer testday [Report]

Posted by sumantro on April 21, 2016 11:06 AM
This is a proud moment as we have seen a good amount of people participating in the event .

Testers : 22 , Test : 40
Bugs Filed : 16 , New :16  , Dup : 0 , Fixed : 0




Participants
  1. Ayan
  2. Karthik Subrahmanya
  3. achembar
  4. arehtykitna
  5. frantisekz
  6. kparal
  7. lbrabec
  8. pschindl
  9. priynag
  10. renault
  11. satellit
  12. sayak
  13. sumantro
  14. swarnava
  15. cpanceac
  16. deaddrift
  17. lsatenstein
  18. roger.k.wells
  19. cmurf
  20. jsedlak
  21. juliuxpigface
  22. qqqqqqq


Bugs and Github issues Reported

  1. moving liveusb-creator to a different display runs out of memory and kills desktop - 1328452
  2. Luc doesn't reuse already downloaded image and stuck - 1328369
  3. AttributeError: 'NoneType' object has no attribute 'device' - 1328337
  4. liveusb-creator: pycurl.error: cannot invoke setopt() - perform() is currently running -1328340
  5. crash when choosing ISO on samba server- 1328560
  6.  Windows 10 install usb stick not recognized-1328794
  7. liveusb-creator on Windows 10 opens diskpart.exe but does nothing-1328484
  8. partitions are not unmounted before writing image to a device, can lead to corrupted data written-1328498
  9. do not leave partially downloaded files on disk-1328789
  10. no writing progress is displayed-1328462
  11. USB stick light flashes continuously while LUC is running, whether it's read/writing or not- 1328563
  12. cpu is over 100% for liveusb-creator process while application is doing nothing - 1318491
  13. Open button doesn't work in file selection dialog-1328457
  14. https://github.com/lmacken/liveusb-creator/issues/43
  15. https://github.com/lmacken/liveusb-creator/issues/44
  16. https://github.com/lmacken/liveusb-creator/issues/45