May 02, 2016

Self-Signed SSL/TLS Certificates: Why They are Terrible and a Better Alternative

A Primer on SSL/TLS Certificates

Many of my readers (being technical folks) are probably already aware of the purpose and value of certificates, but in case you are not familiar with them, here’s a quick overview of what they are and how they work.

First, we’ll discuss public-key encryption and public-key infrastructure (PKI). It was realized very early on in human history that sometimes you want to communicate with other people in a way that prevents unauthorized people from listening in. All throughout time, people have been devising mechanisms for obfuscating communication in ways that only the intended recipient of the code would be able to understand. This obfuscation is called encryption, the data being encrypted is called plaintext and the encrypted data is called ciphertext. The cipher is the mathematical transformation that is used to turn the plaintext into the ciphertext and relies upon one or more keys known only to trusted individuals to get the plaintext back.

Early forms of encryption were mainly “symmetric” encryption, meaning that the cipher used the same key for both encryption and decryption. If you’ve ever added a password to a PDF document or a ZIP file, you have been using symmetric encryption. The password is a human-understandable version of a key. For a visual metaphor, think about the key to your front door. You may have one or more such keys, but they’re all exactly alike and each one of them can both lock and unlock the door and let someone in.

Nowadays we also have forms of encryption that are “asymmetric”. What this means is that one key is used to encrypt the message and a completely different key is used to decrypt it. This is a bit harder for many people to grasp, but it works on the basic mathematical principle that some actions are much more complicated to reverse than others. (A good example I’ve heard cited is that it’s pretty easy to figure out the square of any number with a pencil and a couple minutes, but most people can’t figure out a square-root without a modern calculator). This is harder to visualize, but the general idea is that once you lock the door with one key, only the other one can unlock it. Not even the one that locked it in the first place.

So where does the “public” part of public-key infrastructure come in? What normally happens is that once an asymmetric key-pair is generated, the user will keep one of those two keys very secure and private, so that only they have access to it. The other one will be handed out freely through some mechanism to anyone at all that wants to talk to you. Then, if they want to send you a message, they simply encrypt their message using your public key and they know you are the only one who can decrypt it. On the flip side, if the user wanted to send a public message but provide assurance that it came from them, they can also sign a message with the private key, so that the message will contain a special signature that can be decrypted with their public key. Since only one person should have that key, recipients can trust it came from them.

Astute readers will see the catch here: how do users know for certain that your public key is in fact yours? The answer is that they need to have a way of verifying it. We call this establishing trust and it’s exceedingly important (and, not surprisingly, the basis for the rest of this blog entry). There are many ways to establish trust, with the most foolproof being to receive the public key directly from the other party while looking at two forms of picture identification. Obviously, that’s not convenient for the global economy, so there needs to be other mechanisms.

Let’s say the user wants to run a webserver at “www.mydomain.com”. This server might handle private user data (such as their home address), so a wise administrator will set the server up to use HTTPS (secure HTTP). This means that they need a public and private key (which in this case we call a certificate). The common way to do this is for the user to contact a well-known certificate authority and purchase a signature from them. The certificate authority will do the hard work of verifying the user’s identity and then sign their webserver certificate with the CA’s own private key, thus providing trust by way of a third-party. Many well-known certificate authorities have their public keys shipped by default in a variety of operating systems, since the manufacturers of those systems have independently verified the CAs in turn. Now everyone who comes to the site will see the nice green padlock on their URL bar that means their communications are encrypted.

A Primer on Self-Signed Certificates

One of the major drawbacks to purchasing a CA signature is that it isn’t cheap: the CAs (with the exception of Let’s Encrypt) are out there to make money. When you’re developing a new application, you’re going to want to test that everything works with encryption, but you probably aren’t going to want to shell out cash for every test server and virtual machine that you create.

The solution to this has traditionally been to create what is called a self-signed certificate. What this means is that instead of having your certificate signed by a certificate authority, you instead use the certificates public key to add a signature to the private key. The problem with this approach is that web browsers and other clients that verify the security of the connection will be unable to verify that the server is who it says it is. In most cases, the user will be presented with a warning page that informs them that the server is pretending to be the one you went to. When setting up a test server, this is expected. Unfortunately, however, clicking through and saying “I’m sure I want to connect” has a tendency to form bad habits in users and often results in them eventually clicking through when they shouldn’t.

It should be pretty obvious, but I’ll say it anyway: Never use a self-signed certificate for a production website.

One of the problems we need to solve is how to avoid training users to ignore those warnings. One way that people often do this is to load their self-signed certificate into their local trust store (the list of certificate authorities that are trusted, usually provided by the operating system vendor but available to be extended by the user). This can have some unexpected consequences, however. For example, if the test machine is shared by multiple users (or is breached in a malicious attack), then the private key for the certificate might fall into other hands that would then use it to sign additional (potentially malicious) sites. And your computer wouldn’t try to warn you because the site would be signed by a trusted authority!

So now it seems like we’re in a Catch-22 situation: If we load the certificate into the trusted authorities list, we run the risk of a compromised private key for that certificate tricking us into a man-in-the-middle attack somewhere and stealing valuable data. If we don’t load it into the trust store, then we are constantly bombarded by a warning page that we have to ignore (or in the case of non-browser clients, we may have to pass an option not to verify the client) in which case we could still end up in a man-in-the-middle attack, because we’re blindly trusting the connection. Neither of those seems like a great option. What’s a sensible person to do?

Two Better Solutions

So, let’s take both of the situations we just learned about and see if we can locate a middle ground somewhere. Let’s go over what we know:

  • We need to have encryption to protect our data from prying eyes.
  • Our clients need to be able to trust that they are talking to the right system at the other end of the conversation.
  • If the certificate isn’t signed by a certificate in our trust store, the browser or other clients will warn or block us, training the user to skip validation.
  • If the certificate is signed by a certificate in our trust store, then clients will silently accept it.
  • Getting a certificate signed by a well-known CA can be too expensive for an R&D project, but we don’t want to put developers’ machines at risk.

So there are two better ways to deal with this. One is to have an organization-wide certificate authority rather than a public one. This should be managed by the Information Technologies staff. Then, R&D can submit their certificates to the IT department for signing and all company systems will implicitly trust that signature. This approach is powerful, but can also be difficult to set up (particularly in companies with a bring-your-own-device policy in place). So let’s look at a another solution that’s closer to the self-signed approach.

The other way to deal with it would be to create a simple site-specific certificate authority for use just in signing the development/test certificate. In other words, instead of generating a self-signed certificate, you would generate two certificates: one for the service and one to sign that certificate. Then (and this is the key point – pardon the pun), you must delete and destroy the private key for the certificate that did the signing. As a result, only the public key of that private CA will remain in existence, and it will only have ever signed a single service. Then you can provide the public key of this certificate authority to anyone who should have access to the service and they can add this one-time-use CA to their trust store.

Now, I will stress that the same rule holds true here as for self-signed certificates: do not use this setup for a production system. Use a trusted signing authority for such sites. It’s far easier on your users.

A Tool and a Tale

I came up with this approach while I was working on solving some problems for the Fedora Project. Specifically, we wanted to come up with a way to ensure that we could easily and automatically generate a certificate for services that should be running on initial start-up (such as Cockpit or OpenPegasus). Historically, Fedora had been using self-signed certificates, but the downsides I listed above gnawed at me, so I put some time into it and came up with the private-CA approach.

In addition to the algorithm described above, I’ve also built a proof-of-concept tool called sscg (the Self-Signed Certificate Generator) to easily enable the creation of these certificates (and to do so in a way that never drops the CA’s private key onto a filesystem; it remains in memory). I originally wrote it in Python 3 and that version is packaged for use in Fedora today. This past week as a self-assigned exercise to improve my knowledge of Go, I rewrote the sscg in that language. It was a fun project and had the added benefit of removing the fairly heavyweight dependency on the Python 3 version. I plan to package the golang version for Fedora 25 at some point in the near future, but if you’d like to try it out, you can clone my github repository. Patches and suggestions for functionality are most welcome.


Trusting, Trusting Trust
A long time ago Ken Thompson wrote something called Reflections on Trusting Trust. If you've never read this, go read it right now. It's short and it's something everyone needs to understand. The paper basically explains how Ken backdoored the compiler on a UNIX system in such a way it was extremely hard to get rid of the backdoors (yes, more than one). His conclusion was you can only trust code you wrote. Given the nature of the world today, that's no longer an option.

Every now and then I have someone ask me about Debian's Reproducible Builds. There are other groups working on similar things, but these guys seem to be the furthest along. I want to make clear right away that this work being done is really cool and super important, but not exactly for the reasons people assume. The Debian page is good about explaining what's going on but I think it's easy to jump to some false conclusions on this one.

Firstly, the point of a reproducible build is to allow two different systems to build the exact same binary. This tells us that the resulting binary was not tampered with. It does not tell us the compiler is trustworthy or the thing we built is trustworthy. Just that the system used to build it was clean and the binary wasn't meddled with before it got to you.

A lot of people assume a reproducible build means there can't be a backdoor in the binary. There can due to how the supply chain works. Let's break this down into a few stages. In the universe of software creation and distribution there are literally thousands to millions of steps happening. From each commit, to releases, to builds, to consumption. It's pretty wild. We'll keep it high level.

Here are the places I will talk about. Each one of these could be a book, but I'll keep it short on purpose.
  1. Development: Creation of the code in question
  2. Release: Sending the code out into the world
  3. Build: Turning the code into a binary
  4. Compose: Including the binary in some larger project
  5. Consumption: Using the binary to do something useful
Development
The development stage of anything is possibly the hardest to control. We have reached a point in how we build software that development is now really fast. I would expect any healthy project to have hundreds or thousands of commits every day. Even with code reviews and sign offs, bugs can sneak in. A properly managed project will catch egregious attempts to insert a backdoor.

Release
This is the stage where the project in question cuts a release and puts it somewhere it can be downloaded. A good project will include a detached signature which almost nobody checks. This stage of the trust chain has been attacked in the past. There are many instances of hacked mirrors serving up backdoored content. The detached signature ensures the release is trustworthy. We mostly have trust here solved which is why those signatures are so important.

Build
This is the stage where we take the source code and turn it into a binary. This the step that a reproducible build project has injected trust into. Without a reproducible build stage, there was no real trust here. It's still sort of complicated though. If you've ever looked at the rules that trigger these builds, it wouldn't be very hard to violate trust there, so it's not bullet proof. It is a step in the right direction though.

Compose
This step is where we put a bunch of binaries together to make something useful. It's pretty rare for a single build to output the end result. I won't say it never happens, but it's a bit outside what we're worried about, so let's not dwell on it. The threat we see during this stage is the various libraries you bundle with your application. Do you know where they came from? Do they have some level of trust built in? At this point you could have a totally trustworthy chain of trust, but if you include a single bad library, it can undo everything. If you want to be as diligent as possible you won't ship things built by any 3rd parties. If you build it all yourself, you can ensure some level of trust up to this point then. Of course building everything yourself generally isn't practical. I think this is the next stage that we'll end up adding more trust. Various code scanners are trying to help here.

Consumption
Here is where whatever you put together is used. In general nobody is looking for software, they want a solution to a problem they have. This stage can be the most complex and dangerous though. Even if you have done everything perfectly up to here, if whoever does the deployment makes a mistake it can open up substantial security problems. Better management tools can help this step a lot.

The point of this article isn't to try to scare anyone (even though it is pretty scary if you really think about it). The real point to this is to stress nobody can do this alone. There was once a time a single group could plausibly try to own their entire development stack, those times are long gone now though. What you need to do is look a the above steps and decide where you want to draw your line. Do you have a supplier you can trust all the way to consumption? Do you only trust them for development and release? If you can't draw that line, you shouldn't be using that supplier. In most cases you have to draw the line at compose. If you don't trust what your supplier does beneath that stage, you need a new supplier. Demanding they give you reproducible builds isn't going to help you, they could backdoor things during development or release. It's the old saying: Turst, but verify.

Let me know what you think. I'm @joshbressers on Twitter.

April 24, 2016

Can we train our way out of security flaws?
I had a discussion with some people I work with smarter than myself about training developers. The usual training suggests came up, but at the end of the day, and this will no doubt enrage some of you, we can't train developers to write secure code.

It's OK, my twitter handle is @joshbressers, go tell me how dumb I am, I can handle it.

So anyhow, training. It's a great idea in theory. It works in many instances, but security isn't one of them. If you look at where training is really successful it's for things like how to use a new device, or how to work with a bit of software. Those are really single purpose items, that's the trick. If you have a device that really only does one thing, you can train a person how to use it; it has a finite scope. Writing software has no scope. To quote myself from this discussion:

You have a Turing complete creature, using a Turing complete machine, writing in a Turing complete language, you're going to end up with Turing complete bugs.

The problem with training in this situation is that you can't train for infinite permutations. By its very definition, training can only cover a finite amount of content. Programming by definition requires you to draw on an infinite amount of content. The two are mutually exclusive.

Since you've made it this far, let's come to an understanding. Firstly, training, even how to write software is not a waste of time. Just because you can't train someone to write secure software you can teach them to understand the problem (or a subset of it). The tech industry is notorious for seeing everything as all or none. It's a sliding scale.

So what's the point?

My thoughts on this matter are one of how can we think about the challenges in a different way. Sometimes you have to understand the problem and the tools you have to find better solutions for it. We love to worry about how to teach everyone how to be more secure, when in reality it's all about many layers with small bits of security in each spot.

I hate car analogies, but this time it sort of makes sense.

We don't proclaim the way to stop people getting killed in road accidents is to train them to be better drivers. In fact I've never heard anyone claim this is the solution. We have rules that dictate how to road is to be used (which humans ignore). We have cars with lots of safety features (which humans love to disable). We have humans on the road to ensure the rules are being followed. We have safety built into lots of roads, like guard rails and rumble strips. At the end of the day even with layers of safety built in, there are accidents, lots of accidents, and almost no calls for more training.

You know what's currently the talk about how to make things safer? Self driving cars. It's ironic that software may be the solution to human safety. The point though is that every system reaches a point where the best you can ever do is marginal improvements. Cars are there, software is there. If we want to see substantial change we need new technology that changes everything.

In the meantime, we can continue to add layers of safety for software, this is where most effort seems to be today. We can leverage our existing knowledge and understanding of problems to work on making things marginally better. Some of this could be training, some of this will be technology. What we really need to do is figure out what's next though.

Just as humans are terrible drivers, we are terrible developers. We won't fix auto safety with training any more than we will fix software security with training. Of course there are basic rules everyone needs to understand which is why some training is useful. We're not going see any significant security improvements without some sort of new technology breakthrough. I don't know what that is, nobody does yet. What is self driving software development going to look like?

Let me know what you think. I'm @joshbressers on Twitter.

April 22, 2016

Remotely calling certmongers local signer

It is really hard to make remote calls securely without a minimal Public Key Infrastructure. For a single server development deployment, you can use a self-signed certificate, but once you have multiple servers that need to intercommunicate, you want to have a single signing cert used for all the services. I’m investigating an approach which chains multiple Certmonger instances together.

When Certmonger needs a certificate signed, it generates a Certificate Signing Request (CSR), and then calls a helper application. For a local signing, this executable is

/usr/libexec/certmonger/local-submit

If I want to sign a certificate without going through certmonger, I can first create a local cert database, generate a CSR, and manually sign it:

mkdir ~/certs
certutil -N -d ~certs
certutil -R -s "CN=www.younglogic.net, O=Younglogic, ST=MA, C=USA" -o ~/mycert.req -a -g 2048 -d ~/certs
/usr/libexec/certmonger/local-submit ~/mycert.req > mycert.pem

To get a remote machine to sign it, I used the following bash script:

#!/bin/sh -x

REMOTE_HOST=keycloak.younglogic.net
REMOTE_USER=dhc-user
SSH="ssh $REMOTE_USER@$REMOTE_HOST"      
CERTMONGER_CSR=`cat ~/mycert.req ` 

remotedir=`$SSH mktemp -d -p /home/dhc-user`
echo "$CERTMONGER_CSR" | $SSH tee $remotedir/mycert.req 
new_cert=$( $SSH  /usr/libexec/certmonger/local-submit $remotedir/mycert.req )
echo $new_cert > ~/mycert.pem
$SSH rm $remotedir/mycert.req
$SSH rmdir $remotedir

The /usr/libexec/certmonger/local-submit complies with the interface for Certmonger helper apps. Which means that it can also accept the CSR via the environment variable CERTMONGER_CSR, but as you can see, it also accepts it as an argument. If I drop the explicit definition of this variable, my script should work as a certmonger helper app.

In ~/.config/certmonger/cas/remote

id=remote
ca_is_default=0
ca_type=EXTERNAL
ca_external_helper=/home/ayoung/bin/remote_certmonger.sh

Of course, this will not honor any of the other getcert commands. But we should be able to list the certs.

Call it with:

getcert request -n remote   -c remote -s -d ~/certs/  -N "uid=ayoung,cn=users,cn=accounts,dc=openstack,dc=freeipa,dc=org"
New signing request "20160422020445" added.

getcert list -s

Request ID '20160422020445':
	status: SUBMITTING
	stuck: no
	key pair storage: type=NSSDB,location='/home/ayoung/certs',nickname='remote',token='NSS Certificate DB'
	certificate: type=NSSDB,location='/home/ayoung/certs',nickname='remote'
	signing request thumbprint (MD5): 5D1D5881 12952298 073F1DF6 48B10CB9
	signing request thumbprint (SHA1): A30FAEDE 1917DD4D 4FA3AAFC C704329E C7783B46
	CA: remote
	issuer: 
	subject: 
	expires: unknown
	pre-save command: 
	post-save command: 
	track: yes
	auto-renew: yes

So, not yet. More on this later.

April 20, 2016

Running Keystone Unit Tests against older Versions of RDO Etc

Just because upstrem is no longer supporting Essix doesn’t mean that someone out there is not running it. So, if you need to back port a patch, you might find yourself in the position of having to run unit tests against an older version of Keystone (or other) that does not run cleanly against the files installed by tox. For example, I tried running against an Icehouse era checkout and got a slew of errors like this:

AssertionError: Environmental variable PATH_INFO is not a string: <type> (value: u’/v2.0/tokens/e6aed0a188f1402d9ad3586bc0e35758/endpoints’)

The basic steps are:

  1. Install the packages for the version closest to the one you want to test
  2. checkout your source from git and apply your patch
  3. install any extra rpms required to run the tests
  4. run the test using python -m unittest $TESTNAME

For RDO, the main RPMS can be installed from :

https://repos.fedorapeople.org/repos/openstack/

You might need additional RPMS as packaged in EPEL. You don’t however, need to use an installer, you can use yum to install just the Keystone package.

The Dependencies are a little tricky to solve. Tox uses the test-requirements.txt file in the Keystone repo to install, but thes names do not match up with the package names. Often the RPM will be the name of the python package with the “python-” prefix.

Not all of the dependencies are in Fedora, RDO, or EPEL. Many were built just for CI, and are in https://copr.fedorainfracloud.org/coprs/abregman/.

For later releases, you can check out the jobs running in : https://ci.centos.org/view/rdo/view/promotion-pipeline/ and fetch the set of packages in “all_rpms.txt” but be aware that these are not the set of packages for unit tests. You might need more.

Not every package can be installed this way. For example, pyton-pysaml2 requires a bunch of additional RPMS that I had trouble pulling in. These can still be installed via pip.

April 17, 2016

Software end of life matters!
Anytime you work on a software project, the big events are always new releases. We love to get our update and see what sort of new and exciting things have been added. New versions are exciting, they're the result of months or years of hard work. Who doesn't love to talk about the new cool things going on?

There's a side of software that rarely gets talked about though, and honestly in the past it just wasn't all that important or exciting. That's the end of life. When is it time to kill off the old versions. Or sometimes even kill an entire project. When you do, what happens to the people using it? These are hard things to decide, there aren't good answers usually, it's just not a topic we're good at yet.

I bring this up now because apparently Apple has decided that Quicktime on Windows is no longer a thing. I think everyone can agree that expecting users to find some obscure message on the Internet to know they should uninstall something is pretty far fetched.

The conversation is way bigger than just Apple though. Google is going to brick some old Nest hardware. What about all those old tablets that still work but have no security updates? What about all those Windows XP machines still out there? I bet there are people still using Windows 95!

In some instances, the software and hardware can be decoupled. If you're running XP you can probably upgrade to something slightly better (maybe). Generally speaking though, you have some level of control. If you think about tablets or IoT style devices, the software and hardware are basically the same thing. The software will likely end of life before the hardware stops working. So what does that mean? In the case of pure software, if you need it to get work done, you're not going to uninstall it. It's all really complex unfortunately which is why nobody has figured this out yet.

In the past, you could keep most "hardware" working almost forever. There are cars out there nearly 100 years old. They still work and can be fixed. That's crazy. The thought of 100 year old software should frighten you to your core. They may have stopped making your washing machine years ago, but it still works and you can get it fixed. We've all seen the power tools our grandfathers used.

Now what happens when we decide to connect something to the Internet? Now we've chained the hardware to the software. Software has a defined lifecycle. It is born, it lives, it reaches end of life. Physical goods do not have a predetermined end of life (I know, it's complicated, let's keep it simple), they break, you get a new one. If we add software to this mix, software that creates a problem once it's hit the end of life stage, what do we do? There are two options really.

1) End the life of the hardware (brick it)
2) Let the hardware continue to run with the known bad software.

Neither is ideal. Now there are some devices you could just cut off features. A refrigerator for example. Instead of knowing when to order more pickles it reverts back to only keeping things cold. While this could create confusion in the pickle industry, at least you still have a working device. Other things would be tricky. An internet connected smart house isn't very useful if the things can't talk to each other. A tablet without internet isn't good for much.

I don't have any answers, just questions. We're still trying to sort out what this all means I suspect. If you think you know the answer I imagine you don't understand the question. This one is turtles all the way down.

What do you think? Tell me: @joshbressers

April 13, 2016

Getting Started with Puppet for Keystone

Tripleo uses Puppet to manage the resources in a deployment. Puppet has a command line tool to look at resources.

On my deployed Overcloud, I have:

ls /etc/puppet/modules/keystone/lib/puppet/provider
keystone         keystone_domain_config      keystone_paste_ini  keystone_service  keystone_user_role
keystone_config  keystone_endpoint           keystone.rb         keystone_tenant
keystone_domain  keystone_identity_provider  keystone_role       keystone_user

So I can use the puppet CLI to query the state of my system, or make changes:

To look at the config:

sudo puppet resource keystone_config
keystone_config { 'DEFAULT/admin_bind_host':
  ensure => 'present',
  value  => '10.149.2.13',
}
keystone_config { 'DEFAULT/admin_port':
  ensure => 'present',
  value  => '35357',
}
keystone_config { 'DEFAULT/admin_token':
  ensure => 'present',
  value  => 'vtNheM6drk4mgKgbAtWQPrYJe',
}
keystone_config { 'DEFAULT/log_dir':
  ensure => 'present',
  value  => '/var/log/keystone',
}
...

OK, Admin Token is gross.

$ sudo puppet resource keystone_config DEFAULT/admin_token
keystone_config { 'DEFAULT/admin_token':
  ensure => 'present',
  value  => 'vtNheM6drk4mgKgbAtWQPrYJe',
}

Let’s get rid of that:

sudo puppet resource keystone_config DEFAULT/admin_token ensure=absent
Notice: /Keystone_config[DEFAULT/admin_token]/ensure: removed
keystone_config { 'DEFAULT/admin_token':
  ensure => 'absent',
}

Let’s add a user:

$ sudo puppet resource keystone_users
Error: Could not run: Could not find type keystone_users
[heat-admin@overcloud-controller-0 ~]$ 

Uh oh…what did I do?

[heat-admin@overcloud-controller-0 ~]$ sudo puppet resource keystone_config DEFAULT/admin_token ensure=present value=vtNheM6drk4mgKgbAtWQPrYJe
Notice: /Keystone_config[DEFAULT/admin_token]/ensure: created
keystone_config { 'DEFAULT/admin_token':
  ensure => 'present',
  value  => 'vtNheM6drk4mgKgbAtWQPrYJe',
}
[heat-admin@overcloud-controller-0 ~]$ sudo puppet resource keystone_user
keystone_user { 'admin':
  ensure  => 'present',
  email   => 'admin@example.com',
  enabled => 'true',
  id      => '7cbc569993ae41e7b2736ed2aa727644',
}
...

So it looks like the Puppet modules use the Admin token to do operations.

But I really want to get rid of that admin token…

Back on the undercloud, I have created a Keystone V3 RC file. I’m going to copy that to /root/openrc on the overcloud controller.

[stack@undercloud ~]$ scp overcloudrc.v3 heat-admin@10.149.2.13:
[stack@undercloud ~]$ ssh heat-admin@10.149.2.13
[heat-admin@overcloud-controller-0 ~]$ sudo puppet resource keystone_config DEFAULT/admin_token ensure=absent
keystone_config { 'DEFAULT/admin_token':
  ensure => 'absent',
}
[heat-admin@overcloud-controller-0 ~]$ sudo puppet resource keystone_user
Error: Could not run: Insufficient credentials to authenticate
[heat-admin@overcloud-controller-0 ~]$ sudo cp  overcloudrc.v3 /root/openrc
[heat-admin@overcloud-controller-0 ~]$ sudo puppet resource keystone_user
keystone_user { 'admin':
  ensure  => 'present',
  email   => 'admin@example.com',
  enabled => 'true',
  id      => '7cbc569993ae41e7b2736ed2aa727644',
}
...

Now let’s add a user:

$ sudo puppet resource keystone_user ayoung ensure=present email=ayoung@redhat.com enabled=true password=FreeIPA4All
Notice: /Keystone_user[ayoung]/ensure: created
keystone_user { 'ayoung':
  ensure  => 'present',
  email   => 'ayoung@redhat.com',
  enabled => 'false',
}

Big Shout out to Emilien Macchi who is the Master of Keystone Puppets and taught me about the openrc file.

April 12, 2016

What happened with Badlock?
Unless you live under a rock, you've heard of the Badlock security issue. It went public on April 12. Then things got weird.

I wrote about this a bit in a previous post. I mentioned there that this better be good. If it's not, people will get grumpy. People got grumpy.

The thing is, this is a nice security flaw. Whoever found it is clearly bright, and if you look at the Samba patchset, it wasn't trivial to fix. Hats off to those two groups.
$ diffstat -s samba-4.4.0-security-2016-04-12-final.patch 
 227 files changed, 14582 insertions(+), 5037 deletions(-)
 Here's the thing though. It wasn't nearly as good as the hype claimed. It probably couldn't ever be as good as the hype claimed. This is like waiting for a new Star Wars movie. You have memories from being a child and watching the first few. They were like magic back then. Nothing that ever comes out again will be as good. Your brain has created ideas and memories that are too amazing to even describe. Nothing can ever beat the reality you built in your mind.

Badlock is a similar concept.

Humans are squishy irrational creatures. When we know something is coming one of two things happen. We imagine the most amazing thing ever which nothing will ever live up to (the end result here is being disappointed). Or we imagine something stupid which almost anything will be better than (the end result here is being pleasantly surprised).

I think most of us were expecting the most amazing thing ever. We had weeks to imagine what the worse possible security flaw could be that affects Samba and Windows. Most of us can imagine some pretty amazing things. We didn't get that though. We didn't get amazing. We got a pretty good security flaw, but not one that will change the world. We expected amazing, we got OK, now we're angry. If you look at twitter, the poor guy who discovered this is probably having a bad day. Honestly, there probably wouldn't have been anything that would have lived up to the elevated expectations that were set.

All that said, I do think by doing an announcement weeks in advance created this atmosphere. If this was all quiet until today, we would have been impressed, even if it had a name. Hype isn't something you can usually control. Some try, but by its very nature things get out of hand quickly and easily.

I'll leave you with two bits of wisdom you should remember.

  1. Name your pets, not your security flaws
  2. Never over-hype security. Always underpromise and overdeliver.

What do you think? Tell me: @joshbressers

April 11, 2016

A TFTP Server in Rust

Rust is Pedantic. I’m Pedantic. We get along wonderfully. Since HTTP is way too overdone, I wanted to try something at the Byte twiddling level. I got a very, very basic TFTP server to run and fetch a larger binary file without corrupting it. Time to celebrate with a bragpost.

The code is on Github, and I went full GPL on it.

Some comments are certainly called for. Here is the main loop, that

  • reads a single packet from a UDP socket
  • extracts the OP code
  • calls the appropriate handler function
fn read_message(socket: &net::UdpSocket) {
    let mut file_streams = HashMap::new();

    let mut buf: [u8; 100] = [0; 100];
    loop{
        let result = socket.recv_from(&mut buf);

        match result {
            Ok((amt, src)) => {
                let data = Vec::from(&buf[0..amt]);
                let connection = Connection{socket: socket, src: &src};
                let mut rdr = Cursor::new(&data);

                if amt < 2{
                    panic!("Not enough data in packet")
                }
                let opcode = rdr.read_u16::<BigEndian>().unwrap();

                match opcode {
                    1 => {
                        file_streams.insert(src, handle_read_request(
                            &mut rdr, &amt, &connection));
                    },
                    2 => println!("Write"),
                    3 => println!("Data"),
                    4 => {
                        let chunk = rdr.read_u16::<BigEndian>().unwrap() + 1;
                        file_streams.get_mut(&src).unwrap().send_chunk(&chunk, &connection);
                    },
                    5 => println!("ERROR"),
                    _ => println!("Illegal Op code"),
                }
            },
            Err(err) => panic!("Read error: {}", err)
        }
    }
}

My first hack at reading the opcode just looked at the second byte as u8, as the big endian network approach means that value was also a valid u8 and matching. However, later on, I need to marshall larger and larger numbers into the outgoing buffer, and the byteorder crate handles that for both reading and writing.

However, reading ascii strings this way was not so clean:

 pub fn new(data: &mut Cursor<&Vec <u8>>, amt: &usize,) -> FileStream {
        let mut index = 2;
        for x in 2..20 {
            if data.get_ref().as_slice()[x] == 0{
                index = x;
                break;
            }
        }
        let mut full_path = String::from("/home/ayoung/tftp/");
        let filename = match str::from_utf8(&data.get_ref().as_slice()[2..index]) {
            Ok(file_name) => file_name,
            Err(why) => panic!("couldn't read filename: {}",
                               Error::description(&why)),
        };
        full_path.push_str(filename);
        println!("filename: {}", filename);

I originally tried using the
read_to_string method on the Cursor, but it did not identify the null, and I ended up with an invalid string. str::from_utf8 Worked properly, once the index is set to skip the start of the buffer.

I really like the form of error handling where the success value is used for assignment, like this:

        let file = match File::open(full_path){
            Err(err) => panic!("Can't open file: {}", err),
            Ok(file) => file,
        };

This Server reads files out of $HOME/tftp. To Test it out, I did:

[ayoung@ayoung541 tmp]$ tftp 127.0.0.1 8888
tftp> binary
tftp> get Minecraft.jar
tftp> quit
[ayoung@ayoung541 tmp]$ diff ~/tftp/Minecraft.jar Minecraft.jar 
[ayoung@ayoung541 tmp]$ 

I think I need a syntax highlighter (for WordPress) that understands Rust.

Special thanks to this post that got me started.

April 10, 2016

Cybersecurity education isn't good, nobody is shocked
There was a news story published last week about the almost total lack of cybersecurity attention in undergraduate education. Most people in the security industry won't be surprised by this. In the majority of cases when the security folks have to talk to developers, there is a clear lack of understanding about security.

Every now and then I run across someone claiming that our training and education is going great. Sometimes I believe them for a few seconds, then I remember the state of things. Here's the thing. While there is a lot of good training and education opportunities. The ratio between competent security people and developers is without doubt going down. Software engineering positions are growing at more than double the rate of other positions. By definition it's significantly harder to educate a security person, the math says there's a problem here (this disregards the fact that as an industry we do a horrible job of passing on knowledge).

While it's clear students don't care about security, the question is should they?

It's always easy to pull out an analogy here, comparing this to car safety, or maybe architects vs civil engineers. Those analogies never really work though, the rules are just too different. The fundamental problem really boils down to the fact that a 12 year old kid in his basement has access to the exact same tools and technology the guy working on his PhD at MIT does. I'm not sure there has ever been an industry with a similar situation. Generally those in large organizations had access to significant resources that a normal person doesn't. Like building a giant rocket, or a bridge.

Here is what we need to think about.

Would we expect a kid learning how to build a game on his Dad's computer to also learn security? If I was that kid, I would say no. I want to build a game, security sounds dumb.

What if we're a college kid interested in computer algorithms. Security sounds uninteresting and is probably a waste of time. Remember when they made you take that PhyEd class and all the jocks laughed at you while you whispered to yourself about how they'll all be working at a gas station someday? Yeah, that's us now.

Let's assume that normal people don't care about security and don't want to care about security, what does that mean?

The simple answer would be to "fix the tools", but that's sort of chicken and egg. Developers build their own tools at a rather impressive speed these days, you can't really secure that stuff.

What if we sandbox everything? That really only protects the underlying system, most everything interesting these days is in the database, you can still steal all of that from a sandbox.

Maybe we could ... NO, just stop.
So how can we fix this?
We can't.

It's not that the problems are unfixable, it's that we don't understand them well enough. My best comparison here is when futurists wondered how New York could possible deal with all the horse manure if the city kept growing. Clearly they were thinking only in the context of what was available to them at the time. We think in this way too. It's not that we're dumb, I'm certain we don't really understand the problems. The problems aren't insecure code or bad tools. It's something more fundamental than that. Did we expect the people cleaning up after the horses to solve the manure problem?

If we start to think about the fundamentals, what's the level below our current development models? With the above example it was really about transportation, not horses, but horses are what everyone obsessed over. Our problems aren't really developers, code, and education. It's something more fundamental. What is it though? I don't know.

Do you think you know? Tell me: @joshbressers

April 08, 2016

Its a good thing SELinux blocks access to the docker socket.
I have seen lots of SELinux bugs being reported where users are running a container that volume mounts the docker.sock into a container.  The container then uses a docker client to do something with docker. While I appreciate that a lot of these containers probbaly need this access, I am not sure people realize that this is equivalent to giving the container full root outside of the contaienr on the host system.  I just execute the following command and I have full root access on the host.

docker run -ti --privileged -v /:/host fedora chroot /host

SELinux definitely shows its power in this case by blocking the access.  From a security point of view, we definitely want to block all confined containers from talking to the docker.sock.  Sadly the other security mechanisms on by default in containers, do NOT block this access.  If a process somehow breaks out of a container and get write to the docker.sock, your system is pwned on an SELinux disabled system. (User Namespace, if it is enabled, will also block this access also going forward).

If you have a run a container that talks to the docker.sock you need to turn off the SELinux protection. There are two ways to do this.

You can turn off all container security separation by using the --privileged flag. Since you are giving the container full access to your system from a security point of view, you probably should just do this.

docker run --privileged -v /run/docker.sock:/run/docker.sock POWERFULLCONTAINER

If you want to just disable SELinux you can do this by using the --security-opt label:disable flag.

docker run --security-opt label:disable -v /run/docker.sock:/run/docker.sock POWERFULLCONTAINER

Note in the future if you are using User Namespace and have this problem, a new flag --userns=host flag is being
developed, which will turn off user namespace within the container.

April 05, 2016

Abuse

I had an interesting exchange last week. We had someone in the chatroom asking for help, morgan was doing his part, and I chimed in and proceeded to get attacked.

Me: I’m not 100% buying this, why are you using numerics?
Other: thats because you are stupid.

Um. OK. That was unexpected, but… a couple lines alter there was a “;)” so, I shrugged it off. I can take a bit of ribbing, and chose to respond by being self deprecating…

Other: you dont know unix
Me: Nah, not at all. Isn;’t that what happens when a guy starts singing soprano?

As a core, part of my job is to review code, change requests, and challenge even the assumptions already baked into the code. In this case, the code in question was something that origianlly existed to support PKI tokens, which are on their way out. Now the code was being used for Fernet key rotation. As you can see, it was an interesting technical discussion. But in focusing on the technical side, we all let the attack go un-responded. Then later:

Me: you called me stupid. Now you need to justify yourself

It might not be apparent, but the other person had got under my skin, and I responded. This, too was a mistake. However, at the time I thought I was being levelheaded. We had not had any attacks like this in our chat room yet, and I was not prepared for it. Now, if it is going to happen to anyone, I am glad it was be addressed at me; I can take it. But it is not about me.

Other: you have just proved over 100 lines that you truly are stupid, I will frame it and put it on the wall in austin
Me: go for it.

Me: feel free to insult me all you want. Do not insult other people in this channel.

Yep…I’m still too cocky. And you can see that the tension in the room rises. Other people were excusing themselves from the conversation. But you can see, I am finally starting to switch into oversight mode.

The discussion turns technical again. I’m now fully engaged. The problem set has to do with getpwnam calls inside containers…a lot like the issues we had BProc nsswitch module. Since I was irate, I was being detail oriented. But I think that the whole conversation about user identity inside versus out side the container is an important one to have. And then this:

Other: I will seriously kill you in Austin!

This was over the line. Based in the context, I did not take it as a real threat, and I still don’t. It was an expression of frustration of someone that thinks they know the answer having to justify it yet again. I know that frustration. But, I also know not to respond like this.

Other: seriously, are you lobotomized?

Then back to sane technical conversation, and later:

Other: what planet are you from?

To be honest, this is back to borderline to acceptable ribbing to me. I responded with a link to my My Alma Mater.

I needed to go do family things. As I left, I finally said what I should have said at the begining.

Me: let me make one thing clear. We are all professionals here. You’ve picked on the most mellow of people to insult, which is why you have gotten away with it. The rest of the core developers here are getting very antsy at your attitude. I am willing to help,. but if you keep this up, it will be a kickban, and I do have perms in this room to enforce that. Have I made myself clear?

Later, the PTL, coming back from dinner, had this to say:

I’m the keystone project lead. I’ve read the scroll back earlier and have determined that you have violated the code of conduct (https://www.openstack.org/legal/community-code-of-conduct/), and lost the privilege to be a community member. There was a threat of violence, even if it was meant as hyperbole, that is unacceptable; it chases away other community members and makes for a hostile environment. As a result i am performing a kickban. For information on how to proceed please see the code of conduct website.

Let me be clear, I am not blaming myself as the victim here. I am blaming myself as a community member that did not stop an abusive conversation at the start. It does not matter that I was the target of the abuse: I am core on the project, and it is within my role to maintain a positive environment in the chat room. I made a mistake in letting the other person off simply because it was me. I would not let someone abuse anyone else in my community that way, but didn’t realize that, in not addressing it, I was encouraging a hostile atmosphere. It is about the community, and the environment in the chat room, and the project.

So, we were all caught unprepared. Fortunately, no real harm done. If something like this happens again, here is how I think we should react.

The first time someone says something that is a personal attack, the conversation stops.
It should not be the person attacked responsibility to respond, although they are certainly welcome to do so. The response should come from the other community members, especially those that have Operators privileges in the in the room.

Post a link to the code of conduct. Here is is: https://www.openstack.org/legal/community-code-of-conduct/

If I am there, and you are the person attacked, and do not feel comfortable responding, send me a private message and I will step in. If not, find the operator of the room and ask them to respond. If they don’t, please send them a link to this blog, but also send a message to the Foundation member listed on the Code of Conduct page.

Its easy to feel sorry for the person that we kick banned. “Oh he was probably a dumb kid” or whatever, but remember, there are many, many people that get abused and just disappear. You would not take your kids to a party at a place that had a history of brawls in the parking lot. You don’t want a summer intern, new to Open Source and Open Stack and programming, to witness a hostile environment.

And I am still not certain it makes sense to allow calls that take numeric user ids from inside a container when the Name Service Switch module that backs them is not available. I am willing to be proved wrong, though, and look forward to discussing it with anyone that cares to discuss it civilly.

Adding a new filename transition rule.
Way back in 2012 we added File Name Transition Rules.  These rules allows us to create content with the correct label
in a directory with a different label.  Prior to File Name Transition RUles Administrators and other tools like init scripts creating content in a directory would have to remember to execute restorecon on the new content.  In a lot of cases they would forget
and we would end up with mislabeled content, in some cases this would open up a race condition where the data would be
temporarily mislabeled and could cause security problems.

I recently recieved this email and figured I should write a blog.

Hiya everyone. I'm an SELinux noob.

I love the newish file name transition feature. I was first made aware of it some time after RHEL7 was released (http://red.ht/1VhtaHI), probably thanks to some mention from Simon or one of the rest of you on this list. For things that can't be watched with restorecond, this feature is so awesome.

Can someone give me a quick tutorial on how I could add a custom rule? For example:


filetrans_pattern(unconfined_t, httpd_sys_content_t, httpd_sys_rw_content_t, dir, "rwstorage")


Of course the end goal is that if someone creates a dir named "rwstorage" in /var/www/html, that dir will automatically get the httpd_sys_rw_content_t type. Basically I'm trying to make a clone of the existing rule that does the same thing for "/var/www/html(/.*)?/uploads(/.*)?".
Thanks for reading.

First you need to create a source file myfiletrans.te

policy_module(myfiletrans, 1.0)
gen_require(`
    type unconfined_t, httpd_sys_content_t, httpd_sys_rw_content_t;
')
filetrans_pattern(unconfined_t, httpd_sys_content_t, httpd_sys_rw_content_t, dir, "rwstorage")


Quickly looking at the code we added.  When writing policy, if you are using type fields, unconfined_t, httpd_sys_content_t, httpd_sys_rw_content_t, that are defined in other policy packages, you need to specify this in a gen_require block.  This is similar to defining extern variables to be used in "C".  Then we call the filetrans_pattern interface.  This code tells that kernel that if a process running as unconfined_t, creating a dir named rwstorage in a directory labeled httpd_ssy_content_t, create the directory as httpd_sys_rw_content_t.

Now we need to compile and install the code, note that you need to have selinux-policy-devel, package installed.

make -f /usr/share/selinux/devel/Makefile myfiletrans.pp
semodule -i myfiletrans.pp


Lets test it out.

# mkdir /var/www/html/rwstorage
# ls -ldZ /var/www/html/rwstorage
drwxr-xr-x. 2 root root unconfined_u:object_r:httpd_sys_rw_content_t:s0 4096 Apr  5 08:02 /var/www/html/rwstorage


Lets make sure the old behaviour still works.

# mkdir /var/www/html/rwstorage1
# ls -lZ /var/www/html/rwstorage1 -d
drwxr-xr-x. 2 root root unconfined_u:object_r:httpd_sys_content_t:s0 4096 Apr  5 08:04 /var/www/html/rwstorage1


This is an excellent way to customize your policy, if you continuously see content being created with the incorrect label.

April 03, 2016

Security is really about Risk vs Reward
Every now and then the conversation erupts about what is security really? There's the old saying that the only secure computer is one that's off (or fill in your favorite quote here, there are hundreds). But the thing is, security isn't the binary concept: you can be secure, or insecure. That's not how anything works. Everything is a sliding scale, you are never secure, you are never insecure. You're somewhere in the middle. Rather than bumble around about your risk though, you need to understand what's going on and plan for the risk.

So this brings us to the idea of risk and reward. Rather than just thinking about security, you have to think about how everything fits together. It doesn't matter if your infrastructure is super secure if nobody can do their jobs. As we've all seen over and over, if security gets in the way, security loses. Every. Single. Time.

I think about this a lot, and I've come up with a graph that I think can explain this nicely.


Don't think in the context of secure or insecure. Think in the context of how much risk do I have? Once you understand what your risks are, you can decide if the level of risk you're taking on can be justified by what the result of that risk will be. This of course holds true for nearly all decisions, not just security, but we'll just focus on security.

The above graph puts things into 4 groups. If you have a high level of risk with minimal reward (the Why box), you're making a bad decision. Anything you have in that "Why" box probably needs to go away ASAP, you will regret it someday.

Additionally, if your sustaining operations are of high risk, you're probably doing something wrong. Risk is hard and drains an organization, you should be conducting your day to day operations in a manner than poses a low risk as the day to day is generally not where the high reward is.

The place you want to be is in the "Innovation" or "No Brainer" boxes. Accepting a high level of risk isn't always a bad thing, assuming that risk comes with significant rewards. You can imagine a situation where you are deploying a new and untested technology, but the benefits to conducting business could change everything, or perhaps using a new, untested vendor for the first time.

We have to be careful with risk. Risk can be crippling if you don't understand and manage it. It can also destroy everything you've done if you let it get out of hand. Many of us find ourselves in situations where all risk is seen as bad. Risk isn't always bad, risk is never zero. It's up to everyone to determine what their acceptable level of risk is. Never forget though, that sometimes we need to bump up our level of risk to get to the next level of reward. Just make sure you can bring that risk back under control once you start seeing the outcomes.

What do you think? Let me know: @joshbressers
FreeIPA for Tripleo

My last post showed how to allocate an additional VM for Tripleo. Now I’m going to go through the steps to deploy FreeIPA on it. However, since I went through all of the effort to write Ossipee and Rippowam, I am going to use those to do the heavy lifting.

This one is pretty grungy. I’m going to generate a punch-list from, and will continue to clean up the steps as I go, but first I want to just get it working.

To start, turn the ironic node into a server:

openstack server create  --flavor compute  --image overcloud-full  --key-name default idm

Now, in order to run Ansible, we need a custom inventory. I’ve done a small hack to Ossipee to get it to Generate the appropriate Inventory from the Nova servers in Tripleo’s undercloud.

Ossipee needs to use the V3 version of the Keystone API, so lets convert the V2 stackrc into a v3 and source that. Grab the script from http://adam.younglogic.com/2016/03/v3fromv2/ and run

./v3fromv2.sh stackrc > stackrc.v3
. ./stackrc.v3 

A good way to check that you are using V3 is to do a V3 only operation, like list domains:

openstack domain list
+----------------------------------+------------+---------+--------------------+
| ID                               | Name       | Enabled | Description        |
+----------------------------------+------------+---------+--------------------+
| 33c86e573f094787adb2e808c723dcca | heat_stack | True    |                    |
| default                          | Default    | True    | The default domain |
+----------------------------------+------------+---------+--------------------+

Grab Ossipee and run the inventoroy generator.

git clone https://github.com/admiyo/ossipee.git
python ./ossipee/ossipee-inventory.py > inventory.ini

This gets an inventory file that looks roughly like the ones Ossipee created before, but uses the same host group names as the rest of Tripleo:

[idm]
10.149.2.15
[idm:vars]
ipa_realm=AYOUNG.DELLT1700.TEST
cloud_user=heat-admin
ipa_server_password=FreeIPA4All
ipa_domain=ayoung.dellt1700.test
ipa_forwarder=192.168.23.1
ipa_admin_user_password=FreeIPA4All
ipa_nova_join=False
nameserver=192.168.52.4

[overcloud-controller-0]
10.149.2.13
[overcloud-controller-0:vars]
ipa_realm=AYOUNG.DELLT1700.TEST
cloud_user=heat-admin
ipa_server_password=FreeIPA4All
ipa_domain=ayoung.dellt1700.test
ipa_forwarder=192.168.23.1
ipa_admin_user_password=FreeIPA4All
ipa_nova_join=False
nameserver=192.168.52.4

[overcloud-novacompute-0]
10.149.2.12
[overcloud-novacompute-0:vars]
ipa_realm=AYOUNG.DELLT1700.TEST
cloud_user=heat-admin
ipa_server_password=FreeIPA4All
ipa_domain=ayoung.dellt1700.test
ipa_forwarder=192.168.23.1
ipa_admin_user_password=FreeIPA4All
ipa_nova_join=False
nameserver=192.168.52.4

IN addition, I think the ipa_forwarder values are deployment specific, and I have them wrong. Look in the controller resolv.conf to see what they should be:

$ ssh heat-admin@10.149.2.12 cat /etc/resolv.conf
; generated by /usr/sbin/dhclient-script
nameserver 10.149.2.5

The nameserver value should be the IPA address of the newly created idm vm.

openstack server list
+--------------------------------------+-------------------------+--------+----------------------+
| ID                                   | Name                    | Status | Networks             |
+--------------------------------------+-------------------------+--------+----------------------+
| f81e7122-5d8e-4377-b855-80c28116197d | idm                     | ACTIVE | ctlplane=10.149.2.15 |
| c1bf48cb-659f-4f9f-aa9d-1c0d4bcae06d | overcloud-controller-0  | ACTIVE | ctlplane=10.149.2.13 |
| c1e2069f-4ef1-461b-86a9-2fd2bb321a8a | overcloud-novacompute-0 | ACTIVE | ctlplane=10.149.2.12 |
+--------------------------------------+-------------------------+--------+----------------------+

Obviously, Ossipee needs some work here, but this is likely going to be replaced by Heat work shortly. Anyway, adjust the IP addresses accordingly.

Now grab Rippowman :

git clone https://github.com/admiyo/rippowam.git

And Ansible from EPEL. Note that this needs to be two calls, as the first installs the repo used by the second.

sudo yum -y install epel-release
sudo yum -y install ansible

Rippowam still has the host name as ipa in the ipa playbook. You can change either the inventory or the Rippowam code to match. I changed Rippowman like this:

diff --git a/ipa.yml b/ipa.yml
index e0ea50c..c17c2b5 100644
--- a/ipa.yml
+++ b/ipa.yml
@@ -1,10 +1,10 @@
 
-- hosts: ipa
+- hosts: idm
   remote_user: "{{ cloud_user }}"
   tags: all
   tasks: []
 
-- hosts: ipa
+- hosts: idm
   sudo: yes
   remote_user: "{{ cloud_user }}"
   tags:

The inventory file is set up for later when it needs to talk to the Overcloud controllers. Heat changes the cloud user to heat-admin. Create this on the idm machine.

ssh centos@10.149.2.15 sudo useradd -m  heat-admin  -G wheel
ssh centos@10.149.2.15 sudo mkdir /home/heat-admin/.ssh
ssh centos@10.149.2.15 sudo chown heat-admin:heat-admin /home/heat-admin/.ssh
ssh centos@10.149.2.15 sudo cp /home/centos/.ssh/authorized_keys /home/heat-admin/.ssh/
ssh centos@10.149.2.15 sudo chown heat-admin:heat-admin /home/heat-admin/.ssh/authorized_keys
ssh heat-admin@10.149.2.15 ls
ssh heat-admin@10.149.2.15 pwd

I also manually went in an tweaked the /etc/sudoers values to let password-less sudo work for heat-admin. Not a long term approach I would suggest, but these are just my current development notes.

Make sure that ansible works:

 ansible -i inventory.ini --user heat-admin --sudo idm -m setup

Output not pasted here for brevity.

The machine needs a FQDN to deploy. I am going to continue the pattern from before, where the clustername jhas some aspectof the user name. Since the baremetal host is
ayoung-dell-t1700 this cluster will be ayoung-dell-t1700.test and the FQDN for this host will be idm.ayoung-dell-t1700.test

sudo vi /etc/hostname
sudo hostname `cat /etc/hostname`

Run the ipa playbook.

 ansible-playbook -i inventory.ini rippowam/ipa.yml 

Assuming that runs successfully, do the same kind of thing with the keyclock.yml play book: edit it to change the hostgroup to idm. and run.

Also, seems i have some typos in roles/keycloak/tasks/main.yml:

index 59f67c7..cce462d 100644
--- a/roles/keycloak/tasks/main.yml
+++ b/roles/keycloak/tasks/main.yml
@@ -114,7 +114,7 @@
     - keycloak
   copy: src=keycloak-proxy.conf 
         dest=/etc/httpd/conf.d/keycloak-proxy.conf 
-        owner=root group=rootmode="u=rw,g=r,o=r"
+        owner=root group=root mode="u=rw,g=r,o=r"
@@ -122,5 +122,5 @@
     - keycloak
   service: name=httpd
            enabled=yes
-           state=irestarted
+           state=restarted
 

Fix those and then:

 ansible-playbook -i inventory.ini rippowam/ipa.yml 

Ugh, that was messy. need to clean it up. But it did work.

Now, how to go look at our newly deployed servers? The best bet seems to be to use sshuttle.

From the desktop (not the undercloud)

sshuttle -e "ssh -F $HOME/.quickstart/ssh.config.ansible" -r undercloud -v 10.149.2.0/24

In order to point a browser at it, need to have an entry in the hosts file. For me:

10.149.2.15 idm.ayoung-dell-t1700.test

Keycloak needs to be initialized. Start by SSHing to the idm machine, and then.

$ cd /var/lib/keycloak/keycloak-1.9.0.Final  
$ sudo bin/add-user.sh -u admin
Press ctrl-d (Unix) or ctrl-z (Windows) to exit
Password: 
Added 'admin' to '/var/lib/keycloak/keycloak-1.9.0.Final/standalone/configuration/keycloak-add-user.json', restart server to load user
(reverse-i-search)`': ^C
$ sudo systemctl restart keycloak.service
Extra node on Tripleo Quickstart

I’ve switched my Tripleo development to using tripleo quickstart. While the steps to create an additional VM for the IdM server are roughly what I posted before, it is different enough to warrant description.

When creating the undercloud, you can tell the quickstart script to use an alternative configuration. In my case, I have one based on “minimal” that has the additional node defined:

in tripleo-quickstart/playbooks/centosci/ipa.yml

control_memory: 8192
compute_memory: 8192

overcloud_nodes:
  - name: control_0
    flavor: control
  - name: compute_0
    flavor: compute
  - name: idm_0
    flavor: compute

# FIXME(trown) This is only temporarily set to false so we can
# change CI to use these settings without changing what is run.
# Will be changed to true in a follow-up patch.
introspect: false

extra_args: ""
tempest: false
pingtest: true

Now when kicking off the quickstart:

 ./quickstart.sh -c playbooks/centosci/ipa.yml -t all ayoung-dell-t1700

Note that I am using tags, and this one does the complete undercloud and overcloud deployment.

March 29, 2016

Ransomware is scary, but not for the reasons you think it is
If you've been paying any attention for the past few weeks, you know what ransomware is. It's a pretty massive pain for anyone who gets it, and in some cases, it was a matter of life and death.

It's easy to understand what makes this stuff scary, but there's another angle most haven't caught on to yet, and it's not a pleasant train of thought.

Firstly, let's consider a few thing.

  1. Getting rid of malware is expensive
  2. Recovering from a compromise is even more expensive
  3. Ransomware has a clear and speedy ROI
  4. Normal people don't have a ton of important data
So let's start with #1 and #2. If you are compromised in some way, even if it's just some malware, it's going to cost a lot to clean up the mess. Probably magnitudes more than the current ransom. It's cheaper to pay than to clean up the mess. This will remain true as there isn't an incentive for the authors to price themselves out of business. The ransomware universe is econ 101. If you're an economics PhD student and you want to look impressive, write your thesis about this stuff; you'll probably win some sort of award. We'll get back to the economics of this shortly.

If we think about #3 it's pretty obvious. You write some malware, it literally pays you money. This means there is going to be more and more of this showing up on the market. Regular old malware can't compete with this. Ransomware has a business model, a really good one, except for that whole being illegal and really unethical part. Non ransomware doesn't have such an impressive business model. This is a turning point in the malware industry.

To date most of the ransomware seems to have been targeted at normal people. The price was a bit too high I thought, $400 is probably more than the average person will or can pay. The last few we've heard about hit hospitals though, and they charged a higher fee. This is basic economics. A hospital has more money than a person, and the data and infrastructure means the difference between life and death. Paying the fee will cost less than hiring a security professional. And when you're in the business of keeping people alive, you'll pay that fee if it means getting back to whatever it is you do.

If the ransomware knows where it is and what sort of data it has, the price can fluctuate based on economics. Some businesses can afford a few days of downtime, some can't. The more critical the data and system is to your business, the more you'll be willing to pay. Of course there is a ceiling on this, if the cost of hiring some security folks is less than the cost of paying the ransom, anyone with a clue is going to pay the expert to clean up the mess. This is the next logical step in the evolution of this business model.

If we keep thinking about this and bring the ransomware to its logical conclusion, the future versions are going to request a constant ongoing payment. Not a one time get out of jail free event. Why charge them once when you can charge them over and over again? Most modern infrastructures are complex enough it will be hard to impossible to remove an extremely clever bit of malware. It's going to be time for the good guys to step it up here, more thoughts on that some other day though.

There is even a silly angle that's fun to ponder. We could imagine ransomware that attacks other known malware. If the ransomware is getting a constant ongoing payment, it would be bad if anything else could remove it, from legitimate software to other ransomware. While I don't think antivirus and ransomware will ever converge on the same point, it's still fun to think about.

What do you think? Let me know: @joshbressers

March 28, 2016

Who can +2 a patch?

You are trying to push along a patch…and it dawns on you that you have no idea who to ask. The answer is out there.

Assuming it is an OpenStack project, it configuration for it is stored in the ACLs section of the project-config repo.. For example, the Keystone project is managed by the keystone-core group.

Once you have the name of the group, you can look up on gerrit and list it to find the members. Here is Keystone.

March 24, 2016

Identifying the message sender with Rabbit MQ and Kombu

Yesterday I showed how to identify a user when using the Pika library. However, Oslo Messaging still relies on the Kombu library. This, too, supports matching the user_id in the message to the username used to authenticate to the broker.

Again, a modification of an example from the documentation.

The important modification is to add the user_id to the publish call.

producer.publish(
    {'name': '/tmp/lolcat1.avi', 'size': 1301013},
    exchange=media_exchange, routing_key='video',
    declare=[video_queue], user_id=rabbit_userid)

This Kombu based code does not raise an exception if the message is rejected. However, looking at the count of the number of messages and the dump of the properties on the consumer, only those where the usernames match.

The sender sends two message,on that matches, one that does not. Count the messages in the video queue

$ sudo rabbitmqctl list_queues | grep video
video	0

Queue is empty.

Send two messages, one where the name matches, one where it does not.

$ python kombu-sender.py
owned message sent
misowned message sent

Now check there is only one message in the queue:

$ sudo rabbitmqctl list_queues | grep video
video	1

And receive the message:

$ python kombu-recv.py
recved
{u'name': u'/tmp/lolcat1.avi', u'size': 1301013}
sent by 
a5f56bdb395f53864a80b95f45dc395e94c546c7
$ sudo rabbitmqctl list_queues | grep video
video	0

Only the one where the name matches is passed through.

Here is the full code.
kombu-sender.py

from kombu import Connection, Exchange, Queue

media_exchange = Exchange('media', 'direct', durable=True)
video_queue = Queue('video', exchange=media_exchange, routing_key='video')

def process_media(body, message):
    print body
    message.ack()


rabbit_host = '10.149.2.1'
rabbit_userid = 'a5f56bdb395f53864a80b95f45dc395e94c546c7'
rabbit_password = '06814091f31ad50b55a3509e9e3916082cce556d'


# connections                                                                                   
with Connection('amqp://%s:%s@%s//' % (rabbit_userid, rabbit_password, rabbit_host)) as conn:

    # produce                                                                                   
    producer = conn.Producer(serializer='json')
    try:
        producer.publish({'name': '/tmp/lolcat1.avi', 'size': 1301013},
                         exchange=media_exchange, routing_key='video',
                         declare=[video_queue], user_id=rabbit_userid)
        print("owned message sent")
    except Exception as e:
        print(e)
        raise e
    try:
        producer.publish({'name': '/tmp/phish.avi', 'size': 1301013},
                         exchange=media_exchange, routing_key='video',
                         declare=[video_queue], user_id='fake_user')
	print("misowned message sent")
    except Exception as e:
        print(e)
        raise e

kombu-recv.py

from kombu import Connection, Exchange, Queue

media_exchange = Exchange('media', 'direct', durable=True)
video_queue = Queue('video', exchange=media_exchange, routing_key='video')

def process_media(body, message):
    print ('recved')
    print body
    print ('sent by ')
    print message.properties.get('user_id','no user id in message')
    message.ack()


rabbit_host = '10.149.2.1'
rabbit_userid = 'a5f56bdb395f53864a80b95f45dc395e94c546c7'
rabbit_password = '06814091f31ad50b55a3509e9e3916082cce556d'


# connections                                                                                   
with Connection('amqp://%s:%s@%s//' % (rabbit_userid, rabbit_password, rabbit_host)) as conn:

    with conn.Consumer(video_queue, callbacks=[process_media]) as consumer:
	# Process messages and handle events on all channels                                    
	while True:
            conn.drain_events()

# Consume from several queues on the same channel:                                              
video_queue = Queue('video', exchange=media_exchange, key='video')
image_queue = Queue('image', exchange=media_exchange, key='image')

March 23, 2016

I'm going to do something really cool in 3 weeks! ... Probably.
If you pay attention to the security news, there is something coming called Badlock. It just set off a treasure hunt for security flaws in Samba. Rather than link to the web site (I'd rather not support this sort of behavior), let's think about this as reasonable people.

I can imagine three possible outcomes to the events that have been set in motion.
  1. On April 12 a truly impressive security flaw will be disclosed. We will all be impressed.
  2. Someone will figure this out before April 12, they have no incentive to act responsibly and will publish what the know right away, better to be first than to be right!
  3. Whatever happens on April 12 won't be nearly as interesting or exciting as we've been led to believe. The world will say a collective 'meh' and we'll go back to looking at pictures of cats.
Numbers 1 and 2 rely on the flaw being quite serious. If it is serious, I suspect there is a far greater chance of #2 happening than #1. As an industry we should hope for #3, we don't need more terrible flaws.

The really crazy thing to think about is if the issue isn't actually serious, it probably won't be found. Everyone is looking for a giant problem. They're going to pass up minor issues (if you do find these, please report them, it's still useful work). The prize is a pot of gold we've been told, not some proverbial the journey is the reward nonsense.

The thing everyone always should remember in a situation like this is there are a lot of really smart people on the planet. If you think of something clever or discover something new, there are huge odds someone else did too. 3 weeks almost guarantees someone else can figure out whatever it is you found. It's especially interesting in this case since we have a name "Badlock" so we know it probably involves locking. We know it affects Samba and Windows. And we know who it was found by so we can look at which bits of Samba they've been working on lately. That's a lot of information for a clever person.

The real thing we need to think about here though is what's actually happening. There is a bigger story for us to think about around all these named issues.

If you name an issue, you are making a claim that it's very serious. There are literally thousands of security issues per year, and maybe ten gets fancy names. A name suggests this is something we should care about. That this issue is special. Except that's not really the case all the time. There have been a lot of named issues that weren't very impressive.

What happens in situations like this, when there is a near constant flow of information that's not really important? People stop listening. The human brain is really good at filtering out noise. Named security issues are going to become noise at the current rate things are going. I'm not opposed to this, I think you should name your pets not your security issues.

Send your comments to Twitter: @joshbressers
Identifying the message sender with Rabbit MQ (and Pika)

When sending a message via Rabbit MQ, a sender can chose to identify itself, or hid its identity, but it cannot lie.

I modified the Pika examples to work with hard coded user-id and password. Specifically, I added:

properties = pika.BasicProperties(user_id=rabbit_userid);

And used that as a parameter in:

channel.basic_publish(exchange='',
                      properties=properties,
                      routing_key='hello',
                      body='Hello World!')

On the receiving side, in the callback, make use of the properties:

def callback(ch, method, properties, body):
    print(" [x] Received %r" % body)
    print("Message user_id is %s " % properties.user_id)

And the output looks like this:

$ python recv.py &
[1] 5062
$  [*] Waiting for messages. To exit press CTRL+C
$ python sender.py 
 [x] Sent 'Hello World!'
 [x] Received 'Hello World!'
Message user_id is a5f56bdb395f53864a80b95f45dc395e94c546c7 

Modify the sender so the ids don’t match and:

    raise exceptions.ChannelClosed(method.reply_code, method.reply_text)
pika.exceptions.ChannelClosed: (406, "PRECONDITION_FAILED - user_id property set to 'rabbit_userid' but authenticated user was 'a5f56bdb395f53864a80b95f45dc395e94c546c7'")

Here is the complete code.

sender.py

#!/usr/bin/env python
import pika

rabbit_host = '10.149.2.1'
rabbit_userid = 'a5f56bdb395f53864a80b95f45dc395e94c546c7'
rabbit_password = '06814091f31ad50b55a3509e9e3916082cce556d'
credentials = pika.PlainCredentials(rabbit_userid, rabbit_password)

connection = pika.BlockingConnection(pika.ConnectionParameters(
    host=rabbit_host, credentials=credentials))
channel = connection.channel()
channel.queue_declare(queue='hello')
properties = pika.BasicProperties(user_id=rabbit_userid);
channel.basic_publish(exchange='',
                      properties=properties,
                      routing_key='hello',
                      body='Hello World!')
print(" [x] Sent 'Hello World!'")
connection.close()

recv.py

#!/usr/bin/env python
import pika

rabbit_host = '10.149.2.1'
rabbit_userid = 'a5f56bdb395f53864a80b95f45dc395e94c546c7'
rabbit_password = '06814091f31ad50b55a3509e9e3916082cce556d'
credentials = pika.PlainCredentials(rabbit_userid, rabbit_password)

connection = pika.BlockingConnection(pika.ConnectionParameters(
    host=rabbit_host, credentials=credentials))
channel = connection.channel()

channel.queue_declare(queue='hello')

def callback(ch, method, properties, body):
    print(" [x] Received %r" % body)
    print("Message user_id is %s " % properties.user_id)

channel.basic_consume(callback,
                      queue='hello',
                      no_ack=True)

print(' [*] Waiting for messages. To exit press CTRL+C')
channel.start_consuming()

March 20, 2016

Everything is fine, nothing to see here!
As anyone who reads this blog knows, I've been talking about soft skills in security for quite some time now. I'm willing to say it's one of our biggest issues at the moment, a point which I get a lot of disagreement on. I have sympathy for anyone who thinks this stuff doesn't matter, I used to be there. Until I had to start talking to people. As soon as you talk to most anyone outside the security echo chamber, you see what's actually going on, and it's not great.

I won't say the security industry is one fire, but nobody is going to disagree many of the things we're looking after aren't in great shape.Outside of a few very large successful companies, most organizations have serious and significant security problems that could result in a massive breach, it's just that nobody has tried, yet. I see a few reasons for many of our trouble, I always seem to come back to soft skills.

There is a skills shortage
But there's training, look at all the training, there's so much training everything is fine!

There is training. Some is good, some is bad (like anything). It's not that training in itself is bad, I would encourage anyone to go get training. It's not great though either. Most training today focuses on the symptoms of our problems. Things like pen testing, secure coding (which doesn't exist), network defense. Things that while important, aren't the real problems. I'll talk more about this in a future post, but chew on this. There are about 96,000 CISSP holders, and about 5 million security jobs. That's messed up.

Today everyone who is REALLY, I mean REALLY REALLY good at security got there through blood sweat and tears. Nobody taught them what they know, they learned it on their own. Many of us didn't have training when we were learning these things. Regardless of this though, if training is fantastic, why does it seem there is a constant march toward things getting worse instead of better? That tells me we're not teaching the right skills to the right people. The skills of yesterday don't help you today, and especially don't help tomorrow. By its very definition, training can only cover the topics of yesterday.

How do we skill up for the needs of today and tomorrow? The first thing we have to do is listen to the people running, building, and using the technology of today. They know things we don't just as we know things they don't. Security is still almost always an afterthought, even with everyone claiming it's the most important thing ever. This is our failing, not theirs.

We build our skills by being an industry that doesn't complain and belittle everyone who tries anything. We are notorious for being brutal to the new guys. Everyone starts somewhere, don't be a jerk. I know a lot of people who are afraid to do almost anything in the security space because they know if they're not 100% correct, they will have to deal with a torrent of negative comments. It's not worth talking to us in many instances.

As an industry we are failing our customers
Things aren't that bad, sure there are some breaches but in general everything is going pretty good!

If you read any news stories, you know things aren't OK. There are loads of breaches and high profile security issues. Totally broken devices, phones that can't be updated, light bulbs that can join a botnet. As an industry we like to stick to our echo chamber circles where we spin news and events into something that isn't our fault. We laugh at the stupid people doing stupid things. We find a person or event that can explain away the incident as a singular event, not a systematic problem. The problems are growing exponentially, our resources are growing linearly, this means that our resources are actually decreasing every year.

Most organizations don't have proper security and won't even have a proper conversation until they end up on the wrong side of a major compromise. It's our fault nobody is talking about this stuff, even if the breach isn't technically our fault.

What advice are we giving people they can actually use? In almost every organization the security group is feared and hated. We're not peers, we're enemies, and they are ours. This isn't helpful to anyone. How many of you actually sit down and have honest real discussions with those you are supposed to help. Do you actually understand their problems (not our problems with them, their actual problems, the ones they have to route around security to solve). Security shouldn't be something bolted on later, we're lucky if it's even that in most cases.

Security is seen as a business prohibitor, not a business enabler
I know what needs to be done, nobody wants to listen!

We've all been here before. We suggest something to the group, they ignore us. We are the problem here, not the people we are supposed to help. We blame them for not listening when the real issue is we're not talking to them properly. We throw information at people, complex hard to understand information, then rather than hold their hand when they don't understand, we declare them stupid and go find someone who agrees with us, then we complain about how dumb everyone else is and how smart we are.

They aren't stupid.

Neither are we.

The disconnect is one of talking. We have to talk to people, we have to engage with them. We have to build a relationship. You can't expect to show up and be listened to if you're not respected. People trust those they respect. If you're in that circle of respect, you won't be taken seriously. On a regular basis I hear security tell me "they'll know I was right when we get hacked!" That doesn't even make sense. It's your failure for not creating a level of understanding for the issue, it's not their fault for ignoring you.

Soft skills are hard
You don't even know what you're talking about, my skills are fine!

Maybe. I won't say I'm an expert. I am constantly thinking about the state of things and how interactions go. What I do know though is the things I discuss here are based on my real world lessons. Every day is a new journey into being a new and better security person. I know how the technology works, what I don't know is how people work. It's a journey to figure this out. I'm pretty sure I'm on to something because people I respect are encouraging, yet there are some who are trying very hard to discourage this conversation. As the old saying goes, if nobody is complaining about what you're doing, you're not doing anything interesting.


Here's what I do honestly believe. You can disagree with me or anyone you want. The industry isn't solving the problems it needs to solve. Those problems will be solved eventually, there are many industry groups forming to start talking about some of these problems, the groups mostly talk though, that's not a skill we're good at. Even then I see a lot of criticism toward those groups. Problems won't be solved quickly by doing the same thing we do today. I'm confident a big part of our future is humanizing security. Security today isn't for humans, security tomorrow needs to be. We get there by cooperating, not by arguing and insulting.

Think I'm an idiot, let me know: @joshbressers

March 19, 2016

Convert a keystone.rc from V2 to V3

Everything seems to produce V2 versions of the necessary variables for Keystone, and I am more and more dependant on the V3 setup. Converting from one to the other is trivial, especially if the setup uses the default domain.

#!/bin/bash
if [ "$#" -ne 1 ]
then
    echo "Usage $0 <keystone.rc>"
    exit 1
fi

. $1

NEW_OS_AUTH_URL=`echo $OS_AUTH_URL | sed 's!v2.0!v3!'`

cat << EOF
export OS_AUTH_URL=$NEW_OS_AUTH_URL
export OS_USERNAME=$OS_USERNAME
export OS_PASSWORD=$OS_PASSWORD
export OS_USER_DOMAIN_NAME=Default
export OS_PROJECT_DOMAIN_NAME=Default
export OS_PROJECT_NAME=$OS_TENANT_NAME
export OS_IDENTITY_API_VERSION=3
EOF

And to run it:

[stack@undercloud ~]$ ./v3fromv2.sh stackrc > stackrc.v3 
[stack@undercloud ~]$ . ./stackrc.v3 
[stack@undercloud ~]$ openstack domain list
+----------------------------------+------------+---------+--------------------+
| ID                               | Name       | Enabled | Description        |
+----------------------------------+------------+---------+--------------------+
| d702b42eb3694279bdd0cc74a848a103 | heat_stack | True    |                    |
| default                          | Default    | True    | The default domain |
+----------------------------------+------------+---------+--------------------+

March 18, 2016

Sausage Factory: Multiple Edition Handling in Fedora

First off, let me be very clear up-front: normally, I write my blog articles to be approachable by readers of varying levels of technical background (or none at all). This will not be one of those. This will be a deep dive into the very bowels of the sausage factory.

The Problem

Starting with the Fedora.next initiative, the Fedora Project embarked on a journey to reinvent itself. A major piece of that effort was the creation of different “editions” of Fedora that could be targeted at specific user personas. Instead of having a One-Size-Fits-Some Fedora distribution, instead we would produce an operating system for “doers” (Fedora Workstation Edition), for traditional infrastructure administrators (Fedora Server Edition) and for new, cloudy/DevOps folks (Fedora Cloud Edition).

We made the decision early on that we did not want to produce independent distributions of Fedora. We wanted each of these editions to draw from the same collective set of packages as the classic Fedora. There were multiple reasons for this, but the most important of them was this: Fedora is largely a volunteer effort. If we started requiring that package maintainers had to do three or four times more work to support the editions (as well as the traditional DIY deployments), we would quickly find ourselves without any maintainers left.

However, differentiating the editions solely by the set of packages that they deliver in a default install isn’t very interesting. That’s actually a problem that could have been solved simply by having a few extra choices in the Anaconda installer. We also wanted to solve some classic arguments between Fedora constituencies about what the installed configuration of the system looks like. For example, people using Fedora as a workstation or desktop environment in general do not want OpenSSH running on the system by default (since their access to the system is usually by sitting down physically in front of a keyboard and monitor) and therefore don’t want any potential external access available. On the other hand, most Fedora Server installations are “headless” (no input devices or monitor attached) and thus having SSH access is critical to functionality. Other examples include the default firewall configuration of the system: Fedora Server needs to have a very tightened default firewall allowing basically nothing in but SSH and management features, whereas a firewall that restrictive proves to be harmful to usability of a Workstation.

Creating Per-Edition Default Configuration

The first step to managing separate editions is having a stable mechanism for identifying what edition is installed. This is partly aesthetic, so that the user knows what they’re running, but it’s also an important prerequisite (as we’ll see further on) to allowing the packaging system and systemd to make certain decisions about how to operate.

The advent of systemd brought with it a new file that describes the installed system called os-release. This file is considered to be authoritative for information identifying the system. So this seemed like the obvious place for us to extend to include information about the edition that was running as well. We therefore needed a way to ensure that the different editions of Fedora would produce a unique (and correct) version of the os-release file depending on the edition being installed. We did this by expanding the os-release file to include two new values: VARIANT and VARIANT_ID. VARIANT_ID is a machine-readable unique identifier that describes which version of Fedora is installed. VARIANT is a human-readable description.

In Fedora, the os-release file is maintained by a special RPM package called fedora-release. The purpose of this package is to install the files onto the system that guarantee this system is Fedora. Among other things, this includes os-release, /etc/fedora-release, /etc/issue, and the systemd preset files. (All of those will become interesting shortly).

So the first thing we needed to do was modify the fedora-release package such that it included a series of subpackages for each of the individual Fedora editions. These subpackages would be required to carry their own version of os-release that would supplant the non-edition version provided by the fedora-release base package. I’ll circle back around to precisely how this is done later, but for now accept that this is true.

So now that the os-release file on the system is guaranteed to contain appropriate VARIANT_ID, we needed to design a mechanism by which individual packages could make different decisions about their default configurations based on this. The full technical details of how to do this are captured in the Fedora Packaging Guidelines, but the basic gist of it is that any package that wants to behave differently between two or more editions must read the VARIANT_ID from os-release during its %posttrans (post-transaction) phase of package installation and place a symlink to the correct default configuration file in place. This needs to be done in the %posttrans phase because, due to the way that yum/dnf processes the assorted RPMs, there is no other way to guarantee that the os-release file has the right values until that time. This is because it’s possible for a package to install and run its %post script between the time that the fedora-release and fedora-release-EDITION package gets installed.

That all assumes that the os-release file is correct, so let’s explore how that is made to happen. First of all, we created a new directory in /usr/lib called /usr/lib/os.release.d/ which will contain all of the possible alternate versions of os-release (and some other files as well, as we’ll see later). As part of the %install phase of the fedora-release package, we generate all of the os-release variants and then drop them into os.release.d. We will then later symlink the appropriate one into /usr/lib/os-release and /etc/os-release during %post.

There’s an important caveat here: the /usr/lib/os-release file must be present and valid in order for any package to run the %systemd_post scripts to set up their unit files properly. As a result, we need to take a special precaution. The fedora-release package will always install its generic (non-edition) os-release file during its %post section, to ensure that the %systemd_post scripts will not fail. Then later if a fedora-release-EDITION package is installed, it will overwrite the fedora-release one with the EDITION-specific version.

The more keen-eyed reader may have already spotted a problem with this approach as currently described: What happens if a user installs another fedora-release-EDITION package later? The short answer was that in early attempts at this: “Bad Things Happened”. We originally had considered that installation of a fedora-release-EDITION package atop a system that only had fedora-release on it previously would result in converting the system to that edition. However, that turned out to A) be problematic and B) violate the principle of least surprise for many users.

So we decided to lock the system to the edition that was first installed by adding another file: /usr/lib/variant which is essentially just a copy of the VARIANT_ID line from /etc/os-release. In the %post script of each of the fedora-release subpackages (including the base subpackage), it is checked for its contents. If it does not exist, the %post script of a fedora-release-EDITION package will create it with the appropriate value for that edition. If processing reaches all the way to the %posttrans script of the fedora-release base package (meaning no edition package was part of the transaction), then it will write the variant file at that point to lock it into the non-edition variant.

There remains a known bug with this behavior, in that if the *initial* transaction actually includes two or more fedora-release-EDITION subpackages, whichever one is processed first will “win” and write the variant. In practice, this is effectively unlikely to happen since all of the install media are curated to include at most one fedora-release-EDITION package.

I said above that this “locks” the system into the particular release, but that’s not strictly true. We also ship a script along with fedora-release that will allow an administrator to manually convert between editions by running `/usr/sbin/convert-to-edition -e <edition>`. Effectively, this just reruns the steps that the %post of that edition would run, except that it skips the check for whether the variant file is already present.

Up to now, I’ve talked only about the os-release file, but the edition-handling also addresses several other important files on the system, including /etc/issue and the systemd presets. /etc/issue is handled identically to the os-release file, with the symlink being created by the %post scripts of the fedora-release-EDITION subpackages or the %posttrans of the fedora-release package if it gets that far.

The systemd presets are a bit of a special case, though. First of all, they do not replace the global default presets, but the do supplement them. This means that what we do is symlink in an edition-specific preset into the /usr/lib/systemd/system-preset/ directory. These presets can either enable new services (as in the Server Edition, where it turns on Cockpit and rolekit) or disable them (as in Workstation Edition where it shuts off OpenSSH). However, due to the fact that systemd only processes the preset files during its %post phase, we need to force systemd to reread them after we add the new values.

We need to be careful when doing this, because we only want to apply the new presets if the current transaction is the initial installation of the fedora-release-EDITION package. Otherwise, an upgrade could override choices that the user themselves have made (such as disabling a service that defaults to enabled). This could lead to unexpected security issues, so it has to be handled carefully.

In this implementation, instead of just calling the command to reprocess all presets, we instead parse the preset files and just process only those units that are mentioned in them. (This is to be overcautious in case any other package is changing the default enabled state besides systemd, such as some third-party RPMs that might have `systemctl enable httpd.service` in their %post section, for example.)

Lastly, due to the fact that we are using symlinks to manage most of this, we had to write the %post and %posttrans scripts in the built-in Lua implementation carried by RPM. This allowed us to call posix.symlink() without having to add a dependency on coreutils to do so in bash (which resulted in a circular dependency and broken installations). We wrote this as a single script that is imported by the RPM during the SRPM build phase. This script is actually coped by rpmbuild into the scriptlet sections verbatim, so the script must be present in the dist-git checkout on its own and not even as part of the exploded tarball. So when modifying the Lua script, it’s important to make sure to modify the copy in dist-git as well as the copy upstream.


March 17, 2016

Dependency Injection in Python applied to Ossipee

I reworked my OpenStack API based cluster builder Ossipee last weekend. It makes heavy use of dependency resolution now, and breaks apart the super-base class into properly scoped components.

work.py is the worker classes. These are designed to be reusable components.
plan.py is a merger of the config and plan objects from before. Killed the majority of the copying. It is the least cleaned up of any of the code. I might continue to rework this.

ossipee.py has the factories which determine how to build the components. Python’s lack of type support is really apparent here, leading to boilerplate code.

I particularly like how the Session and client factories now work.

session factory :

def session_factory(resolver):
    parser = resolver.resolve(argparse.ArgumentParser)
    args = parser.parse_args()
    auth_plugin = ksc_auth.load_from_argparse_arguments(args)
    try:
        if not auth_plugin.auth_url:
            logging.error('OS_AUTH_URL not set.  Aborting.')
            sys.exit(-1)
    except AttributeError:
        pass

    session = ksc_session.Session.load_from_cli_options(
        args, auth=auth_plugin)

    return session

nova client factory :

def nova_client_factory(resolver):
    session = resolver.resolve(ksc_session.Session)
    nova_client = novaclient.Client('2', session=session)
    return nova_client

They are registered like this:

depend.register(ksc_session.Session, session_factory)
depend.register(novaclient.Client, nova_client_factory)

https://github.com/admiyo/ossipee/blob/master/ossipee.py#L306

So, the worker object to create a host declares its dependencies in the constructor.

class Server(object):
    def __init__(self, nova, neutron, spec):
        self.name = spec.name
        self.nova = nova
        self.neutron = neutron
        self.spec = spec

Ideally, the parameters to the __init__ function would have documentation about types. While that can make use of ABC, it does not help for all the code out there that does not use ABC. ABC would be useful for providing a means to automate dependency resolution.

I pulled the resolver code I wrote a few years into the tree for now for ease of development. I’ll probably merge it back to the original project. The biggest addition is the ability to name components, to be able to distinguish between two components that implement the same contract. Without this, I had subclass proliferation. Python tuples really make sense here: A Factory is registered via the tuple of the class and the (optional) name, and is resolved the same way.

Mixing named and unnamed components is still a little grungy, but it makes it nice to have a component that can both be a top level worker, and a piece of another workflow.

An instance is resolved via the scope, the class, and the name. We can cheat, and pass in a string as the Class for the name of the “worker”, but I don’t think I want to encourage that.

I only have a single scope for Ossipee, the global scope, as it was fairly short lived. I’d like to try the depend.py code in a web app with both request and session scope to see how well it works to organize things.

March 16, 2016

Tie Your Rabbit Down

I’ve been running the Tripleo Quickstart to setup my development deployments. While looking into the setup, I noticed that the default Rabbit deployment is wide open. I can’t see anything other than firewall port blocking in place. I dug deeper.

All of the services use the following values to talk to the Queues

  RabbitUserName:  guest
  RabbitPassword: guest

The Access Control List (ACL) allows all powers over all queues. There is no Transport Layer Security on the network communication.

I was able to address the first issue by editing the openstack-deploy.sh script that Tripleo Quickstart generates. There is a heredoc section that sets many of the defaults that go into the yaml confiog file used as the input for openstack overcloud create. I added:

  RabbitUserName:  fubar
  RabbitPassword: fumtu

And confirmed that the cloud worked with these changes by running

git clone https://git.openstack.org/openstack-infra/tripleo-ci
tripleo-ci/scripts/tripleo.sh  --overcloud-pingtest

As well as sshing to the controller and running

$ sudo rabbitmqctl list_users
Listing users ...
fubar	[administrator]
...done.
$ sudo grep -i rabbit_password /etc/nova/nova.conf 
# Deprecated group;name - DEFAULT;rabbit_password
#rabbit_password=guest
rabbit_password=fumtu

While I was tempted to tackle this in Quickstart, I think it is better to leave the issue visible there and instead tackle it in the Tripleo library.

We deploy all of Rabbit in a single vhost:

$ sudo rabbitmqctl list_vhosts
Listing vhosts ...
/
...done.

But we do allow for the separation of the RPC mechanism from the Notifications:

In the Nova config file:

# The topic compute nodes listen on (string value)
#compute_topic=compute
...
[cells]
#  (string value)
#topic=cells
#rpc_driver_queue_base=cells.intercell
...
[conductor]
#topic=conductor

[oslo_messaging_notifications]
#topics=notifications

The Keystone config file only has the notifications section. All have the Rabbit Userid and Password in the clear.

The Oslo RPC call is based on creating a response Queue. I would like to permit only the intended RPC target to write to this response Queue. However, these queues are generated using a Random UUID.

def _get_reply_q(self):
        with self._reply_q_lock:
            if self._reply_q is not None:
                return self._reply_q

            reply_q = 'reply_' + uuid.uuid4().hex

            conn = self._get_connection(rpc_common.PURPOSE_LISTEN)

            self._waiter = ReplyWaiter(reply_q, conn,
                                       self._allowed_remote_exmods)

            self._reply_q = reply_q
            self._reply_q_conn = conn

This makes it impossible to write a regular expression to limit the set of accessible queues.

What services actually have presence on the compute nodes? (some lines removed for clarity)

$ sudo lsof -i tcp:amqp
COMMAND     PID       USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
neutron-o 17236    neutron    8u  IPv4  40581      0t0  TCP overcloud-novacompute-0.localdomain:53049->overcloud-controller-0.localdomain:amqp (ESTABLISHED)
...
neutron-o 17236    neutron   19u  IPv4  40590      0t0  TCP overcloud-novacompute-0.localdomain:53058->overcloud-controller-0.localdomain:amqp (ESTABLISHED)
nova-comp 17269       nova    4u  IPv4  40572      0t0  TCP overcloud-novacompute-0.localdomain:53047->overcloud-controller-0.localdomain:amqp (ESTABLISHED)
...
nova-comp 17269       nova   19u  IPv4 130115      0t0  TCP overcloud-novacompute-0.localdomain:53157->overcloud-controller-0.localdomain:amqp (ESTABLISHED)
ceilomete 17682 ceilometer   12u  IPv4 130381      0t0  TCP overcloud-novacompute-0.localdomain:53162->overcloud-controller-0.localdomain:amqp (ESTABLISHED)

In order to trace the connections, I created three rabbit users witth uuidgen based passwords:

sudo rabbitmqctl add_user overcloud-ceil-0 28d90d7c-1ebb-47a6-b58b-3df7aef1f6bf
sudo rabbitmqctl add_user overcloud-neutron-0 1290a77d-35a1-4afa-b5ea-cbc8f9387754
sudo rabbitmqctl add_user overcloud-novacompute-0 53493010-37b3-4188-bd88-b933b9322c7c
sudo rabbitmqctl add_user keystone 4810a2c6-60f0-4014-8fbb-d628ad9d52f9
sudo rabbitmqctl set_permissions overcloud-ceil-0 ".*" ".*" ".*"
sudo rabbitmqctl set_permissions overcloud-neutron-0 ".*" ".*" ".*"
sudo rabbitmqctl set_permissions overcloud-novacompute-0 ".*" ".*" ".*"
sudo rabbitmqctl set_permissions keystone ".*" ".*" ".*"

First, I tested editing the Keystone server on the controller, and was able to see the user change from guest to keystone.

Then, I used the appropriate values on the compute node for the rabbit_user_id and rabbit_password values in the files:

/etc/ceilometer/ceilometer.conf
/etc/nova/nova.conf 
/etc/neutron/neutron.conf

Then restarted the node. After reboot, Nova and Neutron came back, but Ceilometer was not happy (even after cycling the services on both the control node and the compute node.

$ sudo lsof -i tcp:amqp
COMMAND    PID    USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
neutron-o 1680 neutron    8u  IPv4  23125      0t0  TCP overcloud-novacompute-0.localdomain:49085->overcloud-controller-0.localdomain:amqp (ESTABLISHED)
...
neutron-o 1680 neutron   19u  IPv4  23449      0t0  TCP overcloud-novacompute-0.localdomain:49096->overcloud-controller-0.localdomain:amqp (ESTABLISHED)
nova-comp 1682    nova    4u  IPv4  24066      0t0  TCP overcloud-novacompute-0.localdomain:49097->overcloud-controller-0.localdomain:amqp (ESTABLISHED)
...
nova-comp 1682    nova   20u  IPv4 487795      0t0  TCP overcloud-novacompute-0.localdomain:49582->overcloud-controller-0.localdomain:amqp (ESTABLISHED)

Going back to the controller> There is obviously a 1 to 1 relationship between the connections from the compute node and the entities that rabbitmqctl allows us to list:

$sudo rabbitmqctl list_connections
keystone	192.0.2.21	43714	running
keystone	192.0.2.21	43921	running
overcloud-neutron-0	192.0.2.20	49085	running
...
overcloud-neutron-0	192.0.2.20	49096	running
overcloud-novacompute-0	192.0.2.20	49097	running
...
overcloud-novacompute-0	192.0.2.20	49582	running
...done.

With this information we should be able to put together a map of which service talks on which channel.

This is a complex system. I’m going to do some more digging, and see if I can come up with an approach to lock things down a bit better.

March 13, 2016

Containers are like sandwiches
During the RSA conference, I was talking about containers and it occurred to me we can think about them like a sandwich. Not so much that they're tasty, but rather where does your container come from. I was pleased that almost all of the security people I spoke with understand the current security nightmare containers are. The challenge of course is how do we explain what's going on to everyone else. Securtiy is hard and we're bad at talking about it. They also didn't know what Red Hat was doing, which is totally our own fault, but we'll talk about that somewhere else.

But containers are sandwiches. What does that mean? Let's think about them in this context. You can pick up a sandwich. You can look at it, you can tell basically what's going on inside. Are there tomatoes? Lettuce? Ham? Turkey? It's not that hard. There can be things hiding, but for the most part you can get the big details. This is just like a container. Fedora? Red Hat? Ubuntu? It has httpd, great. What about a shell? systemd? Cool. There can be scary bits hidden in there too. Someone decided to replace /bin/sh with a python script? That's just like hiding the olives under the lettuce. What sort of monster would do such a thing!

So now that we have the image of a sandwich in our minds, let's think about a few scenarios.

Find it on a bench
If you're walking through the park and you see a sandwich just laying on a bench what would you do? You might look around, wondering who left this tasty delight, but you're not going to eat it. Most people wouldn't even touch it, who put it there, where did it come from, how old is it, does it have onions? So many questions and you honestly can't get a decent answer. Even if someone could answer the questions, would you eat that sandwich? I certainly wouldn't.

Finding a sandwich on a bench is the public container registry. If this is all you know, you wouldn't think there's anything wrong with doing this, but like the public registry, you don't always know what you're getting. I wonder how many of those containers saw update for the glibc flaw from a few weeks ago? It's probably easier not knowing.

Get it from a scary shop with questionable ingredients
A long time ago I was walking around in New York and decided to hop into a sandwich shop for a quick bite. As I reached for the door, there was a notice from the health department. I decided to keep walking. Even if you can get your sandwich from a shop, if the shop is scary, you could find yourself in trouble.

There are loads of containers available out there you can download that aren't trusted sources. Don't download random containers from random places. It's no different than trying to buy a sandwich from a filthy shop that has to shoo the rats out of the kitchen with a broom.

Get it from a nice shop that uses old ingredients
We've all seen those places selling sandwiches that look nice. The sign is painted, the windows are clean. When you walk in the tables are clean enough to eat off of! But then you order and it's pretty clear everything is old and dried out. You might be able to sneak out the back door before the old man putting it together notices you're not there anymore.

This is currently a huge danger in the container space. Containers are super hip right now so there are plenty of people doing work in this space. Many of these groups don't even know they have a problem. The software in your containers is a lot like sandwich meat. After a few weeks it probably will start to smell, and after a month it's going to do some serious damage to anyone who consumes it.

Be sure to ask your container supplier what they're shipping, where it came from and how fresh it is. It would not be reasonable to ask "If this container was a sandwich would you eat it?"

Get it from a nice shop that uses nice ingredients
This is the dream. You walk into a nice shop. The nice person behind the counter takes your order and using the freshest ingredients possible constructs a sandwich shaped work of art. You take pictures and post them to all your friends explaining this sandwich is what your life was always missing and you didn't know it before now.

This is why you need a partner you can trust when it comes to container content. The closer to the source you can get the better. Ask questions abut the content. Where did it come from? Who is taking care of it? How can I prove any of this? Who is updating it? Containers are a big deal, they're new and exciting. They're also very misunderstood. Only use fresh containers. If the content it more than a few months old, you're eating a sandwich off a park bench. Don't each sandwiches off park benches. Ask hard questions. If your vendor can't answer them, you need to try the shop across the street. Part of the magic of containers is they are the result of truly commoditizing the operating system, you can get container content from a lot of sources, find a good one.

If we think about our infrastructure like we think about public health, you don't want to be responsible for making everyone sick. You need to know what you're using, where it came from, how fresh it is, who put it together, and what's in it. It's not enough to pretend everything is fine. Everything is not fine.

March 07, 2016

The interesting things from RSA are what didn't happen, and containers are sandwiches
The RSA conference is done. It was a very long and busy show, there were plenty of interesting people there and lots of clever ideas and things to do.

I think the best part is what didn't happen though. We love talking about the exciting things from the show, I'm going to talk about the unexciting non events I was waiting to happen (but thankfully they did not).

The DROWN issue came and went. It wasn't very exciting, it got the appropriate amount of attention. Basically SSLv2 is still broken, don't use it for any reasons. If you use SSLv2, it's like licking the handrail at the airport. Nobody is going to feel bad for you.

There were keynotes by actors. The world continues to turn (pun intended). But really, these keynotes are about being entertaining, I didn't go, because well, they're actors :) But I suspect they were entertaining. No doubt this will happen more and more as there are more and more security conferences, finding good keynotes will only get harder. They should hire that guy from the Hackers movie next.

There weren't any exciting hacking events. Not that stunt hacking is a thing for RSA, I'm glad nobody tried anything new. I'm sure Blackhat will be a very different story. We shall wait and see.

And most importantly, I wasn't booed off the stage :P
I was pleased with how my talk went. Attendance was light but that's expected on a Friday morning. The thing that made the happiest is that they had to kick our group out of the room for the next talk, not because I rambled on but because I got everyone in the room talking to each other. It was fantastic.

On to the interesting bit of the trip though. I found the most interest when I was talking about Red Hat's concept of a trusted container registry. Today if you're using the public registry it's comparable to finding a sandwich on a bench at the park. You can look at it, you can tell it has ham and lettuce, but I mean, it's a sandwich you found on a bench. Are you going to eat that?

If you want a nice sandwich you're going to go to a sandwich shop, order a sandwich, and watch someone make it for you. You can then go and sit on the bench if you want.

The idea behind Red Hat's trusted registry is we have a container registry for Red Hat customers. We control all the content in the registry, we know exactly what it is. We know where it came from. We control the sandwich supply chain from start to finish. No mystery meats here!

All the security people I talked to know that containers are currently a bit of a security circus. None of them knew what Red Hat was doing. This is of course a great opportunity for Red Hat to spread the word. Stay tuned for more clever sandwich jokes.

March 04, 2016

What Can Talk To What on the OpenStack Message Broker

If a hypervisor is compromised, the Nova Compute instance running on that node is also compromised. If the compute instance is compromised, then its access to the Message Queue has to be considered tainted as well. What degree of risk does this pose?

I mention the compute node, but really, any service that has access to the broker is a vector for attack. This includes any third party application that listens for, say, Keystone notifications for audit purposes.

At the bottom of this article I have posted an inventory from a recent Tripleo deployment. There are a lot of exchanges and queus, and reading through them is informative.

What we need is a table showing who can read from and who can write to each of these elements.

My first hack at an ACL approach:

  • The default rule should be “read only”.
  • If a service is responsible for creating an exchange or a queue, it should get write access.
  • Beyond that, that service should grant explicit write granted to specific services for a given queue/exchange.

What is the start state?

$ sudo rabbitmqctl list_users
Listing users ...
guest	[administrator]
...done.
$ sudo rabbitmqctl list_permissions
Listing permissions in vhost "/" ...
guest	.*	.*	.*
...done.

So, by default, all the services connect as the same user, and have full permissions to read and write on everything.

I will state that only the Keystone server should be able to write to the keystone topic, and, by default, only Ceilometer should be reading from it.

$ sudo rabbitmqctl list_exchanges
Listing exchanges ...
	direct
amq.direct	direct
amq.fanout	fanout
amq.headers	headers
amq.match	headers
amq.rabbitmq.log	topic
amq.rabbitmq.trace	topic
amq.topic	topic
ceilometer	topic
central	topic
cert_fanout	fanout
cinder	topic
cinder-scheduler_fanout	fanout
cinder-volume_fanout	fanout
compute_fanout	fanout
conductor_fanout	fanout
consoleauth_fanout	fanout
dhcp_agent_fanout	fanout
engine_fanout	fanout
glance	topic
heat	topic
heat-engine-listener_fanout	fanout
ironic	topic
keystone	topic
l3_agent_fanout	fanout
magnetodb	topic
magnum	topic
neutron	topic
neutron-vo-QosPolicy-1.0_fanout	fanout
nova	topic
openstack	topic
q-agent-notifier-dvr-update_fanout	fanout
q-agent-notifier-network-update_fanout	fanout
q-agent-notifier-port-delete_fanout	fanout
q-agent-notifier-port-update_fanout	fanout
q-agent-notifier-security_group-update_fanout	fanout
q-agent-notifier-tunnel-delete_fanout	fanout
q-agent-notifier-tunnel-update_fanout	fanout
q-l3-plugin_fanout	fanout
q-plugin_fanout	fanout
q-reports-plugin_fanout	fanout
reply_1cbc785538484554850f69dda902c537	direct
reply_748d4640dbab4284bae19fe086af14e8	direct
reply_ab42e35c548d48b48c9ba0fc3ac93ec7	direct
reply_b37538409ae84436804ccd1b1c0a3bdd	direct
reply_c6bebd23c7e24a5c9a06730b42d317cf	direct
reply_f34034fd84e347e8b6aeedc49f97282d	direct
sahara	topic
sahara-ops_fanout	fanout
scheduler_fanout	fanout
swift	topic
trove	topic
zaqar	topic
...done.

Here are the Queues

$ sudo rabbitmqctl list_queues
Listing queues ...
cert	0
cert.overcloud-controller-0.localdomain	0
cert_fanout_c8d9d81c87d84e728cb498a0d434c825	0
cinder-scheduler	0
cinder-scheduler.hostgroup	0
cinder-scheduler_fanout_7969a98120ca4f2097af3ade0ba159ef	0
cinder-volume	0
cinder-volume.hostgroup@tripleo_iscsi	0
cinder-volume_fanout_1520069c024c4c6490fdbb6f336819cc	0
compute	0
compute.overcloud-novacompute-0.localdomain	0
compute_fanout_7dc21bc0422b4d4c9addb151e9e2d8ba	0
conductor	0
conductor.overcloud-controller-0.localdomain	0
conductor_fanout_9f3ff7a1e8b146fc9b5dccb1aa80f119	0
consoleauth	0
consoleauth.overcloud-controller-0.localdomain	0
consoleauth_fanout_4b36518037784e7aad7ce7049b89d089	0
dhcp_agent	0
dhcp_agent.overcloud-controller-0.localdomain	0
dhcp_agent_fanout_8776747599464cc3b80a56b731841fd7	0
engine	0
engine.overcloud-controller-0.localdomain	0
engine_fanout_1030eeeec4644022b5a9f7259f7e0018	0
engine_fanout_2ffb137908934072af6a15d3a6b9e616	0
engine_fanout_bac7897eb7ac43f0a561a0c12c408e26	0
engine_fanout_f439912a1d80484ea38ab784a95fb656	0
heat-engine-listener	0
heat-engine-listener.31d42df9-f64f-451d-b9d6-7ef46229c929	0
heat-engine-listener.8730caa4-4104-4d71-bcc1-08ae17a41420	0
heat-engine-listener.b1dd3b6e-d085-4005-a4c9-a29b6f91c3f6	0
heat-engine-listener.ea60f788-af0c-49be-9325-8cefe60cc53a	0
heat-engine-listener_fanout_3b2879946f754cd9bd4becc6b8448071	0
heat-engine-listener_fanout_725990e3081f4ddc839a1bbf78520873	0
heat-engine-listener_fanout_aa1ec5483825470797e11b73cddaf223	0
heat-engine-listener_fanout_cb251c73b0f64d64ac3e38b529e1de30	0
l3_agent	0
l3_agent.overcloud-controller-0.localdomain	0
l3_agent_fanout_83ec229461dd4bd68d4e0debc7f9a39d	0
metering.sample	0
neutron-vo-QosPolicy-1.0	0
neutron-vo-QosPolicy-1.0.overcloud-controller-0.localdomain	0
neutron-vo-QosPolicy-1.0.overcloud-novacompute-0.localdomain	0
neutron-vo-QosPolicy-1.0_fanout_5f54eaed13cb47da8d80b82223f87e47	0
neutron-vo-QosPolicy-1.0_fanout_f57c66031cb4437ea75a23ec1698b287	0
notifications.info	0
notifications.sample	0
q-agent-notifier-dvr-update	0
q-agent-notifier-dvr-update.overcloud-controller-0.localdomain	0
q-agent-notifier-dvr-update.overcloud-novacompute-0.localdomain	0
q-agent-notifier-dvr-update_fanout_33c92818f86644899082458f893c6157	0
q-agent-notifier-dvr-update_fanout_82d2beb050dc4dde956a86cc6e2e5562	0
q-agent-notifier-network-update	0
q-agent-notifier-network-update.overcloud-controller-0.localdomain	0
q-agent-notifier-network-update.overcloud-novacompute-0.localdomain	0
q-agent-notifier-network-update_fanout_0ef20a72234a45718ece2328d230e2c6	0
q-agent-notifier-network-update_fanout_737fb57587f3453cb14d41b01c5fcdcc	0
q-agent-notifier-port-delete	0
q-agent-notifier-port-delete.overcloud-controller-0.localdomain	0
q-agent-notifier-port-delete.overcloud-novacompute-0.localdomain	0
q-agent-notifier-port-delete_fanout_03a026eb000c4efd89e15dc7834b8fdd	0
q-agent-notifier-port-delete_fanout_acd74597e74041abace267f898a2ce31	0
q-agent-notifier-port-update	0
q-agent-notifier-port-update.overcloud-controller-0.localdomain	0
q-agent-notifier-port-update.overcloud-novacompute-0.localdomain	0
q-agent-notifier-port-update_fanout_28a72273f7234c3b9c4cb4d4f64854c1	0
q-agent-notifier-port-update_fanout_b8ccb7d92aa64bfb9106ecd10c59cfea	0
q-agent-notifier-security_group-update	0
q-agent-notifier-security_group-update.overcloud-controller-0.localdomain	0
q-agent-notifier-security_group-update.overcloud-novacompute-0.localdomain	0
q-agent-notifier-security_group-update_fanout_008a11d67bc54f12bce4a03387a64000	0
q-agent-notifier-security_group-update_fanout_a9f578980b6f4c1ca65629e887bff76e	0
q-agent-notifier-tunnel-delete	0
q-agent-notifier-tunnel-delete.overcloud-controller-0.localdomain	0
q-agent-notifier-tunnel-delete.overcloud-novacompute-0.localdomain	0
q-agent-notifier-tunnel-delete_fanout_1769bab276d44b34a6db34498db522c8	0
q-agent-notifier-tunnel-delete_fanout_cb6f4fd56c8f40b9b2f3b0a6484b70ad	0
q-agent-notifier-tunnel-update	0
q-agent-notifier-tunnel-update.overcloud-controller-0.localdomain	0
q-agent-notifier-tunnel-update.overcloud-novacompute-0.localdomain	0
q-agent-notifier-tunnel-update_fanout_47164fceef534b298b8ea4ee34f9282b	0
q-agent-notifier-tunnel-update_fanout_be1a1e9cc37c4f94921131c3346eed48	0
q-l3-plugin	0
q-l3-plugin.overcloud-controller-0.localdomain	0
q-l3-plugin_fanout_bf639b0aebe6466dba97fb88151ee8b7	0
q-l3-plugin_fanout_eeac107aa8374f87afbddbf6aafcd65c	0
q-plugin	0
q-plugin.overcloud-controller-0.localdomain	0
q-plugin_fanout_38617a666c6c46fd91c6eada520f0303	0
q-reports-plugin	0
q-reports-plugin.overcloud-controller-0.localdomain	0
q-reports-plugin_fanout_4feee95d061f40b2906c22268c79a626	0
q-reports-plugin_fanout_c6123bf05ab24ddaa12adca88b920215	0
reply_1cbc785538484554850f69dda902c537	0
reply_748d4640dbab4284bae19fe086af14e8	0
reply_ab42e35c548d48b48c9ba0fc3ac93ec7	0
reply_b37538409ae84436804ccd1b1c0a3bdd	0
reply_c6bebd23c7e24a5c9a06730b42d317cf	0
reply_f34034fd84e347e8b6aeedc49f97282d	0
sahara-ops	0
sahara-ops.2baf790d-3cfe-42b7-b8bf-49611ecc9639	0
sahara-ops_fanout_91b35b7138284165b4f274f5221d6d89	0
scheduler	0
scheduler.overcloud-controller-0.localdomain	0
scheduler_fanout_0888632b036840849e04edc68d4df200	0
...done.


March 02, 2016

Creating an additional host for a Tripleo overcloud

I’ve been successful following the steps to get a Tripleo deployment. I now need to add another server to host the Identity Management and Federation services. Here’s the steps:

The easiest way to to start back at the environment setup, and tell instack to create an extra node:

export NODE_MEM=8192
export NODE_COUNT=3
instack-virt-setup

The default creates two nodes: one for the controller, one for compute. By increasing this to 3, instack will provide a third virtual machine and register it with Ironic.

I then ran through the steps to deploy Tripleo using Tripleo-common.

Note, that I did not run the all-in-one. I ran each of the commands in turn, made sure that it succeeded and moved on to the next step. Running tripleo.sh with no parameters gives the following output:

Options:
      --repo-setup         -- Perform repository setup.
      --delorean-setup     -- Install local delorean build environment.
      --delorean-build     -- Build a delorean package locally
      --undercloud         -- Install the undercloud.
      --overcloud-images   -- Build and load overcloud images.
      --register-nodes     -- Register and configure nodes.
      --introspect-nodes   -- Introspect nodes.
      --overcloud-deploy   -- Deploy an overcloud.
      --overcloud-update   -- Update a deployed overcloud.
      --overcloud-delete   -- Delete the overcloud.
      --use-containers     -- Use a containerized compute node.
      --enable-check       -- Enable checks on update.
      --overcloud-pingtest -- Run a tenant vm, attach and ping floating IP.
      --all, -a            -- Run all of the above commands.

From this list, I ran these commands in this order:

  1. –repo-setup
  2. –undercloud
  3. –overcloud-images
  4. –register-nodes
  5. –introspect-nodes
  6. –overcloud-deploy

If anything goes wrong (usually at the overcloud deploy stage) I’ve used Steve Hardy’s blog post to troubleshoot.

To then provision an operating system on the virtual machine, we can use the undercloud.

 openstack server create  --flavor baremetal --image overcloud-full --key-name default idm

When that finished:

$ openstack server list
+----------------------------+-------------------------+--------+---------------------+
| ID                         | Name                    | Status | Networks            |
+----------------------------+-------------------------+--------+---------------------+
| 099b0784-6591-4aba-90ad-   | idm                     | ACTIVE | ctlplane=192.0.2.10 |
| d5d93bf78745               |                         |        |                     |
| d4ac0792-e70c-4710-9997-b9 | overcloud-controller-0  | ACTIVE | ctlplane=192.0.2.9  |
| 32a67c500b                 |                         |        |                     |
| 45b74adf-447f-             | overcloud-novacompute-0 | ACTIVE | ctlplane=192.0.2.8  |
| 45c8-b308-c694b6d45862     |                         |        |                     |
+----------------------------+-------------------------+--------+---------------------+

Log in and do work:

 ssh centos@192.0.2.10
The authenticity of host '192.0.2.10 (192.0.2.10)' can't be established.
ECDSA key fingerprint is 28:40:4f:a0:70:94:ef:ed:31:87:d1:37:b7:eb:8b:5d.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '192.0.2.10' (ECDSA) to the list of known hosts.
[centos@idm ~]$ hostname
idm

February 29, 2016

Let's talk about soft skills at RSA, plus some other things
It's been no secret that I think the lack of soft skills in the security space is one of our biggest problems. While usually I usually only write all about the world's problems and how to fix them here, during RSA I'm going to take a somewhat different approach.

I'm giving a talk on Friday titled Why Won't Anyone Listen to Us?

I'm going to talk about how a security person can talk to a normal person without turning them against us. We're a group that doesn't like talking to anyone, even each other. We need to start talking to people. I'm not saying we should stand around and accept abuse, I am saying the world wants help with security. We're not really in a place to give it because we don't like people. But they need our help, most of them know it even!

We've all had countless interactions where we give someone good, real advice, and they just completely ignore us. It's infuriating sometimes. Part of the problem is we're not talking to people, we're throwing verbal information at them, and they ignore it. They listen to Oprah, if she told them about two factor auth everyone would be using it by the end of the week!

That's just it, they listen to Oprah. They're going to listen to anyone who talks to them in a way they understand. If it's not us, it will be someone else, probably someone we don't want talking about security.

I can't teach you to like people (there are limits to my abilities), but I can help teach you how to talk to them. And of course a talk like this will need to have plenty of fun sprinkled in. How to talk to someone while being very important, can also be an extremely boring topic.

I touched on some of this during my Devconf.cz talk.

Red Hat is also putting on a Breakfast on Wednesday morning. I'm going to keynote it (I'll keep it short and sweet for those of you attending, there's nothing worse than a speaker at 8am who talks too much).

A coworker, Richard Morrell, is running a podcast from RSA called Locked Down. Be sure to give it a listen. I may even be able to convince him to let me on his show.

I have no doubt the RSA conference will be a great time. If you're there come find me, Red Hat has a booth, North Expo #N3038. Come say hi, or not if you don't like talking to people ;)

There or not, feel free to start a conversation on Twitter. I'm @joshbressers

February 24, 2016

Keystone on Port 80 For Tripleo

Many services assume that Keystone listens on ports 5000 and 35357. I’d prefer to have Keystone listen on the standard HTTP(s) ports of 80 and 443. We can’t remove the non-standard ports without a good deal of rewriting. But there is nothing preventing us from running Keystone on port 80 or 443 in addition to those ports.

I was trying to get this to work for a Tripleo deployment where I needed to ssh in and port forward through several levels. I didn’t want to have to do this for more ports than absolutetly necessary.

I did need to backport one change to make this work with the current Tripleo, but I suspect that, come Milestone 3 of Mitaka, we’ll have it via a rebase of the RDO packages.

In Tripleo, Horizon is run on port 80, and shows up under the /dashboard URI. So, I put Keystone under /keystone (yeah yeah, it should have been /identity. I’ll do that next time.)

UPDATE 1: decreased threads to 1, as oslo-config complains on multiple.
UPDATE 2: changed Location from /keystone/main/ to /keystone/main and /keystone/admin/ to /keystone/admin to match WSGIDaemonProcess

in /etc/httpd/conf.d/11-keystone_wsgi_main.conf

WSGIApplicationGroup %{GLOBAL}
WSGIDaemonProcess keystone_main_11 display-name=keystone-main group=keystone processes=1 threads=1 user=keystone
WSGIProcessGroup keystone_main_11
WSGIScriptAlias /keystone/main "/var/www/cgi-bin/keystone/main"
<Location "/keystone/main">
WSGIProcessGroup keystone_main_11
</Location>

And in /etc/httpd/conf.d/11-keystone_wsgi_admin.conf

WSGIApplicationGroup %{GLOBAL}
WSGIDaemonProcess keystone_admin_11 display-name=keystone-admin group=keystone processes=1 threads=1 user=keystone
WSGIProcessGroup keystone_admin_11
WSGIScriptAlias /keystone/admin "/var/www/cgi-bin/keystone/admin"
<Location "/keystone/admin">
WSGIProcessGroup keystone_admin_11
</Location>

I have an adapted version of the overcloud rc file set of Keystone V3:

export OS_NO_CACHE=True
export OS_CLOUDNAME=overcloud
#export OS_AUTH_URL=http://192.0.2.6:5000/
export OS_AUTH_URL=http://192.0.2.6/keystone/main/
export NOVA_VERSION=1.1
export COMPUTE_API_VERSION=1.1
export OS_USERNAME=admin
export no_proxy=,192.0.2.18
export OS_PASSWORD=`uuidgen -r`
export PYTHONWARNINGS="ignore:Certificate has no, ignore:A true SSLContext object is not available"
export OS_PROJECT_NAME=admin
export OS_USER_DOMAIN_NAME=Default
export OS_PROJECT_DOMAIN_NAME=Default
export OS_IDENTITY_API_VERSION=3

To Test:

$ . ./overcloudv3.rc 
[heat-admin@overcloud-controller-0 ~]$ openstack token issue
+------------+----------------------------------+
| Field      | Value                            |
+------------+----------------------------------+
| expires    | 2016-02-24T05:40:10.017354Z      |
| id         | 53c5ba8766034ee39a3918cc51082f2c |
| project_id | 42fddae694cb4bd29c0911b64c95440b |
| user_id    | 627727a981f149e2a9ae50422738e659 |
+------------+----------------------------------+

February 23, 2016

Change direction, increase speed! (or why glibc changes nothing)
The glibc issue has had me thinking. What will we learn from this?

I'm pretty sure the answer is "nothing", which then made me wonder why this is.

The conclusion I came up with is we are basically the aliens from space invaders. Change direction, increase speed! While this can give the appearance of doing something, we are all very busy all the time. It's not super useful when you really think about it. Look at Shellshock, Heartbleed, GHOST, LOGJAM, Venom, pick an issue with a fancy name. After the flurry of news stories and interviews, did anything change, or did everyone just go back to business as usual? Business as usual pretty much.

Dan Kaminski explains glibc nicely and has some advice. But let's look at this honestly. Is anything going to change? No. Dan found a serious DNS issue back in the day. Did we fix DNS or did we bandage it up as best as we could? We bandaged it. What Dan found was without a doubt as bad or worse than this glibc issue, nothing changed.

I've said this before, I'll say it again. I'm going to say it at RSA next week. We don't know how to fix this. We think we do, you're thinking about it right now, about how we can fix everything! We just have to do that one ... Except you can't. We don't really know what's wrong. Security bugs aren't the problem, they are the result of the problem. We can't fix all the security bugs. I'd be surprised if we've even fixed 10% of security bugs that exist. Even mitigation technologies aren't going to get us there (they are better than constantly fixing bugs, but that's a story for another day).

It's like being obsessed about your tire pressure when there is a hole in the tire. If you only worry about one detail, the tunnel vision makes you miss what's actually going on. Our tires are losing air faster than we can fix them, so we're looking for a bigger pump instead of new tires.

We say things all the time about not using C anymore, or training developers, or writing better documentation. There's nothing wrong with these ideas exactly, but the fact is they've all been tried more than once and none of them work. If we started the largest developer education program ever, if we made every developer sit through a week of training. I bet it would be optimistic if our bug rate would decrease by 5%. Think about that for a while.

We first have to understand our problem. We have lots of solutions, solutions to problems that don't really exist. Solutions without problems tend to turn into new problems. We need to understand our security problem. It's probably more like hundreds or thousands of problems. Every group, every company, every person has different problems. We understand none of them.

We start by listening. We're not going to fix any of this with code. We need to see what's happening, some big picture, some in the weeds. Today we show up, yell at people (if they're lucky), then we leave. We don't know what's really happening. We don't tell anyone what they need to know. We don't even know what they need to know. The people we're not talking to know what the problems are though. They don't know they know, we just have to give them time to explain it to us.

If you're at RSA next week, come talk to me. If not, hit me up on twitter @joshbressers
Thinking about glibc and Heartbleed, how do fix things
After my last blog post Change direction, increase speed! (or why glibc changes nothing) it really got me thinking about how can we start to fix some of this. The sad conclusion is that nothing can be fixed in the short term. Rather than trying to make up some nonsense about how to fix this, I want to explain what's happening and why this can't be fixed anytime soon.

Let's look at Heartbleed first.

There was a rather foul flaw found in OpenSSL, after Heartbleed the Linux Foundation collected a lot of money to help work on core infrastructure projects. If we look at the state of things it basically hasn't changed outside of money moving around. OpenSSL cannot be fixed for a number of reason.

  1. Old codebase
  2. Backward compatibility
  3. Difficult API
  4. It is "general purpose"
The reality is that the only way to get what could be considered a safe library, you would have to throw everything out and start over with some very specific ideas in mind. Things like sun-setting algorithms didn't exist when OpenSSL was created. There is no way you're going to get even a small number of projects to move from using OpenSSL to some new "better" library. It would have to be so much better they couldn't ignore it. As anyone who has ever written software knows, you don't build a library like that over night. I think 5 years would be a conservative estimate for double digit adoption rates.

While I'm picking on OpenSSL here, the story is the same in virtually every library and application that exists. OpenSSL isn't special, it just gets a lot of attention.

Let's think about glibc.

Glibc is the C library used by most Linux distributions. If the Kernel is the heart, glibc is the soul. Nothing can even exist without this library. Glibc even strives to be POSIX compliant, for good reason. POSIX has given us years of compatibility and cooperation.

Glibc probably has plenty more horrible bugs hiding in the code. It's wicked complex and really large. If you ever need some nightmare fuel, look at the glibc source code. Everything we do in C relies on a libc being around, glibc doesn't have that luxury.

Replacing libc is probably out of the question, it's just not something practical. So let's think about something like golang. What if everything was written using golang? It's not totally insane, there are substantial benefits. It's not as fast as C though, that will be the argument most use. Golang will probably never beat C, the things that makes it safer also makes it slower. But now if we think about replacing UNIX utilities with golang, why would we want to do that? Why not throw out all the mistakes UNIX made and do something else?

Now we're back to the legacy and compatibility arguments. Linux has more than twenty years of effort put into it. You can't just replace that. Even if you had the best team in the world I bet 10 years would be wishful thinking for having half the existing features.

So what does this mean? It means we don't know where to start yet. We are trying to solve our problems using what we know and the tools that exist. We have to solve this using new tools and new thinking. The way we fix security is by doing something nobody has ever thought of before. In 1920 the concept of the Internet didn't exist, people couldn't imagine how to even solve some of the problems we can easily solve using it. Don't try to solve our problems with what you know. Solve the problems using new ideas, find out what you don't know, that's where the solution lives.

February 19, 2016

glibc for humans
Unless you've been living under a rock, you've heard about the latest glibc issue.
CVE-2015-7547 - glibc stack-based buffer overflow in getaddrinfo()

It's always hard to understand some of these issues, so I'm going to do my best to explain it using simple language. Making security easy to understand is something I've been talking about for a long time now, it's time to do something about it.

What is it?
The fundamental problem here is that glibc has a bug that could allow a DNS response from an attacker to run the command of that attacker's choosing on your system. The final goal of course would be to become the root user.

The problem is that this glibc function is used by almost everything that talks to the network. In today's hyperconnected world, this means basically everything is vulnerable to this bug because almost everything can connect to the network. As of this writing we have not seen this attack being used on the Internet. Just because there are no known attacks is no reason to relax though, constant vigilance is key for issues like this.

Am I vulnerable?
If you run Linux (most distributions use glibc), and you haven't installed an update from your vendor, yes, you are vulnerable.

Are there workarounds?
No, there is no way to stop this issue. You have to install an update to glibc. Even the stack protector technology that is built into gcc and glibc will not stop this bug. While it is a stack overflow bug, the stack protector checks do not run before the exploit would gain control.

What about containers, VMs, or other confinement technology?
It is possible that a container, VM, or other technology such as SELinux could limit the possible damage from this bug. However it affects so many binaries on the system it should be expected that an attacker able to gain access to one applications could continue to exploit this bug to eventually become root and take over the entire machine.

Do I only need to be worried if I run a webserver or mailserver?
As stated previously, this bug affects virtually everything that talks to the network. Even if you think your webserver or mailserver are safe, everything from bash to your ssh client will use this library. Updating glibc is the only way to ensure you'll be OK.

What if I run my own DNS server?
This point is currently under investigation. It is thought that it may be possible for a bad DNS request to be able to make it through a DNS server to a vulnerable host. Rather than find out, you should update your glibc.

What about ...
No, just update your glibc :)


Do you have other questions? Ask me on twitter and I'll be sure to update this article if I know the answer.
@joshbressers

February 11, 2016

Introduction to Tang and Clevis

In this post I continue the discussion of network-bound decryption and introduce Tang and Clevis, new unlock tools that supersede Deo (which was covered in an earlier post).

Deo is dead. Long live Tang.

Nathaniel McCallum discovered a key encryption protocol based on ElGamal with a desirable security characteristic: no one but the party decrypting the secret can learn the secret. It was reviewed and refined into McCallum-Relyea (MR) exchange. With Deo, the server decrypted (thus learned) the key and sent it back to the client (through an encrypted channel). McCallum-Relyea exchange avoids this. A new protocol based on MR was developed, called Tang.

Another perceived drawback of Deo was its use of X.509 certificates for TLS and for encryption, making it complex to deploy. The Tang protocol is simpler and avoids X.509.

I will avoid going into the details of the cryptography or the protocol in this post, but will include links at the end.

Clevis

Using Tang to bind data to a network is great, but there are many other things we might want to bind our data to, such as passwords, TPM, biometrics, Bluetooth LE beacons, et cetera. It would also be nice to define policies – possibly nested – about how many of what data binders must succeed in order to decrypt or "unlock" a secret. The point here is that unlock policy should be driven by business or and/or user needs, not by technology. The technology must enable but not constrain the policy.

Enter Clevis, the pluggable client-side unlock framework. Plugins, which are called pins, implement different kinds of bindings. Clevis comes with a handful a pins including pwd (password) and https (PUT and GET the secret; a kind of escrow). The tang pin is provided by Tang to avoid circular dependencies.

The sss pin provides a way to "nest" pins, and also provides k of n threshold unlocking. "SSS" stands for Shamir’s Secret Sharing, the algorithm that makes this possible.

LUKS volume decryption, which was implemented in Deo, has not yet been implemented in Clevis, but it is a high priority.

By the way, if you were wondering about the terminology, a clevis, clevis pin and tang together form a kind of shackle.

image

TLS private key decryption

Let’s revisit the TLS private key decryption use case from my earlier Deo post, and update the solution to use Clevis and Tang.

Recall the encryption command, which required the user to input the TLS private key’s passphrase, then encrypted it with Deo, storing it at a location determined by convention:

# (stty -echo; read LINE; echo -n "$LINE") \
  | deo encrypt -a /etc/ipa/ca.pem deo.ipa.local \
  > /etc/httpd/deo.d/f22-4.ipa.local:443

We will continue to use the same file storage convention. Clevis, unlike Tang, does not receive a secret to be encrypted but instead generates one and tells us what it is. Let’s run clevis provision with the Tang pin and see what it gives us:

# clevis provision -P '{"type": "tang", "host": "f23-1.ipa.local"}' \
  -O /etc/httpd/tang.d/f22-4.ipa.local:443

The server advertised the following signing keys:

        0300AF3BF089D8D896DBE7CCE5E2BEC342C5A107
        B4A47300CA5819C34C537098D53CF9392AF06866
        1B581235DCA09D920EE5E31D5EFB44406A441DF5
        E750E646EBB0DC

Do you wish to trust the keys? [yn] y
709DAFCBC8ACF879D1AC386798783C7E

Breaking down the command, the -P argument is a JSON tang pin configuration object, specifying the Tang server’s host name. The argument to -O specifies the output filename.

The program prints the signing key(s) and asks if we want to trust them. Tang is a trust on first use (TOFU) protocol. Out-of-band validation is possible but not yet implemented (there is a ticket for DNSSEC support).

Having trusted the keys, the program performs the Tang encryption, saves the metadata in the specified output file, and finally prints the secret: 709DAFCBC8ACF879D1AC386798783C7E.

We now need to update the passphrase on the TLS private key with the secret that Clevis generated:

# openssl rsa -aes128 < key.pem > newkey.pem && mv newkey.pem key.pem
Enter pass phrase:
writing RSA key
Enter PEM pass phrase:
Verifying - Enter PEM pass phrase:

OpenSSL first asked for the original passphrase to decrypt the private key, then asks (twice) for a new passphrase, which shall be the secret Clevis told us.

Now we must change the helper script that unlocks the private key. Recall the definition of the Deo helper:

#!/bin/sh
DEO_FILE="/etc/httpd/deo.d/$1"
[ -f "$DEO_FILE" ] && deo decrypt < "$DEO_FILE" && echo && exit
exec /bin/systemd-ask-password "Enter SSL pass phrase for $1 ($2) : "

The Clevis helper is similar:

#!/bin/sh
CLEVIS_FILE="/etc/httpd/clevis.d/$1"
[ -f "$CLEVIS_FILE" ] && clevis acquire -I "$CLEVIS_FILE" && echo && exit
exec /bin/systemd-ask-password "Enter SSL pass phrase for $1 ($2) : "

The clevis acquire -I "$CLEVIS_FILE" invocation is the only substantive change. Now we can finally systemctl restart httpd and observe that the key is decrypted automatically, without prompting the operator.

What are the possible downsides to this approach? First, due to limitations in Apache’s passphrase acquisition at present it is possible only to use Clevis pins that do not interact with the user or write to standard output. Second, the secret is no longer controlled by the user doing the provisioning – the TLS private key must be re-encrypted under the new passphrase generated by Clevis, and if the Tang server is unavailable, that is the passphrase that must be entered at the fallback prompt. A lot more work needs to be done to make Clevis a suitable general solution for key decryption in Apache or other network servers, but for this simple case, Clevis and Tang work very well, as long as the Tang server is available.

Conclusion

This has been a very quick and shallow introduction to Clevis and Tang. For a deeper overview and demonstration of Tang server deployment and more advances Clevis policies, I recommend watching Nathaniel McCallum’s talk from DevConf.cz 2016.

Other useful links:

February 10, 2016

OpenStack Keystone Q and A with the Boston University Distributed Systems Class Part 1

Dr. Jonathan Appavoo was kind enough to invite me to be a guest lecturer in his distributed systems class at Boston University. The students proved a list of questions, and I only got a chance to address a handful of them during the class. So, I’ll try to address the rest here.

Page 1 of 1 So far (I’ll update this line as I post the others.)

When do tokens expire? If they don’t expire, isn’t it potentially dangerous since attackers can use old tokens to gain access to privileged information?
Tokens have a set expiry. The default was originally set to be 12 hours. We shortened that to 1 hour a couple years back, but it turns out that some workloads use a token all the way through, and those workload last longer than 1 hour. Those deployments either have lengthened the life of the token in the configuration or had users explicitly request tokens that last longer than an hour.

Can users share roles with in the same project?
Yes, and this is the norm. A role assignment is a many to many to many association between users (or groups of users), projects, and roles. This means:

  • One user may have multiple roles on the same project
  • One user may have the same role on multiple projects
  • Multiple users may have the same role on a project
  • Any mix of the above

It’s interesting that one of the key components is a standardized GUI (Horizon). Wouldn’t it be more useful for there to be a handful of acceptable GUIs tailored to the service a particular set of OpenStack instances is providing?

This is the standardized GUI as you point out. While many companies have deployed OpenStack with custom GUIs, this one is the one that is designed to be the most generally applicable. Each user gets a service catalog along with their tokens, and the Horizon UI can use this to determine what services the user can see, and customize the UI displayed accordingly. So, from that perspective, the UI is tailored to the user.

The individual project teams are not compose of UI or UX folks. You really don’t want us designing UI. As soon as you realize the tough problem is getting a consistent look, feel, set of rules, and all the things that keep users from running away screaming, you realize that it really is its own effort and project.

That said, I did propose a few years back that the Keystone project should be able to render to HTML, and not just JSON (and XML). This would be the start of a progressive enhancement approach that would also make the Keystone team aware of the gaps in the coverage of the UI: its really easy to see when you click through. But, again, this was for test-ability, completeness, and rapid implementation of a UI for new features, not the rich user experience that is the end product. It would still be the source for a follow on UX effort.

Since that time, the Horizon team has embraced a single-page-app effort based on a proxy (to avoid CORS issues) to all of the endpoints. The proxy converts the URLs in the JSON requests and responses to the proxy, but otherwise lets things pass unchanged. I would love to see HTML rendering as an option on this proxy.

Can You elaborate on some examples of OpenStack being used by companies on a large scale?

Here is the official list. If you click through the Case studies, some of them have numbers.

A question about the philosophy behind open source. Do the  problems that arise in distributed systems lend themselves well to an open source approach? Or does Brooks’s Law apply?

Brooks law states: “adding manpower to a late software project makes it later.”  Open source projects are not immune to Brook’s Law.  OpenStack is not driven by any one company, and it has a “release each 6 months” rule that means that if a feature is not going to make a release, it will have another chance six months later.  We have not slipped a release since I’ve been on the project, and I don’t think they did before.

The Keystone team is particularly cautious. New features happen, but they are well debated, beat on, and often deferred a release or more.  Getting code into Keystone is a serious undertaking, with lots of back and forth for even small changes, and some big changes have gone through 70+ revisions before getting merged.  I have a page and a half of links to code that I have submitted and later abandoned.

Adding more people to a project under OpenStack (like Keystone) can’t happen without the approval of the developers.  I mean, anyone can submit and review code, but to be accepted as a core requires a vote of confidence from the existing cores, and that vote won;’t take place if you’ve not already proved yourself.  So the worst that could happen is that one company goes commit happy and gets a bunch of people to try to submit patches, and we would ignore them. It hasn’t happened yet.

Does the public source code make authentication a much more difficult project for OpenStack than it is for a close source Identity-as-a-Service?

So, the arguments why Open Source is good for security is well established, and I won’t repeat them here.  To me, there are Open Source projects, and then there is the Open Source development model.  The first means that the code3 is free software, you can contribute, etc.  But it means that the project might be run by one person or company.  The Open Source development model is more distributed by default.  It means that no one person can ram through changes.  Even the project technical leaders can;t get away with approving code without at least another core also approving.  Getting anything done is difficult.  So, from that perspective, yes.  But there is lot of benefit to offset it:  we get a wider array of inputs, and we get public discussion of features and bugs.  We get contributions from people that are interested in solving their own problems, and, in doing so, solve ours.

Why can we not force or why has there not been more standardization for Identity Providers(IdPs) in a  federation?

Adoption of Federated protocols has been happening, slow but steady.  SAML from “oh that’s neat” to “we have to have that.”  IN my time on this project.  SAML is apretty good standard, and many of the IdPs are implementing it.  There is a little wiggle room in the standard, as my coworker who is working on the client protocol (ECP) can tell you, but the more implementations we see, the more we can iron out the details.  So, I think at least from SAML, we do have a good degree of standardization.

The other protocols are also picking up steam, and I think will play out similarly to SAML.  I suspect the OpenID connect will end up just as well standardized in implementation as SAML is starting to be.  The process is really iterative, and you don;t know the issues you are going to have to deal with until you find them in a deployment.

What makes OpenStack better than other Cloud Computing services?

Short answer: I don’t know.

Too Long answer with a lot of conjecture:  I think that the relative strength of OpenStack depends on which of the other services you compare it to.

Amazon’s are more mature, so there the OpenStack benefit is Open Source, and that you can actually implement it on premise, not just have it  hosted for you.  Control of Hardware is a still a big deal.  I think the Open Source aspect really helped OpenStack compete with vCloud as well.

I think that the open source development model for the Cloud mirrors the success of the Linux Kernel.  The majority of the activity on the Kernel is device drivers.  In OpenStack, there is a real push by vendors to support their devices.  In a proprietary solution, a hardware vendor is dependent on working with that proprietary software vendor to get their device supported. In Open Stack, any device manufacturer that wants to be part of Cinder, Nova, or Neutron can get the code and make it work.

This means that even the big vendors get interested.  The software becomes a marketplace of sorts, and if you can’t inter-operate, you miss out on potential sales.  Thus, we have Cisco interested in Neutron and VMware interested in Nova where it might have initially appeared against their interested to have that competition.

I think part of its success was due to the choice of the Python programming language.  Its a language that System administrators don’t tend to react to negatively like they do with Java.  I pick on Java because it was the language used for Eucalyptus.  I personally like working in Java, but I can really see the value in Python for OpenStack.  The fact that source code is essentially shipped by default overcomes the Apache license potential for a closing off code: end users can see what is actually running on their systems. System administrators for Linux systems are likely to already have some familiarity with Python.

I think the micro project approach has also allowed OpenStack to scale.  It lets people interested in identity focus on identity, and block storage people get to focus on block storage.  The result has been an explosion of interest in contributing.

I think OpenStack got lucky with timing:  the world realized it needed a cloud management solution when OpenStack got mature enough to start filing that role.

A Holla out to the Kolla devs

Devstack uses Pip to install packages, which conflict with the RPM versions on my Fedora system. Since I still need to get work done, and want to run tests on Keystone running against a live database, I’ve long wondered if I should go with container based approach. Last week, I took the plunge and started messing around with Docker. I got the MySQL Fedora container to run, then found Lars Keystone container using Sqlite, and was stumped. I poked around for a way to get the two containers talking to each other, and realized that we had a project dedicated to exactly that in OpenStack: Kolla. While it did not work for me right out of a git-clone, several of the Kolla devs worked with me to get it up and running. here are my notes, distilled.

I started by reading the quickstart guide. Which got me oriented (I suggest you start there, too). But found a couple things I needed to learn. First, I needed a patch that has not quite landed, in order to make calls as a local user, instead of as root. I still ended up creating /etc/kolla and chowning it to ayoung. That proved necessary, as the work done in that patch is “necessary but not sufficient.”

I am not super happy about this, but I needed to make docker run without a deliberate sudo. So I added the docker group, added myself to it, and restarted the docker service via systemd. I might end up doing all this as a separate developer user, not as ayoung, so at least I need to su – developer before the docker stuff. I may be paranoid, but that does not mean they are not out to get me.

Created a dir named ~/kolla/ and put in there:

~/kolla/globals.yml

kolla_base_distro: "centos"
kolla_install_type: "source"

# This is the interface with an ip address you want to bind mariadb and keystone too
network_interface: "enp0s25"
# Set this to an ip address that currently exists on interface "network_interface"
kolla_internal_address: "10.0.0.13"

# Easy way to change debug to True, though not required
openstack_logging_debug: "True"

# For your information, but these default to "yes" and can technically be removed
enable_keystone: "yes"
enable_mariadb: "yes"

# Builtins that are normally yes, but we set to no
enable_glance: "no"
enable_haproxy: "no"
enable_heat: "no"
enable_memcached: "no"
enable_neutron: "no"
enable_nova: "no"
enable_rabbitmq: "no"
enable_horizon: "no"

I also copied the file ./etc/kolla/passwords.yml from the repo into that directory, as it was needed during the deploy.

To build the images, I wanted to work inside the kolla venv (didn’t want to install pip packages on my system) so I ran the

tox -epy27

Which, along with running the unit tests, created a venv. I activated that venv for the build command:

. .tox/py27/bin/activate
./tools/build.py --type source keystone mariadb rsyslog kolla-toolbox

Note that I had first built the binary versions using:

./tools/build.py keystone mariadb rsyslog kolla-toolbox

But then tried to deploy the source version. The source versions are downloaded from tarballs on http://tarballs.openstack.org/ whereas the binary versions are the Delorean RPMS, and the trail the source versions by a little bit (not a lot).

I’ve been told “if you tox gen the config you will get a kolla-build.conf config. You can change that to git instead of url and point it to a repo.” But I have not tried that yet.

I had to downgrade to the pre 2.0 version of Ansible, as I was playing around with 2.0’s support for Keystone V3 API. Kolla needs 1.9

dnf downgrade ansible

There is an SELinux issue. I worked round for now by setting SELInux into permissive mode, but we’ll revisit that shortly. It was only for deploy; once the containers were running, I was able to switch back to enforcing mode.
We will deal with it here.

./tools/kolla-ansible --configdir /home/ayoung/kolla   deploy

Once that ran, I wanted to test Keystone. I needed a keystone RC file. To get it:

./tools/kolla-ansible post-deploy

It put it in /etc/kolla/.

. /etc/kolla/admin-openrc.sh 
[ayoung@ayoung541 kolla]$ openstack token issue
+------------+----------------------------------+
| Field      | Value                            |
+------------+----------------------------------+
| expires    | 2016-02-08T05:51:39.447112Z      |
| id         | 4a4610849e7d45fdbd710613ff0b3138 |
| project_id | fdd0b0dcf45e46398b3f9b22d2ec1ab7 |
| user_id    | 47ba89e103564db399ffe83d8351d5b8 |
+------------+----------------------------------+

Success

I have to admin that I removed the warning.

usr/lib/python2.7/site-packages/keyring/backends/Gnome.py:6: PyGIWarning: GnomeKeyring was imported without specifying a version first. Use gi.require_version('GnomeKeyring', '1.0') before import to ensure that the right version gets loaded.
  from gi.repository import GnomeKeyring

Huge thanks to SamYaple and inc0 (Michal Jastrzebski) for their help in getting me over the learning hump.

I think Kolla is fantastic. It will be central to my development for Keystone moving forward.

February 08, 2016

Devconf.cz
I spent last week at Devconf in the Czech Republic. I didn't have time to write anything new and compelling, but I did give a talk about why everything seems to be on fire.

https://www.youtube.com/watch?v=zmDm7J7V7aw

I explore what's going on right now, why do things look like they're on fire, and how we can start to fix this. Our problem isn't technology, it's the people. We're good at technology problems, we're bad at people problems.

Give the talk a listen. Let me know what you think, I hope to peddle this message as far and wide as possible.

Join the conversation, hit me up on twitter, I'm @joshbressers
Dealing with Duplicate SSL certs from FreeIPA

I reinstalled https://ipa.younglogic.net. My browser started complaining when I try to visit it; The serial number of the TLS certificate is a duplicate. If I am seeing this, anyone else that looked at the site in the past is going to see it, too, so I don’t want to just hack my browser setup to ignore it. Here’s how I fixed it:

FreeIPA uses Certmonger to request and monitor certificates. The Certmonger daemon runs on the server that owns the certificate, and performs the tricky request format generation, then waits for an answer. So, In order to update the IPA server, I am going to tell Certmonger to request a renewal of the HTTPS TLS certificate.

The tool to talk to cermonger is called getcert. First, find the certificate. We know it is going to stored in the Apache HTTPD config directory:

sudo getcert list
Number of certificates and requests being tracked: 8.
Request ID '20160201142947':
	status: MONITORING
	stuck: no
	key pair storage: type=NSSDB,location='/etc/pki/pki-tomcat/alias',nickname='auditSigningCert cert-pki-ca',token='NSS Certificate DB',pin set
	certificate: type=NSSDB,location='/etc/pki/pki-tomcat/alias',nickname='auditSigningCert cert-pki-ca',token='NSS Certificate DB'
	CA: dogtag-ipa-ca-renew-agent
	issuer: CN=Certificate Authority,O=YOUNGLOGIC.NET
	subject: CN=CA Audit,O=YOUNGLOGIC.NET
	expires: 2018-01-21 14:29:08 UTC
	key usage: digitalSignature,nonRepudiation
	pre-save command: /usr/lib64/ipa/certmonger/stop_pkicad
	post-save command: /usr/lib64/ipa/certmonger/renew_ca_cert "auditSigningCert cert-pki-ca"
	track: yes
	auto-renew: yes
...
Request ID '20160201143116':
	status: MONITORING
	stuck: no
	key pair storage: type=NSSDB,location='/etc/httpd/alias',nickname='Server-Cert',token='NSS Certificate DB',pinfile='/etc/httpd/alias/pwdfile.txt'
	certificate: type=NSSDB,location='/etc/httpd/alias',nickname='Server-Cert',token='NSS Certificate DB'
	CA: IPA
	issuer: CN=Certificate Authority,O=YOUNGLOGIC.NET
	subject: CN=ipa.younglogic.net,O=YOUNGLOGIC.NET
	expires: 2018-02-01 14:31:15 UTC
	key usage: digitalSignature,nonRepudiation,keyEncipherment,dataEncipherment
	eku: id-kp-serverAuth,id-kp-clientAuth
	pre-save command: 
	post-save command: /usr/lib64/ipa/certmonger/restart_httpd
	track: yes
	auto-renew: yes

There are many in there, but the one we care about is the last one, with the Request ID of 20160201143116. It is in the NSS database stored in /etc/httpd/alias. To request a new certificate, use the command:

sudo ipa-getcert resubmit -i 20160201143116

While this is an ipa-specific command, it is essentially telling certmonger to renew the certificate. After we run it, I can look at the list of certificates again and see that the “expires” value has been updated:

Request ID '20160201143116':
	status: MONITORING
	stuck: no
	key pair storage: type=NSSDB,location='/etc/httpd/alias',nickname='Server-Cert',token='NSS Certificate DB',pinfile='/etc/httpd/alias/pwdfile.txt'
	certificate: type=NSSDB,location='/etc/httpd/alias',nickname='Server-Cert',token='NSS Certificate DB'
	CA: IPA
	issuer: CN=Certificate Authority,O=YOUNGLOGIC.NET
	subject: CN=ipa.younglogic.net,O=YOUNGLOGIC.NET
	expires: 2018-02-07 02:29:42 UTC
	principal name: HTTP/ipa.younglogic.net@YOUNGLOGIC.NET
	key usage: digitalSignature,nonRepudiation,keyEncipherment,dataEncipherment
	eku: id-kp-serverAuth,id-kp-clientAuth
	pre-save command: 
	post-save command: /usr/lib64/ipa/certmonger/restart_httpd

Now when I refresh my browser window, Firefox no longer complains about the repeated serial number. Now it complains that “the site administrator has incorrectly configured the Security for this site” because I am use a CA cert that it does not know about. But now I can move on and re-install the CA cert.

February 06, 2016

Keystone Implied roles with CURL

Keystone now has Implied Roles.  What does this mean?  Lets say we define the role Admin to  imply the Member role.  Now, if you assigned someone Admin on a project they are automatically assigned the Member role on that project implicitly.

Let’s test it out:

Since we don’t yet have client or CLI support, we’ll have to make due with curl and jq for now.

This uses the same approach Keystone V3 Examples

#!/bin/sh 
. ~/adminrc

export TOKEN=`curl -si -d @token-request.json -H "Content-type: application/json" $OS_AUTH_URL/auth/tokens | awk '/X-Subject-Token/ {print $2}'`

export ADMIN_ID=`curl -H"X-Auth-Token:$TOKEN" -H "Content-type: application/json" $OS_AUTH_URL/roles?name=admin | jq --raw-output '.roles[] | {id}[]'`

export MEMBER_ID=`curl -H"X-Auth-Token:$TOKEN" -H "Content-type: application/json" $OS_AUTH_URL/roles?name=_member_ | jq --raw-output '.roles[] | {id}[]'`

curl -X PUT -H"X-Auth-Token:$TOKEN" -H "Content-type: application/json" $OS_AUTH_URL/roles/$ADMIN_ID/implies/$MEMBER_ID

curl  -H"X-Auth-Token:$TOKEN" -H "Content-type: application/json" $OS_AUTH_URL/role_inferences 

Now, create a new user and and assign them only the user role.

openstack user create Phred
openstack user show Phred
+-----------+----------------------------------+
| Field     | Value                            |
+-----------+----------------------------------+
| domain_id | default                          |
| enabled   | True                             |
| id        | 117c6f0055a446b19f869313e4cbfb5f |
| name      | Phred                            |
+-----------+----------------------------------+
$ openstack  user set --password-prompt Phred
User Password:
Repeat User Password:
$ openstack project list
+----------------------------------+-------+
| ID                               | Name  |
+----------------------------------+-------+
| fdd0b0dcf45e46398b3f9b22d2ec1ab7 | admin |
+----------------------------------+-------+
$ openstack project list
+----------------------------------+-------+
| ID                               | Name  |
+----------------------------------+-------+
| fdd0b0dcf45e46398b3f9b22d2ec1ab7 | admin |
+----------------------------------+-------+
openstack role add --user 117c6f0055a446b19f869313e4cbfb5f --project fdd0b0dcf45e46398b3f9b22d2ec1ab7 e3b08f3ac45a49b4af77dcabcd640a66

Copy token-request.json and modify the values for the new user.

 curl  -d @token-request-phred.json -H "Content-type: application/json" $OS_AUTH_URL/auth/tokens | jq '.token | {roles}'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1643  100  1098  100   545  14742   7317 --:--:-- --:--:-- --:--:-- 14837
{
  "roles": [
    {
      "id": "9fe2ff9ee4384b1894a90878d3e92bab",
      "name": "_member_"
    },
    {
      "id": "e3b08f3ac45a49b4af77dcabcd640a66",
      "name": "admin"
    }
  ]
}

January 31, 2016

Does the market care about security?
I had some discussions this week about security and the market. When I say the market I speak of what sort of products will people or won't people buy based on some requirements centered around security. This usually ends up at a discussion about regulation. That got me wondering if there are any industries that are unregulated, have high safety requirements, and aren't completely unsafe?

After a little research, it seems SCUBA is the industry I was looking for. If you read the linked article (which you should, it's great) the SCUBA story is an important lesson for the security industry. Our industry moves fast, too fast to regulate. Regulation would either hurt innovation or be useless due to too much change. Either way it would be very expensive. SCUBA is a place where the lack of regulation has allowed for dramatic innovation over the past 50 years. The article compares the personal aircraft industry which has substantial regulation and very little innovation (but the experimental aircraft industry is innovating due to lax regulation).

I don't think all regulation is bad, it certainly has its place, but in a fast moving industry it can bring innovation to a halt. And in the context of security, what could you even regulate that would actually matter? Given the knowledge gaps we have today any regulation would just end up being a box ticking exercise.

Market forces are what have kept SCUBA safe, divers and dive shops won't use or stock bad gear. Security today has no such bar, there are lots of products that would fall under the "unsafe" category that are stocked and sold by many. Can this market driven approach work for our security industry?

It's of course not that simple for security. Security isn't exactly an industry in itself. There are security products, then there are other products. If you're writing a web app security probably takes a back seat to features. Buyers don't usually ask about security, they ask about features. People buying SCUBA gear don't ask about safety, they just assume it's OK. When you run computer software today you either know it's insecure, or you're oblivious to what's going on. There's not really a happy middle.

Even if we had an industry body everyone joined, it wouldn't make a huge difference today. There is no software that exists without security problems. It's a wide spectrum of course, there are examples that are terrible and examples that do everything right. Today both groups are rewarded equally because security isn't taken into account in many instances. Even if you do everything right, you will still have security flaws in your software.

Getting the market to drive security is going to be tricky, security isn't a product, it's part of everything. I don't think it's impossible, just really hard. SCUBA has the advantage of a known and expected use case. Imagine if that gear was expected to work underwater, in space, in a fire, in the arctic, and you have to be able to eat pizza while wearing it? Nobody would even try to build something like that. The flexibility of software is also its curse.

In the early days of SCUBA there were a lot of accidents, by moving faster than the regulators could, they not only made the sport extremely safe, but probably saved what we know as SCUBA today. If it was heavily regulated I suspect much of the technology wouldn't look all that different from what was used 30+ years ago. Software regulation would probably keep things looking a like they do today, just with a lot of voodoo to tick boxes.

Our great challenge is how do we apply this lesson from SCUBA to security? Is there a way we can start creating real positive change that can be market driven innovation and avoid the regulation quagmire?

Join the conversation, hit me up on twitter, I'm @joshbressers

January 28, 2016

Remote group merging for Fedora

The Problem

One of the major features of the Fedora Server Edition is the Cockpit administrative console. This web-based interface provides administrators with a powerful set of tools for controlling their system. Cockpit relies upon low-level tools like polkit and sudo to make authorization decisions to determine what a user is permitted to do. By default, most operations on a Fedora system are granted to users in the ‘wheel’ group. People granted administrator access to Cockpit (and other tools through shell access) are generally added to the wheel group in the /etc/group file.

This works reasonably well for single-user systems or very small environments where manual edits to /etc/group are maintainable, but in larger deployments, it becomes very unwieldy to manage lots of entries in /etc/group. In these cases, most environments switch over to using some form of a domain controller (such as FreeIPA, Microsoft Active Directory or a custom LDAP setup). These domain controllers allow users to be managed centrally, allowing administrators to make changes in a single place and have this be automatically picked up by all enrolled systems.

However, there is a problem: historically the group processing on Fedora (provided by glibc) has forced users to choose between using centrally managed groups (such as those provided by a domain and maintained by SSSD) or groups maintained on the local system in the /etc/group file. The behavior of glibc is specified in /etc/nsswitch.conf to decide which of the two mechanisms will “win” in the event of a conflict. This means that administrators need to decide up front whether their groups must all come from a domain controller or some locally.

The Solution

Over the last few months, I worked on adding a new feature to the glibc name-service functionality to enable “group merging”. The net effect is that now for all lookups of a group, glibc can be configured to check both the local files and the remote service and (if the group appears in both), combine the list of member users for both representations of the group into a single response.

Thus, it becomes possible to provide both local and central administrators into the wheel group. This can come in handy for example if an administrator wants to keep one or more local accounts available to do disaster recovery in the event that the machine loses access to the remote users (such as a bad update resulting in SSSD not starting).

Of course, this functionality does not come without a cost: because all merging lookups will try both data sources, it can result in a performance hit when operating against groups that otherwise would have been answered only by the local /etc/group file. With caching services like SSSD, this impact should be minimized.

Fedora and glibc upstream

The group merging patch has been submitted to the upstream glibc project but has not yet been merged into a release. It narrowly missed the 2.23 merge window, so it is currently slated for inclusion into glibc 2.24.

However, Carlos O’Donell has taken the patch and applied it to glibc in Fedora Rawhide (which will become Fedora 24), so it will be possible to take advantage of these features first in Fedora 24, before anyone else. (For anyone interested from other distributions, the patch should apply cleanly on 2.23 and likely with minimal effort atop 2.22 as well, since little changed besides this.)


January 25, 2016

Security and Tribal Knowledge
I've noted a few times in the past the whole security industry is run by magicians. I don't mean this in a bad way, it's just how things work. Long term will will have to change, but it's not going to be an easy path.

When I say everything is run by magicians I speak of extremely smart people who are so smart they don't need or have process (they probably don't want it either so there's no incentive). They can do whatever needs to be done whenever it needs doing. The folks in the center are incredibly smart but they learned their skills on their own and don't know how to pass on knowledge. We have no way to pass knowledge on to others, many don't even know this is a problem. Magicians can be awesome if you have one, until they quit. New industries are created by magicians but no industry succeeds with magicians. There are a finite number of these people and an infinite number of problems.

This got me thinking a bit, and it reminded me of the Internet back in the early 90's.

If you were involved in the Internet back in the 90's, it was all magic back then. The number of people who knew how things worked was incredibly small. There were RFCs and books and product documents, but at the end of the day, it was all magic. If your magician quit, you were screwed until you could find and hire a new magician. The bar was incredibly high.

Sounds a lot like security today.

Back then if you had a web page, it was a huge deal. If you could write CGI scripts, you were amazing, and if you had root on a server you were probably a magician. A lot of sysadmins knew C (you had to), a lot of software was built from source. Keeping anything running was a lot of work, infrastructure barely held together and you had to be an expert at literally everything.

Today getting a web site, running server side scripts, or root aren't impressive. You can get much of those things for free. How did we get here? The bar used to be really high. The bar is pretty low now but also a lot more people understand how much of this works. They're not experts but they know enough to get things done.

How does this apply to security?

Firstly we need to lower the bar. It's not that anyone really plans to do this, it just sort of happens via better tooling. I think the Linux distribution communities helped a lot making this happen back in the day. The tools got a lot better. If you configured a server in 1995 it was horrible, everything was done by hand. Now 80% of the work just sort of happens, you don't need super deep knowledge. Almost all security work done these days is manual. I see things like AFL and LLVM as the start but we have a long way to go. As of right now we don't know which tools are actually useful. There are literally thousands of security products on the market. Only the useful ones will really make a difference in the long term.

The second thing we need to do is transfer relevant knowledge. What that knowledge is will take time to figure out. Does everyone need to know how a buffer overflow exploit works? Probably not, but the tools will really determine who needs to know what. Today you need to know everything. In the future you'll need to know how to use the tools, interpret the output, and fill in some of the gaps. Think of it as the tools having 80% of the knowledge, you just need to bring the missing 20%. Only the tool writers need to know that missing knowledge. Today people have 100% or 0% of knowledge, this is a rough learning curve.

If you look at the Internet today, there is a combination of tons of howtos and much better software to setup and run your infrastructure. There are plenty of companies that can help you build the solution you need. It's not nearly as important to know now to configure your router anymore, there are better tools that do a lot of this for you. This is where security needs to go. We need tools and documents that are useful and helpful. Unfortunately we don't yet really know how to make useful tools, or how to transfer knowledge. We have a long way to go before we can even start that conversation.

The next step security needs to make is to create and pass on tribal knowledge. It's still a bad place to be in, but it's better than magicians. We'll talk about tribal knowledge in the future.

Join the conversation, hit me up on twitter, I'm @joshbressers

January 21, 2016

Resize disks in a Centos 7 Install

The default layout for disks in a Centos deployment may make sense for the average use case, but not for using the machine as a Tripleo all-in-one development box. I have 500 GB of Disk space, and the default installer puts 400GB into /home and 50 GB into /. However, since most of the work here is going to be done in virtual machines, the majority of the /home space is wasted, and I found I have filled up the 50 GB partition on / on a regular basis. So, I want to remove /home and put all the space under /.

Here is my start state.

# df -h
Filesystem               Size  Used Avail Use% Mounted on
/dev/mapper/centos-root   50G   18G   33G  35% /
devtmpfs                  16G     0   16G   0% /dev
tmpfs                     16G     0   16G   0% /dev/shm
tmpfs                     16G   33M   16G   1% /run
tmpfs                     16G     0   16G   0% /sys/fs/cgroup
/dev/mapper/centos-home  411G  1.9G  409G   1% /home
/dev/sda1                497M  167M  331M  34% /boot
tmpfs                    3.2G     0  3.2G   0% /run/user/0

Thus far, I only have 1.9 GB under /home, and 33 out of 50 GB under /, so I have enough space to work with. I start by backing up the /home subdirectories to space on the partition that holds /.

mkdir /home-alt
df -h
mv /home/stack/ /home-alt/
umount /home

Edit the Filesystem table to remove the home directory in the future.

#
# /etc/fstab
# Created by anaconda on Wed Jan 20 14:27:36 2016
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
/dev/mapper/centos-root /                       xfs     defaults        0 0
UUID=3347d9ba-bb62-44cf-8dfc-1b961279f428 /boot                   xfs     defaults        0 0
#/dev/mapper/centos-home /home                   xfs     defaults        0 0
/dev/mapper/centos-swap swap                    swap    defaults        0 0

From the above, we can see that the partition for / and /home are /dev/mapper/centos-root and /dev/mapper/centos-home.

using the pvs command, I can see one physical volume:

  PV         VG     Fmt  Attr PSize   PFree 
  /dev/sda2  centos lvm2 a--  476.45g 64.00m

Using vgs, I can see a singe volume group:

  VG     #PV #LV #SN Attr   VSize   VFree 
  centos   1   3   0 wz--n- 476.45g 64.00m

And finally, using lvs I see the three logical volumes that appeared in my fstab;

  LV   VG     Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  home centos -wi-a----- 410.70g                                                    
  root centos -wi-ao----  50.00g                                                    
  swap centos -wi-ao----  15.69g 

Remove the centos-home volume:

lvremove /dev/mapper/centos-home
Do you really want to remove active logical volume home? [y/n]: y
  Logical volume "home" successfully removed

Extend the centos-root volume by 410GB. I can resize the underlying file system at the same time by passing -r.

lvextend -r /dev/mapper/centos-root /dev/sda2

Check if it worked:

# df -h
Filesystem               Size  Used Avail Use% Mounted on
/dev/mapper/centos-root  461G   20G  442G   5% /
devtmpfs                  16G     0   16G   0% /dev
tmpfs                     16G     0   16G   0% /dev/shm
tmpfs                     16G   33M   16G   1% /run
tmpfs                     16G     0   16G   0% /sys/fs/cgroup
/dev/sda1                497M  167M  331M  34% /boot
tmpfs                    3.2G     0  3.2G   0% /run/user/0

I’ll admit, that was easier than I expected.

Return the /home subdirectories to their correct positions in the directory tree.

# mv /home-alt/ayoung/ /home/
# mv /home-alt/stack/ /home/
# rmdir /home-alt/

For references I used:

  1. How to Extend/Reduce LVM’s (Logical Volume Management) in Linux – Part II
  2. Resize your disks on the fly with LVM
  3. and the man pages for the commands listed.

January 18, 2016

OpenSSH, security, and everyone else
If you pay attention at all, this week you heard about a security flaw in OpenSSH.

Of course nothing is going to change because of this. We didn't make any real changes after Heartbleed or Shellshock, this isn't nearly as bad, it's business as usual.

Trying to force change isn't the important part though. The important thing to think about is the context this bug exists in. The folks who work on OpenSSH are some of the brightest security minds in the world. We're talking well above average here, not just bright. If they can't avoid security mistakes, is there any hope for the normal people?

The answer no.

What do we do now?

For the moment we will continue to operate just like we have been. Things aren't great, but they're not terrible. Part of our problem is things aren't broken enough yet, we're managing to squeak by in most situations.

The next step will be developing some sort of tribal knowledge model. It will develop in a mostly organic way. Long term security will be a teachable and repeatable thing, but we can't just jump to that point, we have to grow into it.

If you look at most of the security conference content today it sort of falls into two camps.

  1. Look at my awesome research
  2. Everything is broken and we can't fix it

Both of these content sets are taught by magicians. They're not really teaching knowledge, they're showing off. How do we teach? Teaching is really hard to do, it's not easy to figure out.

Many people believe security can't be learned, it's just sort of something you have. This is nonsense. There are many possible levels of skill, there is a point where you have to be especially gifted to move on, but there is also a useful place a large number of people can reach.

Perhaps the best place to start is to think about the question "I want to learn security, where do I start?"

I've been asked that many times. I've never had a good answer.

If we want to move our industry forward that's what we have to figure out. If someone came to you asking how to learn security, we have to have an answer. Remember no idea is too crazy, if you have thoughts, let's start talking about it.

Join the conversation, hit me up on twitter, I'm @joshbressers

January 10, 2016

What the lottery and security have in common

If you live in the US you can't escape the news about the Powerball lottery. The jackpot has grown to $1.3 Billion (with a capital B). Everyone is buying tickets and talking about what they'll do when they win enough money to ruin their life.

This made me realize the unfortunate truth about security we like to ignore. Humans are bad at reality. Here is how most of my conversations go.

"You won't win. The odds are zero percent"
"I might! You don't know!"
GOTO 10

I'm of course labeled as being some sort of party pooper because I'm not creating stories about how I will burn through hundreds of millions of dollars in a few short weeks.

What does this have to do with security? It's because people are bad at reality. Let's find out why.

Firstly, remember that as a species evolution has built us to survive on the African Savannah. We are good at looking for horrible beasts in the grass, and begin able to quickly notice other humans (even if they appear in toast). We are bad at things like math and science because math rarely hides in the grass and eats people. The vast majority of people live their lives unaware of this as a problem. What we call "intuition" is simply "don't get eaten by things with big teeth".

Keeping this in mind, let's use the context of the lottery. The odds are basically zero percent once you take the margin of error into account. We don't care though, we want to believe that there's a chance to win. Our brain says "anything is possible" then marketing helps back that up. Almost nobody knows how bad their odds really are and since you see a winner on TV every now and then, you know it's possible, you could be next! The lottery ticket is our magic gateway to fixing all our problems.

Now switch to security. People are bad at understanding the problems. They don't grasp any of the math involved with risk, they want to do something or buy something that is the equivalent of a lottery ticket. They want a magic ticket that will solve all their problems. There are people selling these tickets. The tickets of course don't work.

How we fix this if the question. Modern medicine is a nice example. Long ago it was all magic (literally). Then by creating the scientific method and properly training doctors things got better. People stopped listening to the magicians (well, most people) and now they listen to doctors who use science to make things better. There is still plenty of quack medicine though, we want to believe in the magic cures. In general most of humanity goes to doctors when they're sick though.

Today all security is magic. We need to find a way to create security science so methods and ideas can be taught.

Between thinking about how to best blow my lottery winnings, I'll probably find some time to think about what security science looks like. Once I win though you'll all be on your own. You've been warned!

Join the conversation, hit me up on twitter, I'm @joshbressers

January 07, 2016

Deploying Keycloak via Ansible

Keystone needs to work with multiple federation sources. Keycloak is a JBoss based project that provides, among other things, SAML and OpenID connect protocols. As part of my work in getting the two integrated, I needed to deploy Keycloak. The rest of my development setup is done via Ansible and I wanted to handle Keycloak the same way.

Unlike Ipsilon, Keycloak is not deployed via RPMs and Yum. Instead, the most common deployment method is to download and expand the tarball. This provides a great deal of flexibility to the deployer. While I am not going for a full live-deployment approach here, I did want to use best practices. Here were the decisions I made:

  • Use the System deployed Java runtime
  • Run as a non-root dedicated user named keycloak.
  • Manage the process via systemd
  • Put the majority of the files under /var/run/keycloak.
  • Have all code and configuration owned by root and not be editable by the Keycloak user
  • Use firewalld to open only the ports necessary (8080 and 9990) to communicate with the Keycloak server itself.

Here is the roles/keycloak/tasks/main.yml file that has the majority of the logic:

 

---
- name: install keycloak prerequisites
  tags:
    - keycloak
  yum: name={{ item }} state=present
  with_items:
    - java-1.7.0-openjdk.x86_64
    - firewalld

- name: create keycloak user
  tags:
  - keyclock
  user: name=keycloak

- name: keycloak target directory
  tags:
  - keyclock
  file: dest={{ keycloak_dir }}
        mode=755
        owner=root
        group=root
        state=directory


- name: get Keycloak distribution tarball
  tags:
    - keycloak
  get_url: url={{ keycloak_url }}
           dest={{ keycloak_dir }}

- name: unpack keycloak
  tags:
    - keycloak
  unarchive: src={{ keycloak_dir }}/{{keycloak_archive}}
             dest={{ keycloak_dir }}
             copy=no

- name: keycloak log directory
  tags:
  - keyclock
  file: dest={{ keycloak_log_dir }}
        mode=755
        owner=keycloak
        group=keycloak
        state=directory

- name: keycloak data directory
  tags:
  - keyclock
  file: dest={{ keycloak_jboss_home }}/standalone/data
        mode=755
        owner=keycloak
        group=keycloak
        state=directory


- name: keycloak tmp directory
  tags:
  - keyclock
  file: dest={{ keycloak_jboss_home }}/standalone/tmp
        mode=755
        owner=keycloak
        group=keycloak
        state=directory

- name: make keycloak configuration directory readable
  tags:
  - keyclock
  file: dest={{ keycloak_jboss_home }}/standalone/configuration
        mode=755
        owner=keycloak
        group=keycloak
        state=directory
        recurse=yes

- name: keycloak systemd setup
  tags:
    - keycloak
  template:
       owner=root group=root mode=0644
      src=keycloak.service.j2
      dest=/etc/systemd/system/keycloak.service
  notify:
    - reload systemd

- name: enable firewalld
  tags:
    - ipaserver
  service: enabled=yes
           state=started
           name=firewalld

- name: Open Firewall for services
  tags:
    - keycloak
  firewalld: port={{ item }}
             permanent=true
             state=enabled
             immediate=yes
  with_items:
    - 8080/tcp
    - 9990/tcp

- name: keycloak systemd service enable and start
  tags:
    - keycloak
  service: name=keycloak
           enabled=yes
           state=started

It makes use of some variables that I expect to have to tweak as package versions increase. Here is the roles/keycloak/vars/main.yml file

---
keycloak_version: 1.6.1.Final
keycloak_dir: /var/lib/keycloak
keycloak_archive: keycloak-{{ keycloak_version }}.tar.gz
keycloak_url: http://downloads.jboss.org/keycloak/{{ keycloak_version }}/{{keycloak_archive }}
keycloak_jboss_home: "{{ keycloak_dir }}/keycloak-{{ keycloak_version }}"
keycloak_log_dir: "{{ keycloak_jboss_home }}/standalone/log"

For Systemd I started with the configuration as suggested by: Jens Krämer Which I tailored to reference Keycloak explicitly and also to listen on 0.0.0.0. Here is the template file roles/keycloak/tempaltes/keycloak.service.js

[Unit]
Description=Jboss Application Server
After=network.target

[Service]
Type=idle
Environment=JBOSS_HOME={{ keycloak_jboss_home }} JBOSS_LOG_DIR={{ keycloak_log_dir }} "JAVA_OPTS=-Xms1024m -Xmx20480m -XX:MaxPermSize=768m"
User=keycloak
Group=keycloak
ExecStart={{ keycloak_jboss_home }}/bin/standalone.sh -b 0.0.0.0
TimeoutStartSec=600
TimeoutStopSec=600

[Install]
WantedBy=multi-user.target

The top level playbook for this is somewhat muddied by having other roles, not relevant for this post. It looks like this:

- hosts: keycloak
  remote_user: "{{ cloud_user }}"
  tags: all
  tasks: []

- hosts: keycloak
  sudo: yes
  remote_user: "{{ cloud_user }}"
  tags:
    - ipa
  roles:
    - common
    - ipaclient
    - keycloak
  vars:
    hostname: "{{ ansible_fqdn }}"
    ipa_admin_password: "{{ ipa_admin_user_password }}"

And I call it is using:

 ansible-playbook -i ~/.ossipee/deployments/ayoung.os1/inventory.ini keycloak.yml

I’m tempted to split this code off into its own repository; right now I have it as part of Rippowam.

January 05, 2016

Boolean: virt_use_execmem What? Why? Why not Default?
In a recent bugzilla, the reporter was asking about what the virt_use_execmem.

  • What is it?

  • What did it allow?

  • Why was it not on by default?


What is it?

Well lets first look at the AVC

type=AVC msg=audit(1448268142.167:696): avc:  denied  { execmem } for  pid=5673 comm="qemu-system-x86" scontext=system_u:system_r:svirt_t:s0:c679,c730 tcontext=system_u:system_r:svirt_t:s0:c679,c730 tclass=process permissive=0

If you run this under audit2allow it gives you the following message:


#============= svirt_t ==============

#!!!! This avc can be allowed using the boolean 'virt_use_execmem'
allow svirt_t self:process execmem;


Setroubleshoot also tells you to turn on the virt_use_execmem boolean.

# setsebool -P virt_use_execmem 1

What does the virt_use_execmem boolean do?

# semanage boolean -l | grep virt_use_execmem
virt_use_execmem               (off  ,  off)  Allow confined virtual guests to use executable memory and executable stack


Ok what does that mean?  Uli Drepper back in 2006 added a series of memory checks to the SELInux kernel to handle common
attack vectors on programs using executable memory.    Basically these memory checks would allow us to stop a hacker from taking
over confined applications using buffer overflow attacks.

If qemu needs this access, why is this not enabled by default?

Using standard kvm vm's does not require qemu to have execmem privilege.  execmem blocks certain attack vectors 
Buffer Overflow attack where the hacked process is able overwrite memory and then execute the code the hacked 
program wrote. 

When using different qemu emulators that do not use kvm, the emulators require execmem to work.  If you look at 
the AVC above, I highlighted that the user was running qemu-system-x86.  I order for this emulator to work it
needs execmem so we have to loosen the policy slightly to allow the access.  Turning on the virt_use_execmem boolean
could allow a qemu process that is susceptible to buffer overflow attack to be hacked. SELinux would not block this
attack.

Note: lots of other SELinux blocks would still be in effect.

Since most people use kvm for VM's we disable it by default.



I a perfect world, libvirt would be changed to launch different emulators with different SELinux types, based on whether or not the emulator
requires execmem.   For example svirt_tcg_t is defined which allows this access.

Then you could run svirt_t kvm/qemus and svirt_tcg_t/qemu-system-x86 VMs on the same machine at the same time without having to lower
the security.  I am not sure if this is a common situation, and no one has done the work to make this happen.

January 04, 2016

A security analogy that works
Over the holiday break I spent a lot of time reading and thinking about what the security problem really is. It's really hard to describe, no analogies work, and things just seem to keep getting worse.

Until now!

Maybe.

Well, things will probably keep getting worse, but I think I've found a way to describe this almost anyone can understand. We can't really talk about our problems today, which makes it impossible to fix anything.

Security is the same problem as World Hunger. Unfortunately we can't solve either, but in theory we can make things better. Let's look at the comparisons.

First, the problem we talk about isn't just one thing. It's really hundreds or thousands of other problems we lump together into one group and give it a simple yet mostly meaningless name. The real purpose of the name is to give humans a single idea they can relate to. It's not meant to make the problem more fixable, it just makes it so we can talk about it.

Security includes things like application security, operational security, secure development, secure documentation, pen testing, hacking, DDoS, and hundreds of other things.

World hunger includes homelessness, hunger, malnutrition, lack of education, clean water, and hundreds of other things.

Lots of little things.

Second, the name isn't really the problem. It's what we can see. It's a symptom of other problems. The other problems are what you have to fix, you can't fix the name.

What we call "security" is really other things, and the real problem is rarely security, it's something else, security is the symptom we can see, the real problem is less obvious and hard to see.

In the context of world hunger the real problems are things like clean water, education, equality, corruption, crime, and the list goes on. Hunger is what we see, but to fix hunger, we have to fix those other problems.

We can give people food, but that doesn't fix the real problem, it makes things better for a day or a week. This is exactly how security works today. We run from fire to fire, fixing a single easy to see problem, then run off to the next thing. We never solve any problems, we put out fires.

So assuming this analogy holds, the sort of good news is that world hunger is slowly getting better. The bad news is progress is measured in decades. This is where my thinking starts to falter. Trade can help bring more progress to a given area. What is the equivalent in security? Are there things that can help make the situation better for a localized area? Will progress take decades?

If I had to guess, which I will, I suspect we're in the dark ages of security. We don't approach problems with a scientific mind, we try random things until something works, and then decide that spinning around while holding a chicken is what fixed that buffer overflow.

What we need is "security science". This means we need ways to approach security in a formal reproducible manner. A practice that can be taught and learned. Today it's all magic, some people have magic, most don't. Remember when the world had magicians instead of doctors? Things weren't better back then no matter what those forwards from your uncle claims.

This all leaves a lot of unanswered questions, but I think it's a starting point. Today we have no starting point, we have people complaining everything is broken, people selling magic, some have given up and assume this is how everything will just always be.

What will our Security Renaissance be? What will security science look like?

Join the conversation, hit me up on twitter, I'm @joshbressers