Whilst writing a follow-up to my last post, I noticed that Ansible was failing to connect to a newly spun up Linux server on the Rackspace Cloud and spent a bit of time troubleshooting the connection. All articles I've read using Rackspace & Ansible didn't mention much about ssh connection timeouts so I thought I'd put this together.

I started by modifying my previous playbook to add another play to add the server into a group and set 'wait' to yes, including adding a 'wait_timeout', these will be explained in further detail in my next post.

The resulting playbook looked like this:

- name: Build a Cloud Server
  hosts: localhost # target local host
  connection: local # run actions locally
  gather_facts: False
    - name: Server build request
        module: rax
        credentials: ~/creds # rackspace cloud credentials
        name: test-server
        group: myapp
        flavor: general1-1 # 1GB server
        image: 892b9bab-a11e-4d16-b4bd-ab9494ed7b78 # Image ID for Debian Sid
        region: LON # London Region
        key_name: ansible-bastion # keypair to use
        wait: yes # wait for server to build
        wait_timeout: 50000
        state: present
          - private
          - public
      register: rax

    - name: Add the instance to an ansible group called myapp
        module: add_host
        hostname: "{{ item.name }}"
        ansible_ssh_host: "{{ item.rax_accessipv4 }}"
        ansible_ssh_pass: "{{ item.rax_adminpass }}"
        ansible_ssh_user: root
        groupname: myapp
      with_items: rax.success
      when: rax.action == 'create'

- name: Install Packages
  hosts: myapp
  user: root
  gather_facts: False
  - name: Update apt cache
    apt: update_cache=yes

  - name: Install Packages
    action: apt state=installed pkg={{ item }}
    - vim
    - git
    - tmux

I noticed upon running this playbook that the connection could fail, in a number of occasions it was fine but during others I got the error:

SSH encountered an unknown error during the  connection. We recommend you re-run the command using -v, which 
 will enable SSH debugging output to help diagnose the issue

I immediately tried manually connecting to the spun up server and it appeared to be running fine. I'm well versed with Rackspace and other Cloud providers so I know that this can happen when API calls return as the server has been created but is still booting, so there is a slight race condition where Ansible tries to connect before the SSH daemon or networking is fully running.

I also put Ansible into debug mode to provide the above output:

ansible-playbook <playbook-name> -vvv

Rackspace server launch time can vary between regions (I used the LON region), sometimes this means that the majority of users aren't affected by a slower boot time and don't see this race condition, I didn't see it mentioned in the Ansible Rackspace Guide.

Fortunately the Docs for Ansible are pretty good and in this instance the wait_for module documentation page provided some very useful information. This included the details of a feature added in a previous version to search for the 'OpenSSH' banner whilst testing a connection on the SSH port 22. I could just check to ensure port 22 is listening but I found this still wasn't 100% reliable to ensure it's all up and running to avoid any race conditions with the connection.

The page lists:

# wait 300 seconds for port 22 to become open and contain "OpenSSH", don't assume the inventory_hostname is resolvable
# and don't start checking for 10 seconds
- local_action: wait_for port=22 host="{{ ansible_ssh_host | default(inventory_hostname) }}" search_regex=OpenSSH delay=10

So, with a bit of trial and error I ended up adding a new play before the Install Packages play:

- name: Wait for port 22 to be ready
  hosts: myapp
  gather_facts: False
    - local_action: wait_for port=22 host="{{ ansible_ssh_host }}"  search_regex=OpenSSH delay=10

After this I spun up 5 servers with the same playbook (add count: 5 to the original launch server request) and they all successfully completed and I've not encountered the same error since. I'm not sure of the overhead involved in the addition in wait_for to use the regex vs just testing to see if port 22 is up, but it seems to be minimal.

Hopefully that will help someone! If you ask me it's worth doing even if you aren't experiencing the timeout's just in case the Rackspace Cloud is having a busy day.

Stay tuned for the next post in which I'll cover the full end-to-end playbook.


comments powered by Disqus