Whilst writing a follow-up to my last post, I noticed that Ansible was failing to connect to a newly spun up Linux server on the Rackspace Cloud and spent a bit of time troubleshooting the connection. All articles I've read using Rackspace & Ansible didn't mention much about ssh connection timeouts so I thought I'd put this together.
I started by modifying my previous playbook to add another play to add the server into a group and set 'wait' to yes, including adding a 'wait_timeout', these will be explained in further detail in my next post.
The resulting playbook looked like this:
- name: Build a Cloud Server
hosts: localhost # target local host
connection: local # run actions locally
gather_facts: False
tasks:
- name: Server build request
local_action:
module: rax
credentials: ~/creds # rackspace cloud credentials
name: test-server
group: myapp
flavor: general1-1 # 1GB server
image: 892b9bab-a11e-4d16-b4bd-ab9494ed7b78 # Image ID for Debian Sid
region: LON # London Region
key_name: ansible-bastion # keypair to use
wait: yes # wait for server to build
wait_timeout: 50000
state: present
networks:
- private
- public
register: rax
- name: Add the instance to an ansible group called myapp
local_action:
module: add_host
hostname: "{{ item.name }}"
ansible_ssh_host: "{{ item.rax_accessipv4 }}"
ansible_ssh_pass: "{{ item.rax_adminpass }}"
ansible_ssh_user: root
groupname: myapp
with_items: rax.success
when: rax.action == 'create'
- name: Install Packages
hosts: myapp
user: root
gather_facts: False
tasks:
- name: Update apt cache
apt: update_cache=yes
- name: Install Packages
action: apt state=installed pkg={{ item }}
with_items:
- vim
- git
- tmux
I noticed upon running this playbook that the connection could fail, in a number of occasions it was fine but during others I got the error:
SSH encountered an unknown error during the connection. We recommend you re-run the command using -v, which
will enable SSH debugging output to help diagnose the issue
I immediately tried manually connecting to the spun up server and it appeared to be running fine. I'm well versed with Rackspace and other Cloud providers so I know that this can happen when API calls return as the server has been created but is still booting, so there is a slight race condition where Ansible tries to connect before the SSH daemon or networking is fully running.
I also put Ansible into debug mode to provide the above output:
ansible-playbook <playbook-name> -vvv
Rackspace server launch time can vary between regions (I used the LON region), sometimes this means that the majority of users aren't affected by a slower boot time and don't see this race condition, I didn't see it mentioned in the Ansible Rackspace Guide.
Fortunately the Docs for Ansible are pretty good and in this instance the wait_for module documentation page provided some very useful information. This included the details of a feature added in a previous version to search for the 'OpenSSH' banner whilst testing a connection on the SSH port 22. I could just check to ensure port 22 is listening but I found this still wasn't 100% reliable to ensure it's all up and running to avoid any race conditions with the connection.
The page lists:
# wait 300 seconds for port 22 to become open and contain "OpenSSH", don't assume the inventory_hostname is resolvable
# and don't start checking for 10 seconds
- local_action: wait_for port=22 host="{{ ansible_ssh_host | default(inventory_hostname) }}" search_regex=OpenSSH delay=10
So, with a bit of trial and error I ended up adding a new play before the Install Packages play:
- name: Wait for port 22 to be ready
hosts: myapp
gather_facts: False
tasks:
- local_action: wait_for port=22 host="{{ ansible_ssh_host }}" search_regex=OpenSSH delay=10
After this I spun up 5 servers with the same playbook (add count: 5 to the original launch server request) and they all successfully completed and I've not encountered the same error since. I'm not sure of the overhead involved in the addition in wait_for to use the regex vs just testing to see if port 22 is up, but it seems to be minimal.
Hopefully that will help someone! If you ask me it's worth doing even if you aren't experiencing the timeout's just in case the Rackspace Cloud is having a busy day.
Stay tuned for the next post in which I'll cover the full end-to-end playbook.
Comments
comments powered by Disqus