Skip to content
All posts

Best Practices for Writing Efficient and Scalable Ansible Playbooks

Ansible's a critical tool when you're wrangling complicated, large-scale network gear. Doesn't matter if it's Cisco, Juniper, Arista, or a whole mix – how you write and structure your Ansible playbooks totally impacts how well they scale, how easy they are to read later (by you or someone else!), and how fast they run. As your automation footprint grows, following some solid best practices becomes, well, essential.


This guide shares proven Ansible playbook best practices based on real-world network automation headaches and wins. It's for network engineers who want to build Ansible projects the right way – robust and easy to maintain. We'll cover how to structure your project, handle inventory and variables, keep your code clean, and keep those sensitive passwords safe.

Organize Your Directory Structure

Having a well-organized project directory is just plain necessary for scalable, maintainable Ansible. Trust me, trying to find stuff in a messy folder of playbooks, roles, and inventory files is a nightmare. A clear structure helps teams work together, makes it easier for new folks to jump in, and keeps things modular.
Here’s a layout that works for a lot of people:

network-automation/
├── inventories/ # Keep environments separate here
│ ├── production/
│ │ ├── hosts.yml # Or config for your dynamic inventory
│ │ ├── group_vars/ # Variables for groups in production
│ │ │ ├── all.yml # Global vars for production
│ │ │ └── cisco_routers.yml # Specific vars for production Cisco routers
│ │ └── host_vars/ # Variables for specific production hosts
│ │ └── core-router1.yml # Vars just for core-router1
│ └── staging/
│ ├── hosts.yml
│ └── ... # Same deal for staging group_vars/host_vars
├── playbooks/ # Your main workflow playbooks go here
│ ├── network_backup.yml
│ └── configure_vlans.yml
├── roles/ # Reusable pieces of automation
│ └── cisco_common_config/ # Example role
│ ├── tasks/ # Steps the role performs
│ │ └── main.yml
│ ├── defaults/ # Default variable values for the role
│ │ └── main.yml
│ ├── vars/ # Role-specific variables (higher precedence than defaults)
│ │ └── main.yml
│ └── templates/ # Jinja2 templates for configs/files
│ └── motd.j2
├── files/ # Static files to copy to hosts (firmware, scripts)
├── vault/ # Dedicated spot for encrypted secrets
│ └── credentials.yml # Encrypted sensitive data
├── .ansible-lint # Config file for your linter
├── .gitignore # Tell Git what files to ignore (like temp files)
├── requirements.yml # List of roles and collections to pull from Galaxy
└── README.md # Explain what this project does and how to use it!

A clean structure makes things much easier to manage as your infrastructure and automation grow. Put environment-specific inventory and variables in those dedicated inventories/ folders, playbooks in playbooks/, and anything you plan to reuse goes in roles/. Keep encrypted secrets in that vault/ directory.

For bigger setups, you might sort roles inside roles/ by vendor, tech type, or function (e.g., roles/cisco_ios/, roles/security/). If the logic changes a lot between environments (not just variable values), use separate subdirectories in playbooks/ (e.g., playbooks/production/, playbooks/dev/). If you're on a large team, keep the most reusable, generic stuff in a separate Git repo that gets pulled in using requirements.yml. And always, use requirements.yml to manage external roles and collections – don't just ansible-galaxy install randomly.

Benefits: Organizing your directory structure enhances scalability, maintainability, and readability, making it easier to manage and update your Ansible projects.

Action: Set up this structure and store playbooks in playbooks/ and roles in roles/.

Use Dynamic Inventory (like with NetBox)

Trying to keep static inventory files (hosts.yml, etc.) updated manually for a large or constantly changing network? Forget it, that doesn't scale. Dynamic inventory is key here. It lets Ansible grab current device info from a single Source of Truth (SoT) – maybe NetBox (a common IPAM/DCIM tool), your CMDB, or a cloud provider's API. This means your inventory is always current, you save a ton of manual effort, and you can easily group hosts based on their attributes (like role, site, vendor).

(Using NetBox requires you actually have NetBox running, obviously).

First step is usually installing the relevant Ansible Collection (here, the NetBox one):

ansible-galaxy collection install netbox.netbox

Next, create a configuration file for the inventory plugin (put this inside your environment folder, like inventories/production/netbox_inventory.yml):

plugin: netbox.netbox.nb_inventory
api_endpoint: http://netbox.local/api/
token: your_netbox_token
validate_certs: False
group_by:
  - device_roles
  - site

Verify it's working and see how it groups your devices:

ansible-inventory -i netbox_inventory.yml --graph

That command connects to NetBox, fetches the data, and shows you the groups Ansible creates. You'll see groups based on your group_by settings.

@all:
|--@routers:
| |--rtr1
| |--rtr2
|--@switches:
| |--sw1
| |--sw2
Benefits: Dynamic inventory makes your Ansible setup way more scalable and flexible. As your network changes (devices added, removed, roles updated), your inventory automatically pulls the latest data from your SoT, cutting down on manual updates and potential errors.

Run a Playbook Using NetBox Inventory

Now, you can just target those groups in your playbooks. Here’s a playbook to back up Cisco device configs:

# playbooks/backup_config.yml
- name: Backup Cisco Device Configs
  hosts: routers
  gather_facts: no
  tasks:
    - name: Fetch running config
      cisco.ios.ios_command:
        commands: show running-config
      register: output

    - name: Save config to file
      copy:
        content: "{{ output.stdout[0] }}"
        dest: "backups/{{ inventory_hostname }}.cfg"

Run the playbook, pointing to your dynamic inventory config:

ansible-playbook -i netbox_inventory.yml playbooks/backup_config.yml

Explanation:

● -i netbox_inventory.yml: Uses NetBox API to fetch inventory dynamically.

● hosts: routers: Matches the device_role group routers from NetBox.

● inventory_hostname: Automatically populated from NetBox device name.

● Backup files are saved per device under a backups/ directory.

Test with ansible-lint

ansible-lint is a command-line tool that checks your playbooks, roles, and collections against a set of rules for best practices, common mistakes, and coding style. Running it regularly saves you debugging time and helps prevent bad code from ever getting deployed.

You can run ansible-lint on your playbooks to check for common issues and enforce standards. Here’s how to do it:

ansible-lint playbooks/deploy_vlan.yml

ansible-lint flags things like YAML syntax errors, deprecated modules, inefficient loops, bad task names, potential idempotency issues, security red flags (like hardcoded passwords – use Vault!), and style inconsistencies.

For example:

[ANSIBLE0006] git latest version should not be used
playbooks/deploy_vlan.yml:10
Task/Handler: Install the latest version of the package

This output indicates that using the latest version of a package in a Git task is not recommended. You can then update your playbook to specify a particular version instead.

Benefits: Using it early in your workflow catches errors before they cause problems and helps enforce coding standards across your team. Integrate it into CI/CD pipelines for automated checks, or use pre-commit hooks locally. Set up a project-specific .ansible-lint config file to customize the rules.

 

Use Group and Host Vars to Separate Data from Logic

Keeping your automation code (playbooks, roles) separate from your configuration data (variables) is absolutely key to having flexible, easy-to-maintain Ansible. group_vars and host_vars directories are how you do this. They let you define variables specifically for groups of hosts or individual hosts.

Group Variables (group_vars/): Define variables that apply to all hosts in a specific inventory group.

Example: group_vars/cisco.yml

In this example, the group_vars/cisco.yml file defines variables that apply to all hosts in the cisco group. These variables include the network operating system, connection type, username, and password.

ansible_network_os: ios
ansible_connection: network_cli
ansible_user: admin
ansible_password: "{{ vault_password }}"

Host Variables (host_vars/): Define variables just for one host. These override group variables for that specific host.

Example: host_vars/sw1.yml

In this example, the host_vars/sw1.yml file defines variables that apply to the host sw1. These variables include a list of VLANs with their respective IDs and names.

vlans:
- id: 10
  name: STAFF
- id: 20
  name: GUEST

Using Group and Host Variables in Playbooks:

Here’s an example playbook that uses group and host variables to configure VLANs on Cisco devices:

- name: Configure VLANs
  hosts: cisco
  gather_facts: no
  tasks:
    - name: Create VLANs
      cisco.ios.ios_vlan:
        vlan_id: "{{ item.id }}"
        name: "{{ item.name }}"
      loop: "{{ vlans }}"

Explanation:

  • hosts: cisco: Specifies that the playbook applies to all hosts in the cisco group.
  • gather_facts: no: Disables fact gathering, which is not needed for this task.
  • tasks: Defines a list of tasks to be executed on the target hosts.
  • cisco.ios.ios_vlan: Uses the ios_vlan module to create VLANs on Cisco devices.
  • vlan_id and name: Specify the VLAN ID and name, respectively, using variables defined in the host_vars/sw1.yml file.
  • loop: "{{ vlans }}": Iterates over the list of VLANs defined in the host_vars/sw1.yml file, creating a VLAN for each item in the list.

The Power of Jinja2 Templating

Ansible's templating engine, Jinja2, is how you inject these variable values into task parameters or templates. Use Jinja2 filters (like | ipaddr, | default, | mandatory) and control structures ({% if %}, {% for %}) within your tasks and templates. It's a good idea to build smaller, modular Jinja2 templates for specific network features (e.g., ospf_interface.j2, vlan_config.j2) rather than one giant, monolithic template for a whole device config.

Benefits: Using group_vars and host_vars decouples your configuration data from your automation logic. You can reuse the same playbooks/roles for different environments or sets of devices just by changing the variable files.

 

Understand Variable Precedence

Ansible loads variables from a bunch of different places and overrides them in a specific order. Knowing this order is pretty important for understanding why a variable has a certain value and for troubleshooting weird behavior.

Here’s a simplified look at the order (things higher on the list override things lower down):

  1. Extra vars: Variables passed directly to the ansible-playbook command using the -e flag.
  2. Task vars: Variables defined within individual tasks.
  3. Block vars: Variables defined within a block of tasks.
  4. Role vars/defaults: Variables defined in roles, including those in the defaults and vars directories.
  5. Inventory vars: Variables defined in inventory files, including those in host_vars and group_vars.
  6. Defaults: Default variables defined in playbooks or roles.

(Note: The official docs have the complete, detailed list, but this covers most common cases.)

Example Usage:

Let’s say you have a playbook that defines a variable debug with a default value of false. You can override this variable using extra vars when running the playbook:

ansible-playbook play.yml -e 'debug=true'

You can also use the ansible -m debug module to troubleshoot variable values:

ansible -m debug -a 'var=debug' localhost
Benefits: Understanding variable precedence ensures that the correct values are used in your tasks, enhancing the reliability and maintainability of your automation.+

 

Read more: Learn about Ansible Navigator and how it can help you debug playbooks.

Use Roles and Collections

Roles and Collections are absolutely key for organizing your code, reusing logic, and using shared content from others (like vendors!).

  • Roles: They bundle up tasks, variables, files, templates, and handlers to achieve a specific, defined outcome (like "configure NTP" or "apply standard hardening").
  • Collections: These are bigger distribution packages that can contain multiple roles, custom modules, plugins, and documentation. They're often used by vendors (cisco.ios) or communities (community.general) to distribute their content.
Benefits: Using roles and collections enhances the modularity, reusability, and maintainability of your Ansible playbooks, making it easier to manage complex automation tasks and share code across projects.

Example Role Structure (roles/cisco_vlan_config/):

Here’s an example directory structure that uses roles to organize your playbooks. In this example, the cisco_vlan_config role contains tasks, defaults, and templates for configuring VLANs on Cisco devices.

roles/
└── cisco_vlan_config/
    ├── tasks/
    │ └── main.yml
    ├── defaults/
    │ └── main.yml
    └── templates/

Example: tasks/main.yml

This task uses the ios_vlan module to create VLANs on Cisco devices. The vlans variable is defined in the defaults/main.yml file.

- name: Create VLAN
  cisco.ios.ios_vlan:
    vlan_id: "{{ item.id }}"
    name: "{{ item.name }}"
  loop: "{{ vlans }}"

Using Collections: Installing a Collection

Install collections using ansible-galaxy collection install. Then, use modules from those collections with their full name.

ansible-galaxy collection install cisco.ios
Read more: To dive deeper into modules and their types, check out our detailed guide: Ansible Modules Types Explained.

Use block and rescue for Error Handling

These directives let you group tasks and define what should happen if something fails, or after the block runs regardless of success or failure. This makes your playbooks more resilient – they can handle failures gracefully, log issues, and potentially try to fix things.

Example: Error Handling in Playbooks

In this example, the block section contains the tasks that you want to execute. If any task within the block fails, the rescue section is executed, allowing you to handle the error gracefully.

- name: Configure Interface
  hosts: routers
  tasks:
    - name: Configure interface with error handling
      block:
        - name: Shut interface
          cisco.ios.ios_interface:
            name: GigabitEthernet0/1
            enabled: false
      rescue:
        - name: Log failure
          debug:
            msg: "Interface configuration failed on {{ inventory_hostname }}"

For really robust automation, add pre-flight checks (verify device state first using ping or fact modules, use ansible.builtin.assert) and post-validation (check the desired state after applying changes). Plan how to rollback critical changes if something goes wrong – maybe develop idempotent rollback tasks.

Benefits: Using block and rescue for error handling enhances the robustness and reliability of your Ansible playbooks, allowing them to gracefully handle and recover from errors.

Use Handlers for Idempotent Changes

Handlers are special tasks that only run if another task explicitly tells them to ("notifies" them) and that task actually reported a "changed" state on the target device. They're perfect for actions that should only happen if a configuration actually changed, like saving the config on a network device or restarting a service on a server.

Example: Using Handlers in Playbooks

The Upload config task notifies the Save config handler. The handler will only run if the Upload config task reports a change, ensuring that the configuration is saved only when necessary.

tasks:
- name: Upload config
  cisco.ios.ios_config:
    src: running.cfg
  notify: Save config

handlers:
- name: Save config
  cisco.ios.ios_command:
    commands:
      - write memory

Other Useful Directives:

  • Tags: Apply tags (tags: [vlans, ntp]) to plays, roles, or tasks. Run with ansible-playbook --tags vlans or --skip-tags ntp for granular control. Useful for grouping tasks by function or phase (prechecks, deploy, postchecks).
  • delegate_to: localhost: Run a task on the machine executing the playbook instead of the target device. Handy for interacting with APIs, managing local files, or sending notifications.
  • run_once: true: Force a task to run only once per play, even if targeting multiple hosts. Good for setup tasks on the control node or interacting with a shared resource.
Benefits: Using handlers ensures that changes are applied idempotently, enhancing the efficiency and reliability of your Ansible playbooks by avoiding unnecessary modifications.

Store Credentials Securely with Vault

Putting sensitive stuff (passwords, API keys, secret community strings) directly in your playbooks or variable files? That's a huge security risk. Ansible Vault encrypts these values, keeping them safe even if your files are in a public or shared repository.

Example: Using Ansible Vault

Here’s how you can use Ansible Vault to securely store and manage your credentials:

You can create a new vault file to store your sensitive information.

ansible-vault create vault/vault.yml

This command will prompt you to set a password for the vault and then open an editor where you can define your variables. For example:

vault_password: mysecret

Use Vault in Playbooks:

In your playbooks, you can reference the variables stored in the vault:

ansible_password: "{{ vault_password }}"

Run Playbooks with Vault:

When you run a playbook that uses vaulted variables, you'll need to provide the vault password. The easiest way is to have Ansible prompt you:

ansible-playbook deploy.yml --ask-vault-pass
Benefits: Using Ansible Vault ensures that sensitive information is securely encrypted, enhancing the security and integrity of your Ansible playbooks.

 

Use no_log: true for Sensitive Tasks

Even if you use Vault for inputs, the output of a task might accidentally show sensitive data in the logs (e.g., if a command output included the password). Use no_log: true on any task that deals with sensitive data to prevent its input and output from appearing in logs.

Example: Hiding Sensitive Task Output

The task sets an SNMP community string, which is a sensitive piece of information. By using no_log: true, the output of this task is hidden, ensuring that the community string is not exposed in logs or output.

- name: Set SNMP community
  cisco.ios.ios_snmp_community:
    community: secret
    state: present
  no_log: true
Benefits: Using no_log: true for sensitive tasks ensures that confidential information is not exposed in logs or output, enhancing the security and privacy of your Ansible playbooks.

Use Git for Version Control (Seriously)

Treat your Ansible code like any other important software project. Use Git for version control from day one. It gives you an audit trail of every change, makes team collaboration possible, lets you easily revert to previous versions if you mess up, and simplifies branching for developing new features.

Use Git for all your Ansible projects. Commit your changes often with clear messages explaining what you did and why. Use a branching strategy (like Git Flow or GitHub Flow) and use Pull/Merge Requests for code reviews before merging changes. Tag production releases so you know exactly which version was deployed.

Make Your Code Readable (Comments and Whitespace)

Nobody likes reading messy, uncommented code – including future you! Readable playbooks, roles, and variable files are much easier to maintain, collaborate on, and troubleshoot when something inevitably goes wrong.

  • Write meaningful comments (#). Explain why you did something, not just what the command does (the command itself shows you that).
  • Use consistent whitespace and indentation (YAML typically uses 2 spaces for indentation).
  • Use blank lines to separate logical sections within files.
  • Use clear, descriptive names for playbooks, roles, tasks, variables, and tags.
  • Follow ansible-lint recommendations for style – it helps keep things consistent.

Next Steps: Automate Smarter

Nork automation is absolutely vital for managing today's complex networks. Ansible Galaxy is a huge asset, giving you thousands of ready-to-use roles and collections (like cisco.ios, junipernetworks.junos, arista.eos) that save you serious time and effort. Use this content to automate complex tasks like configuration backups, software upgrades, and compliance checks consistently and at scale.

Pairing these best practices and Galaxy content with CloudMyLab's dedicated automation lab environments is a smart move for practicing. CloudMyLab gives you a safe, realistic sandbox to experiment with Ansible network automations, new Galaxy content, build complex playbooks, and test everything without risking your production networks. You can test playbooks using vendor collections against virtual devices from various manufacturers and hone your skills in a controlled environment before deployment.

Start organizing your projects better, use dynamic inventory where it makes sense, lint your code, separate data from logic with vars, use roles/collections, handle errors properly, secure your secrets, use version control, and keep your code readable. Then, leverage CloudMyLab's hosted lab environments to practice, test, and perfect your automation workflows.

Learning Resources

Here are some helpful resources for learning more about Ansible:

FAQ

Why should I use a specific directory structure in Ansible?

It keeps your project organized, makes it easier for teams to collaborate, and promotes reusability of your automation code (roles, etc.).

What's dynamic inventory and why use it?

It lets Ansible get the list of devices and their details automatically from a Source of Truth (like NetBox or a CMDB) instead of you manually updating static files. It ensures your inventory is always current and scalable.

What does ansible-lint do?

It checks your Ansible code for errors, style issues, and best practice violations before you run it.

Amsible group_vars vs. host_vars?

group_vars applies variables to all hosts in a group. host_vars applies variables to just one specific host (and overrides group vars for that host). They separate configuration data from automation logic.

What is variable precedence in Ansible?

It's the order in which Ansible loads variables from different sources. Higher precedence sources (like command line -e) override lower precedence sources (like group vars or role defaults). Knowing this helps understand variable values.

Why use roles and collections in Ansible?

They are fundamental for organizing your automation, making it reusable (roles), and distributing/consuming content from vendors or communities (collections).

Why is error handling (block, rescue, always) important?

It makes your playbooks more robust, allowing them to handle failures gracefully, log issues, and potentially attempt recovery steps instead of just stopping.

What are Ansible handlers for?

Handlers are triggered by a task that changes something and notify the handler. They are used for actions that should only run if a configuration was actually modified, like saving config or restarting a service.

Why use Ansible Vault?

To encrypt sensitive data (passwords, keys) in your files instead of storing them in plain text. It's a critical security practice.

What is no_log: true for in Ansible?

It prevents a task's input and output from showing up in Ansible's logs, which is important for tasks that handle or display sensitive data.

Why use Git for Ansible?

It's essential for version control, tracking changes, collaboration, code review, and easy rollbacks if needed. Treat your automation code like software.