Automated observability with Puppet in a zero-trust environment

Haroon Rafique

Manager
Development and Operations
Student Information Systems, ITS

“If you want everything to be familiar, you will never learn anything new, because it can’t be significantly different from what you already know.”

— Rich Hickey

The Toil Trap

Manual monitoring target updates
Config drift from reality
Onboarding friction due to cert. management
YAML sprawl

Zero-Toil Architecture

Self-registration
Automated discovery
Identity-driven
Zero-trust

Implementation Blueprint

Classify nodes automatically during CSR signing
Separate logic from data using Hiera roles
Automate service discovery with exported resources
Secure metrics scraping with Caddy and mTLS
Standardize monitoring with layered exporter profiles

Components in This Reference Stack

OpenVox/OpenVox Server for policy and certificates
OpenVoxDB for exported resource inventory
VictoriaMetrics for metrics storage
Vmagent for metrics scraping
Grafana for visualization

Pattern 1: CSR Classification

Immutable identity requested during provisioning via CSR
Cryptographic proof stored in the certificate extensions
Tamper-proof classification secure against agent-side overrides
Automated autosigning (optionally) enabled by secure policy scripts

What is pp_role, and what values should it contain?

roles
- monitoring_scraper
- webserver

Corresponding Hiera data:

roles
├── monitoring_scraper.yaml
└── webserver.yaml

CSR stands for Certificate Signing Request. At the moment of provisioning, the node requests its specific role via a certificate extension in the CSR.

This embeds its identity cryptographically at birth.

And it creates a tamper-proof source of truth that agents cannot override. By keeping node intent attached directly to node identity, we eliminate central merge conflicts in static files and build a pipeline perfectly suited for dynamic, ephemeral fleets.

For demo purposes, I took a shortcut with autosigning, but you can also use a secure policy script to enforce stricter autosigning rules.

pp_role stands for Provisioning Profile Role. Typically, it should contain the roles for the individual VM.

Roles and profiles is not a feature provided by Puppet itself. It is a best-practice design pattern for Puppet that provides an interface between your business logic and reusable Puppet modules.

Pattern 2: Pure Data Roles

CSR Extension attributes

/etc/puppetlabs/puppet/csr_attributes.yaml

---
extension_requests:
  pp_role: webserver

Zero logic inside site manifest or node definitions

manifests/site.pp

node default {
  # Lookup classes from Hiera
  lookup('classes', Array[String], 'unique').include
}

Hiera lookup determines config based on pp_role

data/roles/webserver.yaml

---
# Webserver role - Apache with monitoring
# Applied to nodes with pp_role=webserver in csr_attributes.yaml
classes:
- profile::apache
- profile::monitoring::apache_exporter

Hiera configuration

excerpt from hiera.yaml

hierarchy:
- name: "Per-role data (from CSR attributes)"
  path: "roles/%{trusted.extensions.pp_role}.yaml"

Pattern 3: Auto-Discovery

Exported resources publish state to OpenVoxDB

@@file { "/etc/vmagent/targets.d/${facts['networking']['fqdn']}_${name}.yaml":
  ensure  => file,
  content => @("EOT"),
    # Managed by Puppet
    - targets:
        - ${facts['networking']['fqdn']}:9090/${name}/metrics
      labels:
        job: ${name}
        instance: ${facts['networking']['fqdn']}
        exporter: ${name}
    | EOT
  tag     => 'vmagent_target',
}

Dynamic collection gathers all active targets

# Collect all exported vmagent targets from other nodes
File <<| tag == 'vmagent_target' |>>

Pattern 4: Zero-Trust mTLS

Caddy acts as a sidecar reverse proxy and mTLS server

excerpt from Caddyfile

# Authenticate via Puppet CA with mTLS
tls /etc/caddy/node-cert.pem /etc/caddy/node-key.pem {
  client_auth {
    mode require_and_verify
    trust_pool file /etc/caddy/puppet-ca.pem
  }
}

Import metrics exporter configurations
```
import /etc/caddy/conf.d/*.caddy
```

Proxy the exporter traffic

excerpt from apache.caddy

handle_path /apache* {
  reverse_proxy localhost:9117
}

Only allow legitimate scrapers

@authorized_scrapers_snippet {
  expression \
    {tls_client_subject} == "CN=vmagent.local"
}

Block unauthorized scrapers
```
handle {
 abort
}
```

Pattern 5: Layered Observability

Common profile applies to all nodes

data/common.yaml

classes:
- profile::base

Baseline metrics applied to every node

site/profile/manifests/base.pp

class profile::base {
  include profile::monitoring::node_exporter
}

Role-specific exporters are added based on workload

roles/webserver.yaml

classes:
- profile::apache
- profile::monitoring::apache_exporter

The Automated Pipeline

Host Lifecycle

Host boots and starts OpenVox agent

Agent submits CSR with role intent

Host applies role-specific profiles

Exporters publish scrape endpoints

System Response

CA auto-signs based on secure policy

Hiera maps data classes to the host

OpenVoxDB stores the exported assets

Vmagent polls the target automatically

Security Guardrails For Production

Strict validation of role-bearing CSRs.
Role restrictions to prevent privilege escalation.
Least-privilege paths for authorized scrapers only.
Continuous auditing of exported resources.

Does It Work?

Initial State: Verify current targets are up.
The Trigger: Provision a new webserver node.
The Magic: Watch Puppet execute automatic classification.
The Result: The new scrape target populates automatically.
The Payoff: View the live Grafana dashboards.

Demo

Operational Impact

Scales infinitely from isolated labs to massive production fleets.
Adapts dynamically to volatile node churn and auto-scaling.
Accelerates visibility by establishing instant scrape targets.
Minimizes effort by keeping operational overhead flat as the nodes scale.

Engineering Anti-Patterns

Mutable scripts installing runtime packages
Floating tags using latest in production
Hard-coded credentials in compose files
Copy-pasted logic repeated across services
Ungoverned attributes trusted without policy controls

To wrap things up, when you visit the Git repo, you will notice that I took some shortcuts while building out this live demo environment.

You really don’t want to be installing runtime packages in your docker containers. I did that solely to demonstrate the automation. Ideally, you would use already-built images.

In docker-compose, I refer to the latest tag. In production, I would tend to use immutable tags to ensure that I am always running the same version of a service. The third bullet is a no-no.

There is lots of massaging of how services start, stop, and restart. That is all due to the non-availability of systemd in docker containers.

Treating these slides as a transition into future hardening work will ensure your automation doesn’t accidentally introduce security blind spots or fragile dependencies.

Key Takeaways

Automate observability from the exact second a node boots
Eliminate bottlenecks by moving classification to CSRs
Drop manual updates using native exported resources
Recycle existing security to achieve zero-trust mTLS
Scale your infrastructure without increasing team toil

Bonus Slide: System Architecture

Questions?