Cloudflare Firewall Rule to block bots from accessing multiple domains

Context

As part of an internal WAF module for Cloudflare in Terraform, I had to implement a rule to exclude bots from accessing and scraping certain domains. Since it is a module, the implementation doesn’t know which domains are specified, therefore the resource definition has to be generic enough to accommodate any number of different domains.

Solution

After several iterations and failures, I landed on an implementation leveraging the following Terraform string functions:

formatlist: Produces a list of strings by applying a format to a list of inputs values
join: Produces a single string by concatenating multiple strings

Building a filter matching variable number of domains

To build the section of the filter that matches a single domain, we can use:

"http.host contains \"example\""

This is where the formatlist function comes into play. Since the number of domains is only known during module invocation, we need a more complex expression that matches incoming traffic trying to reach any of the domains:

formatlist("http.host contains \"%s\"", var.domains_without_bots)

The expression above will generate a new list where each string is of the form

http.host contains \"<domain>\"

The above list is not a valid filter expression. We need to combine these strings into a coherent filter expression, by using the join function

join(" or ", formatlist("http.host contains \"%s\"", var.domains_without_bots))

Which would generate:

http.host contains \"<domain1>\" or http.host contains \"<domain2>\"

Lastly the filter only makes sense when the traffic is coming from a bot and the filter expression becomes:

"(cf.client.bot and (${join(" or ", formatlist("http.host contains \"%s\"", var.domains_without_bots))}))"

Final Terraform definition

Combining everything above the filter definition looks as follows:

resource "cloudflare_filter" "domains_without_bots" {
  zone_id = var.zone_id

  description = "Filter bots trying to access domains: ${join(", ", var.domains_without_bots)}"

  expression = "(cf.client.bot and (${join(" or ", formatlist("http.host contains \"%s\"", var.domains_without_bots))}))"
}

Which is then used as part of the Cloudflare firewall rule definition, by referencing the filter id dynamically:

resource "cloudflare_firewall_rule" "domains_without_bots" {
  zone_id = var.zone_id

  description = "Block bots trying to access domains: ${join(", ", var.domains_without_bots)}"

  filter_id = cloudflare_filter.domains_without_bots.id

  action = "block"
}

Invoking the module would then look like:

module "waf" {
  source = "/path/to/cloudflare/waf"
  zone_id = var.zone_id
  domains_without_bots = [
    "example.com",
	  "test.example.io"
  ]
}