Skip to content

Datadog Setup

Datadog is the monitoring and alerting platform used across NHS BSA services. It collects AWS service metrics, plus application metrics and logs from NHS BSA workloads running on AWS.

Your project services should be integrated with the NHSBSA_BSACloud_V2 Datadog organisation.

Access

Raise a service desk request using an ITSM Halo ticket. The appropriate role is assigned based on your requirements. You receive an invitation link, and on first visit you need to set a password.


NHS BSA Naming Conventions and Useful Variables

To align with NHS BSA resource naming conventions, define an additional local variable:

locals {
  name_prefix = "${var.service}-${var.env_name[var.env]}-%s-%s-%s"
}

Apply this convention consistently across all resources.

Define the following reusable variables to avoid duplication:

variable "service" {
  type    = string
  default = "mccloud" # <-- Might be different
}

variable "department" {
  type    = string
  default = "nhs-workforce-services" # <-- Might be different
}

variable "service_line" {
  type    = string
  default = "pensions-services" # <-- Might be different
}

variable "env_name" {
  type = map(string)
  default = {
    dev   = "dev"
    test  = "tst"
    stage = "stg"
    prod  = "pro"
  }
}

# Datadog variables
variable "datadog_settings" {
  type = map(string)
  default = {
    account_id = "464622532012" # Shared Datadog AWS account ID
  }
}

AWS Account Integration

Configure this integration through Terraform to collect CloudWatch metrics and events from AWS services. It operates at the account level, not the service level. If multiple services run in one account, configure the integration once.

API Key

After signing in, you can find API keys for each AWS account at: https://app.datadoghq.com/organization-settings/api-keys

Store the API key in AWS Secrets Manager. After applying the Terraform below, update the secret value manually in the AWS Console.

resource "aws_secretsmanager_secret" "datadog" {
  name        = format(local.name_prefix, "sm", "datadog-configuration", "01")
  description = "Datadog configuration"
}

resource "aws_secretsmanager_secret_version" "datadog" {
  secret_id     = aws_secretsmanager_secret.datadog.id
  secret_string = "placeholder" // gitleaks:allow
  lifecycle {
    ignore_changes = [
      secret_string
    ]
  }
}

Terraform Provider

To use Datadog resources, define the Datadog provider:

terraform {
  required_version = ">= 0.15.0"
  required_providers {
    datadog = {
      source  = "DataDog/datadog"
      version = "~> 3.78.0"
    }
  }
}

provider "datadog" {
  api_key = data.aws_secretsmanager_secret_version.datadog_configuration.secret_string
}

If you have configured the aws_secretsmanager_secret resource above, you can reference it in the provider configuration.

Note: app_key is configured globally at the GitLab level for all projects as the DD_APP_KEY environment variable. Terraform picks it up automatically.

Integration Resources

See Datadog AWS Integration for implementation details.

resource "datadog_integration_aws_external_id" "dd_external_id" {}

resource "datadog_integration_aws_account" "aws_integration" {
  account_tags = [
    "service:${var.service}",
    "env:${var.env_name[terraform.workspace]}",
    "department:${var.department}",
    "service_line:${var.service_line}",
    "business_service:${var.service}",
  ]
  aws_account_id = data.aws_caller_identity.current.account_id
  aws_partition  = "aws"

  aws_regions {
    include_only = [data.aws_region.current.name]
  }

  auth_config {
    aws_auth_config_role {
      role_name   = module.datadog_role.iam_role_name
      external_id = datadog_integration_aws_external_id.dd_external_id.id
    }
  }

  logs_config {
    lambda_forwarder {}
  }

  resources_config {
    cloud_security_posture_management_collection = true
    extended_collection                          = true
  }

  metrics_config {
    automute_enabled          = true
    collect_cloudwatch_alarms = true
    collect_custom_metrics    = true
    enabled                   = true
    namespace_filters {
      exclude_only = []
    }
  }

  traces_config {
    xray_services {
      include_only = []
    }
  }
}

IAM Role and Policy

The full list of required IAM permissions is documented in Datadog AWS IAM permissions reference.

module "datadog_iam_policy" {
  source  = "terraform-aws-modules/iam/aws//modules/iam-policy"
  version = "= 5.55.0"

  name        = format(local.name_prefix, "datadog", "AWS-Integration-policy", "01")
  path        = "/"
  description = "Policy for Datadog AWS Integration"

  policy = data.aws_iam_policy_document.datadog_iam_policy.json
}

data "aws_iam_policy_document" "datadog_iam_policy" {
  statement {
    sid    = "DatadogIAMPolicy"
    effect = "Allow"
    actions = [
      "account:GetAccountInformation",
      "airflow:GetEnvironment",
      "airflow:ListEnvironments",
      "apigateway:GET",
      "autoscaling:Describe*",
      "backup:List*",
      "bcm-data-exports:GetExport",
      "bcm-data-exports:ListExports",
      "budgets:ViewBudget",
      "cloudfront:GetDistributionConfig",
      "cloudfront:ListDistributions",
      "cloudtrail:DescribeTrails",
      "cloudtrail:GetTrail",
      "cloudtrail:GetTrailStatus",
      "cloudtrail:ListTrails",
      "cloudtrail:LookupEvents",
      "cloudwatch:Describe*",
      "cloudwatch:Get*",
      "cloudwatch:List*",
      "codedeploy:BatchGet*",
      "codedeploy:List*",
      "cur:DescribeReportDefinitions",
      "directconnect:Describe*",
      "dynamodb:Describe*",
      "dynamodb:List*",
      "ec2:Describe*",
      "ecs:Describe*",
      "ecs:List*",
      "eks:DescribeCluster",
      "eks:ListClusters",
      "elasticache:Describe*",
      "elasticache:List*",
      "elasticfilesystem:DescribeAccessPoints",
      "elasticfilesystem:DescribeFileSystems",
      "elasticfilesystem:DescribeTags",
      "elasticloadbalancing:Describe*",
      "elasticmapreduce:Describe*",
      "elasticmapreduce:List*",
      "es:DescribeElasticsearchDomains",
      "es:ListDomainNames",
      "es:ListTags",
      "events:CreateEventBus",
      "fsx:DescribeFileSystems",
      "fsx:ListTagsForResource",
      "health:DescribeAffectedEntities",
      "health:DescribeEventDetails",
      "health:DescribeEvents",
      "iam:ListAccountAliases",
      "kinesis:Describe*",
      "kinesis:List*",
      "lambda:List*",
      "logs:DeleteSubscriptionFilter",
      "logs:DescribeDeliveries",
      "logs:DescribeDeliverySources",
      "logs:DescribeLogGroups",
      "logs:DescribeLogStreams",
      "logs:DescribeSubscriptionFilters",
      "logs:FilterLogEvents",
      "logs:GetDeliveryDestination",
      "logs:PutSubscriptionFilter",
      "logs:TestMetricFilter",
      "network-firewall:DescribeLoggingConfiguration",
      "network-firewall:ListFirewalls",
      "oam:ListAttachedLinks",
      "oam:ListSinks",
      "organizations:Describe*",
      "organizations:List*",
      "rds:Describe*",
      "rds:List*",
      "redshift-serverless:ListNamespaces",
      "redshift:DescribeClusters",
      "redshift:DescribeLoggingStatus",
      "route53:List*",
      "s3:GetBucketLocation",
      "s3:GetBucketLogging",
      "s3:GetBucketNotification",
      "s3:GetBucketTagging",
      "s3:ListAllMyBuckets",
      "s3:PutBucketNotification",
      "ses:Get*",
      "ses:List*",
      "sns:GetSubscriptionAttributes",
      "sns:List*",
      "sns:Publish",
      "sqs:ListQueues",
      "ssm:GetServiceSetting",
      "ssm:ListCommands",
      "states:DescribeStateMachine",
      "states:ListStateMachines",
      "support:DescribeTrustedAdvisor*",
      "support:RefreshTrustedAdvisorCheck",
      "tag:GetResources",
      "tag:GetTagKeys",
      "tag:GetTagValues",
      "timestream:DescribeEndpoints",
      "wafv2:ListLoggingConfigurations",
      "xray:BatchGetTraces",
      "xray:GetTraceSummaries"
    ]
    resources = [
      "*"
    ]
  }
}

module "datadog_role" {
  source  = "terraform-aws-modules/iam/aws//modules/iam-assumable-role"
  version = "= 5.55.0"

  create_role         = true
  role_name           = format(local.name_prefix, "datadog", "AWS-Integration-role", "01")
  role_description    = "Role assumed by the external Datadog AWS account for the integration"
  role_requires_mfa   = false
  role_sts_externalid = [datadog_integration_aws_external_id.dd_external_id.id]

  trusted_role_arns = [
    "arn:aws:iam::${var.datadog_settings["account_id"]}:root"
  ]

  custom_role_policy_arns = [
    module.datadog_iam_policy.arn,
  ]
}

NHS BSA Service Integration

The two main AWS service types used across NHS BSA are ECS Fargate and AWS Lambda.

ECS Fargate

ECS tasks are configured with two sidecar containers to ship logs, metrics, and traces to Datadog:

  • AWS FireLens (built on Datadog's Fluent Bit output plugin) — ships logs directly to Datadog.
  • Datadog Agent — collects metrics from containers via the ECS task metadata endpoint.

https://www.datadoghq.com/architecture/using-datadog-with-ecs-fargate/

The task definition below shows a representative example with the application container, FireLens log router, and Datadog Agent sidecar.

[
  {
    "essential": true,
    "image": "amazon/aws-for-fluent-bit:stable",
    "name": "log_router",
    "cpu": 0,
    "user": "0",
    "firelensConfiguration":{
        "type": "fluentbit",
        "options" :{
            "enable-ecs-log-metadata":"true",
            "config-file-type": "file",
            "config-file-value": "/fluent-bit/configs/parse-json.conf"
        }
    },
    "logConfiguration": {
      "logDriver": "awslogs",
      "options": {
        "awslogs-group": "${log_group}",
        "awslogs-region": "${region}",
        "awslogs-stream-prefix": "<app_name>-fluent-bit"
      }
    },
    "readOnlyRootFilesystem": true
  },
  {
    "essential": true,
    "image": "app-image",
    "name": "...",
    "memory": "...",
    "cpu": "...",
    "readOnlyRootFilesystem": "...",
    "mountPoints": [
      {
        "sourceVolume": "...",
        "containerPath": "...",
        "readOnly": "..."
      }
    ],
    "logConfiguration": {
      "logDriver": "awsfirelens",
      "options": {
          "Name": "datadog",
          "Host": "http-intake.logs.datadoghq.com",
          "TLS": "on",
          "dd_service": "${service}-${env}-<app_name>-ui",
          "dd_source": "nodejs",
          "dd_tags": "Env:${env}, business_service:${service}, component:${service}-${env}-<app_name>-ui, service_line:${service_line}, department:${department}",
          "provider": "ecs",
          "retry_limit": "2"
      }
    },
    "portMappings": [
      {
        "protocol": "...",
        "appProtocol": "...",
        "name": "...",
        "containerPort": "...",
        "hostPort": "..."
      }
    ],
    "environment": [
      {
        "name": "DD_ENV",
        "value": "${env}"
      },
      {
        "name": "DD_SERVICE",
        "value": "${service}-${env}-<app_name>-ui"
      },
      {
        "name": "DD_PROFILING_ENABLED",
        "value": "true"
      },
      {
        "name": "DD_VERSION",
        "value": "${tag}"
      }
    ],
    "dockerLabels": {
      "com.datadoghq.ad.instances": "[{\"host\": \"%%host%%\", \"port\": ${port}}]",
      "com.datadoghq.ad.check_names": "[\"${service}-${env}-<app_name>-ui\"]",
      "com.datadoghq.ad.init_configs": "[{}]",
      "com.datadoghq.tags.env": "${env}",
      "com.datadoghq.tags.service": "${service}-${env}-<app_name>-ui",
      "com.datadoghq.tags.version": "${tag}"
    }
  },
  {
    "image": "public.ecr.aws/datadog/agent:latest",
    "name": "datadog-agent",
    "essential": true,
    "cpu": 0,
    "readonlyRootFilesystem": true,
    "logConfiguration": {
      "logDriver": "awslogs",
      "options": {
        "awslogs-group": "${log_group}",
        "awslogs-region": "${region}",
        "awslogs-stream-prefix": "<app_name>-datadog-agent"
      }
    },
    "mountPoints": [
      {
        "sourceVolume": "agent_conf",
        "containerPath": "/etc/datadog-agent",
        "readOnly": null
      },
      {
        "sourceVolume": "datadog",
        "containerPath": "/opt/datadog-agent/run",
        "readOnly": false
      },
      {
        "sourceVolume": "datadog",
        "containerPath": "/var/log",
        "readOnly": false
      },
      {
        "sourceVolume": "datadog",
        "containerPath": "/var/lib",
        "readOnly": false
      },
      {
        "sourceVolume": "datadog",
        "containerPath": "/app",
        "readOnly": false
      },
      {
        "sourceVolume": "datadog",
        "containerPath": "/tmp",
        "readOnly": false
      },
      {
        "sourceVolume": "datadog",
        "containerPath": "/root",
        "readOnly": false
      }
    ],
    "environment": [
      {
        "name": "ECS_FARGATE",
        "value": "true"
      },
      {
        "name": "DD_APM_ENABLED",
        "value": "true"
      },
      {
        "name": "DD_APM_NON_LOCAL_TRAFFIC",
        "value": "true"
      },
      {
        "name": "DD_DOGSTATSD_NON_LOCAL_TRAFFIC",
        "value": "true"
      }
    ],
    "portMappings": [
        {
          "hostPort": 8126,
          "protocol": "tcp",
          "containerPort": 8126
        },
        {
          "hostPort": 8125,
          "protocol": "udp",
          "containerPort": 8125
        }
    ],
    "secrets": [
      {
        "name": "DD_API_KEY",
        "valueFrom": "${datadog_api_key}"
      }
    ]
  }
]

Lambda Functions

Lambda functions use the Datadog Lambda Library as a layer to ship logs, metrics, and traces to Datadog.

https://www.datadoghq.com/blog/tracing-lambda-datadog-apm/

Layer ARNs are published in the Datadog AWS account. Available versions are listed at https://github.com/DataDog/datadog-lambda-js/releases.

For NHS BSA, the layer ARN is typically:

arn:aws:lambda:eu-west-2:464622532012:layer:Datadog-Node22-x:<version>

See the Node.js instrumentation guide for implementation details.

module "lambda_<name>_handler" {
  source  = "terraform-aws-modules/lambda/aws"
  version = "= 7.20.1" # https://registry.terraform.io/modules/terraform-aws-modules/lambda/aws/latest

  function_name = format(local.name_prefix, "lam", "<name>", "01")

  # Datadog handler — after initialisation it delegates to the handler defined in DD_LAMBDA_HANDLER
  handler = "/opt/nodejs/node_modules/datadog-lambda-js/handler.handler"
  layers  = [
    "arn:aws:lambda:eu-west-2:464622532012:layer:Datadog-Node22-x:<version>",
  ]

  environment_variables = {
    DD_PROFILING_ENABLED  = true
    DD_LOGS_ENABLED       = true
    DD_TRACE_ENABLED      = true
    DD_LAMBDA_HANDLER     = "index.handler" # Application handler
    DD_API_KEY_SECRET_ARN = aws_secretsmanager_secret.datadog.arn
    DD_SITE               = "datadoghq.com"
    DD_ENV                = var.env_name[terraform.workspace]
    DD_SERVICE            = "${var.service}-${var.env_name[terraform.workspace]}-lambda-<name>-api"
    DD_FLUSH_TO_LOG       = true
  }
}

DD_* Variables Reference

The table below documents all DD_* variables referenced in this guide.

Variable Scope Required Purpose Example
DD_APP_KEY CI/CD (Terraform) Yes Datadog application key used by the Datadog Terraform provider. Set as masked CI variable
DD_API_KEY ECS Datadog Agent Yes API key used by the Datadog Agent sidecar to authenticate to Datadog. Loaded from ECS secret datadog_api_key
DD_API_KEY_SECRET_ARN Lambda runtime Yes Secret ARN that the Datadog Lambda library reads to obtain API key. arn:aws:secretsmanager:...:secret:datadog...
DD_SITE Lambda runtime Yes Datadog site endpoint used by serverless instrumentation. datadoghq.com
DD_ENV ECS app + Lambda runtime Yes Reserved Datadog tag for deployment environment. stg
DD_SERVICE ECS app + Lambda runtime Yes Reserved Datadog tag identifying the service name. mccloud-stg-lambda-example-api
DD_VERSION ECS app Yes Reserved Datadog tag for application version/release. ${tag}
DD_PROFILING_ENABLED ECS app + Lambda runtime Optional Enables Datadog continuous profiler. true
DD_LOGS_ENABLED Lambda runtime Optional Enables Datadog log forwarding/enrichment for Lambda. true
DD_TRACE_ENABLED Lambda runtime Optional Enables Datadog distributed tracing for Lambda. true
DD_LAMBDA_HANDLER Lambda runtime Yes Points Datadog wrapper handler to the real application handler. index.handler
DD_FLUSH_TO_LOG Lambda runtime Optional Flushes telemetry payloads to CloudWatch logs for forwarding. true
DD_APM_ENABLED ECS Datadog Agent Optional Enables APM collection in the Datadog Agent. true
DD_APM_NON_LOCAL_TRAFFIC ECS Datadog Agent Optional Allows APM intake from other containers/tasks, not only localhost. true
DD_DOGSTATSD_NON_LOCAL_TRAFFIC ECS Datadog Agent Optional Allows DogStatsD metrics from other containers/tasks, not localhost. true

Tags

Consistent tagging is essential for filtering and grouping telemetry in Datadog. Datadog reserved tags (env, service, version) should be set using their corresponding environment variables (DD_ENV, DD_SERVICE, DD_VERSION).

Source How tags are set
DD_ENV Set in the Lambda handler / ECS task definition
DD_SERVICE Set in the Lambda handler / ECS task definition
DD_VERSION Set at deployment time when uploading the Lambda package
Other tags Sourced automatically from AWS resource tags via the AWS integration

The table below lists Datadog tag names, examples, and their expected format:

Datadog Tag Example Format
env stg short form
department nhs-workforce-services long form
service_line pensions-services long form
business_service mccloud long form
component web-app short form

Monitors and Alerting

Monitors are managed by the Live Support Team. Monitor Terraform definitions are maintained in a dedicated Datadog repository separate from the service infrastructure repository. Contact the Live Support Team if you need guidance on defining monitors for your service. Ryan Menzies (Ryan.Menzies@nhsbsa.nhs.uk) is probably the best person to contact for guidance on this matter.

Example of monitor definition

Define monitors in Terraform in each project's dedicated Datadog repository. Implement them for Lambda and ECS Fargate applications in stage and prod environments.

Implement Synthetic tests to validate key user journeys and detect broken pages or structural issues in the application UI.

resource "datadog_monitor" "monitors" {
  name  = "..."
  type  = "..."
  query = "..."

  monitor_thresholds {
    critical = "..."
  }

  message = <<-EOT
    {{#is_alert}}
    ## "${var.service} - <monitor_name> - <environment> is above {{threshold}} on {{business_service.name}}."

    Please investigate:
    ${local.email_to}

    ${each.value.message_additional_info}

    Users Notified:
    ${local.email_cc}
    {{/is_alert}}

    {{#is_recovery}}
    ## "${var.service} - <monitor_name> - <environment> has recovered on {{business_service.name}}."

    Users Notified:
    ${local.email_to}
    ${local.email_cc}
    {{/is_recovery}}
    EOT

}

Set the monitor name and environment placeholders in the message template.

Reference examples: https://gitlab.com/nhsbsa/platform-services/terraform/datadog

For Synthetic tests, Datadog source IP addresses must be allowlisted. Check that this list stays up to date:

variable "datadog_synthetics_ip_ranges" { ### https://ip-ranges.datadoghq.com/synthetics.json
  description = "List of IP addresses Datadog uses for Synthetic Monitoring in eu-west-2"
  type        = list(string)
  default = [
    "3.41.0.88/29",
    "18.130.113.168/32",
    "35.176.195.46/32",
    "35.177.43.250/32",
  ]
}