Implementing Cloud Governance as a Code using Cloud Custodian

Implementing Cloud Governance as a Code using Cloud Custodian

In today’s scaling cloud infrastructure it’s hard to manage all resources compliance. Every organization has a set of policies to follow for detecting violations and taking remediation actions on their cloud resources. This is generally done by writing multiple custom scripts and using some 3rd party tool and integration. Many development teams know how hard it is to manage and write custom scripts and keep a track of those. This is where we can leverage Cloud Custodian DSL policies to manage our Cloud resources with ease.

What is cloud governance?

Cloud governance is a framework which defines how developers can create policies to control costs, minimize security risks, improve efficiency and accelerate deployment.

What are other tools that provide governance as code?

AWS Config

AWS config records and monitors all configuration data of AWS resources and We can build rules to help us enforce compliance. Setting up a Multi account and Multi Zone option is available. It also provides some predefined AWS managed rule that we can use or we can write our own custom rules. We can also take remediation action based on matches. For Custom policy we need to write our own lambda function for taking action.

However we can use Cloud Custodian to set up AWS Config rule and Custom rule which supports Multi account and Multi region using c7n-org. Also it can automatically provision aws lambda function.

Azure Policy

Azure policy enforces organization standards across Azure resources. It provides an aggregated view to evaluate the overall state of the environment, with the ability to drill down to the per-resource, per-policy granularity.(eg. Users are only allowed to create A and B series Virtual Machines). We can turn on in-built policies or create custom policies for all resources. It can also take auto remediation action on non-compliant resources.

Azure Policy is reliable and efficient for building a custom validation layer on deployments to prevent deviation from customer defined rules. Cloud Custodian and Azure Policy have significant overlap in scenarios they can accomplish with regard to compliance implementations. When reviewing your requirements, we recommend first identifying the requirements that can be implemented via Azure Policy. Custodian can then be used to implement the remaining requirements. Custodian is also frequently used to add a second layer of protection or mitigation actions to requirements covered by Azure Policy. This way we can ensure that policy is configured correctly.

Azure Policy Comparison

Till now, we have seen What is cloud governance and what are other tools available in the market. Let’s see now what Cloud Custodian can provide us in cloud governance.

What is Cloud Custodian?

Cloud Custodian is CNCF sandbox project for governing public cloud resources in real-time. It helps us write governance as code the same way we write infrastructure as code. It detects the non-complaints resource and takes action to remediate it. Custodian is a cloud native tool. It can be used with multiple cloud providers(AWS, AZURE, GCP, etc)

We can use Cloud Custodian as below,

  • Compliance and Security as code - We can write Simple YAML DSL policy as a code.

  • Cost savings - Removing unwanted resources and Implementing the on/off hours policy can save costs.

  • Operational efficiency -By adding governance as code it reduces the friction for innovating securely in the cloud and also increases developer efficiency.

How does it work?

When we run Cloud Custodian command depending on the Cloud provider it takes resources, filters, action as input and translate into Cloud provider API Call(eg. AWS Boto3 API). No need to worry about custom script or aws cli commands. We get clean, readable policies and numerous common filters and actions that have been built into Cloud Custodian. If we need custom filters we can always use JMESPath to write our filter.

There can be situations where we may need to run our policy periodically or based on some events. For this Cloud Custodian automatically provision lambda function and CloudWatch event rule. CloudWatch event rules can be scheduled (every 10 minutes) or triggered in response to API calls by CloudTrail, EC2 instance state events, etc.

Cloud Custodian Workflow

How to install and set up Cloud Custodian ?

We can simply install Cloud Custodian with python pip command

python3 -m venv custodian
source custodian/bin/activate
pip install c7n       # This includes AWS support
pip install c7n_azure # Install Azure package
pip install c7n_gcp   # Install GCP Package

Using Cloud Custodian docker image

docker run  -it \
  -v $(pwd)/output:/opt/custodian/output \
  -v $(pwd)/policy.yml:/opt/custodian/policy.yml \
  --env-file <(env | grep "^AWS\|^AZURE\|^GOOGLE|^KUBECONFIG") \
     cloudcustodian/c7n run -v --cache-period 0 -s /opt/custodian/output /opt/custodian/policy.yml

Note: ACCESS and SECRET KEY, DEFAULT_REGION and KUBECONFIG are fetched from ENV variables and users should have access to required IAM Roles and Policies that we define in policy YAML file. Another option is to mount the file/directory inside the container.

Cloud Custodian policy.yaml explained

Cloud Custodian has simple yaml file which includes Resource, Filter and Action

  • Resources: Custodian is able to target several cloud providers (AWS, GCP, Azure) and each provider has its own resource type.(eg ec2, s3 bucket)

  • Filters: Filters are the way in Custodian to target a specific subset of resources. It could be based on some date, tag etc. We can write our custom filter using the JMESPath expression.

  • Actions: Actions is the actual decision you make on resources that match the filter. This action can be as simple as sending a report to the owner, stating that the resource does not match the Cloud governance rule or delete the resource.

Both actions and filters can combine as many rules as you want to express your needs perfectly.

- name: first-policy
  resource: name-of-cloud-resource
  description: Description of policy
    filters:
      - (some filter that will select a subset of resource)
      - (more filters)
    actions:
      - (an action to trigger on filtered resource)
      - (more actions)

Cloud Custodian sample policy

Although Official docs cover most of the aws policies examples, We have picked up some policies which can be used from day 1 for cost saving and Compliance.

ebs-snapshots-month-old.yml

One of the most common issues the organization faces is the complexity of removing old ami,snapshot and volume which lie there in our environment for more than 1 years and add more bills. Eventually we have to write multiple custom scripts to deal with the situation.

Below is a simple policy which removes snapshots which are older than 30 days.

policies:
  - name: ebs-snapshots-month-old
    resource: ebs-snapshot
    filters:
      - type: age
        days: 30
        op: ge
    actions:
      - delete

Here is an example of how we can run the Cloud Custodian policy.

custodian run -v -s /tmp/output /tmp/ebs-snapshots-month-old.yml

cloud-custodian-policy

Every time we run the Custodian command it creates/appends files inside policies.name output directory passed with -s option (eg. /tmp/output/ebs-snapshot-month-old/custodian-run.log)

  • custodian-run.log : All console logs are stored here

  • resources.json : Filtered resources list

  • metadata.json : Metadata about filtered resources

  • action-* : resources list on which action was taken

  • $HOME/.cache/cloud-custodian.cache : All cloud api call results are cached here. Default value is 15 minutes.

To get a filtered resource report we can run the below command. By default it provides reports in csv format but we can change it by passing –format json.

custodian report -s /tmp/output/ --format csv ebs-snapshots-month-old.yml

cloud-custodian-code

only-approved-ami.yml

Stop running ec2 which does not match with the trusted AMI list.

policies:
- name: only-approved-ami
  resource: ec2
  comment: |
    Stop running EC2 instances that are using invalid AMIs
  filters:
    - "State.Name": running
    - type: value
      key: ImageId
      op: not-in
      value:
          - ami-04db49c0fb2215364   # Amazon Linux 2 AMI (HVM)
          - ami-06a0b4e3b7eb7a300  # Red Hat Enterprise Linux 8 (HVM)
          - ami-0b3acf3edf2397475    # SUSE Linux Enterprise Server 15 SP2 (HVM)
          - ami-0c1a7f89451184c8b   # Ubuntu Server 20.04 LTS (HVM)
  actions:
    - stop

Security-group-check.yml

One of the more common issues that we see when Developers tend to allow all traffic on SSH while creating POC VM OR during testing we sometimes allow port 22 to ALL but forget to remove the rule. Below policy can take care of these issues by automatically removing SSH access from ALL and adding only VPN IP to the security group.

policies:
  - name: sg-remove-permission
    resource: security-group
    filters:
       - or:
             - type: ingress
               IpProtocol: "-1"
               Ports: [22]
               Cidr: "0.0.0.0/0"
             - type: ingress
               IpProtocol: "-1"
               Ports: [22]
               CidrV6: "::/0"
    actions:
      - type: set-permissions
        remove-ingress: matched
        add-ingress:
          - IpPermissions:
            - IpProtocol: TCP
              FromPort: 22
              ToPort: 22
              IpRanges:
                - Description: VPN1 Access
                  CidrIp: "10.10.0.0/16"

Support Kubernetes resources

We can now manage Kubernetes resources like deployment, pod, Daemonset, Volume. Below are some sample policies that we can write with Cloud Custodian.

  • Delete POC and untagged resources

  • Update labels and patch on k8 resources

  • Call webhooks based on findings

kubernetes-delete-poc-resource.yml

policies:
  - name: delete-poc-namespace
    resource: k8s.namespace
    filters:
    - type: value
      key: 'metadata.name'
      op: regex
      value: '^.*poc.*$'
    actions:
      - delete

  - name: delete-poc-deployments
    resource: k8s.deployment
    filters:
    - type: value
      key: 'metadata.name'
      op: regex
      value: '^.*poc.*$'
    actions:
      - delete

Note: Cloud Custodian kubernetes resources still work in progress. We can check the status of the plugin here.

What are the types of modes that we can call Cloud Custodian?

  • pull - Default method can be run manually. Preferred to add it in CICD tool cron.

  • periodic - Provision cloud resource (eg. Aws lambda with CloudWatch cron) as per policy and executes as scheduled.

  • Custom mode as per cloud provider - Executes when the event matches

Integrate Cloud Custodian with Jenkins CI

For simplicity we are using Cloud Custodian docker image and injecting the credentials as environment variables.

Note: secret file should have keys in upper case and default region. In case of kubernetes the KUBECONFIG file should be mounted inside the container.

export AWS_ACCESS_KEY_ID=<YOUR_AWS_ACCESS_KEY>
export AWS_SECRET_ACCESS_KEY=<YOUR_AWS_SECRET_ACCESS_KEY>
export AWS_DEFAULT_REGION=<YOUR_DEFAULT_REGION>
pipeline{
    agent{ label 'worker1'}
    stages{
        stage('cloudcustodian-non-prod'){
            steps{
                dir("non-prod"){
                    withCredentials([file(credentialsId: 'secretfile', variable: 'var_secretfile')])
                    {
                    sh '''
                    source $var_secretfile  > /dev/null 2>&1
                    env | grep "^AWS\\|^AZURE\\|^GOOGLE\\|^KUBECONFIG" > envfile

                    for files in $(ls | egrep '.yml|.yaml')
                    do
                        docker run --rm -t \
                        -v $(pwd)/output:/opt/custodian/output \
                        -v $(pwd):/opt/custodian/ \
                        --env-file envfile \
                        cloudcustodian/c7n run -v  -s /opt/custodian/output /opt/custodian/$files
                    done
                    '''
                    }
                }
            }
        }
        stage("cloudcustodian-prod"){
            steps{
                dir("prod"){
                    withCredentials([file(credentialsId: 'secretfile', variable: 'var_secretfile')])
                    {
                    sh '''
                    source $var_secretfile  > /dev/null 2>&1
                    env | grep "^AWS\\|^AZURE\\|^GOOGLE\\|^KUBECONFIG" > envfile

                    for files in $(ls | egrep '.yml|.yaml')
                    do
                        docker run --rm -t \
                        -v $(pwd)/output:/opt/custodian/output \
                        -v $(pwd):/opt/custodian/ \
                        --env-file envfile \
                        cloudcustodian/c7n run -v -s /opt/custodian/output /opt/custodian/$files
                    done
                    '''
                    }
                }
            }
        }
    }
}

Jenkins console output:

cloud-custodian-jenkins-console-output

Tools and Features

Cloud Custodian has a number of add-on tools that have been developed by the community.

Multi Region and Multi Account support

We can use c7n-org plugging to configure multiple AWS, AZURE, GCP accounts and run them in parallel. Flag –region all can be used to run the same policy across all regions.

Notification

c7n-mailer plugin provides lots of flexibility for alert notifications. We can use webhook, email, queue service, Datadog, Slack and Splunk for alerts.

Auto-resource-tagging

c7n_trailcreator script will process cloudtrail records to create a sqlite db of resources and their creators, and then use that sqlitedb to tag the resources with their creator’s name.

Logging and Reporting

It provides reporting in JSON and CSV format. We can also collect these metrics inside Cloud native logging and generate nice dashboards. We can store the logs locally, S3 or on Cloudwatch. A consistent logging format makes it easy to troubleshoot policies.

Custodian Dry run

In Dry run(–dryrun), the action part of policy is ignored. It shows what resources will be impacted by the policy. It is always best practice to do a dry run first before running the actual code.

Custodian Cache

When we execute any policy it fetches data from the cloud and stored locally for 15 min. Cache is used to minimize api calls. We can set the cache with –cache-period 0 option.

Editor integration

It can be integrated with Visual Studio Code for auto compilation and suggestion.

Custodian schema

We can use Custodian schema command to find out the type of resource, action and filters that are available inside Cloud Custodian.


custodian schema     #Shows all resource available in custodian
custodian schema aws     #Shows aws resource available in custodian
custodian schema aws.ec2     #Shows aws ec2 action and filters
custodian schema aws.ec2.actions     #Shows aws ec2 actions only
custodian schema aws.ec2.actions.stop     #Shows ec2 stop sample policy and schema

How is Cloud Custodian better than other tools?

  • Simplicity and Consistency of writing policies across multiple cloud platforms and kubernetes.

  • Multi account and Multi region support using c7n-org.

  • Support a wide range of Notification channels using c7n-mailer

  • Custodian’s terraform provider enables writing and evaluating Custodian policies against Terraform IaC modules.

  • Custodian has deep integration with AWS config. It can deploy any config-rule that is supported by config. Also It can automatically provision aws lambda for AWS custom config policy.

  • We can implement our custom policies in Python if you need to as it supports all rules as per Cloud providers SDK.

  • Cloud Custodian is an opensource CNCF Sandbox project.

Cloud Custodian Limitations

  • No Default Dashboard (Supports AWS native dashboard but We can also send metrics output to Elasticsearch/Grafana, etc. and create dashboard).

  • Cloud Custodian can not prevent custom layer validation pre deployments. It can only run periodically or based on some events.

  • Cloud Custodian does not have any in-built policies. We need to write all policies by ourselves. However it has a lot of good example policies(aws, azure, gcp) that we can use as reference.

Conclusion

Cloud Custodian enables us to define rules and remediation as one policy to facilitate a well-managed cloud infrastructure. We can also use it to write policies for managing Kubernetes resources like deployment, pod, etc. Compared to other cloud based governance tools It provides a very simple DSL to write policies and It’s Consistency across Cloud platforms. Custodian reduces the friction for innovating securely in the Cloud and also increases efficiency.

We can use Cloud Custodian to optimize our Cloud cost by implementing offhour and cleanup policies. It also includes lots of plugins like Multi account/region support, Wide range of Notification tools(Slack, SMTP, sqs, Datadog, Webhooks, etc), etc. We can find a list of Cloud Custodian plugins here.

That’s a wrap folks :) Hope the article was informative and you enjoyed reading it. I’d love to hear your thoughts and experience - let’s connect and start a conversation on LinkedIn.

References & Further Reading: