The Northwestern Cloud Community of Practice recommends these best practices for effective operation and management of public cloud services.
- Perform operations as code
- Use version control for configuration as well as code
- Automate security practices and controls
- Learn from operational failures and share learnings across the organization
Perform operations as code
The API-driven, programmable nature of cloud environments allows practices traditionally used in software engineering to be applied to cloud infrastructure.
Operational tasks and procedures should be automated to the maximum extent possible using a combination of scripting environments like Bash and Powershell, along with configuration management and automation tools such as Terraform and Ansible.
By performing operations as code, you will limit errors, increase productivity, and free up your time to work on higher level projects.
Use version control for configuration as well as code
All configuration for your cloud resources can be managed with the same tools and discipline as your application code by using a version control system. This will also enable continuous delivery practices that greatly enhance efficiency, visibility, consistency, and security.
The Cloud Community of Practice recommends the adoption of git as a version control system, along with a hosted git provider such as GitHub. When git is used in combination with a continuous integration / continuous deployment (CI/CD) tool such as GitHub Actions or Jenkins, configuration changes can be automatically tested and deployed.
Architect for security
Implementing basic security practices everywhere can protect you from misconfigurations, mistakes, and successful attacks.
Control access to cloud resources by implementing role-based access control using IAM for AWS and Azure Active Directory. Make sure your roles have the minimum permissions they need to perform their functions and enable Multi-Factor Authentication (MFA) to further protect access.
Ensure storage accounts do not allow public access and enable encryption across all storage services wherever possible. Additionally, enabling object versioning in cloud storage buckets can protect against accidental data deletion.
Use the cloud provider’s security services to enable automatic, programmatic remediation of security misconfigurations (AWS Config, Azure Policy), get visibility across cloud services (AWS Security Hub, Azure Security Center), and protect web services from attack (AWS Shield, Web Application Firewall).
Learn from operational failures and share learnings across the organization
Failures and incidents will occur no matter how much you work to prevent them, so lay the groundwork to quickly identify incidents and recover as quickly as possible. Write and share a blameless post-mortem of each incident to help the organization learn and become more resilient.
Having regular “Game Days” is good practice for teams to analyze failures and identify lessons learned. Create a strategy document of lessons learned and revisit the documentation after each game day. Share what is learned across teams and with the Cloud Community of Practice.
Setting up the isolated test environment and adopting the principles of Chaos Engineering can help you to go through the process of identifying failures before they become outages.