Part 2: Setting Technical Challenges
Part 1 of this blog series covered the non-technical challenges for a management team. This post reviews the technical challenges IT management will face, and discusses Infrastructure as a Service (IaaS), defined here in Part 1.
Getting Started and Covering the Basics
What you actually know—and what you think you know—about any Public Cloud may differ from the reality.
Public Clouds are API-driven software platforms. Behind-the-scenes services may have unpublished dependencies. The failure of any service may have an impact on other services. You may not discover this until an actual failure.
The organization of your enterprise team is likely built around the demands and peculiarities of the workloads the team supports within the on-premises (onPrem) data center. This data center team was assembled over the years and understands it, and its weaknesses, intimately.
In addition to your knowledge of the data center, your team brings its workloads. Foreshadowing a bit, but it’s the knowledge of workload requirements that will be the most valuable asset they bring to the table during your Cloud Migration.
Your team’s knowledge of your current onPrem Data Center is often a starting point. Your onPrem Data Center and the Cloud have enough superficial similarities to make the team confident (read: over-confident). Cloud organizational entities (AWS Accounts/Azure Subscriptions) and the organization of networks is completely different than it is with onPrem, and requires a different outlook.
Network organization is one of the fundamental services that must be gotten right—this includes future-proofing. Failure to address this as a high priority, may require a re-migration to get right.
Networks should also be fully automated. This is no small matter, nor is it optional.
IAM is the second area of consideration.
IAM is boring. That said, do not make the mistake of placing this entirely in Security’s hands (although they should be a key participant). This will determine where and how users and workloads gain access to which resources.
The effort should typically be led by a joint security-operations task force, with workload owners participating at key points in their Cloud-deployment life-cycles. For instance, it’s okay to have little or no automation of a Dev environment, so long as automation is required for QA and higher environments. AWS Organizations should be used provide guardrails on all environments, especially Dev. Another suggestion would be to monitor for changes in policies and to review them as they’re executed in all environments, so any egregiously bad choices can be corrected quickly.
Beyond networks and IAM, all other errors can usually be recovered from.
Databases are the Achilles heel of a Cloud Migration. My first suggestion is to attempt to migrate your database to the RDS. Discover what changes are required to make the migration (notice I didn’t add, “if any”). Only about one in twenty forklifts requires no database changes.
Take. This is an ideal time to loosen the grip your database vendor has on you. If triggers and stored procedures can be transmuted to Lambda functions, do it.
One last note on getting started. The use of the console during a PoC may be agile. However, limit its use outside the PoC construct. Except perhaps in limited situations, the console will negatively impact your overall environment(s). Relying on the console gives you immediate speed at the cost of long-term agility. We jokingly call this “Consolitis.” The most compelling reason for moving to the Cloud, business agility, will be lost or (in a best-case scenario) watered down when the Console is used to create objects on a non-exception basis.
What’s your end goal?
If your end goal isn’t to evolve your application to be cloud-native, you’re probably not interested in driving-down your costs. That’s okay if you’ve got money to burn, but the movement for most organizations resembles a bell curve. If you’re not interested in driving down your costs—or if it’s not a primary goal— this should be should be communicated to the team as it will drive architectural decisions.
Does your team understand that Automation in Prod is the goal?
Speed to market, in a modern context, can only be accomplished via automation. Automation in all or nearly all environments can only be accomplished when naming conventions have been sorted out and standardized. Networks should be designed and optimized for the Cloud, and should not be just a duplicate of how your OnPrem network is deployed. Jumping ahead—even to show progress—will hurt the ultimate goal.
Have you chosen your pipeline tool yet?
To the degree that your development team may have chosen a pipeline tool, it’s fine to follow their lead. However as your DevOps requirements evolve and expand based on actual experience, don’t be afraid to revisit the topic. AWS CodePipeline is a fine choice in certain situations; Jenkins in others. There any number of CICD solutions. Choose the right one for you.
Do your infrastructure teams understand what a pipeline is? Do they understand at least the basics of Git?
A lack of understanding of pipelines and Git will make it difficult to do anything remotely scalable. Gaining the necessary knowledge will take time. Provide your teams with individual/self-training or team training prior to the migration.
Have you determined how you will deploy your Infrastructure as Code (IaC)?
Most clouds offer several choices. In AWS (I have the most production experience with AWS, so we’ll focus on that) the choice boils down to CloudFormation and TerraForm. You’ll need to determine which better fits your needs and goals.
Having decided on your templating tool, have you involved the Network Team? Has their piece of the infrastructure been deployed as code, too?
While the IT Ops team will likely have some basic scripting in their background, the Network Team may not. How will you handle this gap in your team?
Bash, PowerShell or Python training will ease them into a mindset required to be successful. This includes the use of variables, the reuse of code, the concepts of adding a subnet, VPC, or other network construct on the fly, (having tested it in a lower environment first, of course).
Most errors occur in account management and VPC interconnectivity. Unfortunately, these are the errors most difficult to recover from.
As a manager, when your network team wants to tie your VPCs together via peering—either within an account or across multiple accounts—will you have the confidence to say no?
Inter-VPC network connectivity should be limited to Authenticated HTTPS.
AWS provides “Best Practice Use Cases.” These may be more or less valid, depending on your situation. We find the transit VPC to be the most flexible, scalable, and easily managed. That said, never deploy any sort of compute in the transit VPC. It becomes a high-risk horizontal breaching vehicle for the “bad guys.”
Limiting yourself to the HTTPS standard may be difficult, especially during an UpLift. So, allowances can be made in the context of tightly defined Security Groups, and other techniques. But an exception should never become the rule, and the exception must be added to the backlog. The exception is an undesirable requirement necessitated by the time the team has to deploy the project, and should be identified on the roadmap as an item to be eliminated.
Driving your application team to this standard is an important aspect of management.
Do your workload teams understand what network connectivity is required make the workload function correctly?
Some workloads have complex networking requirements. Have they been thoroughly mapped-out and documented? Have you disallowed outbound connectivity except when this Instance is actually running as a server, and allowed for that connectivity by the type to other types of compute, by type?
Are you leveraging the connectivity of Instance within by the use of the group id itself? By following this pattern, where possible, Instance to Instance connectivity can be limited to appropriate targets within the VPC.
Is your applications network access least privilege?
Consider cross organizational application consumption intersectionality. The preference is to restrict cross accounts to HTTPS access, defined by the required number of endpoints and a common mechanism for managing access and credentials.
Currently, an application in VPC1 requires access to one of several databases in a second account/VPC2.
Is network traffic for the VPC1 application limited to the single database in VPC2? Can other resources beside the Application take advantage of that link without requesting authorization?
Identity and Access Management
Shifting gears from Network concerns, let’s now look at managing People’s Roles, vis-à-vis AWS and Services Roles: AWS Identity and Access Management (IAM).
Losing control of IAM Role mappings and associated policies is a mistake that’s very hard to recover from. Be disciplined. Ask questions. Actual access reviews are boring tasks that must be constantly conducted to maintain Cloud security integrity. New AWS tools have made this easier.
Having gotten over the CapEx versus OpEx hump, how are you going to organize yourself financially in the Cloud?
In AWS, the base unit of financial accountability is the “Account.” Even if nothing else is done, the account will be billed as an individual unit. AWS Organizations can aid you in setting a single security standard across multiple accounts (and thus business units and environments).
Having an AWS footprint with multiple accounts has a number of benefits:
- The most important benefit is reduction of the blast radius in the event of a bad software decision.
- It allows teams to more easily comply with the principal of least privilege.
- Related to least privilege, it allows teams to have separate accounts for different environments (dev, qa, uat, prod, etc.).
- It provides a level of granularity for billing purposes when used in combination with Tags precision cost allocations.
While the positives far outweigh the negatives, there are negatives to a multi-account that need to be managed. AWS Organizations gives your team greater control over Roles and Policies and lets you setup guard rails for sub-accounts. As has been mentioned, your network will be more complex.
Managing Roles and Policies will be more complex, but less complex than without Organizations. Be sure to answer the following:
- Does the Application (whether hosted on EC2, ECS, KMS, Batch, Lambda or ElasticBeanStalk) have IAM policies which restrict it to only the data repositories it requires and nothing more?
- Having uplifted the application, is someone assigned to reviewing access requirements and investigating any seemingly unused access?
- Are policy reviews a part of your deployment code reviews? Mistakes will be made, but if you’re not prepared to examine the policies—both changed and prior—then you’re not performing a required task of managing a Cloud deployment.
Getting Started, Revisited
Is your team prepared to do the training and shift to the new code-driven culture and away from traditional Network Operations, IT Operations to DevOps roles?
Is it prepared to make the necessary investments in a secure cloud? This includes retraining or hiring new staff to assist in the project, reducing the exposure of over-provisioned networks, services and users, in order to protect internal and (potentially) customer secrets?
Is it retraining the development staff to leverage the Cloud for both refactoring and green tree projects?
A Cloud Migration is usually a large, complicated process. Start small if you must, with a single work load. Keep in mind the investment you’re making in people, process and your application workload. Use those lessons to inform follow-on workloads. Build success on success and control how networks interact, as well how your workloads interact. Lastly, If your organization is currently crafting its Public Cloud migration strategy and hopes to avoid a migration nightmare, please don’t hesitate to reach out to us to learn more about what’s involved and how Anexinet’s expert consultants can help.