A few weeks ago on a Saturday morning I tried to pay a medical bill online and received the following message:
Sorry! In order to serve you better, our website will be down for scheduled maintenance from Friday 6:00 PM to Sunday 6:00 PM.
OK, I get it. Stuff happens. However, the following week I was greeted with the same message. Two weekends in a row means 48 hours of downtime over two weeks. Even if that’s the only downtime for the year, that means an availability of 98.9 percent -- a rate that may be unacceptable for IT departments and many online businesses. I sensed legacy architecture.
Many successful businesses still use legacy architecture. After all, they were built on it. However, the inherent complexity of legacy servers leads to fewer releases with large amounts of content. This means more components and dependencies that cannot be individually upgraded. It’s this domino effect that forces many companies to still take down servers for extended maintenance (like the medical billing company I mentioned earlier).
Computer science has known the solution to the complexity problem for a long time: modularity. Design the system not as a small number of large complex parts, but as a large number of small, simple containerized applications. However, the larger number of smaller services means there are inevitably more pieces to manage. Where a legacy system might have had a dozen VMs, now it could have hundreds of software containers.
Microservices security challenges
With a system made up of microservices, each network interaction becomes a potential security issue. Instead of flowing vertically through the system (Web to app to database), much of the network traffic is also flowing sideways, from container to container. Configuring an effective security system that protects every one of those connections is impossibly difficult for an administrator to understand, much less practically manage.
This is why recent surveys have outlined security as one of the biggest barriers to container adoption. Why is this such a stumbling block? The fact is, when you’re dealing with containers, security comes down to four key areas:
- Is the code running within the container safe? One danger with developers implementing individual container layers from public places is losing control over what code is actually running in the system. Public repositories do not provide any assurance of safety and could potentially become a vector for malware in the future. Even without malicious intent, there is no guarantee that publicly available code does not contain a host of vulnerabilities. The delivery path is important as well: How do you make sure that the code intended by the developers is the actual code deployed in production?
- Where is it running? The ability to use diverse infrastructure sources is a huge boon, but not all providers are appropriate for all jobs. Workloads that handle sensitive data such as personally identifiable information (PII) may need to run on infrastructure that meets certain security requirements. There may be geopolitical considerations, such as laws forbidding offshoring of certain types of data. Development, test, and production systems should have good “walls” separating them, and making sure that access to each is properly segregated and managed.
- What can it consume? A container instance should get only the resources it needs to handle its normal workload, and no more. Resources should be allocated and distributed for efficiency purposes as well as security. One indicator of a possible breach is a system consuming an abnormal amount of some resource, such as a server using unexpectedly high network bandwidth because of malware that is exfiltrating data. Resource management can help contain and detect such attacks.
- Who can it talk to? Modularizing at the container level implies exposing interfaces that previously would have been only available inside the process. As network services, these interfaces can potentially be called by any entity on the network. If the container is properly designed to provide only a well-defined API on a specific interface, then the overall system is more secure by only allowing that interface to be visible to other processes. Outbound access should also be controlled. Some containers will not require access to any other services; others will talk to fellow containers and potentially to services like an enterprise database or directory. Each of these interactions should be specifically controlled.
Clearly, there are many security concerns for an operations team to worry about and several moving parts. What is needed is a comprehensive management system that also helps with security. One great way to do this is to base a system on policy. A strong policy system could take a full set of information, along with operational security requirements, and control the resulting deployment of jobs. The system could be assembled and managed with a foundation of policy-based security.
As it happens, I prefer not to spend whole weekends trying to pay bills. Ideally, my creditors would have systems that are persistent and available so that I can quickly pay them, then return to imploring my teenagers to spend time with me. Getting beyond legacy architectures and moving to a policy-based system centered on microservices and containerized applications is, I think, an important step in that direction. This is the problem we’ve been working to solve at Apcera.
Policy-based service management
Apcera is a cloud platform that gives you the control to automate the integration and overall management of all IT resources, across both on-prem servers and multiple clouds. Designed to give developers what they want and operators what they need, the Apcera system lets organizations securely deploy, orchestrate, and govern their workloads (including containers) by declaring and automatically enforcing fine-grained policies across the cloud stack.
Of course, “policy” may mean different things to different people. In our case, it means an extensive rule book that teams can draw on to set up basic rules for applications, containers, IT resources, and developers, to name a few. Apcera streamlines this policy-setting process and automates the enforcement, effectively setting the guardrails so that developers and ops can innovate at speed with complete confidence and trust.
For example, the default Docker container found on one of the biggest public repositories has had several serious vulnerabilities and exploits. Although security tools have been built in, compatibility is not guaranteed, and only one part of the development process may be covered. Dealing with these variables is not impossible, but the more steps that exist in the process, the higher the chance for human error.
Preventing human error is one of the most critical parts of security. Mishandling credentials, not following procedures, misconfiguring firewalls or other features, accidentally exposing data and services to users -- all can make the best security system dangerously porous. Apcera’s policy engine works to eliminate this human error through a focus on isolation, access control, and component logs.
Isolation and access control
Containers exist to modularize and isolate workloads, but they can't be completely isolated. Most containers eventually need to interact with outside sources; it’s why container networking tools exist. However, there are several levels where automated isolation can actually work and be beneficial.
- OS level: Isolating the container runtime can prevent unintentional interaction between workloads. This is achieved by giving each workload its own set of IDs, file system instances, and network stacks.
- Internal networking level: Networking among containers without policy constraints can allow viruses and exploits to get out of control quickly. Ideally, containers should not be able to talk to each other without explicit permission. This can be granted via policy at the port and protocol level and must include requests in both directions, for both input and output. This means that accidental configuration mistakes are much rarer because all networking is deliberate and controlled by policy.
- Component-to-service level: Modern applications are often built via microservices components, so it's common to have a component talking through a connection to another component such as a back-end database. If this database is exposed, anyone can access it and make off with the data. Apcera’s wire protocol-aware semantic pipelines, which sit in the middle of traffic between the client and the server, can be used to obscure the ID of these databases and manage the permissions to the database.
- Application-to-Internet level: When the application is ready to connect to the outside world, it’s important to monitor and route all I/O traffic to specific workloads even when the workload is moved to a different cloud or provider.
Using policy to control who can deploy applications and specialty software in your developer environments has long-term benefits for security control issues. It prevents unsecured applications from accidentally being deployed. While unsecured versions of programs may exist in containers during the development cycle, Apcera’s policy engine can prevent deployment if the application isn’t known to be secure. It can also run compliance checks on existing programs to ensure that their parameters are in line with existing policy or to determine whether they need to be retired. Since different business types may have different compliance and risk constraints, flexible policy mechanisms are invaluable for regulating the deployment pipeline.
Debugging an application is a common task for a developer. When that application is connected to a database, the dev needs to access the logs tracking all transactions from that application. Granting access to those logs while maintaining database security can go beyond merely giving the developer access to the whole server or making a special debugging copy of all the logs. Since a semantic pipeline sits between the component and database, it can provide a log of all data requests for individual applications without needing to compromise the server itself.
A platform designed to enforce
The Apcera Platform takes a holistic approach to implementing policy across network infrastructure. The interaction between your people, your policy, and the platform itself is best described as follows:
- Reason: Your people determine which policies need to be declared inside the system. The most important areas for policy are around workload composition, resource management, scheduling/placement, and the network.
- Decide: The policy decides what the system is permitted to do. Apcera’s policies are granular, deep, and pervasive. The system can be used only in ways granted through policy.
- Enforce: The Apcera Platform enforces the policy through a policy engine. After all, what good is policy without enforcement? When somebody or something attempts to violate a policy, the platform automatically blocks unintended actions that weren’t expressly permitted through policy.
The Apcera policy engine is embedded in each system component (see diagram) and can enforce policy on virtually any resource in the cluster, including jobs, packages, stagers, services, service gateways, semantic pipelines, Docker images, and even policy itself, to name a few.
Each component has access to all policy information and performs policy-related tasks. Apcera uses a system of resource identifiers known as fully qualified names (FQNs), each consisting of a resource type, a namespace, and a location within the namespace. Apcera policies associate permissions with a realm. The form of the permissions depends on the resource type. The permissions usually allow an operation specific to that resource type for a specified user or class of users. Any operation that is not specifically allowed is disallowed.
Examples of constraints that you can specify by policy include, but are not limited to:
- Node.js applications running in production must use Node.js 0.10.25
- Applications running in production can only use databases or other services running in production
- Applications running in the testing environment cannot connect to services running in production
- Applications that access the Web must be scanned for vulnerabilities issues and malware
- Developers can use the runtime of their choice for applications and services running in their own sandbox but not when running in production
- Set the maximum RAM quota for each developer.
- Disallow a specific software version from being deployed
- Identify if any existing applications are out of compliance