Scope a microservice extraction
Consul service mesh provides a robust set of features that enable you to migrate from monolith to microservices in a safe, predictable and controlled manner.
This tutorial is focused on explaining the process of scoping your first microservice extraction and preparing for deployment to the Consul service mesh on Kubernetes. This tutorial is designed for practitioners who will be responsible for developing microservices that will be deployed to Consul service mesh running on Kubernetes. However, the concepts discussed should be useful for practitioners developing microservices that will be deployed to any environment.
In this tutorial, you will:
- Document the service API and deployment profile
- Identify considerations when designing services for deployment to Consul service mesh on Kubernetes
- Identify required refactorings
- Define the scope of the initial implementation
If you are already familiar with scoping microservices implementations and deployments feel free to skip ahead to the next tutorial in the collection where we discuss how to Extract a microservice from a monolithic code base.
Source code
This tutorials in this collection reference an example application. You can review the source code for the different tiers of the application on github.
Documenting the service
When looking to refactor from a monolith to microservices you should start by documenting the functionality or component you wish to extract. Below is an example worksheet you can use to document a monolithic feature set that you are targeting for extraction to a microservice. This list is not exhaustive, and you should feel free to expand on it to meet your specific needs. Items on the list that have Consul service mesh specific concerns are linked to the relevant documentation.
- Name - name of the service
- ServiceAccount - name of the Kubernetes ServiceAccount the service runs under and whether any other services, including multiple versions of the same service, will use that same account
- Service Port - the port(s) you wish to expose your service on
- Namespace - optional namespace to run the service in if not default
- Protocol - what protocol the service uses (e.g. TCP, HTTP, gRPC)
- Statefulness Profile - is the service stateful or stateless
- API Surface - exposed routes, supported http methods for each route, required parameters per route
- Upstreams - list of upstream services that this service depends on
- Downstreams - list of downstream services that depend on this service
- Environment Variables - list of OS environment variables the service requires to operate
- File System Configuration - list of file system configuration files the service requires to operate
- Runtime Configuration - list of runtime configuration values and sources the service expects to retrieve dynamically at runtime and requires to operate
- Buildtime Secrets - list of secrets, along with their sources, that the service expects to be statically injected as part of the build process
- Runtime Secrets - list of secrets, along with their sources, that the service expects to be dynamically available while the service is running
- Imported Dependencies or Client Libraries - list of all 3rd party libraries the service references or uses dynamically via plugins
- Labels - key value metadata to be applied to the service
- Tags - tag metadata to be applied to the service
Example worksheet
Using the HashiCups case study started earlier in the collection, the following worksheet is an example of how the HashiCups team documented their microservice. They chose to include a notes column to help them remember why they made the decisions they did. The team performs frequent retrospectives, so they wanted to be able to include these notes in future conversations in case they find themselves questioning a previous decision.
Attribute | Value | Notes |
---|---|---|
Name | coffee-service | We currently only sell coffee, but are adding more skus Supply chain and data sources are totally different already Chose to segment by product type to avoid more complexity |
ServiceAccount | coffee-service | Plan to use the same service account for all versions |
ServicePort | 9090 | |
Namespace | N/A | |
Protocol | HTTP | |
Statefulness Profile | Stateless | Service is a stateless pass through to database In time we want to add a stateful caching layer in this pod |
API Surface | GET /coffees GET /health | |
Upstreams | Postgres | Available at port 5432 |
Downstreams | Ordering | Ordering requires a product sku when making a purchase |
Environment Variables | USERNAME PASSWORD LOG_FORMAT LOG_LEVEL INFO BIND_ADDRESS VERSION | Postgres secrets, logging advice, static service config, Plan to run multiple version from same binary to see how it works out |
File System Config | N/A | Chose to refactor app to use Environment variables instead of config files |
Runtime Config | N/A | No known dynamic config at this time |
Buildtime Secrets | USERNAME PASSWORD | Need to find a way to automate this eventually Taking on debt now since deployment will start with manual operator deploys |
Runtime Secrets | N/A | |
3rd Party Libs | gorilla/mux hashicorp/go-hclog jmoiron/sqlx lib/pq opentracing/opentracing-go stretchr/testify uber/jaeger-client-go uber/jaeger-lib | |
Labels | version: v1 tier: api | |
Tags | N/A |
Note
One thing they realized during their analysis was that calling the
service product-service
was a misnomer they needed to correct. The service only
dealt with coffee data. Also, they already had multiple wholesale coffee wholesale
suppliers, and the complexity around just coffee data was not trivial. They decided
to rename this to coffee-service
now, so that they would not be tempted to add
other product data to this service in the future, because that would be creating
a new monolith.
Design for Consul service mesh on Kubernetes
Some of the design decisions referenced in the worksheet allude to Consul or Kubernetes specific requirements when implementing a microservice that is deployed to the Consul service mesh on Kubernetes. Some are decisions that were made to make the application more cloud native in it's implementation. This section will highlight some of those decisions so that you can make the same considerations when designing your services.
Consul Annotations
Consul uses Pod annotations to configure services injected into the mesh. The full list of all available annotations can be found in the documentation. Several of the annotations map directly to the elements of the service profile outlined above, and have been mapped below for your convenience. We recommend you review the linked documentation for each to help you think through how best to configure your service to run in the Consul service mesh on Kubernetes environment.
- Name -
consul.hashicorp.com/connect-service
- Service Port -
consul.hashicorp.com/connect-service-port
- Protocol -
consul.hashicorp.com/connect-service-protocol
- Upstreams -
consul.hashicorp.com/connect-service-upstreams
- Metadata -
consul.hashicorp.com/service-meta-{key}: {value}
- Tags =
consul.hashicorp.com/service-tags: comma,separated,list,of,tags
The annotations for configuring Connect must be on the pod specification. Since higher level constructs such as Deployments wrap pod specification templates, Consul service mesh can be used with all of these higher level constructs, too. See the documentation for more details.
Communicate over localhost
Consul service mesh expects applications to connect to upstream services on a
specific port using localhost. Since all containers within a Pod share networking,
Consul's Envoy proxy sidecar is able to intercept requests to ports configured
as upstreams and proxy them to the actual target IP in the mesh. For this
process to work currently, applications must be configured to connect to services
using localhost:<port-number>
. In this example, the coffee-service
has the
Postgres database service defined as an upstream. In order to be able to connect,
the connection string will use localhost
as the host name and the port number
will need to match the port number configured in the connect-service-upstreams
annotation in the Pod metadata.
Manage Consul
Consul configuration entries
are a fundamental element of the service mesh, and are the mechanism by which users
can configure the global mesh as well as individual service settings. For example,
if you intend to you use any of Consul's Layer 7 traffic management
features, your service protocol must be either HTTP, HTTPS, or gRPC. Adding a
service-defaults
config entry is a convenient way to ensure at the mesh level that all new instances
of a service with a given name are deployed properly to the mesh.
When using Consul service mesh on Kubernetes, Consul configuration entries can be managed as Kubernetes Custom Resource Definitions. Visit the Manage Consul Service Mesh using Kubernetes Custom Resource Definitions tutorial to learn more.
Health checks
One of the primary benefits of using an orchestrator like Kubernetes or a service mesh like Consul is the ability to observe and manage the mesh from a centralized control plane, and ensure that requests are only routed to services that are available and healthy. This process is called health checking.
Kubernetes uses the concept of probes to provide health checking features. A probe is a configuration applied to a container that the orchestrator can use to ascertain the status of the container relative to a given probe.
Consul service mesh on Kubernetes has its own health check feature. When enabled, a TTL health check will be registered within Consul for each Kubernetes pod that is part of the Consul service mesh. The Consul health check's state will reflect the pod's readiness status, which is the combination of all Kubernetes probes registered with the pod.
The example application will illustrate how to register an HTTP livenessProbe that can then be leveraged by both Kubernetes and Consul.
Observability
Observability is the ability to collect, measure, and analyze telemetry data produced by your system. Telemetry data are an indispensable tool when operating infrastructure in an orchestrated environment. You will not have access to physical or virtual machines, so you must take steps to both collect and publish your telemetry data to a system or platform that you do have access to.
You do not have to have a comprehensive observability solution in place to experiment with a pilot project, but before you deploy your first service, you should have a plan for how much telemetry data you will produce and how you will access it.
The OpenTelemetry project divides telemetry into three different categories: metrics, tracing, and logging. Let's discuss each of those now.
Metrics
Metrics are primarily concerned with aggregating quantitative information over time. Examples include, how many requests does the service receive per minute, how many non-success status codes per N number of requests does the service return each hour, or what is the average CPU load by minute for a container running a specific workload.
That last example is worth delving into. Application developers that are used to building services for on-prem deployment may not be familiar with the need to monitor container resource utilization. Applications deployed to orchestrators introduce a multitude of layers of obfuscation between the operating system and the running application. Because of that, it can be challenging to predict the amount of CPU, RAM, a disk you need to provide to your applications. If you don't provide enough resources, your application will likely suffer from performance issues. If you provide more resources than you need, your cloud costs will be higher than they need to be, and inversely your profit margins lower.
Because of these concerns, you should consider having your service measure and reports resources utilization as metric data to your observability pipeline in periodic intervals. While you won't be able to see the host operating system, you will gain insight into whether you've sized your containers correctly.
If you don't have a preference already, Prometheus is a well respected open source solution, and Consul service mesh on Kubernetes has first class support for publishing application metrics to Prometheus. To learn more about how to monitor your mesh visit our Layer 7 Observability with Consul Service Mesh, Prometheus, Grafana, and Kubernetes tutorial.
Tracing
Tracing, or distributed tracing, is the process of following a request through the mesh so that you can observe its behavior at the various touch points of the request handling pipeline. For example, it is not uncommon for a request to first hit a public endpoint, like a load balancer, which then routes the request to a UI service that calls one or more API services that in turn make one or more DB calls. If you experience performance problems in such a situation, how do you pinpoint which service(s) is causing the slow down? By adding tracing to your service mesh, you can see how much time a request spends inside each pod or node that represents a leg of the journey.
Distributed tracing typically relies on request header propagation. The presence of well known headers inform the mesh that the request is part of a distributed trace. The tracing solution will observe traffic to and from the pod, and use the trace id in the header set to correlate requests between pods in the cluster, so that they can be visualized as a sequence of calls or spans.
Most tracing solutions also provide the ability to instrument your application code and emit custom tracing data to help you debug service internals. This usually involves referencing a 3rd party SDK, and using it to first configure the runtime, and then publish tracing data yourself. In order for the application code to participate in the distributed trace, it will need to have a way of retrieving the trace id from the request headers, so that the service operation can "join" the trace. If handled correctly, the internal spans you generate with your code will appear as spans in the overall request trace.
Jaeger is another popular open source project that many organizations use as part of their observability pipeline. The Envoy proxy documentation has information on how to configure Envoy proxies to enable distributed tracing with Jaeger. Sample code for configuring Jaeger to monitor services deployed to Consul on Kubernetes can be found in this repository.
Logging
Logging is another indispensable tool for operating microservices. Debugging a
failing service in a distributed system may feel extremely foreign to you, especially
if you are used to working out of an IDE with a debugger. Logging to stdout
is
an essential skill to acquire and master when developing microservices. Indeed,
many people have suggested that when you develop microservices you should begin
to think of stdout
as your new debugger.
However, without forethought logging can actually lead to either a shortage of actionable insight, or an undesirable signal-to-noise ration. To avoid finding yourself with too much or too little information, you should develop a team standard including:
- Types of events to log
- What information a log message should include
- What log level each event should be logged at
- How and where you will export logs out of the cluster
There are many more considerations when developing a logging strategy for your project, but this list and the next section should be enough to inform the conversation with your team. The CNCF Landscape for logging page is also a great place to start looking for both inspiration as well as solutions that will meet your needs.
Avoiding overreach
By now, you are probably beginning to realize how different developing microservices is from how traditional on-prem applications have typically been built, and after reading the observability overview, you may feel a little overwhelmed with all the new technologies you will need to learn and leverage to be successful running microservices at scale. If you are feeling that way, don't worry, you are in good company! If you aren't feeling that way, then either you have a lot of experience with cloud native development already, or you need to reread the trade-offs section of the Migrate to Microservices article earlier in this collection. Pay close attention to the simplicity trade off discussion.
Earlier in this article we stated that you don't have to implement a comprehensive observability solution for a pilot project. Let's return to our HashiCups case study and see how they balanced their need for observability with their need to limit scope.
HashiCups revisited
After documenting their service, the HashiCups team was eager to get started, and felt like they had some really actionable insight into what they needed to do to get the service up and operating inside a Consul service mesh on Kubernetes. Then one member of the team brought up the issue of observability. She reminded everyone that once the services were containerized and distributed, attaching a debugger and stepping through all the layers of services calls wasn't really an option. The team agreed, but were unanimously against trying to solve all their observability challenges during a pilot project.
After some discussion and a few moments of anxiety they came to some pragmatic realizations. First, they weren't going to be changing the monolith at all. The Consul service mesh was going to allow them to route traffic for their single new microservice without having to modify the monolith code. This meant that they didn't really have a practical need for distributed tracing. So they were able to defer that concern for the pilot project.
Second, they realized that while they wanted applications metrics in the long run, they didn't need them to launch a pilot project. The hadn't agreed on a metrics solution, what metrics to collect, at what grain, etc. They also knew they needed to do research on the topic before committing to a solution or investing development time. The team simply didn't have the data they needed to move forward, and ultimately realized that was ok, for now. They quickly added this issue to their backlog with a high value score and all agreed to assign a resource to it as soon as they got general approval for the project.
That left logging. The HashiCups monolith already used a 3rd party logging library
that the team liked, so they decided to continue using it. After some discussion,
they reached consensus that for a pilot project, they would forgo addressing the
need to export logs out of the cluster. If they started to have issues they would
use kubectl logs
to inspect the log stream, and if things got really bad, they
could always roll back at the service mesh layer by re-routing traffic back to
the monolith. Everyone agreed that this was not a scalable approach, but was an
acceptable trade-off for a pilot project. So, they added the need for a better long
term logging solution to the backlog with a high value score, and felt like that
was a defensible, pragmatic decision.
However, they decided one area they did need to improve during the pilot was their logging discipline. Visibility into the monolith when things went wrong wasn't great, and they feared that if they didn't start the microservices migration with a strategy for discipline in this area, the problem could get much worse in a microservices architecture where visibility could be even more of a challenge.
They identified three main areas of deficiency that they needed to address in their logging approach. The first, was that if something went wrong on startup or shutdown, they couldn't easily tell where in those processes the errors had occurred. The second problem they had was that if something went wrong at runtime, it wasn't always clear where, because some developers had implemented logging and some hadn't. They wanted a way to enforce a consistent must log policy. The third problem they had was that they hadn't set the project up with log levels. There was only one log level throughout, and that wasn't ideal. Developers who did add logging were often very verbose, and that ended up polluting the logs with debug or trace information, that they really didn't want in production.
The first problem was the easiest to solve. As a team they came to an agreement on what constituted a lifecycle event, and then documented it. They also agreed that this document should be a living document, and that all team members were encouraged to bring considerations for addition to the list to the team stand up meetings. After their initial pass they came away with this initial list of lifecycle events.
- Service initializing/initialized
- Environment variables parsing/parsed
- Configuration initializing/initialized
- Middleware initializing/initialized
- Router initializing/initialized
- Route handler registering/registered
At some point one of the team members pointed out that all of these events could
be logged from the main
function, and the team liked how that could help speed up
debugging lifecycle events. The team felt pretty good about themselves, until they
started discussing the second problem.
The second problem of enforcing a must log policy became a complicated conversation. The team agreed that adding some logging middleware for request handling was a good idea, but beyond the level of request received/response returned defining the must log policy became very situational. After going through a lot of "what about" tangents, they realized two things.
First, they realized that providing guidelines to developers was about all they could do right now. This was supposed to be a fast experiment on extracting a microservice, not an experiment in writing a framework. They had too many open questions and not enough time.
Second, they realized that the second and the third problem were tightly coupled. Everyone agreed that adding log levels would help improve the value of logging, and when they started to look at the available log levels themselves, they saw a kind of policy arise. These are the developer guidelines they came away with.
Level | Guideline |
---|---|
Info | Lifecycle events Component initialization Inter-component method calls External service calls |
Debug | Intra-component method calls Flow of control branching Coarse grained contextual state |
Trace | Flow of control iteration Fine grained contextual state |
Warn | Expected handled errors Unexpected recoverable errors |
Error | Unexpected or unrecoverable errors |
They also agreed that code reviewers must include logging assessment as part of the review and that assessment should include both presence of logging and correct application of log level based on guidelines. They agreed that these guidelines should also be a living document, and that they should improve them as they learned more from their experiment.
Define the scope
The HashiCups team had done their research and were now more aware of how Consul service mesh on Kubernetes worked. They had come to terms with the amount of work that would have to go into building a robust microservices architecture. They had held themselves accountable in terms of limiting scope and avoiding overreach. They knew they needed to start small and get a quick win.
In order for the pilot to be considered a success, they decided to the following scope was the absolute minimum they could implement and still have the project be considered a success.
- Structural code
- Baseline a bare bones service project in a new git repo
- Refactor logging to conform with guidelines
- Identify required configuration and refactor to environment variables if possible
- Stub out route handlers to allow route configuration
- Configure route handling
- Business logic
- Migrate and refactor the coffees route handler and supporting code
- Write unit tests to gate deployment
- Deployment configuration
- Create the Kubernetes deployment configuration
- Include a livenessProbe
- Add Consul specific annotations
- Document a manual process for that covers the following
- Deploying container images
- Injecting secrets
- Deploying to the Kubernetes cluster
They pitched their scope outline to the broader community of stakeholders, and after the other teams signed off on the plan, the development team went to work.
Next steps
In this, tutorial you:
- Documented the service API and deployment profile
- Identified considerations when designing services for deployment to Consul service mesh on Kubernetes
- Identified required refactorings
- Defined the scope of initial implementation
The next tutorial in the collection will show you how to Extract a microservice from a monolith and deploying your first microservice to the Consul service mesh on Kubernetes.