Post

Envoy

Envoy is a proxy solution. There are so many proxies in the market: nginx, haproxy, trafik and etc, but why does Envoy stand out? It is because the xDS APIs, or a fancier name, xDS protocol. Envoy project incubated a protocol that allows dynamically changing proxy configurations using a channel between proxy and the management server.

Alex Burnos has a great article talking about data plane, control plane and universal data plan API. Just to copy his explanation here:

  • data plane: proxies
  • control plane: management server
  • data plan API: the API for proxies and management server to talk to each other, i.e., xDS APIs.

See Envoy’s golang implementation of data-plane-api: https://github.com/envoyproxy/go-control-plane.

Like any other proxy, Envoy manages client and server connections internally, and map a client connection to a server connection. In Envoy’s terminology, client side is called downstream, and server side is called upstream. Accordingly, there are two important concepts: Listener and Cluster. Meanwhile, to make connection management efficient, an event loop is usually needed. Envoy is no exception. It uses libevent.

Envoy codebase is huge. You can get a lot of inspirations from it such as event loop usage, watchdog C++ implementation, virtual inheritance, WASM sandbox, L4/L7 protocol details, cryptograhy C++ implementation and etc. It also covers Envoy mobile, and you can even learn some good practice about how to reuse code between backend and mobile.

Build

Follow the official doc.

1
2
3
4
5
6
7
8
brew install bazelisk
cd ~/code/envoy

# release build. Be cautious: too slow. Use a debug build instead.
bazel build -c opt envoy

# debug build
bazel build --jobs=7 -c dbg envoy

A gentle reminder: It took hours to build Envoy. There are over 14k source files to compile. The official doc recommends using a machine with 36+ cores to build it. The build time in my Macbook M1 with 10 logical processors is following.

1
2
INFO: Elapsed time: 4138.131s, Critical Path: 348.04s
INFO: 14847 processes: 3209 internal, 11636 darwin-sandbox, 1 local, 1 worker.

Command to generate compilation database as following.

1
TEST_TMPDIR=/tmp tools/gen_compilation_database.py

It failed in my Macbook M1. To mitigate this issue, I need to comment out the test and contrib source code before running the python script.

1
2
3
4
5
6
7
8
9
10
11
diff --git a/tools/gen_compilation_database.py b/tools/gen_compilation_database.py
index 8153c1ad30..5d1cd470db 100755
--- a/tools/gen_compilation_database.py
+++ b/tools/gen_compilation_database.py
@@ -131,8 +131,6 @@ if __name__ == "__main__":
     parser.add_argument(
         'bazel_targets', nargs='*', default=[
             "//source/...",
-            "//test/...",
-            "//contrib/...",
         ])

A little bit more about the source code structure:

  • All public header files are inside folder envoy.
  • All implementations are inside folder source.
  • It uses impl pattern everywhere.

Cluster

Cluster represents the upstream server info. We can use the admin endpoint to get all the cluster records:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
$ curl http://localhost:15000/clusters

outbound|8001||server.production.svc.cluster.local::observability_name::outbound|8001||server.production.svc.cluster.local
outbound|8001||server.production.svc.cluster.local::default_priority::max_connections::4294967295
outbound|8001||server.production.svc.cluster.local::default_priority::max_pending_requests::4294967295
outbound|8001||server.production.svc.cluster.local::default_priority::max_requests::4294967295
outbound|8001||server.production.svc.cluster.local::default_priority::max_retries::4294967295
outbound|8001||server.production.svc.cluster.local::high_priority::max_connections::1024
outbound|8001||server.production.svc.cluster.local::high_priority::max_pending_requests::1024
outbound|8001||server.production.svc.cluster.local::high_priority::max_requests::1024
outbound|8001||server.production.svc.cluster.local::high_priority::max_retries::3
outbound|8001||server.production.svc.cluster.local::added_via_api::true
outbound|8001||server.production.svc.cluster.local::172.31.48.187:8001::cx_active::0
outbound|8001||server.production.svc.cluster.local::172.31.48.187:8001::cx_connect_fail::0
outbound|8001||server.production.svc.cluster.local::172.31.48.187:8001::cx_total::0
outbound|8001||server.production.svc.cluster.local::172.31.48.187:8001::rq_active::0
outbound|8001||server.production.svc.cluster.local::172.31.48.187:8001::rq_error::0
outbound|8001||server.production.svc.cluster.local::172.31.48.187:8001::rq_success::0
outbound|8001||server.production.svc.cluster.local::172.31.48.187:8001::rq_timeout::0
outbound|8001||server.production.svc.cluster.local::172.31.48.187:8001::rq_total::0
outbound|8001||server.production.svc.cluster.local::172.31.48.187:8001::hostname::
outbound|8001||server.production.svc.cluster.local::172.31.48.187:8001::health_flags::healthy
outbound|8001||server.production.svc.cluster.local::172.31.48.187:8001::weight::1
outbound|8001||server.production.svc.cluster.local::172.31.48.187:8001::region::us-east-2
outbound|8001||server.production.svc.cluster.local::172.31.48.187:8001::zone::us-east-2a
outbound|8001||server.production.svc.cluster.local::172.31.48.187:8001::sub_zone::
outbound|8001||server.production.svc.cluster.local::172.31.48.187:8001::canary::false
outbound|8001||server.production.svc.cluster.local::172.31.48.187:8001::priority::0
outbound|8001||server.production.svc.cluster.local::172.31.48.187:8001::success_rate::-1.0
outbound|8001||server.production.svc.cluster.local::172.31.48.187:8001::local_origin_success_rate::-1.0
outbound|8001||server.production.svc.cluster.local::172.31.49.195:8001::cx_active::0
outbound|8001||server.production.svc.cluster.local::172.31.49.195:8001::cx_connect_fail::0
outbound|8001||server.production.svc.cluster.local::172.31.49.195:8001::cx_total::0
outbound|8001||server.production.svc.cluster.local::172.31.49.195:8001::rq_active::0
...

Obviously, this example comes from a k8s cluster and more specifically from a istio-proxy container. A few notes about the output. First, some rows have IP address but others do not. The former describes the detailed nodes information. The code is here. The latter describes cluster wide information and the code is here.

Second, the IP addresses 172.31.48.187 and 172.31.49.195 are not the service address but the corresponding pod address. For the example above. The server service in the production namespace has many backing pods. So we see many different addresses in the output. Envoy at runtime will load balance between these available endpoints. If there is no pod backing this service, then there is IP address in the output and routing will fail.

Where does these cluster/host info come from? It comes from the CDS (cluster service discovery) subscription. For more details, see Istio’s notes.

I once has doubt about the IP address returned by this /cluster endpoint: is it the same IP address used by Envoy to build connection to the upstream server? It is. See code.

How to debug Envoy?

Envoy has a set of admin endpoints for debug purpose. It even has an admin web UI. see doc. So we can know its internal state.

This post is licensed under CC BY 4.0 by the author.