Istio — Impacts of namespace sameness with traffic management in a Multi-Cluster environment

Patrick Picard
ITNEXT
Published in
3 min readJul 18, 2022

--

In a recent blog, I covered how global MeshConfig settings can be configured in Anthos Service Mesh. Now it is time to talk about the story that led me there!

We currently run a multi-cluster environment and recently enabled service discovery by adding the Istio remote secrets to the clusters. Once that was accomplished, we started having odd behaviors across some services such as ArgoCD and Vault. We run ArgoCD in a Hot/Warm setup and use the same namespace name (hint hint). Vault has performance replicas in each regional cluster and uses the same namespace name as well (hint hint x2).

For ArgoCD, the behavior oddness was related to Dex. We use Dex for Single sign-on. After the service discovery, the Dex page would load, accept your credentials and upon redirect would not recognize the login. We ended up in a login loop. Being the primary SME on Argo and having done the service discovery change, I had some of the “change parameters” under my control. So I searched online to see if others had similar issues, dug and dug, but didn’t find much except this snippet:

Namespace sameness

This had a light bulb pop over my head and started looking into whether namespace sameness was at the root of my problems. At first, I scaled down the replicas for Dex deployment down to zero on the standby cluster….and morale improved! Good news as I was getting somewhere. Users were able to login again and monitor their application deployments.

As I continued investigating in Cloud Logging (we are on GKE), I also noticed an increase in errors from the Redis deployment used by ArgoCD (we run Redis in HA configuration). Errors indicated a membership flapping which triggered me as well. I repeated the scale down to zero procedure on the standby cluster against Redis and all other components…..morale improved further.

Re-reading the namespace sameness snippet, it became evident the behavior change was related. Putting my networking hat on, I recalled that Istio effectively hijacks your network under the covers with the injected envoy proxy. The topology is no longer a matter of looking at the Kubernetes Service Endpoints; it’s the Istio topology that ruled the world and it sees beyond the local cluster! Looking at Istio’s point of view, I saw the following

istioctl proxy-config endpoints deploy/istio-ingressgateway.istio-ingress |grep dex
Istio’s Topology for DEX

Well, that made sense, so now I had to figure out an approach to make this permanent. A few things came to mind:

  • (Second best option) Break the namespace sameness — have primary and standby instances with different namespace names
  • Live with namespace sameness, but make the services “independent” by renaming services and other resources to have different names across the clusters (a lot more configuration!)
  • Keep all replicas down to zero on standby cluster
  • (Selected) Attack the problem from Istio’s Mesh perspective

In the end, we opted to attack the problem at Istio’s level by marking a few namespaces to be cluster local. This meant effectively disabling service discovery for the namespaces so they could be considered local to the cluster. Components that benefitted from this include ArgoCD, Vault, Dynatrace, External Secrets, and I’m sure more will follow.

For the details on how to configure this Mesh-wide setting, I recommend checking my other blog entry Anthos Service Mesh — Configuring global settings.

Now that this crisis has been addressed, where does the Namespace Sameness make sense? If you have a stateless application or the state externalized, and run across multiple clusters/regions; this feature is pretty good. For platform components, less so. Especially for services where membership / quorum is at play (Redis HA members, Vault Raft backend, etc.)

This was a painful few days of hypotheses, validation, searching for how to fix, and working with Google Support. In the end, the fix was a single ConfigMap which was easy to deploy to clusters.

--

--