GRPC connection happens on the wrong subnet when using network segmentation

I’m not sure if this belongs here as it’s configuration error on my part or in the Github issues as it is a bug, but I figured I’ll try here first…

Steps to reproduce

These are the steps to reproduce from a fully clean, publicly accessible system, with docker and a few other common dependencies (like git) installed, which has access to the internet and ports 80 and 443 exposed. I’m not sure how much of it is relevant, but here you go…

  1. Set up a standard Traefik configuration along the lines of this config1. I use DigitalOcean for my DNS but one could also use standalone/HTTP verification for ACME and set the DNS manually.
  2. Clone my configuration and checkout the relevant branch
    $ git clone https://git.tams.tech/TWS/ocis-deployment.git
    $ cd ocis-deployment
    $ git checkout feature/office-suite
    
  3. Run the initialization commands
    $ mkdir -p mounts/{config,data}
    $ docker compose run init
    $ sh gen-secrets.sh
    $ sh dns.sh  # if using digitalocean DNS, requires `doctl auth init` be run on the machine once before
    
  4. Start the service
    $ docker compose up
    

To put it another way…

Since this basically amounts to “go out to some other web site and deploy the configuration”, I’ll summarize here the relevant aspect of the configuration which is a problem as more generalized “steps”

  1. Get a working OCIS deployment, behind a Traefik (or other) reverse proxy
  2. Set the $GATEWAY_GRPC_ADDR environment variable to 0.0.0.0:9142 on the ocis service
  3. Add a private network to connect the app provider with OCIS
  4. Add an app-provider configuration, as laid out in this example2 in the OCIS repo, pointing it at your already existing ocis container via DNS. Connect the private app-provider service and the OCIS service to the new app-provider network, not the network Traefik is connected to.

Expected behaviour

All services start and eventually stabilize to a state where they are running without error

Actual behaviour

Error message produced regularly (approximately every 20 seconds) for as long as the service remains up, even after all other services have rebooted enough time to resolve their dependency errors.

hard to read raw output, or...
ocis-app-provider-1  | {"level":"error","pid":1,"error":"rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 172.18.0.12:9142: i/o timeout\"","time":"2023-09-19T16:25:18.957971967Z","caller":"github.com/cs3org/reva/v2@v2.16.1-0.20230911153145-a2e2320f3448/internal/grpc/services/appprovider/appprovider.go:164","message":"error registering app provider: error calling add app provider"}
...reformatted to be easier to read:
{
  "level": "error",
  "pid": 1,
  "error": "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 172.18.0.12:9142: i/o timeout\"",
  "time": "2023-09-19T16:25:18.957971967Z",
  "caller": "github.com/cs3org/reva/v2@v2.16.1-0.20230911153145-a2e2320f3448/internal/grpc/services/appprovider/appprovider.go:164",
  "message": "error registering app provider: error calling add app provider"
}

Server configuration

Operating system: NixOS+Docker

Web server: N/A (but using Traefik reverse proxy)

Database: postgres

PHP version: N/A

ownCloud version: OCIS latest (image sha256: 8048918ad590a5d02218527abee0570e04e9172776bb90db2c6b83334565d106)

Updated from an older ownCloud or fresh install: Fresh install

Where did you install ownCloud from: docker hub

The content of config/config.php:
N/A (OCIS doesn’t have anything.php)

List of activated apps:
N/A

Are you using external storage, if yes which one: local

Are you using encryption: no

Are you using an external user-backend, if yes which one: no

Client configuration

Browser:

Operating system:

Logs

Web server error log

See above for relevant section

ownCloud log (data/owncloud.log)

N/A

Browser log

N/A

More information

If we inspect the app-provider network configuration…

$ docker inspect ocis-app-provider-1 | jq -r '.[] | .NetworkSettings.Networks | .[] | .IPAddress'
172.26.0.5

We can see the problem: the subnet that app-provider is trying to reach the ocis container on is not the network that the app-provider container is on. Sure enough, if we inspect the ocis container:

$ docker inspect ocis-ocis-1 | jq -r '.[] | .NetworkSettings.Networks | .[] | .IPAddress'
172.26.0.4
172.27.0.3
172.18.0.12

We can see that, sure, the ocis container is on the app-provider-net network, but it’s also on the web network, which is the subnet the app-provider container is trying to reach it on. This suggests that either the mDNS/service registry system3 is only reporting the IP address of the web network, or the client is only trying the first IP that it gets in response to the mDNS query and discarding any other networks. I don’t really know that much about how mDNS works. I did try to do a bit of spelunking in the code…didn’t find anything I understood though.

Even more frustrating, within the relevant containers, dig returns the IP addresses of the common subnet for both containers

$ docker compose run -u 0 --entrypoint sh ocis
[+] Building 0.0s (0/0)                                                                                           
[+] Creating 1/0
 ✔ Container ocis-search-engine-1  Running                                                                   0.0s 
[+] Building 0.0s (0/0)                                                                                           
# apk add --quiet bind-tools
# dig +short ocis
172.26.0.4
# dig +short app-provider
172.26.0.5
$docker compose run -u 0 --entrypoint sh app-provider
[+] Building 0.0s (0/0)                                                                                           
[+] Creating 2/0
 ✔ Container ocis-search-engine-1  Running                                                                   0.0s 
 ✔ Container ocis-ocis-1           Running                                                                   0.0s 
[+] Building 0.0s (0/0)                                                                                           
# apk add --quiet bind-tools
# dig +short ocis
172.26.0.4
# dig +short app-provider
172.26.0.5

See also

The pull request in the config repo: https://git.tams.tech/TWS/ocis-deployment/pulls/1

Footnotes/relevant links

since discourse didn’t let me post links I have to put them as a footnote and obfuscate them

1: https://git.tams.tech/TWS/traefik-config
2: https://github.com/owncloud/ocis/blob/3ba6229add9edb6dc99e8733272f15accdcdbbb3/deployments/examples/ocis_wopi/docker-compose.yml#L103-L129
3: https://github.com/owncloud/ocis/blob/b0ac9840dff00a2527b2e8df86bebcd12632104c/ocis/README.md

I am not sure, this might be an issue with docker. I have witnessed similar things with other (non-ocis) deployments. If you have two networks defined, docker gets confused and always picks the wrong one… If I read it correctly you have at least two there, one named ocis-net from our example and another one from you traefik deployment. Could you try to get put them in the same network as an experiment?

Sorry it took me so long to respond, many plates to keep spinning :sweat_smile:

Setting all services to be only connected to one network alongside the reverse proxy, and indeed the DNS issue goes away. Now I’m getting a different error:

{
  "level": "error",
  "service": "app-provider",
  "error": "unable to register services: rgrpc: grpc service appprovider could not be started,: Application server at https://office.ocis-test.tams.tech does not match this AppProvider for Collabora",
  "time": "2023-10-10T09:57:21.672679598Z",
  "message": "error starting the grpc server"
}

Also I would still very much rather a solution which allows me to maintain network segmentation.