Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gRPC stream is closed after 60 seconds of idle even with timeout annotations set #12434

Open
0x113 opened this issue Nov 29, 2024 · 9 comments
Open
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/backlog Higher priority than priority/awaiting-more-evidence. triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@0x113
Copy link

0x113 commented Nov 29, 2024

What happened:

The gRPC bi-directional stream is interrupted after 60 of idle even after necessary annotations are set.
Annotations:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    nginx.ingress.kubernetes.io/proxy-connect-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/backend-protocol: GRPCS

I verified that these values are set correctly by execing into the pod and checking nginx.conf directly:

proxy_connect_timeout                   300s;
proxy_send_timeout                      300s;
proxy_read_timeout                      300s;
proxy_next_upstream                     error timeout;
proxy_next_upstream_timeout             0;
grpc_connect_timeout                    300s;
grpc_send_timeout                       300s;
grpc_read_timeout                       300s;

However, the bi-directional stream between the server and the agent is still closed after 60 seconds.

What you expected to happen:
I expected the stream to be closed after 5 minutes.

I think the default value of 60s is used whenever annotation values are greater than 60s. If I set these 3 annotations to a value less than 60, then the timeout is applied properly. For instance, I set it to "10" and the stream was interrupted after 10 seconds of idle.

NGINX Ingress controller version (exec into the pod and run /nginx-ingress-controller --version):

ingress-nginx-controller-6df48c5677-cjpgv:/etc/nginx$ /nginx-ingress-controller --version
-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:       v1.11.3
  Build:         0106de65cfccb74405a6dfa7d9daffc6f0a6ef1a
  Repository:    https://github.com/kubernetes/ingress-nginx
  nginx version: nginx/1.25.5

-------------------------------------------------------------------------------

Kubernetes version (use kubectl version): v1.29.10

Environment:

  • Cloud provider or hardware configuration: Managed AKS

  • OS (e.g. from /etc/os-release):

  • Kernel (e.g. uname -a):

  • Install tools:

    • Helm
  • Basic cluster related info:

    • Managed AKS v1.29.10, Public Azure Cloud
  • How was the ingress-nginx-controller installed:

$ helm ls -A | grep -i ingress
ingress-nginx                   	ingress-nginx   	1       	2024-11-29 15:59:19.017422556 +0100 CET	deployed	ingress-nginx-4.11.3                                                     	1.11.3
$ helm -n ingress-nginx get values ingress-nginx
USER-SUPPLIED VALUES:
null
  • Current State of the controller:
$ kubectl describe ingressclasses
Name:         azure-application-gateway
Labels:       addonmanager.kubernetes.io/mode=Reconcile
              app=ingress-appgw
              app.kubernetes.io/component=controller
Annotations:  <none>
Controller:   azure/application-gateway
Events:       <none>

Name:         nginx
Labels:       app.kubernetes.io/component=controller
              app.kubernetes.io/instance=ingress-nginx
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=ingress-nginx
              app.kubernetes.io/part-of=ingress-nginx
              app.kubernetes.io/version=1.11.3
              helm.sh/chart=ingress-nginx-4.11.3
Annotations:  meta.helm.sh/release-name: ingress-nginx
              meta.helm.sh/release-namespace: ingress-nginx
Controller:   k8s.io/ingress-nginx
Events:       <none>
$ kubectl -n ingress-nginx get all -o wide

NAME                                            READY   STATUS    RESTARTS   AGE    IP            NODE                                NOMINATED NODE   READINESS GATES
pod/ingress-nginx-controller-6df48c5677-cjpgv   1/1     Running   0          124m   10.244.2.16   aks-nodepool1-19682194-vmss000003   <none>           <none>

NAME                                         TYPE           CLUSTER-IP     EXTERNAL-IP      PORT(S)                      AGE    SELECTOR
service/ingress-nginx-controller             LoadBalancer   10.0.217.178   <redacted>   80:32223/TCP,443:31568/TCP   124m   app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx
service/ingress-nginx-controller-admission   ClusterIP      10.0.24.224    <none>           443/TCP                      124m   app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx

NAME                                       READY   UP-TO-DATE   AVAILABLE   AGE    CONTAINERS   IMAGES                                                                                                                     SELECTOR
deployment.apps/ingress-nginx-controller   1/1     1            1           124m   controller   registry.k8s.io/ingress-nginx/controller:v1.11.3@sha256:d56f135b6462cfc476447cfe564b83a45e8bb7da2774963b00d12161112270b7   app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx

NAME                                                  DESIRED   CURRENT   READY   AGE    CONTAINERS   IMAGES                                                                                                                     SELECTOR
replicaset.apps/ingress-nginx-controller-6df48c5677   1         1         1       124m   controller   registry.k8s.io/ingress-nginx/controller:v1.11.3@sha256:d56f135b6462cfc476447cfe564b83a45e8bb7da2774963b00d12161112270b7   app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx,pod-template-hash=6df48c5677
$ kubectl -n <ingresscontrollernamespace> describe po <ingresscontrollerpodname>

Name:             ingress-nginx-controller-6df48c5677-cjpgv
Namespace:        ingress-nginx
Priority:         0
Service Account:  ingress-nginx
Node:             aks-nodepool1-19682194-vmss000003/10.224.0.6
Start Time:       Fri, 29 Nov 2024 15:59:47 +0100
Labels:           app.kubernetes.io/component=controller
                  app.kubernetes.io/instance=ingress-nginx
                  app.kubernetes.io/managed-by=Helm
                  app.kubernetes.io/name=ingress-nginx
                  app.kubernetes.io/part-of=ingress-nginx
                  app.kubernetes.io/version=1.11.3
                  helm.sh/chart=ingress-nginx-4.11.3
                  pod-template-hash=6df48c5677
Annotations:      <none>
Status:           Running
IP:               10.244.2.16
IPs:
  IP:           10.244.2.16
Controlled By:  ReplicaSet/ingress-nginx-controller-6df48c5677
Containers:
  controller:
    Container ID:    containerd://c9fec8b39fbde912c1f7daf9e151fb32bc9fa4ab754b26908b873476f1a6d6a2
    Image:           registry.k8s.io/ingress-nginx/controller:v1.11.3@sha256:d56f135b6462cfc476447cfe564b83a45e8bb7da2774963b00d12161112270b7
    Image ID:        registry.k8s.io/ingress-nginx/controller@sha256:d56f135b6462cfc476447cfe564b83a45e8bb7da2774963b00d12161112270b7
    Ports:           80/TCP, 443/TCP, 8443/TCP
    Host Ports:      0/TCP, 0/TCP, 0/TCP
    SeccompProfile:  RuntimeDefault
    Args:
      /nginx-ingress-controller
      --publish-service=$(POD_NAMESPACE)/ingress-nginx-controller
      --election-id=ingress-nginx-leader
      --controller-class=k8s.io/ingress-nginx
      --ingress-class=nginx
      --configmap=$(POD_NAMESPACE)/ingress-nginx-controller
      --validating-webhook=:8443
      --validating-webhook-certificate=/usr/local/certificates/cert
      --validating-webhook-key=/usr/local/certificates/key
      --enable-metrics=false
    State:          Running
      Started:      Fri, 29 Nov 2024 15:59:56 +0100
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:      100m
      memory:   90Mi
    Liveness:   http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=5
    Readiness:  http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
    Environment:
      POD_NAME:       ingress-nginx-controller-6df48c5677-cjpgv (v1:metadata.name)
      POD_NAMESPACE:  ingress-nginx (v1:metadata.namespace)
      LD_PRELOAD:     /usr/local/lib/libmimalloc.so
    Mounts:
      /usr/local/certificates/ from webhook-cert (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vtb8k (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       True 
  ContainersReady             True 
  PodScheduled                True 
Volumes:
  webhook-cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  ingress-nginx-admission
    Optional:    false
  kube-api-access-vtb8k:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason  Age                 From                      Message
  ----    ------  ----                ----                      -------
  Normal  RELOAD  13m (x6 over 125m)  nginx-ingress-controller  NGINX reload triggered due to a change in configuration
$ kubectl -n ingress-nginx describe svc ingress-nginx-controller

Name:                     ingress-nginx-controller
Namespace:                ingress-nginx
Labels:                   app.kubernetes.io/component=controller
                          app.kubernetes.io/instance=ingress-nginx
                          app.kubernetes.io/managed-by=Helm
                          app.kubernetes.io/name=ingress-nginx
                          app.kubernetes.io/part-of=ingress-nginx
                          app.kubernetes.io/version=1.11.3
                          helm.sh/chart=ingress-nginx-4.11.3
Annotations:              meta.helm.sh/release-name: ingress-nginx
                          meta.helm.sh/release-namespace: ingress-nginx
Selector:                 app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx
Type:                     LoadBalancer
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       10.0.217.178
IPs:                      10.0.217.178
LoadBalancer Ingress:     <redacted>
Port:                     http  80/TCP
TargetPort:               http/TCP
NodePort:                 http  32223/TCP
Endpoints:                10.244.2.16:80
Port:                     https  443/TCP
TargetPort:               https/TCP
NodePort:                 https  31568/TCP
Endpoints:                10.244.2.16:443
Session Affinity:         None
External Traffic Policy:  Cluster
Events:
  Type    Reason               Age                From                Message
  ----    ------               ----               ----                -------
  Normal  UpdatedLoadBalancer  33m (x3 over 51m)  service-controller  Updated load balancer with new hosts
  • Current state of ingress object, if applicable:
$ kubectl describe ing -n <ns> <ing-name>
Name:             <ing-name>
Namespace:         <ns>
Address:         <redacted>
Ingress Class:    nginx
Default backend:  <default>
Rules:
  Host        Path  Backends
  ----        ----  --------
  *
                 envoy-grpcapi:443 (10.244.0.29:8080)
Annotations:  nginx.ingress.kubernetes.io/backend-protocol: GRPCS
              nginx.ingress.kubernetes.io/proxy-connect-timeout: 300
              nginx.ingress.kubernetes.io/proxy-read-timeout: 300
              nginx.ingress.kubernetes.io/proxy-send-timeout: 300
              nginx.ingress.kubernetes.io/ssl-redirect: true
Events:
  Type    Reason  Age                From                      Message
  ----    ------  ----               ----                      -------
  Normal  Sync    15m (x7 over 55m)  nginx-ingress-controller  Scheduled for sync
  • Others:
    • Any other related information like ;
      • copy/paste of the snippet (if applicable)
      • kubectl describe ... of any custom configmap(s) created and in use
      • Any other related information that may help

How to reproduce this issue:

Anything else we need to know:

@0x113 0x113 added the kind/bug Categorizes issue or PR as related to a bug. label Nov 29, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority labels Nov 29, 2024
@longwuyuan
Copy link
Contributor

/remove-kind bug

Can you please write a step-by-step guide that someone can copy/paste from, to reproduce on a kind cluster. Inclusding the gRPC application.

@k8s-ci-robot k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. and removed kind/bug Categorizes issue or PR as related to a bug. labels Nov 30, 2024
@0x113
Copy link
Author

0x113 commented Nov 30, 2024

Sure. I will work on a sample app and share the details soon.

@Dunge
Copy link

Dunge commented Dec 2, 2024

Try setting client-body-timeout

@strongjz
Copy link
Member

strongjz commented Dec 4, 2024

/kind bug
/priority backlog
/triage needs-information

Let us know if the client body timeout works, i am also seeing client header timeout as well should be set as well

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence. triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority labels Dec 4, 2024
@strongjz
Copy link
Member

@0x113 are you still having issues? If not can you post the resolution and/or close the ticket?

Copy link

This is stale, but we won't close it automatically, just bare in mind the maintainers may be busy with other tasks and will reach your issue ASAP. If you have any question or request to prioritize this, please reach #ingress-nginx-dev on Kubernetes Slack.

@github-actions github-actions bot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jan 21, 2025
@jbriones-lumenvox
Copy link

jbriones-lumenvox commented Jan 25, 2025

I think I've been dealing with the same issue. Unfortunately I don't have a replication I can share, but I can share more details.

Brief context: we have been attempting to support idle gRPC streams up to 120s, but we have been experiencing timeouts at 60s.

I'm on a slightly older version of the controller:

NGINX Ingress controller
  Release:       v1.11.0
  Build:         96dea883d6ee3c6261b722896e8638758b7cc4cb
  Repository:    https://github.com/kubernetes/ingress-nginx
  nginx version: nginx/1.25.5

Annotations:

metadata:
  name: lumenvox-api-ingress-grpc
  namespace: {{ default .Release.Namespace .Values.global.defaultNamespace }}
  annotations:
    nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/proxy-body-size: "0"
    nginx.ingress.kubernetes.io/server-snippet: |
      grpc_read_timeout 120s;
      grpc_send_timeout 120s;
      client_body_timeout 120s;

Here's the server block that is generated from those annotations:

        ## start server lumenvox-api.testmachine.com
        server {
                server_name lumenvox-api.testmachine.com ;

                http2 on;

                listen 80  ;
                listen [::]:80  ;
                listen 443  ssl;
                listen [::]:443  ssl;

                set $proxy_upstream_name "-";

                ssl_certificate_by_lua_block {
                        certificate.call()
                }

                # Custom code snippet configured for host lumenvox-api.testmachine.com
                grpc_read_timeout 120s;
                grpc_send_timeout 120s;
                client_body_timeout 120s;

                location / {

                        set $namespace      "lumenvox";
                        set $ingress_name   "lumenvox-api-ingress-grpc";
                        set $service_name   "lumenvox-api-service";
                        set $service_port   "grpc";
                        set $location_path  "/";
                        set $global_rate_limit_exceeding n;

                        rewrite_by_lua_block {
                                lua_ingress.rewrite({
                                        force_ssl_redirect = false,
                                        ssl_redirect = true,
                                        force_no_ssl_redirect = false,
                                        preserve_trailing_slash = false,
                                        use_port_in_redirects = false,
                                        global_throttle = { namespace = "", limit = 0, window_size = 0, key = { }, ignored_cidrs = { } },
                                })
                                balancer.rewrite()
                                plugins.run()
                        }

                        # be careful with `access_by_lua_block` and `satisfy any` directives as satisfy any
                        # will always succeed when there's `access_by_lua_block` that does not have any lua code doing `ngx.exit(ngx.DECLINED)`
                        # other authentication method such as basic auth or external auth useless - all requests will be allowed.
                        #access_by_lua_block {
                        #}

                        header_filter_by_lua_block {
                                lua_ingress.header()
                                plugins.run()
                        }

                        body_filter_by_lua_block {
                                plugins.run()
                        }

                        log_by_lua_block {
                                balancer.log()

                                plugins.run()
                        }

                        port_in_redirect off;

                        set $balancer_ewma_score -1;
                        set $proxy_upstream_name "lumenvox-lumenvox-api-service-grpc";
                        set $proxy_host          $proxy_upstream_name;
                        set $pass_access_scheme  $scheme;

                        set $pass_server_port    $server_port;

                        set $best_http_host      $http_host;
                        set $pass_port           $pass_server_port;

                        set $proxy_alternative_upstream_name "";

                        client_max_body_size                    0;

                        grpc_set_header Host                   $best_http_host;

                        # Pass the extracted client certificate to the backend

                        # Allow websocket connections
                        grpc_set_header                        Upgrade           $http_upgrade;

                        grpc_set_header                        Connection        $connection_upgrade;

                        grpc_set_header X-Request-ID           $req_id;
                        grpc_set_header X-Real-IP              $remote_addr;

                        grpc_set_header X-Forwarded-For        $remote_addr;

                        grpc_set_header X-Forwarded-Host       $best_http_host;
                        grpc_set_header X-Forwarded-Port       $pass_port;
                        grpc_set_header X-Forwarded-Proto      $pass_access_scheme;
                        grpc_set_header X-Forwarded-Scheme     $pass_access_scheme;

                        grpc_set_header X-Scheme               $pass_access_scheme;

                        # Pass the original X-Forwarded-For
                        grpc_set_header X-Original-Forwarded-For $http_x_forwarded_for;

                        # mitigate HTTPoxy Vulnerability
                        # https://www.nginx.com/blog/mitigating-the-httpoxy-vulnerability-with-nginx/
                        grpc_set_header Proxy                  "";

                        # Custom headers to proxied server

                        proxy_connect_timeout                   5s;
                        proxy_send_timeout                      60s;
                        proxy_read_timeout                      60s;

                        proxy_buffering                         off;
                        proxy_buffer_size                       4k;
                        proxy_buffers                           4 4k;

                        proxy_max_temp_file_size                1024m;

                        proxy_request_buffering                 on;
                        proxy_http_version                      1.1;

                        proxy_cookie_domain                     off;
                        proxy_cookie_path                       off;

                        # In case of errors try the next upstream server before returning an error
                        proxy_next_upstream                     error timeout;
                        proxy_next_upstream_timeout             0;
                        proxy_next_upstream_tries               3;

                        # Grpc settings
                        grpc_connect_timeout                    5s;
                        grpc_send_timeout                       60s;
                        grpc_read_timeout                       60s;

                        # Custom Response Headers

                        grpc_pass grpc://upstream_balancer;

                        proxy_redirect                          off;

                }

        }
        ## end server lumenvox-api.testmachine.com

The snippet from our annotations can be seen with the correct values just above the start of the location block. However, near the end of the location block, those same timeouts are set to 60s. I believe this is overriding the snippet from our annotations.

I've been able to support longer idle streams by manually updating those values:

  1. bash into the ingress-nginx-controller pod
  2. edit nginx.conf to change grpc_read_timeout and grpc_send_timeout in the location block from 60s to 120s
  3. run nginx -s reload

However, when the pod restarts, those values are reset back to 60s, and the longer streams start to fail again.

I also tried using the nginx.ingress.kubernetes.io/configuration-snippet annotation:

metadata:
  annotations:
    meta.helm.sh/release-name: lumenvox
    meta.helm.sh/release-namespace: lumenvox
    nginx.ingress.kubernetes.io/backend-protocol: GRPC
    nginx.ingress.kubernetes.io/proxy-body-size: "0"
    nginx.ingress.kubernetes.io/server-snippet: |
      grpc_read_timeout 120s;
      grpc_send_timeout 120s;
      client_body_timeout 120s;
    nginx.ingress.kubernetes.io/configuration-snippet: |
      grpc_read_timeout 120s;
      grpc_send_timeout 120s;
      client_body_timeout 120s;
    nginx.ingress.kubernetes.io/ssl-redirect: "true"

However, this causes an error:

error: ingresses.networking.k8s.io "lumenvox-api-ingress-grpc" could not be patched: admission webhook "validate.nginx.ingress.kubernetes.io" denied the request:
-------------------------------------------------------------------------------
Error: exit status 1
2025/01/25 00:50:38 [emerg] 315#315: "grpc_read_timeout" directive is duplicate in /tmp/nginx/nginx-cfg1038597644:978
nginx: [emerg] "grpc_read_timeout" directive is duplicate in /tmp/nginx/nginx-cfg1038597644:978
nginx: configuration file /tmp/nginx/nginx-cfg1038597644 test failed

-------------------------------------------------------------------------------

So to sum up, it looks like the nginx-ingress-controller pod is autogenerating the configuration, and in that auto-generated configuration, there are values that override the grpc-related timeouts. This can be manually fixed with the steps I described, but I haven't been able to find a good long-term solution.

@Dunge
Copy link

Dunge commented Jan 25, 2025

I had better chance with proxy_send_timeout/proxy_read_timeout (and body-timeout) than the grpc specific ones.

Also server-snippets are now disabled by default now, you should use the annotations (and configmap for the body one) now.

@github-actions github-actions bot removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Jan 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/backlog Higher priority than priority/awaiting-more-evidence. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
Development

No branches or pull requests

6 participants