Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSH connectivity issues #498

Open
jrudolph opened this issue Dec 13, 2024 · 57 comments
Open

SSH connectivity issues #498

jrudolph opened this issue Dec 13, 2024 · 57 comments

Comments

@jrudolph
Copy link
Contributor

I recently started having issues with SSH connectivity (again, but this time also with my existing cluster) and looked into it in more detail. This may or not be a duplicate/related to #443 #415, but I wanted to document my findings in case someone else runs into similar problems.

For me the connectivity issues were a complex mix of different things that overlaid each other. All of these lead to hetzner-k3s hanging (silently) in "Waiting for successful ssh connectivity".

  1. most importantly, I recently used a another ssh key for another project and ssh-agent provided the wrong one to hetzner-k3s => in that case, use_agent: true will ignore the configured keys and use the wrong keys and fail silently
  2. use_agent: false: did also fail silently because the key pair was using a password (this is documented somewhere but still a caveat)
  3. with 2.0.x, the master setup does not work out of the box for me because the hetzner firewall configuration is only created after the master has been setup, so there's no connectivity to port 22 (that's probably caused by having existing firewalls defined in hetzner cloud project)
  4. The retry logic in hetzner-k3s hammers the ssh port with requests (~ 1/s?), so that the ssh server will by default start blocking your IP pretty quickly

Here are the workarounds I used:

  1. with use_agent: true, use ssh-add -D, and ssh-add <key> to add the right key to the ssh agent
  2. with use_agent: false use a key without a password (not recommended)
  3. setup a custom firewall rule in hetzner cloud that matches on the cluster label to enable port 22 even before hetzner-k3s creates the firewall rules, using a completely clean hetzner cloud project might also work
  4. if connecting to the server is broken from the command line as well, use a VPN, restart the server, or the sshd process on the nodes (if you can still connect via console)

My main debugging tool in the end was to log in to a node, set debug logging level in /etc/ssh/sshd_config and observe the logs during the attempts (will show blocking, wrong keys, etc).

Suggestions for hetzner-k3s:

  1. with use_agent: true:
    • warn if ssh-agent provides a different key than the one declared
    • surface more detailed error messages from libssh2 (I tried but failed, I'm not a crystal developer)
  2. with use_agent: false:
    • see 1.
    • explicitly warn when provided key is password protected (e.g. grep private key file for encryption headers)
  3. Create firewall rules before/with the servers to be more resilient against existing firewall rules on the project
  4. Be less aggressive with retries when trying to contact ssh, or fail early e.g. on auth issues
@vitobotta
Copy link
Owner

Thanks for sharing your experience!

Just to clear things up about the firewall: by default, when there's no firewall attached to a server, all ports are open. So, as long as you don't have an existing firewall blocking the selected SSH port, everything should be fine.

You're right about the SSH keys, though. That said, I create clusters regularly and haven't run into any issues with SSH connectivity yet due to keys or else.

I'll check if it's possible to use the SSH shard for Crystal to verify that the key used by the agent matches the one in your config. I'll also see if it can detect whether the key is protected by a passphrase.

Regarding the firewall, you shouldn't set up an additional one in your Hetzner project. Ideally, the project should be solely dedicated to the cluster managed by hetzner-k3s. I'll make sure this information is clearer in the docs.

@MarcelHaldimann
Copy link

I think I have the same problem with a simple example.

After the master node has beend created I am able to connect with an ssh-client without a problem with my private key.

hetzner_token: <<API-KEY>>
cluster_name: test
kubeconfig_path: "./kubeconfig"
k3s_version: v1.30.8+k3s1 

networking:
  ssh:
    port: 22
    use_agent: true # set to true if your key has a passphrase
    public_key_path: "./Documents/ssh/gl-new/gl-root.pub"
    private_key_path: "./Documents/ssh/gl-new/gl-root"
  allowed_networks:
    ssh:
      - 0.0.0.0/0
    api: # this will firewall port 6443 on the nodes
      - 0.0.0.0/0
  public_network:
    ipv4: true
    ipv6: true
  private_network:
    enabled: true
    subnet: 10.0.0.0/16
    existing_network_name: ""
  cni:
    enabled: true
    encryption: false
    mode: flannel

  # cluster_cidr: 10.244.0.0/16 # optional: a custom IPv4/IPv6 network CIDR to use for pod IPs
  # service_cidr: 10.43.0.0/16 # optional: a custom IPv4/IPv6 network CIDR to use for service IPs. Warning, if you change this, you should also change cluster_dns!
  # cluster_dns: 10.43.0.10 # optional: IPv4 Cluster IP for coredns service. Needs to be an address from the service_cidr range


manifests:
  cloud_controller_manager_manifest_url: "https://github.com/hetznercloud/hcloud-cloud-controller-manager/releases/download/v1.21.0/ccm-networks.yaml"
  csi_driver_manifest_url: "https://raw.githubusercontent.com/hetznercloud/csi-driver/v2.11.0/deploy/kubernetes/hcloud-csi.yml"
#   system_upgrade_controller_deployment_manifest_url: "https://github.com/rancher/system-upgrade-controller/releases/download/v0.13.4/system-upgrade-controller.yaml"
#   system_upgrade_controller_crd_manifest_url: "https://github.com/rancher/system-upgrade-controller/releases/download/v0.13.4/crd.yaml"
#   cluster_autoscaler_manifest_url: "https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/hetzner/examples/cluster-autoscaler-run-on-master.yaml"

datastore:
  mode: etcd # etcd (default) or external
  #external_datastore_endpoint: postgres://....

schedule_workloads_on_masters: false

image: debian-12 # optional: default is ubuntu-24.04
# autoscaling_image: 103908130 # optional, defaults to the `image` setting
# snapshot_os: microos # optional: specified the os type when using a custom snapshot

masters_pool:
  instance_type: cx22
  instance_count: 1
  location: nbg1

worker_node_pools:
- name: test-node-pool
  instance_type: cx22
  instance_count: 3
  location: nbg1
  # image: debian-11
  # labels:
  #   - key: purpose
  #     value: blah
  # taints:
  #   - key: something
  #     value: value1:NoSchedule
# - name: medium-autoscaled
#   instance_type: cpx31
#   instance_count: 2
#   location: nbg1
#   autoscaling:
#     enabled: true
#     min_instances: 0
#     max_instances: 3

embedded_registry_mirror:
  enabled: false # Check if your k3s version is compatible before enabling this option. You can find more information at https://docs.k3s.io/installation/registry-mirror

additional_packages:
- htop

post_create_commands:
- apt update
- apt upgrade -y
- apt autoremove -y

Log output:

mh@ione56 hetzner-k8s % hetzner-k3s create --config ./cluster-config.yml
[Configuration] Validating configuration...
[Configuration] ...configuration seems valid.
[Private Network] Creating private network...
[Private Network] ...private network created
[SSH key] Creating SSH key...
[SSH key] ...SSH key created
[Placement groups] Creating placement group test-masters...
[Placement groups] ...placement group test-masters created
[Placement groups] Creating placement group test-test-node-pool-2...
[Placement groups] ...placement group test-test-node-pool-2 created
[Instance test-master1] Creating instance test-master1 (attempt 1)...
[Instance test-master1] Instance status: starting
[Instance test-master1] Powering on instance (attempt 1)
[Instance test-master1] Waiting for instance to be powered on...
[Instance test-master1] Instance status: running
[Instance test-master1] Waiting for successful ssh connectivity with instance test-master1...
[Instance test-master1] Instance test-master1 already exists, skipping create
[Instance test-master1] Instance status: running
[Instance test-master1] Waiting for successful ssh connectivity with instance test-master1...
[Instance test-master1] Instance test-master1 already exists, skipping create
[Instance test-master1] Instance status: running
[Instance test-master1] Waiting for successful ssh connectivity with instance test-master1...
Error creating instance: timeout after 00:01:00
Instance creation for test-master1 failed. Try rerunning the create command.

Let me know if you need more informations or more logs.

Environment

I installed it with brew on macos Sonoma (14.6.1) upgraded to Sequoia (15.2) still the same problem.

Tested with:
image: debian-12
and
image: ubuntu-24.04

Also with different k3s_versionversions.

mh@ione56 hetzner-k8s % hetzner-k3s --version
2.0.9

Is there something wrong in my config?

thanks in advance
Marcel

@vitobotta
Copy link
Owner

@MarcelHaldimann do you have a passphrase on your key?

@MarcelHaldimann
Copy link

@vitobotta
Yes I do.

@vitobotta
Copy link
Owner

@vitobotta Yes I do.

Did you add the SSH key to Keychain?

@MarcelHaldimann
Copy link

Wow I am an idiot. Now it works like a charm! Thank you!

ssh-add --apple-use-keychain gl-root "~/Documents/ssh/gl-new/gl-root"

I thought I would be asked for the password.

Is there more documentation than in this example?

Thanks for the work and the support!

@vitobotta
Copy link
Owner

Wow I am an idiot. Now it works like a charm! Thank you!

ssh-add --apple-use-keychain gl-root "~/Documents/ssh/gl-new/gl-root"

I thought I would be asked for the password.

Is there more documentation than in this example?

Thanks for the work and the support!

Good point, looks like I forgot to add a mention about this for macOS. Would you mind making a small PR for this? :)

@FernandoJCa
Copy link

Hi! I'm having the same problem. Using an ssh key with no passphrase. My main difference is that I'm running it from a pipeline.

Image

It's a dockerfile I created previously.

The first time I ran the same cluster config on a pipeline it worked perfectly without any problems. But this time I get this ssh error

Any idea what cloud be happening?

@vitobotta
Copy link
Owner

Hi @FernandoJCa

It's difficult to give you suggestion without knowing more on the setup. What kind of pipeline and how do you supply the key for use with hetzner-k3s?

@FernandoJCa
Copy link

FernandoJCa commented Jan 25, 2025

Hello @vitobotta

It's a Gitlab CI pipeline.

---

stages:
  - deploy
  - delete
default:
  tags:
    - gitlab-org

before_script:
  - apk update && apk upgrade && apk add openssh-client
  - eval $(ssh-agent -s)
  - chmod 400 "$SSH_PRIVATE_KEY"
  - ssh-add "$SSH_PRIVATE_KEY"
  - mkdir -p ~/.ssh
  - chmod 700 ~/.ssh
  - mv $SSH_PRIVATE_KEY ~/.ssh/id_ed25519
  - mv $SSH_PUBLIC_KEY ~/.ssh/id_ed25519.pub
  - ls ~/.ssh/

deploy-cluster:
  image: registry.gitlab.com/snoopy-group/hetzner-cli-docker:v1.0.1
  stage: deploy
  script:
    - hetzner-k3s create --config cluster_config.yaml | tee create.log
  artifacts:
    paths:
      - kubeconfig.yaml
    expire_in: 2 h

delete-cluster:
  image: registry.gitlab.com/snoopy-group/hetzner-cli-docker:v1.0.1
  stage: delete
  script:
    - hetzner-k3s delete --config cluster_config.yaml | tee delete.log
  when: manual

The docker image is a custom one I created using the docker image from your repo as a base. Just install the dependencies and the hetzner-k3s cli. The SSH key is provided via CI/CD variables.

As you can see in the before_script section, I add the SSH_PRIVATE_KEY and then move both the public and private keys into the .ssh directory since my cluster_config needs the keys there.

It worked once on one of my tests and now for some reason it does not work anymore. It's nothing critical since I deployed the cluster using my host, but I'm still curious what the problem might be.

If you need more information let me know!

@vitobotta
Copy link
Owner

You are running the tool via docker but you are not mounting the SSH keypair in the container, right?

@FernandoJCa
Copy link

FernandoJCa commented Jan 25, 2025

GitLab is running all of the commands on the container, so the SSH key pair will be mounted. I already checked that just in case.

@vitobotta
Copy link
Owner

I'll try to add some additional debugging info to the SSH connection. I'll see if I can make a temp release in a bit.

@FernandoJCa
Copy link

Let me know, so I can do a quick test!

@vitobotta
Copy link
Owner

I have added some debugging info for when it fails to open an SSH session. Try the build when it's ready https://github.com/vitobotta/hetzner-k3s/actions/runs/12969090467

@FernandoJCa
Copy link

New test with the new build, no luck.

This is the cluster config I'm using

---
cluster_name: snoopy-cluster
kubeconfig_path: "./kubeconfig.yaml"
k3s_version: v1.32.0+k3s1

networking:
  ssh:
    port: 22
    use_agent: false
    public_key_path: "~/.ssh/id_ed25519.pub"
    private_key_path: "~/.ssh/id_ed25519"
  allowed_networks:
    ssh:
      - 0.0.0.0/0
    api:
      - 0.0.0.0/0
  public_network:
    ipv4: true
    ipv6: true
  private_network:
    enabled: true
    subnet: 10.0.0.0/16
    existing_network_name: ""
  cni:
    enabled: true
    encryption: false
    mode: flannel

schedule_workloads_on_masters: false

masters_pool:
  instance_type: cx22
  instance_count: 1
  location: hel1

worker_node_pools:
  - name: snoopy-house
    instance_type: cx32
    instance_count: 2
    location: hel1

And here is the pipeline log:

Image

Should I try with a different cluster config o ssh key? Well, I can login with ssh with that ssh keypair in my host without any issue

@vitobotta
Copy link
Owner

Did you try with use_agent: true?

@FernandoJCa
Copy link

Did a few more tries, here are the results:

  • user_agent: true Fail
    Using a new Alpine Version, updated from 3.20.3 to 3.21.2
  • With the new Alpine Version tried with user_agent: true Fail
  • With the new Alpine Version tried with user_agent: false Fail

@vitobotta
Copy link
Owner

Since I am not familiar with Gitlab I am confused by this:

deploy-cluster:
  image: registry.gitlab.com/snoopy-group/hetzner-cli-docker:v1.0.1
  stage: deploy
  script:
    - hetzner-k3s create --config cluster_config.yaml | tee create.log
  artifacts:
    paths:
      - kubeconfig.yaml
    expire_in: 2 h

This is running a new container inside the main container I guess or something like that. Where is it mounting the SSH keypair at ~./.ssh inside the hetzner-k3s container?

@vitobotta
Copy link
Owner

I read that Gitlab stores files in Secure Files and they are accessible at the $CI_SECURE_FILES_DIR location. How are you handling this?

@FernandoJCa
Copy link

The mounting part is in the before_script part.

Gitlab use three ways to send commands to the container before_script, script and after_script. Essentially the are the same, runs commands on the container but a different time.

To be 100% that the SSH keypair is being added to the container I did a cat on both public and private just before the hetzner-k3s --version.

Image

So they are on the container. And can be accessed through the CLI, as far as I understand.

And just to be clear, the gitlab runners use my docker container to execute all the commands. As for files, it treats them like a normal Linux ENV (https://docs.gitlab.com/ee/ci/variables/#use-file-type-cicd-variables)

@vitobotta
Copy link
Owner

Since you can otherwise SSH into the hosts, can you monitor /var/log/auth.log while hetzner-k3s is running to see if you can find useful info there?

@vitobotta
Copy link
Owner

I am gonna try to get rid of the ssh2 library and just use the SSH binary via shell command. This library has been causing several headaches already. Bear with me. I will make a new build soon.

@FernandoJCa
Copy link

Since you can otherwise SSH into the hosts, can you monitor /var/log/auth.log while hetzner-k3s is running to see if you can find useful info there?

Well, found something funny. A couple of Asian IP's trying to reach my master node, but nothing else.

Image

I'm doing more research/trying stuff to see any more logs

@vitobotta
Copy link
Owner

I'm almost done removing that library.

@FernandoJCa
Copy link

Take your time, there's no rush!

@vitobotta
Copy link
Owner

Hopefully this will work better https://github.com/vitobotta/hetzner-k3s/actions/runs/12969882972

Try the new build when ready. I have removed the library I mentioned and now I am just using regular ssh via shell. Should no longer run into weird issues with keys.

@FernandoJCa
Copy link

Looks like one job failed

@vitobotta
Copy link
Owner

Sorry, this one https://github.com/vitobotta/hetzner-k3s/actions/runs/12969906432

I forgot to remove one reference.

@FernandoJCa
Copy link

Still failing 😅

@vitobotta
Copy link
Owner

I had to remove something else https://github.com/vitobotta/hetzner-k3s/actions/runs/12969938981

@vitobotta
Copy link
Owner

Did you try?

@FernandoJCa
Copy link

I'm starting to suspect that is an issue with the Gitlab Runner or something like that. Is weird becaouse it worked the first time when i tried the last week.

Image

@vitobotta
Copy link
Owner

hetzner-k3s is now using plain ssh binary via shell, not anymore a library which was a bit problematic in some cases. If this doesn't work then it must be something with the environment or settings.

@FernandoJCa
Copy link

Could be, I will try to polish a bit the Dockerfile and see if I can find a solution. I'll let you know!

Thank you so much for all your help!

@vitobotta
Copy link
Owner

Np.

@FernandoJCa
Copy link

FernandoJCa commented Jan 26, 2025

Hi @vitobotta Came back with new info. I was trying to run the create command on my host and found something interesting in auth.log while hetzner-k3s was running. Did a test with user_agent: false and user_agent: true

I did the following tests:

  • Run Create on my host WITHOUT DOCKER. Result: Same error as logs
  • Run Create on my host USING DOCKER. Result: Same Error as logs

Image

Image

Same thing happens with v2.1.1.rc6


Trying with a brand new config, same result. I still can login normally with ssh from my host but is not able to login with hetzner-k3s.

@vitobotta
Copy link
Owner

@FernandoJCa I'm not sure what else we can try. I've never had issues with SSH connections with hetzner-k3s, though a few people have mentioned problems here and there. I thought it might be due to the library I was using, but now hetzner-k3s just uses the ssh and scp binaries from the host OS. So if you can SSH into the servers normally, hetzner-k3s should work fine too. I've tried with both old and new keys and never had any issues. I'm implementing some changes for work, and in the past couple of days, I've created dozens of clusters without any problems. I'm really running out of ideas for what could be causing your issue.

hetzner-k3s uses these ssh options: -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o LogLevel=ERROR -o ConnectTimeout=5 -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o BatchMode=yes -o PasswordAuthentication=no -o PreferredAuthentications=publickey -o PubkeyAuthentication=yes -o IdentitiesOnly=yes -i #{private_ssh_key_path}.

Can you try SSHing into the nodes manually using the same options? I doubt it will make a difference, but it's worth a shot.

Also, can you remind me which OS you're using on the host? I use macOS but develop hetzner-k3s in an Alpine dev container and have tested it a lot on Ubuntu. Never had any issues with any of these systems.

@vitobotta
Copy link
Owner

I am gonna test some things and get back to you in a bit.

@vitobotta
Copy link
Owner

I think I may have just reproduced the issue! Investigating...

@vitobotta
Copy link
Owner

NVM.. I had transferred the wrong key. I am testing from a Ubuntu server now and all good. If you tell me your host OS I can try to reproduce on the same system.

@FernandoJCa
Copy link

I'm using Linux Mint 22.

Image

I'm trying to see the ssh configuration of the nodes to see if I find something that can be causing the issue. I'll let you know if I found something

@vitobotta
Copy link
Owner

This is based on Ubuntu so it should be the same. Let me know if you find something.

@FernandoJCa
Copy link

FernandoJCa commented Jan 26, 2025

No luck, tried a bunch of stuff but still the same issue. Could be a problem on my OS? Not sure, will test later with a friend on his PC to see if I'm the issue or what is happening.

One thing is sure, hetzner-k3s is trying to logging with ssh but is instantly disconnecting for some reason that I can not understand.

Its funny that for some reason it only worked one time and it never worked again hahaha.


PSA: Did a few tries with new ssh keys deploying a new cluster but no luck either.

@FernandoJCa
Copy link

@vitobotta Switched back to v2.0.9 and works.

Any idea why?

@vitobotta
Copy link
Owner

@vitobotta Switched back to v2.0.9 and works.

Any idea why?

On the same, host, with the same keys and configuration? Is the version of hetzner-k3s the only difference?

@FernandoJCa
Copy link

Yep, the version of the hetzner-k3s is the only different.

Still not working in the pipeline but looks like thats a Gitlab issue, since is not able to... for some reason... send the connection. Like I was monitoring the auth logs and there's no attempt of connection by any gitlab runner.
But that doesn't matter right tbh

@vitobotta
Copy link
Owner

Can you please test again with a new cluster, in a new hetzner project, with both 2.0.9 and 2.1.1.rc6? Ensure you use a new project with nothing inside each time.

@FernandoJCa
Copy link

New project-v2.0.9, works

New project-v2.1.1.rc6, fails with the same issue

@vitobotta
Copy link
Owner

Can you try v2.1.1.rc7 with DEBUG=true?

@FernandoJCa
Copy link

Could you please guide how I pass the DEBUG flag?

I tried with: hetzner-k3s create --config cluster_config.yaml DEBUG=true but is the same output as usual.

@vitobotta
Copy link
Owner

It's just an env variable so I think that wait it should work. Or try DEBUG=true hetzner-k3s create --config cluster_config.yaml, not sure it makes a difference

@FernandoJCa
Copy link

Well, did the test and this is what I see

Image

Image

@vitobotta
Copy link
Owner

Interesting, that seems to suggest that the connection is OK, so there may be a problem with comparing that string with the expected one. It's just a simple check to verify that SSH commands can be executed correctly.

Wait for rc8 to be ready (https://github.com/vitobotta/hetzner-k3s/actions/runs/12978554154) and try again with DEBUG=true. Then paste the logs again. Thanks for the help investigating this!

@FernandoJCa
Copy link

Don't worry @vitobotta, I'm happy to help.

Unfortunately I'm not good enough with Crystal to help with the code.

I'll test it later today, when I have a bit more time I'll let you know the results!

@vitobotta
Copy link
Owner

Sounds good

@vitobotta
Copy link
Owner

vitobotta commented Jan 26, 2025

I was just reading that there may be an issue with extra new lines/carriage returns in some cases, so that would make that string comparison fail. I have made a change in rc9 (https://github.com/vitobotta/hetzner-k3s/actions/runs/12978828792) to remove extra characters. Please test with this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants