Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CloudStack fails to start more VMs #10205

Open
akrasnov-drv opened this issue Jan 19, 2025 · 8 comments
Open

CloudStack fails to start more VMs #10205

akrasnov-drv opened this issue Jan 19, 2025 · 8 comments
Milestone

Comments

@akrasnov-drv
Copy link

Discussed in #10184

Originally posted by akrasnov-drv January 14, 2025
Hi,

I'm struggling to make CloudStack 4.20.0.0 properly start KVM VMs on Ubuntu 22.

We have isolated network over VLAN.
CloudStack manages to start single VM and to add several more. But when I ask to start more (e.g. 10-30), Cloudstack starts behaving weird.
New VMs produce different errors, then Cloudstack becomes slow, does not clean resources, and at the end stays with number of VMs in Starting state.

I have 5 KVM servers connected, each able to handle 30 VMs alone (in KVM without Cloudstack). VMs use local server storage. I do not see any resource problem.
I tried to debug the issue, and looks like virtual router stops working properly. I found in its log that it restarts managing script at some point, still part of VMs do not get proper network config. Static NAT enable also returns errors.
Error while enabling static nat. Ip Id: 14
Expunge for VMs then also hangs.
In addition sometimes I see KVM hosts stop communicating with management, and stop writing to their local logs.

To recover I need to restart management, delete virtual router and clean stuck resources, sometimes directly in mysql db. Agent restart is also sometimes needed.
Any help to understand and fix the problem is highly appreciated.
I'll provide logs or other info on request.

Thanks,
Alex.

To summon

- under some load part of VMs stays in Starting state, and UI becomes unresponsive, libvirt restart revives UI and expunging of VMs
- part of VMs that manage to start do not get IPs
- most fail to get static nat configured (I have enough free public IPs)
- at the end primary VR fails but backup one is not promoted to primary for some 30-60 min
@akrasnov-drv
Copy link
Author

More info and logs are available in the linked discussion.
Fresh ones I'll post on a request.

@akrasnov-drv
Copy link
Author

akrasnov-drv commented Jan 19, 2025

Just noticed that after switch over of back to primary VR, there are numerous errors in VR cloud.log

2025-01-19 10:38:46,432 ERROR    Not able to setup source-nat for a regular router yet
2025-01-19 10:38:46,876 ERROR    Not able to setup source-nat for a regular router yet
2025-01-19 10:48:46,361 ERROR    Not able to setup source-nat for a regular router yet
...
2025-01-19 13:08:46,424 ERROR    Not able to setup source-nat for a regular router yet
2025-01-19 13:08:46,881 ERROR    Not able to setup source-nat for a regular router yet

and there were no outside access from VMs started.
After network restart, both VRs were recreated and the source-nat started working.

@akrasnov-drv
Copy link
Author

akrasnov-drv commented Jan 20, 2025

Additional VR issue. VM stopped getting hostname from metadata service, even though VR is accessible.
Queries return error 500.
In apache2 log in VR I found numerous messages

[Mon Jan 20 09:01:58.806961 2025] [core:alert] [pid 2013:tid 2058] [client 10.10.246.146:47742] /var/www/html/latest/.htaccess: RewriteRule: bad flag delimiters
[Mon Jan 20 09:02:04.895031 2025] [core:alert] [pid 2013:tid 2059] [client 10.10.246.146:47744] /var/www/html/latest/.htaccess: RewriteRule: bad flag delimiters
[Mon Jan 20 09:02:11.090009 2025] [core:alert] [pid 2013:tid 2060] [client 10.10.246.146:50412] /var/www/html/latest/.htaccess: RewriteRule: bad flag delimiters
[Mon Jan 20 09:02:17.429128 2025] [core:alert] [pid 2013:tid 2061] [client 10.10.246.146:57298] /var/www/html/latest/.htaccess: RewriteRule: bad flag delimiters
[Mon Jan 20 09:02:24.250405 2025] [core:alert] [pid 2013:tid 2062] [client 10.10.246.146:57308] /var/www/html/latest/.htaccess: RewriteRule: bad flag delimiters
[Mon Jan 20 09:02:30.995303 2025] [core:alert] [pid 2013:tid 2063] [client 10.10.246.146:42222] /var/www/html/latest/.htaccess: RewriteRule: bad flag delimiters

it's quite long

cat /var/www/html/latest/.htaccess|wc -l
160688

It's even more ugly as the lines are just repeating:

cat /var/www/html/latest/.htaccess|sort -u |wc -l
38

#are these used?
#http://<routerIP>/latest/foo and .../foo/ (yield metadata/$IP/foo)
#http://<routerIP>/latest/meta-data and .../meta-data/   (dir listing of meta-data)
#http://<routerIP/latest/meta-data/foo and .../foo/  (yield metadata/$IP/foo)
#http://<routerIP>/latest/user-data  and .../user-data/  (both yield user-data file)
Options +FollowSymLinks
RewriteEngine On
RewriteRule ^availability-zone/?$  ../metadata/%{REMOTE_ADDR}/availability-zone [L,NC,QSA]
RewriteRule ^availability-zone$  ../metadata/%{REMOTE_ADDR}/availability-zone [L,NC,QSA]
RewriteRule ^cloud-domain/?$  ../metadata/%{REMOTE_ADDR}/vm-id [L,NC,QSA]
RewriteRule ^cloud-domain-id/?$  ../metadata/%{REMOTE_ADDR}/vm-id [L,NC,QSA]
RewriteRule ^cloud-identifier/?$  ../metadata/%{REMOTE_ADDR}/cloud-identifier [L,NC,QSA]
RewriteRule ^cloud-identifier$  ../metadata/%{REMOTE_ADDR}/cloud-identifier [L,NC,QSA]
RewriteRule ^hypervisor-host-name$  ../metadata/%{REMOTE_ADDR}/hypervisor-host-name [L,NC,QSA]
RewriteRule ^instance-id/?$  ../metadata/%{REMOTE_ADDR}/instance-id [L,NC,QSA]
RewriteRule ^instance-id$  ../metadata/%{REMOTE_ADDR}/instance-id [L,NC,QSA]
RewriteRule ^local-hostname/?$  ../metadata/%{REMOTE_ADDR}/local-hostname [L,NC,QSA]
RewriteRule ^local-hostname$  ../metadata/%{REMOTE_ADDR}/local-hostname [L,NC,QSA]
RewriteRule ^local-ipv4/?$  ../metadata/%{REMOTE_ADDR}/local-ipv4 [L,NC,QSA]
RewriteRule ^local-ipv4$  ../metadata/%{REMOTE_ADDR}/local-ipv4 [L,NC,QSA]
RewriteRule ^locaRewriteRule ^user-data$  ../userdata/%{REMOTE_ADDR}/user-data [L,NC,QSA]
RewriteRule ^meta-data/(.+)$  ../metadata/%{REMOTE_ADDR}/$1 [L,NC,QSA]
RewriteRule ^meta-data/(.+[^/])/?$  ../metadata/%{REMOTE_ADDR}/$1 [L,NC,QSA]
RewriteRule ^meta-data/?$  ../metadata/%{REMOTE_ADDR}/meta-data [L,NC,QSA]
RewriteRule ^meta-data/$  ../metadata/%{REMOTE_ADDR}/meta-data [L,NC,QSA]
RewriteRule ^public-hostname/?$  ../metadata/%{REMOTE_ADDR}/public-hostname [L,NC,QSA]
RewriteRule ^public-hostname$  ../metadata/%{REMOTE_ADDR}/public-hostname [L,NC,QSA]
RewriteRule ^public-ipv4/?$  ../metadata/%{REMOTE_ADDR}/public-ipv4 [L,NC,QSA]
RewriteRule ^public-ipv4$  ../metadata/%{REMOTE_ADDR}/public-ipv4 [L,NC,QSA]
RewriteRule ^public-keys/?$  ../metadata/%{REMOTE_ADDR}/public-keys [L,NC,QSA]
RewriteRule ^public-keys$  ../metadata/%{REMOTE_ADDR}/public-keys [L,NC,QSA]
RewriteRule ^service-offering/?$  ../metadata/%{REMOTE_ADDR}/service-offering [L,NC,QSA]
RewriteRule ^service-offering$  ../metadata/%{REMOTE_ADDR}/service-offering [L,NC,QSA]
RewriteRule ^user-data/?$  ../userdata/%{REMOTE_ADDR}/user-data [L,NC,QSA]
RewriteRule ^user-data$  ../userdata/%{REMOTE_ADDR}/user-data [L,NC,QSA]
RewriteRule ^vm-id/?$  ../metadata/%{REMOTE_ADDR}/vm-id [L,NC,QSA]
RewriteRule ^vm-id$  ../metadata/%{REMOTE_ADDR}/vm-id [L,NC,QSA]

RewriteRule ^locaRewriteRule ^user-data$ ../userdata/%{REMOTE_ADDR}/user-data [L,NC,QSA] - looks suspicious, and it's the single line (not repeating)
Btw, I have no VMs in CloudStack now, except 1 being started, so all that content was kept from yesterday and not cleaned yet. Just to make sure, VRs were recreated yesterday!

@akrasnov-drv
Copy link
Author

After network restart (with cleanup), VR booted with "clean" .htaccess that already has duplicate lines:

root@r-3167-VM:~# cat /var/www/html/latest/.htaccess|wc -l
49
root@r-3167-VM:~# cat /var/www/html/latest/.htaccess|sort -u|wc -l
37

@weizhouapache
Copy link
Member

weizhouapache commented Jan 20, 2025

I checked one VR in my testing env, there are a lot of duplicated lines in .htaccess

this may be not a major issue

@akrasnov-drv
Copy link
Author

akrasnov-drv commented Jan 20, 2025

The main issue in my env is that it's stuck too often.
We are trying to use it with jcloud plugin of Jenkins. When Jenkins tries to start some 20 VMs it gets just about 5. The rest is created, fails static nat and then is destroyed. But after some 50 VMs (created and destroyed) new ones just hang in Starting state. Then if I ask to purge all, it hangs. As I wrote initially libvirt restart helps at least to revive expunge.

  • under some load part of VMs stays in Starting state, and UI becomes unresponsive, libvirt restart revives UI and expunging of VMs

Also, why I've got to it - .htaccess was broken, and metadata service stopped working. Likely, because of enormous number of writes to the file, 2 writes happened simultaneously and wrote to the same line.

@akrasnov-drv
Copy link
Author

Finally I've got to a weird state when restart of libirtd, agents and management does not help.
I have 68 VMs stuck in Expunging state.
If I try to execute Expunge again I get

Cannot invoke "java.lang.Long.longValue()" because the return value of "com.cloud.utils.db.SequenceFetcher.getNextSequence(java.lang.Class, javax.persistence.TableGenerator, Object)" is null

and UI becomes very slow again.
I know just one solution for this - mysql cleanup.

@weizhouapache
Copy link
Member

I will fix the issue with /var/www/html/latest/.htaccess

cc @DaanHoogland

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

No branches or pull requests

3 participants