Hi All,
I just had an interesting issue, and though I would share this as it might save you co-ordinating some planned downtime which could potentially be avoided.
We had a disconnected host where the guest VM's are all still running and can be accessed via RDP, but the host is not responsive through the iLO / DCUI and SSH is not running and can't be started.
The host logged the following sequential events in vCenter;
The root filesystem's file table is full. As a result, the file tmp:/auto-backup.1481830/etc/hosts could not be created by the application 'tar'.
The root filesystem's file table is full. As a result, the file tmp:/auto-backup.1482016/etc/sfcb/repository/root/interop/cim_listenerdestinationcimxml.idx could not be created by the application 'tar'.
The root filesystem's file table is full. As a result, the file tmp:/auto-backup.1482194/etc/vmware/hostd/vmAutoStart.xml could not be created by the application 'tar'.
The root filesystem's file table is full. As a result, the file /etc/vmware/esx.conf.LOCK.17554 could not be created by the application 'hostd-worker'.
The root filesystem's file table is full. As a result, the file /var/log/ipmi/0/.sensor_threshold.raw could not be created by the application 'sfcb-vmware_raw'.
The root filesystem's file table is full. As a result, the file /var/log/ipmi/0/.sensor_hysteresis.raw could not be created by the application 'sfcb-vmware_raw'.
The root filesystem's file table is full. As a result, the file /var/run/sfcb/52c25dd2-064a-abee-ce4c-cafd051d527c could not be created by the application 'sfcb-CIMXML-Pro'.
The root filesystem's file table is full. As a result, the file /var/log/ipmi/0/.sel_header.raw could not be created by the application 'sfcb-vmware_raw'.
The root filesystem's file table is full. As a result, the file /var/run/sfcb/52ca5a12-1d8d-7902-1e14-170d2c282951 could not be created by the application 'sfcb-CIMXML-Pro'.
The root filesystem's file table is full. As a result, the file /var/log/ipmi/0/.sensor_readings.raw could not be created by the application 'sfcb-vmware_raw'.
The root filesystem's file table is full. As a result, the file /etc/vmware/esx.conf.LOCK.17554 could not be created by the application 'hostd-worker'.
Unable to apply DRS resource settings on host. A general system error occurred: Invalid fault. This can significantly reduce the effectiveness of DRS.
The root filesystem's file table is full. As a result, the file /var/run/sfcb/523777d0-72dc-9e0b-c6b0-9d32a5255317 could not be created by the application 'sfcb-CIMXML-Pro'.
The root filesystem's file table is full. As a result, the file /var/run/sfcb/52fc39a4-62d0-866e-50a3-663209c9ca28 could not be created by the application 'sfcb-CIMXML-Pro'.
The vSphere HA availability state of this host has changed to Unreachable
Host is not responding
Alarm 'Host connection state' on myhost.mydomain changed from Green to Red
Alarm 'Host connection state' on myhost.mydomain sent email tomyemail@mydomain
vSphere HA agent for this host has an error: The vSphere HA agent is not reachable from vCenter Server
Alarm 'vSphere HA host status' on myhost.mydomain changed from Green to Red
vSphere HA agent for this host has an error: The vSphere HA agent is not reachable from vCenter Server
Cannot scan the host myhost.mydomain because its power state is unknown.
Host is not responding
I found this KB article, but was unable to start the process as I couldn't SSH onto the host;
Since I knew which guests were running on the affected host, I contacted the business and arranged emergency downtime to shut these guests down so that I could power cycle the host and deal with the issue. After lots of co-ordination we finally agreed on a suitable time which satisfied all business areas, and started the remediation.
Now here is the interesting part ... within seconds of shutting down guest VM's with a simple for loop and the shutdown command the host staus changed to Green and was connected to vCenter again.
for /f %i in (C:\_temp\targets.txt) do shutdown -s -m\\%i-t 0 -f
I enabled SSH and ran "stat -f /" - results below;
~ # stat -f /
File: "/"
ID: 1 Namelen: 127 Type: visorfs
Block size: 4096
Blocks: Total: 449852 Free: 324368 Available: 324368
Inodes: Total: 8192 Free: 55
After running throught the above mentioned KB article, the inodes were still exhausted;
/var/run/sfcb # stat -f /
File: "/"
ID: 1 Namelen: 127 Type: visorfs
Block size: 4096
Blocks: Total: 449852 Free: 324565 Available: 324565
Inodes: Total: 8192 Free: 122
So now that the host was available again I put it into Maintenance mode, rebooted it and checked again after the reboot (plenty of free inodes);
~ # stat -f /
File: "/"
ID: 1 Namelen: 127 Type: visorfs
Block size: 4096
Blocks: Total: 449852 Free: 332942 Available: 332942
Inodes: Total: 8192 Free: 5721
All VM's that were shutdown were now powered up using PowerCLI.
So the interesting point that could potentially be taken from this is that next time this issue occurs, I might be able to resolve the issue by shutting down one or more running VM's without affecting all guest VM's ... so perhaps shutdown the lowest priority non-production VM's first to see if this frees up enough inodes to get the host responsive again.
So two questions;
- Is this logic flawed?
- Is there a method to monitor FREE inodes so that this can be caught in advance of it becoming and issue involving downtime?
Cheers, & happy new year!
Jon