[hetzner] факты о замене запчастей



За 1 год работы из партии в 1000 серверов
Больше всего зависало 20 серверов. ОС не устанавливалась например.
За последние 3 месяца эти 20 серверов зависли более 10 раз.

Был написал тикет на проверку серверов.
Вот отчет. С фактами и конкретикой. Как работает Hetzner.
  1. Там нету автоматического мониторинга как в OVH
  2. И зачастую нужно подтверждать согласием — ответом на тикеты.

136.243.92.xx (ram заменили)
as requested we've checked the server and detect faulty RAM. We replaced the RAM and rebooted the server afterwards.

46.4.69.xx (сгорели диски)
We regret to inform you that both HDDs are having read failures. You can log in to the Rescue System for performing a backup using root password

144.76.202.xx (без ошибок)
The hardware check has been finished without any errors. We have booted your server back into the installed system.

144.76.162.xx (без ошибок)
The hardware check has been finished without any errors. We have booted your server back into the installed system.

144.76.64.xx (без ошибок)
The hardware check has been finished without any errors. We have booted your server back into the installed system.

176.9.37.xx (ram заменили)
The hardware check has finished and adjudged the RAM being faulty.
We have replaced it and started up your system again.
46.4.106.xx (ram заменили)
As requested we've checked the server and the RAM was adjudged being faulty. We've replaced the RAM and startet the server afterwards.

176.9.23.xx (без ошибок)
The hardware check is finished now and your server has passed all tests successfully.
Because of this we have booted your server back into your installed system.

144.76.78.xx (без ошибок)
The hardware check has been finished on your server now:
-----------------%< —
DMESG: Ok
STRESSTEST-CPU-TEMP: Ok
STRESSTEST: Ok
HDDTEST P6HWZ0WW: Ok
HDDTEST WMC1P0116944: Ok
-----------------%< —
All hardware was working stable and the hard drives are in a good condidion — so no replacement is needed.
We have now booted your server back to the installed system.

213.133.123.xx (без ошибок)
The hardware check is finished now and your server has passed all tests successfully, because of this we have booted your server back into your installed system.

144.76.25.xx (без ошибок)
the hardware check has been finished on your server now:
-----------------%< —
DMESG: Ok
STRESSTEST-CPU-TEMP: Ok
STRESSTEST: Ok
HDDTEST WCC1P1093225: Ok
HDDTEST WCC1P1094498: Ok
-----------------%< —
All hardware were running stable during our tests and the HDDs are in a good condition.
We have now booted your server back to the installed system.

78.46.62.xx (без ошибок)
the hardware check has been finished on your server now:
-----------------%< —
DMESG: Ok
STRESSTEST-CPU-TEMP: Ok
STRESSTEST: Ok
HDDTEST WCC1P0939418: Ok
HDDTEST WCC1P0931055: Ok
-----------------%< —
All hardware were running stable during our tests and the HDDs are in a good condition.
We have now booted your server back to the installed system.

144.76.235.xx (без ошибок)
the hardware check has been finished on your server now:
-----------------%< —
DMESG: Ok
STRESSTEST-CPU-TEMP: Ok
STRESSTEST: Ok
HDDTEST WCC1P1032095: Ok
HDDTEST WCC1P0464855: Ok
-----------------%< —
All hardware were running stable during our tests and the HDDs are in a good condition.
We have now booted your server back to the installed system.

136.243.10.xx (замена RAM и прошили bios)
the hardware check has been finished on your server now:
-----------------%< —
DMESG: Ok
STRESSTEST-CPU-TEMP: Ok
STRESSTEST: Error
HDDTEST WMC1P0D8CWDX: Ok
HDDTEST WMC1P0DDJNS3: Ok
-----------------%< —
All hardware were running stable during our tests and the HDDs are in a good condition, except for the memory which has caused some errors during the test.
Hence we have replaced the memory of your server on suspicion and updated the BIOS firmware.
We have now booted your server back to the installed system.

148.251.151.xx (замена RAM и прошили bios)
the hardware check has been finished on your server now:
-----------------%< —
DMESG: Ok
STRESSTEST-CPU-TEMP: Ok
STRESSTEST: Error
HDDTEST WMC1P0DAJR3Y: Ok
HDDTEST WMC1P0DKCY6X: Ok
-----------------%< —
All hardware were running stable during our tests and the HDDs are in a good condition, except for the memory which has caused some errors during the test.
Hence we have replaced the memory of your server on suspicion and updated the BIOS firmware.
We have now booted your server back to the installed system.

148.251.135.xx (замена RAM и прошили bios)
the hardware check has been finished on your server now:
-----------------%< —
DMESG: Ok
STRESSTEST-CPU-TEMP: Ok
STRESSTEST: Error
HDDTEST WMC1P0221289: Ok
HDDTEST WCC1P1128092: Ok
-----------------%< —
All hardware were running stable during our tests and the HDDs are in a good condition, except for the memory which has caused some errors during the test.
Hence we have replaced the memory of your server on suspicion and updated the BIOS firmware.
We have now booted your server back to the installed system.

5.9.151.xx (без ошибок)
the hardware check has been finished on your server now:
-----------------%< —
DMESG: Ok
STRESSTEST-CPU-TEMP: Ok
STRESSTEST: Ok
HDDTEST P6JHW40S: Ok
HDDTEST P6J9T46X: Ok
-----------------%< —
All hardware were running stable during our tests and the HDDs are in a good condition.
We have now booted your server back to the installed system.

78.47.15.xx (спросили про потерю данных, после моего ответа заменили HDD)
The hardware check has been finished on your server now:
-----------------%< —
DMESG: Error
STRESSTEST-CPU-TEMP: Ok
STRESSTEST: Error
HDDTEST WMC1P0358081: Error
HDDTEST WMC1P0361016: Error
-----------------%< —
It seems as both HDDs are in a bad condition and need to be replaced.
The test also showed some memory problems, so we would like to replace the memory also.
Your server is now booted to the installed system again.
Please confirm the full dataloss for the server for replacing the HDDs and the memory.
For this purpose please also suggest us an appointment, when we may shut down your server for about 20 to 30 Minutes.

As requested we have replaced the defective hardware and started the server into the rescue-system afterwards.

144.76.162.xx (спросили про потерю данных, после моего ответа заменили HDD)
the hardware check has been finished on your server now:
-----------------%< —
DMESG: Ok
STRESSTEST-CPU-TEMP: Ok
STRESSTEST: Ok
HDDTEST WCC1P1111318: Error
HDDTEST WCC1P1111237: Ok
-----------------%< —
All hardware were running stable during our tests, except of one HDD, which needs to be replaced.
We have now booted your server back to the installed system.
Please confirm the dataloss for the defective hard drive.
For this purpose please also suggest us an appointment, when we may shut down your server for about 20 to 30 Minutes.

As requested, we have replaced the defective drive and booted the server into the installed system.

Нет комментариев