For a long while one of my servers “crashed” weekly. Every saturday at 3am the load would just spiral up and everything would become unresponsive (still working but slow) – everything was waiting on disk I/O. I kept putting off checking why this happened, and just rebooted the machine manually every saturday (lazy lazy). After a while I found out it was at 3am, so it had to be a cron job. I then noticed that one of my other servers also had a high load at 3am, but less high and it recovered.
Today I did some thinking and checking, and to blame was.. me! I had configured smartmontools to do a weekly long selfcheck at 3am on all 4 disks in the software RAID5 at the same time. This apparently took so much speed away from the disks that the machine didn’t cope.. Woopsie.
Note: hobby environment of course, real production I would have spent more time figuring it out 🙂