disk latency weirdness

I've been seeing occasional slow database updates. They've corresponded closely with disk-latency spikes. You can see a 9 month graph (using datadog) latency graph here. While the graph only shows spikes of a few milliseconds, the underlying spikes can be quite large (up to 7seconds!!)

I've contacted linode support, they have been responsive but haven't said that anything changed at the host level (or noisy neighbour). Looking that that graph linked above, you can clearly see something changed around March 13th and has persisted ever since.

Is this something anyone else has seen? Any suggestions on what to do about it? Does it seem likely that this is a linode level issue rather than anything I have done? I can't see any provisioning / app changes that were made around that time.

1 Reply

I tried to locate a ticket on your account to see if I could get a better understanding of what troubleshooting was suggested. I couldn't find a ticket, so some of this may be redundant from what you've already tried.

Some contention is expected in a shared virtual hosting environment, though steal can also be caused by internal factors. By opening a Support ticket, we can check the status of the host that your Linode is on. The following Community Question site post provides some helpful commands you can run to get a better sense of what could be internally causing these performance issues:

What is CPU steal and how does it affect my Linode?

Since you mentioned a database in your question, I also wanted to bring up some issues that may be a factor if you're using MySQL.

One MySQL option, sync_binlog is set by default to cause every transaction to be written to the log before it is committed. Another option, innodb_flush_log_at_trx_commit, causes the contents of the InnoDB log buffer to be written out to the log file at each transaction commit and the log file is then flushed to disk. Again, this takes place with every single database transaction. While these two options work to make the server ACID compliant and minimize the risk of data loss, they can cause serious IO overhead if you have a high volume of database transactions, especially on a journalling filesystem.

In that above link regarding steal, the following command is included:

for x in `seq 1 1 30`; do ps -eo state,pid,cmd | grep "^D"; echo "-"; sleep 2; done

I'd suggest running that as it will display a timestamp as well as the process which may be waiting on IO (processes in a D state):

You may see something like jbd2, mysql, or a combination of them with other processes in this output. jbd2 is a kernel process used to synchronize the filesystem journal to disk. If it's waiting on IO, the OS is having a hard time keeping up with journaling and MySQL is becoming write-bound. Example output showing jbd2 issues:

# for x in `seq 1 1 5`; do ps -eo state,pid,cmd | grep "^D"; echo "-"; sleep 5; done
-
D 2064 [jbd2/sda-8]
-
D 2064 [jbd2/sda-8]
-
D 2064 [jbd2/sda-8]
-
D 2064 [jbd2/sda-8]
-

Here are some links for additional information about specific configuration options for performance tuning, though there are drawbacks to some specific changes.

Feel free to share these with a customer and ask that they look into making one or more of the suggestions in the ServerFault article to help them troubleshoot and tune their database for better IO performance:

Before you make any changes, I highly recommend backing up your data. Our Backup Service is an option, though one of the limitations for this service is regarding highly transactional databases. With that in mind, it wouldn't hurt to use mysqldump to create a data dump for backing up your database.

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct