Help! I Am Out of Disk Space!

How can we fix a nasty out-of-space issue leveraging the flexibility of Percona Operator for MySQL?

When planning a database deployment, one of the most challenging factors to consider is the amount of space we need to dedicate to data on disk.

This is even more cumbersome when working on bare metal, as it is more difficult to add space when using this kind of solution with respect to the cloud.

When using cloud storage like EBS or similar, it is normally easy(er) to extend volumes, which gives us the luxury to plan the space to allocate for data with a good grade of relaxation.

Is this also true when using a solution based on Kubernetes like Percona Operator for MySQL? Well, it depends on where you run it. However, if the platform you choose supports the option to extend volumes, K8s per se gives you the possibility to do so as well.

Nonetheless, if it can go wrong it will, and ending up with a fully filled device with MySQL is not a fun experience.

As you know, on normal deployments, when MySQL has no space left on the device, it simply stops working, ergo it will cause a production down event, which of course is unfortunate and we want to avoid it at any cost.

This blog is the story of what happened, what was supposed to happen, and why.

The story

The case was on AWS using EKS.

Given all the above, I was quite surprised when we had a case in which a deployed solution based on Percona Operator for MySQL went out of space. However, we started to dig in and review what was going on and why.

The first thing we did was to quickly investigate what was really taking space, and that could have been an easy win if most of the space was taken by some log, but unfortunately, this was not the case, as data was really taking all the available space.

The next step was to check what storage class (SC) was used for the PersistentVolumeClaim (PVC):

k get pvc
NAME                         VOLUME    CAPACITY   ACCESS MODES   STORAGECLASS
datadir-mt-cluster-1-pxc-0   pvc-<snip>   233Gi      RWO            io1
datadir-mt-cluster-1-pxc-1   pvc-<snip>   233Gi      RWO            io1
datadir-mt-cluster-1-pxc-2   pvc-<snip>   233Gi      RWO            io1

k get pvc

NAME VOLUME CAPACITY ACCESS MODES STORAGECLASS

datadir-mt-cluster-1-pxc-0 pvc-<snip> 233Gi RWO io1

datadir-mt-cluster-1-pxc-1 pvc-<snip> 233Gi RWO io1

datadir-mt-cluster-1-pxc-2 pvc-<snip> 233Gi RWO io1

Ok we use the io1 SC, it is now time to check if the SC is supporting volume expansion:

kubectl describe sc io1
Name:            io1
IsDefaultClass:  No
Annotations:     kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"storage.k8s.io/v1","kind":"StorageClass","metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"false"},"name":"io1"},"parameters":{"fsType":"ext4","iopsPerGB":"12","type":"io1"},"provisioner":"kubernetes.io/aws-ebs"}
,storageclass.kubernetes.io/is-default-class=false
Provisioner:           kubernetes.io/aws-ebs
Parameters:            fsType=ext4,iopsPerGB=12,type=io1
AllowVolumeExpansion:  <unset> <------------
MountOptions:          <none>
ReclaimPolicy:         Delete
VolumeBindingMode:     Immediate
Events:                <none>

kubectl describe sc io1

Name: io1

IsDefaultClass: No

Annotations: kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"storage.k8s.io/v1","kind":"StorageClass","metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"false"},"name":"io1"},"parameters":{"fsType":"ext4","iopsPerGB":"12","type":"io1"},"provisioner":"kubernetes.io/aws-ebs"}

,storageclass.kubernetes.io/is-default-class=false

Provisioner: kubernetes.io/aws-ebs

Parameters: fsType=ext4,iopsPerGB=12,type=io1

AllowVolumeExpansion: <unset> <------------

MountOptions: <none>

ReclaimPolicy: Delete

VolumeBindingMode: Immediate

Events: <none>

And no is not enabled, and in this case, we cannot just go and expand the volume, we must change the storage class settings first. To enable volume expansion, you need to delete the storage class and enable it again.

Unfortunately, we were unsuccessful in doing that operation, because the storage class kept staying unset for ALLOWVOLUMEEXPANSION.

As said, this is a production down event, so we cannot invest too much time in digging into why it was not correctly changing the mode, we had to act quickly.

The only option we had to fix it was:

Expand the io1 volumes from the AWS console (or AWS client)
Resize the file system
Patch any K8 file to allow K8 to correctly see the new volume’s dimension

Expanding EBS volumes from the console is trivial, just go to Volumes, select the volume you want to modify, choose modify, and change the size of it with the one desired, and done.

Once that is done, connect to the Node hosting the pod which has the volume mounted like this:

 k get pods -o wide|grep mysql-0
NAME                                        READY     STATUS    RESTARTS   AGE    IP            NODE             
cluster-1-pxc-0                               2/2     Running   1          11d    10.1.76.189     <mynode>.eu-central-1.compute.internal

k get pods -o wide|grep mysql-0

NAME READY STATUS RESTARTS AGE IP NODE

cluster-1-pxc-0 2/2 Running 1 11d 10.1.76.189 <mynode>.eu-central-1.compute.internal

Then we need to get the id of the PVC to identify it on the node:

k get pvc
NAME                         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS
datadir-cluster-1-pxc-0   Bound    pvc-1678c7ee-3e50-4329-a5d8-25cd188dc0df   233Gi      RWO            io1

k get pvc

NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS

datadir-cluster-1-pxc-0 Bound pvc-1678c7ee-3e50-4329-a5d8-25cd188dc0df 233Gi RWO io1

One note, when doing this kind of recovery with a Percona XtraDB Cluster-based solution, always recover node-0 first, then the others.

So we connect to <mynode> and identify the volume:

lslbk |grep pvc-1678c7ee-3e50-4329-a5d8-25cd188dc0df
nvme1n1      259:4    0  350G  0 disk /var/lib/kubelet/pods/9724a0f6-fb79-4e6b-be8d-b797062bf716/volumes/kubernetes.io~aws-ebs/pvc-1678c7ee-3e50-4329-a5d8-25cd188dc0df <-----

1 2	lslbk \|grep pvc-1678c7ee-3e50-4329-a5d8-25cd188dc0df nvme1n1 259:4 0 350G 0 disk /var/lib/kubelet/pods/9724a0f6-fb79-4e6b-be8d-b797062bf716/volumes/kubernetes.io~aws-ebs/pvc-1678c7ee-3e50-4329-a5d8-25cd188dc0df <-----

At this point we can resize it:

root@ip-<snip>:/# resize2fs  /dev/nvme1n1
resize2fs 1.45.5 (07-Jan-2020)
Filesystem at /dev/nvme1n1 is mounted on /var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/aws/eu-central-1a/vol-0ab0db8ecf0293b2f; on-line resizing required
old_desc_blocks = 30, new_desc_blocks = 44
The filesystem on /dev/nvme1n1 is now 91750400 (4k) blocks long.

root@ip-<snip>:/# resize2fs /dev/nvme1n1

resize2fs 1.45.5 (07-Jan-2020)

Filesystem at /dev/nvme1n1 is mounted on /var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/aws/eu-central-1a/vol-0ab0db8ecf0293b2f; on-line resizing required

old_desc_blocks = 30, new_desc_blocks = 44

The filesystem on /dev/nvme1n1 is now 91750400 (4k) blocks long.

The good thing is that as soon as you do that, the MySQL daemon sees the space and will restart, however, it will happen only on the current pod and K8 will still see the old dimension:

k get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                            STORAGECLASS   REASON
pvc-1678c7ee-3e50-4329-a5d8-25cd188dc0df   333Gi      RWO            Delete           Bound    pxc/datadir-cluster-1-pxc-0   io1

k get pv

NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON

pvc-1678c7ee-3e50-4329-a5d8-25cd188dc0df 333Gi RWO Delete Bound pxc/datadir-cluster-1-pxc-0 io1

To allow K8 to be aligned with the real dimension, we must patch the information stored, and the command is the following:

kubectl patch pvc <pvc-name>  -n <pvc-namespace> -p '{ "spec": { "resources": { "requests": { "storage": "NEW STORAGE VALUE" }}}}'
Ie:
kubectl patch pvc datadir-cluster-1-pxc-0 -n pxc -p '{ "spec": { "resources": { "requests": { "storage": "350" }}}}'

kubectl patch pvc <pvc-name> -n <pvc-namespace> -p '{ "spec": { "resources": { "requests": { "storage": "NEW STORAGE VALUE" }}}}'

Ie:

kubectl patch pvc datadir-cluster-1-pxc-0 -n pxc -p '{ "spec": { "resources": { "requests": { "storage": "350" }}}}'

Remember to use as PVC-name the NAME coming from:

<span style="font-weight: 400;">kubectl get pvc</span><span style="font-weight: 400;">.</span>

1	<span style="font-weight: 400;">kubectl get pvc</span><span style="font-weight: 400;">.</span>

Once this is done, K8 will see the new volume dimension correctly.

Just repeat the process for Node-1 and Node-2 and …done, the cluster is up again.

Finally, do not forget to modify your custom resources file (cr.yaml) to match the new volume size. I.E.:

    volumeSpec:
      persistentVolumeClaim:
        storageClassName: "io1"
        resources:
          requests:
            storage: 350G

volumeSpec:

persistentVolumeClaim:

storageClassName: "io1"

resources:

requests:

storage: 350G

The whole process took just a few minutes, it was time now to investigate why the incident happened and why the storage class was not allowing extension in the first place.

Why it happened

Well, first and foremost the platform was not correctly monitored. As such there was a lack of visibility about space utilization and no alert about disk space.

This was easy to solve by enabling the Percona Monitoring and Management (PMM) feature in the cluster cr and setting the alert in PMM once the nodes join it (see https://docs.percona.com/percona-monitoring-and-management/get-started/alerting.html for details on how to do it).

The second issue was the problem with the storage class. Once we had the time to carefully review the configuration files, we identified that there was an additional tab in the SC class, which was causing K8 to ignore the directive.

It was supposed to be:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: io1
  annotations:
    storageclass.kubernetes.io/is-default-class: "false"
provisioner: kubernetes.io/aws-ebs
parameters:
  type: io1
  iopsPerGB: "12"
  fsType: ext4 
allowVolumeExpansion: true <----------

It was:
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: io1
  annotations:
    storageclass.kubernetes.io/is-default-class: "false"
provisioner: kubernetes.io/aws-ebs
parameters:
  type: io1
  iopsPerGB: "12"
  fsType: ext4 
  allowVolumeExpansion: true. <---------

kind: StorageClass

apiVersion: storage.k8s.io/v1

metadata:

name: io1

annotations:

storageclass.kubernetes.io/is-default-class: "false"

provisioner: kubernetes.io/aws-ebs

parameters:

type: io1

iopsPerGB: "12"

fsType: ext4

allowVolumeExpansion: true <----------

It was:

kind: StorageClass

apiVersion: storage.k8s.io/v1

metadata:

name: io1

annotations:

storageclass.kubernetes.io/is-default-class: "false"

provisioner: kubernetes.io/aws-ebs

parameters:

type: io1

iopsPerGB: "12"

fsType: ext4

allowVolumeExpansion: true. <---------

What was concerning was the lack of error returned by the Kubernetes API, so in theory the configuration was accepted but not really validated.

In any case, once we had fixed the typo and recreated the SC, the setting for volume expansion was correctly accepted:

kubectl describe sc io1
Name:            io1
IsDefaultClass:  No
Annotations:     kubectl.kubernetes.io/last-applied-configuration={"allowVolumeExpansion":true,"apiVersion":"storage.k8s.io/v1","kind":"StorageClass","metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"false"},"name":"io1"},"parameters":{"fsType":"ext4","iopsPerGB":"12","type":"io1"},"provisioner":"kubernetes.io/aws-ebs"}
,storageclass.kubernetes.io/is-default-class=false
Provisioner:           kubernetes.io/aws-ebs
Parameters:            fsType=ext4,iopsPerGB=12,type=io1
AllowVolumeExpansion:  True    
MountOptions:          <none>
ReclaimPolicy:         Delete
VolumeBindingMode:     Immediate
Events:                <none>

kubectl describe sc io1

Name: io1

IsDefaultClass: No

Annotations: kubectl.kubernetes.io/last-applied-configuration={"allowVolumeExpansion":true,"apiVersion":"storage.k8s.io/v1","kind":"StorageClass","metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"false"},"name":"io1"},"parameters":{"fsType":"ext4","iopsPerGB":"12","type":"io1"},"provisioner":"kubernetes.io/aws-ebs"}

,storageclass.kubernetes.io/is-default-class=false

Provisioner: kubernetes.io/aws-ebs

Parameters: fsType=ext4,iopsPerGB=12,type=io1

AllowVolumeExpansion: True

MountOptions: <none>

ReclaimPolicy: Delete

VolumeBindingMode: Immediate

Events: <none>

What should have happened instead?

If proper monitoring and alerting were in place, the administrators would have the time to act and extend the volumes without downtime.

However, the procedure for extending volumes on K8 is not complex but also not as straightforward as you may think. My colleague Natalia Marukovich wrote a blog post, Percona Operator Volume Expansion Without Downtime, that gives you the step by step instructions on how to extend the volumes without downtime.

Conclusion

Using the cloud, containers, automation, or more complex orchestrators like Kubernetes, does not solve all, does not prevent mistakes from happening, and more importantly, does not make the right decisions for you.

You must set up a proper architecture that includes backup, monitoring, and alerting. You must set the right alerts and act on them in time.

Finally, automation is cool, however, the devil is in the details and typos are his day-to-day joy. Be careful and check what you put online, do not rush it. Validate, validate, validate…

Great stateful MySQL to all.

Percona Monitoring and Management is a best-of-breed open source database monitoring solution. It helps you reduce complexity, optimize performance, and improve the security of your business-critical database environments, no matter where they are located or deployed.

Download Percona Monitoring and Management Today

0 Comments

Inline Feedbacks

View all comments

MySQL 5.7
End of Life

Compare Percona to Leading Database Solutions

Software
Downloads

Product
Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Help! I Am Out of Disk Space!

The story

Why it happened

What should have happened instead?

Conclusion

Related

Related Blog Articles

RECOMMENDED ARTICLES

Why SELECT COUNT(*) FROM TABLE Is Sometimes Very Slow in MySQL or MariaDB

Partially Rolling Back a Transaction in MySQL or PostgreSQL

JSON_TABLE() Will Be in PostgreSQL 17

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7 End of Life

Compare Percona to Leading Database Solutions

Software Downloads

Product Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Help! I Am Out of Disk Space!

The story

Why it happened

What should have happened instead?

Conclusion

Related

Share This Post!

Want to get weekly updates listing the latest blog posts?

Related Blog Articles

RECOMMENDED ARTICLES

Why SELECT COUNT(*) FROM TABLE Is Sometimes Very Slow in MySQL or MariaDB

Partially Rolling Back a Transaction in MySQL or PostgreSQL

JSON_TABLE() Will Be in PostgreSQL 17

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7
End of Life

Software
Downloads

Product
Documentation