Fix Bad Blocks in Ubuntu

Some harddrive issues can be fixed!

0 Comments

We all know harddrives can have issues, but some of them can be fixed! Many computer users have heard of badblocks or filesystem errors, but to most this is a kiss of death unless your OS automatically fixes it or it spontaneously goes away. This article deals with fixing read errors reported by the S.M.A.R.T monitoring system. No matter the OS you run, you need to be aware of the status of your harddrives, and S.M.A.R.T tells you just that. This article will assume you have setup your system to monitor and notify you of failures in the S.M.A.R.T reporting of your harddrives.

My computer is telling me there is an issue, now what?

If your computer is telling you there is an issue, don’t panic! The first step is to see where exactly the issue is occuring. The easiest way is via terminal. If we know that the drive giving us an issue is located at /dev/sdb, then we can use smartctl to find out exactly what is happening.

owner@pc-0:~# sudo /usr/sbin/smartctl -l selftest /dev/sdb
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-24-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     14289         -
# 2  Short offline       Completed without error       00%     14276         -
# 3  Short offline       Completed: read failure       90%     14276         33443624
# 4  Short offline       Completed: read failure       90%     14276         33443624
# 5  Extended offline    Completed: read failure       90%     14267         33443624
# 6  Short offline       Completed: read failure       90%     14265         33443624
# 7  Short offline       Completed: read failure       90%     14241         33443624
# 8  Short offline       Completed: read failure       90%     14217         33443624
# 9  Short offline       Completed: read failure       90%     14193         33443624
# 10  Short offline       Completed: read failure       90%     14169         33443624
# 11  Short offline       Completed: read failure       90%     14145         33443624
# 12  Short offline       Completed: read failure       90%     14121         33443624
# 13  Extended offline    Completed: read failure       90%     14099         33443624
# 14  Short offline       Completed: read failure       90%     14097         33443624
# 15  Short offline       Completed: read failure       90%     14073         33443624
# 16  Short offline       Completed: read failure       90%     14049         33443624
# 17  Short offline       Completed: read failure       90%     14025         33443624
# 18  Short offline       Completed: read failure       90%     14001         33443624
# 19  Short offline       Completed: read failure       90%     13977         33443624
# 20  Short offline       Completed: read failure       90%     13953         33443624
# 21  Extended offline    Completed: read failure       90%     13931         33443624

As you can see, this drive was detecting an error and then I was able to fix it. There is one crucial bit of information we need to write down from this command, that is the LBA of the error. We will need this number to find out if a file is affected by this error and to try and fix it later. In my case the LBA is 33443624.

I’ve got the LBA, now what?

The next thing we need to know is a bit about the filesystems on this drive. The reason for this is the LBA does not tell us what partition the error occurs on. To find this out we will use the fdisk command.

owner@pc-0:~# sudo fdisk -l /dev/sdb
Disk /dev/sdb: 2.7 TiB, 3000592982016 bytes, 5860533168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 8E2D320E-C270-4865-BF8E-B5F3ABC87466

Device     Start        End    Sectors  Size Type
/dev/sdb1   2048 5860532223 5860530176  2.7T Linux filesystem

From here we need to note three bits of information; the unit size, which partition to look into, and the starting sector of that partition. In this case the unit size is 512. Now to tell what partition the error is in we need to look at the LBA, you wrote that down right, and see what partition that is in the Start-End range of. In my case, I only have one partition, but you can see the Start is 2048 and the End is 5860532223; my LBA of 33443624 would definitely fall in this range. So this tells me the error is in partition /dev/sdb1. Now that I know the partition, I can note down that it begins at sector 2048.

Lastly we need to find out the block size of your Filesystem, with ext2/3/4 you can use the command tune2fs.

owner@pc-0:~# sudo tune2fs -l /dev/sdb1 | grep -i "block size"
Block size:               4096

Take note of the block size and keep moving forward!

Now for a bit of Math…

I hope you don’t mind a bit of math as we will need it to find out the exact block we need to inspect for the error. I promise it is not all that hard. To do so we need to use the following formula:

FileSystemBlock = (LBA-StartSector)*(UnitSize/BlockSize)

It is important to note that the final number needs to be an integer or whole number. Therefore, if you get a number like 1342534.5 you will need to drop the decimal and use 1342534 as the correct block number.

Now, in my case I would fill out the formula as follows:

FileSystemBlock = (33443624-2048)*(512/4096) = 33441576*0.125 = 4180197

So the filesystem block I need to investigate is 4180197

Now we find out if there is data in the block or if it is empty.

It is important to note, whether the block is empty or not, we will need to write to the block in question. This means you will need to have a back of the file that this block is a part of. Unfortunately now is not the time to backup as it is too late, sorry, but there is no way around it. We will find all this information out using the debugfs command.

Ok let’s find out what is using that block:

owner@pc-0:~# debugfs
debugfs 1.42.13 (17-May-2015)
debugfs:  open /dev/sdb1
debugfs:  testb 4180197
Block 4180197 not in use
debugfs:  quit

In this case, nothing is using the block

owner@pc-0:~# debugfs
debugfs 1.42.13 (17-May-2015)
debugfs:  open /dev/sdb1
debugfs:  testb 4180197
Block 4180197 marked in use
debugfs:  icheck 4180197
Block   Inode number
4180197 57147410
debugfs:  ncheck 57147410
Inode   Pathname
57147410        /cm/Family Movie.2016.01.01.mkv
^C
debugfs:  quit

In this case the block is the part of a file, as mentioned before, this file will be lost during the fixing of this issue; now is not the time to make a backup!

Finally we need to write to that block!

The last step of this process is to write to the block in question to see if it is just a victim of bitrot; the random loss of data that can happen on harddrives over time. If we write the block and then scan the drive again it should not fail in the same spot, if it does then the block will be mapped as bad by the harddrive and not used again. In either case S.M.A.R.T should no longer detect it as a problem. To write over just the block in question we will use the dd command. Do note, we need to use the Block Size number noted before to make sure we write over the entire block.

owner@pc-0:~# sudo dd if=/dev/zero of=/dev/sdb1 bs=4096 count=1 seek=4180197
owner@pc-0:~# sudo sync

Once this is done, do another self-test of your drive and see if it fails again. If it does make sure the LBA number is not the same, if it is not then do the process again for the next number. It is not uncommon for more than one block to go bad at a time, especially adjacent blocks.

I hope this helps you fix a drive or two!

If you find this or any of my other contributions useful consider sharing them on social media.

- By: Last Updated:

Comments

Small ad here
Select a size at which to preview the size