Recovering Data From A Hard Disk Image

I was given an image of a hard drive that had been reformatted multiple times for use with various operating systems and tried to recover as much data as I could from the reformatted area. This was quite a fun rabbit hole to fall down.

The hard drive in question had most recently been used for a Linux install so was formatted currently as ext3. These files could be easily copied off the drive itself or from the image so weren’t of particular interest. However, the drive had previously been used with Windows and there were several files from this time that could be recovered from the hard drive’s “empty space”.

Initial Recovery

Despite the name PhotoRec can be used to scan the sectors of a hard drive (in this case an image of one) and recover several file types from it. To recognize the various file types it uses a database of file signatures to determine the type. After running PhotoRec I used jdupes to find and remove any duplicate files, a script to sort all the files into folders based on their extension, then tikatree to get an idea of what it found and to look for anything of interest.

Discovery and Repair

PhotoRec had recovered quite a few archives that were of particular interest as the files within would retain their file names. Initially, I wrote a batch script to use the WinRAR CLI to extract the archives, but the WinRAR CLI requires using a paid version to do this. So I switched to using 7z but I noticed a few issues, a number of the archives were broken in some way, some of the archives used a different character encoding, and my script didn’t extract the archives that were nested in another archive. To handle all this I wound up writing a Python script instead.

The Python script goes through a directory and calls 7z to extract an archive based on its file extension, if the extraction fails it logs the error and moves the archive into a folder that I can deal with later, if the extraction succeeds it looks through the extracted files for any other archives.

To deal with character encoding problems I first tried to determine what code pages had been used and just pass them to 7z as an argument, 7z supports this for .tar and .zip archives even though it’s not in the documentation. However, there isn’t a way to predetermine which code page to use and different files in an archive might use different encodings. Instead, the script overrides the system language to LC_ALL=C when running 7z so the files it extracts will have their file names written as byte values. The script can then check if the file names are ASCII, if not it uses chardet to try and determine the encoding and will rename the file using UTF-8 if it meets a certain confidence threshold. Any filename changes get logged along with any files it can’t determine the encoding of so I can manually fix them.

After running this script I had the joy of trying to repair/extract what I could from the various broken archives. I used WinRAR to try and fix any .rar or .zip files, bzip2recover for .bz2, gzrecover for .gz, and cpio for .tar.

More Recovery

Now that I was able to look through the files I found several references to files that may be on the hard drive but since they were made using somewhat obscure software PhotoRec wouldn’t have been able to recognize the file type to recover them. I was able to make some basic file type definitions based on files that I found in some of the extracted archives, for other files I had to scour the internet looking for some example files. Adding definitions this way is rather crude but to get better results would require modifying PhotoRec itself which I didn’t feel like doing.

To test out my file definitions I decided to go a step further than using PhotoRec’s fidentify. I created a small NTFS formatted hard disk image that contained several test files along with a bunch of other random files, I then zeroed out the Master File Table before reformatting the image to ext3 to give PhotoRec a worst-case scenario.

After testing out my file definitions I ran PhotoRec on the actual hard disk image looking for these files, I was able to find a number of them but because the easy way to add definitions to PhotoRec only looks for a signature at the start of a file it couldn’t determine where the file ended and would put whatever other data was in that sector into the file.

More Discovery

Now that I had recovered most of the data I started to dig into it. Because the drive had been reused and I had done a recovery on a sector level a number of the recovered files were corrupted because part of the file had been overwritten, so something like a PDF that had been partially overwritten by a text document. After better familiarizing myself with what was on the hard drive I decided to write yet another Python script.

This script scans through files looking for various keywords and file structures that are of particular interest. To improve scanning performance I created a dictionary of keywords and their ASCII and some other character encodings byte values, this way it can just look for the bytes in a file. This of course can lead to a bunch of false positives in stuff like executables.

Quite a few programs use the same overall file structure for their files as other programs so PhotoRec will recover these and give them the wrong file extension. After determining a couple of file types of interest that might have been misidentified I added some code to my script to try and identify these.

I did uncover a couple of interesting files this way and was even able to somewhat repair some of the corrupted files that were found. But by and large, this turned up a bunch of false positives as the search was fairly broad.