MKV header & raw file recovery
- 
				abolibibelot
- Posts: 40
- Joined: Sun Jan 31, 2016 5:45 pm
- Location: France
MKV header & raw file recovery
Working on two hard drive which contained many movies in MKV format, I discovered an issue (probably easy to fix) in R-Studio's raw file detection ability for that format.
The first HDD has an intact file system, the files have been simply deleted, and R-Studio (v7.7) finds them with their original names and attributes. However, some of the MKVs also appear as "Extra found files" (with a "link" symbol), but not all of them.
Then the second HDD has a severely corrupted file system (about 25GB have been filled with 0's), so no partition structure of file tree can be identified, and surprisingly R-Studio only manages to find three MKV files, all three truncated at a few KB, whereas Photorec finds many full length MKV files, perfectly readable. If I examine those files with WinHex, it appears that there are two different headers, and R-Studio only detects one.
[1] 1A 45 DF A3 93 42 82 88 6D 61 74 72 6F 73 6B 61 > detected by R-Studio
[2] 1A 45 DF A3 A3 42 86 81 01 42 F7 81 01 42 F2 81 04 42 F3 81 08 42 82 88 6D 61 74 72 6F 73 6B 61 > not detected by R-Studio
WinHex has a file carving function and recognizes MKV files with both headers by default. The "File Type Signatures Search.txt" file inside WinHex directory does indeed contain both headers definitions :
Matroska mkv;mka (matroska|\x01\x42\xF7\x81\x01\x42\xF2\x81) 8 10485760
(Which means : at offset "8" there can be either "matroska" = type [1] or "01 42 F7 81 01 42 F2 81" = type [2], and default length will be 10485760 bytes.)
And indeed if I examine the files from the first HDD, those which appear in "Extra found files" all have the same type of header [1], those which do not have the other type [2].
The other issue is the file length. Apparently R-Studio cuts the MKV files after detecting a certain number of "00" bytes, yet many files, including those MKVs, can have quite a large number of null bytes at any point (in this case right after the header), so it can't be a good way to determine the ending of a file. The default behaviour should be : consider that the file continues unless another file start is detected. I tried to create custom settings for MKV so as to detect all those files with R-Studio, but so far with no success.
(Besides, R-Studio can't preview MKV files like it can do for other video formats, in such a case it would be nice to have at least the possibility of using an external program, like VLC Media Player.)
So, could this issue be fixed in a future update ? And how could I create a customized MKV definition so that it could recognize both types of headers and find the correct file length, in such a simple case where the files are mostly one right after the other on the HDD ?
			
									
									
						The first HDD has an intact file system, the files have been simply deleted, and R-Studio (v7.7) finds them with their original names and attributes. However, some of the MKVs also appear as "Extra found files" (with a "link" symbol), but not all of them.
Then the second HDD has a severely corrupted file system (about 25GB have been filled with 0's), so no partition structure of file tree can be identified, and surprisingly R-Studio only manages to find three MKV files, all three truncated at a few KB, whereas Photorec finds many full length MKV files, perfectly readable. If I examine those files with WinHex, it appears that there are two different headers, and R-Studio only detects one.
[1] 1A 45 DF A3 93 42 82 88 6D 61 74 72 6F 73 6B 61 > detected by R-Studio
[2] 1A 45 DF A3 A3 42 86 81 01 42 F7 81 01 42 F2 81 04 42 F3 81 08 42 82 88 6D 61 74 72 6F 73 6B 61 > not detected by R-Studio
WinHex has a file carving function and recognizes MKV files with both headers by default. The "File Type Signatures Search.txt" file inside WinHex directory does indeed contain both headers definitions :
Matroska mkv;mka (matroska|\x01\x42\xF7\x81\x01\x42\xF2\x81) 8 10485760
(Which means : at offset "8" there can be either "matroska" = type [1] or "01 42 F7 81 01 42 F2 81" = type [2], and default length will be 10485760 bytes.)
And indeed if I examine the files from the first HDD, those which appear in "Extra found files" all have the same type of header [1], those which do not have the other type [2].
The other issue is the file length. Apparently R-Studio cuts the MKV files after detecting a certain number of "00" bytes, yet many files, including those MKVs, can have quite a large number of null bytes at any point (in this case right after the header), so it can't be a good way to determine the ending of a file. The default behaviour should be : consider that the file continues unless another file start is detected. I tried to create custom settings for MKV so as to detect all those files with R-Studio, but so far with no success.
(Besides, R-Studio can't preview MKV files like it can do for other video formats, in such a case it would be nice to have at least the possibility of using an external program, like VLC Media Player.)
So, could this issue be fixed in a future update ? And how could I create a customized MKV definition so that it could recognize both types of headers and find the correct file length, in such a simple case where the files are mostly one right after the other on the HDD ?
Re: MKV header & raw file recovery
1. If a file appears in the "Extra found files" section, that means that R-Studio couldn't find parent folders of the the file.
2. R-Studio cuts a file when it finds the beginning of a next file.
I'll pass your message to our developers.
			
									
									
						2. R-Studio cuts a file when it finds the beginning of a next file.
I'll pass your message to our developers.
- 
				abolibibelot
- Posts: 40
- Joined: Sun Jan 31, 2016 5:45 pm
- Location: France
Re: MKV header & raw file recovery
With the version I currently use, some files appear both in the file tree (i.e. with their parent folders) AND in "Extra found files", and those who appear in both places have a "link" symbol (an arrow on the icon, and there's an option in context menu to go directly to the corresponding linked file). In this case, as I explained, the MKV files with the "link" symbol on their icon all have the same type of header, which is consistent with the fact that, on the second HDD (where no file system / file tree could be identified), the files with the other type of header couldn't be retrieved at all (and I know they are still there because I could extract a dozen of them with Photorec or WinHex, and they play fine from start to finish). I ran R-Studio twice for about 10 hours on that 4TB HDD which was almost filled with MKV files and it only found 3 with a size of a few KB, that's quite a disappointement. I was hoping I could use R-Studio for that task, as I don't have enough space to extract everything with Photorec (which runs automatically but doesn't allow to scan only a portion of a drive), and doing it manually with WinHex is going to be very fastidious.1. If a file appears in the "Extra found files" section, that means that R-Studio couldn't find parent folders of the the file.
Well, in this case, the MKV metadata (encoding parameters etc., after the header and before the actual video data) is treated as a text file (I can find those metadata in some .txt "Extra found files"), so yes it seems to stop when it finds the begining of a next file, but it shouldn't consider this as another file. The problem, as I found out examining a few MKV files, is that there is no regular footer, so it must be kinda tricky to correctly identify the end of the file. Yet Photorec does a pretty good job. I haven't tried disabling .txt file type in the options to see if it improves the size detection for those MKVs (as in this case I know that there were only video files on that HDD, but even if it does work like that it's still an issue for general cases of data recovery with mixed contents and unknown file types).2. R-Studio cuts a file when it finds the beginning of a next file.
Re: MKV header & raw file recovery
Turning on only the necessary files is a good idea, surely. And if R-Studio doesn't recognize the files, you can write your own Known File Type description file (Creating a Custom Known File Type for R-Studio). Sorry for sounding like sending you to fry an egg, but there're too many file types, and we just cannot keep up with all those changes.
			
									
									
						Re: MKV header & raw file recovery
If you send me 3-4 small pieces (a header and short body piece) of those mkv files, I can write that Known File Type description file.
			
									
									
						- 
				abolibibelot
- Posts: 40
- Joined: Sun Jan 31, 2016 5:45 pm
- Location: France
Re: MKV header & raw file recovery
Thanks for your diligence. I'd be glad to have contributed to the improvement of this excellent software.If you send me 3-4 small pieces (a header and short body piece) of those mkv files, I can write that Known File Type description file.
Here are six 1 MB pieces cut with WinHex (the total sizes indicated in the file names are only approximate, as those files were extracted manually with WinHex as I indicated before, since no residual file system could be identified on that HDD). It's enough for them to be fully recognized by MediaInfo, so you can verify which muxer and which parameters was used for each of them. There are two with type 1 header (the one currently recognized by R-Studio) and four with type 2 header. (I found no clear pattern, the type of header doesn't seem to be correlated with an older or newer version of MKVMerge. -- Actually it appears that the files with type 2 header have an encoding date information displayed in MediaInfo, whereas those with type 1 don't.)
http://www.cjoint.com/c/FDmr4YCwLAy
Here's also the "File Type Signature Search" included in WinHex, which I cited in my previous message :
http://www.cjoint.com/c/FDmsjvt8bmy
Re: MKV header & raw file recovery
I downloaded the files. Will look at them ASAP.
			
									
									
						Re: MKV header & raw file recovery
Here what I created and tested for the mkv files of both types:
<?xml version="1.0" encoding="UTF-8"?>
<FileTypeList version="2.0">
<FileType id="50001" group="Matroska" description="Matroska_Type1" features="" extension="mkv">
<Begin combine="AND">
<Signature offset="8">matroska</Signature>
</Begin>
</FileType>
<FileType id="50002" group="Matroska" description="Matroska_Type2" features="" extension="mkv">
<Begin combine="AND">
<Signature offset="24">matroska</Signature>
</Begin>
</FileType>
</FileTypeList>
The only difference is that the string "matroska" appears at the offset of 8 bytes for Type 1 and at the offset of 24 bytes for Type 2.
Unfortunately, no way to find the end of file.
			
									
									
						<?xml version="1.0" encoding="UTF-8"?>
<FileTypeList version="2.0">
<FileType id="50001" group="Matroska" description="Matroska_Type1" features="" extension="mkv">
<Begin combine="AND">
<Signature offset="8">matroska</Signature>
</Begin>
</FileType>
<FileType id="50002" group="Matroska" description="Matroska_Type2" features="" extension="mkv">
<Begin combine="AND">
<Signature offset="24">matroska</Signature>
</Begin>
</FileType>
</FileTypeList>
The only difference is that the string "matroska" appears at the offset of 8 bytes for Type 1 and at the offset of 24 bytes for Type 2.
Unfortunately, no way to find the end of file.
- 
				abolibibelot
- Posts: 40
- Joined: Sun Jan 31, 2016 5:45 pm
- Location: France
Re: MKV header & raw file recovery
Sorry to be so late giving an update, I let this issue (and those hard drives) lie for quite a while.
Last night I tested this definition file (which is simple and clever in that particular case, but may not be specific enough to be implemented as is in R-Studio : other file types could contain “matroska” -- this very web page for instance -- and if that string happens to be at an offset of +8 or +24 it's going to be recognized as a MKV file) on the whole used portion of the second hard drive (the one with no remaining file system). So, it does seem to work for header detection, but the identified file sizes are crazy.
http://www.cjoint.com/c/FIukUJBJaiy (direct link : http://www.cjoint.com/doc/16_09/FIukUJB ... 3%A9es.png)
http://www.cjoint.com/c/FIulKYpw8Jy (direct link : http://www.cjoint.com/doc/16_09/FIulKYp ... ype-1-.png)
It doesn't cut each file where the next one begins. For example, files 0000.mkv and 0001.mkv are respectively 65 793 949 696 and 58 764 492 800 bytes, the difference is 7 029 456 896 bytes, which is exactly the size of the first file extracted manually with WinHex. Same for the next ones, then 0006.mkv has the correct size (7314735104, about 7GB), then it goes again to about 70GB... The largest file size as it appears in the preview panel is 482 484 944 896 (449GB !).
http://www.cjoint.com/c/FIulnBkuXGy (direct link : http://www.cjoint.com/doc/16_09/FIulnBk ... illes-.png)
What could explain such a behaviour, and is there a possible fix ? (I could still continue extracting those damn files manually with WinHex, but there are about 300 of them, so it's going to be quite a chore... Photorec identifies the headers correctly, and doesn't produce such humongous file sizes, but sometimes a file is cut short for no apparent reason, and it doesn't allow to select an interval to avoid scanning portions known to be empty or to resume the recovery on such a large volume, so it's not ideal either.)
Here are the offsets and sizes of the first 15 files I manually identified, if it can help :
01 : 36004757504-43034214399 > 7029456896
02 : 43034214400-51874496511 > 8840282112
03 : 51874496512-61262053375 > 9387556864
61262053376-61262069759 > 16384 = remnant of index / folder structure
04 : 61262069760-69479497727 > 8217427968
05 : 69479497728-80038916095 > 10559418368
80038916096-80038920191 > 4096 = remnant of index / folder structure
06 : 80038920192-94483972095 > 14445051904
07 : 94483972096-101798707199 > 7314735104
08 : 101798707200-107662475263 > 5863768064
09 : 107662475264-114702745599 > 7040270336
10 : 114702745600-122918207487 > 8215461888
11 : 122918207488-129958674431 > 7040466944
12 : 129958674432-136995340287 > 7036665856
13 : 136995340288-144033447935 > 7038107648
14 : 144033447936-147548536831 > 3515088896
15 : 147548536832-158103551999 > 10555015168
			
									
									
						Last night I tested this definition file (which is simple and clever in that particular case, but may not be specific enough to be implemented as is in R-Studio : other file types could contain “matroska” -- this very web page for instance -- and if that string happens to be at an offset of +8 or +24 it's going to be recognized as a MKV file) on the whole used portion of the second hard drive (the one with no remaining file system). So, it does seem to work for header detection, but the identified file sizes are crazy.
http://www.cjoint.com/c/FIukUJBJaiy (direct link : http://www.cjoint.com/doc/16_09/FIukUJB ... 3%A9es.png)
http://www.cjoint.com/c/FIulKYpw8Jy (direct link : http://www.cjoint.com/doc/16_09/FIulKYp ... ype-1-.png)
It doesn't cut each file where the next one begins. For example, files 0000.mkv and 0001.mkv are respectively 65 793 949 696 and 58 764 492 800 bytes, the difference is 7 029 456 896 bytes, which is exactly the size of the first file extracted manually with WinHex. Same for the next ones, then 0006.mkv has the correct size (7314735104, about 7GB), then it goes again to about 70GB... The largest file size as it appears in the preview panel is 482 484 944 896 (449GB !).
http://www.cjoint.com/c/FIulnBkuXGy (direct link : http://www.cjoint.com/doc/16_09/FIulnBk ... illes-.png)
What could explain such a behaviour, and is there a possible fix ? (I could still continue extracting those damn files manually with WinHex, but there are about 300 of them, so it's going to be quite a chore... Photorec identifies the headers correctly, and doesn't produce such humongous file sizes, but sometimes a file is cut short for no apparent reason, and it doesn't allow to select an interval to avoid scanning portions known to be empty or to resume the recovery on such a large volume, so it's not ideal either.)
Here are the offsets and sizes of the first 15 files I manually identified, if it can help :
01 : 36004757504-43034214399 > 7029456896
02 : 43034214400-51874496511 > 8840282112
03 : 51874496512-61262053375 > 9387556864
61262053376-61262069759 > 16384 = remnant of index / folder structure
04 : 61262069760-69479497727 > 8217427968
05 : 69479497728-80038916095 > 10559418368
80038916096-80038920191 > 4096 = remnant of index / folder structure
06 : 80038920192-94483972095 > 14445051904
07 : 94483972096-101798707199 > 7314735104
08 : 101798707200-107662475263 > 5863768064
09 : 107662475264-114702745599 > 7040270336
10 : 114702745600-122918207487 > 8215461888
11 : 122918207488-129958674431 > 7040466944
12 : 129958674432-136995340287 > 7036665856
13 : 136995340288-144033447935 > 7038107648
14 : 144033447936-147548536831 > 3515088896
15 : 147548536832-158103551999 > 10555015168
- 
				abolibibelot
- Posts: 40
- Joined: Sun Jan 31, 2016 5:45 pm
- Location: France
Re: MKV header & raw file recovery
I'm re-reading this one-year-old thread I initiated.
– Could noone come up with an explanation regarding those weird file sizes, using the custom template provided by “Alt” ?
– Has there been any progress in later versions of R-Studio for MKV files detection and recovery ? (Version 7.7 was used at the time. In that particular case, I finally extracted all the files with Photorec, selecting only Matroska file type, which worked very well, even though it would have been more convenient with R-Studio's GUI.)
			
									
									
						– Could noone come up with an explanation regarding those weird file sizes, using the custom template provided by “Alt” ?
– Has there been any progress in later versions of R-Studio for MKV files detection and recovery ? (Version 7.7 was used at the time. In that particular case, I finally extracted all the files with Photorec, selecting only Matroska file type, which worked very well, even though it would have been more convenient with R-Studio's GUI.)