...making Linux just a little more fun!
Ben Okopnik [ben at linuxgazette.net]
Amusing example of serendipity: one of our readers just sent me an email letting me know that a link to a SourceForge project in one of our articles was outdated and needed to be pointed to the new, renamed version of the project. I changed it after taking a moment to verify the SF link - and noticed that some of the project functionality was relevant to Neil Youngman's question of a couple of months ago.
Pulling down the (small) project tarball and reading the docs supported that impression:
'repeats' searches for duplicate files using a multistage process. Ini- tially, all files in the specified directories (and all of their subdi- rectories) are listed as potential duplicates. In the first stage, all files with a unique filesize are declared unique and are removed from the list. In the second stage, any files which are actually a hardlink to another file are removed, since they don't actually take up any more disk space. Next, all files for which the first 4096 bytes (adjustable with the -m option) have a unique filehash are declared unique and are removed from the list. Finally, all files which have a unique filehash (for the entire file) are declared unique and are removed from the list. Any remaining files are assumed to be duplicates and are listed on stdout.
The project is called "littleutils", by Brian Lindholm. There's a number of other handy little utilities in there, all worth exploring.
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * https://LinuxGazette.NET *
Mulyadi Santosa [mulyadi.santosa at gmail.com]
On Wed, Dec 29, 2010 at 00:58, Ben Okopnik <ben at linuxgazette.net> wrote:
> The project is called "littleutils", by Brian Lindholm. There's a number > of other handy little utilities in there, all worth exploring.
Reminds me to ssdeep as well (https://ssdeep.sourceforge.net/). The author call it sliding hash or something like that...
-- regards,
Mulyadi Santosa Freelance Linux trainer and consultant
blog: the-hydra.blogspot.com training: mulyaditraining.blogspot.com
Ben Okopnik [ben at linuxgazette.net]
On Wed, Dec 29, 2010 at 01:25:33AM +0700, Mulyadi Santosa wrote:
> On Wed, Dec 29, 2010 at 00:58, Ben Okopnik <ben at linuxgazette.net> wrote: > > The project is called "littleutils", by Brian Lindholm. There's a number > > of other handy little utilities in there, all worth exploring. > > Reminds me to ssdeep as well (https://ssdeep.sourceforge.net/). The > author call it sliding hash or something like that...
I've just taken a look at "ssdeep" (it uses a "rolling hash" method to compute "Context-Triggered Piecewise Hashes"); very interesting, but wouldn't help Neil much, unfortunately.
The point of CTPHs is to allow you to identify small differences (and identify the blocks where the difference occurs) in a large number of files. That's useful because a system attacker may modify a program to add a back door - and then obscure their tracks by randomly changing one harmless bit in thousands of other files (the example given was changing "This program cannot be run in DOS mode" to "This program cannot be run on DOS mode"), which would create a huge list of MD5 mismatches, thus hiding the hacked file.
"ssdeep" is a proof of concept for the CTPH technique. It's interesting to note, though, that CTPHs are not guaranteed to be unique; in fact, because they use a 6-bit hash, there's a 1 in 2^-6 probability of hash collision. CTPH computation is also relatively slow - O(n log n) in the worst case. However, when combined with a more traditional hash, they can quickly sort out files with large modifications from ones with small ones. I can see where, e.g., geneticists would find this useful.
-- * Ben Okopnik * Editor-in-Chief, Linux Gazette * https://LinuxGazette.NET *
Mulyadi Santosa [mulyadi.santosa at gmail.com]
On Wed, Dec 29, 2010 at 02:02, Ben Okopnik <ben at linuxgazette.net> wrote:
> The point of CTPHs is to allow you to identify small differences (and > identify the blocks where the difference occurs) in a large number of > files. That's useful because a system attacker may modify a program to > add a back door - and then obscure their tracks by randomly changing one > harmless bit in thousands of other files (the example given was changing > "This program cannot be run in DOS mode" to "This program cannot be run > on DOS mode"), which would create a huge list of MD5 mismatches, thus > hiding the hacked file.
Wow, Ben, I am amazed on how fast you deduced all these things right after my reply few minutes ago :D
Looks like age doesn't bite you in :D
-- regards,
Mulyadi Santosa Freelance Linux trainer and consultant
blog: the-hydra.blogspot.com training: mulyaditraining.blogspot.com