Dupedit is a program for finding duplicate files. Give it files and folders, and dupedit will find all sets of exactly identical files. It cares only about the content of files (not filenames). Here is what makes dupedit special:
newest: [ dupedit-5.5.1.tar.bz2 | dupedit-5.5-win32.zip ]
old releases: dupedit/
|operating systems||All versions tested on Linux.|
Version 5.5 miraculously compiles on Windows (using msys/minGW) and was tweaked to work ok.
|license||LGPL3 (version 5.5.1 and before), MPL2 (future versions)|
|languages (localized at compile time)||en.ascii, nb-NO.UTF-8|
|author||Andreas Nordal (ɯoɔ˙lıɐɯƃ@lɐpɹou˙sɐǝɹpuɐ)|
The default action is to list identical files in groups. Explanation follows after the example:
[me@pc:~/dupedit/testdir] touch empty0 empty1 [me@pc:~/dupedit/testdir] echo -n "hello" > dup0 [me@pc:~/dupedit/testdir] echo -n "world" > unik [me@pc:~/dupedit/testdir] cp dup0 dup1; cp dup0 dup2 [me@pc:~/dupedit/testdir] dupedit . -- #0 -- 0 B -- 0.0 empty1 0.1 empty0 -- #1 -- 5 B -- 1.0 dup0 1.1 dup1 1.2 dup2 In total: 5 identical files. Deduplication can save 10 bytes. [me@pc:~/dupedit/testdir] #Let's create false copies [me@pc:~/dupedit/testdir] ln dup0 hardlink0 [me@pc:~/dupedit/testdir] ln -s dup1 symlink1 [me@pc:~/dupedit/testdir] sudo mount --bind . ../mnt [me@pc:~/dupedit/testdir] dupedit dup? *link? ../testdir/dup0 ../mnt/dup0 -- #0 -- 5 B -- 0.0.0 ../mnt/dup0 0.0.1 ../testdir/dup0 0.0.2 hardlink0 0.0.3 dup0 0.1.0 symlink1 0.1.1 dup1 0.2 dup2 In total: 7 identical files. Deduplication can save 10 bytes.
explanation of output
The grouping of files is given by 2 or 3 dot separated numbers (in hex) preceding each file's (relative) path (by a tab in between). This is a unique enumeration of the 2 levels of grouping of the file:
Files with a common first number (first level grouping) are identical — they contain the same data.
Files with a common first and second number (second level grouping) are physically the same file — writing to one file will affect all.
The third number is there only to enumerate members of the second level group, and is omitted when there is only one member.
For human readability, each first level group is preceded by a comment of the form:
-- #<first level group number> -- <filesize> --
There are 2 ways to automate treatment of duplicate files using dupedit:
The -exec option does exactly the same as it does in the find program, but the syntax is different. Syntax: just terminate the command with a percent-sign; filename arguments are implicitly at the end.
Let's say you want to delete all except one file for each group of identical files. So you write this script, preservefirst.bash in your current directory:
1 2 3
#!/bin/bash shift rm -f "$@"
You can invoke it from dupedit as follows:
dupedit -exec bash preservefirst.bash % what ever
If you wanted to hardlink the files instead, you would write:
1 2 3 4 5 6
#!/bin/bash winner="$1" shift for i in "$@"; do ln -f "$winner" "$i" done
The 2 scripts above are included with dupedit 5.5, along with a program that selects the surviving file based on path prefix.
Admittedly, this is a bit hackish, and loses all control over false copies, but for now, this is how I do it. If for example we delete the symlink target instead of the symlink, or we delete one of two equivalent paths, we are screwed…
The version number is bumped when I change the design significantly. The program changed name from fcmp to dupedit between version 4 and 5. This changelog is incomplete, as I am a bit lazy.
work in progress
Instead of just outputting and forgetting results, this totally different beast is made with interactivity in mind. When presenting each set of duplicates, a primitive shell lets you do whatever you want with them. It works by the same principles, but thanks to a never-ending supply of new ideas while studying real-time programming at university, some experiments and the following boring re-implementation (the file reading subsystem was rewritten 4 times), you won't find many source lines in common with version 5.
Will have to wait for version 6.
Below is a screenshot of my 2 attempts at creating a graphical user interface for version 5.4.
As we can see, they are localized in Norwegian.
Only the first became usable.
Sadly, I was too lazy to fix the last bug, so I never released it.
I'm a n00b at GUI design.
The idea is to open dupedit's output in your editor of choice, so you can delete lines corresponding to files you want deleted (maybe also giving some fancy commands if you want to sym/hardlink files instead). When you save, dupedit finds and acts on your changes. For the impatient, modifying something like vidir to operate on output from dupedit should be straightforward. The format of dupedit's output was created with this in mind.
# no hashing #
This is the strength of dupedit. The algorithm may be described as «compare as you go». Each file is read only once, no matter how many and big duplicates you have.
false copies — the «file self-comparison hazard»
There might be many reasons why multiple paths lead to the same underlying file:
Comparing a file against itself is a serious hazard which can only result in falsely accusing it to be a copy of itself, which is easy enough for users to believe when presenting the file with its multiple filenames (not to mention auto-deduplication purposes). A file on a UNIX filesystem has only one inode number, and a filesystem has only one device number (even in multiple simultaneous mounts of the filesystem*). Files that share inode- and device numbers are considered by dupedit as «false copies» and treated as one file.
*beware of network filesystems
Network filesystems do not inherit the device-ID of the source filesystem. More annoyingly, (at least on sshfs) inode numbers do not reflect hardlink relationships. Therefore, you can not trust dupedit's ability to distinguish false from physical copies if some of them are on a network filesystem, which means you must consciously do it yourself.**
**not ideal for network mounts anyway
Running dupedit or any non-distributed deduplication algorithm on a network mount is a luxurious way to spend network bandwidth, compared to running it locally where the files are; or, to compare files between computers, make each computer hash its local files and send checksums over the net. The nearest thing to such a distributed deduplication program that I know of, would be rsync, which is not a deduplication program.
Not ideal: Jamming your network & exposing yourself to the «file self-comparison hazard».
[me@pc:~] mkdir mnt && sshfs email@example.com:/home/user mnt [me@pc:~] dupedit mnt
Not any better: Other deduplication programs may be based on checksumming (e.g. fdupes), but that does not solve the underlying problem that network filesystems are made to send file contents over the net, not checksums. It is also the network filesystem that possibly inhibits «samefile detection», thereby causing a «file self-comparison hazard».
[me@pc:~] mkdir mnt && sshfs firstname.lastname@example.org:/home/user mnt [me@pc:~] fdupes mnt
Correct: Run dupedit on the machine where the files are.
[me@pc:~] ssh email@example.com [user@server:/home/user] dupedit .
Algorithmically speaking: Only when checksums match, and you are hash-collision paranoid, might you have an excuse for running a deduplication program on a network mount.