dupedit
description
Dupedit is a program for finding duplicate files. Give it files and folders, and dupedit will find all sets of exactly identical files. It cares only about the content of files (not filenames). Here is what makes dupedit special:
download
| releases |
newest: [ dupedit-5.5.1.tar.bz2 | dupedit-5.5-win32.zip ]
old releases: dupedit/ |
|---|---|
| forum | cli-apps.org |
| developmental repositories |
|
| operating systems | All versions tested on Linux. Version 5.5 miraculously compiles on Windows with minGW and appears to work ok. |
| license | LGPL3 with expiry date to public domain. |
| languages (localized at compile time) | en.ascii, nb-NO.UTF-8 |
| programming language | C |
| author | Andreas Nordal (ɯoɔ˙lıɐɯƃ@lɐpɹou˙sɐǝɹpuɐ) |
usage
The default action is to list identical files in groups. Explanation follows after the example:
[me@pc:~/dupedit/testdir] touch empty0 empty1 [me@pc:~/dupedit/testdir] echo -n "hello" > dup0 [me@pc:~/dupedit/testdir] echo -n "world" > unik [me@pc:~/dupedit/testdir] cp dup0 dup1; cp dup0 dup2 [me@pc:~/dupedit/testdir] dupedit . -- #0 -- 0 B -- 0.0 empty1 0.1 empty0 -- #1 -- 5 B -- 1.0 dup0 1.1 dup1 1.2 dup2 In total: 5 identical files. Deduplication can save 10 bytes. [me@pc:~/dupedit/testdir] #Let's create false copies [me@pc:~/dupedit/testdir] ln dup0 hardlink0 [me@pc:~/dupedit/testdir] ln -s dup1 symlink1 [me@pc:~/dupedit/testdir] sudo mount --bind . ../mnt [me@pc:~/dupedit/testdir] dupedit dup? *link? ../testdir/dup0 ../mnt/dup0 -- #0 -- 5 B -- 0.0.0 ../mnt/dup0 0.0.1 ../testdir/dup0 0.0.2 hardlink0 0.0.3 dup0 0.1.0 symlink1 0.1.1 dup1 0.2 dup2 In total: 7 identical files. Deduplication can save 10 bytes.
explanation of output
The grouping of files is given by 2 or 3 dot separated numbers (in hex) preceding each file's (relative) path (by a tab in between). This is a unique enumeration of the 2 levels of grouping of the file:
Files with a common first number (first level grouping) are identical — they contain the same data.
x.x.x filename
Files with a common first and second number (second level grouping) are physically the same file — writing to one file will affect all.
x.x.x filename
The third number is there only to enumerate members of the second level group, and is omitted when there is only one member.
x.x.x filename
For human readability, each first level group is preceded by a comment of the form:
-- #<first level group number> -- <filesize> --
automation
There are 2 ways to automate treatment of duplicate files using dupedit:
-exec
The -exec option does exactly the same as it does in the find program, but the syntax is different. Syntax: just terminate the command with a percent-sign; filename arguments are implicitly at the end.
Let's say you want to delete all except one file for each group of identical files. So you write this script, preservefirst.bash in your current directory:
1 2 3 | #!/bin/bash shift rm -f "$@" |
You can invoke it from dupedit as follows:
dupedit -exec bash preservefirst.bash % what ever
If you wanted to hardlink the files instead, you would write:
1 2 3 4 5 6 | #!/bin/bash winner="$1" shift for i in "$@"; do ln -f "$winner" "$i" done |
The 2 scripts above are included with dupedit 5.5, along with a program that selects the surviving file based on path prefix.
Admittedly, this is a bit hackish, and loses all control over false copies, but for now, this is how I do it. If for example we delete the symlink target instead of the symlink, or we delete one of two equivalent paths, we are screwed…
changelog
The version number is bumped when I change the design significantly. The program changed name from fcmp to dupedit between version 4 and 5. This changelog is incomplete, as I am a bit lazy.
work in progress
version 6
Instead of just outputting and forgetting results, this totally different beast is made with interactivity in mind. When presenting each set of duplicates, a primitive shell lets you do whatever you want with them. It works by the same principles, but thanks to a never-ending supply of new ideas while studying real-time programming at university, some experiments and the following boring re-implementation (the file reading subsystem was rewritten 4 times), you won't find many source lines in common with version 5.
Qt GUI
Will have to wait for version 6.
Below is a screenshot of my 2 attempts at creating a graphical user interface for version 5.4.
As we can see, they are localized in Norwegian.
Only the first became usable.
Sadly, I was too lazy to fix the last bug, so I never released it.
I'm a n00b at GUI design.


future work
editor
The idea is to open dupedit's output in your editor of choice, so you can delete lines corresponding to files you want deleted (maybe also giving some fancy commands if you want to sym/hardlink files instead). When you save, dupedit finds and acts on your changes. For the impatient, modifying something like vidir to operate on output from dupedit should be straightforward. The format of dupedit's output was created with this in mind.
theory
# no hashing #
This is the strength of dupedit. The algorithm may be described as «compare as you go». Each file is read only once, no matter how many and big duplicates you have.
false copies — the «file self-comparison hazard»
There might be many reasons why multiple paths lead to the same underlying file:
Comparing a file against itself is a serious hazard which can only result in falsely accusing it to be a copy of itself, which is easy enough for users to believe when presenting the file with its multiple filenames (not to mention auto-deduplication purposes). A file on a UNIX filesystem has only one inode number, and a filesystem has only one device number (even in multiple simultaneous mounts of the filesystem*). Files that share inode- and device numbers are considered by dupedit as «false copies» and treated as one file.
*beware of network filesystems
Network filesystems do not inherit the device-ID of the source filesystem. More annoyingly, (at least on sshfs) inode numbers do not reflect hardlink relationships. Therefore, you can not trust dupedit's ability to distinguish false from physical copies if some of them are on a network filesystem, which means you must consciously do it yourself.**
**you're doing it wrong anyway
Running dupedit or any non-distributed deduplication algorithm on a network mount is a total waste of network bandwidth compared to running it locally where the files are, or, to compare files between computers, hashing files locally by each computer and send checksums over the net. The nearest thing to such a distributed deduplication program that I know of, would be rsync, which is not a deduplication program.
Wrong: Jamming your network, slowing down dupedit, exposing yourself to the «file self-comparison hazard».
[me@pc:~] mkdir mnt && sshfs user@example.com:/home/user mnt [me@pc:~] dupedit mnt
Wrong: In case you thought checksumming (like fdupes does) magically saves any bandwidth, think again. It will be hazardous too, read the source or ask Adriano Lopez.
[me@pc:~] mkdir mnt && sshfs user@example.com:/home/user mnt [me@pc:~] fdupes mnt
Correct: Run dupedit on the machine where the files are.
[me@pc:~] ssh user@example.com [user@server:/home/user] dupedit .
Only when checksums match, and you are hash-collision paranoid, might you have an excuse for running a deduplication program on a network mount.
licensing
As usual, source code is the place to express feelings. Code snippet:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | /* dupedit 6.0 pre-alfa * * Copyright 2009 Andreas Nordal (ɯoɔ˙lıɐɯƃ@lɐpɹou˙sɐǝɹpuɐ) * * This release of dupedit becomes public domain when the year 2030 * begins. Until then, redistribution and modification must be done * according to the GNU Lesser General Public License as published * by the Free Software Foundation, either version 3, or (at your * option) any later version. By doing so, you may postpone the * transition from LGPL to public domain, but not advance it. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU Lesser General Public License for more details. * * You should have received a copy of the GNU Lesser General Public License * along with this program. If not, see <http://www.gnu.org/licenses/> */ /* Freedom is more than having a fixed set of options. (Richard Stallman) * My conclusion: * The 4 freedoms of GPL are insufficient; freedom implies no lisence. * Solution: * Make the lisence last as long as you need or care, just not forever. */ |