dupedit

description


Dupedit is a program for finding duplicate files. Give it files and folders, and dupedit will find all sets of exactly identical files. It cares only about the content of files (not filenames). Here is what makes dupedit special:

  1. Dupedit distinguishes physical copies (duplicates) from multiple names of the same file (false copies). This is important, as deleting a false copy may lead to data loss. For each physical copy, dupedit groups its false copies. Internally, this group of false copies is treated as one file, thereby avoiding the file self-comparison hazard.
  2. Dupedit is fast and reliable. It compares any number of files at once without hashing.
  3. As the name implies, a future goal is to become the ultimate tool for editing duplicates, where editing means managing.

download

releases newest: [ dupedit-5.5.1.tar.bz2 | dupedit-5.5-win32.zip ]
old releases: dupedit/
forumcli-apps.org
developmental repositories
  • The stable branch of dupedit is part of diverseutils. Clone it from git://nerdvar.com/diverseutils.git to get my latest & greatest usable stuff. This branch will be more up to date than the latest release.
  • The bleeding edge of dupedit development happens at git://nerdvar.com/dupedit.git. Warning: Not usable. For curious enthusiasts. What is to become version 6 has been rewritten numerous times the last couple of years, as my ideas keep getting better, but has seldom been in a compiling state.
operating systemsAll versions tested on Linux.
Version 5.5 miraculously compiles on Windows with minGW and appears to work ok.
licenseLGPL3 with expiry date to public domain.
languages (localized at compile time)en.ascii, nb-NO.UTF-8
programming languageC
authorAndreas Nordal (ɯoɔ˙lıɐɯƃ@lɐpɹou˙sɐǝɹpuɐ)

usage


The default action is to list identical files in groups. Explanation follows after the example:

[me@pc:~/dupedit/testdir] touch empty0 empty1
[me@pc:~/dupedit/testdir] echo -n "hello" > dup0
[me@pc:~/dupedit/testdir] echo -n "world" > unik
[me@pc:~/dupedit/testdir] cp dup0 dup1; cp dup0 dup2
[me@pc:~/dupedit/testdir] dupedit .

-- #0 -- 0 B --
0.0     empty1
0.1     empty0

-- #1 -- 5 B --
1.0     dup0
1.1     dup1
1.2     dup2

In total: 5 identical files. Deduplication can save 10 bytes.
[me@pc:~/dupedit/testdir] #Let's create false copies
[me@pc:~/dupedit/testdir] ln dup0 hardlink0
[me@pc:~/dupedit/testdir] ln -s dup1 symlink1
[me@pc:~/dupedit/testdir] sudo mount --bind . ../mnt
[me@pc:~/dupedit/testdir] dupedit dup? *link? ../testdir/dup0 ../mnt/dup0

-- #0 -- 5 B --
0.0.0   ../mnt/dup0
0.0.1   ../testdir/dup0
0.0.2   hardlink0
0.0.3   dup0
0.1.0   symlink1
0.1.1   dup1
0.2     dup2

In total: 7 identical files. Deduplication can save 10 bytes.

explanation of output


The grouping of files is given by 2 or 3 dot separated numbers (in hex) preceding each file's (relative) path (by a tab in between). This is a unique enumeration of the 2 levels of grouping of the file:

  1. Files with a common first number (first level grouping) are identical — they contain the same data.

    x.x.x	filename
  2. Files with a common first and second number (second level grouping) are physically the same file — writing to one file will affect all.

    x.x.x	filename
  3. The third number is there only to enumerate members of the second level group, and is omitted when there is only one member.

    x.x.x	filename
  4. For human readability, each first level group is preceded by a comment of the form:

    -- #<first level group number> -- <filesize> --

automation


There are 2 ways to automate treatment of duplicate files using dupedit:

  1. The easiest is to use the -exec argument to invoke a user defined script. Your script will be invoked once per group of identical files, with the filenames as arguments. (Versions before 5.5 invoked the script once per file, which was not so useful.) Warning: Dupedit's knowledge of which files are false and true copies will not be passed to your script this way.
  2. Have your script parse the output of dupedit. I have not attempted this. Rather, I am thinking of this as a possible editor interface.

-exec


The -exec option does exactly the same as it does in the find program, but the syntax is different. Syntax: just terminate the command with a percent-sign; filename arguments are implicitly at the end.

Let's say you want to delete all except one file for each group of identical files. So you write this script, preservefirst.bash in your current directory:

1
2
3
#!/bin/bash
shift
rm -f "$@"

You can invoke it from dupedit as follows:

dupedit -exec bash preservefirst.bash % what ever

If you wanted to hardlink the files instead, you would write:

1
2
3
4
5
6
#!/bin/bash
winner="$1"
shift
for i in "$@"; do
	ln -f "$winner" "$i"
done

The 2 scripts above are included with dupedit 5.5, along with a program that selects the surviving file based on path prefix.

Admittedly, this is a bit hackish, and loses all control over false copies, but for now, this is how I do it. If for example we delete the symlink target instead of the symlink, or we delete one of two equivalent paths, we are screwed…

changelog


The version number is bumped when I change the design significantly. The program changed name from fcmp to dupedit between version 4 and 5. This changelog is incomplete, as I am a bit lazy.

pre version 4
These were experiments that didn't work.
version 4
The first usable version. Used a matrix of booleans for bookkeeping. Ugly matrix based output.
version 5.0
Use linked lists.
Change name: fcmpdupedit
New (machine readable) output.
version 5.1 - 5.4
Feature expansions (e.g. elimination by filesize, detection of false copies, directory recursion, the -exec option).
version 5.5
New semantic for the -exec option: Execute the command once with all filenames (a lot more useful than once per filename).
Make -exec work on Windows.
On Windows, use the windows-equivalents of inode number and device id.
version 5.5.1
Fix makefile in the 5.5 tarball (the makefile in the git repo is different). I had tried to bundle a library (libwhich), but forgot to instruct gcc to include from current directory, resulting in compilation error for everyone except me, because I had the library installed…
Thanks go to Asterios Dramis & Tim Beech.
version 6
Invent a crazy datastructure for bookkeeping.
Interactive mode where files can be added and compared.
Read files concurrently.
Opportunistically find subfiles inside other files, and compare them as any other file (useful for archive transparency and to distinguish data from metadata).
In non-interactive mode, be able to take a list of files to compare from standard in (e.g. from the find command, this one was a user request).

work in progress

version 6


Instead of just outputting and forgetting results, this totally different beast is made with interactivity in mind. When presenting each set of duplicates, a primitive shell lets you do whatever you want with them. It works by the same principles, but thanks to a never-ending supply of new ideas while studying real-time programming at university, some experiments and the following boring re-implementation (the file reading subsystem was rewritten 4 times), you won't find many source lines in common with version 5.

Qt GUI


Will have to wait for version 6. Below is a screenshot of my 2 attempts at creating a graphical user interface for version 5.4. As we can see, they are localized in Norwegian. Only the first became usable. Sadly, I was too lazy to fix the last bug, so I never released it. I'm a n00b at GUI design.
dupedit with Qt graphical user interface (initial attempt)dupedit with Qt graphical user interface (with tabs)

future work

editor


The idea is to open dupedit's output in your editor of choice, so you can delete lines corresponding to files you want deleted (maybe also giving some fancy commands if you want to sym/hardlink files instead). When you save, dupedit finds and acts on your changes. For the impatient, modifying something like vidir to operate on output from dupedit should be straightforward. The format of dupedit's output was created with this in mind.

theory


# no hashing #


This is the strength of dupedit. The algorithm may be described as «compare as you go». Each file is read only once, no matter how many and big duplicates you have.

  1. Dupedit is never wrong about which files are identical.
  2. Dupedit is very fast at elimination. Instead of reading whole files for hashing, compare as you go and skip the rest of files that no longer look like exact copies. Of course, like other deduplication programs, only files of equal size are worth comparing.

false copies — the «file self-comparison hazard»


There might be many reasons why multiple paths lead to the same underlying file:

Comparing a file against itself is a serious hazard which can only result in falsely accusing it to be a copy of itself, which is easy enough for users to believe when presenting the file with its multiple filenames (not to mention auto-deduplication purposes). A file on a UNIX filesystem has only one inode number, and a filesystem has only one device number (even in multiple simultaneous mounts of the filesystem*). Files that share inode- and device numbers are considered by dupedit as «false copies» and treated as one file.

*beware of network filesystems


Network filesystems do not inherit the device-ID of the source filesystem. More annoyingly, (at least on sshfs) inode numbers do not reflect hardlink relationships. Therefore, you can not trust dupedit's ability to distinguish false from physical copies if some of them are on a network filesystem, which means you must consciously do it yourself.**

**you're doing it wrong anyway


Running dupedit or any non-distributed deduplication algorithm on a network mount is a total waste of network bandwidth compared to running it locally where the files are, or, to compare files between computers, hashing files locally by each computer and send checksums over the net. The nearest thing to such a distributed deduplication program that I know of, would be rsync, which is not a deduplication program.

Wrong: Jamming your network, slowing down dupedit, exposing yourself to the «file self-comparison hazard».

[me@pc:~] mkdir mnt && sshfs user@example.com:/home/user mnt
[me@pc:~] dupedit mnt

Wrong: In case you thought checksumming (like fdupes does) magically saves any bandwidth, think again. It will be hazardous too, read the source or ask Adriano Lopez.

[me@pc:~] mkdir mnt && sshfs user@example.com:/home/user mnt
[me@pc:~] fdupes mnt

Correct: Run dupedit on the machine where the files are.

[me@pc:~] ssh user@example.com
[user@server:/home/user] dupedit .

Only when checksums match, and you are hash-collision paranoid, might you have an excuse for running a deduplication program on a network mount.

licensing


As usual, source code is the place to express feelings. Code snippet:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
/* dupedit 6.0 pre-alfa
 *
 * Copyright 2009 Andreas Nordal (ɯoɔ˙lıɐɯƃ@lɐpɹou˙sɐǝɹpuɐ)
 *
 * This release of dupedit becomes public domain when the year 2030
 * begins. Until then, redistribution and modification must be done
 * according to the GNU Lesser General Public License as published
 * by the Free Software Foundation, either version 3, or (at your
 * option) any later version. By doing so, you may postpone the
 * transition from LGPL to public domain, but not advance it.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU Lesser General Public License for more details.
 *
 * You should have received a copy of the GNU Lesser General Public License
 * along with this program.  If not, see <http://www.gnu.org/licenses/>
 */

/* Freedom is more than having a fixed set of options. (Richard Stallman)
 * My conclusion:
 * The 4 freedoms of GPL are insufficient; freedom implies no lisence.
 * Solution:
 * Make the lisence last as long as you need or care, just not forever.
 */