dupedit

description


Dupedit is a program for finding duplicate files. Give it files and folders, and dupedit will find all sets of exactly identical files. It cares only about the content of files (not filenames). Here is what makes dupedit special:

  1. Dupedit distinguishes physical copies (duplicates) from multiple names of the same file (false copies). This is important, as deleting a false copy may lead to data loss. For each physical copy, dupedit groups its false copies. Internally, this group of false copies is treated as one file, thereby avoiding the file self-comparison hazard.
  2. Dupedit is fast and reliable. It compares any number of files at once without hashing.
  3. As the name implies, a future goal is to become the ultimate tool for editing duplicates, where editing means managing.

download

releases newest: [ dupedit-5.5.1.tar.bz2 | dupedit-5.5-win32.zip ]
old releases: dupedit/
forumcli-apps.org
developmental repositories
  • The stable branch of dupedit is part of diverseutils. Clone it from git://nerdvar.com/diverseutils.git to get my latest & greatest usable stuff. This branch will be more up to date than the latest release.
  • The bleeding edge of dupedit development happens at git://nerdvar.com/dupedit.git. Warning: Not usable. For curious enthusiasts. What is to become version 6 has been rewritten numerous times the last couple of years, as my ideas keep getting better, but has seldom been in a compiling state.
operating systemsAll versions tested on Linux.
Version 5.5 miraculously compiles on Windows (using msys/minGW) and was tweaked to work ok.
licenseLGPL3 (version 5.5.1 and before), MPL2 (future versions)
languages (localized at compile time)en.ascii, nb-NO.UTF-8
programming languageC
authorAndreas Nordal (ɯoɔ˙lıɐɯƃ@lɐpɹou˙sɐǝɹpuɐ)

usage


The default action is to list identical files in groups. Explanation follows after the example:

[me@pc:~/dupedit/testdir] touch empty0 empty1
[me@pc:~/dupedit/testdir] echo -n "hello" > dup0
[me@pc:~/dupedit/testdir] echo -n "world" > unik
[me@pc:~/dupedit/testdir] cp dup0 dup1; cp dup0 dup2
[me@pc:~/dupedit/testdir] dupedit .

-- #0 -- 0 B --
0.0     empty1
0.1     empty0

-- #1 -- 5 B --
1.0     dup0
1.1     dup1
1.2     dup2

In total: 5 identical files. Deduplication can save 10 bytes.
[me@pc:~/dupedit/testdir] #Let's create false copies
[me@pc:~/dupedit/testdir] ln dup0 hardlink0
[me@pc:~/dupedit/testdir] ln -s dup1 symlink1
[me@pc:~/dupedit/testdir] sudo mount --bind . ../mnt
[me@pc:~/dupedit/testdir] dupedit dup? *link? ../testdir/dup0 ../mnt/dup0

-- #0 -- 5 B --
0.0.0   ../mnt/dup0
0.0.1   ../testdir/dup0
0.0.2   hardlink0
0.0.3   dup0
0.1.0   symlink1
0.1.1   dup1
0.2     dup2

In total: 7 identical files. Deduplication can save 10 bytes.

explanation of output


The grouping of files is given by 2 or 3 dot separated numbers (in hex) preceding each file's (relative) path (by a tab in between). This is a unique enumeration of the 2 levels of grouping of the file:

  1. Files with a common first number (first level grouping) are identical — they contain the same data.

    x.x.x	filename
  2. Files with a common first and second number (second level grouping) are physically the same file — writing to one file will affect all.

    x.x.x	filename
  3. The third number is there only to enumerate members of the second level group, and is omitted when there is only one member.

    x.x.x	filename
  4. For human readability, each first level group is preceded by a comment of the form:

    -- #<first level group number> -- <filesize> --

automation


There are 2 ways to automate treatment of duplicate files using dupedit:

  1. The easiest is to use the -exec argument to invoke a user defined script. Your script will be invoked once per group of identical files, with the filenames as arguments. (Versions before 5.5 invoked the script once per file, which was not so useful.) Warning: Dupedit's knowledge of which files are false and true copies will not be passed to your script this way.
  2. Have your script parse the output of dupedit. I have not attempted this. Rather, I am thinking of this as a possible editor interface.

-exec


The -exec option does exactly the same as it does in the find program, but the syntax is different. Syntax: just terminate the command with a percent-sign; filename arguments are implicitly at the end.

Let's say you want to delete all except one file for each group of identical files. So you write this script, preservefirst.bash in your current directory:

1
2
3
#!/bin/bash
shift
rm -f "$@"

You can invoke it from dupedit as follows:

dupedit -exec bash preservefirst.bash % what ever

If you wanted to hardlink the files instead, you would write:

1
2
3
4
5
6
#!/bin/bash
winner="$1"
shift
for i in "$@"; do
	ln -f "$winner" "$i"
done

The 2 scripts above are included with dupedit 5.5, along with a program that selects the surviving file based on path prefix.

Admittedly, this is a bit hackish, and loses all control over false copies, but for now, this is how I do it. If for example we delete the symlink target instead of the symlink, or we delete one of two equivalent paths, we are screwed…

changelog


The version number is bumped when I change the design significantly. The program changed name from fcmp to dupedit between version 4 and 5. This changelog is incomplete, as I am a bit lazy.

pre version 4
These were experiments that didn't work.
version 4
The first usable version. Used a matrix of booleans for bookkeeping. Ugly matrix based output.
version 5.0
Use linked lists.
Change name: fcmpdupedit
New (machine readable) output.
version 5.1 - 5.4
Feature expansions (e.g. elimination by filesize, detection of false copies, directory recursion, the -exec option).
version 5.5
New semantic for the -exec option: Execute the command once with all filenames (a lot more useful than once per filename).
Make -exec work on Windows.
On Windows, use the windows-equivalents of inode number and device id.
version 5.5.1
Fix makefile in the 5.5 tarball (the makefile in the git repo is different). I had tried to bundle a library (libwhich), but forgot to instruct gcc to include from current directory, resulting in compilation error for everyone except me, because I had the library installed…
Thanks go to Asterios Dramis & Tim Beech.
version 6
Invent a crazy datastructure for bookkeeping.
Interactive mode where files can be added and compared.
Read files concurrently.
Opportunistically find subfiles inside other files, and compare them as any other file (useful for archive transparency and to distinguish data from metadata).
In non-interactive mode, be able to take a list of files to compare from standard in (e.g. from the find command, this one was a user request).

work in progress

version 6


Instead of just outputting and forgetting results, this totally different beast is made with interactivity in mind. When presenting each set of duplicates, a primitive shell lets you do whatever you want with them. It works by the same principles, but thanks to a never-ending supply of new ideas while studying real-time programming at university, some experiments and the following boring re-implementation (the file reading subsystem was rewritten 4 times), you won't find many source lines in common with version 5.

Qt GUI


Will have to wait for version 6. Below is a screenshot of my 2 attempts at creating a graphical user interface for version 5.4. As we can see, they are localized in Norwegian. Only the first became usable. Sadly, I was too lazy to fix the last bug, so I never released it. I'm a n00b at GUI design.
dupedit with Qt graphical user interface (initial attempt)dupedit with Qt graphical user interface (with tabs)

future work

editor


The idea is to open dupedit's output in your editor of choice, so you can delete lines corresponding to files you want deleted (maybe also giving some fancy commands if you want to sym/hardlink files instead). When you save, dupedit finds and acts on your changes. For the impatient, modifying something like vidir to operate on output from dupedit should be straightforward. The format of dupedit's output was created with this in mind.

theory


# no hashing #


This is the strength of dupedit. The algorithm may be described as «compare as you go». Each file is read only once, no matter how many and big duplicates you have.

  1. Dupedit is never wrong about which files are identical.
  2. Dupedit is very fast at elimination. Instead of reading whole files for hashing, compare as you go and skip the rest of files that no longer look like exact copies. Of course, like other deduplication programs, only files of equal size are worth comparing.

false copies — the «file self-comparison hazard»


There might be many reasons why multiple paths lead to the same underlying file:

Comparing a file against itself is a serious hazard which can only result in falsely accusing it to be a copy of itself, which is easy enough for users to believe when presenting the file with its multiple filenames (not to mention auto-deduplication purposes). A file on a UNIX filesystem has only one inode number, and a filesystem has only one device number (even in multiple simultaneous mounts of the filesystem*). Files that share inode- and device numbers are considered by dupedit as «false copies» and treated as one file.

*beware of network filesystems


Network filesystems do not inherit the device-ID of the source filesystem. More annoyingly, (at least on sshfs) inode numbers do not reflect hardlink relationships. Therefore, you can not trust dupedit's ability to distinguish false from physical copies if some of them are on a network filesystem, which means you must consciously do it yourself.**

**not ideal for network mounts anyway


Running dupedit or any non-distributed deduplication algorithm on a network mount is a luxurious way to spend network bandwidth, compared to running it locally where the files are; or, to compare files between computers, make each computer hash its local files and send checksums over the net. The nearest thing to such a distributed deduplication program that I know of, would be rsync, which is not a deduplication program.

Not ideal: Jamming your network & exposing yourself to the «file self-comparison hazard».

[me@pc:~] mkdir mnt && sshfs user@example.com:/home/user mnt
[me@pc:~] dupedit mnt

Not any better: Other deduplication programs may be based on checksumming (e.g. fdupes), but that does not solve the underlying problem that network filesystems are made to send file contents over the net, not checksums. It is also the network filesystem that possibly inhibits «samefile detection», thereby causing a «file self-comparison hazard».

[me@pc:~] mkdir mnt && sshfs user@example.com:/home/user mnt
[me@pc:~] fdupes mnt

Correct: Run dupedit on the machine where the files are.

[me@pc:~] ssh user@example.com
[user@server:/home/user] dupedit .

Algorithmically speaking: Only when checksums match, and you are hash-collision paranoid, might you have an excuse for running a deduplication program on a network mount.