18.5 Duplicate Files


A common challenge is to find duplicate files, such as photos or music or documents. When available disk space becomes tight then it’s also a good time for a clean up.

A simple trick to find duplicates is to calculate a MD5 signature or hash for a file, and to the use that signature to find duplicates of the file, knowing that in general a mapping of the contents of a file to a signature is a unique mapping - the signature is different for different files.

The fdupes package provides the fdupes command that incorporates the use of the MD5 signature within a more thorough pipeline to guarantee the files are duplicates. The pipeline for checking for duplicate files begins with a file size comparison, a partial MD5 signature comparison, a full MD5 signature comparison, and then a byte-to-byte comparison.

With the --delete option fdupes will begin an interactive session to list all duplicated files in the current directory .. With the --recurse option duplicates are searched for in the current directory and below. The interactive session will list all duplicates and provide options for their resolution. This is a quick and effective way to reduce duplicated files:

fdupes --delete --recurse .

The interactive session will look something like the below example:

Set 40 of 1919:

    [+] ./Camera/PXL_20230307_025903772.MP.jpg
    [-] ./Camera/20230307_155903_00.jpg

Set 41 of 1919:

    [ ] ./Camera/20210812_151633_00.jpg
    [ ] ./Camera/PXL_20210812_051633163.PORTRAIT.jpg

Set 42 of 1919:

    [ ] ./Camera/20200516_141257_00.jpg
    [ ] ./Camera/IMG_20200516_141257.jpg

( Preserve files [1 - 2, all, help] ): 
Ready                                                             Set 41 of 1919

You can choose to preserve the first of the duplicated files by entering 1, the second by 2, or preserve all files. In the above example 1 was typed followed by Enter. When a selection has been made you can type prune to perform the actions which will delete the files marked for pruning with the -.

The interactive session of fdupes provides quite a comprehensive set of commands to mark duplicates automatically. The functionality can be reviewed through the help command.

Your donation will support ongoing availability and give you access to the PDF version of this book. Desktop Survival Guides include Data Science, GNU/Linux, and MLHub. Books available on Amazon include Data Mining with Rattle and Essentials of Data Science. Popular open source software includes rattle, wajig, and mlhub. Hosted by Togaware, a pioneer of free and open source software since 1984. Copyright © 1995-2022 Graham.Williams@togaware.com Creative Commons Attribution-ShareAlike 4.0