Finding Duplicate Files using DataGravity FingerPrints

I love it when community feedback brings an idea to life.  I have had the benefit of seeing this first hand many times since joining the DataGravity family first as an Alpha customer, and for the last two years as a Solutions Architect.  The most recent example centers on the topic of duplicate files and stems from a conversation at Tech Field Day Extra - VMWorld 2014.  Several of the delegates were discussing the reality of just how many duplicate files exist within a given file system and how valuable it would be to be able to identify those to provide space and performance savings.  In the words of Hans De Leenheer - 'That is 101, finding what is duplicate'.

Imagine if you will for a minute how many duplicate copies of the exact same file live on a department share, virtual machine or home directory.  Copies of office templates, time reporting spreadsheets, company wide memos, or department powerpoints.  All the exact same files saved to different locations, by different people on the storage system. Howard Marks proposed a use case to find just how many copies of the same marketing powerpoint have been saved.

File Fingerprinting

DataGravity now creates a file fingerprint for every supported file.  A SHA-1 cryptographic hash value of the file provides the file's "fingerprint" as a 40 character hexadecimal value.  Each file has a unique SHA-1 value associated with file contents allowing inspection with far more accuracy then being only able to look at simple file meta-data such as file name and size.

The file fingerprint is unique to the contents of a file to allow the following:

  • Locate a file on any mount point / share / VM based on its unique content.
  • Find all files with identical content, even if the files have different names or reside in different locations.
  • Ensure that a file has not changed over time, by viewing the file fingerprint from different DiscoveryPoints.
  • Ensure that a file containing specific content, as identified by the file SHA-1 value, does not reside on the DataGravity Discovery system.

Finding Duplicates

Finding duplicate files all with the same unique fingerprint is extended to DataGravity's search and discovery. Let's search for all duplicates of the recent marketing presentation using the file's fingerprint.

It is easy to see that indeed there are duplicates of the presentation being saved by multiple people, to multiple locations.  In fact some of these files appear to be copied by the same user into different directories on their home share, but are the EXACT SAME file.

Using the preview function from the search confirms our duplicates.

 

There is a growing number of examples of how file fingerprinting is useful, many of which I will continue to share here on the blog.  Identifying duplicate files is one of my favorite uses of the feature, mostly because of how useful it is, but also because it demonstrates how DataGravity listens and incorporates feedback to enhance the product.