Finding Duplicate Files with PowerShell

Let's explore a script that leverages DataGravity's file fingerprints to identify the top 10 duplicate files on a given department share or virtual machine.

The Workflow

  1. Export fingerprints and file names to File List (CSV format)
  2. Run the FindDuplicateFiles.ps1 powershell script
  3. List the Top 10 duplicate files and space they are consuming

Files and Fingerprints

DataGravity makes it easy to identify files and their unique SHA-1 fingerprints on a share or virtual machine (VMware or Hyper-V).  In this example we are going to gather the file names and fingerprints in the Sales department share.

The Script:

FindDuplicateFiles.ps1 -csvFilePath "c:\temp\sales.csv" -top 10

Script parameters:

-csvFilePath is the path to the CSV file we downloaded in the first step which contains a list of the files and file fingerprints.  This is an export from DataGravity's Search.

-top optional parameter that if specified will show the top number of duplicate files

Listing and Validating Duplicates

Let's run the script to return the top 10 duplicate files, and their file size.

These can of course be validated as the example below returns duplicate files consuming the most space.

The full powershell script is listed below, and available on my Powershell repo on GitHub.