Initial results of my search for an image scanner today. The criteria:
- Able to handle large image sets (think Video Astronomy and way too many digital cameras)
- Must compare images using tsomething other than CRC
- This means I cannot use a simple file comaprison tool since some images are in different formats or dimensions although they containt he “same” image.
- Able to control simularity thresholds
- Able to auto-select the images to delete
- I am not sure what criteria is best here; dimensions, image size (larger == better?), quality (how?)
- Able to manualy choose from possible duplicates
- Must be able to handle 200,000 (I expect this many someday) on a 2 Ghz system with 1 MB RAM.
- Optional criteria
- Database driven so future images can be easily compared against exisiting
- Realtively fast
Theory of Operation
Many of these programs work by creating thumbnails of the image to normalize size and type and then compariing the thumbails. I suspect the thumbnails are compared bit-wise and the number of pixels the that are the same inidcate the simularity. This works well assuming the thumbails have the appropiate ccontarct control (otherwise gray scale images can fade and look the same) and small differences in an image are not critical. I am OK with loosing images that have very samll detail differences, especially on an initial pass. The tedium of doing an initial scan of 100,000+ images warrents accepting some loss if images that may be desired.
My test bed consists of approx 1,500 images followed by 200,000. The first test provides feedback as to general use of the program. The second test stresses the program for my expected large data set.
Handled 1600 images relativly quickly. A database made subsequant scans speedy. The automarking is based on file size and or dimensions. But only the LARGER of these can be selected for keeping. Sometimes I want to select the smaller sizes for certain collections (directories). This program also had issues with detecting gray scale images but not has bad as the others in this list. Sometimes every image in the dupcliate list was marked for deletion, which seems odd to me.
Finally the silly sliding screens and sound effects, ripped off from old StarTek shows, was annoying. This can be disabled in the
DISCARDED : Gray scale support is poor, anomolies in marking all files in a duplicate set,
A very powerfull image viewing and organizer with a duplicate scanner built in. It uses a database and is quite speedy once the thumbnails are built. The problem is it lacks any type of auto-marking tool. Every duplicate must be hand examined and deleted. This is fine when I am maintaining the image database but VERY hard to do when working with my initial 100,000 or so images. This program has not been pdated since 2004, is it even maintained any longer? It was able to handle my 200,000 images although it took a long time (as expected). It can take a long time to sort by simularity and without a progress indicater the program can appear to be frozen.
DISCARDED : No automatic deletion criteria. With this the program would be almost perfect!
Duplicate Image Finder (DIF)
This is VERY slow to scan but vert fast to analyze, a reasonable trade off. Gray scale images were often (almost always) matched against other gray scale images that were not even visually simular. Images can be auto-marked by dimensions and/or size but only the larger can be selected. There is no auto quality check and no way to say to discard all but the smallest. After processing all duplicate thumbnails appear to be loaded into RAM. I fear what effect this will have on large data sets.
DISCARDED : Gray scale support is very poor, even at a 97% setting
Visual Simularity Duplicate Image Finder
It ran out of RAM and bascially came to a halt after a couple of hours and using 2.5 GB of swap space. Apparently my artifical test bed of 200,000 images was just too much. A smaller sample indicated the program had most of the options I wanted included auto-marking based on simularity, file size and/or dimensions. It often detected simularities with b/w and gray scale images that were visually VERY different.
Does NOT use a database so process starts from scratch every time.
DISCARD : Gray scale support is low, VERY slow as it does NOT use a database, RAM intensive as all processing appears to occur in RAM.
Uses a database, is very slow to analyze images. Quality checker is based soley on image size. In cannot find multiple duplicates for a single file. Every comaprison is 1-to-1 which means multiple scans may be required. It found the most duplicates and the least false duplicates. Very slow when deleting files.
DISCARDED : It is too slow when given large dataseets and is very RAM intensive. All data appears to be held in RAM then written to the database.
DISCARD Can only search for exact duplicates!