前言

这几年通过用P站爬虫、cg图包分享、零散收集等手段，屯了近1T的动漫图片，多则多矣却良莠不齐，其中更是有大量的重复或者近似图片。

最近想整理一下这些存货，于是要先把重复和近似的图片识别出来先处理掉。

试用了重复文件的查找神器Duplicate Cleaner Pro，使用图片模式扫描后确实找出了一些近似图片，但却有非常多漏网之鱼。

果然，还是要自己动手丰衣足食，就研究了一下使用python搜索近似图片的方法，效果很不错，记录如下。

原理

对于重复的文件，一般是使用MD5、SHA-1等算法计算出文件的哈希值，再进行对比，优点是计算速度比较快，错误率极低。

但对于图片来说，图像格式、元数据等等都会导致像素内容完全相同的情况下有不同的二进制文件内容，从而导致哈希值截然不同。

更何况如果要搜寻对图片进行分辨率、亮度、色度、对比度、模糊度、缩放、旋转、截取、小幅度修改图片内容等操作导致的相似图片时，原图与修改后的图之间哈希值的巨大差异是完全无法在这种情况下使用的。

因此相似图片的搜索需要对比的不是文件的二进制内容，而是图片上每个像素的色彩分布方式即图像哈希值。

根据色彩分布方式的计算方式不同，常用的图片相似度算法有aHash、dHash、pHash等，基本的工作步骤为化简图片-计算图像哈希值、对比得出相似度。

化简图片

一张常见的1080p图片的分辨率是1920*1080，超过200万像素，每个像素有红绿蓝三个通道（部分格式还有Alpha透明通道），每个通道可取值为0~255，如果逐像素逐通道计算和对比会消耗大量时间。

因此，需要对图片进行简化，再去计算颜色的分布方式，常用的简化方式如下。

简化尺寸

首先是缩小尺寸，比如缩小到32*32像素或者更小，能保留图片的整体色彩趋势，显著降低计算量，同时将不同分辨率的图片统一到一致的尺寸，得到格式相同的图像哈希值，方便后面对比。

简化色彩

其次是将三个色彩通道简化为明度，即将图片进行去色，将像素的颜色分布方式简化为像素的明暗分布方式，计算量再降低三分之二，对于图片内容相同而饱和度相近的图片能更好地进行识别。

简化色彩位深

所谓位深即色彩的取值范围，常用的8位深度下，图片每个通道可取值为0~255，可以将0~255的值映射到0~64甚至更小的范围内，从而提高计算速度。

aHash算法

aHash算法即平均哈希算法，原理是先求出所有像素的平均值，再将每个像素与平均值对比，大于或等于平均值的标记为1，小于平均值的标记为0，再将标记结果组合在一起。

这种算法在简化图片时常常将图片缩小为8*8，此时得到的就是一个64位的0/1哈希序列，可以视为这张图片的数字指纹。

在对比时，将两张图片的哈希序列通过逐位异或操作得到不相同的位数，即汉明距离（Hamming distance），不同的位数越少，两张图片的相似度越大。

aHash算法的优点是速度快，缺点是精确度较差，对均值比较敏感。

dHash算法

dHash算法即差异哈希算法，原理是比较每行相邻元素的大小，如果左边的像素比右边的像素更亮则标记为1，否则为0，最后组合得到哈希序列。

这种算法在简化图片时常常将图片缩小为9*8，每行9个元素相邻比较可得到8个值，一共8行，结果也是一个64位的0/1哈希序列。

两张图片的哈希序列同样通过同位对比即可得到相似度大小。

dHash算法的优点是速度快，同时判断效果要好于aHash。

pHash算法

pHash算法即感知哈希算法，原理是通过离散余弦变换（DCT）降低图片频率，通过有损压缩的方式保留大部分图像特征，再对特征值进行比较。

DCT是一种特殊的傅立叶变换，将图片从像素域变换为频率域，DCT矩阵中从左上角到右下角代表越来越高频率的系数，但是除左上角外，其他部分的系数都为0左右，pHash算法在简化图片时常常将图片缩小为32*32，因此只选取DCT矩阵中左上角8*8的部分即可得到图像的大部分特征。

再将8*8的部分矩阵的每个值与均值比较，组合得到的同样是一个64位的0/1哈希序列，最终再通过同位对比得到相似度大小。

pHash算法优点是更为稳定，判断效果好，但速度略慢。

代码实现

对这三种图像哈希算法，可以使用opencv进行算法实现，代码如下：

# -*- coding: utf-8 -*-
import cv2
import numpy as np

def pHash(img,leng=32,wid=32):
    img = cv2.resize(img, (leng, wid))   
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    dct = cv2.dct(np.float32(gray))
    dct_roi = dct[0:8, 0:8]            
    avreage = np.mean(dct_roi)
    phash_01 = (dct_roi>avreage)+0
    phash_list = phash_01.reshape(1,-1)[0].tolist()
    hash = ''.join([str(x) for x in phash_list])
    return hash

def dHash(img,leng=9,wid=8):
    img=cv2.resize(img,(leng, wid))
    image=cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
    #每行前一个像素大于后一个像素为1，相反为0，生成哈希
    hash=[]
    for i in range(wid):
        for j in range(wid):
            if image[i,j]>image[i,j+1]:
                hash.append(1)
            else:
                hash.append(0)
    return hash

def aHash(img,leng=8,wid=8):
    img=cv2.resize(img,(leng, wid))
    image=cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
    avreage = np.mean(image)                           
    hash = [] 
    for i in range(image.shape[0]): 
        for j in range(image.shape[1]): 
            if image[i,j] >= avreage: 
                hash.append(1) 
            else: 
                hash.append(0) 
    return hash

def Hamming_distance(hash1,hash2):
    num = 0
    for index in range(len(hash1)):
        if hash1[index] != hash2[index]:
            num += 1
    return num 

if __name__ == '__main__':
    
    image1 = cv2.imread('image1')
    image2 = cv2.imread('image2')
    
    d_dist = Hamming_distance(dHash(image1),dHash(image2))
    
    p_dist = Hamming_distance(pHash(image1),pHash(image2))
    
    a_dist = Hamming_distance(aHash(image1),aHash(image2))
    
    print('a_dist is '+'%d' % a_dist + ', similarity is ' +'%f' % (1 - a_dist * 1.0 / 64))
    print('p_dist is '+'%d' % p_dist + ', similarity is ' +'%f' % (1 - p_dist * 1.0 / 64))
    print('d_dist is '+'%d' % d_dist + ', similarity is ' +'%f' % (1 - d_dist * 1.0 / 64))

效果对比

为了测试这三种算法的准确性，准备了以下测试图片。

测试结果如下：

可以看出这几种图像哈希算法对于大多数情况识别率是较高的，而裁剪遮挡特别是图片的旋转会导致计算出的相似度降低，在三种算法中，dHash稍好于aHash一些，pHash的算法更为宽容，但对于完全不相似的图片之间，pHash算法也会得到更高的相似度，有更大概率会导致错误匹配。

如果对于识别准确度有更高的要求，就需要使用更高级的图像特征点识别或者深度学习等方法。

不过对于目前我的需求来说，使用pHash或者dHash就已经足够了。

如果觉得写算法麻烦，也可以直接使用python第三方库imagehash，除了支持aHash、dHash、pHash以及wHash（使用DWT替代DCT的pHash）算法，还支持colorHash（HSV）和cropHash（抗截取哈希），使用也很简单，这里就不详述了。

畜生就是畜生，
就算它长着人脸，口吐人言，
理论高深莫测，立场冠冕堂皇，
你也不要放下手中的刀子。

《桐宫之囚》
——阿菩

สล็อตเว็บตรง KC9 潜水

162217 804799There is noticeably a bundle comprehend this. I suppose you created specific good points in functions also. 101704

4月前回复
wings789 潜水

139568 642682I believe other website owners really should take this web site as an model, extremely clean and fantastic user pleasant pattern . 479179

4月前回复
dark168 潜水

528537 224350Real informative and great anatomical structure of topic material , now thats user pleasant (:. 520790

4月前回复
เว็บคาสิโน 潜水

939077 321940Ich kenne einige Leute, die aus Kanadakommen. Eines Tages werde ich auch dorthin reisen Lg Daniela 317933

3月前回复
รากฟันเทียมโคราช 潜水

98389 166094Be the precise blog in case you have wants to learn about this topic. You comprehend considerably its almost onerous to argue to you (not that I personally would needHaHa). You undoubtedly put a new spin for a topic thats been discussing for some time. Nice stuff, merely good! 131741

3月前回复
Truman Whisman 传说

Saved as a favorite, I really like your blog!

http://www.tlovertonet.com/

3月前回复
ufabet789 潜水

570071 531798I actually like this weblog web site, will certainly come back once again. Make confident you carry on creating quality content material articles. 407952

3月前回复
click this 潜水

586444 703510Soon after study some with the weblog posts inside your internet site now, and i genuinely such as your technique for blogging. I bookmarked it to my bookmark website list and are checking back soon. Pls look into my internet web site likewise and make me aware what you consider. 485033

3月前回复
ร้านเค้กวันเกิดใกล้ฉัน 潜水

658124 504354Some genuinely wondrous function on behalf of the owner of this internet site, perfectly wonderful topic material . 486209

2月前回复
Mostbet Aviator 潜水

640812 591434It is not that I want to duplicate your site, but I genuinely like the style. Could you tell me which design are you making use of? Or was it custom created? 785314

2月前回复
EndoliftX 潜水

221609 208419Thanks so considerably for an additional post. I be able to get that kind of info details. friend, and exactly. 736816

2月前回复
Free Golf Streaming 传说

fascinate este conteúdo. Gostei bastante. Aproveitem e vejam este conteúdo. informações, novidades e muito mais. Não deixem de acessar para saber mais. Obrigado a todos e até mais. 🙂

https://worldsports.me/golf-live-coverage

2月前回复
Nursing Care Costs 潜水

136611 475365The electronic cigarette uses a battery and a small heating component the vaporize the e-liquid. This vapor can then be inhaled and exhaled 81968

1月前回复
wikipedia reference 潜水

181732 581161Discover how to deal along with your domain get in touch with details and registration. Realize domain namelocking and Exclusive domain name Registration. 560724

1月前回复
SEOAD 潜水

172339 707496Spot up for this write-up, I in fact feel this superb internet site requirements a whole lot a lot more consideration. Ill far more likely be once once again to read considerably far more, thank you that information. 581964

1月前回复
apuesta360 潜水

792547 967884quite good post, i definitely enjoy this site, go on it 673501

1月前回复
สมัครเว็บสล็อต LSM99 潜水

186092 195678Ive applied the valuable points from this page and I can surely tell that it gives a lot of assistance with my present jobs. I would be quite pleased to maintain getting back in this web page. Thank you. 458438

3周前回复
ufabet789 潜水

615990 452460Hey. Really good internet web site!! Man .. Excellent .. Wonderful .. Ill bookmark this internet website and take the feeds alsoI am pleased to locate so much beneficial details here within the post. Thanks for sharing 202936

2周前回复
Kristian Canaday 传说

Great write-up, I am regular visitor of one’s website, maintain up the nice operate, and It is going to be a regular visitor for a lengthy time.

https://www.smortergiremal.com/

2周前回复
ดูหนังใหม่ 潜水

4364 222070Oh my goodness! a wonderful post dude. Thanks Nevertheless My business is experiencing issue with ur rss . Dont know why Not able to sign up for it. Possibly there is any person obtaining identical rss difficulty? Anyone who knows kindly respond. Thnkx 114955

1周前回复
Thai Massage Manhattan 潜水

948475 835669I like this weblog its a master peace ! Glad I observed this on google . 941265

2天前回复

图片相似算法

前言

原理