乐趣区

关于python:模糊哈希fuzzy-hash对比文件相似度

比照两个文件类似度,python 中可通过 difflib.SequenceMatcher/ssdeep/python_mmdt/tlsh 实现,
在大量须要比照,且文件较大时,须要更高的效率,能够思考含糊哈希(fuzzy hash),如 ssdeep/python_mmdt

测试过程发现:

  • difflib 办法,读取文件后,能够实现匹配度输入
  • ssdeep/mmdt/tlsh 办法能够实现,实现提前含糊哈希值,验证时,只读取一次,实现比照,从而优化比照工夫,及内存 /cpu 耗费
  • tlsh 测试时,值越小,类似度越高,在比照小文件时,很不现实
  • 在比照小文件时,三种办法相差不大,在比照大文件(案例中 81MB),difflib 办法慢的难以承受
  • 在理论环境中,倡议应用 mmdt 办法,因为 ssdeep 在二进制比照中差异较大,失去参考价值,具体还有哪些文件类型存在此问题有待考量,

测试环境:
OS:ubuntu20.04
python:3.8.10
py-tlsh==4.7.2
python-mmdt==0.3.1
ssdeep==3.4

# -*- coding: utf-8 -*-

import ssdeep
import time
from python_mmdt.mmdt.mmdt import MMDT
from difflib import SequenceMatcher

def difflib_test(file1,file2):
    start_time = time.time()
    with open(file1,'rb') as f:
        s1 = f.read()
    with open(file2,'rb') as f:
        s2 = f.read()
    match_obj =  SequenceMatcher(None,s1,s2)
    print("difflib match:",match_obj.ratio())
    end_time = time.time()
    print('difflib_test cost:',end_time-start_time)

def mmdt_test(file1,file2):
    start_time = time.time()
    mmdt=MMDT()
    r1 = mmdt.mmdt_hash(file1)
    print(r1)
    r2 = mmdt.mmdt_hash_streaming(file2)
    print(r2)
    # sim1 = mmdt.mmdt_compare(file1, file2)
    # print("mmdt match:",sim1)
    sim2 = mmdt.mmdt_compare_hash(r1, r2)
    print("mmdt match:",sim2)
    end_time = time.time()
    print('mmdt_test cost:',end_time-start_time)

def ssdeep_test(file1,file2):
    start_time = time.time()
    sig1=ssdeep.hash_from_file(file1)
    sig2=ssdeep.hash_from_file(file2)
    print(sig1)
    print(sig2)
    print("ssdeep match:",ssdeep.compare(sig1,sig2))
    end_time = time.time()
    print('ssdeep_test cost:',end_time-start_time)

if __name__ == '__main__':
    start_time = time.time()
    file1='/root/test/fstab'
    file2='/root/test/fstab2'
    # file1 = '/root/test/initrd.img-5.4.0-125-generic'
    # file2 = '/root/test/initrd.img-5.4.0-135-generic'
    mmdt_test(file1,file2)    
    ssdeep_test(file1,file2)
    difflib_test(file1,file2)
    end_time = time.time()
    print('总执行工夫:',end_time-start_time)

上面给出比照小文件 / 大文件成果:

测试 tlsh

import tlsh
import time

def tlsh_test(file1,file2):
    start_time = time.time()
    with open(file1,'rb') as f:
        s1 = tlsh.hash(f.read())
    with open(file2,'rb') as f:
        s2 = tlsh.hash(f.read())
    match_obj =  tlsh.diff(s1,s2)
    print("tlsh match:",match_obj)
    end_time = time.time()
    print('difflib_test cost:',end_time-start_time)


if __name__ == '__main__':
    start_time = time.time()
    # file1='/root/test/fstab'
    # file2='/root/test/fstab2'
    file1 = '/root/test/initrd.img-5.4.0-125-generic'
    file2 = '/root/test/initrd.img-5.4.0-135-generic'
    tlsh_test(file1,file2)
    end_time = time.time()
    print('总执行工夫:',end_time-start_time)

比照小文件 / 大文件

退出移动版