关于python:asposewordsdocx实现docx合并以及去除aspose的印记

aspose.words+docx实现docx合并以及去除aspose的印记

起因

因工作须要实现多个word文档的合并，并尽量保障original style的形式将word转化成html用于端上进行展现。本文实现次要解决问题：

word的多个文档的合并[次要是实现append的形式合并]
将合并文档转化成html文件，波及英文，日文的字体word原样展现，合并中图片的base64d的转化
因为aspose是商业利用，为了实现完满白嫖，不通过破解的形式去掉转化后后果中aspose的印记

装置次要工具

aspose.words.python@6.22
python-docx
docxcompose
bs4

次要代码

利用宝导入

#! /usr/bin/env python3# -*- coding: utf-8 -*-# DESC: 1. 基于docx实现多个docx的合并#       2. 基于aspose的实现docx到html的转化#       3. 基于bs4的html的元素和内容的增删改等操作import osimport reimport pandas as pdimport aspose.words as awimport aspose.words.saving as savingfrom bs4 import BeautifulSoupfrom docx import Documentfrom docxcompose.composer import Composer

合并word文档

def merge_docx(docx_list: list, docx_merge_tar: str, docx_list_src: str) -> str:    """    合并word文档    目前只是将word进行拼装，不进行分页等操作    """    if len(docx_list) == 0:        raise Exception("input is empty.")    if len(docx_list) == 1:        return os.path.join(docx_list_src, docx_list[0])    # 将第一个word作为基word    base_docx = Document(os.path.join(docx_list_src, docx_list[0]))    base_docx_composer = Composer(base_docx)    # composer.append的形式合并到基word    for next_docx in docx_list[1:]:        next_docx_path = os.path.join(docx_list_src, next_docx)        base_docx_composer.append(Document(next_docx_path))    base_docx_composer.save(docx_merge_tar)    print("merge docx list ok.")    return docx_merge_tar

将word转成html

def aspose_convert_docx_html(docx_file_path: str, html_file_path: str) -> str:    """    应用aspose.words-python将word转化成html    """    docx = aw.Document(docx_file_path)    # 设置转化选项    save_options = saving.HtmlSaveOptions(aw.SaveFormat.HTML)    # 将图片存成base64模式    save_options.export_images_as_base64 = True    docx.save(html_file_path, save_options)    return html_file_path

去掉aspose的印记

def del_aspose_elemet(html_tar_file: str, to_tar_file: str):    """    去除aspose的信息    """    html_content = open(html_tar_file, "r", encoding="utf-8")    soup = BeautifulSoup(html_content, features="lxml")    # 删除指定的aspose的内容    for tag in soup.find_all(style=re.compile("-aw-headerfooter-type:")):        tag.extract()    word_key_tag = soup.find("p", text=re.compile("Evaluation Only"))    word_key_tag.extract()    f = open(to_tar_file, "w", encoding="utf-8")    f.write(soup.prettify())    f.close()

测试

if __name__ == '__main__':    docx_file_path = r"D:\merge_tar\demo.docx"    html_file_path = r"D:\merge_tar\demo.html"    aspose_convert_docx_html(docx_file_path, html_file_path)    process_file_path = r"D:\merge_tar\demo_d.html"    del_aspose_elemet(html_file_path, process_file_path)

测试后果

demo.docx

apsose转化word到html

解决aspose的印记

后记

aspose的转化后options设置有很多，具体可参考sapose.words的github查看demos
bs4在解决html很弱小
本文次要是记录工作中解决文档的实际后果，如果对你有用，那再好不过了