关于python:asposewordsdocx实现docx合并以及去除aspose的印记

因工作须要实现多个 word 文档的合并，并尽量保障 original style 的形式将 word 转化成 html 用于端上进行展现。本文实现次要解决问题：

word 的多个文档的合并 [次要是实现 append 的形式合并]
将合并文档转化成 html 文件，波及英文，日文的字体 word 原样展现，合并中图片的 base64d 的转化
因为 aspose 是商业利用，为了实现完满白嫖，不通过破解的形式去掉转化后后果中 aspose 的印记

aspose.words.python@6.22
python-docx
docxcompose
bs4

利用宝导入

#! /usr/bin/env python3
# -*- coding: utf-8 -*-
# DESC: 1. 基于 docx 实现多个 docx 的合并
#       2. 基于 aspose 的实现 docx 到 html 的转化
#       3. 基于 bs4 的 html 的元素和内容的增删改等操作

import os
import re
import pandas as pd
import aspose.words as aw
import aspose.words.saving as saving
from bs4 import BeautifulSoup
from docx import Document
from docxcompose.composer import Composer

合并 word 文档

def merge_docx(docx_list: list, docx_merge_tar: str, docx_list_src: str) -> str:
    """
    合并 word 文档
    目前只是将 word 进行拼装，不进行分页等操作
    """
    if len(docx_list) == 0:
        raise Exception("input is empty.")
    if len(docx_list) == 1:
        return os.path.join(docx_list_src, docx_list[0])
    # 将第一个 word 作为基 word
    base_docx = Document(os.path.join(docx_list_src, docx_list[0]))
    base_docx_composer = Composer(base_docx)
    # composer.append 的形式合并到基 word
    for next_docx in docx_list[1:]:
        next_docx_path = os.path.join(docx_list_src, next_docx)
        base_docx_composer.append(Document(next_docx_path))
    base_docx_composer.save(docx_merge_tar)
    print("merge docx list ok.")
    return docx_merge_tar

将 word 转成 html

def aspose_convert_docx_html(docx_file_path: str, html_file_path: str) -> str:
    """应用 aspose.words-python 将 word 转化成 html"""
    docx = aw.Document(docx_file_path)
    # 设置转化选项
    save_options = saving.HtmlSaveOptions(aw.SaveFormat.HTML)
    # 将图片存成 base64 模式
    save_options.export_images_as_base64 = True
    docx.save(html_file_path, save_options)
    return html_file_path

去掉 aspose 的印记

def del_aspose_elemet(html_tar_file: str, to_tar_file: str):
    """去除 aspose 的信息"""
    html_content = open(html_tar_file, "r", encoding="utf-8")
    soup = BeautifulSoup(html_content, features="lxml")
    # 删除指定的 aspose 的内容
    for tag in soup.find_all(style=re.compile("-aw-headerfooter-type:")):
        tag.extract()
    word_key_tag = soup.find("p", text=re.compile("Evaluation Only"))
    word_key_tag.extract()

    f = open(to_tar_file, "w", encoding="utf-8")
    f.write(soup.prettify())
    f.close()

if __name__ == '__main__':
    docx_file_path = r"D:\merge_tar\demo.docx"
    html_file_path = r"D:\merge_tar\demo.html"
    aspose_convert_docx_html(docx_file_path, html_file_path)

    process_file_path = r"D:\merge_tar\demo_d.html"
    del_aspose_elemet(html_file_path, process_file_path)

demo.docx

apsose 转化 word 到 html

解决 aspose 的印记

aspose 的转化后 options 设置有很多，具体可参考 sapose.words 的 github 查看 demos
bs4 在解决 html 很弱小
本文次要是记录工作中解决文档的实际后果，如果对你有用，那再好不过了

关于python:asposewordsdocx实现docx合并以及去除aspose的印记

aspose.words+docx 实现 docx 合并以及去除 aspose 的印记

起因

装置次要工具

次要代码

测试

测试后果

后记