关于excel:Excel海量数据导入mongodb

导入 Excel 的数据，供零碎内查问（将近 30w 行）
这个 Excel 是一个全量的数据，不是增量，零碎外保护，有变动了，定期导入更新

尝试一般的 poi 导入 excel 的形式，数据量太大，很容易 OOM

WorkbookFactory.create(file.getInputStream());

我的 excel 不过 30M 左右，程序占了 3g 内存，竟然还会 OOM

参考这个问题：
https://stackoverflow.com/que…

应用这个库，反对用流的形式读取 excel：
https://github.com/monitorjbl…

援用仓库后，呈现了 Method Not Found 的异样，可能是依赖抵触了，我被动排除了一下

<dependency>
            <groupId>com.monitorjbl</groupId>
            <artifactId>xlsx-streamer</artifactId>
            <version>2.1.0</version>
            <exclusions>
                <exclusion>
                    <groupId>org.apache.poi</groupId>
                    <artifactId>poi</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi</artifactId>
            <version>4.1.2</version>
        </dependency>

try (InputStream is = file.getInputStream();
                Workbook wb = StreamingReader.builder()
                        // number of rows to keep in memory (defaults to 10)
                        .rowCacheSize(100)
                        // buffer size to use when reading InputStream to file (defaults to 1024)
                        .bufferSize(4096)
                        // InputStream or File for XLSX file (required)
                        .open(is);
        ) {……}

数据量比拟大，如果组装 java 对象，必定节约性能，我抉择间接用 org.bson.Document 组装数据。

如果应用 spring-data-mongdb 的 API，必定也有损耗，我抉择应用 mongdb 本人的 API：

com.mongodb.client.MongoCollection   
void insertMany(List<? extends TDocument> var1)

来批量导入。保护一个 list，从 excel 中读行组装，每满 1000 行 insertMany 一次。

我这里是全量导入，不思考反复更新什么的，间接创立一个新的 collection B，导入胜利后，drop 掉旧的 collection A，把 collection B 重命名为 A 即可，省心省力。

// 生成一个不会反复的表名
        String collectionName = "invoiceInfo_" + new SimpleDateFormat("yyyyMMddHHmmss").format(new Date());
        if (mongoTemplate.collectionExists(collectionName)) {mongoTemplate.dropCollection(collectionName);
        }
        // 创立表
        MongoCollection<Document> collection = mongoTemplate.createCollection(collectionName);

        try (InputStream is = file.getInputStream();
                Workbook wb = StreamingReader.builder()
                        // number of rows to keep in memory (defaults to 10)
                        .rowCacheSize(100)
                        // buffer size to use when reading InputStream to file (defaults to 1024)
                        .bufferSize(4096)
                        // InputStream or File for XLSX file (required)
                        .open(is);
        ) {Sheet sheet = wb.getSheetAt(0);
            long start = System.currentTimeMillis();
            List<Document> documents = new ArrayList<>();
            int i = 0;
            for (Row row : sheet) {if (i++ == 0 || row == null) {continue;}
                // 从第 2 行开始取数据, 默认第一行是表头.
                int col = 0;

                Document document = new Document();
                Object invoiceCode = ExcelUtil.getCellValue(row, col++);
                if (StringUtils.isEmpty(invoiceCode.toString())) {break;}
                document.put("invoiceCode", invoiceCode);
                document.put("invoiceNumber", ExcelUtil.getCellValue(row, col++));
                document.put("billingDate", ExcelUtil.getCellValue(row, col++));
                document.put("buyerName", ExcelUtil.getCellValue(row, col++));
                document.put("buyerTaxId", ExcelUtil.getCellValue(row, col++));
                document.put("sellerName", ExcelUtil.getCellValue(row, col++));
                document.put("sellerTaxPayerId", ExcelUtil.getCellValue(row, col++));
                document.put("noTaxAmount", ExcelUtil.getCellValue(row, col++));
                document.put("taxRate", ExcelUtil.getCellValue(row, col++));
                document.put("taxAmount", ExcelUtil.getCellValue(row, col++));
                document.put("amount", ExcelUtil.getCellValue(row, col++));
                document.put("remark", ExcelUtil.getCellValue(row, col++));
                documents.add(document);

                if (i % 1000 == 0) {collection.insertMany(documents);
                    log.info("已导入 {} 条", i);
                    documents.clear();}

            }
            if (!documents.isEmpty()) {collection.insertMany(documents);
                log.info("已导入 {} 条", i);
            }
            log.info("time = {}", System.currentTimeMillis() - start);

            // 删除旧表
            if (mongoTemplate.collectionExists(TABLE_NAME)) {mongoTemplate.dropCollection(TABLE_NAME);
            }
            // 将新表重命名
            collection.renameCollection(new MongoNamespace(mongoTemplate.getDb().getName(), TABLE_NAME));

        } catch (Throwable e) {log.error("导入发票信息异样", e);
            mongoTemplate.dropCollection(collectionName);
        }

我应用的 jvm 配置 -Xms512m -Xmx1024m 限度内存最大应用 1g

实测 30w 数据导入用了 20~25 秒左右，

多提供些内存，多花点工夫，百万数据应该也没有什么问题

关于excel:Excel海量数据导入mongodb

我的需要

Excel 导入

依赖

应用 xlsx-streamer 读取 excel：

导入 mongodb

残缺代码：

导入后果