关于excel:Excel海量数据导入mongodb

我的需要

导入Excel的数据，供零碎内查问（将近30w行）
这个Excel是一个全量的数据，不是增量，零碎外保护，有变动了，定期导入更新

Excel导入

尝试一般的poi导入excel的形式，数据量太大，很容易OOM

WorkbookFactory.create(file.getInputStream());

我的excel不过30M左右，程序占了3g内存，竟然还会OOM

参考这个问题：
https://stackoverflow.com/que...

应用这个库，反对用流的形式读取excel：
https://github.com/monitorjbl...

依赖

援用仓库后，呈现了Method Not Found的异样，可能是依赖抵触了，我被动排除了一下

<dependency>            <groupId>com.monitorjbl</groupId>            <artifactId>xlsx-streamer</artifactId>            <version>2.1.0</version>            <exclusions>                <exclusion>                    <groupId>org.apache.poi</groupId>                    <artifactId>poi</artifactId>                </exclusion>            </exclusions>        </dependency>        <dependency>            <groupId>org.apache.poi</groupId>            <artifactId>poi</artifactId>            <version>4.1.2</version>        </dependency>

应用xlsx-streamer读取excel：

try (                InputStream is = file.getInputStream();                Workbook wb = StreamingReader.builder()                        // number of rows to keep in memory (defaults to 10)                        .rowCacheSize(100)                        // buffer size to use when reading InputStream to file (defaults to 1024)                        .bufferSize(4096)                        // InputStream or File for XLSX file (required)                        .open(is);        ) {  ……}

导入mongodb

数据量比拟大，如果组装java对象，必定节约性能，我抉择间接用 org.bson.Document组装数据。

如果应用 spring-data-mongdb 的API，必定也有损耗，我抉择应用mongdb本人的API：

com.mongodb.client.MongoCollection   void insertMany(List<? extends TDocument> var1)

来批量导入。保护一个list，从excel中读行组装，每满1000行 insertMany 一次。

我这里是全量导入，不思考反复更新什么的，间接创立一个新的collection B，导入胜利后，drop掉旧的collection A，把collection B重命名为A即可，省心省力。

残缺代码：

// 生成一个不会反复的表名        String collectionName = "invoiceInfo_" + new SimpleDateFormat("yyyyMMddHHmmss").format(new Date());        if (mongoTemplate.collectionExists(collectionName)) {            mongoTemplate.dropCollection(collectionName);        }        // 创立表        MongoCollection<Document> collection = mongoTemplate.createCollection(collectionName);        try (                InputStream is = file.getInputStream();                Workbook wb = StreamingReader.builder()                        // number of rows to keep in memory (defaults to 10)                        .rowCacheSize(100)                        // buffer size to use when reading InputStream to file (defaults to 1024)                        .bufferSize(4096)                        // InputStream or File for XLSX file (required)                        .open(is);        ) {            Sheet sheet = wb.getSheetAt(0);            long start = System.currentTimeMillis();            List<Document> documents = new ArrayList<>();            int i = 0;            for (Row row : sheet) {                if (i++ == 0 || row == null) {                    continue;                }                // 从第2行开始取数据,默认第一行是表头.                int col = 0;                Document document = new Document();                Object invoiceCode = ExcelUtil.getCellValue(row, col++);                if (StringUtils.isEmpty(invoiceCode.toString())) {                    break;                }                document.put("invoiceCode", invoiceCode);                document.put("invoiceNumber", ExcelUtil.getCellValue(row, col++));                document.put("billingDate", ExcelUtil.getCellValue(row, col++));                document.put("buyerName", ExcelUtil.getCellValue(row, col++));                document.put("buyerTaxId", ExcelUtil.getCellValue(row, col++));                document.put("sellerName", ExcelUtil.getCellValue(row, col++));                document.put("sellerTaxPayerId", ExcelUtil.getCellValue(row, col++));                document.put("noTaxAmount", ExcelUtil.getCellValue(row, col++));                document.put("taxRate", ExcelUtil.getCellValue(row, col++));                document.put("taxAmount", ExcelUtil.getCellValue(row, col++));                document.put("amount", ExcelUtil.getCellValue(row, col++));                document.put("remark", ExcelUtil.getCellValue(row, col++));                documents.add(document);                if (i % 1000 == 0) {                    collection.insertMany(documents);                    log.info("已导入{}条", i);                    documents.clear();                }            }            if (!documents.isEmpty()) {                collection.insertMany(documents);                log.info("已导入{}条", i);            }            log.info("time = {}", System.currentTimeMillis() - start);            // 删除旧表            if (mongoTemplate.collectionExists(TABLE_NAME)) {                mongoTemplate.dropCollection(TABLE_NAME);            }            // 将新表重命名            collection.renameCollection(new MongoNamespace(mongoTemplate.getDb().getName(), TABLE_NAME));        } catch (Throwable e) {            log.error("导入发票信息异样", e);            mongoTemplate.dropCollection(collectionName);        }

导入后果

我应用的jvm配置 -Xms512m -Xmx1024m 限度内存最大应用1g

实测30w数据导入用了20~25秒左右，

多提供些内存，多花点工夫，百万数据应该也没有什么问题