关于数据处理:记录一次线上数据图源本地化操作的过程

8次阅读

共计 7529 个字符,预计需要花费 19 分钟才能阅读完成。

最近学了一个比拟赞的电商我的项目,我的项目作者提供了残缺的示例数据,包含商品信息及配图,然而这些配图是固定的 URL,商品详情为 html,html 中有 img 标签,img 标签中也有 url。依据过往教训这种在线 CDN 很容易挂掉,因而产生了把商品数据中的商品图片提取进去,放在本人的腾讯云服务器中的想法,保障可拜访性。

演示数据

[{
    "ID": "b93e59e214fc4478ac72652a2c87fe54",
    "GOODS_SERIAL_NUMBER": "2300000059885",
    "SHOP_ID": "402880e860166f3c0160167897d60002",
    "SUB_ID": "402880e86016d1b5016016dcd7c50004",
    "GOOD_TYPE": 1,
    "STATE": 0,
    "IS_DELETE": 1,
    "NAME": "云南红提 800g/ 盒",
    "ORI_PRICE": 18,
    "PRESENT_PRICE": 15,
    "AMOUNT": 10000,
    "DETAIL": "<img src=\"http://images.koow.cc/shopGoodsDetailImg/20171225/20171225112029_9395.jpg\"width=\"100%\"height=\"auto\"alt=\"\"/><img src=\"http://images.koow.cc/shopGoodsDetailImg/20171225/20171225112029_3391.jpg\"width=\"100%\"height=\"auto\"alt=\"\"/><img src=\"http://images.koow.cc/shopGoodsDetailImg/20171225/20171225112029_7603.jpg\"width=\"100%\"height=\"auto\"alt=\"\"/><img src=\"http://images.koow.cc/shopGoodsDetailImg/20171225/20171225112029_4718.jpg\"width=\"100%\"height=\"auto\"alt=\"\"/><img src=\"http://images.koow.cc/shopGoodsDetailImg/20171225/20171225112030_778.jpg\"width=\"100%\"height=\"auto\"alt=\"\"/><img src=\"http://images.koow.cc/shopGoodsDetailImg/20171225/20171225112030_2602.jpg\"width=\"100%\"height=\"auto\"alt=\"\"/><img src=\"http://images.koow.cc/shopGoodsDetailImg/20171225/20171225112030_7913.jpg\"width=\"100%\"height=\"auto\"alt=\"\"/><img src=\"http://images.koow.cc/shopGoodsDetailImg/20171225/20171225112030_202.jpg\"width=\"100%\"height=\"auto\"alt=\"\"/><img src=\"http://images.koow.cc/shopGoodsDetailImg/20171225/20171225112030_4296.jpg\"width=\"100%\"height=\"auto\"alt=\"\"/><img src=\"http://images.koow.cc/shopGoodsDetailImg/20171225/20171225112030_6956.jpg\"width=\"100%\"height=\"auto\"alt=\"\"/><img src=\"http://images.koow.cc/shopGoodsDetailImg/20171225/20171225112030_8200.jpg\"width=\"100%\"height=\"auto\"alt=\"\"/><img src=\"http://images.koow.cc/shopGoodsDetailImg/20171225/20171225112031_3967.jpg\"width=\"100%\"height=\"auto\"alt=\"\"/><img src=\"http://images.koow.cc/shopGoodsDetailImg/20171225/20171225112031_5114.jpg\"width=\"100%\"height=\"auto\"alt=\"\"/>",
    "BRIEF": null,
    "SALES_COUNT": 0,
    "IMAGE1": "http://images.koow.cc/shopGoodsImg/20171225/20171225112020_561.jpg",
    "IMAGE2": null,
    "IMAGE3": null,
    "IMAGE4": null,
    "IMAGE5": null,
    "ORIGIN_PLACE": null,
    "GOOD_SCENT": null,
    "CREATE_TIME": 1514172047397,
    "UPDATE_TIME": 1522037064430,
    "IS_RECOMMEND": 0,
    "PICTURE_COMPERSS_PATH": "http://images.koow.cc/compressedPic/20171225112020_561.jpg"
  },
  {
    "ID": "e0ab2f6e2802443ba117b1146cf85fee",
    "GOODS_SERIAL_NUMBER": "4894375014863",
    "SHOP_ID": "402880e860166f3c0160167897d60002",
    "SUB_ID": "2c9f6c94609a62be0160a02d1dc20021",
    "GOOD_TYPE": 1,
    "STATE": 0,
    "IS_DELETE": 1,
    "NAME": "菓子町园道乳酸菌味夹心饼干(抹茶味)540/ 罐",
    "ORI_PRICE": 29.8,
    "PRESENT_PRICE": 29.8,
    "AMOUNT": 10000,
    "DETAIL": "<img src=\"http://images.koow.cc/shopGoodsDetailImg/20180213/20180213110655_230.jpg\"width=\"100%\"height=\"auto\"alt=\"\"/><img src=\"http://images.koow.cc/shopGoodsDetailImg/20180213/20180213110656_329.jpg\"width=\"100%\"height=\"auto\"alt=\"\"/><img src=\"http://images.koow.cc/shopGoodsDetailImg/20180213/20180213110656_2659.jpg\"width=\"100%\"height=\"auto\"alt=\"\"/><img src=\"http://images.koow.cc/shopGoodsDetailImg/20180213/20180213110656_9521.jpg\"width=\"100%\"height=\"auto\"alt=\"\"/><img src=\"http://images.koow.cc/shopGoodsDetailImg/20180213/20180213110656_8611.jpg\"width=\"100%\"height=\"auto\"alt=\"\"/><img src=\"http://images.koow.cc/shopGoodsDetailImg/20180213/20180213110656_1390.jpg\"width=\"100%\"height=\"auto\"alt=\"\"/><img src=\"http://images.koow.cc/shopGoodsDetailImg/20180213/20180213110656_7291.jpg\"width=\"100%\"height=\"auto\"alt=\"\"/><img src=\"http://images.koow.cc/shopGoodsDetailImg/20180213/20180213110657_3919.jpg\"width=\"100%\"height=\"auto\"alt=\"\"/><img src=\"http://images.koow.cc/shopGoodsDetailImg/20180213/20180213110657_2170.jpg\"width=\"100%\"height=\"auto\"alt=\"\"/><img src=\"http://images.koow.cc/shopGoodsDetailImg/20180213/20180213110657_4402.jpg\"width=\"100%\"height=\"auto\"alt=\"\"/><img src=\"http://images.koow.cc/shopGoodsDetailImg/20180213/20180213110657_1926.jpg\"width=\"100%\"height=\"auto\"alt=\"\"/><img src=\"http://images.koow.cc/shopGoodsDetailImg/20180213/20180213110657_9438.jpg\"width=\"100%\"height=\"auto\"alt=\"\"/><img src=\"http://images.koow.cc/shopGoodsDetailImg/20180213/20180213110657_4361.jpg\"width=\"100%\"height=\"auto\"alt=\"\"/><img src=\"http://images.koow.cc/shopGoodsDetailImg/20180213/20180213110657_2730.jpg\"width=\"100%\"height=\"auto\"alt=\"\"/><img src=\"http://images.koow.cc/shopGoodsDetailImg/20180213/20180213110658_314.jpg\"width=\"100%\"height=\"auto\"alt=\"\"/><img src=\"http://images.koow.cc/shopGoodsDetailImg/20180213/20180213110658_8779.jpg\"width=\"100%\"height=\"auto\"alt=\"\"/><img src=\"http://images.koow.cc/shopGoodsDetailImg/20180213/20180213110658_9878.jpg\"width=\"100%\"height=\"auto\"alt=\"\"/><img src=\"http://images.koow.cc/shopGoodsDetailImg/20180213/20180213110658_3471.jpg\"width=\"100%\"height=\"auto\"alt=\"\"/>",
    "BRIEF": null,
    "SALES_COUNT": 0,
    "IMAGE1": "http://images.koow.cc/shopGoodsImg/20180213/20180213110648_2744.jpg",
    "IMAGE2": null,
    "IMAGE3": null,
    "IMAGE4": null,
    "IMAGE5": null,
    "ORIGIN_PLACE": null,
    "GOOD_SCENT": null,
    "CREATE_TIME": 1518491222336,
    "UPDATE_TIME": 1523174099461,
    "IS_RECOMMEND": 0,
    "PICTURE_COMPERSS_PATH": "http://images.koow.cc/compressedPic/20180213110648_2744.jpg"
  }]

能够看到,数据比拟残缺,包含 ID、编号、名称、价格、介绍等信息。
如果想要提取 JSON 对象中的图片 URL,对于其中的 images1-images5 对象比拟好解决,只须要遍历即可。对于 DETAIL 中的图片 URL,因为 URL 混在 html 中,没有方法间接拿到,可通过正则匹配的模式获取。上面分步骤操作:

提取 IMAGE1-IMAGE5 中的图片 URL

const fs = require("fs");

fs.readFile("./goods_demo.json", "utf8", (err, data) => {
  // 序列化数据 
  data = JSON.parse(data);
  data.map((value, index) => {for (let i = 0; i < 5; i++) {
      // 遍历数据,并写入到名为 result.txt 的文件中
      if (value[`IMAGE${i + 1}`] !== null) {const url = value[`IMAGE${i + 1}`]
        fs.appendFile("./result.txt",`\r\n${url}`, function(err) {if (err) console.log("写文件操作失败");
          else console.log("写文件操作胜利");
        });
      }
    }
  });
});

应用 NodeJS 运行下面的代码后,就可能正确的读取到 IMAGE 对象中的 URL, 并写入到 result.txt 文件中。

提取 DETAIL 对象中的图片 URL

对 url 地址剖析能够发现,图片 URL 包含 http 结尾 (part1),CDN 的 URL(part2),图片所在的目录 (part3),图片的名称 (part4)


"http://(part1)images.koow.cc(part2)/shopGoodsImg(part3)/20171225(part3)/20171225112020_561.jpg(part4)"

依据以上正则规定,能够用以下正则进行匹配!

// \w 示意任意字母数字或下划线
// url 中的 / 符号须要本义
// {2,5} 示意呈现 2 - 5 次
// / g 示意全局匹配
const urlReg = /http\:\/\/images.koow.cc(\/\w+){2,5}\.jpg/g;

加上对 JSON 中 DETAIL 对象解决的代码当前,整体代码如下:

const fs = require("fs");

fs.readFile("./goods_demo.json", "utf8", (err, data) => {data = JSON.parse(data);
  data.map((value, index) => {if (value.DETAIL) {
      // 匹配图片的正则表达式
      const urlReg = /http\:\/\/images.koow.cc(\/\w+){2,5}\.jpg/g;
      const arrlist = value.DETAIL.match(urlReg);
      // 对匹配到的 image list 遍历并写入文件
      if (arrlist && arrlist.length) {
        arrlist.map(item => {fs.appendFile("./result.txt", `\r\n${item}`, function(err) {if (err) console.log("写 DETAIL 记录操作失败");
            else console.log("写 DETAIL 记录操作胜利");
          });
        });
      }
    }
    
    for (let i = 0; i < 5; i++) {if (value[`IMAGE${i + 1}`] !== null) {const url = value[`IMAGE${i + 1}`]
        fs.appendFile("./result.txt",`\r\n${url}`, function(err) {if (err) console.log("写文件操作失败");
          else console.log("写文件操作胜利");
        });
      }
    }
  });
});

最终提取的 url 在 reuslt.txt 中存储,期待后续的解决。

http://images.koow.cc/shopGoodsDetailImg/20171225/20171225112029_9395.jpg
http://images.koow.cc/shopGoodsDetailImg/20171225/20171225112029_3391.jpg
http://images.koow.cc/shopGoodsDetailImg/20171225/20171225112029_4718.jpg
http://images.koow.cc/shopGoodsDetailImg/20171225/20171225112029_7603.jpg
……

批量下载

想要做公有的 CDN 服务器,文件的存储门路是不能变的,不然就匹配不到数据库中存储的门路。如何在批量下载时放弃图片的目录不变呢?很简略,只须要应用 wget 命令:

wget -nc -r -i ./result.txt

-nc, –no-clobber 不要笼罩曾经存在的文件
-r, –recursive 递归下载,下载所有文件
-i, –input-file 下载指定文件中的 URL

总结

对 JSON 或 XML 数据执行解决是程序员的必备技能,把握高效的数据处理办法能让工作事倍功半,防止不必要的工夫开销。作者写本文的目标是心愿能帮忙到有同样需要的小伙伴,也心愿电脑旁的你能把本人解决数据的技巧分享进去!

正文完
 0