记一次微信文章转 h5 遇到的问题及解决方案

微信文章转 h5 探索实践

1 目标

微信文章转成 h5

2 问题分析

2.1 转换要对微信文章做哪些处理

图片：微信图片有防盗链，需要上传到 cdn
样式：class 样式和 style 样式，要全部变成 style 样式
删除不需要的元素：小程序文字、小程序图片、卡券、投票
获取微信文章的渲染高度：特殊原因，需要设置 page 高度
获取微信文章的全局变量：包括标题、描述、封面
保存作品：保存 html

3 技术选型

puppeteer（无头浏览器）：获取微信文章的渲染高度，获取微信文章的全局变量
cheerio：在 node 端使用 jquery 的方式操作节点
juice：用来将 class 样式转换成 inline 样式

4 具体实现

4.1 获取微信文章 html 字符串

1
2
3
4
5
6
7
8


const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
// 仿真 iphone5
await page.emulate(iPhone);
// 跳转页面，wait 页面 load
await page.goto(getProtocolUrl(decodeURIComponent(url)), { waitUntil: "load" });
// 获取页面 html
const html = await page.content();

tips: url 做参数时，如果包含&，&之后的会被截掉。所以前端传的时候需要 encodeURIComponent, 后端接收时要 decodeURIComponent

4.2 获取微信文章的全局变量

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


const {
  msg_title,
  msg_desc,
  msg_cdn_url,
  create_time,
  appuin
} = await page.$eval("html", () => {
  return {
    msg_title: window.msg_title,
    msg_desc: window.msg_desc,
    msg_cdn_url: window.msg_cdn_url,
    create_time: window.ct,
    appuin: window.appuin
  };
});

tips: appuin 可以设置微信公众号历史文章跳转将链接替换为https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=`${appuin}`&scene=124#wechat_redirect，即可以实现同样的跳转

4.3 将 css 样式转为 inline 样式，并加在 cheerio

1
2


// cheerio load html
const $ = cheerio.load(juice(html));

接下来就可以像使用 jquery 一样操作 dom 节点了

4.4 预处理

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


// 删除阅读全文
$(".rich_media_tool").remove();
// 删除投票
$(".vote_iframe")
  .parents("p")
  .remove();
// 删除上传的视频
$("iframe[data-mpvid].video_iframe").remove();
// 重置 title 的 margin-top
$(".rich_media_title").css("margin-top", "0px");

4.5 遍历节点

遍历节点的目的有两个：

找到所有的 background-image
找到所有 style 带有 rem 的元素

使用广度优先遍历的方式

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


const nodeList: Cheerio[] = [$("#js_article")];

  // 广度优先遍历节点
  while (nodeList.length > 0) {
    const currentNode = nodeList.shift();
    const childrens = (currentNode as Cheerio).children();
    if (childrens) {
      childrens.each((i, el) => {
        // 这里处理每一个节点
        // ...
        nodeList.push($(el));
      });
    }
  }

ps: 这里感觉有很大的优化空间，目前很耗费内存。

可以做以下处理

替换 rem

1
2
3
4
5
6
7
8


// 处理 rem
const style = node.attr("style");
const dealStyle =
style &&
style.replace(/([0-9.]+?)rem/g, function(match, p1) {
return `${p1 * 16}px`;
});
$(el).attr("style", dealStyle);

获取背景图片的原始地址

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


// 处理背景图片
// 背景图片 node 和 src 存储
const backgroundImgNodes: Cheerio[] = [];
const backgroundImgSrcs: string[] = [];
if (style && /url\("(.*?)"\)/.test(style)) {
  backgroundImgSrcs.push(
    getProtocolUrl(style.match(/url\("(.*?)"\)/)![1])
  );
  backgroundImgNodes.push($(el));
}

4.6 上传图片与回填

将网络图片直接上传到腾讯 COS，这里使用转换流 Transform，request 图片地址 pipe 到 transform 流。

上传

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41


const transformStream = new Transform({
  transform(chunk, encoding, callback) {
    callback(null, chunk);
  },
});

request
.get(url)
.on('response', function(res) {
// 获取文件扩展名
ext = res.headers['content-type']
  ? res.headers['content-type'].split('/')[1]
  : 'png';
// uuid 命名文件
filename = `${uuid.v4()}.${ext}`;
// 上传到腾讯cos
cos.putObject(
  {
    Bucket: `${bucket}-${bucketAppId}`,
    Region: region,
    Key: prefix ? `${prefix}/${filename}` : filename,
    Body: transformStream,
  },
  (err: Record<string, any>, data: Record<string, any>) => {
    if (err) {
        // 错误处理
    } else {
      const cdnUrl = `https://${data.Location}`;
      // 异步设置图片缓存redis，不用同步
      setImgCacheRedis(url, cdnUrl);
      resolve(cdnUrl);
      );
    }
  }
);
})
.pipe(transformStream)
.on('error', () => {
// 错误处理
});
});

回填上传后的图片

1
2
3


backgroundImgNodes.forEach((el, index) => {
$(el).css("background-image", `url("${uploadBackgroundImgSrcs[index]}")`);
});

4.7 在无头浏览器设置处理后的 html，并获取高度

1
2
3
4
5


await page.setContent($.html(), { waitUntil: "load" });
// 获取高度
const height = await page.evaluate(() => {
  return document.documentElement.scrollHeight;
});

4.8 保存 html

部署问题

puppeteer 在 alpine 中启动问题

解决办法： https://github.com/GoogleChrome/puppeteer/blob/master/docs/troubleshooting.md#running-on-alpine

puppeteer 测试环境获取高度比本地小很多

经过排查，发现是 alpine 系统中没有中文字体，导致在无头浏览器中渲染时，中文是乱码的.

解决办法： https://stackoverflow.com/questions/49067625/how-can-i-use-chinese-in-alpine-headless-chrome

eago

[工作]记一次微信文章转h5