Board logo

标题: [文本处理] [已解决]批处理如何从网页的文本里提取出tag的部分? [打印本页]

作者: 灯塔彭于晏    时间: 2021-6-1 18:10     标题: [已解决]批处理如何从网页的文本里提取出tag的部分?

你好,先谢谢你了,代码如下!
能否用正则的形式,搭配上bat,在下面中找出所有 "tag":"XXX" 中  XXX 的内容。
比如下面的内容,最后生成的文本是

オリジナル
イラストレーション
背景
風景3000users入り
女の子
風景
……

就谢谢你了~~~~~~
  1. master/img/2019/06/17/13/29/28/75269339_p0_master1200.jpg","regular":"https://i.pximg.net/img-master/img/2019/06/17/13/29/28/75269339_p0_master1200.jpg","original":"https://i.pximg.net/img-original/img/2019/06/17/13/29/28/75269339_p0.png"},"tags":{"authorId":"30486331","isLocked":false,"tags":[{"tag":"オリジナル","locked":true,"deletable":false,"userId":"30486331","translation":{"en":"原创"},"userName":"ヨムナシ"},{"tag":"背景","locked":true,"deletable":false,"userId":"30486331","translation":{"en":"background"},"userName":"ヨムナシ"},{"tag":"風景","locked":true,"deletable":false,"userId":"30486331","translation":{"en":"风景"},"userName":"ヨムナシ"},{"tag":"イラストレーション","locked":true,"deletable":false,"userId":"30486331","translation":{"en":"illustration"},"userName":"ヨムナシ"},{"tag":"女の子","locked":true,"deletable":false,"userId":"30486331","translation":{"en":"女孩子"},"userName":"ヨムナシ"},{"tag":"ここに行きたい","locked":false,"deletable":false,"translation":{"en":"好想去这里"}},{"tag":"風景3000users入り","locked":false,"deletable":false,"translation":{"en":"scenery 3000+ bookmarks"}},{"tag":"オリジナル3000users入り","locked":false,"deletable":false,"translation":{"en":"原创3000收藏"}}],"writable":false},"alt":"#オリジナル 逆上がりの世界 - ヨムナシ的插画","storableTags":["RTJMXD26Ak","jhuUT0OJva","uusOs0ipBx","J6HRrOvKcm","Lt-oEicbBr","LpjxMAWKke","t5wuY96p0s","YRDwjaiLZn"],"userId":"30486331","userName":"ヨムナシ","userAccount":"yomunashi333","userIllusts":{"85863283":null,"85713805":null,"85713731":null,"77296333":null,"76298917":null,"75269561":null,"75269463":{"id":"75269463","title":"茄子きらいだからあげるねっ","illustType":0,"xRestrict":0,"restrict":0,"sl":2,"url":"https://i.pximg.net/c/250x250_80_a2/img-master/img/2019/06/17/13/46/39/75269463_p0_square1200.jpg","description":"","tags":["オリジナル","背景","イラストレーション","女の子","パフェ","オリジナル100users入
复制代码

作者: hlzj88    时间: 2021-6-1 21:14

采用sed来处理,非正则,仅针对上文。代码简单,可以根据自己需要调整
  1. copy /y 35.txt 原文.txt
  2. sed -i "s/{\"tag\":/\nmmmmmm/g;s/,\"locked/\n/g" 原文.txt
  3. findstr /i "mmmmmm" 原文.txt>>结果.txt
  4. sed -i "s/mmmmmm//g;s/\"//g" 结果.txt
复制代码
关键语句  s/原内容/新内容/g   。  \把紧跟的字符作为字符识别   \n换行
代码涉及的sed 下载地址 http://bcn.bathome.net/s/tool/index.html?down&key=sed
作者: hlzj88    时间: 2021-6-1 21:24

依靠sed来提取,非正则。仅对上文。语句简单,可自行修改功能。
  1. copy /y 35.txt 原文.txt
  2. sed -i "s/{\"tag\":/\nmmmmmm/g;s/,\"locked/\n/g" 原文.txt
  3. findstr /i "mmmmmm" 原文.txt>>结果.txt
  4. sed -i "s/mmmmmm//g;s/\"//g" 结果.txt
  5. pause
复制代码
语句说明  s/原内容/新内容/g  替换。 \对紧跟后面的字符令其识别为字符。\n换行
涉及的sed下载地址 http://bcn.bathome.net/s/tool/index.html?down&key=sed
作者: newswan    时间: 2021-6-2 00:17

  1. $m=select-string -path file '(?<="tag":")[^"]*' -AllMatches
  2. foreach( $a in $m.matches )
  3. {
  4.     $a.value
  5. }
复制代码
  1. grep -P -o "(?<=\"tag\":)[^,]*" file
复制代码





欢迎光临 批处理之家 (http://bbs.bathome.net/) Powered by Discuz! 7.2