Board logo

标题: [文本处理] 批处理如何提取出一行超长字符串里的指定内容? [打印本页]

作者: cjiabing    时间: 2016-1-14 14:55     标题: 批处理如何提取出一行超长字符串里的指定内容?

本帖最后由 pcl_test 于 2016-1-14 16:18 编辑

从网页源文件中提取到一超长字符串(只有一行,不够长可以自己加),如:
  1.      <td align="left" valign="top"><!--begin 1581042-0-9-->  <a href=content/2015-12/17/content_6404737.htm?node=43149 class="f14 blue001" target=_blank><span class="f14 blue001">·</span>标题内容1&nbsp;&nbsp;<span class="f12 black">2015-12-17</span></a> <br>  <a href=content/2015-12/17/content_6404733.htm?node=43149 class="f14 blue001" target=_blank><span class="f14 blue001">·</span>标题内容2&nbsp;&nbsp;<span class="f12 black">2015-12-17</span></a> <br>  <a href=content/2015-12/17/content_6404731.htm?node=43149 class="f14 blue001" target=_blank><span class="f14 blue001">·</span>标题内容3&nbsp;&nbsp;<span class="f12 black">2015-12-17</span></a> <br>  <a href=content/2015-12/17/content_6404726.htm?node=43149 class="f14 blue001" target=_blank><span class="f14 blue001">·</span>标题内容4&nbsp;&nbsp;<span class="f12 black">2015-12-17</span></a> <br>  <a href=content/2015-11/25/content_6371151.htm?node=43149 class="f14 blue001" target=_blank><span class="f14 blue001">·</span>标题内容5&nbsp;&nbsp;<span class="f12 black">2015-11-25</span></a> <br>  <a href=content/2015-10/14/content_6304117.htm?node=43149 class="f14 blue001" target=_blank><span class="f14 blue001">·</span>标题内容6&nbsp;&nbsp;<span class="f12 black">2015-10-14</span></a> <br>  <a href=content/2015-10/14/content_6304094.htm?node=43149 class="f14 blue001" target=_blank><span class="f14 blue001">·</span>标题内容7&nbsp;&nbsp;<span class="f12 black">2015-10-14</span></a> <br>  <a href=content/2015-10/14/content_6304085.htm?node=43149 class="f14 blue001" target=_blank><span class="f14 blue001">·</span>标题内容8&nbsp;&nbsp;<span class="f12 black">2015-10-14</span></a> <br>  <a href=content/2015-10/14/content_6304078.htm?node=43149 class="f14 blue001" target=_blank><span class="f14 blue001">·</span>标题内容9&nbsp;&nbsp;<span class="f12 black">2015-10-14</span></a> <br>  <a href=content/2015-09/11/content_6264794.htm?node=43149 class="f14 blue001" target=_blank><span class="f14 blue001">·</span>标题内容10&nbsp;&nbsp;<span class="f12 black">2015-09-11</span></a> <br> <hr size="1"></hr>   <a href=content/2015-08/17/content_6197492.htm?node=43149 class="f14 blue001" target=_blank><span class="f14 blue001">·</span>标题内容11&nbsp;&nbsp;<span class="f12 black">2015-08-17</span></a> <br> <hr size="1"></hr> <div id="displaypagenum"><p><center> <span>1</span> <a href=node_43149_2.htm>2</a> <a href=node_43149_2.htm>下一页</a> <a href=node_43149_2.htm></a></center></p></div><script language="javascript">function turnpage(page){  document.all("div_currpage").innerHTML = document.all("div_page_roll"+page).innerHTML;}</script><!--end 1581042-0-9--></td>
复制代码
从字符串中提取对应的网址和标题内容,如:
content/2015-12/17/content_6404737.htm?node=143149             标题内容

纯批处理可能受到超长行的限制,需要换行处理。
如能用sed等三方也可以。谢谢!
作者: pcl_test    时间: 2016-1-14 16:05

本帖最后由 pcl_test 于 2016-1-14 16:11 编辑
  1. //&cls&cscript -nologo -e:jscript "%~fs0"<"文本.txt"&pause&exit
  2. function getStr(patt, txt){
  3.     var str, s='';
  4.     while ((str = patt.exec(txt)) != null){
  5.         s += str[1]+'\t'+str[2]+'\r\n';
  6.     }
  7.     return s;
  8. }
  9. var reg = /<a\shref="?([^\s"]+)"?[^>]*><span.+?<\/span>([^<]+?)(&nbsp;)*</g;
  10. var htmltxt = WScript.StdIn.ReadAll();
  11. WSH.echo(getStr(reg, htmltxt));
复制代码





欢迎光临 批处理之家 (http://bbs.bathome.net/) Powered by Discuz! 7.2