批处理之家 - Powered by Discuz! Board

标题: [网络连接] [已解决]批处理如何批量获取网页中的特定文本内容？ [打印本页]

作者: super1129 时间: 2011-12-11 16:39 标题: [已解决]批处理如何批量获取网页中的特定文本内容？

本帖最后由 super1129 于 2011-12-14 15:35 编辑

对于网址中数字是有规律的网页，比如某杂志主页http://aem.asm.org/content/76/1.toc，还有76/2.toc 76/3.toc 等等
我如果想获取http://aem.asm.org/content/76/1.toc 这个网页中，网页源代码<H4 class="cit-first-element cit-title-group">和</H4> 之间的内容（在这个网页中是该杂志文章的title，如<H4 class="cit-first-element cit-title-group">Contribution of Microbial Activity to Carbon Chemistry in Clouds </H4>）
将<H4 class="cit-first-element cit-title-group">和</H4> 中的文本内容Contribution of Microbial Activity to Carbon Chemistry in Clouds 全部保存下来。（可能有多处）
同理继续获取
http://aem.asm.org/content/76/2.toc
http://aem.asm.org/content/76/3.toc
………………

请问大家该如何实现呢？谢谢
这些网址我可以事先放在一个txt文档中。最后结果保存到一个txt文档就行。

作者: sxw 时间: 2011-12-11 17:51

type test.txt|geturls
复制代码

或使用正则表达式提取

作者: super1129 时间: 2011-12-11 18:02

本帖最后由 super1129 于 2011-12-11 18:06 编辑

或使用正则表达式提取
sxw 发表于 2011-12-11 17:51

请问下type test.txt|geturls  这个批处理中，
筛选条件在哪里呢？我要找这两个之间的内容 <H4 class="cit-first-element cit-title-group"> 和  </H4>
如何用你放到你上面的语句中去？谢谢

wget、curl、sed  是不是都可以实现，哪个效率更高呢？（如果网页比较多的话）
能否推荐一个，写出代码？谢谢

test.txt中有
http://aem.asm.org/content/76/1.toc
http://aem.asm.org/content/76/2.toc
http://aem.asm.org/content/76/3.toc

要找到网页源代码中<H4 class="cit-first-element cit-title-group"> 和  </H4> 之间的文本内容。并保存到abc.txt

作者: CrLf 时间: 2011-12-11 18:41

下载的部分就不写了，这是提取标题的部分：

@echo off
(for /f "delims=" %%a in (1.txt) do (
   set "str=%%acit-first-element cit-title-group.=."
   setlocal enabledelayedexpansion
   for /f "tokens=2,3delims==<>" %%c in (
      "!str:*cit-first-element cit-title-group=!"
   ) do endlocal&if /i %%d==/h4 echo;%%~c
))>abc.txt
pause
复制代码

作者: super1129 时间: 2011-12-11 18:52

下载的部分就不写了，这是提取标题的部分：
CrLf 发表于 2011-12-11 18:41

请问是不是用 wget -P -i "url.txt" 下载？
下载后文件名是不是自动命名的，能否规定为1,2,3，。。。或者下载到一个abc.txt中？谢谢

作者: sxw 时间: 2011-12-11 21:50

回复 3# super1129

把你的网页内容保存到test.txt中，type test.txt通过管道将结果传递给geturls.exe
你使用geturls /?查看帮助，很简单。

下载的话可以用wget 支持断点续传，curl也可以。

正则表达式可以使用Perl。

欢迎光临批处理之家 (http://bbs.bathome.net/)