本帖最后由 zhuan1688 于 2022-3-27 14:18 编辑
最终还是用火车采集器了,那个效率高很多,也容易很多!
下面是我在咱论坛找到的一位老师的批量提取网页内容的代码,问题是我的a.txt里网址太多,而这个代码是提取全部内容后才会生成jieguo.txt,
请教各位老师,能不能实时生成jieguo.txt,也就是提取一个网址生成一次jieguo.txt,这样就算中途出了问题我也可以从失败的地方重新开始
如果实现不了,让代码实时显示执行过程也行,我加上echo命令后不是显示条件编译已关闭就是显示缺少;
谢谢各位老师的辛苦付出- //&cls&cscript -nologo -e:jscript "%~f0"<"a.txt"&pause&exit /b
-
- function BintoStr(strBin,strCharset){
- var stream = new ActiveXObject('ADODB.Stream')
- stream.Type = 1
- stream.Mode = 3
- stream.Open()
- stream.Write(strBin)
- stream.Position = 0
- stream.Type = 2
- stream.Charset = strCharset
- return stream.ReadText
- }
-
- function getHtmlTxt(strURL){
- var http = new ActiveXObject('Msxml2.XMLHTTP');
- try{
- http.open('GET', strURL, false);
- http.send();
- var m = http.GetResponseHeader('Content-Type').match(/charset\s?=\s?([^\s;]+)/i);
- if (m){
- var contenttype = m[0].replace(/charset\s?=\s?/,'');
- var HtmlText = BintoStr(http.ResponseBody,contenttype);
- return HtmlText;
- }
- else{
- var m = http.ResponseText.match(/<\s?meta.+?charset\s?=\s?[^\s\"]+/i);
- if (m){
- var contenttype = m[0].replace(/.+?charset\s?=\s?/,'');
- var HtmlText = BintoStr(http.ResponseBody,contenttype);
- return HtmlText;
- }
- else return http.ResponseText;
- }
- }catch (e){}
- }
-
- var fso = new ActiveXObject('Scripting.Filesystemobject');
- var url = WScript.StdIn.ReadAll().split(/\s/);
- var titlestr = descriptionstr = '';
- for (var i=0; i<url.length; i++)
- {
- var txt = getHtmlTxt(url[i]);
- if (txt) {
- var title = /<guid isPermaLink="false">([^<]+)<\/guid>/i.exec(txt);
- if (title)
- titlestr += title[1]+'\r\n';
- else titlestr += 'not found\r\n';
- }
- else{
- titlestr += 'access forbidden\r\n';
- }
- }
- fso.CreateTextFile('jieguo.txt',2).Write(titlestr);
复制代码
|