批处理之家 - Powered by Discuz! Board

标题: [文本处理] 批处理如何获取网页title [打印本页]

作者: hxx 时间: 2020-4-26 22:21 标题: 批处理如何获取网页title

我看别人都是用 *.htm
----------------
案例：

我的思路是能不能直接从 curl -vv www.baidu.com
中直接获取这个title

<title>百度一下，你就知道</title>

假设这个,是 -

<title>百度一下 - 你就知道</title>

只截取`百度一下`这四个字

作者: maxwell 时间: 2020-4-27 11:37

这个你干脆写爬虫得了.方便

作者: Batcher 时间: 2020-4-27 18:40

回复 1# hxx

curl -vv www.baidu.com 2>nul | grep -Po "<title>.*</title>" | gawk -F "[<>-]" "{print $3}" > "结果.txt"
复制代码

地址：
http://bcn.bathome.net/s/tool/index.html?key=curl
http://bcn.bathome.net/s/tool/index.html?key=grep
http://bcn.bathome.net/s/tool/index.html?key=gawk

作者: xp3000 时间: 2020-4-28 09:12

本帖最后由 xp3000 于 2020-4-28 18:07 编辑

方法2：
所需三个工具，放入C:\Windows\System32
http://bcn.bathome.net/tool/haxx,7.59.0/curl.exe
http://bcn.bathome.net/tool/iconv.exe
http://bcn.bathome.net/tool/3.0/grep.exe

@echo off
setlocal enabledelayedexpansion
for /f "delims=" %%i in  ('curl "www.baidu.com" ^| iconv -f utf-8 -t gbk ^| grep -oP "(?<=<title>)[^<-]+(?=(，|-).+<\/title>)"') do (
echo %%i
)
pause
复制代码

作者: xp3000 时间: 2020-4-28 09:19

@if (0)==(0) @echo off
setlocal enabledelayedexpansion
(curl "http://book.zongheng.com/showchapter/470520.html" | iconv -f utf-8 -t gbk |cscript -nologo -e:jscript "%~f0")>>提取章节名称.txt
pause & goto :EOF
@end
WSH.echo(WSH.StdIn.ReadAll().match(/第[零一二三四五六七八九十百千廿卅卌0-9]+[章节卷] [^ <>]+/mg).join('\r\n'));
复制代码

借个帖子，如何先提取后替换？

作者: went 时间: 2020-4-28 10:00

win10

@echo off
set "url=www.baidu.com"
set "char='，','-'"
powershell -c "if((Invoke-WebRequest -UseBasicParsing -Uri '%url%').Content -match '(?<=<title>).*?(?=</title>)'){$Matches[0].Split(@(%char%))[0]}"
pause&exit
复制代码

作者: netdzb 时间: 2020-4-28 17:15

本帖最后由 netdzb 于 2020-4-28 17:30 编辑

回复 3# Batcher

管理员，命令能否解释一下?
grep的非贪婪模式匹配得到一个字符串，然后gawk的分割符有4个吗?
中间的管道 | 表示什么?

这个代码写三行可以吗? 我这里没有看懂。
分割符号里面的方括号表示什么?
谢谢。

欢迎光临批处理之家 (http://bbs.bathome.net/)