批处理之家 - Powered by Discuz! Board

标题: [文本处理] sed匹配html遇到的问题（基本解决了，谢谢各位） [打印本页]

作者: netdzb 时间: 2021-5-22 16:50 标题: sed匹配html遇到的问题（基本解决了，谢谢各位）

本帖最后由 netdzb 于 2021-5-22 20:09 编辑

@echo off
echo "</span></a><a href="http://www.360.com:8099/dl.php?MDEzNXRCT09G^" onclick=^"down_process2(^'159979^');"|sed -r "s/(.*)(href=)(\"http:.*\")(.*)/\3/"
上面是在windows控制台测试的代码

希望得到的结果是
"http://www.360.com:8099/dl.php?MDEzNXRCT09G"
实际运行是空行

我把代码改成shell代码，匹配的是贪婪模式，现在我也不知道怎么解决这个问题?

echo "</span></a ><a href=\"http://www.360.com:8099/dl.php?MDEzNXRCT09G\" onclick=\"down_process2(\'159979\');"|sed -r "s/(.*)(href=)(\"http:\/\/.*\")(.*)/\3/"

作者: xp3000 时间: 2021-5-22 17:05

没统计不知道全不全,包括第三方命令时
?<>()! ; ,\|&^="要加^

作者: netdzb 时间: 2021-5-22 17:18

回复 2# xp3000

我去试验一下，谢谢你。

作者: 1152 时间: 2021-5-22 17:24

回复 2# xp3000

"不用加^

作者: netdzb 时间: 2021-5-22 17:29

回复 2# xp3000

现在难题是sed这个工具没有懒惰模式，我现在匹配出来的结果是

"http://www.360.com:8099/dl.php?MDEzNXRCT09G" onclick="
它会去匹配他最长那个引号的串。
我现在打算用vbs的正则来写不知道可行吗？

作者: netdzb 时间: 2021-5-22 17:32

回复 4# 1152

sed的匹配不支持懒惰模式，不知道有什么可行的方案？

作者: 1152 时间: 2021-5-22 17:34

回复 6# netdzb

懒惰模式?

作者: cutebe 时间: 2021-5-22 18:24

匹配时不包含"（引号）

sed -r "s#.*(\"http[[:alnum:]:?/.`~!@\#\$%^^^&-_=+(){},]*\").*#\1#" a.html
复制代码

作者: netdzb 时间: 2021-5-22 18:47

本帖最后由 netdzb 于 2021-5-22 19:10 编辑

回复 8# cutebe

运行结果不正确，我把样本fea.txt传上来了，能否帮我看一下，谢谢！
请把http的链接给提取出来。

https://javame.lanzoui.com/irgt9pcxl3e

作者: xp3000 时间: 2021-5-22 20:02

VBS：

Set fso = CreateObject("Scripting.FileSystemObject")
For Each f in fso.GetFolder(".").Files
    ext = LCase(fso.GetExtensionName(f))
    If ext = "htm" or ext = "html" or ext = "txt" Then
        txt = fso.OpenTextFile(f).ReadAll
        fso.OpenTextFile(f & ".txt", 2, true).Write GetUrl(txt)
    End If
Next
MsgBox "OK"

Function GetUrl(str)
    Set re = New RegExp
    re.Pattern = "http://[^""]+(?="")"
    re.Global = True
    re.IgnoreCase = True
    For Each m in re.Execute(str)
        If InStr(s, m & vbCrLf) = 0 Then s = s & m & vbCrLf
    Next
    GetUrl = s
End Function
复制代码

BAT

//&cls&(type *.txt|cscript -nologo -e:jscript "%~0") 2>nul>>"批量提取%~nx0.txt"&pause&exit
WSH.echo(WScript.StdIn.ReadAll().match(/http:\/\/[^\"]+(?=\")/g).join('\r\n'))
复制代码

作者: cutebe 时间: 2021-5-22 21:25

sed -r "s#.*(\"http[[:alnum:]:/.?%%]*\").*#\1#" fea.txt
sed -r "s#.*(\"http[^^\"]*\").*#\1#" fea.txt
::for中：
for /f "delims=" %%h in ('sed -r "s#.*(\"http[[:alnum:]:/.?%%]*\").*#\1#" fea.txt')do (
	echo %%h
)
for /f "delims=" %%h in ('sed -r "s#.*(\"http[^^^^\"]*).*#\1#" fea.txt')do (
	echo %%h"
)
复制代码

sed -r "s#\"#\n#g^" fea.txt|sed -n "/^http/p"
sed -r "s#\"#\n#g^" fea.txt|sed "/^http/!d"
sed -r "s#\"#\n#g^" fea.txt|findstr /i "^https:// ^http://"
复制代码

欢迎光临批处理之家 (http://bbs.bathome.net/)