标题: [问题求助] 获取到的请求头 Content-Disposition 乱码如何解决? [打印本页]
作者: tmplinshi 时间: 2016-11-28 21:28 标题: 获取到的请求头 Content-Disposition 乱码如何解决?
- Set objWinHttp = CreateObject("WinHttp.WinHttpRequest.5.1")
- objWinHttp.Open "HEAD", "https://www.nyaa.se/?page=download&tid=613616"
- objWinHttp.Send
- MsgBox objWinHttp.GetResponseHeader("Content-Disposition")
复制代码
以上代码显示的结果是:
inline; filename="娴疯醇鐜?65.rar.torrent"
bat 的话可以用 iconv 转码,vbs 要如何解决呢?
作者: pcl_test 时间: 2016-11-28 21:45
可以使用adodb.stream转码
作者: aa77dd@163.com 时间: 2016-11-29 03:23
- Const adTypeBinary = 1
- Const adTypeText = 2
-
- ' accept a string and convert it to Bytes array in the selected Charset
- Function StringToBytes(Str,Charset)
- ' Dim Stream
- Set Stream = CreateObject("ADODB.Stream")
- Stream.Type = adTypeText
- Stream.Charset = Charset
- Stream.Open
- Stream.WriteText Str
- Stream.Flush
- Stream.Position = 0
- ' rewind stream and read Bytes
- Stream.Type = adTypeBinary
- StringToBytes= Stream.Read
- Stream.Close
- Set Stream = Nothing
- End Function
-
- ' accept Bytes array and convert it to a string using the selected charset
- Function BytesToString(Bytes, Charset)
- ' Dim Stream
- Set Stream = CreateObject("ADODB.Stream")
- Stream.Charset = Charset
- Stream.Type = adTypeBinary
- Stream.Open
- Stream.Write Bytes
- Stream.Flush
- Stream.Position = 0
- ' rewind stream and read text
- Stream.Type = adTypeText
- BytesToString= Stream.ReadText
- Stream.Close
- Set Stream = Nothing
- End Function
-
- ' This will alter charset of a string from 1-byte charset(as windows-1252)
- ' to another 1-byte charset(as windows-1251)
- Function AlterCharset(Str, FromCharset, ToCharset)
- Dim Bytes
- Bytes = StringToBytes(Str, FromCharset)
-
- ' HEXS=""
- ' for i = 1 to LenB(Bytes)
- ' HEXS = HEXS & hex(ascb(MidB (Bytes, i, 1))) & ","
- ' next
- ' MsgBox HEXS
-
- AlterCharset = BytesToString(Bytes, ToCharset)
- End Function
-
-
- Set objWinHttp = CreateObject("WinHttp.WinHttpRequest.5.1")
- objWinHttp.Open "HEAD", "https://www.nyaa.se/?page=download&tid=613616"
- objWinHttp.Send
- MsgBox objWinHttp.GetResponseHeader("Content-Disposition")
- ' MsgBox LenB ( objWinHttp.GetResponseHeader("Content-Disposition") )
- MsgBox AlterCharset( objWinHttp.GetResponseHeader("Content-Disposition"), "GB2312", "utf-8")
复制代码
作者: tmplinshi 时间: 2016-11-29 10:10
本帖最后由 tmplinshi 于 2016-11-29 10:17 编辑
谢谢二位!
显示的结果是 "海贼�?65.rar.torrent",正确的应该是 "海贼王765.rar.torrent"。
不知道是不是因为 objWinHttp.GetResponseHeader("Content-Disposition") 返回的字符本身就丢失了数据。
作者: 523066680 时间: 2016-11-29 10:44
本帖最后由 523066680 于 2016-11-29 11:08 编辑
- use Encode;
- use LWP::Simple;
-
- my $all = get("https://www.nyaa.se/?page=download&tid=613616");
- $all =~ /name\d+:(.*?rar)/i;
- print encode('gbk', decode('utf8', $1));
复制代码
海贼王765.rar
补充修改一下:- use Encode;
- use LWP::Simple;
-
- my $h = head("https://www.nyaa.se/?page=download&tid=613616");
- print encode('gbk', decode('utf8', $h->{'_headers'}->{'content-disposition'} ))
复制代码
inline; filename="海贼王765.rar.torrent"
作者: tmplinshi 时间: 2016-11-29 11:02
回复 5# 523066680
多谢。你这个方法好像是从文件内容中读取的文件名。如果指向的不是种子文件,而是其他的文件类型比如 exe 就无效了。
作者: 523066680 时间: 2016-11-29 11:08
本帖最后由 523066680 于 2016-11-29 11:31 编辑
回复 6# tmplinshi
补充修改了
关于 _headers , 和 content-disposition 的键值由来,Perl的说明文档没有具体介绍,但是可以通过 Data::Dump 输出整个数据结构
- use LWP::Simple;
- use Data::Dump qw(dump);
- my $h = head("https://www.nyaa.se/?page=download&tid=613616");
- print dump $h;
do {
my $a = bless({
_content => "",
_headers => bless({
"cf-ray" => "3092ded3aec707eb-LAX",
"client-date" => "Tue, 29 Nov 2016 03:11:05 GMT",
"client-peer" => "104.20.74.106:443",
"client-response-num" => 1,
"client-ssl-cert-issuer" => "/C=GB/ST=Greater Manchester/L=Salford/O=COMODO CA Limited/CN=COMODO ECC Domain Validation Secure Server CA 2",
"client-ssl-cert-subject" => "/OU=Domain Control Validated/OU=PositiveSSL Multi-Domain/CN=ssl366349.cloudflaressl.com",
"client-ssl-cipher" => "ECDHE-ECDSA-AES128-GCM-SHA256",
"client-ssl-socket-class" => "IO::Socket::SSL",
"connection" => "close",
"content-disposition" => "inline; filename=\"\xE6\xB5\xB7\xE8\xB4\xBC\xE7\x8E\x8B765.rar.torrent\"",
"content-type" => "application/x-bittorrent",
"date" => "Tue, 29 Nov 2016 03:11:07 GMT",
"last-modified" => "Thu, 23 Oct 2014 12:11:17 GMT",
"server" => "cloudflare-nginx",
"set-cookie" => "__cfduid=d41adfbdcefc8d9c55b9a6c24451c6fb61480389066; expires=Wed, 29-Nov-17 03:11:06 GMT; path=/; domain=.nyaa.se; HttpOnly",
"vary" => "Accept-Encoding",
}, "HTTP::Headers"),
_msg => "OK",
_protocol => "HTTP/1.1",
_rc => 200,
_request => bless({
_content => "",
_headers => bless({ "user-agent" => "LWP::Simple/6.00 libwww-perl/6.04" }, "HTTP::Headers"),
_method => "HEAD",
_uri => bless(do{\(my $o = "https://www.nyaa.se/?page=download&tid=613616")}, "URI::https"),
_uri_canonical => 'fix',
}, "HTTP::Request"),
}, "HTTP::Response");
$a->{_request}{_uri_canonical} = \${$a->{_request}{_uri}};
$a;
}
我觉得这件事(网络爬虫)有三种语言比较合适:ruby python perl
安利 ruby
作者: tmplinshi 时间: 2016-11-29 11:46
本帖最后由 tmplinshi 于 2016-11-29 11:50 编辑
@523066680 非常感谢!不过我仍然希望有 VBS 的解决方案。其实我是想在 AHK 中处理这个问题,而 VBS 代码可以直接在 AHK 中使用。
作者: 523066680 时间: 2016-11-29 11:46
本帖最后由 523066680 于 2016-11-29 11:52 编辑
回复 4# tmplinshi
从第一个弹窗显示的内容
inline; filename="娴疯醇鐜?65.rar.torrent"
可以发现已经有一个变成问号,将 “娴疯醇鐜?65” 还原 gbk 编码(假装是gbk),
其编码内容是:
[e6 b5] [b7 e8] [b4 bc] [e7 8e] 3f 36 35
而原本的编码内容是(utf8):
[e6 b5 b7] [e8 b4 bc] [e7 8e 8b] 37 36 35
由于gbk解读的话,>127的部分是2个字节为一个宽字符的,
提取 [e6 b5] [b7 e8] [b4 bc] [e7 8e] 后剩下 8b 37 36 35,
由于 [8b 37] 在 gbk 表中没有对应的字符,所以变成问号,就变成 3f 咯
按理说如果数据完整提取了,也只是按gbk解读会显示乱码,不应该丢失。
看看是 AlterCharset 的问题,还是 objWinHttp.GetResponseHeader("Content-Disposition")
最好打印编码出来看看
作者: tmplinshi 时间: 2016-11-29 12:03
D:\Desktop>cscript /nologo test.vbs
inline; filename="娴疯醇鐜?65.rar.torrent"
D:\Desktop>cscript /nologo test.vbs | xd
000000 69 6e 6c 69 6e 65 3b 20 66 69 6c 65 6e 61 6d 65 inline; filename
000010 3d 22 e6 b5 b7 e8 b4 bc e7 8e 3f 36 35 2e 72 61 ="........?65.ra
000020 72 2e 74 6f 72 72 65 6e 74 22 0d 0a r.torrent"..
作者: tmplinshi 时间: 2016-11-29 12:31
curl 返回的是这样的:
作者: aa77dd@163.com 时间: 2016-11-29 12:55
本帖最后由 aa77dd@163.com 于 2016-11-29 16:50 编辑
数据如此: 这是什么编码我完全不知道- 69,0,6E,0,6C,0,69,0,6E,0,65,0,3B,0,20,0,66,0,69,0,6C,0,65,0,6E,0,61,0,6D,0,65,0,3D,0,22,0,34,5A,AF,75,87,91,1C,94,3F,0,36,0,35,0,2E,0,72,0,61,0,72,0,2E,0,74,0,6F,0,72,0,72,0,65,0,6E,0,74,0,22,0,
复制代码
其中 34,5A,AF,75,87,91,1C,94,3F 按 UTF-16 LE 解码为 娴疯醇鐜
另外 ahk 也可以用
ComObjCreate("Msxml2.XMLHTTP")
之类
我觉得是 WinHttp 或者 VBS 的问题- Const adTypeBinary = 1
- Const adTypeText = 2
-
- ' accept a string and convert it to Bytes array in the selected Charset
- Function StringToBytes(Str,Charset)
- ' Dim Stream
- Set Stream = CreateObject("ADODB.Stream")
- Stream.Type = adTypeText
- Stream.Charset = Charset
- Stream.Open
- Stream.WriteText Str
- Stream.Flush
- Stream.Position = 0
- ' rewind stream and read Bytes
- Stream.Type = adTypeBinary
- StringToBytes= Stream.Read
- Stream.Close
- Set Stream = Nothing
- End Function
-
- ' accept Bytes array and convert it to a string using the selected charset
- Function BytesToString(Bytes, Charset)
- ' Dim Stream
- Set Stream = CreateObject("ADODB.Stream")
- Stream.Charset = Charset
- Stream.Type = adTypeBinary
- Stream.Open
- Stream.Write Bytes
- Stream.Flush
- Stream.Position = 0
- ' rewind stream and read text
- Stream.Type = adTypeText
- BytesToString= Stream.ReadText
- Stream.Close
- Set Stream = Nothing
- End Function
-
- ' This will alter charset of a string from 1-byte charset(as windows-1252)
- ' to another 1-byte charset(as windows-1251)
- Function AlterCharset(Str, FromCharset, ToCharset)
- Dim Bytes
- Bytes = StringToBytes(Str, FromCharset)
-
- AlterCharset = BytesToString(Bytes, ToCharset)
- End Function
-
-
- Set objWinHttp = CreateObject("WinHttp.WinHttpRequest.5.1")
- objWinHttp.Open "HEAD", "https://www.nyaa.se/?page=download&tid=613616"
- objWinHttp.Send
- MsgBox objWinHttp.GetResponseHeader("Content-Disposition")
- MsgBox LenB ( objWinHttp.GetResponseHeader("Content-Disposition") )
-
- HEXS=""
- for i = 1 to LenB(objWinHttp.GetResponseHeader("Content-Disposition"))
- HEXS = HEXS & hex(ascb(MidB (objWinHttp.GetResponseHeader("Content-Disposition"), i, 1))) & ","
- next
- MsgBox HEXS
-
-
- MsgBox AlterCharset( objWinHttp.GetResponseHeader("Content-Disposition"), "GB2312", "utf-8")
复制代码
作者: 523066680 时间: 2016-11-29 14:33
本帖最后由 523066680 于 2016-11-29 14:40 编辑
回复 12# aa77dd@163.com
就用普通的 asc- Set objWinHttp = CreateObject("WinHttp.WinHttpRequest.5.1")
- objWinHttp.Open "HEAD", "https://www.nyaa.se/?page=download&tid=613616"
- objWinHttp.Send
-
- name = objWinHttp.GetResponseHeader("Content-Disposition")
-
- say = ""
- for i = 1 to len(name)
- say = say & hex( asc(mid(name, i, 1)) ) & " "
- next
-
- msgbox say
复制代码
---------------------------
---------------------------
69 6E 6C 69 6E 65 3B 20 66 69 6C 65 6E 61 6D 65 3D 22 E6B5 B7E8 B4BC E78E 3F 36 35 2E 72 61 72 2E 74 6F 72 72 65 6E 74 22
---------------------------
确定
---------------------------
可以看到某字节已经丢失,变成了问号(0x3f)
作者: aa77dd@163.com 时间: 2016-11-29 14:41
回复 13# 523066680
问题可能出现在 GetResponseHeader("Content-Disposition") 方法- req := ComObjCreate("Microsoft.XMLHTTP")
- ; Open a request with async enabled.
- req.open("GET", "https://www.nyaa.se/?page=download&tid=613616", true)
- ; Set our callback function (v1.1.17+).
- req.onreadystatechange := Func("Ready")
- ; Send the request. Ready() will be called when it's complete.
- req.send()
- ; /*
- ; If you're going to wait, there's no need for onreadystatechange.
- ; Setting async=true and waiting like this allows the script to remain
- ; responsive while the download is taking place, whereas async=false
- ; will make the script unresponsive.
- while req.readyState != 4
- sleep 100
- ; */
- #Persistent
-
- Ready() {
- global req
- if (req.readyState != 4) ; Not done yet.
- return
- if (req.status == 200 || req.status == 304) {
- MsgBox % "responseText: " req.responseText
- t:=req.GetResponseHeader("Content-Disposition")
- MsgBox % "Content-Disposition: " t
-
- }
- else
- MsgBox 16,, % "Status " req.status
- ExitApp
- }
复制代码
作者: 523066680 时间: 2016-11-29 14:54
本帖最后由 523066680 于 2016-11-29 15:13 编辑
回复 14# aa77dd@163.com
换一种语言海阔天空哈哈~ (如果ruby python perl 也不喜欢,那么,C# 是坠吼的
恩,这句话是和 tmplinshi 说的
作者: aa77dd@163.com 时间: 2016-11-29 17:13
本帖最后由 aa77dd@163.com 于 2016-11-29 17:18 编辑
回复 15# 523066680
有个贴吧
http://tieba.baidu.com/p/1618906999
这应该是个编码转换的 bug, 而非字节丢失
uni GB
5A34 E6B5 娴
75AF B7E8 疯
9187 B4BC 醇
作者: 523066680 时间: 2016-11-29 17:19
本帖最后由 523066680 于 2016-11-29 21:47 编辑
回复 16# aa77dd@163.com
是我描述上出了偏差 :]
有一首关于编码的诗
手持两把锟斤拷,口中疾呼烫烫烫。
脚踏千朵屯屯屯,笑看万物锘锘锘。
作者: tmplinshi 时间: 2016-11-29 22:37
看来对于 VBS 是无解了。谢谢各位帮忙。
作者: 523066680 时间: 2016-11-29 23:35
本帖最后由 523066680 于 2016-11-29 23:37 编辑
回复 18# tmplinshi
也许有可能,但是太折腾的话还是另辟蹊径的好。
作者: tmplinshi 时间: 2016-12-12 21:47
本帖最后由 tmplinshi 于 2017-1-3 23:30 编辑
HttpQueryInfo function 文章中有这么一段话:
Note The HttpQueryInfoA function represents headers as ISO-8859-1 characters not ANSI characters. The HttpQueryInfoW function represents headers as ISO-8859-1 characters converted to UTF-16LE characters. As a result, it is never safe to use the HttpQueryInfoW function when the headers can contain non-ASCII characters. Instead, an application can use the MultiByteToWideChar and WideCharToMultiByte functions with a Codepage parameter set to 28591 to map between ANSI characters and UTF-16LE characters.
估计 "WinHttp.WinHttpRequest.5.1" 对象用的就是 HttpQueryInfoW 函数吧。
如果使用 HttpQueryInfoA 则可以正确转换编码。AHK 代码示例:
- MsgBox, % GetAllResponseHeaders("https://www.nyaa.se/?page=download&tid=613616")
-
- GetAllResponseHeaders(Url, RequestHeaders := "", NO_AUTO_REDIRECT := false, NO_COOKIES := false) {
- static INTERNET_OPEN_TYPE_DIRECT := 1
- , INTERNET_SERVICE_HTTP := 3
- , HTTP_QUERY_RAW_HEADERS_CRLF := 22
- , CP_UTF8 := 65001
- , Default_UserAgent := "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
-
- hModule := DllCall("LoadLibrary", "str", "wininet.dll", "ptr")
-
- if !hInternet := DllCall("wininet\InternetOpen", "ptr", &Default_UserAgent, "uint", INTERNET_OPEN_TYPE_DIRECT
- , "str", "", "str", "", "uint", 0)
- return
- ; -----------------------------------------------------------------------------------
- if !InStr(Url, "://")
- Url := "http://" Trim(Url)
-
- regex := "(?P<protocol>\w+)://((?P<user>\w+):(?P<pwd>\w+)@)?(?P<host>[\w.]+)(:(?P<port>\d+))?(?P<path>.*)"
- RegExMatch(Url, regex, v_)
-
- if (v_protocol = "ftp") {
- throw, "ftp is not supported."
- }
- if (v_port = "") {
- v_port := (v_protocol = "https") ? 443 : 80
- }
- ; -----------------------------------------------------------------------------------
- Internet_Flags := 0
- | 0x400000 ; INTERNET_FLAG_KEEP_CONNECTION
- | 0x80000000 ; INTERNET_FLAG_RELOAD
- | 0x20000000 ; INTERNET_FLAG_NO_CACHE_WRITE
- if (v_protocol = "https") {
- Internet_Flags |= 0x1000 ; INTERNET_FLAG_IGNORE_CERT_CN_INVALID
- | 0x2000 ; INTERNET_FLAG_IGNORE_CERT_DATE_INVALID
- | 0x800000 ; INTERNET_FLAG_SECURE ; Technically, this is redundant for https
- }
- if NO_AUTO_REDIRECT
- Internet_Flags |= 0x00200000 ; INTERNET_FLAG_NO_AUTO_REDIRECT
- if NO_COOKIES
- Internet_Flags |= 0x00080000 ; INTERNET_FLAG_NO_COOKIES
- ; -----------------------------------------------------------------------------------
- hConnect := DllCall("wininet\InternetConnect", "ptr", hInternet, "ptr", &v_host, "uint", v_port
- , "ptr", &v_user, "ptr", &v_pwd, "uint", INTERNET_SERVICE_HTTP, "uint", Internet_Flags, "uint", 0, "ptr")
-
- hRequest := DllCall("wininet\HttpOpenRequest", "ptr", hConnect, "str", "HEAD", "ptr", &v_path
- , "str", "HTTP/1.1", "ptr", 0, "ptr", 0, "uint", Internet_Flags, "ptr", 0, "ptr")
-
- nRet := DllCall("wininet\HttpSendRequest", "ptr", hRequest, "ptr", &RequestHeaders, "int", -1
- , "ptr", 0, "uint", 0)
-
- Loop, 2 {
- DllCall("wininet\HttpQueryInfoA", "ptr", hRequest, "uint", HTTP_QUERY_RAW_HEADERS_CRLF
- , "ptr", &pBuffer, "uint*", bufferLen, "uint", 0)
- if (A_Index = 1)
- VarSetCapacity(pBuffer, bufferLen, 0)
- }
- ; -----------------------------------------------------------------------------------
- output := StrGet(&pBuffer, "UTF-8")
- ; -----------------------------------------------------------------------------------
- DllCall("wininet\InternetCloseHandle", "ptr", hRequest)
- DllCall("wininet\InternetCloseHandle", "ptr", hConnect)
- DllCall("wininet\InternetCloseHandle", "ptr", hInternet)
- DllCall("FreeLibrary", "Ptr", hModule)
-
- return output
- }
复制代码
欢迎光临 批处理之家 (http://bbs.bathome.net/) |
Powered by Discuz! 7.2 |