Board logo

标题: [问题求助] 获取到的请求头 Content-Disposition 乱码如何解决? [打印本页]

作者: tmplinshi    时间: 2016-11-28 21:28     标题: 获取到的请求头 Content-Disposition 乱码如何解决?

  1. Set objWinHttp = CreateObject("WinHttp.WinHttpRequest.5.1")
  2. objWinHttp.Open "HEAD", "https://www.nyaa.se/?page=download&tid=613616"
  3. objWinHttp.Send
  4. MsgBox objWinHttp.GetResponseHeader("Content-Disposition")
复制代码
以上代码显示的结果是:
inline; filename="娴疯醇鐜?65.rar.torrent"


bat 的话可以用 iconv 转码,vbs 要如何解决呢?


作者: pcl_test    时间: 2016-11-28 21:45

可以使用adodb.stream转码
作者: aa77dd@163.com    时间: 2016-11-29 03:23

  1. Const adTypeBinary = 1
  2. Const adTypeText = 2
  3. ' accept a string and convert it to Bytes array in the selected Charset
  4. Function StringToBytes(Str,Charset)
  5.   ' Dim Stream
  6.   Set Stream = CreateObject("ADODB.Stream")
  7.   Stream.Type = adTypeText
  8.   Stream.Charset = Charset
  9.   Stream.Open
  10.   Stream.WriteText Str
  11.   Stream.Flush
  12.   Stream.Position = 0
  13.   ' rewind stream and read Bytes
  14.   Stream.Type = adTypeBinary
  15.   StringToBytes= Stream.Read
  16.   Stream.Close
  17.   Set Stream = Nothing
  18. End Function
  19. ' accept Bytes array and convert it to a string using the selected charset
  20. Function BytesToString(Bytes, Charset)
  21.   ' Dim Stream
  22.   Set Stream = CreateObject("ADODB.Stream")
  23.   Stream.Charset = Charset
  24.   Stream.Type = adTypeBinary
  25.   Stream.Open
  26.   Stream.Write Bytes
  27.   Stream.Flush
  28.   Stream.Position = 0
  29.   ' rewind stream and read text
  30.   Stream.Type = adTypeText
  31.   BytesToString= Stream.ReadText
  32.   Stream.Close
  33.   Set Stream = Nothing
  34. End Function
  35. ' This will alter charset of a string from 1-byte charset(as windows-1252)
  36. ' to another 1-byte charset(as windows-1251)
  37. Function AlterCharset(Str, FromCharset, ToCharset)
  38.   Dim Bytes
  39.   Bytes = StringToBytes(Str, FromCharset)
  40. ' HEXS=""
  41. ' for i = 1 to LenB(Bytes)
  42. ' HEXS = HEXS & hex(ascb(MidB (Bytes, i, 1))) & ","
  43. ' next
  44. ' MsgBox HEXS
  45.   AlterCharset = BytesToString(Bytes, ToCharset)
  46. End Function
  47. Set objWinHttp = CreateObject("WinHttp.WinHttpRequest.5.1")
  48. objWinHttp.Open "HEAD", "https://www.nyaa.se/?page=download&tid=613616"
  49. objWinHttp.Send
  50. MsgBox objWinHttp.GetResponseHeader("Content-Disposition")
  51. ' MsgBox LenB ( objWinHttp.GetResponseHeader("Content-Disposition") )
  52. MsgBox AlterCharset( objWinHttp.GetResponseHeader("Content-Disposition"), "GB2312", "utf-8")
复制代码

作者: tmplinshi    时间: 2016-11-29 10:10

本帖最后由 tmplinshi 于 2016-11-29 10:17 编辑

谢谢二位!

显示的结果是 "海贼�?65.rar.torrent",正确的应该是 "海贼王765.rar.torrent"。
不知道是不是因为 objWinHttp.GetResponseHeader("Content-Disposition") 返回的字符本身就丢失了数据。
作者: 523066680    时间: 2016-11-29 10:44

本帖最后由 523066680 于 2016-11-29 11:08 编辑
  1. use Encode;
  2. use LWP::Simple;
  3. my $all = get("https://www.nyaa.se/?page=download&tid=613616");
  4. $all =~ /name\d+:(.*?rar)/i;
  5. print encode('gbk', decode('utf8', $1));
复制代码
海贼王765.rar

补充修改一下:
  1. use Encode;
  2. use LWP::Simple;
  3. my $h = head("https://www.nyaa.se/?page=download&tid=613616");
  4. print encode('gbk', decode('utf8',  $h->{'_headers'}->{'content-disposition'} ))
复制代码
inline; filename="海贼王765.rar.torrent"
作者: tmplinshi    时间: 2016-11-29 11:02

回复 5# 523066680


    多谢。你这个方法好像是从文件内容中读取的文件名。如果指向的不是种子文件,而是其他的文件类型比如 exe 就无效了。
作者: 523066680    时间: 2016-11-29 11:08

本帖最后由 523066680 于 2016-11-29 11:31 编辑

回复 6# tmplinshi


    补充修改了

关于 _headers , 和 content-disposition 的键值由来,Perl的说明文档没有具体介绍,但是可以通过 Data::Dump 输出整个数据结构


do {
  my $a = bless({
    _content => "",
    _headers => bless({
      "cf-ray" => "3092ded3aec707eb-LAX",
      "client-date" => "Tue, 29 Nov 2016 03:11:05 GMT",
      "client-peer" => "104.20.74.106:443",
      "client-response-num" => 1,
      "client-ssl-cert-issuer" => "/C=GB/ST=Greater Manchester/L=Salford/O=COMODO CA Limited/CN=COMODO ECC Domain Validation Secure Server CA 2",
      "client-ssl-cert-subject" => "/OU=Domain Control Validated/OU=PositiveSSL Multi-Domain/CN=ssl366349.cloudflaressl.com",
      "client-ssl-cipher" => "ECDHE-ECDSA-AES128-GCM-SHA256",
      "client-ssl-socket-class" => "IO::Socket::SSL",
      "connection" => "close",
      "content-disposition" => "inline; filename=\"\xE6\xB5\xB7\xE8\xB4\xBC\xE7\x8E\x8B765.rar.torrent\"",
      "content-type" => "application/x-bittorrent",
      "date" => "Tue, 29 Nov 2016 03:11:07 GMT",
      "last-modified" => "Thu, 23 Oct 2014 12:11:17 GMT",
      "server" => "cloudflare-nginx",
      "set-cookie" => "__cfduid=d41adfbdcefc8d9c55b9a6c24451c6fb61480389066; expires=Wed, 29-Nov-17 03:11:06 GMT; path=/; domain=.nyaa.se; HttpOnly",
      "vary" => "Accept-Encoding",
    }, "HTTP::Headers"),
    _msg => "OK",
    _protocol => "HTTP/1.1",
    _rc => 200,
    _request => bless({
      _content => "",
      _headers => bless({ "user-agent" => "LWP::Simple/6.00 libwww-perl/6.04" }, "HTTP::Headers"),
      _method => "HEAD",
      _uri => bless(do{\(my $o = "https://www.nyaa.se/?page=download&tid=613616")}, "URI::https"),
      _uri_canonical => 'fix',
    }, "HTTP::Request"),
  }, "HTTP::Response");
  $a->{_request}{_uri_canonical} = \${$a->{_request}{_uri}};
  $a;
}

我觉得这件事(网络爬虫)有三种语言比较合适:ruby python perl

安利 ruby
作者: tmplinshi    时间: 2016-11-29 11:46

本帖最后由 tmplinshi 于 2016-11-29 11:50 编辑

@523066680 非常感谢!不过我仍然希望有 VBS 的解决方案。其实我是想在 AHK 中处理这个问题,而 VBS 代码可以直接在 AHK 中使用。
作者: 523066680    时间: 2016-11-29 11:46

本帖最后由 523066680 于 2016-11-29 11:52 编辑

回复 4# tmplinshi


    从第一个弹窗显示的内容
inline; filename="娴疯醇鐜?65.rar.torrent"

可以发现已经有一个变成问号,将 “娴疯醇鐜?65” 还原 gbk 编码(假装是gbk),
其编码内容是:
[e6 b5] [b7 e8] [b4 bc] [e7 8e] 3f 36 35

而原本的编码内容是(utf8):
[e6 b5 b7] [e8 b4 bc] [e7 8e 8b] 37 36 35

由于gbk解读的话,>127的部分是2个字节为一个宽字符的,
提取 [e6 b5] [b7 e8] [b4 bc] [e7 8e] 后剩下 8b 37 36 35,
由于 [8b 37] 在 gbk 表中没有对应的字符,所以变成问号,就变成 3f 咯

按理说如果数据完整提取了,也只是按gbk解读会显示乱码,不应该丢失。
看看是 AlterCharset 的问题,还是  objWinHttp.GetResponseHeader("Content-Disposition")
最好打印编码出来看看
作者: tmplinshi    时间: 2016-11-29 12:03

D:\Desktop>cscript /nologo test.vbs
inline; filename="娴疯醇鐜?65.rar.torrent"

D:\Desktop>cscript /nologo test.vbs | xd
000000  69 6e 6c 69 6e 65 3b 20 66 69 6c 65 6e 61 6d 65    inline; filename
000010  3d 22 e6 b5 b7 e8 b4 bc e7 8e 3f 36 35 2e 72 61    ="........?65.ra
000020  72 2e 74 6f 72 72 65 6e 74 22 0d 0a                r.torrent"..



作者: tmplinshi    时间: 2016-11-29 12:31

curl 返回的是这样的:


作者: aa77dd@163.com    时间: 2016-11-29 12:55

本帖最后由 aa77dd@163.com 于 2016-11-29 16:50 编辑

数据如此: 这是什么编码我完全不知道
  1. 69,0,6E,0,6C,0,69,0,6E,0,65,0,3B,0,20,0,66,0,69,0,6C,0,65,0,6E,0,61,0,6D,0,65,0,3D,0,22,0,34,5A,AF,75,87,91,1C,94,3F,0,36,0,35,0,2E,0,72,0,61,0,72,0,2E,0,74,0,6F,0,72,0,72,0,65,0,6E,0,74,0,22,0,
复制代码
其中 34,5A,AF,75,87,91,1C,94,3F  按 UTF-16 LE 解码为 娴疯醇鐜

另外 ahk 也可以用
ComObjCreate("Msxml2.XMLHTTP")

之类

我觉得是 WinHttp 或者 VBS 的问题
  1. Const adTypeBinary = 1
  2. Const adTypeText = 2
  3. ' accept a string and convert it to Bytes array in the selected Charset
  4. Function StringToBytes(Str,Charset)
  5.   ' Dim Stream
  6.   Set Stream = CreateObject("ADODB.Stream")
  7.   Stream.Type = adTypeText
  8.   Stream.Charset = Charset
  9.   Stream.Open
  10.   Stream.WriteText Str
  11.   Stream.Flush
  12.   Stream.Position = 0
  13.   ' rewind stream and read Bytes
  14.   Stream.Type = adTypeBinary
  15.   StringToBytes= Stream.Read
  16.   Stream.Close
  17.   Set Stream = Nothing
  18. End Function
  19. ' accept Bytes array and convert it to a string using the selected charset
  20. Function BytesToString(Bytes, Charset)
  21.   ' Dim Stream
  22.   Set Stream = CreateObject("ADODB.Stream")
  23.   Stream.Charset = Charset
  24.   Stream.Type = adTypeBinary
  25.   Stream.Open
  26.   Stream.Write Bytes
  27.   Stream.Flush
  28.   Stream.Position = 0
  29.   ' rewind stream and read text
  30.   Stream.Type = adTypeText
  31.   BytesToString= Stream.ReadText
  32.   Stream.Close
  33.   Set Stream = Nothing
  34. End Function
  35. ' This will alter charset of a string from 1-byte charset(as windows-1252)
  36. ' to another 1-byte charset(as windows-1251)
  37. Function AlterCharset(Str, FromCharset, ToCharset)
  38.   Dim Bytes
  39.   Bytes = StringToBytes(Str, FromCharset)
  40.   
  41.   AlterCharset = BytesToString(Bytes, ToCharset)
  42. End Function
  43. Set objWinHttp = CreateObject("WinHttp.WinHttpRequest.5.1")
  44. objWinHttp.Open "HEAD", "https://www.nyaa.se/?page=download&tid=613616"
  45. objWinHttp.Send
  46. MsgBox objWinHttp.GetResponseHeader("Content-Disposition")
  47. MsgBox LenB ( objWinHttp.GetResponseHeader("Content-Disposition") )
  48.         HEXS=""
  49.         for i = 1 to LenB(objWinHttp.GetResponseHeader("Content-Disposition"))
  50.                 HEXS = HEXS & hex(ascb(MidB (objWinHttp.GetResponseHeader("Content-Disposition"), i, 1))) & ","
  51.         next
  52.         MsgBox HEXS
  53. MsgBox AlterCharset( objWinHttp.GetResponseHeader("Content-Disposition"), "GB2312", "utf-8")
复制代码

作者: 523066680    时间: 2016-11-29 14:33

本帖最后由 523066680 于 2016-11-29 14:40 编辑

回复 12# aa77dd@163.com


    就用普通的 asc
  1. Set objWinHttp = CreateObject("WinHttp.WinHttpRequest.5.1")
  2. objWinHttp.Open "HEAD", "https://www.nyaa.se/?page=download&tid=613616"
  3. objWinHttp.Send
  4. name = objWinHttp.GetResponseHeader("Content-Disposition")
  5. say = ""
  6. for i = 1 to len(name)
  7.     say = say & hex( asc(mid(name, i, 1)) ) & " "
  8. next
  9. msgbox say
复制代码
---------------------------

---------------------------
69 6E 6C 69 6E 65 3B 20 66 69 6C 65 6E 61 6D 65 3D 22 E6B5 B7E8 B4BC E78E 3F 36 35 2E 72 61 72 2E 74 6F 72 72 65 6E 74 22
---------------------------
确定   
---------------------------

可以看到某字节已经丢失,变成了问号(0x3f)
作者: aa77dd@163.com    时间: 2016-11-29 14:41

回复 13# 523066680

问题可能出现在 GetResponseHeader("Content-Disposition") 方法
  1. req := ComObjCreate("Microsoft.XMLHTTP")
  2. ; Open a request with async enabled.
  3. req.open("GET", "https://www.nyaa.se/?page=download&tid=613616", true)
  4. ; Set our callback function (v1.1.17+).
  5. req.onreadystatechange := Func("Ready")
  6. ; Send the request.  Ready() will be called when it's complete.
  7. req.send()
  8. ; /*
  9. ; If you're going to wait, there's no need for onreadystatechange.
  10. ; Setting async=true and waiting like this allows the script to remain
  11. ; responsive while the download is taking place, whereas async=false
  12. ; will make the script unresponsive.
  13. while req.readyState != 4
  14.     sleep 100
  15. ; */
  16. #Persistent
  17. Ready() {
  18.     global req
  19.     if (req.readyState != 4)  ; Not done yet.
  20.         return
  21.     if (req.status == 200 || req.status == 304) {
  22.         MsgBox % "responseText: " req.responseText
  23. t:=req.GetResponseHeader("Content-Disposition")
  24.         MsgBox % "Content-Disposition: " t
  25. }
  26.     else
  27.         MsgBox 16,, % "Status " req.status
  28.     ExitApp
  29. }
复制代码

作者: 523066680    时间: 2016-11-29 14:54

本帖最后由 523066680 于 2016-11-29 15:13 编辑

回复 14# aa77dd@163.com


    换一种语言海阔天空哈哈~ (如果ruby python perl 也不喜欢,那么,C# 是坠吼的

恩,这句话是和 tmplinshi 说的
作者: aa77dd@163.com    时间: 2016-11-29 17:13

本帖最后由 aa77dd@163.com 于 2016-11-29 17:18 编辑

回复 15# 523066680

有个贴吧

http://tieba.baidu.com/p/1618906999

这应该是个编码转换的 bug, 而非字节丢失

uni     GB
5A34 E6B5 娴
75AF B7E8 疯
9187 B4BC 醇
作者: 523066680    时间: 2016-11-29 17:19

本帖最后由 523066680 于 2016-11-29 21:47 编辑

回复 16# aa77dd@163.com

是我描述上出了偏差 :]

有一首关于编码的诗
手持两把锟斤拷,口中疾呼烫烫烫。
脚踏千朵屯屯屯,笑看万物锘锘锘。

作者: tmplinshi    时间: 2016-11-29 22:37

看来对于 VBS 是无解了。谢谢各位帮忙。
作者: 523066680    时间: 2016-11-29 23:35

本帖最后由 523066680 于 2016-11-29 23:37 编辑

回复 18# tmplinshi


也许有可能,但是太折腾的话还是另辟蹊径的好。
作者: tmplinshi    时间: 2016-12-12 21:47

本帖最后由 tmplinshi 于 2017-1-3 23:30 编辑

HttpQueryInfo function 文章中有这么一段话:
Note The HttpQueryInfoA function represents headers as ISO-8859-1 characters not ANSI characters. The HttpQueryInfoW function represents headers as ISO-8859-1 characters converted to UTF-16LE characters. As a result, it is never safe to use the HttpQueryInfoW function when the headers can contain non-ASCII characters. Instead, an application can use the MultiByteToWideChar and WideCharToMultiByte functions with a Codepage parameter set to 28591 to map between ANSI characters and UTF-16LE characters.


估计 "WinHttp.WinHttpRequest.5.1" 对象用的就是 HttpQueryInfoW 函数吧。
如果使用 HttpQueryInfoA 则可以正确转换编码。AHK 代码示例:

  1. MsgBox, % GetAllResponseHeaders("https://www.nyaa.se/?page=download&tid=613616")
  2. GetAllResponseHeaders(Url, RequestHeaders := "", NO_AUTO_REDIRECT := false, NO_COOKIES := false) {
  3. static INTERNET_OPEN_TYPE_DIRECT := 1
  4.      , INTERNET_SERVICE_HTTP := 3
  5.      , HTTP_QUERY_RAW_HEADERS_CRLF := 22
  6.      , CP_UTF8 := 65001
  7.      , Default_UserAgent := "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
  8. hModule := DllCall("LoadLibrary", "str", "wininet.dll", "ptr")
  9. if !hInternet := DllCall("wininet\InternetOpen", "ptr", &Default_UserAgent, "uint", INTERNET_OPEN_TYPE_DIRECT
  10. , "str", "", "str", "", "uint", 0)
  11. return
  12. ; -----------------------------------------------------------------------------------
  13. if !InStr(Url, "://")
  14. Url := "http://" Trim(Url)
  15. regex := "(?P<protocol>\w+)://((?P<user>\w+):(?P<pwd>\w+)@)?(?P<host>[\w.]+)(:(?P<port>\d+))?(?P<path>.*)"
  16. RegExMatch(Url, regex, v_)
  17. if (v_protocol = "ftp") {
  18. throw, "ftp is not supported."
  19. }
  20. if (v_port = "") {
  21. v_port := (v_protocol = "https") ? 443 : 80
  22. }
  23. ; -----------------------------------------------------------------------------------
  24. Internet_Flags := 0
  25.                 | 0x400000   ; INTERNET_FLAG_KEEP_CONNECTION
  26.                 | 0x80000000 ; INTERNET_FLAG_RELOAD
  27.                 | 0x20000000 ; INTERNET_FLAG_NO_CACHE_WRITE
  28. if (v_protocol = "https") {
  29. Internet_Flags |= 0x1000  ; INTERNET_FLAG_IGNORE_CERT_CN_INVALID
  30.                | 0x2000   ; INTERNET_FLAG_IGNORE_CERT_DATE_INVALID
  31.                | 0x800000 ; INTERNET_FLAG_SECURE ; Technically, this is redundant for https
  32. }
  33. if NO_AUTO_REDIRECT
  34. Internet_Flags |= 0x00200000 ; INTERNET_FLAG_NO_AUTO_REDIRECT
  35. if NO_COOKIES
  36. Internet_Flags |= 0x00080000 ; INTERNET_FLAG_NO_COOKIES
  37. ; -----------------------------------------------------------------------------------
  38. hConnect := DllCall("wininet\InternetConnect", "ptr", hInternet, "ptr", &v_host, "uint", v_port
  39. , "ptr", &v_user, "ptr", &v_pwd, "uint", INTERNET_SERVICE_HTTP, "uint", Internet_Flags, "uint", 0, "ptr")
  40. hRequest := DllCall("wininet\HttpOpenRequest", "ptr", hConnect, "str", "HEAD", "ptr", &v_path
  41. , "str", "HTTP/1.1", "ptr", 0, "ptr", 0, "uint", Internet_Flags, "ptr", 0, "ptr")
  42. nRet := DllCall("wininet\HttpSendRequest", "ptr", hRequest, "ptr", &RequestHeaders, "int", -1
  43. , "ptr", 0, "uint", 0)
  44. Loop, 2 {
  45. DllCall("wininet\HttpQueryInfoA", "ptr", hRequest, "uint", HTTP_QUERY_RAW_HEADERS_CRLF
  46. , "ptr", &pBuffer, "uint*", bufferLen, "uint", 0)
  47. if (A_Index = 1)
  48. VarSetCapacity(pBuffer, bufferLen, 0)
  49. }
  50. ; -----------------------------------------------------------------------------------
  51. output := StrGet(&pBuffer, "UTF-8")
  52. ; -----------------------------------------------------------------------------------
  53. DllCall("wininet\InternetCloseHandle", "ptr", hRequest)
  54. DllCall("wininet\InternetCloseHandle", "ptr", hConnect)
  55. DllCall("wininet\InternetCloseHandle", "ptr", hInternet)
  56. DllCall("FreeLibrary", "Ptr", hModule)
  57. return output
  58. }
复制代码





欢迎光临 批处理之家 (http://bbs.bathome.net/) Powered by Discuz! 7.2