xp3000

Rank: 5 Rank: 5

帖子: 441
积分: 618
技术: 37
捐助: 0
注册时间: 2013-4-25

1楼 跳转到 » 正序看帖

打印

字体大小: tT

发表于 2023-1-2 09:46 | 只看该作者

[文本处理] [已解决]如何能打开文件时自动识别编码处理后保存为ANSI？

本帖最后由 xp3000 于 2023-1-8 08:33 编辑

一些txt文件，里面很多多余内容需要删除，例如
PS：……
PS2：……
PS3：……
最后需要保存为ANSI编码。
搜索到一个这个不知道怎么用

function detectEncoding(str) {
  // 检查字符串开头的 BOM
  if (str.charCodeAt(0) === 0xFEFF) {
    return 'UTF-16BE';
  } else if (str.charCodeAt(0) === 0xFFFE) {
    return 'UTF-16LE';
  } else if (str.charCodeAt(0) === 0xEFBB && str.charCodeAt(1) === 0xBF) {
    return 'UTF-8BOM';
  }

  // 检查字符串开头的“UTF-32”标识符
  if (str.charCodeAt(0) === 0x0000 && str.charCodeAt(1) === 0x0000 && str.charCodeAt(2) === 0xFEFF) {
    return 'UTF-32BE';
  } else if (str.charCodeAt(0) === 0xFEFF && str.charCodeAt(1) === 0x0000 && str.charCodeAt(2) === 0x0000) {
    return 'UTF-32LE';
  }

  // 如果未找到上述模式，则假定字符串以 ASCII 或 ANSI 编码
  return 'ANSI/ASCII';
}
复制代码

To determine the character encoding of a string in JavaScript, you can use the .charCodeAt() method to get the numeric Unicode representation of each character in the string and then use some logic to determine the character encoding based on the values that are returned.

For example, to determine if a string is encoded in ANSI, you could use the following approach:

Copy code
function isANSI(str) {
  for (let i = 0; i < str.length; i++) {
    // If any character has a Unicode value greater than 255,
    // the string is not encoded in ANSI
    if (str.charCodeAt(i) > 255) return false;
  }

  // If all characters have a Unicode value less than or equal to 255,
  // the string is encoded in ANSI
  return true;
}
To determine if a string is encoded in UTF-8, you could use the following approach:

Copy code
function isUTF8(str) {
  for (let i = 0; i < str.length; i++) {
    let c = str.charCodeAt(i);

    // If the Unicode value of the character is in the range 0-127,
    // it is encoded as a single byte in UTF-8
    if (c >= 0 && c <= 127) continue;

    // If the Unicode value of the character is in the range 128-2047,
    // it is encoded as two bytes in UTF-8
    if (c >= 128 && c <= 2047) {
      i++;
      continue;
    }

    // If the Unicode value of the character is in the range 2048-65535,
    // it is encoded as three bytes in UTF-8
    if (c >= 2048 && c <= 65535) {
      i += 2;
      continue;
    }

    // If the Unicode value of the character is in the range 65536-1114111,
    // it is encoded as four bytes in UTF-8
    if (c >= 65536 && c <= 1114111) {
      i += 3;
      continue;
    }

    // If the character is not encoded as a single, double, triple, or
    // quadruple byte in UTF-8, the string is not encoded in UTF-8
    return false;
  }

  // If all characters in the string are encoded as a single, double,
  // triple, or quadruple byte in UTF-8, the string is encoded in UTF-8
  return true;
}
To determine if a string is encoded in UTF-8 with a BOM (Byte Order Mark), you could use the following approach:

Copy code
function isUTF8BOM(str) {
  // The BOM for UTF-8 is the byte sequence EF BB BF
  if (str.charCodeAt(0) === 0xEF && str.charCodeAt(1) === 0xBB && str.charCodeAt(2) === 0xBF) {
    // If the first three bytes of the string match the BOM for UTF-8,
    // check if the rest of the string is encoded in UTF-8
    return isUTF8(str.substring(3));
  }

  // If the first three bytes of the string do not match the BOM for UTF-8,
  // the string is not encoded in UTF-8


Try again
复制代码

xp3000

少尉

Rank: 5 Rank: 5

帖子: 441
积分: 618
技术: 37
捐助: 0
注册时间: 2013-4-25

19楼

发表于 2023-1-8 09:47 | 只看该作者

学习了，谢谢各位大神

TOP

terse

中将

Rank: 8 Rank: 8

帖子: 2339
积分: 9739
技术: 475
捐助: 0
注册时间: 2008-2-25

18楼

发表于 2023-1-7 19:00 | 只看该作者

发一个js的先粗暴处理32编码文件

1>1/* :
@echo off
set "ph=%cd%\Result\"
set "enc=gb2312"
md "%ph%" 2>nul
dir /b /a-d *.txt *.jpg| cscript -nologo -e:jscript %0 "%ph%" "%enc%"
pause & exit
*/
var cp1252 = "\u20AC\u0081\u201A\u0192\u201E\u2026\u2020\u2021\u02C6\u2030\u0160\u2039\u0152\u008D\u017D\u008F\u0090\u2018\u2019\u201C\u201D\u2022\u2013\u2014\u02DC\u2122\u0161\u203A\u0153\u009D\u017E\u0178"; 
var reg = /^(fffe00|00feff|efbbbf|fffe|feff)/i, re = /(?:^|\n)(PS\d*：.*)/g,
charsets = { 'fffe00' : 'unicodefffe', '00feff' : 'unicodefeff', 'efbbbf' : "UTF-8",  'fffe' : 'unicodefffe', 'feff' : 'unicodefeff' };
function getText(file, enc) {
       var i = 0, stream, bin, count, content, hex='';
       stream = new ActiveXObject("ADODB.Stream");
       stream.type = 2;
       stream.charset = 'Latin1';
       stream.open();
       stream.loadFromFile(file);
       bin= stream.ReadText(-1);
       count = (bin.length > 4096) ? 4096 : bin.length;
       for (;i<4;) hex += bin.charCodeAt(i++).toString(16);
       var bom = reg.test(hex) ? hex.match(reg)[0] : '';
       stream.Position = 0;
       stream.Type = 2;
       stream.charset = bom ? charsets[bom] : getEncoding(bin, count) ? 'UTF-8' : 'gbk';
       content = stream.readText(-1);
       if (/00/.test(bom))
       {
            if (bom == '00feff') { content = content.slice(2) };
            content = content.replace(/\x00/g, '');
       }
       stream.Close
       return content
}

function getEncoding(b, len) {
       var n = 1;
       for ( var i = 0; i < len; i++)
       {
               var byt = (b.charCodeAt(i) <= 255) ? b.charCodeAt(i):
                     128+cp1252.indexOf(b.charAt(i));
               if (n == 1)
               {
                      if (byt >= 0x80)
                      {
                          while (((byt <<= 1) & 0x80) != 0) {n++}
                          if (n == 1 || n > 6) { return false }
                      }
               } else {
                   if ((byt & 0xC0) != 0x80) { return false}
                   n--;
               }
       }
       return true;
}
var enc = WScript.Arguments(1);
while (!WScript.StdIn.AtEndOfStream){
       var file = WScript.StdIn.ReadLine();
       var text = getText( file ).replace(re, '');
       var path = WScript.Arguments(0) + file;
       var stream = new ActiveXObject("ADODB.Stream");
       stream.type = 2;
       stream.charset = enc;
       stream.open();
       stream.writetext(text);
       stream.SaveToFile(path, 2);
       stream.Close
}
复制代码

1 评分人数

xp3000: 乐于助人技术 + 1

TOP

czjt1234

上尉

Rank: 5 Rank: 5

帖子: 942
积分: 1407
技术: 101
捐助: 0
注册时间: 2010-4-30

17楼

发表于 2023-1-6 22:27 | 只看该作者

回复 16# WHY

所以说很烦

保存备用

QQ 20147578

TOP

WHY

上校

Rank: 6 Rank: 6

帖子: 1442
积分: 3193
技术: 556
捐助: 0
注册时间: 2015-7-19

16楼

发表于 2023-1-6 21:03 | 只看该作者

我贴一个VBS

On Error ReSume Next
Dim fso, myDir, dstFolder
Set fso = CreateObject("Scripting.FileSystemObject")
myDir = fso.GetFile(WSH.ScriptFullName).ParentFolder.Path           '脚本自身目录
dstFolder = myDir & "\Result"                                       '目标目录
If Not fso.FolderExists(dstFolder) Then fso.CreateFolder(dstFolder) '创建目标目录

Dim objFile
For Each objFile In fso.GetFolder(myDir).Files
    If LCase(Right(objFile.Name, 4)) = ".txt" Then
        If objFile.Size > 0 Then
            CheckEncoding objFile.Path, dstFolder & "\" & objFile.Name
        End If
    End If
Next

Function DeleteStr(ByRef str)
    Dim reg, arrIn, n, i, arrOut()
    Set reg = New RegExp
    reg.IgnoreCase = True
    reg.Pattern = "PS[0-9]*："      '删除包含 "PS" + 数字 + "："的行 
    str = Replace(str, vbCrLf, vbLf)
    arrIn = Split(str, vbLf)
    n = 0
    For i = 0 To UBound(arrIn)
        If Not reg.Test(arrIn(i)) Then
            ReDim PreServe arrOut(n)
            arrOut(n) = arrIn(i)
            n = n + 1
        End If
    Next
    DeleteStr = Join(arrOut, vbCrLf)
End Function

Function ConvertUtf32ToUtf16(srcFile, dstFile, encName)
    Dim xmlDoc, node
    Set xmlDoc = CreateObject("MSXML2.DOMDocument")
    Set node = xmlDoc.CreateElement("binary")
    node.DataType = "bin.hex"
    Dim ado, sz, i, j, arr()
    Set ado = CreateObject("ADODB.Stream")
    ado.Type = 1
    ado.Open
    ado.LoadFromFile srcFile
    sz = ado.Size
    ReDim arr(sz\4)
    Dim h(3)
    For i = 1 To sz Step 4
        For j = 0 To 3
            h(j) = Right("00" & Hex(AscB(ado.Read(1))), 2)
        Next
        If encName = "UTF32LE" Then
            arr(i\4) = h(0) & h(1)
        ElseIf encName = "UTF32BE" Then
            arr(i\4) = h(2) & h(3)
        End If
    Next
    node.Text = Join(arr, "")
    ado.Position = 0
    ado.Write node.NodeTypedValue
    ado.SetEOS()
    ado.SaveToFile dstFile, 2
    ado.Close()
    SaveFileUtf16ToAnsi dstFile, dstFile
End Function

Function SaveFileUtf16ToAnsi(srcFile, dstFile)
    Dim f, str
    Set f = fso.OpenTextFile(srcFile, 1, True, -1)
    str = DeleteStr(f.ReadAll)
    f.Close
    fso.OpenTextFile(dstFile, 2, True).Write(str)
End Function

Function SaveFileUtf8ToAnsi(srcFile, dstFile, charset)
    Dim ado, str
    Set ado = CreateObject("ADODB.Stream")
    ado.Type = 2
    ado.CharSet = charset
    ado.Open
    ado.LoadFromFile srcFile
    str = ado.ReadText(-1)
    ado.Position = 0
    ado.CharSet = "GB2312"
    ado.WriteText DeleteStr(str)
    ado.SetEOS
    ado.SaveToFile dstFile, 2
    ado.Close
End Function

Function SaveFileAnsiToAnsi(srcFile, dstFile)
    Dim f, str
    Set f = fso.OpenTextFile(srcFile, 1, True)
    str = DeleteStr(f.ReadAll)
    f.Close
    fso.OpenTextFile(dstFile, 2, True).Write(str)
End Function

Function CheckEncoding(srcFile, dstFile)
    Dim ado, i, BOM
    Set ado = CreateObject("ADODB.Stream")
    ado.Type = 1
    ado.Open
    ado.LoadFromFile srcFile
    For i = 0 To 3
        BOM = BOM & Right("00" & Hex(AscB(ado.Read(1))), 2)
    Next
    If BOM = "FFFE0000" Then
        ado.Close
        ConvertUtf32ToUtf16 srcFile, dstFile, "UTF32LE"
    ElseIf BOM = "0000FEFF" Then
        ado.Close
        ConvertUtf32ToUtf16 srcFile, dstFile, "UTF32BE"
    ElseIf Left(BOM, 4) = "FFFE" or Left(BOM, 4) = "FEFF" Then
        ado.Close
        SaveFileUtf16ToAnsi srcFile, dstFile   'UNICODE
    ElseIf Left(BOM, 6) = "EFBBBF" Then
        ado.Close
        SaveFileUtf8ToAnsi srcFile, dstFile, "UTF-8"
    ElseIf Left(BOM, 6) = "2B2F76" Then
        ado.Close
        SaveFileUtf8ToAnsi srcFile, dstFile, "UTF-7"
    Else
        Dim sz, arr()
        ado.Position = 0
        sz = ado.Size
        ReDim arr(sz-1)
        For i = 1 To sz
            arr(i-1) = ChrW(AscB(ado.Read(1)))
        Next
        If isUTF8(arr) Then
            ado.Close
            SaveFileUtf8ToAnsi srcFile, dstFile, "UTF-8"
        Else
            ado.Close
            SaveFileAnsiToAnsi srcFile, dstFile  'ANSI
        End If
    End If
End Function

Function isUTF8(ByRef arr)
    Dim s, reg
    s = "[\xC0-\xDF](?:[^\x80-\xBF]|$)"
    s = s & "|[\xE0-\xEF].{0,1}(?:[^\x80-\xBF]|$)"
    s = s & "|[\xF0-\xF7].{0,2}(?:[^\x80-\xBF]|$)"
    s = s & "|[\xF8-\xFB].{0,3}(?:[^\x80-\xBF]|$)"
    s = s & "|[\xFC-\xFD].{0,4}(?:[^\x80-\xBF]|$)"
    s = s & "|[\xFE-\xFE].{0,5}(?:[^\x80-\xBF]|$)"
    s = s & "|[\x00-\x7F][\x80-\xBF]"
    s = s & "|[\xC0-\xDF].[\x80-\xBF]"
    s = s & "|[\xE0-\xEF]..[\x80-\xBF]"
    s = s & "|[\xF0-\xF7]...[\x80-\xBF]"
    s = s & "|[\xF8-\xFB]....[\x80-\xBF]"
    s = s & "|[\xFC-\xFD].....[\x80-\xBF]"
    s = s & "|[\xFE-\xFE]......[\x80-\xBF]"
    s = s & "|^[\x80-\xBF]"
    Set reg = New RegExp
    reg.Pattern = s
    isUTF8 = Not reg.Test(Join(arr, ""))
End Function

MsgBox "Done"
复制代码

1 评分人数

xp3000: 乐于助人技术 + 1

TOP

czjt1234

上尉

Rank: 5 Rank: 5

帖子: 942
积分: 1407
技术: 101
捐助: 0
注册时间: 2010-4-30

15楼

发表于 2023-1-3 15:46 | 只看该作者

回复 14# xp3000

这样啊，那就算了，vbs没有读写utf32编码的com对象，要读取二进制数据然后逐个解码，太繁了

QQ 20147578

TOP

xp3000

少尉

Rank: 5 Rank: 5

帖子: 441
积分: 618
技术: 37
捐助: 0
注册时间: 2013-4-25

14楼

发表于 2023-1-3 11:01 | 只看该作者

vbs也可以，主要是js能看懂一点

TOP

czjt1234

上尉

Rank: 5 Rank: 5

帖子: 942
积分: 1407
技术: 101
捐助: 0
注册时间: 2010-4-30

13楼

发表于 2023-1-3 10:35 | 只看该作者

回复 12# xp3000

为什么要js，我测试了一下你发的文件，vbs可以实现

QQ 20147578

TOP

xp3000

少尉

Rank: 5 Rank: 5

帖子: 441
积分: 618
技术: 37
捐助: 0
注册时间: 2013-4-25

12楼

发表于 2023-1-3 09:21 | 只看该作者

这个BAT+JS可以实现吗

TOP

xp3000

少尉

Rank: 5 Rank: 5

帖子: 441
积分: 618
技术: 37
捐助: 0
注册时间: 2013-4-25

11楼

发表于 2023-1-3 09:18 | 只看该作者

本帖最后由 xp3000 于 2023-1-3 09:20 编辑

回复 10# czjt1234
https://cowtransfer.com/s/215a0b55e04c4e 点击链接查看 [ UTF-32.zip ] ，或访问奶牛快传 cowtransfer.com 输入传输口令 mdty5z 查看；

TOP

czjt1234

上尉

Rank: 5 Rank: 5

帖子: 942
积分: 1407
技术: 101
捐助: 0
注册时间: 2010-4-30

10楼

发表于 2023-1-3 07:06 | 只看该作者

回复 6# xp3000

你有空用百度网盘上传一个文件，分享个链接
我想看看utf-32文件啥模样，notepad++是不支持的

QQ 20147578

TOP

hlzj88

少校

Rank: 6 Rank: 6

帖子: 820
积分: 1616
技术: 51
捐助: 20
注册时间: 2008-11-3

9楼

发表于 2023-1-2 23:57 | 只看该作者

回复 8# WHY

非常牛

目的，学习批处理

TOP

WHY

上校

Rank: 6 Rank: 6

帖子: 1442
积分: 3193
技术: 556
捐助: 0
注册时间: 2015-7-19

8楼

发表于 2023-1-2 23:41 | 只看该作者

本帖最后由 WHY 于 2023-1-3 14:15 编辑

Test.ps1，右键使用PowerShell运行，增加 utf-7 编码识别

function Get-Encoding($filePath){
    $reg = '[\xC0-\xDF](?:[^\x80-\xBF]|$)';
    $reg += '|[\xE0-\xEF].{0,1}(?:[^\x80-\xBF]|$)';
    $reg += '|[\xF0-\xF7].{0,2}(?:[^\x80-\xBF]|$)';
    $reg += '|[\xF8-\xFB].{0,3}(?:[^\x80-\xBF]|$)';
    $reg += '|[\xFC-\xFD].{0,4}(?:[^\x80-\xBF]|$)';
    $reg += '|[\xFE-\xFE].{0,5}(?:[^\x80-\xBF]|$)';
    $reg += '|[\x00-\x7F][\x80-\xBF]';
    $reg += '|[\xC0-\xDF].[\x80-\xBF]';
    $reg += '|[\xE0-\xEF]..[\x80-\xBF]';
    $reg += '|[\xF0-\xF7]...[\x80-\xBF]';
    $reg += '|[\xF8-\xFB]....[\x80-\xBF]';
    $reg += '|[\xFC-\xFD].....[\x80-\xBF]';
    $reg += '|[\xFE-\xFE]......[\x80-\xBF]';
    $reg += '|^[\x80-\xBF]';
    $byte = [IO.File]::ReadAllBytes($filePath);
    $BOM  = [BitConverter]::ToString($byte[0..3]);
    If ($BOM -eq 'FF-FE-00-00'){
        return (New-Object System.Text.UTf32Encoding $false, $true); #UTF32LE with BOM
    } elseIf ($BOM -eq '00-00-FE-FF'){
        return (New-Object System.Text.UTf32Encoding $true, $true);  #UTF32BE with BOM
    } elseIf ($BOM.StartsWith('FF-FE') -or $BOM.StartsWith('FE-FF')){
        return [Text.Encoding]::GetEncoding('UNICODE');              #UTF16 with BOM
    } elseIf ($BOM.StartsWith('EF-BB-BF')){
        return [Text.Encoding]::GetEncoding('UTF-8');                #UTF8 with BOM
    } elseIf ($BOM.StartsWith('2B-2F-76')){
        return [Text.Encoding]::GetEncoding('UTF-7');                #UTF7 with BOM
    } else {
        $m = [regex]::Match([char[]]$byte -join '', $reg);
        If ($m.Success){
            return [Text.Encoding]::GetEncoding('GB2312');           #ANSI
        } else {
            return [Text.Encoding]::GetEncoding('UTF-8');            #UTF8 without BOM
        }
    }
}

$path = $MyInvocation.MyCommand.Path -replace '\\[^\\]*$', '\';     #脚本自身路径
$dstFolder = $path + 'Result\';                                     #目标文件路径
if(![IO.Directory]::Exists($dstFolder)){$null = md $dstFolder};     #创建目标目录

forEach( $file In (dir -Literal $path -Filter *.txt) ){
    $enc = Get-Encoding $file.FullName;                             #获取编码
    $arr = [IO.File]::ReadAllLines($file.FullName, $enc);
    $arr = $arr -NotMatch 'PS[0-9]*：';   #删除包含 'PS' + 数字 + '：'的行
    #另存为ansi编码
    [IO.File]::WriteAllLines($dstFolder + $file.Name, $arr, [Text.Encoding]::GetEncoding('GB2312'));
}

echo 'Done';
[Console]::ReadLine();
复制代码

2 评分人数

xp3000: 乐于助人技术 + 1
hlzj88: 牛技术 + 1

TOP

hlzj88

少校

Rank: 6 Rank: 6

帖子: 820
积分: 1616
技术: 51
捐助: 20
注册时间: 2008-11-3

7楼

发表于 2023-1-2 23:19 | 只看该作者

findstr /i "的" 1.txt&&goto wc || @iconv -c -f utf-8 -t GBK 1.txt>>gb1.txt
findstr /i "的" gb1.txt&&move /y gb1.txt 1.txt&&goto wc || @iconv -c -f utf-32 -t GBK 1.txt>>gb2.txt
findstr /i "的" gb2.txt&&move /y gb2.txt 1.txt&&goto wc || @iconv -c -f UCS-2LE -t GBK 1.txt>>gb3.txt
findstr /i "的" gb3.txt&&move /y gb3.txt 1.txt
:wc
del /q gb*.txt
echo 完成
findstr /iv "ps2 ps3 ps" 1.txt>>2.txt
pause
复制代码

http://bcn.bathome.net/s/tool/index.html?key=iconv

这是一个文件处理的苯办法，utf-8 和unicode 测试都成功。utf-32，软件iconv是支持的，但我没有这样的文件，也不会保存为这样的文件。所以这个只是臆测写了。

其他什么格式的文件，不知道还能有什么格式，所以没有多写。查找的字的目的是中文的字的字频最高。处理英文可以用 the

1 评分人数

xp3000: 乐于助人技术 + 1

目的，学习批处理

TOP

xp3000

少尉

Rank: 5 Rank: 5

帖子: 441
积分: 618
技术: 37
捐助: 0
注册时间: 2013-4-25

6楼

发表于 2023-1-2 22:52 | 只看该作者

\x2B\x2F\x76 文件为 UTF-7 编码
\xEF\xBB\xBF 文件为 UTF-8 BOM 编码
\xFE\xFF 文件为 UTF-16 BE 编码
\xFF\xFE 文件为 UTF-16 LE 编码
\x00\x00\xFE\xFF 文件为 UTF-32 BE 编码
\xFF\xFE\x00\x00 文件为 UTF-32 LE 编码

不能上传文件，查了下文件开头这样的，以二进制十六进制查看文件，开头是上面去掉\x内容

TOP

12 下一页

返回列表

[新手上路]批处理新手入门导读	[视频教程]批处理基础视频教程	[视频教程]VBS基础视频教程	[批处理精品]批处理版照片整理器
[批处理精品]纯批处理备份&还原驱动	[批处理精品]CMD命令50条不能说的秘密	[在线下载]第三方命令行工具	[在线帮助]VBScript / JScript 在线参考

[文本处理] [已解决]如何能打开文件时自动识别编码处理后保存为ANSI？

[收藏此主题] [关注此主题的新回复]

[通过 QQ、MSN 分享给朋友]