windows代码页转换工具wincp.exe
[i=s] 本帖最后由 happy886rr 于 2017-6-1 18:43 编辑 [/i][tvcp已经更名为wincp,修复数个bug,版本号升级为1.1]
wincp代码页转化工具,支持文本编码转换,BOM头修改、伪造BOM、去BOM,自动修正参数,BOM自动偏移 等功能。
[quote]
下载:存外链图为a.zip解压便是。
[img]http://i1.piimg.com/1949/e42768f0e3f9b852.png[/img]
WINCP.EXE (TEXT CODEPAGE CONVERSION TOOL, BY LEO, VERSION 1.1)
摘要:
=========================================================================
代码页转化工具,支持文本编码转换,BOM头修改、伪造、去BOM,自动修正参数,
BOM自动偏移... 等功能。
用处特殊,效果奇佳。不仅仅是编码转换,更具代码页翻译、伪造、BOM自定义,加
密等等。
补充:code_page参数可以是代码页数字入936,也可以是代码页缩写如GBK,具体对
照详见 备注(常见代码页缩写)。
=========================================================================
用法:
-------------------------------------------------------------------------
wincp [input_file] -f [code_page] -t [code_page] -s [skip_number] -b[fill_BOM] -o [out_file]
-------------------------------------------------------------------------
-f From the code page
-t Translate to the code page
-s Skip the number of bytes
-b Filling BOM
-o Output file name
-h Show help information
-------------------------------------------------------------------------
举例:
-------------------------------------------------------------------------
REM 将test.txt从BIG5编码转为UTF8编码
wincp test.txt -o out.txt -f BIG5 -t UTF8
REM 将test.txt从ANSI编码转为UTF8编码
wincp test.txt -o out.txt -f 936 -t 65001
REM 将test.txt从UTF8编码转为UCS-2LE编码,即通常的UNICODE编码,并填充其BOM头为0xFFFE。
wincp test.txt -f 65001 -t 1200 -s 0 -b 0xFFFE -o out.txt
REM 将test.txt从UNICODE大端编码转为UTF8编码
wincp test.txt -o out.txt -f UCS2BE -t UTF8
wincp test.txt -oout.txt -fUNICODEBE -tUTF8
REM 将test.txt去除BOM
wincp test.txt -oout.txt
REM 伪造BOM
wincp test.txt -oout.txt -b0xADFF0000
...
-------------------------------------------------------------------------
备注:(常见代码页缩写)
-------------------------------------------------------------------------
ANSI 0
GBK 936
GB18030 54936
BIG5 950
UNICODE UTF16 UCS2 1200
UNICODEBE UTF16BE UCS2BE 1201
UTF8 65001
UTF7 65000
UTF32 12000
UTF32BE 12001
-------------------------------------------------------------------------
代码页:(通用代码页对照表)
-------------------------------------------------------------------------
437 — 最初的 IBM PC 代码页,实现了扩展ASCII字符集
737 — 希腊语
850 — Latin-1(西欧语言)
852 — Latin-2(中欧及东欧语言)
855 — 西里尔(Cyril)字母
857 — 土耳其语
858 — 带欧元符号的“多语言”
860 — 葡萄牙语
861 — 冰岛语
863 — 法语 加拿大英语
865 — 北欧
866 — 西里尔(Cyril)字母
869 — 希腊语
874 — 泰文字母
932 — 日本
949 — 韩国
936 — GBK中文编码
950 — BIG5繁体中文
1200 — UCS-2LE Unicode 小端序
1201 — UCS-2BE Unicode 大端序
1250 — 东欧拉丁字母
1251 — 古斯拉夫语
1252 — 西欧拉丁字母 ISO-8859-1.
1253 — 希腊语
1254 — 土耳其语
1255 — 希伯来语
1256 — 阿拉伯语
1257 — 巴尔
1258 — 越南
1254 — 土耳其语
10000 — Macintosh Roman encoding (followed by several other Mac character sets)
10007 — Macintosh Cyrillic encoding
10029 — Macintosh Central European encoding
12000 — utf-32 Unicode UTF-32, little endian byte order; available only to managed applications
12001 — utf-32BE Unicode UTF-32, big endian byte order; available only to managed applications
28591 — iso-8859-1 ISO 8859-1 Latin 1; Western European (ISO)
51936 — EUC-CN EUC Simplified Chinese; Chinese Simplified (EUC)
54936 — GB18030
65000 — UTF-7 Unicode
65001 — UTF-8 Unicode
-------------------------------------------------------------------------
BOM:(常见字节顺序标记)
-------------------------------------------------------------------------
UTF-8 EF BB BF
UTF-16 (LE) FF FE
UTF-16 (BE) FE FF
UTF-32 (LE) FF FE 00 00
UTF-32 (BE) 00 00 FE FF
UTF-7 2B 2F 76 +[38|39|2B|2F]
UTF-1 F7 64 4C
UTF-EBCDIC DD 73 66 73
SCSU 0E FE FF
BOCU-1 FB EE 28 (+FF)
GB-18030 84 31 95 33
-------------------------------------------------------------------------
版本:
VERSION 1.0
[/quote]
源码支持单宽字符,各类win编译器编译。[code]
/*
TEXT CODEPAGE CONVERSION TOOL, COPYRIGHT@2017~2019 BY LEO, VERSION 1.1
WINCP.EXE
UNICODE COMPILATION:
==> G++ wincp.cpp -D _UNICODE -D UNICODE -municode -O2 -static
==> CL wincp.cpp /O2 /Oy- /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /MD
ANSI COMPILATION:
==> G++ wincp.cpp -O2 -static
==> CL wincp.cpp /O2 /Oy- /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /MD
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <windows.h>
#include <locale.h>
#include <ctype.h>
#include <tchar.h>
#include <time.h>
#if !defined(_MSC_VER) && !defined(bool)
#include <stdbool.h>
#endif
#if !defined(WIN32) && !defined(__WIN32__)
#error Only run on windows system
#endif
/***************定义宏变量***************/
//文件限制
#define MAX_FILE_SIZE 1024 * 1024
//标准行长
#define BUFF_SIZE 1024
//BOM容器长度
#define BOMS_SIZE 4
//编码检测阈值(字节)
#define CHECK_SIZE 16383
//定义帮助说明
#define HELP_INFORMATION _T("\
wincp v1.1 - Console text codepage conv tool - Copyright (C) 2017-2019 by LEO\n\
Usage: wincp [input_file] -f [code_page] -t [code_page] -s [skip_number] -b[fill_BOM] -o [out_file]\n\
\n\
General options:\n\
-f From the code page\n\
-t Translate to the code page\n\
-s Skip the number of bytes\n\
-b Filling BOM\n\
-o Output file name\n\
-h Show help information\n\
\n\
Official website:\n\
http://www.bathome.net/thread-44343-1-1.html\n\
")
/*
Microsoft code pages:\n\
897 – IBM-PC SBCS Japanese (JIS X 0201-1976)
941 – IBM-PC Japanese DBCS for Open environment
947 – IBM-PC DBCS for (Big5 encoding)
950 – Traditional Chinese MIX (Big5 encoding) (1114 + 947) (same with euro: 1370)
1114 – IBM-PC SBCS (Simplified Chinese; GBK; Traditional Chinese; Big5 encoding)
1126 – IBM-PC Korean SBCS
1162 – Windows Thai (Extension of 874; but still called that in Windows)
1169 – Windows Cyrillic Asian
1250 – Windows Central Europe
1251 – Windows Cyrillic
1252 – Windows Western
1253 – Windows Greek
1254 – Windows Turkish
1255 – Windows Hebrew
1256 – Windows Arabic
1257 – Windows Baltic
1258 – Windows Vietnamese
1361 – Korean (JOHAB)
1362 – Korean Hangul DBCS
1363 – Windows Korean (1126 + 1362) (Windows CP 949)
1372 – IBM-PC MS T Chinese Big5 encoding (Special for DB2)
1373 – Windows Traditional Chinese (extension of 950)
1374 – IBM-PC DB Big5 encoding extension for HKSCS
1375 – Mixed Big5 encoding extension for HKSCS (intended to match 950)
1385 – IBM-PC Simplified Chinese DBCS (Growing CS for GB18030, also used for GBK PC-DATA.)
1386 – IBM-PC Simplified Chinese GBK (1114 + 1385) (Windows CP 936)
1391 – Simplified Chinese 4 Byte (Growing CS for GB18030, also used for GBK PC-DATA.)
1392 – IBM-PC Simplified Chinese MIX (1252 + 1385 + 1391)
...
*/
//开关解析宏名
#define _OPT_TEOF -1
#define _OPT_TILL -2
#define _OPT_TERR -3
//开关解析变量
int OPTIND=1, OPTOPT, UNOPTIND=-1;
TCHAR* OPTARG;
#if defined(_UNICODE) || defined(UNICODE)
#define TCHARFORMAT WCHAR
#else
#define TCHARFORMAT CHAR
#endif
//BOM转UINT宏函数
#define BOM2UINT(x) (unsigned int)(((unsigned char)(x)[0]<<24)|((unsigned char)(x)[1]<<16)|((unsigned char)(x)[2]<<8)|((unsigned char)(x)[3]))
/***************功能函数群***************/
//判断纯数字
int _istPositiveNumber(TCHAR* instr)
{
//过滤前空
while(_istspace(*instr))
{
instr++;
}
//过滤空值和负数
if(*instr == _T('\0') || *instr == _T('-'))
{
return -1;
}
//判断每一位是数字
while(_istdigit(*(instr)))
{
instr++;
}
//判断结尾
return (*instr == _T('\0')) ?0 :1;
}
//获取代码页
int _tgetCP(TCHAR* instr)
{
//空指针
if(instr == NULL)
{
return -1;
}
//设置返回值
int retCP;
switch(_istPositiveNumber(instr))
{
case -1:
return -1;
case 0:
return _ttoi((TCHARFORMAT*)instr);
case 1:
break;
}
if (_tcsicmp(instr, _T("ANSI") ) ==0)
{
retCP=CP_ACP;
}
else if(_tcsicmp(instr, _T("GBK") ) ==0)
{
retCP=936;
}
else if(_tcsicmp(instr, _T("GB18030")) ==0)
{
retCP=54936;
}
else if(_tcsicmp(instr, _T("BIG5") ) ==0)
{
retCP=950;
}
else if(
_tcsicmp(instr, _T("UNICODE") ) ==0 ||
_tcsicmp(instr, _T("UTF16") ) ==0 ||
_tcsicmp(instr, _T("UCS2") ) ==0
)
{
retCP=1200;
}
else if(
_tcsicmp(instr, _T("UNICODEBE")) ==0 ||
_tcsicmp(instr, _T("UTF16BE") ) ==0 ||
_tcsicmp(instr, _T("UCS2BE") ) ==0
)
{
retCP=1201;
}
else if(_tcsicmp(instr, _T("UTF7") ) ==0)
{
retCP=65000;
}
else if(_tcsicmp(instr, _T("UTF8") ) ==0)
{
retCP=65001;
}
else if(_tcsicmp(instr, _T("UTF32") ) ==0)
{
retCP=12000;
}
else if(_tcsicmp(instr, _T("UTF32BE")) ==0)
{
retCP=12001;
}
else
{
retCP=-1;
}
return retCP;
}
//字符转HEX
int C2HEX(TCHAR intc)
{
int hret=-1;
if (_T('0')<=intc && intc<=_T('9'))
{
hret=intc-48;
}
else if(_T('A')<=intc && intc<=_T('F'))
{
hret=intc-55;
}
else if(_T('a')<=intc && intc<=_T('f'))
{
hret=intc-87;
}
else
{
hret=-1;
}
return hret;
}
//BOM头转BINBYTE
int TCHARRAY2BIN(TCHAR* instr, BYTE* &tainer)
{
memset(tainer, 0, BOMS_SIZE);
if(*instr == _T('x') || *instr == _T('X'))
{
instr ++;
}
if(_tcsnicmp(instr, _T("0x"), 2) ==0)
{
instr += 2;
}
int i=-1, hexNUM;
while(++i<BOMS_SIZE)
{
hexNUM=C2HEX(*instr++);
if(hexNUM != -1)
{
tainer[i] |= (hexNUM<<4);
}
else
{
break;
}
hexNUM=C2HEX(*instr++);
if(hexNUM != -1)
{
tainer[i] |= hexNUM;
}
else
{
break;
}
}
return i;
}
//开关解析模块
int _tgetopt(int nargc, TCHAR* nargv[], TCHAR* ostr)
{
static TCHAR* place = (TCHAR*)_T("");
static TCHAR* lastostr = NULL;
register TCHAR* oli;
if(ostr!=lastostr)
{
lastostr=ostr;
place=(TCHAR*)_T("");
}
if(!*place)
{
if(
(OPTIND>=nargc) ||
(*(place=nargv[OPTIND]) !=(TCHAR)_T('-')) ||
(!*(++place))
)
{
if(*place !=(TCHAR)_T('-') && OPTIND <nargc)
{
place =(TCHAR*)_T("");
if(UNOPTIND == -1)
{
UNOPTIND = OPTIND++;
return _OPT_TILL;
}
else
{
return _OPT_TERR;
}
}
place=(TCHAR*)_T("");
return _OPT_TEOF;
}
if (*place == (TCHAR)_T('-') && *(place+1) == (TCHAR)_T('\0'))
{
++OPTIND;
return _OPT_TEOF;
}
}
if (
(OPTOPT=*place++) == (TCHAR)_T(':') ||
!(oli=(TCHAR*)_tcschr((TCHARFORMAT*)ostr, (TCHAR)OPTOPT))
)
{
if(!*place)
{
++OPTIND;
}
}
if (oli != NULL && *(++oli) !=(TCHAR)_T(':'))
{
OPTARG=NULL;
if(!*place)
{
++OPTIND;
}
}
else
{
if(*place)
{
OPTARG=place;
}
else if(nargc <= ++OPTIND)
{
place=(TCHAR*)_T("");
}
else
{
OPTARG=nargv[OPTIND];
}
place=(TCHAR*)_T("");
++OPTIND;
}
return OPTOPT;
}
//代码页转化
void PageTurnAround(const BYTE* input, int inputSIZE, int inPAGE, int outPAGE, BYTE* &outDATA, int &oLEN)
{
int wLEN;
char* outCACHE=NULL;
wchar_t* wcsCACHE=NULL;
if(inPAGE == outPAGE)
{
outDATA=(BYTE*)input, oLEN=inputSIZE;
return;
}
//针对UCS-2输入代码页
if(inPAGE == 1200)
{
wcsCACHE=(wchar_t*)input;
wLEN=inputSIZE/2+1;
goto TOMCS;
}
if(inPAGE == 1201)
{
wchar_t* wp=(wchar_t*)input;
while(*wp)
{
*wp = (((*wp)&0x00FF)<<8)|(((*wp)&0xFF00)>>8);
wp ++;
}
wcsCACHE=(wchar_t*)input;
wLEN=inputSIZE/2+1;
goto TOMCS;
}
//输入代码页 过渡到 UNICODE中转代码页
wLEN=MultiByteToWideChar(inPAGE, 0, (char*)input,-1, NULL, 0);
if(wLEN <1)
{
_ftprintf(stderr, _T("Unable to convert code page\n"));
exit(1);
}
wcsCACHE=(wchar_t*)malloc(wLEN * sizeof(wchar_t));
MultiByteToWideChar(inPAGE, 0, (char*)input, -1, wcsCACHE, wLEN);
TOMCS:
//针对UCS-2输出代码页
if(outPAGE == 1200)
{
outDATA=(BYTE*)wcsCACHE, oLEN=(wLEN-1)*2;
return;
}
if(outPAGE == 1201)
{
wchar_t* wp=(wchar_t*)wcsCACHE;
while(*wp)
{
*wp = (((*wp)&0x00FF)<<8)|(((*wp)&0xFF00)>>8);
wp ++;
}
outDATA=(BYTE*)wcsCACHE, oLEN=(wLEN-1)*2;
return;
}
//UNICODE中转代码页 过渡到 输出代码页
int uLEN=WideCharToMultiByte(outPAGE, 0, wcsCACHE, -1, NULL, 0, NULL, NULL);
if(uLEN <1)
{
_ftprintf(stderr, _T("Unable to convert code page\n"));
exit(1);
}
outCACHE=(char*)malloc(uLEN);
WideCharToMultiByte(outPAGE, 0, wcsCACHE, -1, outCACHE, uLEN, NULL, NULL);
outDATA=(BYTE*)outCACHE, oLEN=uLEN-1;
return;
}
//文本转化核心
bool ConveTextFile(TCHAR* inFILE, TCHAR* outFILE, int inPAGE, int outPAGE, int skipNUMBER, int binBOM_SIZE, BYTE* tainerBOM_BIN)
{
//读取输入文件
FILE* inFP=_tfopen(inFILE, _T("rb"));
if(inFP == NULL)
{
_ftprintf(stderr, _T("Open input file error\n"));
exit(1);
}
//获取字典文件尺寸
fseek(inFP, 0, SEEK_END);
int fsize = ftell(inFP);
if(fsize > MAX_FILE_SIZE)
{
_ftprintf(stderr, _T("The input file is too large, can not be greater than %dKB\n"), MAX_FILE_SIZE/1024);
exit(1);
}
fseek(inFP, (long)skipNUMBER, SEEK_SET);
//动态分配文本容器
BYTE* inDATA=(BYTE*)malloc(fsize+1);
//将文本流读入内存
int readSIZE=fsize-skipNUMBER;
fread(inDATA, sizeof(BYTE), readSIZE, inFP);
fclose(inFP);
inDATA[fsize-skipNUMBER]='\0';
//转化代码页
int oLEN=0;
BYTE* outDATA=NULL;
//调用代码页转换函数
PageTurnAround(inDATA, readSIZE, inPAGE, outPAGE, outDATA, oLEN);
if(oLEN <1)
{
return false;
}
//读取输出文件
FILE* outFP=_tfopen(outFILE, _T("wb"));
if(outFP == NULL)
{
_ftprintf(stderr, _T("Open output file error\n"));
exit(1);
}
fwrite(tainerBOM_BIN, sizeof(BYTE), binBOM_SIZE, outFP);
fwrite(outDATA, sizeof(BYTE), oLEN, outFP);
fclose(outFP);
free(inDATA);
return true;
}
#if defined _MSC_VER
#else
extern "C"
#endif
//*************MAIN主函数入口*************/
int _tmain(int argc, TCHAR** argv)
{
if(argc<2)
{
//无参数则退出
_ftprintf(stdout, HELP_INFORMATION);
return 0;
}
//设置传入参数
TCHAR *opeOUTFILE=NULL, *opeINFILE=NULL;
int opeIN_PAGE=CP_ACP, opeOUT_PAGE=CP_ACP, opeSKIP_NUMBER=0, opeBOM_SIZE=0;
BYTE opeFLAG=0x00, *pTAINER=NULL, tainerBOM_BIN[BOMS_SIZE]= {0};
//开关解析
int K=_OPT_TEOF;
while( (K=_tgetopt(argc, argv, (TCHAR*)_T("f:t:s:b:o:hF:T:S:B:O:H"))) != _OPT_TEOF)
{
switch(K)
{
case _T('f'):
case _T('F'):
opeIN_PAGE =_tgetCP(OPTARG);
if(opeIN_PAGE == -1)
{
_ftprintf(stderr, _T("The switch '-f' needs a positive number\n"));
exit(1);
}
opeFLAG |= 0x01;
break;
case _T('t'):
case _T('T'):
opeOUT_PAGE =_tgetCP(OPTARG);
if(opeIN_PAGE == -1)
{
_ftprintf(stderr, _T("The switch '-t' needs a positive number\n"));
exit(1);
}
opeFLAG |= 0x02;
break;
case _T('s'):
case _T('S'):
if(OPTARG == NULL)
{
_ftprintf(stderr, _T("The switch '-s' needs a positive number\n"));
exit(1);
}
opeSKIP_NUMBER = _ttoi((TCHARFORMAT*)OPTARG);
if(! (0<= opeSKIP_NUMBER && opeSKIP_NUMBER <=4 ) )
{
_ftprintf(stderr, _T("The switch '-s' needs a number between {0,4}\n"));
exit(1);
}
opeFLAG |= 0x04;
break;
case _T('b'):
case _T('B'):
if(OPTARG != NULL && _tcslen(OPTARG) <= 8)
{
_ftprintf(stderr, _T("The switch '-b' needs binary number\n"));
exit(1);
}
pTAINER=(BYTE*)tainerBOM_BIN;
opeBOM_SIZE = TCHARRAY2BIN(OPTARG, pTAINER);
opeFLAG |= 0x08;
break;
case _T('o'):
case _T('O'):
if(OPTARG != NULL)
{
opeFLAG |= 0x10;
opeOUTFILE = OPTARG;
}
break;
case _T('h'):
case _T('H'):
_ftprintf(stdout, HELP_INFORMATION);
return 0;
case _OPT_TILL:
//第一个无选项的参数识别为输入名
opeINFILE = argv[UNOPTIND];
break;
case _OPT_TERR:
_ftprintf(stderr, _T("Extra parameters \"%s\"\n"), argv[OPTIND]);
exit(1);
default:
_ftprintf(stderr, _T("Unknown switch '-%c'\n"), K);
exit(1);
}
}
//无输入,强制退出
if(opeINFILE == NULL)
{
_ftprintf(stderr, _T("Needs input file name\n"));
exit(1);
}
//无输出,强制覆盖
if(opeOUTFILE == NULL)
{
opeOUTFILE=opeINFILE;
}
//无参数,SKIP智能偏移
if((opeFLAG&0x04) == 0)
{
FILE* inFP=_tfopen(opeINFILE, _T("rb"));
if(inFP == NULL)
{
_ftprintf(stderr, _T("Open input file error\n"));
exit(1);
}
fread(tainerBOM_BIN, sizeof(BYTE), BOMS_SIZE, inFP);
fclose(inFP);
UINT uBOM_VALUE = BOM2UINT(tainerBOM_BIN);
//倒序识别BOM
switch(uBOM_VALUE)
{
case 0xFFFE0000:
case 0x0000FEFF:
case 0x2B2F7638:
case 0x84319533:
opeSKIP_NUMBER = 4;
break;
default:
if(
(uBOM_VALUE>>16) == 0xFFFE ||
(uBOM_VALUE>>16) == 0xFEFF
)
{
opeSKIP_NUMBER = 2;
}
else if((uBOM_VALUE>>8) == 0xEFBBBF)
{
opeSKIP_NUMBER = 3;
}
else
{
opeSKIP_NUMBER = 0;
}
break;
}
}
//无参数,BOM自动修正
if((opeFLAG&0x08) == 0)
{
TCHAR* tcsBIN =_T("");
switch(opeOUT_PAGE)
{
case 1200:
tcsBIN =_T("0xFFFE");
break;
case 1201:
tcsBIN =_T("0xFEFF");
break;
case 12000:
tcsBIN =_T("0xFFFE0000");
break;
case 12001:
tcsBIN =_T("0x0000FEFF");
break;
case 65001:
tcsBIN =_T("0xEFBBBF");
break;
case 65007:
tcsBIN =_T("0x2B2F7638");
break;
case 54936:
tcsBIN =_T("0x84319533");
break;
default:
break;
}
//填充BOM缓存
pTAINER = (BYTE*)tainerBOM_BIN;
opeBOM_SIZE = TCHARRAY2BIN(tcsBIN, pTAINER);
}
//执行代码页转化
if(! ConveTextFile(opeINFILE, opeOUTFILE, opeIN_PAGE, opeOUT_PAGE, opeSKIP_NUMBER, opeBOM_SIZE, tainerBOM_BIN))
{
_ftprintf(stderr, _T("Conver file error\n"));
return 1;
}
return 0;
}
[/code] 演示下将本地的目录“小说”目录下的所有网页转换成txt,
希望可以做到去除非段落换行,比如浏览器打开网页文件显示:
脸
色当即羞红了起来
网页代码应该去除两个<br/> <br/>及之间的内容:
“脸<br/> <br/>色当即羞红了起来”替换成“脸色当即羞红了起来”
如果<br/> <br/>左边有中文右边有正规的段落换行(全角半角空格多个),不希望替换;
如果<br/> <br/>左边有中文右边有中文,必须替换;
如果<br/> <br/>左边第一个字是,、“:;,必须替换;
如果<br/> <br/>右边是,。、“”:;!?…,必须替换;
如果<br/> <br/>左边第一个字是。?!”……右边是……不希望替换;
如果<br/> <br/>左边是:右边是“必须替换;
还有有时候所有的换行不是<br/> <br/>而是</P><P>、<br/><br/>或
<br/>
<br/>
最后提取标题和<br/>之间内容,自定义的替换广告内容,和对多个空字符换行被清理掉,变成干净的ANSI文本
我这里有几个测试网页 [b]回复 [url=http://www.bathome.net/redirect.php?goto=findpost&pid=200012&ptid=44343]2#[/url] [i]3518228042[/i] [/b]
建议直接用sed,处理这些最好用脚本和正则。当然,C语言也能做,但是代码将会很繁琐。
页:
[1]