Board logo

标题: [文本处理] 批处理怎样提取txt多个指定字符串的全部行? [打印本页]

作者: WILSONMAO    时间: 2021-9-26 12:54     标题: 批处理怎样提取txt多个指定字符串的全部行?

文本内容如最后,以下为一组,重复30万组。
内容开头为字段名,需要提取所有文本组内CT CY CL WC C1五个字段下的全部内容,需要五个字段是对应的,因为零星一些内容可能为空。
最后效果如下:
  CT CY CL WC C1
1** ** ** ** **
2** ** ** ** **
3** ** ** ** **
.....

FN Clarivate Analytics Web of Science
VR 1.0
PT C
AU Si, D
   Cheng, SC
   Xing, RW
   Liu, C
   Wu, OY
AF Si, Dong
   Cheng, Sunny Chieh
   Xing, Ruiwen
   Liu, Chang
   Wu, Hoi Yan
GP IEEE
TI Scaling up Prediction of Psychosis by Natural Language Processing
SO 2019 IEEE 31ST INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL
   INTELLIGENCE (ICTAI 2019)
SE Proceedings-International Conference on Tools With Artificial
   Intelligence
LA English
DT Proceedings Paper
CT 31st IEEE International Conference on Tools with Artificial Intelligence
   (ICTAI)
CY NOV 04-06, 2019
CL Portland, OR
SP IEEE, IEEE Comp Soc
DE Machine learning; Natural language processing; Text classification;
   Prediction of psychosis; Schizophrenia; Word embeddings; Convolutional
   neural networks
ID HIGH-RISK; SCHIZOPHRENIA; PREVALENCE
AB Mental health professionals currently diagnose and treat mental disorders, such as schizophrenia, mainly by analyzing the language and speech of their patients, a method that maybe improved with the usage of artificial intelligence. This study aims to use machine learning to distinguish between the speech of patients who suffer from mental disorders which cause psychosis from that of healthy individuals to improve early detection of schizophrenia. We analyzed forty interview transcripts from patients who have been diagnosed with first episode psychosis. Word embeddings and convolutional neural network were utilized for the classification of patients from healthy individuals. The preliminary test results achieved a prediction rate of 99%, which indicated that our speech classifier was able to discriminate speech in patients from healthy individuals' daily conversations. This suggested that machine learning models can learn and train upon features of natural languages to predict whether or not an individual is beginning to show the first signs of early psychosis based on their speech. This line of inquiry will contribute to the improved identification of individuals at risk for psychiatric symptoms and lead to the development of targeted therapies.
C1 [Si, Dong; Xing, Ruiwen; Liu, Chang; Wu, Hoi Yan] Univ Washington, Comp & Software Syst, Bothell, WA 98011 USA.
   [Cheng, Sunny Chieh] Univ Washington, Nursing & Healthcare Leadership, Tacoma, WA USA.
RP Si, D (corresponding author), Univ Washington, Comp & Software Syst, Bothell, WA 98011 USA.
EM dongsi@uw.edu; ccsunny@uw.edu; ruiwen@uw.edu; chang15@uw.edu;
   hoiyanwu@uw.edu
FU Graduate Research Award of Computing and Software Systems division;
   University of Washington BothellUniversity of Washington [74-0525];
   NVIDIA Corporation (Santa Clara, CA, USA)
FX This research was funded by the Graduate Research Award of Computing and
   Software Systems division and the startup fund 74-0525 of the University
   of Washington Bothell.; We gratefully acknowledge the support of NVIDIA
   Corporation (Santa Clara, CA, USA) with the donation of the GPU used for
   this research.
NR 31
TC 1
Z9 1
U1 0
U2 0
PU IEEE COMPUTER SOC
PI LOS ALAMITOS
PA 10662 LOS VAQUEROS CIRCLE, PO BOX 3014, LOS ALAMITOS, CA 90720-1264 USA
SN 1082-3409
BN 978-1-7281-3798-8
J9 PROC INT C TOOLS ART
PY 2019
BP 339
EP 347
DI 10.1109/ICTAI.2019.00055
PG 9
WC Computer Science, Artificial Intelligence; Computer Science, Theory &
   Methods
SC Computer Science
GA BP4NY
UT WOS:000553441500046
DA 2021-09-15
ER
作者: qixiaobin0715    时间: 2021-9-26 16:45

WC和C1是不是写颠倒了?
字段与字段之间是用空格分隔吗?
作者: WILSONMAO    时间: 2021-9-26 17:07

回复 2# qixiaobin0715


    顺序可以随意更改;字段之间似乎没有空格,回车下一行了
作者: qixiaobin0715    时间: 2021-9-26 17:10

我说的是同一行的CT CY CL WC C1之间。
作者: qixiaobin0715    时间: 2021-9-26 17:21

源文件中CT CY CL C1 WC是固定顺序的吧?
作者: WILSONMAO    时间: 2021-9-26 17:42

回复 5# qixiaobin0715


    是的
作者: WILSONMAO    时间: 2021-9-26 17:43

回复 4# qixiaobin0715


    是的
作者: qixiaobin0715    时间: 2021-9-26 17:45

本帖最后由 qixiaobin0715 于 2021-9-26 17:48 编辑

零星一些内容可能为空,是什么意思?
CT也可能为空吗?
作者: WILSONMAO    时间: 2021-9-26 18:10

链接:https://pan.baidu.com/s/1QX4H6uUy41_ezGPwQuVszw
提取码:1x4z

附件链接
作者: WILSONMAO    时间: 2021-9-26 18:12

回复 8# qixiaobin0715


链接:https://pan.baidu.com/s/1QX4H6uUy41_ezGPwQuVszw
提取码:1x4z

详情见附件 感恩大佬
作者: idwma    时间: 2021-9-26 20:47

本帖最后由 idwma 于 2021-9-26 22:22 编辑
  1. @echo off
  2. setlocal enabledelayedexpansion
  3. set "str=CT CY CL C1 WC"
  4. (for /f "delims=" %%a in (111.txt) do (
  5. set "strr=%%a"
  6. if defined f (
  7. set ccc=
  8. if not "!strr:~0,2!"=="  " (
  9. set f=
  10. set ccc=1
  11. )
  12. if not defined ccc (
  13. call set "!ff!=%%!ff!%%!strr:~3!"
  14. )
  15. )
  16. for %%c in (!str!) do (
  17. if "!strr:~0,2!"=="%%c" (
  18. set str=!str:%%c=!
  19. set "ff=%%c"
  20. set "!ff!=!strr:~3! "
  21. set f=1
  22. )
  23. )
  24. if defined CT if defined CY if defined CL if defined WC if defined C1 (
  25. if not defined f (
  26. set /a n+=1
  27. echo;!n!##!CT!##!CY!##!CL!##!C1!##!WC!
  28. for %%c in (CT CY CL C1 WC) do set %%c=
  29. set "str=CT CY CL C1 WC"
  30. )
  31. )
  32. ))>222.txt
  33. pause
复制代码

作者: qixiaobin0715    时间: 2021-9-27 11:22

本帖最后由 qixiaobin0715 于 2021-9-27 13:58 编辑

回复 1# WILSONMAO
文件较大,请耐心等待:
  1. @echo off &@cls&chcp>nul 65001
  2. set var=CT CY CL WC C1
  3. setlocal enabledelayedexpansion
  4. (echo,CT,CY,CL,WC,C1
  5. for /f "tokens=1*" %%a in ('findstr /br "%var%" 2019') do (
  6.     if "%%a"=="CT" if defined _CT (
  7.         echo,"%%b","!_CY!","!_CL!","!_WC!","!_C1!"
  8.         for %%i in (%var%) do set "_%%i="
  9.     )
  10.     set "_%%a=%%b"
  11. )
  12. echo,"!_CT!","!_CY!","!_CL!","!_WC!","!_C1!"
  13. )>test.csv
  14. pause
复制代码

作者: qixiaobin0715    时间: 2021-9-27 11:50

本帖最后由 qixiaobin0715 于 2021-9-27 12:18 编辑

回复 1# WILSONMAO
这样要准确些,并且效率提升不少:
  1. @echo off &@cls&chcp>nul 65001
  2. set var=CT CY CL WC C1
  3. findstr /br "%var%" 2019>a.txt
  4. setlocal enabledelayedexpansion
  5. (echo,CT,CY,CL,WC,C1
  6. for /f "tokens=1*" %%a in (a.txt) do (
  7.     if "%%a"=="WC" (
  8.         echo,"!_CT!","!_CY!","!_CL!","%%b","!_C1!"
  9.         for %%i in (%var%) do set "_%%i="
  10.     )
  11.     set "_%%a=%%b"
  12. ))>test.csv
  13. del a.txt
  14. pause
复制代码
csv文件可使用Excel打开。
作者: WILSONMAO    时间: 2021-9-30 10:01

回复 13# qixiaobin0715


    感恩大佬




欢迎光临 批处理之家 (http://bbs.bathome.net/) Powered by Discuz! 7.2