标题: [原创代码] powershell 网页爬虫获取时光网电影数据库续(并行处理) [打印本页]
作者: gflrlm 时间: 2018-2-1 12:26 标题: powershell 网页爬虫获取时光网电影数据库续(并行处理)
前一篇讲到采用并行处理提高速度, 这个时候需要给前面的脚本传递参数, 而且要在前面的那个脚本《000___mtime_抓取电影网页-选数据-逗号分列.ps1》的第一行加入以下代码:- param([int]$start_id_txt_param=$(throw "Parameter missing: -start_id_txt_param 3042"), [int]$end_id_txt_param=$(throw "Parameter missing: -end_id_txt_param 3242"), [int]$call_cnt=$(throw "Parameter missing: -call_cnt 132") )
- $USING_PARAM_ENABLE=1
复制代码
将下面的代码保存为 《并行.ps1 》, 然后同样需要设置如下参数:
$start_index = 12500 # MAX mtime id now is 239000. 开始处理的文件 id
$step = 5000 并行处理的文件个数
$repeat_cnt = 5 需要启动多少个任务,这里是5 个
$end_index = 240000 # MAX mtime id now is 239000.- $throttleLimit = 5
- $iss = [system.management.automation.runspaces.initialsessionstate]::CreateDefault()
- $Pool = [runspacefactory]::CreateRunspacePool(1, $throttleLimit, $iss, $Host)
- $Pool.Open()
-
-
- $log_file="D:\迅雷下载\fork.log"
-
-
- if((Test-Path $log_file)) {
- Remove-Item $log_file
- }
-
-
- $ScriptBlock = {
- param($s,$e,$x)
- #Start-Sleep -Seconds 2
- #[System.Console]::WriteLine("Processing XXX.ps1 -start_id_txt_param $s -end_id_txt_param $e -call_cnt $x")
-
- D:\迅雷下载\000___mtime_抓取电影网页-选数据-逗号分列.ps1 -start_id_txt_param $s -end_id_txt_param $e -call_cnt $x
-
- }
-
-
- $start_index = 12500 # MAX mtime id now is 239000.
- $step = 5000
- $repeat_cnt = 5
- $end_index = 240000 # MAX mtime id now is 239000.
- if($start_index+$step *$repeat_cnt -gt $end_index){
- write-host "Error:" -ForegroundColor Red
- write-host "Max mtime for now is 24000, you are using a number greater than 24000!!!" -ForegroundColor Red
- cmd /c "pause"
- exit
- }
- for ($x = 1; $x -le $repeat_cnt; $x++) {
- $start=$start_index+($x-1)*$step
- $end=$start+$step
- write-host "Processing XXX.ps1 -start_id_txt_param $start -end_id_txt_param $end -call_cnt $x"
- "Processing XXX.ps1 -start_id_txt_param $start -end_id_txt_param $end -call_cnt $x" |Out-File -Append $log_file
- $powershell = [powershell]::Create().AddScript($ScriptBlock).AddArgument($start).AddArgument($end).AddArgument($x)
- $powershell.RunspacePool = $Pool
- $handle = $powershell.BeginInvoke()
- #cmd /c "pause"
- }
- $handle.IsCompleted
- cmd /c "pause"
复制代码
欢迎光临 批处理之家 (http://bbs.bathome.net/) |
Powered by Discuz! 7.2 |