抓取(爬取)网上信息的脚本程序,俗称网络蜘蛛。
powershell中自带了这样的两个命令,【invoke-webrequest】和【invoke-restmethod】,但这两个命令有时候会乱码。
现在转帖分享, 某个【歪果仁】写的脚本。来源于 墙外出处: https://gist.github.com/angel-vladov/9482676
核心代码
function read-htmlpage {
param ([parameter(mandatory=$true, position=0, valuefrompipeline=$true)][string] $uri)
# invoke-webrequest and invoke-restmethod can't work properly with utf-8 response so we need to do things this way.
[net.httpwebrequest]$webrequest = [net.webrequest]::create($uri)
[net.httpwebresponse]$webresponse = $webrequest.getresponse()
$reader = new-object io.streamreader($webresponse.getresponsestream())
$response = $reader.readtoend()
$reader.close()
# create the document class
[mshtml.htmldocumentclass] $doc = new-object -com "htmlfile"
$doc.ihtmldocument2_write($response)
# returns a htmldocumentclass instance just like invoke-webrequest parsedhtml
$doc
#powershell 传教士 转帖并修改的文章 2016-01-01, 允许再次转载,但必须保留名字和出处,否则追究法律责任
}
原文函数
function read-htmlpage {
param ([parameter(mandatory=$true, position=0, valuefrompipeline=$true)][string] $uri)
# invoke-webrequest and invoke-restmethod can't work properly with utf-8 response so we need to do things this way.
[net.httpwebrequest]$webrequest = [net.webrequest]::create($uri)
[net.httpwebresponse]$webresponse = $webrequest.getresponse()
$reader = new-object io.streamreader($webresponse.getresponsestream())
$response = $reader.readtoend()
$reader.close()
# create the document class
[mshtml.htmldocumentclass] $doc = new-object -com "htmlfile"
$doc.ihtmldocument2_write($response)
# returns a htmldocumentclass instance just like invoke-webrequest parsedhtml
$doc
}
powershell function you can use for reading utf8 encoded html pages content. the built in invoke-webrequest and invoke-restmethod fail miserably.
发表评论