Re: 用PHP编程语言抓取网页的HTML内容汇总

所属分类:PHP工具与代码

http://hi.baidu.com/%C0%EE%B1%F8/blog/i ... 3b394

[code]<?php
$filename="http://www.edeng.cn/index.html"; //设定开始抓取的网页地址
$content=file_get_contents($filename); //获取html内容
$filename = 'index1.htm'; //取名字
$handle=fopen($filename,'w');//打开文件并确认可写
fwrite($handle,$content);

echo "<script>alert('成功地写入到文件$filename');</script>";
?> [/code]
附带fopen操作说明：

fopen
(PHP 3, PHP 4 )

fopen -- 打开文件或者 URL说明
[code]resource fopen ( string filename, string mode [, int use_include_path [, resource zcontext]])[/code]

fopen() 将 filename 指定的名字资源绑定到一个流上。如果 filename 是 "scheme://..." 的格式，则被当成一个 URL，PHP 将搜索协议处理器（也被称为封装协议）来处理此模式。如果该协议尚未注册封装协议，PHP 将发出一条消息来帮助检查脚本中潜在的问题并将 filename 当成一个普通的文件名继续执行下去。

如果 PHP 认为 filename 指定的是一个本地文件，将尝试在该文件上打开一个流。该文件必须是 PHP 可以访问的，因此需要确认文件访问权限允许该访问。如果激活了安全模式或者 open_basedir 则会应用进一步的限制。

如果 PHP 认为 filename 指定的是一个已注册的协议，而该协议被注册为一个网络 URL，PHP 将检查并确认 allow_url_fopen 已被激活。如果关闭了，PHP 将发出一个警告，而 fopen 的调用则失败。

注: 所支持的协议列表见附录 J。某些协议（也被称为 wrappers）支持 context 和 php.ini 选项。参见相应的页面哪些选项可以被设定。（也就是 php.ini 的值 user_agent 用在 http 协议中。）有关 contexts 和 zcontext 参数的说明，参见参考 CV, Stream Functions。

mode 参数指定了所要求到该流的访问类型。可以是以下：

表格 1. fopen() 中的 mode 的可能值列表

mode 说明
'r' 只读方式打开，将文件指针指向文件头。
'r+' 读写方式打开，将文件指针指向文件头。
'w' 写入方式打开，将文件指针指向文件头并将文件大小截为零。如果文件不存在则尝试创建之。
'w+' 读写方式打开，将文件指针指向文件头并将文件大小截为零。如果文件不存在则尝试创建之。
'a' 写入方式打开，将文件指针指向文件末尾。如果文件不存在则尝试创建之。
'a+' 读写方式打开，将文件指针指向文件末尾。如果文件不存在则尝试创建之。
'x' 创建并以写入方式打开，将文件指针指向文件头。如果文件已存在，则 fopen() 调用失败并返回 FALSE，并生成一条 E_WARNING 级别的错误信息。如果文件不存在则尝试创建之。这和给底层的 open(2) 系统调用指定 O_EXCL|O_CREAT 标记是等价的。此选项被 PHP 4.3.2 以及以后的版本所支持，仅能用于本地文件。
'x+' 创建并以读写方式打开，将文件指针指向文件头。如果文件已存在，则 fopen() 调用失败并返回 FALSE，并生成一条 E_WARNING 级别的错误信息。如果文件不存在则尝试创建之。这和给底层的 open(2) 系统调用指定 O_EXCL|O_CREAT 标记是等价的。此选项被 PHP 4.3.2 以及以后的版本所支持，仅能用于本地文件。

注: 不同的操作系统家族具有不同的行结束习惯。当你写入一个文本文件并想插入一个新行时，你需要使用符合你操作系统的行结束符号。基于 Unix 的系统使用 \n 作为行结束字符，基于 Windows 的系统使用 \r\n 作为行结束字符，基于 Macintosh 的系统使用 \r 作为行结束字符。

如果写入文件时使用了错误的行结束符号，则其它应用程序打开这些文件时可能会表现得很怪异。

Windows 下提供了一个文本转换标记（'t'）可以透明地将 \n 转换为 \r\n。与此对应你还可以使用 'b' 来强制使用二进制模式，这样就不会转换你的数据。要使用这些标记，要么用 'b' 或者用 't' 作为 mode 参数的最后一个字符。

默认的转换模式依赖于 SAPI 和你使用的 PHP 版本，因此为了便于移植鼓励你总是指定恰当的标记。当操作以 \n 作为行结束定界符的纯文本文件时，如果你还期望这些文件可以用于其它应用程序例如 Notepad，则应该在脚本中使用 't' 模式。在所有其它情况下使用 'b'。

在操作二进制文件时如果没有指定 'b' 标记，可能会碰到一些奇怪的问题，包括坏掉的图片文件以及关于 \r\n 字符的奇怪问题。

为移植性考虑，强烈建议在用 fopen() 打开文件时总是使用 'b' 标记。

再一次，为移植性考虑，强烈建议你重写那些依赖于 't' 模式的代码使其使用正确的行结束符并改成 'b' 模式。

自 PHP 4.3.2 起，对所有区别二进制和文本模式的平台默认模式都被设为二进制模式。如果你在升级后脚本碰到问题，尝试暂时使用 't' 标记，直到所有的脚本都照以上所说的改为更具移植性以后。

如果也需要在 include_path 中搜寻文件的话，可以将可选的第三个参数 use_include_path 设为 '1' 或 TRUE。

如果打开失败，本函数返回 FALSE。

例子 1. fopen() 例子

[code]<?php
$handle = fopen ("/home/rasmus/file.txt", "r");
$handle = fopen ("/home/rasmus/file.gif", "wb");
$handle = fopen ("http://www.example.com/", "r");
$handle = fopen ("ftp://user:password@example.com/somefile.txt", "w");
?> [/code]

如果在用服务器模块版本的 PHP 时在打开和写入文件上遇到问题，记住要确保所使用的文件是服务器进程所能够访问的。

在 Windows 平台上，要小心转义文件路径中的每个反斜线，或者用斜线。

[code]<?php
$handle = fopen ("c:\\data\\info.txt", "r");
?> [/code]

fwirte例子：

例子 1. 一个简单的 fwrite 范例

[code]<?php
$filename = 'test.txt';
$somecontent = "添加这些文字到文件\n";

// 首先我们要确定文件存在并且可写。
if (is_writable($filename)) {

// 在这个例子里，我们将使用添加模式打开$filename，
// 因此，文件指针将会在文件的开头，
// 那就是当我们使用fwrite()的时候，$somecontent将要写入的地方。
if (!$handle = fopen($filename, 'a')) {
print "不能打开文件 $filename";
exit;
}

// 将$somecontent写入到我们打开的文件中。
if (!fwrite($handle, $somecontent)) {
print "不能写入到文件 $filename";
exit;
}

print "成功地将 $somecontent 写入到文件$filename";

fclose($handle);

} else {
print "文件 $filename 不可写";
}
?> [/code]

基于PHP抓取MSN SPACE的RSS信息
http://hi.baidu.com/meloidea/blog/item/ ... 18e92

RSS(也叫聚合内容，Really Simple Syndication)是一种描述和同步网站内容的格式，是目前使用最广泛的XML应用，可以有效的实现网络内容信息资源的共享。

基于PHP有很多现成的RSS聚合类，下面以lastRSS为例介绍一下如何去抓取微软的MSN SPACE日志信息。

首先需要下载lastRSS类，地址是：http://lastrss.oslab.net/。下载下来解压，就一个文件lastRSS.php，非常简单，使用代码如下，比如要抓取RSS种子是'http://meloidea.spaces.live.com/feed.rss'，代码如下：

[code]<?php
//指定PHP文件输出的编码，避免乱码
header('content-type:text/html;charset=utf-8');

// 加载 lastRSS 类
include "lastRSS.php";

// 创建 lastRSS 对象
$rss = new lastRSS;

//设置缓存目录
$rss->cache_dir = './temp';
//缓存保留时间1200秒
$rss->cache_time = 1200;

// 加载RSS源
if ($rs = $rss->get('http://meloidea.spaces.live.com/feed.rss')) {

// 显示网站标识图
if ($rs[image_url] != '') {
echo "<a href=\"$rs[image_link]\"><img src=\"$rs[image_url]\" alt=\"$rs[image_title]\" vspace=\"1\" border=\"0\" /></a><br />\n";
}

//显示网站标题
echo "<big><b><a href=\"$rs[link]\">$rs[title]</a></b></big><br />\n";

//显示网站描述
echo "$rs[description]<br />\n";

//显示最新的日志 (标题，链接和描述)
echo "<div align=left>";
foreach($rs['items'] as $item) {

echo "<li><a href=\"$item[link]\">".$item[title]."</a><br />".html_entity_decode($item['description'])."</li>";
}
echo "</div>";
}
else {
echo "错误: 找不到RSS源文件信息\n";
}
?>[/code]
注意：MSN SPACE需要使用html_entity_decode($item['description']，否则日志内容将显示成HTML的代码

当然除了lastRSS还有magpierss和gregarius也可以实现RSS的聚合

========================================================

PHP抓取远程网站数据的代码
http://hi.baidu.com/kaxi/blog/item/071c ... d97da

现在可能还有很多程序爱好者都会遇到同样的疑问,就是要如何像搜索引擎那样去抓取别人网站的HTML代码,然后把代码收集整理成为自己有用的数据!今天就等我介绍一些简单例子吧.

Ⅰ.抓取远程网页标题的例子:

以下是代码片段：
[code]<?php
/*
+-------------------------------------------------------------
+抓取网页标题的代码,直接拷贝本代码片段,另存为.php文件执行即可.
+-------------------------------------------------------------
*/

error_reporting(7);
$file = fopen ("http://www.dnsing.com/", "r");
if (!$file) {
echo "<font color=red>Unable to open remote file.</font>\n";
exit;
}
while (!feof ($file)) {
$line = fgets ($file, 1024);
if (eregi ("<title>(.*)</title>", $line, $out)) {
$title = $out[1];
echo "".$title."";
break;
}
}
fclose($file);

//End
?>[/code]

Ⅱ.抓取远程网页HTML代码的例子:

以下是代码片段：
[code]<? php
/*
+----------------
+DNSing Sprider
+----------------
*/

$fp = fsockopen("www.dnsing.com", 80, $errno, $errstr, 30);
if (!$fp) {
echo "$errstr ($errno)<br/>\n";
} else {
$out = "GET / HTTP/1.1\r\n";
$out .= "Host: www.dnsing.com \r\n";
$out .= "Connection: Close \r\n\r\n";
fputs($fp, $out);
while (!feof($fp)) {
echo fgets($fp, 128);
}
fclose($fp);
}
//End
?>
[/code]

以上两个代码片段都直接Copy回去运行就知道效果了,上面的例子只是抓取网页数据的雏形,要使其更适合自己的使用,情况有各异.所以,在此各位程序爱好者自己好好研究一下吧.

这篇文章介绍了两个比较基础的例子

========================================================

用PHP抓取百度空间的文章标题、内容及分类列表
http://hi.baidu.com/antsnet/blog/item/7 ... 3c636

[code]<?php

获取分类
function getSort($url = 'http://hi.baidu.com/antsnet')
{
$contents = file_get_contents($url);
if($contents)
{
//$contents = preg_replace("/[\r\n|\n]/", "", $contents);
$exp_match = "/<div class=\"item\"><a href=\"\/antsnet\/blog\/category\/(.*)\" title=/";
preg_match_all($exp_match, $contents, $match);
return $match[1];
}
}
?>[/code]

获取文章列表及描述
<?php
[code]function getDid($url){
if(stristr($url,"http://")==false){
$url="http://";
}
$exp_domain="/^http:\/\/(.*)[\.com|\.cn|\.org|\.com\.cn|\.net\.cn|\.org\.cn]{1}\//";//获取DOMAIN
$exp_header="/<\/head>|<\/HEAD>/";
preg_match($exp_domain,$url,$match);
$DOMAIN="http://hi.baidu.com";
$div_exp="/<div(.*)>(.*)(\r\n|\n)(.*)<\/div>/";
$div_page="/<div(.*)id=\"page\">(.*)<\/div>/";
$CONTENTS=@file_get_contents($url);
if(!$CONTENTS){
die("This url does not exit.");
}
preg_match_all($div_exp,$CONTENTS,$match);
preg_match_all($div_page,$CONTENTS,$myPage);
foreach($match[0] as $m){
if(stristr($m,"class=\"tit\"")!=false){
$m=str_replace("/antsnet","?act=article&path=/antsnet",$m);
//$m=str_replace("a href=","a target=\"_blank\" href=",$m);
//$m=str_replace("title=","target=\"_blank\" title=",$m);
$title[]=$m;
}
if(stristr($m,"class=\"item\"")){
$m=str_replace("/antsnet","?act=article&?path=antsnet",$m);
//$m=str_replace("a href=","a target=\"_blank\" href=",$m);
//$m=str_replace("title=","target=\"_blank\" title=",$m);
$sort[]=$m;
}
}
if(is_array($myPage)){
$page=str_replace("/antsnet","?act=Page&index=/antsnet",$myPage[0][0]);
}
@array_shift($title);
if(sizeof($title)==0){
header("Location: index.php");
exit();
}
$return["title"]=$title;
$return["sort"]=$sort;
$return["page"]=$page;
return $return;
}
?>[/code]

获取文章内容
[code]<?php
function getArticleContents($url = 'http://hi.baidu.com/antsnet/blog/item/f1fefbdc5df36aa4cc1166d8.html')
{
$contents = preg_replace("/[\r\n|\n]/", "", file_get_contents($url));
$exp_title = "/<title>(.*)<\/title>/";
$exp_match = "/<table style=\"table-layout:fixed\">(.*)<\/tr><\/table><br>/";
preg_match_all($exp_match, $contents, $match);
preg_match_all($exp_title, $contents, $titles);
return '<p><span style="color:red;font-size:15px;">' . str_replace('_Antsnet.net', '', $titles[1][0]) . '     [snatch at : '.date('Y-m-d H:i:s').']</span></p>' . $match[0][0];
}
?>[/code]
========================================================

php抓取蜘蛛
http://hi.baidu.com/piea/blog/item/ff82 ... 08903

搜索引擎的蜘蛛访问网站是通过远程抓取页面来进行的，我们不能使用JS代码来取得蜘蛛的Agent信息，但是我们可以通过image标签，这样我们就可以得到蜘蛛的agent资料了，通过对agent资料的分析，就可以确定蜘蛛的种类、性别等因素，我们在通过数据库或者文本来记录就可以进行统计了。
下面是我的程序和源代码：
数据库结构：
#
# 表的结构`naps_stats_bot`
#
[code]CREATE TABLE `naps_stats_bot` (
`botid` int(10) unsigned NOT NULL auto_increment,
`botname` varchar(100) NOT NULL default '',
`botagent` varchar(200) NOT NULL default '',
`bottag` varchar(100) NOT NULL default '',
`botcount` int(11) NOT NULL default '0',
`botlast` datetime NOT NULL default '0000-00-00 00:00:00',
`botlasturl` varchar(250) NOT NULL default '',
UNIQUE KEY `botid` (`botid`),
KEY `botname` (`botname`)
) TYPE=MyISAM AUTO_INCREMENT=9 ;[/code]
#
# 导出表中的数据 `naps_stats_bot`
#
[code]INSERT INTO `naps_stats_bot` VALUES (1, 'Googlebot', 'Googlebot/2.X(+http://www.googlebot.com/bot.html)', 'googlebot', 0, '0000-00-00 00:00:00','');
INSERT INTO `naps_stats_bot` VALUES (2, 'MSNbot', 'MSNBOT/0.1(http://search.msn.com/msnbot.htm)', 'msnbot', 0, '0000-00-00 00:00:00', '');
INSERT INTO `naps_stats_bot` VALUES (3, 'Inktomi Slurp', 'Slurp/2.0', 'slurp',0, '0000-00-00 00:00:00', '');
INSERT INTO `naps_stats_bot` VALUES (4, 'Baiduspider','Baiduspider+(+http://www.baidu.com/search/spider.htm)', 'baiduspider', 0,'0000-00-00 00:00:00', '');
INSERT INTO `naps_stats_bot` VALUES (5, 'Yahoobot','Mozilla/5.0+(compatible;+Yahoo!+Slurp;+http://help.yahoo.com/help/us/ysearch/slurp)','slurp', 0, '0000-00-00 00:00:00', '');
INSERT INTO `naps_stats_bot` VALUES (6, 'Sohubot', 'sohu-search','sohu-search', 0, '0000-00-00 00:00:00', '');
INSERT INTO `naps_stats_bot` VALUES (7, 'Lycos', 'Lycos/x.x', 'lycos', 0,'0000-00-00 00:00:00', '');
INSERT INTO `naps_stats_bot` VALUES (8, 'Robozilla', 'Robozilla/1.0','robozilla', 0, '0000-00-00 00:00:00', '');[/code]

PHP程序：
[Copy to clipboard]
[code]
<?PHP
/***************************************************************************
* NAPS -- Network Article Publish System
* ----------------------------------------------
* bot.php
* -------------------
* begin : 2004-08-15
* copyright :(C) 2004 week9
*
***************************************************************************/
/***************************************************************************
*
* This program is free software; you can redistribute it and/ormodify
* it under the terms of the GNU General Public License as publishedby
* the Free Software Foundation; either version 2 of the License.
*
***************************************************************************/
/***************************************************************************
*
* NAPS产品是自由软件。你可以且必须根据《GNU GPL-GNU通用公共许可证》的相关规定
* 复制、修改及分发NAPS产品。任何以NAPS产品为基础的衍生发行版未必须经过飘飘的授权。
*
***************************************************************************/
error_reporting(E_ALL & ~E_NOTICE);
function get_naps_bot()
{
$useragent = strtolower($_SERVER['HTTP_USER_AGENT']);

if (strpos($useragent, 'googlebot')!== false){
return 'Googlebot';
}

if (strpos($useragent, 'msnbot') !==false){
return 'MSNbot';
}

if (strpos($useragent, 'slurp') !==false){
return 'Yahoobot';
}

if (strpos($useragent,'baiduspider') !== false){
return 'Baiduspider';
}

if (strpos($useragent,'sohu-search') !== false){
return 'Sohubot';
}

if (strpos($useragent, 'lycos') !==false){
return 'Lycos';
}

if (strpos($useragent, 'robozilla')!== false){
return 'Robozilla';
}
return false;
}
$tlc_thispage = addslashes($_SERVER['HTTP_USER_AGENT']);
//添加蜘蛛的抓取记录
$searchbot = get_naps_bot();
if ($searchbot) {
$DB_naps->query("UPDATEnaps_stats_bot SET botcount=botcount+1, botlast=NOW(),botlasturl='$tlc_thispage' WHERE botname='$searchbot'");
}
?>[/code]
引用方法：
[code]
<img src="./bot.php"width="0" height="0">[/code]

类似的浏览器和用户群也可分类。

稍微改变一下，就是一个简易版的网站访问日志程序了。