Forums / General / How to stop spiders?

How to stop spiders?

Author Message

Luis Muñoz

Friday 18 March 2005 7:28:27 am

Any idea about how to stop bad spiders from entering the site? With bad spiders a refer to the ones wich collect email addresses. The problem is that some of them go so fast that they can make the server stop working if they attack in the peak hours. It wold be also good to protect people emails. Masking email adresses is a bit useless, spiders learn so fast that make masking useless.

Any idea would be appreciated.
Thanks,
Luis.

Lex 007

Friday 18 March 2005 7:38:28 am

Hello

You can use the wash operator on your emails to obfuscate them.

See this post : http://www.ez.no/community/forum/setup_design/obfuscate_email_addresses

Lex

Luis Muñoz

Friday 18 March 2005 8:03:10 am

My main problem isn't obfuscate email adresses. My main problem is the spider crawling the site at the maximum speed supported by the server, what produces a slow site or even blocks the server.

Łukasz Serwatka

Friday 18 March 2005 8:16:54 am

Hi Luis,

Here you find some info about spiders control
http://www.searchengineworld.com/robots/robots_tutorial.htm

Personal website -> http://serwatka.net
Blog (about eZ Publish) -> http://serwatka.net/blog

Jonathan Dillon-Hayes

Wednesday 23 March 2005 2:57:42 am

Have to second the robots.txt file idea. BAsically, you're left with either:

1) hope that they stop
2) put a robots.txt file in there and hope that they stop.

I would do 2.

Jonathan

---------
FireBright provides advanced eZ deployment with root access
http://www.FireBright.com/

Lex 007

Wednesday 23 March 2005 5:42:58 am

Unfortunaly, I don't think the robots.txt would do anything.

If I were a spider programmer, the first steps I'd do in my program would be :
- check if there is a robots.txt
- then immediatly visit the "forbidden" folders, because they must be the most interesting ones ...

Probably if you obfuscate all your e-mail adresses, spiders won't come back because they don't see anything interesting on your site.

Tony Wood

Wednesday 23 March 2005 9:32:44 am

Hi Luis,

You might want to log the IP address of the <i>bad</i> spiders and then block them. This will only work however, on spiders that have a fixed IP.

I hope this helps

Tony

Tony Wood : twitter.com/tonywood
Vision with Technology
Experts in eZ Publish consulting & development

Power to the Editor!

Free eZ Training : http://www.VisionWT.com/training
eZ Future Podcast : http://www.VisionWT.com/eZ-Future

Harry Oosterveen

Thursday 24 March 2005 2:02:57 pm

How do you recognize a bad spider? If you can recognize it from the HTTP information, add the following lines to your .htaccess file, or in the Apache httpd.conf, if you can access that one.

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (badspider)
RewriteRule !^nocrawl.html /nocrawl.html [L]

'badspider' is a regular expression matching the reported user agent of the bad spider. 'nocrawl.html' is simply a short html page with no links.

Alternatively, you can add the following code in the beginning of the index.php file:

if( preg_match( '/badspider/', $_SERVER['HTTP_USER_AGENT']))
  die( 'Go away' );

Harry Oosterveen

Friday 25 March 2005 4:44:01 am

You can find more info on http://ezinearticles.com/?Invasion-of-the-Email-Snatchers&id=20846. On this page is also a list of bad spiders, applied to the .htaccess method I mentioned above. To apply this list to the php-code for the index.php file, use this:

$badspiders = array( 
  'EmailSiphon',
  'EmailWolf',
  'ExtractorPro',
  'Mozilla.*NEWT',
  'Crescent',
  'CherryPicker',
  '[Ww]eb[Bb]andit',
  'WebEMailExtrac.*',
  'NICErsPRO',
  'Telesoft',
  'Zeus.*Webster',
  'Microsoft.URL',
  'Mozilla/3.Mozilla/2.01',
  'EmailCollector' );
	
$regex = '/^(' . join( '|', $badspiders ) . ')/';

if( preg_match( $regex, $_SERVER['HTTP_USER_AGENT'])) {
  die( 'Go away' );
}

Note that new robots will evolve, so you have to adapt this list regularly.

Jonathan Dillon-Hayes

Monday 28 March 2005 2:14:07 am

There is a much easier way...

Just include a directory in the top of your robots.txt file that includes code to add whoever visits it to a ban list. That way, if the spider doesn't listen, as soon as it starts to investigate, it gets locked out. Your human traffic will be unaffected.

You could easily adapt that php code into a simple script to do it. Just add a database handler, and a three column table with id, name, and ip.

J

---------
FireBright provides advanced eZ deployment with root access
http://www.FireBright.com/

Eivind Marienborg

Monday 28 March 2005 4:04:28 am

Your problem is that spiders visit your site at the wrong hours of day, draining your system for resources.. How about a script that replaces the robots.txt file at different times of day? Letting them search your site at night, and banning all robots at daytime, for example.

eZ debug

Timing: Jan 18 2025 18:29:17
Script start
Timing: Jan 18 2025 18:29:17
Module start 'content'
Timing: Jan 18 2025 18:29:18
Module end 'content'
Timing: Jan 18 2025 18:29:18
Script end

Main resources:

Total runtime1.2357 sec
Peak memory usage4,096.0000 KB
Database Queries228

Timing points:

CheckpointStart (sec)Duration (sec)Memory at start (KB)Memory used (KB)
Script start 0.00000.0078 588.7188180.8516
Module start 'content' 0.00781.0252 769.5703839.1406
Module end 'content' 1.03300.2027 1,608.7109349.3281
Script end 1.2357  1,958.0391 

Time accumulators:

 Accumulator Duration (sec) Duration (%) Count Average (sec)
Ini load
Load cache0.00440.3528210.0002
Check MTime0.00160.1322210.0001
Mysql Total
Database connection0.00080.068210.0008
Mysqli_queries1.111489.93522280.0049
Looping result0.00310.24832260.0000
Template Total1.196496.820.5982
Template load0.00230.183620.0011
Template processing1.194296.635920.5971
Template load and register function0.00020.016710.0002
states
state_id_array0.00300.241410.0030
state_identifier_array0.00310.252820.0016
Override
Cache load0.00210.1691590.0000
Sytem overhead
Fetch class attribute can translate value0.00270.221880.0003
Fetch class attribute name0.00160.1295160.0001
XML
Image XML parsing0.00970.788780.0012
class_abstraction
Instantiating content class attribute0.00000.0025180.0000
General
dbfile0.01751.4150520.0003
String conversion0.00000.000630.0000
Note: percentages do not add up to 100% because some accumulators overlap

CSS/JS files loaded with "ezjscPacker" during request:

CacheTypePacklevelSourceFiles
CSS0extension/community/design/community/stylesheets/ext/jquery.autocomplete.css
extension/community_design/design/suncana/stylesheets/scrollbars.css
extension/community_design/design/suncana/stylesheets/tabs.css
extension/community_design/design/suncana/stylesheets/roadmap.css
extension/community_design/design/suncana/stylesheets/content.css
extension/community_design/design/suncana/stylesheets/star-rating.css
extension/community_design/design/suncana/stylesheets/syntax_and_custom_tags.css
extension/community_design/design/suncana/stylesheets/buttons.css
extension/community_design/design/suncana/stylesheets/tweetbox.css
extension/community_design/design/suncana/stylesheets/jquery.fancybox-1.3.4.css
extension/bcsmoothgallery/design/standard/stylesheets/magnific-popup.css
extension/sevenx/design/simple/stylesheets/star_rating.css
extension/sevenx/design/simple/stylesheets/libs/fontawesome/css/all.min.css
extension/sevenx/design/simple/stylesheets/main.v02.css
extension/sevenx/design/simple/stylesheets/main.v02.res.css
JS0extension/ezjscore/design/standard/lib/yui/3.17.2/build/yui/yui-min.js
extension/ezjscore/design/standard/javascript/jquery-3.7.0.min.js
extension/community_design/design/suncana/javascript/jquery.ui.core.min.js
extension/community_design/design/suncana/javascript/jquery.ui.widget.min.js
extension/community_design/design/suncana/javascript/jquery.easing.1.3.js
extension/community_design/design/suncana/javascript/jquery.ui.tabs.js
extension/community_design/design/suncana/javascript/jquery.hoverIntent.min.js
extension/community_design/design/suncana/javascript/jquery.popmenu.js
extension/community_design/design/suncana/javascript/jScrollPane.js
extension/community_design/design/suncana/javascript/jquery.mousewheel.js
extension/community_design/design/suncana/javascript/jquery.cycle.all.js
extension/sevenx/design/simple/javascript/jquery.scrollTo.js
extension/community_design/design/suncana/javascript/jquery.cookie.js
extension/community_design/design/suncana/javascript/ezstarrating_jquery.js
extension/community_design/design/suncana/javascript/jquery.initboxes.js
extension/community_design/design/suncana/javascript/app.js
extension/community_design/design/suncana/javascript/twitterwidget.js
extension/community_design/design/suncana/javascript/community.js
extension/community_design/design/suncana/javascript/roadmap.js
extension/community_design/design/suncana/javascript/ez.js
extension/community_design/design/suncana/javascript/ezshareevents.js
extension/sevenx/design/simple/javascript/main.js

Templates used to render the page:

UsageRequested templateTemplateTemplate loadedEditOverride
1node/view/full.tplfull/forum_topic.tplextension/sevenx/design/simple/override/templates/full/forum_topic.tplEdit templateOverride template
11content/datatype/view/ezxmltext.tpl<No override>extension/community_design/design/suncana/templates/content/datatype/view/ezxmltext.tplEdit templateOverride template
17content/datatype/view/ezxmltags/paragraph.tpl<No override>extension/ezwebin/design/ezwebin/templates/content/datatype/view/ezxmltags/paragraph.tplEdit templateOverride template
4content/datatype/view/ezxmltags/line.tpl<No override>design/standard/templates/content/datatype/view/ezxmltags/line.tplEdit templateOverride template
7content/datatype/view/ezimage.tpl<No override>extension/sevenx/design/simple/templates/content/datatype/view/ezimage.tplEdit templateOverride template
3content/datatype/view/ezxmltags/literal.tpl<No override>extension/community/design/standard/templates/content/datatype/view/ezxmltags/literal.tplEdit templateOverride template
1pagelayout.tpl<No override>extension/sevenx/design/simple/templates/pagelayout.tplEdit templateOverride template
 Number of times templates used: 44
 Number of unique templates used: 7

Time used to render debug report: 0.0002 secs