Go to file
2006-04-12 04:59:27 +00:00
cleanup.php Live fix: improved reporting 2006-04-02 03:50:06 +00:00
README load_lists is no longer required 2005-10-31 05:31:57 +00:00
SpamBlacklist.php * Support for $wgExtensionCredits 2005-08-26 14:33:40 +00:00
SpamBlacklist_body.php (bug 5185) Strip out SGML comments before scanning the text for matches so some nutter can't circumvent the lot with a well placed <!-- --> 2006-04-12 04:59:27 +00:00

MediaWiki extension: SpamBlacklist 
----------------------------------

SpamBlacklist is a simple edit filter extension. When someone tries to save the
page, it checks the text against a potentially very large list of "bad"
hostnames. If there is a match, it displays an error message to the user and 
refuses to save the page. 

To enable it, first download a copy of the SpamBlacklist directory and put it
into your extensions directory. Then put the following at the end of your 
LocalSettings.php:

require_once( "$IP/extensions/SpamBlacklist/SpamBlacklist.php" );

The list of bad URLs can be drawn from multiple sources. These sources are
configured with the $wgSpamBlacklistFiles global variable. This global variable
can be set in LocalSettings.php, AFTER including SpamBlacklist.php. 

$wgSpamBlacklistFiles is an array, each value containing either a URL, a filename 
or a database location. Specifying a database location allows you to draw the
blacklist from a page on your wiki. The format of the database location
specifier is "DB: <db name> <title>".

Example:

require_once( "$IP/extensions/SpamBlacklist/SpamBlacklist.php" );
$wgSpamBlacklistFiles = array(
	"$IP/extensions/SpamBlacklist/wikimedia_blacklist", // Wikimedia's list

//          database    title
	"DB: wikidb My_spam_blacklist",    
);

File format
-----------

In simple terms:
   * Everything from a "#" character to the end of the line is a comment
   * Every non-blank line is a regex fragment which will only match inside URLs

Internally, a regex is formed which looks like this:

   !http://[a-z0-9\-.]*(line 1|line 2|line 3|....)!Si

A few notes about this format. It's not necessary to add www to the start of
hostnames, the regex is designed to match any subdomain. Don't add patterns
to your file which may run off the end of the URL, e.g. anything containing 
".*". Unlike in some similar systems, the line-end metacharacter "$" will not
assert the end of the hostname, it'll assert the end of the page.

Performance
-----------

This extension uses a small "loader" file, to avoid loading all the code on 
every page view. This means that page view performance will not be affected 
even if you are not running a PHP bytecode cache such as Turck MMCache. Note 
that a bytecode cache is strongly recommended for any MediaWiki installation.

The regex match itself generally adds an insignificant overhead to page saves,
on the order of 100ms in our experience. However loading the spam file from disk
or the database, and constructing the regex, may take a significant amount of
time depending on your hardware. If you find that enabling this extension slows
down saves excessively, try installing MemCached or another supported data
caching solution. The SpamBlacklist extension will cache the constructed regex 
if such a system is present. 

Stability
---------

This extension has not been widely tested outside Wikimedia. Although it has
been in production on Wikimedia websites since December 2004, it should be 
considered experimental. Its design is simple, with little input validation, so
unexpected behaviour due to incorrect regular expression input or non-standard 
configuration is entirely possible.

Obtaining or making blacklists
------------------------------

The primary source for a MediaWiki-compatible blacklist file is the Wikimedia
spam blacklist on meta:

    http://meta.wikimedia.org/wiki/Spam_blacklist

In the default configuration, the extension loads this list from our site 
once every 10-15 minutes.

The Wikimedia spam blacklist can only be edited by trusted administrators. 
Wikimedia hosts large, diverse wikis with many thousands of external links, 
hence the Wikimedia blacklist is comparatively conservative in the links it 
blocks. You may want to add your own keyword blocks or even ccTLD blocks. 
You may suggest modifications to the Wikimedia blacklist at:

    http://meta.wikimedia.org/wiki/Talk:Spam_blacklist

To make maintenance of local lists easier, you may wish to add a DB: source to
$wgSpamBlacklistFiles and hence create a blacklist on your wiki. If you do this,
it is strongly recommended that you protect the page from general editing.
Besides the obvious danger that someone may add a regex that matches everything,
please note that an attacker with the ability to input arbitrary regular
expressions may be able to generate segfaults in the PCRE library.

Copyright
---------
This extension and this documentation was written by Tim Starling and is 
ambiguously licensed.