2007-01-06 20:56:46 +00:00
|
|
|
MediaWiki extension: SpamBlacklist
|
2005-01-20 07:04:19 +00:00
|
|
|
----------------------------------
|
|
|
|
|
|
|
|
SpamBlacklist is a simple edit filter extension. When someone tries to save the
|
|
|
|
page, it checks the text against a potentially very large list of "bad"
|
|
|
|
hostnames. If there is a match, it displays an error message to the user and
|
2007-01-06 20:56:46 +00:00
|
|
|
refuses to save the page.
|
2005-01-20 07:04:19 +00:00
|
|
|
|
|
|
|
To enable it, first download a copy of the SpamBlacklist directory and put it
|
|
|
|
into your extensions directory. Then put the following at the end of your
|
|
|
|
LocalSettings.php:
|
|
|
|
|
|
|
|
require_once( "$IP/extensions/SpamBlacklist/SpamBlacklist.php" );
|
|
|
|
|
|
|
|
The list of bad URLs can be drawn from multiple sources. These sources are
|
|
|
|
configured with the $wgSpamBlacklistFiles global variable. This global variable
|
2007-01-06 20:56:46 +00:00
|
|
|
can be set in LocalSettings.php, AFTER including SpamBlacklist.php.
|
2005-01-20 07:04:19 +00:00
|
|
|
|
2005-10-31 05:31:57 +00:00
|
|
|
$wgSpamBlacklistFiles is an array, each value containing either a URL, a filename
|
|
|
|
or a database location. Specifying a database location allows you to draw the
|
2005-01-20 07:04:19 +00:00
|
|
|
blacklist from a page on your wiki. The format of the database location
|
|
|
|
specifier is "DB: <db name> <title>".
|
|
|
|
|
|
|
|
Example:
|
|
|
|
|
|
|
|
require_once( "$IP/extensions/SpamBlacklist/SpamBlacklist.php" );
|
|
|
|
$wgSpamBlacklistFiles = array(
|
|
|
|
"$IP/extensions/SpamBlacklist/wikimedia_blacklist", // Wikimedia's list
|
|
|
|
|
|
|
|
// database title
|
2007-01-06 20:56:46 +00:00
|
|
|
"DB: wikidb My_spam_blacklist",
|
2005-01-20 07:04:19 +00:00
|
|
|
);
|
|
|
|
|
2007-07-20 21:13:26 +00:00
|
|
|
The local pages [[MediaWiki:Spam-blacklist]] and [[MediaWiki:Spam-whitelist]]
|
|
|
|
will always be used, whatever additional files are listed.
|
|
|
|
|
2007-03-05 19:56:56 +00:00
|
|
|
Compatibility
|
|
|
|
-----------
|
|
|
|
|
|
|
|
This extension is primarily maintained to run on the latest release version
|
|
|
|
of MediaWiki (1.9.x as of this writing) and development versions.
|
|
|
|
|
|
|
|
The current version *may* work as far back as 1.6.x, but will *not* work
|
|
|
|
with 1.5.x or older. You may be able to dig older versions out of the
|
|
|
|
Subversion repository which work with those versions, but if using
|
|
|
|
Wikimedia's blacklist file you will likely have problems with failure
|
|
|
|
due to the large size of the blacklist not being handled by old versions
|
|
|
|
of the code.
|
|
|
|
|
|
|
|
|
2005-01-20 07:04:19 +00:00
|
|
|
File format
|
|
|
|
-----------
|
|
|
|
|
|
|
|
In simple terms:
|
|
|
|
* Everything from a "#" character to the end of the line is a comment
|
|
|
|
* Every non-blank line is a regex fragment which will only match inside URLs
|
|
|
|
|
|
|
|
Internally, a regex is formed which looks like this:
|
|
|
|
|
|
|
|
!http://[a-z0-9\-.]*(line 1|line 2|line 3|....)!Si
|
|
|
|
|
|
|
|
A few notes about this format. It's not necessary to add www to the start of
|
|
|
|
hostnames, the regex is designed to match any subdomain. Don't add patterns
|
|
|
|
to your file which may run off the end of the URL, e.g. anything containing
|
|
|
|
".*". Unlike in some similar systems, the line-end metacharacter "$" will not
|
|
|
|
assert the end of the hostname, it'll assert the end of the page.
|
|
|
|
|
|
|
|
Performance
|
|
|
|
-----------
|
|
|
|
|
|
|
|
This extension uses a small "loader" file, to avoid loading all the code on
|
|
|
|
every page view. This means that page view performance will not be affected
|
|
|
|
even if you are not running a PHP bytecode cache such as Turck MMCache. Note
|
|
|
|
that a bytecode cache is strongly recommended for any MediaWiki installation.
|
|
|
|
|
|
|
|
The regex match itself generally adds an insignificant overhead to page saves,
|
|
|
|
on the order of 100ms in our experience. However loading the spam file from disk
|
|
|
|
or the database, and constructing the regex, may take a significant amount of
|
|
|
|
time depending on your hardware. If you find that enabling this extension slows
|
|
|
|
down saves excessively, try installing MemCached or another supported data
|
|
|
|
caching solution. The SpamBlacklist extension will cache the constructed regex
|
2007-01-06 20:56:46 +00:00
|
|
|
if such a system is present.
|
2005-01-20 07:04:19 +00:00
|
|
|
|
2007-07-20 21:13:26 +00:00
|
|
|
Caching behavior
|
|
|
|
----------------
|
|
|
|
|
|
|
|
Blacklist files loaded from remote web sites are cached locally, in the cache
|
|
|
|
subsystem used for MediaWiki's localization. (This usually means the objectcache
|
|
|
|
table on a default install.)
|
|
|
|
|
|
|
|
By default, the list is cached for 15 minutes (if successfully fetched) or
|
|
|
|
10 minutes (if the network fetch failed), after which point it will be fetched
|
|
|
|
again when next requested. This should be a decent balance between avoiding
|
|
|
|
too-frequent fetches if your site is frequently used and staying up to date.
|
|
|
|
|
|
|
|
Fully-processed blacklist data may be cached in memcached or another shared
|
|
|
|
memory cache if it's been configured in MediaWiki.
|
|
|
|
|
|
|
|
|
2005-01-20 07:04:19 +00:00
|
|
|
Stability
|
|
|
|
---------
|
|
|
|
|
|
|
|
This extension has not been widely tested outside Wikimedia. Although it has
|
|
|
|
been in production on Wikimedia websites since December 2004, it should be
|
|
|
|
considered experimental. Its design is simple, with little input validation, so
|
|
|
|
unexpected behaviour due to incorrect regular expression input or non-standard
|
|
|
|
configuration is entirely possible.
|
|
|
|
|
|
|
|
Obtaining or making blacklists
|
|
|
|
------------------------------
|
|
|
|
|
|
|
|
The primary source for a MediaWiki-compatible blacklist file is the Wikimedia
|
|
|
|
spam blacklist on meta:
|
|
|
|
|
|
|
|
http://meta.wikimedia.org/wiki/Spam_blacklist
|
|
|
|
|
2005-10-31 05:31:57 +00:00
|
|
|
In the default configuration, the extension loads this list from our site
|
|
|
|
once every 10-15 minutes.
|
2005-01-20 07:04:19 +00:00
|
|
|
|
|
|
|
The Wikimedia spam blacklist can only be edited by trusted administrators.
|
|
|
|
Wikimedia hosts large, diverse wikis with many thousands of external links,
|
|
|
|
hence the Wikimedia blacklist is comparatively conservative in the links it
|
2007-01-06 20:56:46 +00:00
|
|
|
blocks. You may want to add your own keyword blocks or even ccTLD blocks.
|
2005-01-20 07:04:19 +00:00
|
|
|
You may suggest modifications to the Wikimedia blacklist at:
|
|
|
|
|
|
|
|
http://meta.wikimedia.org/wiki/Talk:Spam_blacklist
|
|
|
|
|
|
|
|
To make maintenance of local lists easier, you may wish to add a DB: source to
|
|
|
|
$wgSpamBlacklistFiles and hence create a blacklist on your wiki. If you do this,
|
|
|
|
it is strongly recommended that you protect the page from general editing.
|
|
|
|
Besides the obvious danger that someone may add a regex that matches everything,
|
|
|
|
please note that an attacker with the ability to input arbitrary regular
|
|
|
|
expressions may be able to generate segfaults in the PCRE library.
|
|
|
|
|
2006-06-22 19:59:43 +00:00
|
|
|
Whitelisting
|
|
|
|
------------
|
|
|
|
|
|
|
|
You may sometimes find that a site listed in a centrally-maintained blacklist
|
|
|
|
contains something you nonetheless want to link to.
|
|
|
|
|
|
|
|
A local whitelist can be maintained by creating a [[MediaWiki:Spam-whitelist]]
|
|
|
|
page and listing hostnames in it, using the same format as the blacklists.
|
|
|
|
URLs matching the whitelist will be ignored locally.
|
|
|
|
|
2005-01-20 07:04:19 +00:00
|
|
|
Copyright
|
|
|
|
---------
|
|
|
|
This extension and this documentation was written by Tim Starling and is
|
|
|
|
ambiguously licensed.
|