Misplaced Pages

User:ClueBot NG/Documentation: Difference between revisions

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.
< User:ClueBot NG Browse history interactively← Previous editNext edit →Content deleted Content addedVisualWikitext
Revision as of 21:36, 3 November 2010 editCrispy1989 (talk | contribs)434 editsNo edit summary← Previous edit Revision as of 10:21, 14 November 2010 edit undo98.222.57.24 (talk) Nobody is looking at the config, so people can ask individually for it. Also, dataset review interface instructions have been moved to the interface itself.Next edit →
Line 1: Line 1:
=== Core Configuration ===


Cluebot-NG's core vandalism detection engine is extensively configurable at run-time using configuration files. The effectiveness and accuracy of the bot is largely dependent on these config files. You are encouraged to look through these and suggest additions and improvements. Among other things, they include regexes, metric specifications, and words to search for.

To understand these config files, you must read them thoroughly in order. There are comments that explain everything you need, but there is not redundant information. If something has been explained in an earlier comment, it is not re-explained.

Here are the configuration files, in the order you should read them:

# - This is the main configuration file. It includes the other files.
# - Contains basic edit processors including filters, metrics, and word set operations.
# - Word categories and lists. Not for sensitive eyes.
# - Contains Bayesian and ANN edit processors.
# - File generated by script that contains expressions to generate ANN inputs.
# - File generated by script that contains list of ANN inputs.
# - Edit processors related to overall running.
# - Edit processors creating output.
# - Miscellaneous configuration, not edit processors.
# - Script that generates ann_input_expressions.conf and ann_input_list.conf

To suggest improvements or additions, contact ] by email, IRC, or user talk page.

== Team == == Team ==
* Christopher Breneman &mdash; {{user|Crispy1989}} &mdash; wrote and maintains the core engine and core configuration. * Christopher Breneman &mdash; {{user|Crispy1989}} &mdash; wrote and maintains the core engine and core configuration.
Line 48: Line 26:
* Selecting a threshold to hold false positives at a maximal rate of 0.25%, the bot catches approximately 63% of all vandalism. * Selecting a threshold to hold false positives at a maximal rate of 0.25%, the bot catches approximately 63% of all vandalism.


<!--

== Dataset Review Interface == == Dataset Review Interface ==
One of the keys to Cluebot-NG functioning well is its dataset. The larger and more accurate its dataset it, the better it will function, with fewer false positives, and more caught vandalism. It's impossible for just a few people to manually review the thousands of edits necessary, so Cobi wrote a dataset review interface to allow people to review edits and classify them as vandalism or constructive. One of the keys to Cluebot-NG functioning well is its dataset. The larger and more accurate its dataset it, the better it will function, with fewer false positives, and more caught vandalism. It's impossible for just a few people to manually review the thousands of edits necessary, so Cobi wrote a dataset review interface to allow people to review edits and classify them as vandalism or constructive.
Line 69: Line 47:


The review interface can be found . To gain access, email me or contact me somehow, and give me your google ID. Please thoroughly read the instructions before starting. The review interface can be found . To gain access, email me or contact me somehow, and give me your google ID. Please thoroughly read the instructions before starting.


=== Core Configuration ===


Cluebot-NG's core vandalism detection engine is extensively configurable at run-time using configuration files. The effectiveness and accuracy of the bot is largely dependent on these config files. You are encouraged to look through these and suggest additions and improvements. Among other things, they include regexes, metric specifications, and words to search for.

To understand these config files, you must read them thoroughly in order. There are comments that explain everything you need, but there is not redundant information. If something has been explained in an earlier comment, it is not re-explained.

Here are the configuration files, in the order you should read them:

# - This is the main configuration file. It includes the other files.
# - Contains basic edit processors including filters, metrics, and word set operations.
# - Word categories and lists. Not for sensitive eyes.
# - Contains Bayesian and ANN edit processors.
# - File generated by script that contains expressions to generate ANN inputs.
# - File generated by script that contains list of ANN inputs.
# - Edit processors related to overall running.
# - Edit processors creating output.
# - Miscellaneous configuration, not edit processors.
# - Script that generates ann_input_expressions.conf and ann_input_list.conf

To suggest improvements or additions, contact ] by email, IRC, or user talk page.
-->

Revision as of 10:21, 14 November 2010

Team

  • Christopher Breneman — Crispy1989 (talk · contribs) — wrote and maintains the core engine and core configuration.
  • Cobi Carter — Cobi (talk · contribs) — wrote and maintains the Misplaced Pages interface code and dataset review interface.
  • Tim — Tim1357 (talk · contribs) — wrote some of the dataset generation code and maintains the training dataset.

Questions, comments, contributions, and suggestions regarding:

  • the core engine, algorithms, and configuration should be directed to Crispy1989 (talk · contribs).
  • the bot's operation, whitelists, and interface to Misplaced Pages should be directed to Cobi (talk · contribs).
  • the bot's dataset should be directed to Tim1357 (talk · contribs).


Languages

  • C / C++ — The core is written in C/C++ from scratch.
  • PHP — The bot shell (Misplaced Pages interface) is written in PHP, and shares some code with the original ClueBot.
  • Python — Some of the dataset management tools are written in Python.
  • Bash — A few scripts to make it easier to train and maintain the bot are Bash scripts.
  • Java — The dataset review interface is written in Java using the Google App framework.

Statistics

As Cluebot-NG requires a dataset to function, the dataset can also be used to give fairly accurate statistics on its accuracy and operation. Different parts of the dataset are used for training and trialing, so these statistics are not biased.

The exact statistics change and improve frequently as we update the bot. Currently:

  • Selecting a threshold to optimize total accuracy, the bot correctly classifies over 90% of edits.
  • Selecting a threshold to hold false positives at a maximal rate of 0.25%, the bot catches approximately 63% of all vandalism.