Misplaced Pages

User:ClueBot NG/Documentation: Difference between revisions

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.
< User:ClueBot NG Browse history interactively← Previous editNext edit →Content deleted Content addedVisualWikitext
Revision as of 17:28, 2 November 2010 editCrispy1989 (talk | contribs)434 editsNo edit summary← Previous edit Revision as of 19:07, 2 November 2010 edit undoCrispy1989 (talk | contribs)434 editsNo edit summaryNext edit →
Line 46: Line 46:
The exact statistics change and improve frequently as we update the bot. Currently: The exact statistics change and improve frequently as we update the bot. Currently:
* Selecting a threshold to optimize total accuracy, the bot correctly classifies over 90% of edits. * Selecting a threshold to optimize total accuracy, the bot correctly classifies over 90% of edits.
* Selecting a threshold to hold false positives at a minimal rate of 0.25%, the bot catches approximately 63% of all vandalism. * Selecting a threshold to hold false positives at a maximal rate of 0.25%, the bot catches approximately 63% of all vandalism.





Revision as of 19:07, 2 November 2010

Core Configuration

Cluebot-NG's core vandalism detection engine is extensively configurable at run-time using configuration files. The effectiveness and accuracy of the bot is largely dependent on these config files. You are encouraged to look through these and suggest additions and improvements. Among other things, they include regexes, metric specifications, and words to search for.

To understand these config files, you must read them thoroughly in order. There are comments that explain everything you need, but there is not redundant information. If something has been explained in an earlier comment, it is not re-explained.

Here are the configuration files, in the order you should read them:

  1. cluebotng.conf - This is the main configuration file. It includes the other files.
  2. static_processing.conf - Contains basic edit processors including filters, metrics, and word set operations.
  3. word.conf - Word categories and lists. Not for sensitive eyes.
  4. trained_processing.conf - Contains Bayesian and ANN edit processors.
  5. ann_input_expressions.conf - File generated by script that contains expressions to generate ANN inputs.
  6. ann_input_list.conf - File generated by script that contains list of ANN inputs.
  7. main_running.conf - Edit processors related to overall running.
  8. outputs.conf - Edit processors creating output.
  9. misc.conf - Miscellaneous configuration, not edit processors.
  10. make_ann_input_expressions.sh - Script that generates ann_input_expressions.conf and ann_input_list.conf

To suggest improvements or additions, contact User:Crispy1989 by email, IRC, or user talk page.

Team

  • Christopher Breneman — Crispy1989 (talk · contribs) — wrote and maintains the core engine and core configuration.
  • Cobi Carter — Cobi (talk · contribs) — wrote and maintains the Misplaced Pages interface code and dataset review interface.
  • Tim — Tim1357 (talk · contribs) — wrote some of the dataset generation code and maintains the training dataset.

Questions, comments, contributions, and suggestions regarding:

  • the core engine, algorithms, and configuration should be directed to Crispy1989 (talk · contribs).
  • the bot's operation, whitelists, and interface to Misplaced Pages should be directed to Cobi (talk · contribs).
  • the bot's dataset should be directed to Tim1357 (talk · contribs).


Languages

  • C / C++ — The core is written in C/C++ from scratch.
  • PHP — The bot shell (Misplaced Pages interface) is written in PHP, and shares some code with the original ClueBot.
  • Python — Some of the dataset management tools are written in Python.
  • Bash — A few scripts to make it easier to train and maintain the bot are Bash scripts.
  • Java — The dataset review interface is written in Java using the Google App framework.

Statistics

As Cluebot-NG requires a dataset to function, the dataset can also be used to give fairly accurate statistics on its accuracy and operation. Different parts of the dataset are used for training and trialing, so these statistics are not biased.

The exact statistics change and improve frequently as we update the bot. Currently:

  • Selecting a threshold to optimize total accuracy, the bot correctly classifies over 90% of edits.
  • Selecting a threshold to hold false positives at a maximal rate of 0.25%, the bot catches approximately 63% of all vandalism.


Dataset Review Interface

One of the keys to Cluebot-NG functioning well is its dataset. The larger and more accurate its dataset it, the better it will function, with fewer false positives, and more caught vandalism. It's impossible for just a few people to manually review the thousands of edits necessary, so Cobi wrote a dataset review interface to allow people to review edits and classify them as vandalism or constructive.

This interface is used for a few things. Firstly, it's used to make sure the dataset we already have is accurate. False positives and false negatives from the trial dataset are put in the review queue, because we've found that a very few edits in the dataset may not be correctly classified. This causes problems in the bot's training and threshold calculations.

Also, random edits from Misplaced Pages may be added to the review queue to grow the overall size of the dataset.

Classifying edits in this review interface can actually help Misplaced Pages more with your time than just just hunting vandalism. Hunting vandalism manually may catch a small fraction of a percent of vandalism on Misplaced Pages. Classifying edits in this interface may allow Cluebot-NG to catch 5% or more of additional vandalism.

To use the dataset review interface, you need a google account, as the interface is built on google's app framework. To be granted access to the interface, ask Cobi or Crispy1989. Before starting, please thoroughly review the below directions.

In the review interface, you will have a browser window with Misplaced Pages articles, and a window sitting on top where you can classify edits. You will be able to click links and such in the main browser window without interrupting the process. The window sitting on top allows you to classify edits as Vandalism, Constructive, or Skip.

In general, if an edit would be classified as vandalism by a human, it should be classified as vandalism. Most other edits should be classified as constructive, with a few exceptions (and because many of the edits in the review queue may be borderline, you may encounter these exceptions more often than you might think). Skipping an edit excludes it from the dataset entirely. An edit may be skipped if it's borderline vandalism, and it's not a big deal if the bot classifies edits like it as vandalism in production. An edit may also be skipped if you can't tell whether or not it's vandalism. The other case where skipping edits may be acceptable is if the edit is not vandalism, but is a very poor quality edit, and contains some attributes of vandalism. Although very poor edits made in good faith technically should not be classified as vandalism, classifying them as constructive could interfere with the bot's training, so they should be skipped.

In some cases, the interface may ask "Are you sure?" when you select a result. If this happens, double-check that your classification is correct, then click Yes or No.

There is also a Comment box along with the Vandalism, Constructive, and Skip buttons. This is optional. If you think there's something about the edit that the Cluebot-NG operators should know about, such as an edit that's clearly constructive but may look like vandalism based on simple statistics, leave a comment about it, and the Cluebot-NG operators will take that into account.