User:ClueBot NG/Documentation: Difference between revisions

< User:ClueBot NG Browse history interactively ← Previous edit Next edit →Content deleted Content addedVisual WikitextInline

Revision as of 10:36, 14 November 2010 edit98.222.57.24 (talk) Some improvements to userpage.← Previous edit		Revision as of 10:58, 14 November 2010 edit undoCrispy1989 (talk \| contribs)434 editsNo edit summaryNext edit →
Line 11:		Line 11:
	== Dataset Review Interface ==		== Dataset Review Interface ==
	For the bot to be effective, the dataset needs to be expanded. Our current dataset has some degree of bias, as well as some inaccuracies. We need volunteers to help review edits and classify them as either vandalism or constructive. We hope to eventually completely replace our current dataset with a random sampling of edits, reviewed and classified by volunteers. A list of current contributors, more thorough instructions on how to use the interface, and the interface itself, are at the .		For the bot to be effective, the dataset needs to be expanded. Our current dataset has some degree of bias, as well as some inaccuracies. We need volunteers to help review edits and classify them as either vandalism or constructive. We hope to eventually completely replace our current dataset with a random sampling of edits, reviewed and classified by volunteers. A list of current contributors, more thorough instructions on how to use the interface, and the interface itself, are at the .

		⚫	== Statistics ==

		⚫	As Cluebot-NG requires a dataset to function, the dataset can also be used to give fairly accurate statistics on its accuracy and operation. Different parts of the dataset are used for training and trialing, so these statistics are not biased.

		⚫	The exact statistics change and improve frequently as we update the bot. Currently:
		⚫	* Selecting a threshold to optimize total accuracy, the bot correctly classifies over 90% of edits.
		⚫	* Selecting a threshold to hold false positives at a maximal rate of 0.25%, the bot catches approximately 63% of all vandalism.

			== Development News/Status ==

			=== Core Engine ===
			* Current version is working well.
			* '''Currently writing a dedicated wiki markup parser for more accurate markup-context-specific metrics.''' (No existing alternative parsers are complete or fast enough)

			=== Dataset Review Interface ===
			* Code to import edits into database is finished.
			* '''Currently changing logic that determines the end result for an edit.'''

			=== Dataset Status ===
			* We found that the Python dataset downloader we used to generate the training dataset does not generate data that is identical to the live downloader. It's possible that this is greatly reducing the effectiveness of the live bot. We're working on writing shared code for live downloading and dataset generation so we can regenerate the dataset.

	== Languages ==		== Languages ==
Line 20:		Line 41:
	* ] — Some of the original dataset management and downloader tools were written in Python.		* ] — Some of the original dataset management and downloader tools were written in Python.

⚫	== Statistics ==

⚫	As Cluebot-NG requires a dataset to function, the dataset can also be used to give fairly accurate statistics on its accuracy and operation. Different parts of the dataset are used for training and trialing, so these statistics are not biased.

⚫	The exact statistics change and improve frequently as we update the bot. Currently:
⚫	* Selecting a threshold to optimize total accuracy, the bot correctly classifies over 90% of edits.
⚫	* Selecting a threshold to hold false positives at a maximal rate of 0.25%, the bot catches approximately 63% of all vandalism.

	<!--		<!--

Revision as of 10:58, 14 November 2010

Team

Christopher Breneman — Crispy1989 (talk · contribs) — wrote and maintains the core engine and core configuration.
Cobi Carter — Cobi (talk · contribs) — wrote and maintains the Misplaced Pages interface code and dataset review interface.
Tim — Tim1357 (talk · contribs) — wrote the original dataset downloader code and scripts to generate portions of the original dataset.

Questions, comments, contributions, and suggestions regarding:

the core engine, algorithms, and configuration should be directed to Crispy1989 (talk · contribs).
the bot's interface to Misplaced Pages and dataset review interface should be directed to Cobi (talk · contribs).
the bot's original dataset should be directed to Tim1357 (talk · contribs).

Dataset Review Interface

For the bot to be effective, the dataset needs to be expanded. Our current dataset has some degree of bias, as well as some inaccuracies. We need volunteers to help review edits and classify them as either vandalism or constructive. We hope to eventually completely replace our current dataset with a random sampling of edits, reviewed and classified by volunteers. A list of current contributors, more thorough instructions on how to use the interface, and the interface itself, are at the dataset review interface.

Statistics

As Cluebot-NG requires a dataset to function, the dataset can also be used to give fairly accurate statistics on its accuracy and operation. Different parts of the dataset are used for training and trialing, so these statistics are not biased.

The exact statistics change and improve frequently as we update the bot. Currently:

Selecting a threshold to optimize total accuracy, the bot correctly classifies over 90% of edits.
Selecting a threshold to hold false positives at a maximal rate of 0.25%, the bot catches approximately 63% of all vandalism.

Development News/Status

Core Engine

Current version is working well.
Currently writing a dedicated wiki markup parser for more accurate markup-context-specific metrics. (No existing alternative parsers are complete or fast enough)

Dataset Review Interface

Code to import edits into database is finished.
Currently changing logic that determines the end result for an edit.

Dataset Status

We found that the Python dataset downloader we used to generate the training dataset does not generate data that is identical to the live downloader. It's possible that this is greatly reducing the effectiveness of the live bot. We're working on writing shared code for live downloading and dataset generation so we can regenerate the dataset.

Languages

C / C++ — The core is written in C/C++ from scratch.
PHP — The bot shell (Misplaced Pages interface) is written in PHP, and shares some code with the original ClueBot.
Java — The dataset review interface is written in Java using the Google App framework.
Bash — A few scripts to make it easier to train and maintain the bot are Bash scripts.
Python — Some of the original dataset management and downloader tools were written in Python.