Revision as of 10:52, 5 November 2021 editKlein Muçi (talk | contribs)Extended confirmed users4,574 edits →ClueBot NG on SqWiki: ReplyTag: Reply← Previous edit | Revision as of 13:53, 5 November 2021 edit undoNaomiAmethyst (talk | contribs)Edit filter managers, Extended confirmed users, Rollbackers, Template editors6,269 edits →ClueBot NG on SqWiki: ReplyNext edit → | ||
Line 271: | Line 271: | ||
:::::The reason I ask is because if there's one thing we (and all the small wikis) lack is a large active userbase. We struggle so much with having an active working force that that was actually what brought me here. Even after setting up strict edit filters and trying to block vandals fast, still the number of pages and changes pending review is so large that it's unmanageable by us. (We lowered it to 0 some time ago but still...) Therefore it's unfortunately very common for changes to expect review for months if not years before someone actually comes to do that. Lately we started being attacked by some IP vandals which come and change just small trivial information on articles for example the name of the city where someone is born or the date when someone died or the number of works published by someone. These are undetectable by the filters and are unblockable for very large periods of time because they're IPs (and more than one) and they're not on the same IP range. This not only lowers the project's overall integrity but also increases the workload for the already non-existing patrollers which starts a vicious cycle: New patrollers/reviewers may become interested in helping and seeing the extremely large number of pending changes feel like their work won't matter and leave which only makes the number grow more. When I asked for help here in dealing with this situation, Xaosflux showed me your bot. It is crucial for us in automatizing vandalism fighting so we can have a chance in reviewing the remaining constructive edits which may or may not be acceptable for SqWiki standards. | :::::The reason I ask is because if there's one thing we (and all the small wikis) lack is a large active userbase. We struggle so much with having an active working force that that was actually what brought me here. Even after setting up strict edit filters and trying to block vandals fast, still the number of pages and changes pending review is so large that it's unmanageable by us. (We lowered it to 0 some time ago but still...) Therefore it's unfortunately very common for changes to expect review for months if not years before someone actually comes to do that. Lately we started being attacked by some IP vandals which come and change just small trivial information on articles for example the name of the city where someone is born or the date when someone died or the number of works published by someone. These are undetectable by the filters and are unblockable for very large periods of time because they're IPs (and more than one) and they're not on the same IP range. This not only lowers the project's overall integrity but also increases the workload for the already non-existing patrollers which starts a vicious cycle: New patrollers/reviewers may become interested in helping and seeing the extremely large number of pending changes feel like their work won't matter and leave which only makes the number grow more. When I asked for help here in dealing with this situation, Xaosflux showed me your bot. It is crucial for us in automatizing vandalism fighting so we can have a chance in reviewing the remaining constructive edits which may or may not be acceptable for SqWiki standards. | ||
:::::Currently I'm the only active one dealing with bot developing in SqWiki. I run a bot myself which operates in SqWiki, SqQuote and LaWiki but it's a rather simple one working on the Pywikibot framework and the occasional AWB changes. I haven't had a chance to work on GitHub yet even though I have an account there, if I'm not wrong. I can try starting that journey (even though I'm an autodidactic coder) but I'd need a lot of guidance along the way. To be honest what I was expecting was to work towards some localization "tables", like I've done with the other imported bots in the past (maybe, most notably, IABot), not duplicate the code. I highly expected Cluebot's functionality to have been requested by many Wikis during its existence and i18n infrastructure to be already implemented in it. I was surprised to understand that I may be one of the few (if my understanding is correct) users who's going on with a request like this. - ] (]) 10:52, 5 November 2021 (UTC) | :::::Currently I'm the only active one dealing with bot developing in SqWiki. I run a bot myself which operates in SqWiki, SqQuote and LaWiki but it's a rather simple one working on the Pywikibot framework and the occasional AWB changes. I haven't had a chance to work on GitHub yet even though I have an account there, if I'm not wrong. I can try starting that journey (even though I'm an autodidactic coder) but I'd need a lot of guidance along the way. To be honest what I was expecting was to work towards some localization "tables", like I've done with the other imported bots in the past (maybe, most notably, IABot), not duplicate the code. I highly expected Cluebot's functionality to have been requested by many Wikis during its existence and i18n infrastructure to be already implemented in it. I was surprised to understand that I may be one of the few (if my understanding is correct) users who's going on with a request like this. - ] (]) 10:52, 5 November 2021 (UTC) | ||
::::::@] The links I posted are to the original versions of the files since the original training hasn't changed in the decade or so. The bot itself has been updated more regularly in and . But, yeah, we collected and categorized some of the edits ourselves, and some had been collected by , and some were crowd-sourced by using . | |||
::::::Essentially at a high level, the bot takes the edits and and then compares them against the known good and known bad edits' statistics using an ]. If it looks like good edits more than bad edits, it leaves it alone, otherwise it reverts it. This is essentially what machine learning is. | |||
::::::This does, of course, lead to why the bot hasn't been localized, yet. It needs a completely new data-set for each new wiki it operates on, and no one has taken on that challenge yet. It's also why ClueBot NG does not operate on other English wikis other than en.wikipedia, because the data-set actually needs to be made for the wiki in question, not just the language. For example, an article on the English Misplaced Pages would look totally different than one on the English Wikinews or Wiktionary, and because the bot works by looking at an edit and trying to determine whether or not it belongs based on its data-set, it would notice the differences. The actual strings tables used for messages themselves are trivial to update for localization in comparison with the data-set. | |||
::::::Other projects have asked for ClueBot NG before, but not that often. I've told them essentially what I've told you: The bot is open source, but you have to collect a data-set for it to work. There is also ] that could potentially be used and updated, but its functionality was limited and largely eclipsed by the ], and much less effective than the machine learning approach that ClueBot NG uses. -- ]<sup>(]|]|])</sup> 13:53, 5 November 2021 (UTC) | |||
== ] == | == ] == |
Revision as of 13:53, 5 November 2021
This user talk page is watched by friendly talk page watchers which means that someone other than me might reply to your query. Their input is welcome and their help with messages that I cannot reply to quickly, or to facilitate communication when it’s faltering, is appreciated. |
|
Archives (Index) |
|
This page is archived by ClueBot III. |
Tech News: 2021-43
Latest tech news from the Wikimedia technical community. Please tell other users about these changes. Not all changes will affect you. Translations are available.
Recent changes
- The Coolest Tool Award 2021 is looking for nominations. You can recommend tools until 27 October.
Changes later this week
- The new version of MediaWiki will be on test wikis and MediaWiki.org from 26 October. It will be on non-Misplaced Pages wikis and some Wikipedias from 27 October. It will be on all wikis from 28 October (calendar).
Future changes
- Diff pages will have an improved copy and pasting experience. The changes will allow the text in the diff for before and after to be treated as separate columns and will remove any unwanted syntax.
- The version of the Liberation fonts used in SVG files will be upgraded. Only new thumbnails will be affected. Liberation Sans Narrow will not change.
Meetings
- You can join a meeting about the Community Wishlist Survey. News about the disambiguation and the real-time preview wishes will be shown. The event will take place on Wednesday, 27 October at 14:30 UTC. See how to join.
Tech news prepared by Tech News writers and posted by bot • Contribute • Translate • Get help • Give feedback • Subscribe or unsubscribe.
20:07, 25 October 2021 (UTC)
GWB.
Hi Rich! On the GWB page I do not see the changes in the box on the right side of the page. It is the box that gives a brief description of the bridge. It still says October 24th. How do I make Those changes? Thank you! — Preceding unsigned comment added by Maincable (talk • contribs) 23:49, 25 October 2021 (UTC)
- @Maincable: That would be as part of the 'Infobox bridge' template, the bits you are looking for are begin= and open=. Also, remember to sign your messages on talk pages using 4 ~ signs - Rich 14:34, 26 October 2021 (UTC)
Articles you might like to edit, from SuggestBot
Note: All columns in this table are sortable, allowing you to rearrange the table so the articles most interesting to you are shown at the top. All images have mouse-over popups with more information. For more information about the columns and categories, please consult the documentation and please get in touch on SuggestBot's talk page with any questions you might have.
SuggestBot picks articles in a number of ways based on other articles you've edited, including straight text similarity, following wikilinks, and matching your editing patterns against those of other Wikipedians. It tries to recommend only articles that other Wikipedians have marked as needing work. We appreciate that you have signed up to receive suggestions regularly; your contributions make Misplaced Pages better — thanks for helping!
If you have feedback on how to make SuggestBot better, please let us know on SuggestBot's talk page. -- SuggestBot (talk) 12:53, 28 October 2021 (UTC)
The Signpost: 31 October 2021
- From the editor: Different stories, same place
- News and notes: The sockpuppet who ran for adminship and almost succeeded
- Discussion report: Editors brainstorm and propose changes to the Requests for adminship process
- Recent research: Welcome messages fail to improve newbie retention
- Community view: Reflections on the Chinese Misplaced Pages
- Traffic report: James Bond and the Giant Squid Game
- Technology report: Wikimedia Toolhub, winners of the Coolest Tool Award, and more
- Serendipity: How Misplaced Pages helped create a Serbian stamp
- Book review: Misplaced Pages and the Representation of Reality
- WikiProject report: Redirection
- Humour: A very Wiki crossword
ClueBot NG on SqWiki
Hey Rich!
I'm a crat from SqWiki. These days I was shown ClueBot NG from a user when I asked him advice in fighting vandalism. Would it be possible to make ClueBot NG work in wikis other than EnWiki? We (and I believe a lot of other wikis as well) would be really grateful to benefit from it if it was possible. - Klein Muçi (talk) 00:49, 1 November 2021 (UTC)
- @Klein Muçi: it can, however I needs a lot of training data. Pinging @DamianZaremba: to see if he can provide more input to what is required - Rich 07:06, 1 November 2021 (UTC)
- Yeah, I understand that because I read how it worked. I was thinking to maybe keep it on a kind of a "simulation" mode while it learned (maybe just don't give it the bot flag yet?) and later unleash it in full power. - Klein Muçi (talk) 11:32, 1 November 2021 (UTC)
- I don't think it quite works like that, the bot flag is irrelevant. @Cobi: could maybe assist as well? - Rich 11:33, 1 November 2021 (UTC)
- At the very least, the bot needs a several tens of thousands of randomly sampled main-space edits categorized as good or bad to even have a chance of being reasonably accurate, but ideally more. I also do not speak Albanian, so I couldn't reasonably offer support for false positives or anything like that. The bot itself is open source, and most of the tooling should be in the repo.
- It seems that DamianZaremba's been reworking some of the training tooling, but the original training tooling is mostly here. It's a bit of a mess since it is mostly a snapshot of some of our working directories back when we were originally training the bot. The basic idea was there was a MySQL database called EditDB, and it had a table called editset.
- Tools like editClassificationToEditDB.php took data in on stdin in the form of "123456 V" or "234567 C" to mark revid 123456 as vandalism and revid 234567 as constructive. Tools like generateXML.php would then emit XML suitable for training the bot's core from the edits in the EditDB. Tools like autodatasetgen.go were built to find other ways of generating classifications like by checking if someone reverted real-world edits. This was not as effective as the smaller (but still large) hand-curated datasets.
- Finally, after using generateXML.php to generate train.xml, trial.xml, and bayestrain.xml in the editsets directory (we used limit clauses to split the files, with 0-16000 in bayestrain.xml, 16000-60000 in train.xml, and the rest in trial.xml), we then ran trainandtrial.sh to train the bot and then get metrics on the efficacy of the bot. There are also tools like autotraintrial.php which attempts to explore reasonable ANN parameters which are stored in localtoolconfig and what we believe to be reasonable values for training datasets between 50,000 and 100,000 edits.
- If any of that made some sort of sense, you may wish to give it a go. If not, maybe find a bot dev on SqWiki that has time and desire to curate and run a SqWiki version? -- Cobi 03:54, 5 November 2021 (UTC)
- @Cobi, thanks a lot for taking the time to explain the details! I followed every provided link along with your explanations. I saw that there hadn't been any changes for the last decade almost so I do understand that it may appear as an "old project" for you. I have a naïve question I couldn't understand from your explanation though: You say that the bot should use around 50k results (just an example) divided into C and V type to start its training which then gets information added by reverts and more. Then you also mention "hand-curated datasets". Should I understand that those initial 50k results (again, just an example) were divided into C and V type manually? If I'm misunderstanding that, how was that initial division made?
- The reason I ask is because if there's one thing we (and all the small wikis) lack is a large active userbase. We struggle so much with having an active working force that that was actually what brought me here. Even after setting up strict edit filters and trying to block vandals fast, still the number of pages and changes pending review is so large that it's unmanageable by us. (We lowered it to 0 some time ago but still...) Therefore it's unfortunately very common for changes to expect review for months if not years before someone actually comes to do that. Lately we started being attacked by some IP vandals which come and change just small trivial information on articles for example the name of the city where someone is born or the date when someone died or the number of works published by someone. These are undetectable by the filters and are unblockable for very large periods of time because they're IPs (and more than one) and they're not on the same IP range. This not only lowers the project's overall integrity but also increases the workload for the already non-existing patrollers which starts a vicious cycle: New patrollers/reviewers may become interested in helping and seeing the extremely large number of pending changes feel like their work won't matter and leave which only makes the number grow more. When I asked for help here in dealing with this situation, Xaosflux showed me your bot. It is crucial for us in automatizing vandalism fighting so we can have a chance in reviewing the remaining constructive edits which may or may not be acceptable for SqWiki standards.
- Currently I'm the only active one dealing with bot developing in SqWiki. I run a bot myself which operates in SqWiki, SqQuote and LaWiki but it's a rather simple one working on the Pywikibot framework and the occasional AWB changes. I haven't had a chance to work on GitHub yet even though I have an account there, if I'm not wrong. I can try starting that journey (even though I'm an autodidactic coder) but I'd need a lot of guidance along the way. To be honest what I was expecting was to work towards some localization "tables", like I've done with the other imported bots in the past (maybe, most notably, IABot), not duplicate the code. I highly expected Cluebot's functionality to have been requested by many Wikis during its existence and i18n infrastructure to be already implemented in it. I was surprised to understand that I may be one of the few (if my understanding is correct) users who's going on with a request like this. - Klein Muçi (talk) 10:52, 5 November 2021 (UTC)
- @Klein Muçi The links I posted are to the original versions of the files since the original training hasn't changed in the decade or so. The bot itself has been updated more regularly in the bot repo and the core repo. But, yeah, we collected and categorized some of the edits ourselves, and some had been collected by open research projects that have analyzed vandalism on enwiki, and some were crowd-sourced by using a web-interface that let others we trusted categorize edits.
- Essentially at a high level, the bot takes the edits and generates hundreds of statistics about each edit and then compares them against the known good and known bad edits' statistics using an Artificial Neural Network. If it looks like good edits more than bad edits, it leaves it alone, otherwise it reverts it. This is essentially what machine learning is.
- This does, of course, lead to why the bot hasn't been localized, yet. It needs a completely new data-set for each new wiki it operates on, and no one has taken on that challenge yet. It's also why ClueBot NG does not operate on other English wikis other than en.wikipedia, because the data-set actually needs to be made for the wiki in question, not just the language. For example, an article on the English Misplaced Pages would look totally different than one on the English Wikinews or Wiktionary, and because the bot works by looking at an edit and trying to determine whether or not it belongs based on its data-set, it would notice the differences. The actual strings tables used for messages themselves are trivial to update for localization in comparison with the data-set.
- Other projects have asked for ClueBot NG before, but not that often. I've told them essentially what I've told you: The bot is open source, but you have to collect a data-set for it to work. There is also the old version of ClueBot that could potentially be used and updated, but its functionality was limited and largely eclipsed by the Edit Filter, and much less effective than the machine learning approach that ClueBot NG uses. -- Cobi 13:53, 5 November 2021 (UTC)
- I don't think it quite works like that, the bot flag is irrelevant. @Cobi: could maybe assist as well? - Rich 11:33, 1 November 2021 (UTC)
- Yeah, I understand that because I read how it worked. I was thinking to maybe keep it on a kind of a "simulation" mode while it learned (maybe just don't give it the bot flag yet?) and later unleash it in full power. - Klein Muçi (talk) 11:32, 1 November 2021 (UTC)
Tech News: 2021-44
Latest tech news from the Wikimedia technical community. Please tell other users about these changes. Not all changes will affect you. Translations are available.
Recent changes
- There is a limit on the amount of emails a user can send each day. This limit is now global instead of per-wiki. This change is to prevent abuse.
Changes later this week
- The new version of MediaWiki will be on test wikis and MediaWiki.org from 2 November. It will be on non-Misplaced Pages wikis and some Wikipedias from 3 November. It will be on all wikis from 4 November (calendar).
Tech news prepared by Tech News writers and posted by bot • Contribute • Translate • Get help • Give feedback • Subscribe or unsubscribe.
20:27, 1 November 2021 (UTC)