User talk:Cyberpower678: Difference between revisions

Browse history interactively ← Previous edit Next edit →Content deleted Content addedVisual WikitextInline

Revision as of 09:48, 29 August 2013 editInt21h (talk \| contribs)Autopatrolled, Extended confirmed users30,095 edits →Alan Turing: new section← Previous edit		Revision as of 10:38, 29 August 2013 edit undoCyberpower678 (talk \| contribs)Edit filter managers, Administrators26,883 edits →Alan Turing: ReplyNext edit →
Line 136:		Line 136:

	Your bot keeps tagging ] references as potential spam links. They do not appear to be so. Your bot has been reverted 3 times in a row by 3 separate editors. That's the danger zone, bot or no bot. Seriously, you should implement a check to see if its already made the exact same edit. Or something. ] (]) 09:48, 29 August 2013 (UTC)		Your bot keeps tagging ] references as potential spam links. They do not appear to be so. Your bot has been reverted 3 times in a row by 3 separate editors. That's the danger zone, bot or no bot. Seriously, you should implement a check to see if its already made the exact same edit. Or something. ] (]) 09:48, 29 August 2013 (UTC)
			:That's intentional. Read the tag more clearly.—] ]<sub style="margin-left:-4.4ex;color:olive;font-family:arnprior">Online</sub> 10:38, 29 August 2013 (UTC)

Revision as of 10:38, 29 August 2013

This user is offline, or has forgotten to update this message since starting a wikisession.

_{(If there have been multiple edits from this user in the last 60 minutes and the most recent one wasn't to activate this template, it is safe to assume that this user forgot.)}

Senior Editor II

—CYBERPOWER (Merry Christmas)Click here to find out why my signature changes color.
This signature was designed with a font that has been discontinued. You can however download the font pack necessary to view how the sig is supposed to look like here and here.
Be sure to download both fonts from both links. The sig has also been designed to look good for people who don't want to download the font packs.

Hello!! I am Cyberpower678. I am an administrator on Misplaced Pages. Despite that, I'm still your run of the mill user here on Misplaced Pages.
I specialize in bot work and tools, but I lurk around RfPP, AfD, AIV, and AN/I, as well as RfA. If you have any questions in those areas, please feel free to ask. :-)
For InternetArchiveBot matters specifically, please see meta:InternetArchiveBot/Problem for common issues or meta:User talk:InternetArchiveBot to leave a message
I also serve as a mailing list moderator and account creator over at the Account Creation Center. If you have any questions regarding an account I created for you, or the process itself, feel free to email the WP:ACC team or me personally.
At current I have helped to create accounts for 2512 different users and renamed 793 other users.
Disputes or discussions that appear to have ended or is disputed will be archived.

All the best.—cyberpower

View my talk page Archives.

Requests for adminship and bureaucratship update
No current discussions. Recent RfAs, recent RfBs: (successful, unsuccessful)

This user is busy doing other things and would like a {{talkback}} notice at this time.

Cyberpower678, in accordance with the Wikimedia Foundation's Terms of Use, discloses that he has been paid by Internet Archives for his contributions to Misplaced Pages. This funding is for the ongoing development of InternetArchiveBot.

If you are asking about InternetArchiveBotYou will get a faster response if you post to meta:User talk:InternetArchiveBot.

Talkpages

First, thanks for tagging pages that have black-listed links, obviously necessary and useful! Second, you seem to be tagging talkpages; is that intentional? Given that we are not generally supposed to edit other users' posts on talkpages not our own, what are we supposed to do about these? (this question asked out of plain ignorance, no desire to criticise) Justlettersandnumbers (talk) 00:17, 26 August 2013 (UTC)

I came here to make the same point. But I'll just more directly ask that the bot stop tagging talk pages -- there's no value in that. Looie496 (talk) 00:42, 26 August 2013 (UTC)

In the same vein, not only was Talk:BDSM tagged, but the link it tagged is actually invalid now. (Main site is there, but it gives an error when requesting that particular page.) – RobinHood70 01:11, 26 August 2013 (UTC)

Ok. I have added the Talk namespace to the exceptions list.—cyberpower _Online 10:31, 26 August 2013 (UTC)

Tracing the blacklist

Is there any easy way to determine what rule on the blacklist a URL triggers? For example, this looks legitimate on its face, but obviously someone somewhere has a problem with something whose regex also matches this one. Without knowing whether it's "washington.com" or "nbcwashington" or "petition" or whatever, it's hard to know where to list the whitelist exception and who to discuss the origin of the rule that caught it. DMacks (talk) 00:29, 26 August 2013 (UTC)

Unfortunately, no. However, I know a user who can track the regex matching this link.—cyberpower _Offline 01:27, 26 August 2013 (UTC)

It's the word "petition" that's doing it. There's a discussion about removing it. – RobinHood70 06:56, 26 August 2013 (UTC)

Not that bug again. The petition regex caused it to flag a whitehouse.gov site as spam. :/—cyberpower _Online 10:30, 26 August 2013 (UTC)

Where is the blacklist?

I am questioning why the bot is tagging some pages as blacklisted. This Wired magazine article seems like legitimate journalism to me and this statement from the United States White House seems okay to cite also, but the bot did not like them.

Who is making the blacklist? What are the criteria for inclusion into this list? Blue Rasberry (talk) 02:30, 26 August 2013 (UTC)

Click the links in the box.—cyberpower _Offline 02:31, 26 August 2013 (UTC)

Help me out a bit more. Here is the article of my concern - Access2Research. Here is the edit the bot made. The bot seems to not like any link with the word "petition" in it. If that is the criteria for tagging, I have to assert that this is not a legitimate criteria because good sources can have that word in the url. Is that in fact the criteria for tagging links as bad? Blue Rasberry (talk) 14:26, 27 August 2013 (UTC)

I do not determine what qualifies as spam. Administrators do that. MediaWiki's spam blacklist contains regex fragments that determine what to mark as spam. There is an entry listing the word petition. A discussion has been brought up to remove this badly written fragment from the list. You can find this link in another thread on this talk page.—cyberpower _Online 15:37, 27 August 2013 (UTC)

bad tag on Rachael Carson

The bot recently tagged Rachel Carson--I think that this is the result of a glitch. Here's the edit: http://en.wikipedia.org/search/?title=Rachel_Carson&oldid=570192594 The ref that alerted the bot was a web.archive.org link to an archive page from www.medaloffreedom.com, which is now in the hands of some pissant linkfarm. The archived version is at http://web.archive.org/web/20071018025824/www.medaloffreedom.com/Chronological.htm . I found medaloffreedomcom on the blacklist, but, obviously, not archive.com. Since archive.com is used to supply links to old sources that are no longer available, often for similar reasons, the whole domain should be whitelisted or excepted. For now, I'm just reverting the error and hope that this will be addressed before the the bot doesuch more of this.--Hjal (talk) 06:46, 26 August 2013 (UTC)

There's \bweb\.archive\.org.{0,50}obsessedwithwrestling\.com which could be setting it off. I'll take a look. The bot only tags if there's a positive match to a regex scan. Regexes are generated by MediaWiki's own engine.—cyberpower _Online 10:39, 26 August 2013 (UTC)

Blacklist tag.

Hi, this bot placed a blacklist tag on Trams in Melbourne, I've searched both blacklists and can't find the website listed. Is this a mistake in which case I can remove the tag? Or is the website on the blacklist, and if so, why? Liamdavies (talk) 12:04, 26 August 2013 (UTC)

\bguy\.com\b is what triggered the bot. You'll need to consult the admin that added to find out why.—cyberpower _Offline 13:30, 26 August 2013 (UTC)

Does that mean the the site is infact blacklisted? Or is this a mixup? How would I find out what admin added the site? Liamdavies (talk) 10:09, 27 August 2013 (UTC)

I can't answer that for you. Sorry.—cyberpower _Online 10:35, 27 August 2013 (UTC)

(talk page stalker) Added here. Maybe it's trying to catch the domain "guy" itself (and any subdomains) rather than any domain that has "guy" as the last of several words? DMacks (talk) 10:45, 27 August 2013 (UTC)

Then in order to fix that, the regex needs to be more specific, be placed on the whitelist, or exceptions list, or be removed from the blacklist entirely.—cyberpower _Online 10:50, 27 August 2013 (UTC)

Does this mean that I can remove the tag or not? Liamdavies (talk) 14:49, 27 August 2013 (UTC)

You can, if you want. The bot will simply keep retagging it though until it doesn't see it as blacklisted anymore.—cyberpower _Online 15:34, 27 August 2013 (UTC)

OK, how would one get it un-blacklisted? This really does seem like an issue that should be sorted out. Is this bot new? The link in question has been on the page for years, so I don't see why this is now a problem. Liamdavies (talk) 17:40, 27 August 2013 (UTC)

Yes, the bot is new. It is still in its trial stages. To get a link to not be marked as spam, you'll need to contact an administrator that manages the MediaWiki:Spam-blacklist. They can then sort it out for you.—cyberpower _Online 17:44, 27 August 2013 (UTC)

If the blacklisting of "guy.com" results in tagging articles with links to "cable-car-guy.com" or any other "foo-guy.com" links, it seems like a bot error. How many sites would have to be entered on the whitelist or exceptions list to make up for that kind of error? Unlike "foo.guy.com," "foo-guy.com" is in a different domain than "guy.com".--Hjal (talk) 17:52, 27 August 2013 (UTC)

It's not a bot error. The bot only tags pages containing links that match positively to the links. The regex would need to be refined more to reduce false positives.—cyberpower _Online 17:55, 27 August 2013 (UTC)

Maybe this bots full approval should be held back until this happens? I would also suggest that talk and file pages be exempt, the bot has thrown tags on places that they really aren't needed here and here, and is targeting the sources of film posters here. What is the deal with sourcing of non-free images? Is it wise to be removing the source as this user did to get the tag removed? Liamdavies (talk) 18:05, 27 August 2013 (UTC)

Bot approval is determined by whether the bot is operating as it should, not if there are false positives due to faulty regexes. As a matter of fact, this has actually prompted a few discussion on refining the regexes in the blacklist. These false positives the bot is throwing out are also affecting the spam blacklist filter and inhibiting addition of further links. Also I have already added the file space to exemption list. You can see all the pages that are excempt at User:Cyberpower678/spam-exception.js.—cyberpower _{Limited Access} 18:12, 27 August 2013 (UTC)

(talk page stalker) Folks, this is not the fault of the bot itself. The bot is just doing what it was told. The problem appears to lie with the people who are making up the "blacklist filter", which the bot just enforces. Somebody told the filter to blacklist any domain containing "guy", and that is producing numerous false positives. (In my case it was an innocent academic site hosted by "geology-guy".) Both cases have been called to the attention of the people at the whitelist site, and hopefully they will sort it out eventually. In the meantime, don't beat up on the bot or the bot owner. Computers are very smart, but they are also very stupid: they only do what they are told. --MelanieN (talk) 18:15, 27 August 2013 (UTC)

Finally someone someone understands. ;-)—cyberpower _{Limited Access} 18:16, 27 August 2013 (UTC)

Sorry, I should clarify, I know the bot is simply doing it's job. When I said "Maybe this bots full approval should be held back until this happens?" that was in regards to "The regex would need to be refined more to reduce false positives." in that there are going to be problems (such as posted here) until that happens. My pointing to the talk and file matters was to draw attention to an issue, not beat up on a fellow contributor. I apologise if it came of as anger, it was probably more frustration, and seeing the future mess on your talk page if these minor issues (mainly to do with a faulty blacklist, rather than a bot doing what it's told) don't get sorted before the bot goes live. Liamdavies (talk) 18:38, 27 August 2013 (UTC)

Actually false positives can be beneficial to fixing regex issues that weren't apparent earlier.—cyberpower _{Limited Access} 18:45, 27 August 2013 (UTC)

Several false positives have been identified here; do we know if anyone is trying to fix them? Or what does it take to get that process started? --MelanieN (talk) 23:37, 27 August 2013 (UTC)

Only administrators can change the blacklist and whitelist. Preferably the administrators that make the changes to the list should be contacted. There is currently a discussion regarding the petition regex that is causing the bot to flag any link with the word petition in it as spam.—cyberpower _Online 23:42, 27 August 2013 (UTC)

Copying this over from MediaWiki_talk:Spam-blacklist#petition, I've figured out what the issue is. The MediaWiki spam-blacklist only matches within the domain name, while User:Cyberbot II matches anywhere in the url. TDL (talk) 23:11, 28 August 2013 (UTC)

I don't think this is what the problem is. The regex engine is identical to the blacklist engine.—cyberpower _Online 23:16, 28 August 2013 (UTC)

Well the two engines are producing different results, so there's obviously something different. This is likely related to this bug.

What exactly do you attempt to match with the bot? If you look at the documentation for the blacklist fiilter, it states that it matches "!http://*(line 1|line 2|line 3|....)!Si". Unless I'm missing something, that wouldn't pick up anything after a "/". I'm presume your bot's code doesn't have this line, and instead just searches for the raw regexs from the blacklist, which means it will pick up anything in the url, rather than just in the domain name.

So for example, the blacklist filter "\bpetition(?:online|s)?\b" doesn't prevent me from saying http://praguemonitor.com/2012/09/24/presidential-candidate-jakl-launches-petition-his-bid, but yout bot tags it. If I try and add a url with "petition" before the .com it gets caught by the filter. TDL (talk) 23:50, 28 August 2013 (UTC)

You are incorrect. The regexes are generated exactly like the extension generates it.—cyberpower _Online 00:07, 29 August 2013 (UTC)

I've copied this over to Misplaced Pages:Bots/Requests_for_approval/Cyberbot_II_4#Trial as that seems like a better place to discuss this. TDL (talk) 00:08, 29 August 2013 (UTC)

I've reverted it. Cross-posting only causes me confusion. Here's the regex generator:

function buildRegexes( $lines, $batchSize=4096 ) {
    # Make regex
    # It's faster using the S modifier even though it will usually only be run once
    //$regex = 'https?://+*(' . implode( '|', $lines ) . ')';
    //return '/' . str_replace( '/', '\/', preg_replace('|\\\*/|', '/', $regex) ) . '/Sim';
    $regexes = array();
    $regexStart = '/*';
    $regexEnd = getRegexEnd( $batchSize );
    $build = false;
    foreach( $lines as $line ) {
        if( substr( $line, -1, 1 ) == "\\" ) {
            // Final \ will break silently on the batched regexes.
            // Skip it here to avoid breaking the next line;
            // warnings from getBadLines() will still trigger on
            // edit to keep new ones from floating in.
            continue;
        }
        // FIXME: not very robust size check, but should work. :)
        if( $build === false ) {
            $build = $line;
        } elseif( strlen( $build ) + strlen( $line ) > $batchSize ) {
            $regexes = $regexStart .
                str_replace( '/', '\/', preg_replace('|\\\*/|u', '/', $build) ) .
                $regexEnd;
            $build = $line;
        } else {
            $build .= '|';
            $build .= $line;
        }
    }
    if( $build !== false ) {
        $regexes = $regexStart .
            str_replace( '/', '\/', preg_replace('|\\\*/|u', '/', $build) ) .
            $regexEnd;
    }
    return $regexes;
}

I don't speak PHP, but looking at your source code would appear to confirm that my suspicion was correct. Where is the http:// tacked on to the front of all the regexs? The only mention of "http" in your code is commented out. Unless you deal with this later in your code, your bot will find all matches of "*($line)" and not "http://*($line)" as the blacklist filter does. Looks like the error is with the line $regexStart = '/*'; which should read $regexStart = '/http://*'; to be consistent with what the filter is doing. TDL (talk) 00:51, 29 August 2013 (UTC)

The documentation is out of date. I will repeat, I downloaded the extension and installed it into my script as is. This is how the extension is configured.—cyberpower _Online 00:55, 29 August 2013 (UTC)

I have to agree that this bot is tagging things beyond what it should. Featured Article William Wilberforce was tagged with this tag simply because a googlebook search string had the word "petitions" in it. The spam filter is not triggered by these search strings... I tested it by trying to reinsert the search string and it saved without problem.. Test it yourself and see. I realize that this bot can probably be very helpful but think some fixes are in order first. --Slp1 (talk) 01:46, 29 August 2013 (UTC)

What fix do you propose? The http part is omitted for very good reasons.—cyberpower _Online 01:50, 29 August 2013 (UTC)

Hmm, well then why does your code differ from the code provided by the repository? It has the line:

$regexStart = $blacklist->getRegexStart();

where your problematic line is.

This function getRegexStart is defined as:

public function getRegexStart() {
       return '/(?:https?:)?\/\/+*(';
   }

All of which suggests that you need to add an "(?:https?:)?\/\/+" to the line as I suggested above to get results consistent with the filter. This error would certainly explain the observed symptoms. TDL (talk) 01:54, 29 August 2013 (UTC)

Now we are getting somewhere. It would seem that I have an outdated copy of the filter engine. I'll make the change once I have access to labs again. Looks like I made a fuss for nothing. 9_9 My apologies.—cyberpower _Online 02:01, 29 August 2013 (UTC)

Cyberbot II making multiple edits in a row to the same article

When pending changes protection was changed from level 1 to level 2 on Misplaced Pages:Pending changes/Testing/5, Cyberbot II made two edits in a row to fix the tag, instead of just one. Jackmcbarn (talk) 01:07, 27 August 2013 (UTC)

I'm working on a rewrite.—cyberpower _Online 10:48, 27 August 2013 (UTC)

Hasty tagging

Nothing big because this is still hypothetical, but just notifying of a potential vulnerability. I just added a {{pp-pc1}} tag in my sandbox and Cyberbot II removed it in 8 seconds. If I were you, I'd add a delay of something like 30 minutes so that a vandal on a spree can't abuse it. (The purpose of a long delay for applications like STiki that see vandalism from far back.) So far it's been easy to mass rollback IP-hopping vandals, but that might not last long. Ginsuloft (talk) 18:59, 28 August 2013 (UTC)

(talk page stalker) WP:BEANS! That's the kind of thing you should say privately. But yes, I agree. Jackmcbarn (talk) 00:43, 29 August 2013 (UTC)

RfX template

Is it possible to view the template without the "last updated" message below? Thanks. Mohamed CJ (talk) 06:02, 29 August 2013 (UTC)

Alan Turing

Your bot keeps tagging Alan Turing references as potential spam links. They do not appear to be so. Your bot has been reverted 3 times in a row by 3 separate editors. That's the danger zone, bot or no bot. Seriously, you should implement a check to see if its already made the exact same edit. Or something. Int21h (talk) 09:48, 29 August 2013 (UTC)

That's intentional. Read the tag more clearly.—cyberpower _Online 10:38, 29 August 2013 (UTC)