Misplaced Pages:Bots/Requests for approval/Cyberbot II 4

< Misplaced Pages:Bots | Requests for approval

This is an old revision of this page, as edited by Cyberpower678 (talk | contribs) at 14:23, 29 August 2013 (Reply). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Revision as of 14:23, 29 August 2013 by Cyberpower678 (talk | contribs) (Reply)(diff) ← Previous revision | Latest revision (diff) | Newer revision → (diff)

Cyberbot II 4

Operator: Cyberpower678 (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 02:04, Thursday June 27, 2013 (UTC)

Automatic, Supervised, or Manual: Automatic

Programming language(s): PHP

Source code available: No

Function overview: Tag all pages contatining blacklisted links in the MediaWiki:Spam-blacklist and the meta:Spam blacklist with {{Spam-links}}

Links to relevant discussions (where appropriate): Misplaced Pages:Bot_requests#Unreliable_source_bot

Edit period(s): Daily

Estimated number of pages affected: Unknown. Probably hundreds or thousands at first

Exclusion compliant (Yes/No): Yes

Already has a bot flag (Yes/No): Yes

Function details: This bot scans the above mentioned lists and tags any page with blacklisted link with {{Spam-links}} in the article namespace.

Discussion

Since the sites on those lists have been determined to be spam, would it be better to simply remove those links? Would your bot only consider external links, or also references? Thanks! GoingBatty (talk) 02:33, 27 June 2013 (UTC)

I believe it would be better to simply tag them instead of remove them. It is uncertain whether removing them may end up breaking something. I can have my bot remove them instead if that is what is preferred, or the MediaWiki software turns out to inhibit the bot. As for your questions, it would handle any link matched in article space.—cyberpower _Online 02:40, 27 June 2013 (UTC)

External links are not {{unreliable source}}, they're external links. Also, you should probably skip links listed at MediaWiki:Spam-whitelist. And note there are also links on the blacklist that aren't there because anything using that link is unreliable, e.g. any url shortener is there because the target should be linked directly rather than via a shortener. Anomie ⚔ 11:24, 27 June 2013 (UTC)

Thanks for the input. I could remove external links while tagging refs with {{unreliable source}}.—cyberpower _Offline 12:32, 27 June 2013 (UTC)

Wait, you're tagging external links that are listed on the spam blacklist with {{unreliable source}}? Unless I'm missing something here this won't work. When the bot tries to save the page, it will hit the blacklist and won't save. --Chris 13:55, 27 June 2013 (UTC)

I have considered that possibility, which is why my alternative is to simply remove the link and refs altogether.—cyberpower _Online 14:51, 27 June 2013 (UTC)

In that case, I think this is something better dealt with by a human. Simply removing external links will probably lead to a bit of "brokenness" in the article where the link was, and would need human intervention to clean up after the bot. Also, if the article does have blacklisted links it in, chances are it probably has other problems (e.g. the entire article could be spam), so it would be preferable to have a human view the article and take action. I think if you want to continue with this task, the best thing to do would be for the bot to create a list of pages that contain blacklisted links, and post that for users to manually review. --Chris 15:18, 27 June 2013 (UTC)

I'm not certain if the software will block the bots edits, if the spam link is already there. I was thinking more along the lines that the tags place it in a category, that humans can then review. If it can't tag it next to the link, maybe it can tag the page instead and place it in the same category. What do you think?—cyberpower _Online 15:30, 27 June 2013 (UTC)

As I understand it, if the spam link is already on the page, the software will block the edit anyway. --Chris 16:00, 27 June 2013 (UTC)

hmmm. I'm looking at the extension that is responsible. If the software blocks any edit that has the link in there already, that would likely cause a lot of problems on wiki. But, I'll have more info later tonight.—cyberpower _Offline 16:57, 27 June 2013 (UTC)

{{BAGAssistanceNeeded}} I have tested the spam filter extensively on the peachy wiki. Tagging blacklisted links will not trip the filter, nor will removing it or adding the link if it already exists on the page. Modifying the link, or adding it to a page where the link is not yet present will trip the filter.—cyberpower _Offline

Ok, I stand corrected. I'd like to review the source code for this bot. --Chris 12:40, 3 July 2013 (UTC)

Also can you give a bit more detail on exactly how the bot will operate? Will it only be tagging references, or will it remove external links as mentioned above? How will the bot deal with any false positives? Will it skip links listed on MediaWiki:Spam-whitelist? Will it be possible to whitelist other links (e.g. url shorteners as mentioned by Anomie), that shouldn't be tagged as unreliable? --Chris 12:47, 3 July 2013 (UTC)

The bot code is not yet fully completed as of this writing. I seem to be hitting resource barriers. Because it process an enormous amount of external links, I am working on conserving memory usage. Also, the regex scan is quite a resource hog as well, which I am trying to improve efficiency on. Yes, it will obey the whitelist. Because there is a risk of breaking things when removing the link, and tagging references can lead to false positives, I thought about placing a tag on the top of the page, listing the links that it found. False positives can be reported to me, or an admin, who will modify a .js page in my userspace with an exception to be added or removed, that the bot will read before it edits the pages.—cyberpower _Online 14:00, 3 July 2013 (UTC)

The script is now finished. Chris G. has the code and is reviewing it. The task will seek out blacklisted external links and tag the pages containing them. Exceptions can be added for specific cases and it reads the whitelist too.—cyberpower _Online 15:33, 24 July 2013 (UTC)

Although this page states that blacklisted external links will be tagged with {{unreliable source}}, Misplaced Pages:Village_pump_(miscellaneous)#New_Bot states that they will be tagged with {{spam-links}}. Could you please clarify? Thanks! GoingBatty (talk) 23:38, 24 July 2013 (UTC)

Some changes were made since the filing of this BRFA. I have now amended the above.—cyberpower _Offline 05:43, 25 July 2013 (UTC)

Comment maybe post a note at Misplaced Pages talk:WikiProject Spam and possibly also at Misplaced Pages talk:Spam as the editors there might not have seen the note at the VPM. Just thinking. 64.40.54.156 (talk) 04:38, 25 July 2013 (UTC)
Done—cyberpower _Online 08:40, 25 July 2013 (UTC)

Review:

Try and avoid using gotos wherever possible. It makes code hard to read, and often leads to strange bugs. E.g. at line 86 instead of:

    if( empty($blacklistregexarray) ) goto theeasystuff;
    else $blacklistregex = buildSafeRegexes($blacklistregexarray);

You could have written:

    if( !empty($blacklistregexarray) ) {
           $blacklistregex = buildSafeRegexes($blacklistregexarray);
           <LINES 89 - 112>
    }

Done all labels removed.—cyberpower _Online 12:24, 25 July 2013 (UTC)

Line 13 - Why the while loop? Unless there is a continue I am missing somewhere it seems to just run once, and break at line #156

Already done The break command was a remnant from the debugging period. It's removed now.—cyberpower _Online 12:24, 25 July 2013 (UTC)

Line 36 - while str_replace should work 99% of the time, it would be best practice to use substr instead. e.g.:

substr($exception,strlen("page="))

Done Missed this one.—cyberpower _Online 12:44, 25 July 2013 (UTC)

Lines 127 - 131, you seem to be checking that the API hasn't returned a blank page? This should really be done at a framework level, not in the bot code. Basically you should check the HTTP code == "200", if it doesn't sleep for 1 second and try again. If it happens again sleep for 2 seconds. And so on. But this should be done at the framework level, so you don't have to worry about it each time you use "$pageobject->get_text();" (in fact, it should be checked on all API queries)

Already done You reminded me that I programed that safeguard into the Peachy framework already. :p—cyberpower _Online 12:24, 25 July 2013 (UTC)

Bug at line 165 - "else return true;" I think you want "return true;" after the foreach loop. Otherwise it only checks one of the whitelisted links.

        if( preg_match($regex, $link) ) {
            foreach( $whitelistregex as $wregex ) {
                if( preg_match($wregex, $link) ) return false; 
                else return true;
            }
        }

v.s.

        if( preg_match($regex, $link) ) {
            foreach( $whitelistregex as $wregex ) {
                if( preg_match($wregex, $link) ) 
                     return false; 
            }
            return true;
        }

Fixed—cyberpower _Online 12:24, 25 July 2013 (UTC)

General comment. Considering how many edits your bot is going to make, you should put a sleep(); somewhere in the code to make sure you don't hammer the servers. At the very least after each edit, if not every http request.

Already done Framework has throttle.—cyberpower _Online 12:24, 25 July 2013 (UTC)

lines 145ish - is it possible to get the page id in the same API request as you get the transclusions? That way instead of making 165,000+ API calls (for each page), you only make about 33 calls.

Done—cyberpower _Online 12:24, 25 July 2013 (UTC) --Chris 09:29, 25 July 2013 (UTC)

AHHH. How did I not see that regex scan bug? D: Thanks for the input. I'll make the appropriate modifications now. I completely forgot that the framework was already designed to handle errors. :D
Modifications finished.—cyberpower _Online 12:44, 25 July 2013 (UTC)

Trial

Ok, we'll start with a small trial to make sure everything runs smoothly, and then we can move onto a much wider trial. Approved for trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. --Chris 10:57, 2 August 2013 (UTC)

It started out ok, but then something went horribly wrong and it started tagging pages with empty tags. I have terminated the bot at the moment and will be looking into what caused the problems.—cyberpower _Online 12:46, 10 August 2013 (UTC)

Bug found. Bot restarted.—cyberpower _Online 19:58, 10 August 2013 (UTC)

Trial complete. I haven't looked at the edits yet as it's currently the middle of the night right now.—cyberpower _Offline 00:32, 12 August 2013 (UTC)

Even after the restart, 2 pages had blank tags added (1, 2). Also, maybe non-article pages should be skipped (3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17), unless there is some reason that these should have the links removed:Jay8g 01:18, 13 August 2013 (UTC)

Thank you. I am already looking into the bug. And am already working on excluding namespaces.

The bugs have been fixed. The exceptions list now supports entire namespaces.—cyberpower _Online 12:14, 15 August 2013 (UTC)

Approved for extended trial (1000 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. Although I would ask that you do them in batches (maybe 100 or 200 edits at a time) --Chris 10:16, 24 August 2013 (UTC)

The bugs seem to be fixed. However, upon reactivating the bot, it began tagging spam reports in the Misplaced Pages space. I have added the Misplaced Pages namespace into the exclusions list.—cyberpower _Online 16:06, 24 August 2013 (UTC)
- I think that you should exclude pages in the file namespace and all talk namespaces as well:Jay8g 18:01, 25 August 2013 (UTC)
  Why?—cyberpower _Online 18:22, 25 August 2013 (UTC)
  - - With talk, it is usually discussing the inclusion of the link. With file, it is typically used as a file source. Neither should be removed:Jay8g 02:53, 26 August 2013 (UTC)
      I've excluded MediaWiki talk, Talk, File talk, and File namespaces.—cyberpower _Online 11:37, 26 August 2013 (UTC)
      User talk might be useful as well, for the same reason:Jay8g 16:11, 26 August 2013 (UTC)
      
      Also, Misplaced Pages Talk should be excluded (see this edit):Jay8g 16:30, 26 August 2013 (UTC)
      I've already added those.—cyberpower _{Limited Access} 17:06, 26 August 2013 (UTC)
Comment - You may want to change the edit summary from "Tagging page with Spam-links" to "Tagging page with Template:Spam-links" to make it clear that the bot isn't adding links, but adding a template. Giving users a link to the template documentation may also help reduce the number of comments you get on your bot's talk page. GoingBatty (talk) 14:42, 26 August 2013 (UTC)
Good idea.—cyberpower _Online 15:01, 26 August 2013 (UTC)
Comment Frankly, these spamlink tags are massive, too invasive and even disturbing for the readers. They should have a more decent and reasonable size and position. Cavarrone 11:10, 28 August 2013 (UTC)
The intent is to tackle spam links. They are supposed to be very noticeable.—cyberpower _Offline 13:42, 28 August 2013 (UTC)
I understand the intent very well, but the result is distractive and objectively disturbing for the readers and subsequently a damage for the encyclopedia, as the main intent of Misplaced Pages should be serving the readers, not punishing the spammers (at least not in such way, affecting the whole article). With that size/position the template becomes almost the main topic of the article, and this is IMHO unacceptable. I am not even sure it is a damage for them (if their intent is publicizing a website, I'm not sure they are disturbed from having their links in the head of an article). We have filters and whitelists to prevent further spamming of a blacklisted website, but marking/working on previous (possible) spam requires a finer and more accurate approach than this stuff here. I predict a lot of similar complaints when/if the bot will work on regular basis (eg, have you an idea of how many articles use the only blacklisted New York Times as a source?). Cavarrone 15:23, 28 August 2013 (UTC)
This is merely a starter template. If people want it changed, then consensus will change.—cyberpower _{Limited Access} 17:24, 28 August 2013 (UTC)
Cyberpower, you have a serious bug. None of the links on Access2Research are blacklisted. It reports on a petitionrule, but the rule is more specific than just the word 'petition', the actual rule catches only a couple of domains, not all links with that term. See the local and meta talkpages of the blacklists for requests regarding petitions ... Note that the links there are actually saved, so they are not blacklisted ... --Dirk Beetstra 18:43, 28 August 2013 (UTC)
The petition rule doesn't seem specific to me. I hardly call \bpetition(?:online|s)?\b specific. It's not a bot bug.—cyberpower _Online 19:04, 28 August 2013 (UTC)

those links are not blacklisted, I just managed to save a page with one of the links ... --Dirk Beetstra 21:20, 28 August 2013 (UTC)

If they exist already, they won't be blocked. Also, I have found that the filter only partially enforces the regex list. Sometimes it blocks links with petition in it and other times it doesn't. The regex generator is the same as MediaWiki's extension. The validation process of these links is identical. If it's really a bug, then it's a bug with PHP.—cyberpower _Online 22:03, 28 August 2013 (UTC)

See diff --Dirk Beetstra 21:23, 28 August 2013 (UTC)

Other point, I would really consider to move the template to a more neutral name, like 'blacklisted-links'. Not all links are spam that are on the blacklist, they are however all blacklisted .. --Dirk Beetstra 18:51, 28 August 2013 (UTC)
That can be sorted out afterwards.—cyberpower _Online 19:04, 28 August 2013 (UTC)

It is also tagging links to googlebook with search strings with "forbidden words" like "petitions" in it. That's not a spam link and it doesn't not trigger the spam filter either, because it ain't forbidden in that context. I'll also second that the box is too large and overwhelming. Slp1 (talk) 01:53, 29 August 2013 (UTC)

Cyberpower678 can you stop the trial until we sort out the above. --Chris 02:00, 29 August 2013 (UTC)

Per my talk page discussion, it has been brought to my attention that these issues are indeed a bot issue, not because it's bugged, but because it's running outdated code. I seem to have downloaded an outdated version of the extension. I'll make the modifications in the next few days. My bot has been shut down for quite some time now. I still recommend that petition regex be removed. There's no need for it. As for the spam links template, that can be fixed later as it's not crucial to the bot's operation.—cyberpower _Online 02:05, 29 August 2013 (UTC)

I am glad that you have found the problem, but I can't say I agree that the template is not crucial to the bot's operation or that it can be fixed later. This is an encyclopedia and the template (rather than the code) is what our readers and editors see and use. The prior template was inappropriately large and overwhelming; it talks about "spam links" which is not the case and uses the term "external links" in a way that is not consistent with WP's definition at external links. At least on the William Wilberforce page, the promised list of "problematic links" was not shown, which meant we had to dig to try and figure out what the (non)problem was. Given the fact that the bot is in a trial stage and is making mistakes, it seems to me that it would be better to provide information about where to report errors in tagging, rather than the current formulation which suggest that the bot can do no wrong, that the article and its editors (or the blacklist) are at fault, and that they need to figure out the problem and act on it or the bot will be back. That's not the case, and feedback needs to be given, received and then acted upon with a positive spirit. Slp1 (talk) 11:54, 29 August 2013 (UTC)

I agree with you completely. The problem was that I was convinced that there was no issue with the code, that the regex is generated by the blacklist extension itself, that the bot simply validates the regex against the blacklist. I was right in every aspect. Where I was wrong was that I was running an out of date version of the extension. The newer version has a more refined regex generator. The template layout should be decided by the community. I merely created something for the bot. I'm not going to force this template on the community if they don't want it.—cyberpower _Offline 12:47, 29 August 2013 (UTC)

To expand on what I said, I meant not crucially in fixing bugs in the bot that may cause mistags. I do agree that the template will need to be fixed and adjusted to reflect consensus, but now I am only concerned in making sure the bot operates correctly.—cyberpower _Online 14:21, 29 August 2013 (UTC)

I have updated the regex scanner. It should now mirror Misplaced Pages's blacklist filter when scanning regexes. The change will go into effect on the next run.—cyberpower _Online 14:23, 29 August 2013 (UTC)

Category:

Open Misplaced Pages bot requests for approval