Revision as of 22:26, 6 February 2008 editJtir (talk | contribs)4,773 edits →Hey, you forgot some links !: more on User:Dispenser/Link checker; this is a very nice tool← Previous edit | Revision as of 17:43, 10 February 2008 edit undoDispenser (talk | contribs)Autopatrolled, Extended confirmed users, Pending changes reviewers33,005 edits changed to example.com; using {{ambox}}; added link to live refbot; started new section on why we aren't allowed to do metadataNext edit → | ||
Line 24: | Line 24: | ||
* The title which is used as the url title is the HTML title from the linked page. (from the <nowiki><title></nowiki> tag) | * The title which is used as the url title is the HTML title from the linked page. (from the <nowiki><title></nowiki> tag) | ||
* newlines, linefeeds, and tabs from titles are converted into a single space to avoid long titles. Extra spaces are also removed. | * newlines, linefeeds, and tabs from titles are converted into a single space to avoid long titles. Extra spaces are also removed. | ||
* Titles containing ], several consecutive } or ' are handled correctly, converting some of the preceding characters to their html entities () | * Titles containing ], several consecutive } or ' are handled correctly, converting some of the preceding characters to their html entities () | ||
* When content-type is not <code>text/html</code> (medias, .pdf, .doc, etc...), I can't automatically find a title, hence I only convert references to <code><nowiki><ref>http://lien.org/doc.pdf</ref></nowiki></code>. | * When content-type is not <code>text/html</code> (medias, .pdf, .doc, etc...), I can't automatically find a title, hence I only convert references to <code><nowiki><ref>http://lien.org/doc.pdf</ref></nowiki></code>. | ||
== Hey, you forgot some links ! == | == Hey, you forgot some links ! == | ||
{{ambox | |||
| style=background:#d9eaf6; border:1px solid #4682b4; border-left:10px solid #4682b4; color:#333333; | |||
<tr> | |||
| image = ] | |||
| text = '''If a link is unchanged after an edit by DumZiBoT, please check that you can access the linked page.'''</br> | |||
<small>If you think that a particular link was ignored by DumZiBoT because it's ''particular'', please ].</small |
<small>If you think that a particular link was ignored by DumZiBoT because it's ''particular'', please ].</small> | ||
}} | |||
</tr> | |||
</table> | |||
Some links may not be changed, even after DumZiBoT's run. These things may have occurred : | Some links may not be changed, even after DumZiBoT's run. These things may have occurred : | ||
* The HTML linked page has no title (rare, but happens). | * The HTML linked page has no title (rare, but happens). | ||
* DumZiBoT got an HTTP error while trying to get the page (see ] and ]). The link may be invalid, the page may not be available anymore, or may be protected. These links should be removed, but chances are that the error is temporary : That's why I do not remove these links on the basis of a single try !<!-- I just log them : if one of these links is still erroneous one week later, I'll (automatically) leave a note on the talkpage.--> Also, some pages, such as Google cache links, and Google books pages, give bots a 401/403 error although they're available. | * DumZiBoT got an HTTP error while trying to get the page (see ] and ]). The link may be invalid, the page may not be available anymore, or may be protected. These links should be removed, but chances are that the error is temporary : That's why I do not remove these links on the basis of a single try !<!-- I just log them : if one of these links is still erroneous one week later, I'll (automatically) leave a note on the talkpage.--> Also, some pages, such as Google cache links, and Google books pages, give bots a 401/403 error although they're available. | ||
⚫ | *:''] provides a ] in an article, generates a detailed table reporting the status of each link, and allows an editor to make changes to the article, including tagging ] and updating redirects. In addition, made a crude ]'' | ||
⚫ | : |
||
* Either the link or the html title is ]. | * Either the link or the html title is ]. | ||
Line 45: | Line 43: | ||
* ''Link'' '''blacklist''' : for now, only ] links are ignored, since for non-registered users JSTOR gives the message: "JSTOR: Accessing JSTOR". Please ] if you think that a particular domain should get blacklisted | * ''Link'' '''blacklist''' : for now, only ] links are ignored, since for non-registered users JSTOR gives the message: "JSTOR: Accessing JSTOR". Please ] if you think that a particular domain should get blacklisted | ||
* ''Title'' '''blacklist''' : Based on an original idea from {{u|Dispenser}}, I exclude links containing ''register'', ''sign up'', ''404 not found'', and so on. | * ''Title'' '''blacklist''' : Based on an original idea from {{u|Dispenser}}, I exclude links containing ''register'', ''sign up'', ''404 not found'', and so on. | ||
=== Meta-data === | |||
Why doesn't DumZiBoT included extra information like access date, author, publication or use the {<!-- prevent linking -->{cite}} series of templates? Changing of citation system is against Misplaced Pages Policy ] had been block by ruling of the arbitration community for doing this. | |||
== And what about server load ? == | == And what about server load ? == |
Revision as of 17:43, 10 February 2008
What is DumZiBoT doing?
He is converting bare external links in references into named external links.
Here are some examples of his work: , , and here is what he is doing now.
He usually runs every time that a new XML dump is available.
His owner is NicDumZ.
The idea
References like these:
<ref></ref>
<ref>http://www.google.fr</ref>
are converted into this:
<ref></ref>
They look like this:
- The title which is used as the url title is the HTML title from the linked page. (from the <title> tag)
- newlines, linefeeds, and tabs from titles are converted into a single space to avoid long titles. Extra spaces are also removed.
- Titles containing ], several consecutive } or ' are handled correctly, converting some of the preceding characters to their html entities (This title enclose brackets [here])
- When content-type is not
text/html
(medias, .pdf, .doc, etc...), I can't automatically find a title, hence I only convert references to<ref>http://lien.org/doc.pdf</ref>
.
Hey, you forgot some links !
If a link is unchanged after an edit by DumZiBoT, please check that you can access the linked page. If you think that a particular link was ignored by DumZiBoT because it's particular, please poke me. |
Some links may not be changed, even after DumZiBoT's run. These things may have occurred :
- The HTML linked page has no title (rare, but happens).
- DumZiBoT got an HTTP error while trying to get the page (see 4xx Client Error and 5xx Client Error). The link may be invalid, the page may not be available anymore, or may be protected. These links should be removed, but chances are that the error is temporary : That's why I do not remove these links on the basis of a single try ! Also, some pages, such as Google cache links, and Google books pages, give bots a 401/403 error although they're available.
- Dispenser provides a tool that checks all external links in an article, generates a detailed table reporting the status of each link, and allows an editor to make changes to the article, including tagging dead links and updating redirects. In addition, made a crude online reflink.py
- Either the link or the html title is blacklisted.
Blacklists
- Link blacklist : for now, only JSTOR links are ignored, since for non-registered users JSTOR gives the message: "JSTOR: Accessing JSTOR". Please contact me if you think that a particular domain should get blacklisted
- Title blacklist : Based on an original idea from Dispenser, I exclude links containing register, sign up, 404 not found, and so on.
Meta-data
Why doesn't DumZiBoT included extra information like access date, author, publication or use the {{cite}} series of templates? Changing of citation system is against Misplaced Pages Policy RefBot had been block by ruling of the arbitration community for doing this.
And what about server load ?
The search for pages containing invalid references is made from the last XML dump. DumZiBoT only fetches from the servers pages that needed modifications at the time of the dump. (Some pages are downloaded but eventually do not need changes, because the references were fixed between the dump and the fetch.)
Where should I grumble report a problem ?
- User Talk:NicDumZ, and not elsewhere =)