Misplaced Pages

User talk:NicDumZ: Difference between revisions

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.
Browse history interactively← Previous editNext edit →Content deleted Content addedVisualWikitext
Revision as of 08:44, 8 February 2008 editBackin72 (talk | contribs)5,347 edits How do I request that your bot visit a page?: ty← Previous edit Revision as of 09:13, 8 February 2008 edit undoNicDumZ (talk | contribs)Extended confirmed users983 edits Character set problems: rNext edit →
Line 361: Line 361:


::::::: Take another look at the PDF file I linked to. You'll note that the "Windows" and "Mac OS X" columns often specify different Unicode values. Sometimes these values are absent. Python's inability to process these codes correctly is not a sufficient reason for churning out garbage in Windows-1252. -- ] ] 16:06, 7 February 2008 (UTC) ::::::: Take another look at the PDF file I linked to. You'll note that the "Windows" and "Mac OS X" columns often specify different Unicode values. Sometimes these values are absent. Python's inability to process these codes correctly is not a sufficient reason for churning out garbage in Windows-1252. -- ] ] 16:06, 7 February 2008 (UTC)
::::::::Python's <code>shift jis</code> did not contain that character. <code>shift jis 2004</code> does, so I now try shift jis, then shift jis 2004, then cp932. ] <font color="red">]</font> 09:13, 8 February 2008 (UTC)



::Apparently ] extends nicely ]. I've just , and it appear to work. I don't know if there's any similar solution for Shift JIS... ] <font color="red">]</font> 14:15, 7 February 2008 (UTC) ::Apparently ] extends nicely ]. I've just , and it appear to work. I don't know if there's any similar solution for Shift JIS... ] <font color="red">]</font> 14:15, 7 February 2008 (UTC)

Revision as of 09:13, 8 February 2008

Archives for the NicDumZ talk page
Template:Archive banner/list


Your bot request

Hi NicDumZ I wanted to let you know that Misplaced Pages:Bots/Requests for approval/DumZiBoT has been approved. Please visit the above link for more information. Thanks! BAGBot (talk) 03:30, 3 February 2008 (UTC)

Your bot is most excellent

Very nice edit at acupuncture. A much-needed service. rock on, Jim Butler (t) 10:15, 3 February 2008 (UTC)

Ah ! Thanks a lot xD NicDumZ ~ 10:16, 3 February 2008 (UTC)
Indeed — good work! (And after reading the approval process page, I really do mean work!) Thanks! — the Sidhekin (talk) 11:09, 3 February 2008 (UTC)
Hear hear! I agree. :) Jmlk17 11:11, 3 February 2008 (UTC)
hey, just adding in my little note of praise. I want to marry your bot and have its babies! :D Mathmo 19:09, 6 February 2008 (UTC)

thanks

Glad to see a bot doing this.

Does it also convert inline exlinks to refs?

I made such a change here. (It's the first change, the second just needed a different name.) --Jtir (talk) 14:16, 3 February 2008 (UTC)

Thanks !
No, it does not convert inline links ! I'm afraid this would cause too much trouble : How a bot could be sure that an inline link should be converted into a reference ? NicDumZ ~ 14:19, 3 February 2008 (UTC)
OK. I just noticed the helpful link in the bot's edit summary. --Jtir (talk) 14:23, 3 February 2008 (UTC)
Yes. Please feel free to edit that page if you think that it needs some improvements. Some things are obvious for me, but might not be this obvious for others... Also, my english is not this good :) NicDumZ ~ 14:31, 3 February 2008 (UTC)
"He runs every time that a new XML dump is available."
Would it be better to say "He usually runs every time ..."?
I'm not so happy with the phrase "... please check that you can access the pages", but that's the best I could come up with. --Jtir (talk) 15:37, 3 February 2008 (UTC)
Well, I do appreciate your help ! The best that you could come up with is way better than my not-so-good academic English... Thanks a lot !
I've added usually: You are right, the run frequency depends on my availability :)
NicDumZ ~ 15:49, 3 February 2008 (UTC)
Glad to help out. --Jtir (talk) 16:04, 3 February 2008 (UTC)

Your bot is awesome

Your bot edited two pages and cleaned up the reference sections a job that I really don't like doing. Thank you, your bot is very useful. EconomistBR (talk) 15:40, 3 February 2008 (UTC)

Can I just add to that "Yippee!!!"? This is wonderful to see! Thank you! -- SatyrTN (talk / contribs) 15:56, 3 February 2008 (UTC)
+1. I see there are some issues, but I think bot is doing a great job. utcursch | talk 04:42, 4 February 2008 (UTC)
Your bot rules, DumZiBoT just edited the GT Interactive article, over 60 references. Now the reference list looks so elegant, clean and easy on the eyes.
I can't wait to see DumZiBoT editing the Vale (mining company) (80 bare references) and Infogrames (58 bare references), the changes on those pages will be huge. EconomistBR (talk) 07:45, 4 February 2008 (UTC)

Amen, amen. Replacing cryptic ref URLs with the corresponding <title> element via a bot is a fantastic idea!! Kudos. — ¾-10 01:53, 5 February 2008 (UTC)

possibly missed exlinks

This edit possibly missed the exlinks in a named ref (<ref name="RFC3092">). The named ref had the bare exlink repeated in three places. I made the corrections in these two edits (it took two edits because I didn't realize the exlink had been repeated in three places). And, yes, the page can be accessed. :-) --Jtir (talk) 16:45, 3 February 2008 (UTC)

Nice catch !
I've just corrected this :)
NicDumZ ~ 16:57, 3 February 2008 (UTC)
Thanks! That's the fastest bug fix I have ever seen when I wasn't also the programmer. :-) (I guess this is your test suite.) --Jtir (talk) 17:09, 3 February 2008 (UTC)
Yes it is ! Feel free to add links overthere !! NicDumZ ~ 17:11, 3 February 2008 (UTC)

Your bot is not helpful

You need to turn off this bot, especially on science articles. You are making it difficult if not impossible to watch science articles for trolls, vandals and POV-warriors, because all I see on my watchlist is your useless bot. You are making Misplaced Pages worse off, not better, because once the POV warriors know how your bot works, they'll just put in links without titles, and your bot will format it, making yours the last change in history. This will take more work using Twinkle or other vandal fighting tools. Either turn the thing off, or I will ask for administrative assistance. OrangeMarlin 17:50, 3 February 2008 (UTC)

Can't you ask nicely, mmh ?
NicDumZ ~ 17:57, 3 February 2008 (UTC)
"especially on science articles": Could you cite a specific example? --Jtir (talk) 17:55, 3 February 2008 (UTC)
I think the bot is great, but Orangemarlin has a point -- shouldn't there be an option to ignore bot edits on watchlists? Just like you can ignore edits marked "minor"? csloat (talk) 18:37, 3 February 2008 (UTC)
There is an option to ignore bots. However, my bot has not been flagged yet, hence is not considered by Mediawiki as a bot. Just wait a few hours :) NicDumZ ~ 19:00, 3 February 2008 (UTC)
I will resume my edits once DumZiBoT gets flagged. But seriously Orangemarlin, adopting such a condescending tone is not the way around. I expect some excuses. NicDumZ ~ 18:38, 3 February 2008 (UTC)
I've changed my mind : per , as your bot is listed as approved by WP:BAG, you may operate it, just keep it under 3-4 edits per min until you are flagged. DumZiBoT has been approved (hence is considered as useful), I reduced a bit the edit rate : I see no reason to stop. NicDumZ ~ 18:57, 3 February 2008 (UTC)
Orangemarlin : Next time, please feed me with some diffs... NicDumZ ~ 18:57, 3 February 2008 (UTC)
I don't feed diffs, because frankly I'd rather edit articles than try to prove anything, since I specifically stated, your bot makes my life difficult on Misplaced Pages. But so do incompetent admins, anti-science editors, and trolls. You got my opinion, you ignored my opinion, I'm fine with that decision. OrangeMarlin 21:47, 3 February 2008 (UTC)
As another editor said below, once it gets caught up with all the untitled references, everything should be fine. --Jim Butler (t) 07:52, 5 February 2008 (UTC)
How can it be a bad thing for the bot to fetch the URL title? Now you will be spared from having to look at the URL to decided whether or not it is useful. It makes my life easier, and I edit science articles too. So what then? --Adoniscik (talk) 03:16, 6 February 2008 (UTC)

In da middle

It looks good to me, so long as the tiles are accurate. On the other hand, I can see concerns about bots such as those raised above. In any case, so long as you're amenable to receiving feedback whem and if problems arise, I think both you and the bot will be happy together. Clever, by the way.  :) As a used to be programmer I wouldn't mind seeing the code. Cheers. &#0149;Jim62sch&#0149; 21:08, 3 February 2008 (UTC)

Thanks ! The code, (slightly out to date) is available for now here. I believe that the above problems will be fixed once my bot gets flagged, though... :) NicDumZ ~ 21:10, 3 February 2008 (UTC)
Very, very nice. Logical and well-referenced. Congrats, well done! &#0149;Jim62sch&#0149; 21:21, 3 February 2008 (UTC)

Your bot is wonderful!

What a great idea for a bot. This is something that I manually do all of the time. Your bot is very helpful, and I cannot believe that no one had thought of it sooner. Kudos! нмŵוτнτ 18:01, 3 February 2008 (UTC)

I just don't know what to answer. It makes me xD ! Thanks :) NicDumZ ~ 19:26, 3 February 2008 (UTC)

Ditto. I think DumZiBoT is doing fine. Decriptive text as a link label sure beats just a number () in the references section. -Fnlayson (talk) 20:01, 3 February 2008 (UTC)

More support: even if it picks the wrong phrase it's still an improvement, and any human follow-up is easier than before. --Old Moonraker (talk) 07:42, 4 February 2008 (UTC)

Your bot is a pain...

...but I hope that, when it has caught up with all the untitled references, I might find some reason to look at my watchlist again! TINYMARK 23:08, 3 February 2008 (UTC)

It's now flagged. I think that this should be better now :) NicDumZ ~ 00:20, 4 February 2008 (UTC)

Great bot but...

Hi. Your bot is really useful but it needs some tuning i think. Can you please exclude JSOTR links? Check here. For non-registered users JSTOR gives the message: "JSTOR: Accessing JSTOR" and doesn't show the real html. -- Magioladitis (talk) 01:56, 4 February 2008 (UTC)

Exception added !
Thanks ;) NicDumZ ~ 02:02, 4 February 2008 (UTC)
Is there a way that editors could be informed that such links are present in an article and may need their attention? In analogy with the image fair use notifications, perhaps a brief message on the talk page could say that DumZiBoT had not changed such links. --Jtir (talk) 09:08, 4 February 2008 (UTC)

After looking at Andrew Sullivan, I would say that the links needing attention are obvious. --Jtir (talk) 09:41, 4 February 2008 (UTC)

Andrew Sullivan

Hi, can your linkbot be set loose on Andrew Sullivan? Benjiboi 08:14, 4 February 2008 (UTC)

 Done NicDumZ ~ 08:17, 4 February 2008 (UTC)
Thank you! Benjiboi 14:12, 4 February 2008 (UTC)

What a good little bot

Thank you. --Duncan (talk) 09:46, 4 February 2008 (UTC)

Bot

Here is another problem: "" http://en.wikipedia.org/search/?title=Pergolide&curid=622942&diff=188999410&oldid=150088654 Maybe you should add a bad word list that contains error message words... Сасусlе 12:42, 4 February 2008 (UTC)

mmhh, turned that function on from now on. NicDumZ ~ 14:38, 4 February 2008 (UTC)
We have a black list... and it match when I do the regex on the title, just not sure why its still adding it. The title comes from the cookie error page at . — Dispenser 14:45, 4 February 2008 (UTC)
Because the feature was disabled. I did that for a test (yesterday ?), and I, sigh, forgot to uncomment it. NicDumZ ~ 14:54, 4 February 2008 (UTC)

list of exlinks that are excluded

Do you have a link to a list of exlinks that are excluded? (a blacklist?) I am thinking of adding a third reason to the section in User:DumZiBoT/refLinks that lists reasons an exlink might not be changed. --Jtir (talk) 12:51, 4 February 2008 (UTC)

I've commented it out, and added a banner. Tell me what you think :) NicDumZ ~ 14:19, 4 February 2008 (UTC)
Thanks, that looks good. I widened the banner, because the text was wrapping just before the last word on my display. Maybe there is a better way. (center?) --Jtir (talk) 18:23, 4 February 2008 (UTC)

Great bot

Keep it up. --Arcadian (talk) 13:09, 4 February 2008 (UTC)

Thames dumb barge?

Your bot fixed a bare reference in Landing craft, but it included the gratuitous word "dumb." Is that your idea of humor, or a flaw in your bot, or what? I have removed the word "dumb." Lou Sander (talk) 15:34, 4 February 2008 (UTC)

xD
Look at the title of your browser when opening this page : http://www.naval-history.net/WW2MiscRNLandingBarges.htm Get it :) ? My bot only copies the title from the page, not less, not more. And I have no idea why is there "dumb" in the title of this page ?!
NicDumZ ~ 15:37, 4 February 2008 (UTC)
From the page one of his first tasks was to requisition 1000 ‘dumb’ (unpowered) Thames barges. Dispenser 16:12, 4 February 2008 (UTC)
Got it! The word wasn't in the title of the article as printed, or very visible when skimming it. I saw "dumb" and "dum" and feared the worst. Thanks for responding. Good bot. Not broken. Lou Sander (talk) 17:13, 4 February 2008 (UTC)

Bug report

Here is a bug for you to fix. The text is supposed to be in Russian, but it is gibberish due to incorrect encoding.—Ëzhiki (Igels Hérissonovich Ïzhakoff-Amursky) • (yo?); 16:51, 4 February 2008 (UTC)

Thanks. Stopped the bot, looking for a fix right now. NicDumZ ~ 16:54, 4 February 2008 (UTC)
I'm afraid there's not much that I can do :
Server: nginx/0.6.25

Date: Mon, 04 Feb 2008 16:56:46 GMT
Content-Type: text/html
Transfer-Encoding: chunked
Connection: close
X-Powered-By: PHP/4.4.7
Content-Language: ru

  • The HTML source of the page does not have any charset information neither, therefore the page is not meeting any standards.
  • The generated characters are fine, so I can't really detect when a page is in Russian or not...
As a side note, my Firefox is completely lost when opening the page : I have to tell him explicitly that it is some Russian, or else it won't print me the page correctly.
I've added an exception : The script now tries the encoding windows-1251 when the domain name is ending by .ru; It works here but I'm afraid that this might not work everywhere, or cause some collateral problems... NicDumZ ~ 17:14, 4 February 2008 (UTC)
I don't really know how to fix it myself (or I would have told you :)), but I'll keep an eye on the changes the bot does to Russia-related articles (since I've got about 5,000 of them watchlisted, I am reasonably sure I'll be able to catch cases like this).
On a different note, let me express my gratitude—this kind of bot is something we've been needing for quite a while now! Great job, keep it up. Cheers,—Ëzhiki (Igels Hérissonovich Ïzhakoff-Amursky) • (yo?); 17:29, 4 February 2008 (UTC)
I'm examined the issue. On Firefox 3 Beta 2 it will detect the proper encoding if the Character Encoding > Auto-Detect > Universal option is set. It also render correctly in Opera 9.23. On IE6 it renders a Shift-JIS. Safari 3 don't have an auto-detect option. I tried save the files to disk to get rid of the meta data in the HTTP headers, Firefox still somehow work but Opera fails. My conclusion is that opera has different default encoding based on header information and most likely the Content-Language (Update, this appears not to be the case, they're both apparently using statsticsal method of the languages to determine the encoding — Dispenser 00:51, 5 February 2008 (UTC)). Also, can't we include on of those nifty Language icons if Content-Language is specified? — Dispenser 22:44, 4 February 2008 (UTC)
Firefox 2.0.0.10, with "Auto-Detect Russian" enabled, detected the encoding as "Cyrillic (Windows-1251)" and rendered the page correctly. --Jtir (talk) 23:15, 4 February 2008 (UTC)
Looking at the source again and it seems to me that it never pass the HTTP header encoding to UnicodeDammit, only if its encoded in the page using the meta tags which looks for those tags anyway. And as it contains chardet it should be able to identify the encoding correctly. — Dispenser 02:49, 5 February 2008 (UTC)
Also, can't we include on of those nifty Language icons if Content-Language is specified? Very nice idea, again. I'm working on that. NicDumZ ~ 21:09, 7 February 2008 (UTC)
Seems operational : http://en.wikipedia.org/search/?title=Future_French_aircraft_carrier&diff=prev&oldid=189814408 :) NicDumZ ~ 21:41, 7 February 2008 (UTC)

DumZ bot

Hi, I noticed the your bot introduced a hidden comment into East Mountain that looks like spam: " TopoZone - The Web's Topographic Map, and more!" Can you explain this?--Pgagnon999 (talk) 18:24, 4 February 2008 (UTC)

Hello Pgagnon !
You'll find your answer at User:DumZiBoT/refLinks :)
Cheers !
NicDumZ ~ 21:38, 4 February 2008 (UTC)
For the record, this is the link. I have reannotated. --Jtir (talk) 19:00, 4 February 2008 (UTC)
Thanks ! :) NicDumZ ~ 21:38, 4 February 2008 (UTC)

Hmmm....interesting. Not sure how I feel about the opportunity for a free hidden advertising plug for companies with clever URL titles. . .or (in this case anyway) if the bot introduced anything of value that wasn't already inherent in the URL sytax itself, but it is what it is. . .and, at the end of the day, not a super big deal.--Pgagnon999 (talk) 23:49, 4 February 2008 (UTC)

I think that it's wikipedian's responsibility to add titles to an external links. When this hasn't been done, I do my best to fix that. If, however, the fix is not that good, well 1)It's better than a plain hideous URL 2) Someone would have had to edit the link to add a good link anyway; after DumZiBoT, this someone just has to *fix* the title, which is less work than checking the link, and adding a title... NicDumZ ~ 15:16, 5 February 2008 (UTC)

Another issue with the bot: When a URL redirects, the bot is following it to its new destination and blithely listing the title of the new URL. Where I observed this: In List of unaccredited institutions of higher learning, http://www.asiaweek.com/asiaweek/features/universities2000/artic_online.html redirected to the current issue of TIME Magazine, so the bot left a link title of "TIME Magazine - Asia Edition - February 11, 2008 Vol. 171, No. 5". That misdirection was fairly innocuous (although the current issue of the magazine would be useless as a source, at least it's clean), and I've fixed that particular misdirection with a link to the archive.org version of the original AsiaWeek article, but I think that as a general policy the bot process should be generating a list of domains that redirect, rather than generating new titles. --Orlady (talk) 15:02, 5 February 2008 (UTC)

hmm... Right. Though, a lot of websites are using soft redirects if for example, the content has moved, or if you linked to a frame when the navigation menu is in another frame. I'm afraid that logging the redirects would not do, as there would have too many of them. However, I might add some sort of exception for the Times... NicDumZ ~ 15:12, 5 February 2008 (UTC)
I hadn't thought of the frames issue... I see multiple problems with following a nonframe-related redirect. One is when the domain registration has expired and a new owner has redirected it to unsavory content. Another is that the bare URL is actually more informative to a user than the description of the new target. A user who clicks on an Asiaweek URL that has the year 2000 in its name will quickly recognize what happened when they see the current issue of Time magazine, but a user who sees a link to Time magazine that makes no sense in the context is likely to assume that the Misplaced Pages contributors were idiots. This problem is by no means unique to Time magazine -- many domains do that kind of thing with old URLs. --Orlady (talk) 15:29, 5 February 2008 (UTC)
Of course this is exactly the reason why I created my tool. By the way that article is a horrid mess with its external links. — Dispenser 01:24, 6 February 2008 (UTC)

Perhaps you can turn your attention to the article "Malleus"

While you are about it, NicDumZ, perhaps you can turn your attention to the article malleus. The info box ref to the image of the gestation stage indicated needs fixing as it directs you straight to the UNC University Wiki article, unless you as the potential reader know what you are doing. Not many of our readers might know that though. Unless he (the reader) knows to home in on the template used he is going to be nonplussed. Many thanks, and congratulations on your work. Do you actually look out for unsourced articles, too? Dieter Simon (talk) 01:41, 5 February 2008 (UTC) Dieter Simon (talk) 01:43, 5 February 2008 (UTC)

...?! I'm sorry I really don't understand what you are trying to do. I looked at the history of malleus, and couldn't understand either. {{EmbryologyUNC}} looks fine to me, si I really don't understand ?! NicDumZ ~ 11:38, 5 February 2008 (UTC)
I believe he means that it is confusing to have the two links in the UNC ref conjoined without explanation. That is a problem with {{EmbryologyUNC}} and could be fixed by writing a sentence using the link names. Unfortunately, the external links have uninformative names like "subject #231 1044" and "hednk-023". I'm not sure how to fix that. --Jtir (talk) 14:58, 5 February 2008 (UTC)
Okay, understood. I tried fixing that. How is it now, Dieter ? Better ? NicDumZ ~ 15:04, 5 February 2008 (UTC)
Much better. A third argument could be an optional external link name that overrides the default "hednk-023". --Jtir (talk) 15:32, 5 February 2008 (UTC)
Yes,NicDumZ, that's what was needed, as far as I am concerned. Many thanks. Dieter Simon (talk) 23:22, 5 February 2008 (UTC)

More praise for your bot

Hey, I just saw the edits made by your bot at Krav Maga -- great bot, in both concept and performance! Kudos! JDoorjam JDiscourse 19:32, 5 February 2008 (UTC)


Absolutely brilliant - congratulations from me too --Matilda 22:41, 5 February 2008 (UTC)

DumZiBoT

Nice BOT - can you change it to use a basic citation template though ? eg

<ref>
{{Citation
| title =
| url =
}}
</ref>

Cheers -- John (Daytona2 · Talk · Contribs) 23:02, 5 February 2008 (UTC)

Seconded. Mahanga 23:12, 5 February 2008 (UTC)
Thirded :-) --Matilda 23:15, 5 February 2008 (UTC)
Fourthded :) vıdıoman 23:23, 5 February 2008 (UTC)
(edit conflict)Well, I'm not used to the English style guidelines... But It seems to me that {{citation|title=example|url=http://example.com}} gives exactly the same result as , or am I missing something ? If so, why should I complicate things, for the users and for the servers, using this intricated template ? :þ NicDumZ ~ 23:24, 5 February 2008 (UTC)
The template can be expanded. It provides the groundwork. I'm pretty sure we're supposed to have "retrieved on" tags for references as well. vıdıoman 23:38, 5 February 2008 (UTC)
it is very good groundwork and promotes the use of additional paramters such as who published which is very useful for judging the reliability of the source--Matilda 23:45, 5 February 2008 (UTC)
I'm reticent to that idea (But some could say that I'm always reluctant to other's ideas; please make yourself bold if you think that it worths it) :
  • DumZiBoT is dealing with tens of thousands of links. I don't think that all these references that have been left alone for so long will be granted any further information in the next days. Even if 30% of these links are getting modified in the future, that would leave something like 50,000 unnecessary templates ? I don't really think that the servers need that, do they ?! And leaving the technical complaint apart, I personally try to use the simplest syntax I can when editing articles. I don't think that using templates when the standard syntax simply works is the way to a "newcomer-friendly" encyclopedia.
  • Also, while I really understand that this might ease the work of some contributors in the future, I'm not sure that every contributors would like to use this template. And the reading of Misplaced Pages:Citation templates confirmed my doubts : They may be used at the discretion of individual editors, subject to agreement with the other editors on the article. Some editors find them helpful, while other editors find them annoying.
  • Eventually, the same page states Because they are optional, editors should not change articles from one style to another without consensus. : As I really don't think that there is a consensus over that question, I'm not going to do this, since this could be considered as some sort of "orignial style guideline pushing", if you see what I mean, despite my poor English...
NicDumZ ~ 00:03, 6 February 2008 (UTC)
I commonly use cite templates to ensure a consistent reference style, but I know of one editor who is an experienced librarian and he never uses them; indeed, he removes them and does what he calls "scratch cataloging". And I agree that "access dates" are not always needed — published scholarly works (e.g. JSTOR) are not going to change, nor are court decisions, newspaper articles more than a few days old, The Bible, the works of Shakespeare, etc. Further, if sources are changing after they are "accessed", they are not verifiable. --Jtir (talk) 00:43, 6 February 2008 (UTC)
It is exactly for reasons of verifiabilty that access dates are recommended, in the hope that deleted information may be retrieved again using, for example, the Wayback Machine. I am a great fan of citation templates for the above reason of consistency, but I fear it would be expecting too much for a bot to intelligently retrieve the infomation necessary (how would it decide whether a particular name is the author or the subject of an article?) TINYMARK 01:14, 6 February 2008 (UTC)
Please read the BRfA as these questions are redundant, although the answers are a bit more indepth. And I had come up with an idea of getting meta-data into the links. — Dispenser 01:40, 6 February 2008 (UTC)
I couldn't quite fathom what was the consensus on the BRfA regarding the citation template issue, but I too would be overjoyed to see a bot do the grunt work of creating {{cite}} template stubs, but what we have now is great too. Thank you very much! Adoniscik (talk) 03:10, 6 February 2008 (UTC)
My opinion is that people who are experienced in adding references are welcome to ignore the format of the {{cite}} templates. However, since these links are already poorly cited, the article can't be high in the priority of these editors. By using the {{cite}} templates, this bot could both lay the groundwork for a well-formatted citation, and also bring attention to these templates to inexperienced editors. Bluap (talk) 04:55, 6 February 2008 (UTC)
Thanks for the replies, and the pointer towards the BRfA discussion on the issue. I didn't express myself at all well, but others understood my thinking which was flawed, because I now realise that I was advocating pushing citation templates because I think that they act to encourage high quality referencing and I appreciate the work put in by their constructors. Which method is likely to encorage the highest quality reference information from the Misplaced Pages user base ? I say citation templates. Since we're not allowed to push them, my arguement is with that ruling, and I will see what I can do to challenge it. Cheers -- John (Daytona2 · Talk · Contribs) 21:13, 6 February 2008 (UTC)

Wow...

I just figured out what your bot does, after quite a bit of confusion. As soon as I figured it out, I was quite impressed. Thanks for making such a useful addition to the Wikimunity. Darkage7 (talk) 07:20, 6 February 2008 (UTC)

What a Brilliant Idea Barnstar

What a Brilliant Idea Barnstar
You are awarded this barnstar for programming DumZiBoT to expand bare references. Thanks for helping make Misplaced Pages a well-referenced resource. Flibirigit (talk) 07:34, 6 February 2008 (UTC)

One more voice in the crowd

I've seen probably fifty pages on my Watched list get (slightly) improved by this both in the past two days - keep up the work, it's a great idea. Sherurcij 08:52, 6 February 2008 (UTC)

And another

Your bot is doing great work! Thank you so much! Aleta (Sing) 14:23, 6 February 2008 (UTC)

Character set problems

Your bot made rather a mess of the Meishi article by converting the anchor text for one of the references into unintelligible garbage. Does this thing understand Shift JIS? -- Sakurambo 桜ん坊 14:28, 6 February 2008 (UTC)

Well, my bot handles Shift JIS as any other encoding.
Problem is, the page http://www.youmeishi.com/contents/product/paper.html contains a badly encoded character, probably line 365 of the html source, which causes the python.codecs module to raise an error (character #19563 of the html source, but since the codecs parser failed, I don't think that this number is reliable). I can't do anything to solve these kind of problems, that's really not my fault, sorry. NicDumZ ~ 16:34, 6 February 2008 (UTC)
The Shift_JIS source is not invalid. The character you're blaming this problem on is the "mm" glyph highlighted in this screenshot of the page's HTML source. This is equivalent to the Unicode character 0x339c ("SQUARE MM", part of the CJK compatibility code block).
The web page in question also clearly identifies itself as Shift_JIS, so it makes no sense to use any other encoding. If your software can't recognise the encoding of a web page, wouldn't it make more sense to just leave it alone? Or do you really think it's better to blame the problem on other people and carry on regardless? -- Sakurambo 桜ん坊 17:19, 6 February 2008 (UTC)
Firefox 2.0.0.11 auto-detects the encoding of this page as Shift JIS, yet the name of the page is still displayed as a string of question marks: "【名刺用紙】名刺用紙販売所". --Jtir (talk) 20:25, 6 February 2008 (UTC)
That was in Windows XP, where I do not have the Japanese language pack installed. With Firefox 2.0.0.10 in Linux, all but one character is displayed correctly, instead of question marks. So never mind, it is my problem with fonts. --Jtir (talk) 20:49, 6 February 2008 (UTC)

Well, feel free to try by yourself, instead of assuming that I'm deliberately using another encoding :

import urllib2
url = u'http://www.youmeishi.com/contents/product/paper.html'
handler = urllib2.urlopen(url)
source = handler.read()
to_uni = Unicode(source, "Shift JIS") #will raise UnicodeDecodeError (illegal mutibyte sequence)

There must be some problem in the encoding of the HTML source. What you have to understand is that my script tries first to convert to the encoding specified in the "meta" markup of the page. When no UnicodeDecodeError is raised, it assumes that it works, and uses that encoding. But when an error is raised, it goes on an try other encodings. When a "fine" codec is found, i.e. a codec that does not raise an error during the conversion, I use it. But there's no way for an automated script to determine whether a character sequence makes sense or not... (Also, some pages actually say they use one encoding in their meta tags, while they're not; And a lot of pages are not sending any encoding : that's why I try other encodings)

NicDumZ ~ 09:29, 7 February 2008 (UTC)

OK, I got the same error (UnicodeDecodeError: 'shift_jis' codec can't decode bytes in position 19563-19564: illegal multibyte sequence).
I guess there are just some characters in Shift JIS that can't be successfully mapped to Unicode. Simply looking for another encoding that doesn't raise any errors isn't going to be safe because any random binary data will work with some 8-bit encodings such as KOI8-R and Mac OS Roman.
So instead of banging square pegs into round holes, I think it would be safer to halt the process as soon as you encounter a UnicodeDecodeError condition. -- Sakurambo 桜ん坊 11:27, 7 February 2008 (UTC)
nah, you're not listening :) What should I do then when no encoding is being given in the headers, or in the meta tags ? Nothing ? That would exclude a lot of pages... ! Much, much, much more than the hundred or so links that are being given weird titles because of an encoding error...
NicDumZ ~ 11:52, 7 February 2008 (UTC)
No information in the meta tags? Please go back to http://www.youmeishi.com/contents/product/paper.html and take a look at the HTML source. Can you see any meta tags in there? What about this one:
<meta http-equiv="Content-Type" content="text/html; charset=Shift_JIS">
(Hint: It's on line 4) -- Sakurambo 桜ん坊 12:12, 7 February 2008 (UTC)

Okay. Stop this. Read again : my point was not about that particular page, but about others : If I stop at the first UnicodeDecodeError that I get, that means that I will not be able to detect any encoding for pages not specifying their encodings. And I was saying that pages not specifying their encoding are way more common than pages specifying an encoding, and badly encoded, hence I made the implementation choice to try to detect an encoding, since the false positives are very rare (Over 25,000 contributions, I've been reported less than 10 errors : You could say that some errors are remaining undetected, but still, even considering, exaggeratedly, that 500 errors are remaining, that would make a 0,02% error rate. Come on, give me some space.)

NicDumZ ~ 12:27, 7 February 2008 (UTC)

You're just not paying attention are you? Let me spell it all out once again:
  1. I'm not talking about pages that fail to provide encoding information (either in the HTTP headers or in a meta HTML tag).
  2. The problem with your bot is that it ignores encoding information that has already been provided if it has difficulties converting pages into unicode.
  3. If your bot cannot successfully convert a web page from its declared encoding into unicode, then it should admit defeat.
  4. It makes absolutely no sense to use windows 1252 encoding to convert a page that has explicitly declared itself to use a different encoding.
Now apparently you're having trouble understanding one of the above points (1-4). If you could let me know which one, then I'll try to make the explanation a bit simpler for you. -- Sakurambo 桜ん坊 13:08, 7 February 2008 (UTC)
Just... calm down, would you ? Answer for #4 is here. I don't understand #2. When you can't convert a text into unicode using a charset, you just... can't. That means that one, or several byte sequences have no equivalent in the charset, or, better, that the charset does not specify any corresponding unicode character for that byte sequence. Knowing that I can't use this charset for a particular character, what should I do ? Try to convert this byte sequence apart, using another charset ? That makes no sense ! The only thing that you can do is to try something else, another encoding for the whole text, because it must have been encoded differently, that's it. And no, definitely, it should not admit defeat in this case : I saw, several times, pages that were declaring a charset while they were actually using another charset. NicDumZ ~ 13:19, 7 February 2008 (UTC)
I'm perfectly calm, thank you. The bold text was just there to hold your attention, which seems to be in rather short supply.
Anyway, since you are having problems understanding #4, let me elaborate:
  1. If a web page contains a meta tag specifying the character set as "Shift_JIS" (or "GB2312"), then you can be fairly certain that it contains Japanese (or Chinese) characters.
  2. There are some code points in Shift JIS and GB2312 that render correctly but are apparently not directly compatible with the Unicode standard. I have already pointed out two examples for you.
  3. Windows 1252 encoding does not support Japanese or Chinese.
  4. It is therefore meaningless to use Windows 1252 to convert pages encoded as Shift JIS or GB2312.
Again, if you don't understand any of these points, just let me know. I would also appreciate it if you could provide the URLs for some of these "pages that were declaring a charset while they were actually using another charset". -- Sakurambo 桜ん坊 13:35, 7 February 2008 (UTC)
I just noticed you're having problems with #2 as well. What this means is that if your bot has problems converting a page into unicode using the character set stated in a meta tag, it ignores the meta tag and uses windows 1252 instead. Is this not correct? -- Sakurambo 桜ん坊 13:40, 7 February 2008 (UTC)
Again, if you don't understand any of these points, just let me know. I would also appreciate it if you could provide the URLs for some of these "pages that were declaring a charset while they were actually using another charset". -- Sakurambo 桜ん坊 13:35, 7 February 2008 (UTC)
There are some code points in Shift JIS and GB2312 that render correctly but are apparently not directly compatible with the Unicode standard. No, you're wrong. Yes, of course, you can print them in unicode. I too, like you, can see perfectly well in Firefox these characters, and that's because it's the point of Unicode: It can print anything. Every codepoint has a meaning in Unicode, but not every codepoints have a meaning in a charset, and that's precisely why it causes problems ! A charset is a dictionary : A codepoint, a number, mapped to a unicode character. But the fact is that it is an encoding, i.e. that the same codepoint in Unicode and in a charset don't render the same character. For example, C006 is 쀆 in unicode, while it is ∑ in GB2312. And when you try to use such a dictionary, if a key is missing, i.e. no unicode character is given for that codepoint, well... You just don't know what it is. I understand perfectly what you mean : Firefox *can* actually print these characters, using some tricks, or complex heuristics to guess what character it is. But the fact is that I don't know what are these heuristics. That's it : I'm not ignoring the charset, are lazily giving it up, I just have no way that I know to find which Unicode character is this two-bytes sequence. Do you get this ? Because the more we talk, the more it seems that you have troubles understanding how a charset works, and that might be why we just can't find the right questions, or the right answers... NicDumZ ~ 13:52, 7 February 2008 (UTC)
This is getting very tiresome. For someone who claims to be so knowledgeable about character encodings, you do sound rather uninformed.
Take a look at this PDF file (it's in Japanese; the title "Windows と Mac OS X 間での シフト JIS コード非互換文字一覧" translates as "List of Shift JIS code incompatibilities between Windows and Mac OS"). On page 3 you'll see an entry for the Shift JIS code point 0x876F, which corresponds to the "mm" character that caused your bot to fail at youmeishi.com. In case your PDF reader is incompatible with Japanese, here's a screenshot of the relevant section. I've already provided you with a screenshot of the "View Source" window for this page in Firefox, where this character is displayed correctly.
The inability of your bot to successfully convert these code points into Unicode is not a good enough reason for defaulting to Windows 1252. Especially for pages that identify themselves as using non-Roman character sets. -- Sakurambo 桜ん坊 15:07, 7 February 2008 (UTC)
Hey, come on. You have tried by yourself to convert the text using Shift JIS, haven't you ? It failed, didn't it ? Isn't it the proof that the text is not Shift JIS-compliant ? Then what are you saying ? I know that the mm symbol is part of the Shift JIS set... Did I ever said that it wasn't part of it ? No ! I really don't get why you're writing this here... NicDumZ ~ 15:18, 7 February 2008 (UTC)
*Sigh*
No, it doesn't prove anything of the sort.
It proves that some Shift JIS encoded pages are difficult to convert into Unicode. That's all. -- Sakurambo 桜ん坊 15:24, 7 February 2008 (UTC)
Well, why would it be difficult ? If each byte sequence is in the Shift JIS table, hence has a Unicode equivalent, there is no problem. unicode() is basically mapping this byte sequence to the corresponding unicode codepoint. The only reason it fails is that somewhere, ther is a byte sequence that IS NOT shift-JIS compliant, remember the error message : UnicodeDecodeError: 'shift_jis' codec can't decode bytes in position 19563-19564: illegal multibyte sequence. ILLEGAL MULTIBYTE SEQUENCE. Get it ?
It seems that A) you didn't really read what I wrote above about charsets or B) You're showing some bad faith here.
NicDumZ ~ 15:34, 7 February 2008 (UTC)
Wow, it really is difficult getting through to you, isn't it?
I've provided you with ample evidence that the code point 0x876F really does exist in Shift JIS. Yet again, you've missed the point. Characters in this range are frequently used in Japanese web pages, and are handled quite happily by Japanese web browsers.
You are continuing to make this assertion that the inability of your bot to successfully convert between these encodings somehow "proves" that these pages are invalid and would be better off being converted using Windows-1252. That is a ridiculous position to take. All your error messages prove is that the Python character encoding library is inadequate in some cases. Shall I put that in capitals for you? IT'S INADEQUATE. IN SOME CASES.
Take another look at the PDF file I linked to. You'll note that the "Windows" and "Mac OS X" columns often specify different Unicode values. Sometimes these values are absent. Python's inability to process these codes correctly is not a sufficient reason for churning out garbage in Windows-1252. -- Sakurambo 桜ん坊 16:06, 7 February 2008 (UTC)
Python's shift jis did not contain that character. shift jis 2004 does, so I now try shift jis, then shift jis 2004, then cp932. NicDumZ ~ 09:13, 8 February 2008 (UTC)
Apparently GBK extends nicely GB 2312. I've just ried replacing GB2312 by GBK, and it appear to work. I don't know if there's any similar solution for Shift JIS... NicDumZ ~ 14:15, 7 February 2008 (UTC)
I'm now using Code page 932 to extend Shift JIS. It works for this link (), but there might be other problems... NicDumZ ~ 14:36, 7 February 2008 (UTC)
GBK is a superset of GB2312, so it should be safe to use. For Shift JIS, you could try using Windows-31J (Code page 932), which includes this character. But the point I'm trying to make is this: if your bot is unable to work with the information it's given, then it should do nothing instead of generating garbage for other people to clear up. If your bot really has encountered lots of "pages that were declaring a charset while they were actually using another charset", then please provide some examples. -- Sakurambo 桜ん坊 15:07, 7 February 2008 (UTC)

Idea for your bot

Perhaps you could set up a requests page where people could post articles they would like the bot to fix the references. I was trying to get your bot to have a go at February 2008 tornado outbreak, but there doesn't seem to be a way to add requests. Great job on the bot btw! Cheers, JACOPLANE • 2008-02-6 17:26

Its working through the database and will in time fix the bare references in all articles. I may setup something on the Toolserver that will operate similar to my other tools (don't actually edit) if NicDumZ thinks its a good idea. — Dispenser 19:49, 6 February 2008 (UTC)
Ah, come on Dispenser, you have proved many times that you had very good ideas. If you think that some thing has to be done, just do it :)
An external tool might an efficient intermediate way to proceed bare references that have been added after the last dump, or after my last pass. Actually, DumZiBoT will proceed every bare reference in the mean time, but that might take some time...
NicDumZ ~ 09:33, 7 February 2008 (UTC)
Ok, need to changed main() function around and added a stub function to web wikipedia.py but it works. Sort of, need to get the edit form into shape. — Dispenser 21:06, 7 February 2008 (UTC)

"<!-- Bot generated title -->" unncessecary"_unncessecary-2008-02-06T20:35:00.000Z">

Each reference that is modified gets "<!-- Bot generated title -->" added to it, it's about 29 bytes per each reference modified. It may not seem like much, but that can add up quick. Shouldn't that kind of comment just be put into the edit summary? Gh5046 (talk) 20:35, 6 February 2008 (UTC)"_unncessecary"> "_unncessecary">

Oh, and I forgot to say, thanks for creating this bot. It's very helpful. Gh5046 (talk) 20:36, 6 February 2008 (UTC)"_unncessecary"> "_unncessecary">

Well, automatically retrieved titles might be nonsensical, and some editors might not understand why without having to check deeply in the history of the article; that's why I add this comment : If no one actually catch a diff like this, weeks later, someone finding can easily know that this was inserted by a Bot, and easily understand that DumZiBoT has been mistaken. It also allows easy bug reports... ! (this diff was actually reported just above)
NicDumZ ~ 09:40, 7 February 2008 (UTC)
WP:AWB could munch down on these comments like a kid with a bag of cookies. :-) It might, however, be useful to put the name of the bot in the comment so that an edit by DumZiBoT could be distinguished from other bot edits: "<!-- DumZiBoT: Bot generated title -->". --Jtir (talk) 17:15, 7 February 2008 (UTC)

Untitled Document

http://en.wikipedia.org/search/?title=Pointy_hat&diff=189487992&oldid=182653326

Look at the diff line around "Gomer". Is "Untitled Document" more useful than the bare URL? --Damian Yerrick (talk | stalk) 21:09, 6 February 2008 (UTC)

Fixed here. I don't know why DumZiBoT didn't convert this link. --Jtir (talk) 21:31, 6 February 2008 (UTC)
Thanks for the fix; yet I need to fix my code. I was thinking about adding an exception directly to the title blacklist. What do you think, Dispenser ? Actually, the title blacklist is intended for unaccessible links, and adding an exception for an untitled page/document might seems messy, but that'd work :)
NicDumZ ~ 09:50, 7 February 2008 (UTC)
Maybe you don't. While this case has some merit, ISTM, that once DumZiBoT has done the conversions, an editor should review them and make any further changes. A compromise might be to include both the URL and the title in the link name: http://www.editionhutter.de/german.htm — Untitled Document. --Jtir (talk) 16:32, 7 February 2008 (UTC)
I honestly think that the URL blacklist is more a hack as the site might give valid title sometimes. The title blacklist is more refined and allows specific variation to be covered. Ultimately we implemented to improve the quality of the titles produced by the bot, and a few days ago I thought on adding adding this but doing a google:allintitle:untitled search shows that I was too broad is the matching. I recommend now to use untitled *(document|page|$). — Dispenser 21:39, 7 February 2008 (UTC)

converting multiple bare links in one reference

In Meishi, DumZiBoT did not convert three bare exlinks in one of the references.

<ref>See, e.g., http://www.adobe.com/jp/special/creativesuite/portal/guides/cs2_01_52.html, http://www.washiya.com/shop/namecard/index.html, http://www.kenseido.co.jp/shop/kps/namecard.html</ref>

--Jtir (talk) 21:13, 6 February 2008 (UTC)

Well, you read the FAQ :þ
I actually don't convert links with text around, I just convert references made of one link.
This could be some work for DumZiBoT2, along with some external links (those contained in a External links section) processing.
NicDumZ ~ 09:46, 7 February 2008 (UTC)
OK. The first sentence of the documentation is misleading then. Maybe it should say something like: "He is converting single bare external links in references …". Converting exlinks in other contexts would be a nice future enhancement. --Jtir (talk) 14:53, 7 February 2008 (UTC)

Suggestion for determining web site name

Hi—First, kudos on a most excellent bot. I was reading your discussion with Dispenser about filling in more of the parameters of template:cite web, and I have a suggestion. The basic idea is to slog through a dump examining occurrences of template:cite web, and correlating the values for the url= and work= parameters. For instance, if 99% of the time, url values with a prefix of http://nytimes.com/ co-occur with work=The New York Times, then you can reliably add the latter to the references you generate for similar urls. You can build up a dictionary of these relationships in a first pass of the bot (or with a separate script). Make sense? —johndburger 01:41, 7 February 2008 (UTC)

But WP say we should not push the optional use of citation templates - See my earlier request. When I get some time I'm looking to investigate/challenge this as I believe there is a greater liklihood of getting higher quality reference info. using the templates and hence educating people of their existence. -- John (Daytona2 · Talk · Contribs) 12:08, 7 February 2008 (UTC)
What WP:CITE says is Because templates are optional and can be contentious, editors should not change an article with a distinctive citation format to another without gaining consensus. If taken literally, that suggests that the bot should not change anything in an article full of nothing but bare links—it already has a "distinctive citation format". But most articles with bare links are, in fact, a mix of formats. If there are any instance of the template:cite family in an article, I think you could make the argument that it's perfectly reasonable to change a bare link to cite web.
But, in fact, my suggestion is actually independent of how the bot inserts the reference—I should have made that more clear. Whether DumZiBoT uses a template, or raw wikimarkup, it can still add the name of the web site in many cases using the approach I described above. —johndburger 01:09, 8 February 2008 (UTC)
I had wanted to add PDF conversion and remembered that I had seen it once somewhere. A quick grep in pywikipedia came up with the old m:standardize_notes.py which uses {{ref}} templates instead of the newer m:cite.php system. Because of the changing of the templates its been in a quasi-block on the en. The script itself does alot of things. With the quick glance I taken at the source it doesn't do as many checks with the titling as reflinks.py does. However, it does the news cite referencing. In any case it was a good source to get code to parse titles from PDF files. — Dispenser 05:51, 8 February 2008 (UTC)
I just tried adding that feature, using the code from standardize_notes. But apparently it just don't work: I can't find a pdf that gives me a title with that code ?! NicDumZ ~ 07:31, 8 February 2008 (UTC)
Subprocess doesn't seem to accept streams only files. Either write to a temp file or reopen using url_retrive (hackish) like it does in the program. — Dispenser 08:03, 8 February 2008 (UTC)

How do I request that your bot visit a page?

Regulation of acupuncture, as well as acupuncture, could use his talents... again, thank you, very nice work! best regards, Jim Butler (t) 05:46, 7 February 2008 (UTC)

Processed both articles. Eventually, DumZiBoT will fix every bare references, it just takes time.
NicDumZ ~ 10:39, 7 February 2008 (UTC)
Super, thanks again --Jim Butler (t) 08:44, 8 February 2008 (UTC)

Request for DumZiBoT2

Would it be hard to make a bot that consolidated references with <cite name=X>? Just making a suggestion! --Adoniscik (talk) 06:43, 7 February 2008 (UTC)

erm... I don't know how that tag works, actually. Some documentation might help :þ
Also, how would you retrieve the "X" value ?
Thanks for trying to help,
NicDumZ ~ 09:53, 7 February 2008 (UTC)
Possibly means one ref definition where you give it a name <ref name=BBC080207> which you then refer to it by for any other occurances using only <ref name=BBC080207 /> ? Misplaced Pages:Footnotes#Naming_a_ref_tag_so_it_can_be_used_more_than_once. It would be a sensible addition although you need to avoid name conflicts. I use dates - SourceYYMMDD Cheers -- John (Daytona2 · Talk · Contribs) 12:03, 7 February 2008 (UTC)
Ah, okay ! You used "cite" instead of "ref" in your first message, so I was lost.
Seems a bit complex to do that, but that's really a good idead :)
NicDumZ ~ 12:06, 7 February 2008 (UTC)
Correct...I meant citations with the <ref> tag. Sorry, I was typing late at night. --Adoniscik (talk) 15:42, 7 February 2008 (UTC)

Again strange edits

What is this ? Please fix the handling of non latin scripts before running the bot again. --jergen (talk) 08:27, 7 February 2008 (UTC)

The sourcepage of http://www.chinascout.org/ pretends to be encoded in GB 2312. However, line 1726, there is the character "深", encoded C389, which is not part of GB 2312 : http://demo.icu-project.org/icu-bin/convexp?conv=ibm-1383_P110-1999&b=C3&s=ALL#layout
In other words : This page is not encoded properly, hence DumZiBoT is not able to convert it into unicode using GB 2312. It then tries without success to use ascii, then utf8, and eventually defaults to windows 1252 which render these ugly characters. This is not a bug of DumZiBoT.
NicDumZ ~ 10:26, 7 February 2008 (UTC)
This is a bug, and you really should do something about it. For your information, the "illegal" character in this case is not 0xC389 but 0x89C3. It may not appear in all GB2312 standards, but it renders perfectly well in my browser. The equivalent Unicode glyph is 0x5169, as you can see here.
You have absolutely no justification for using windows 1252 encoding to convert a page that explicitly declares itself to be GB 2312. Your bot is broken. Please fix it. -- Sakurambo 桜ん坊 12:06, 7 February 2008 (UTC)
Give me a break. This character is not a standard GB 2312 character, I have no reason to support it. NicDumZ ~ 12:15, 7 February 2008 (UTC)
I'm not saying your bot has to support these characters. It's quite obvious that it doesn't support them.
I'm just saying it should stop trying to interpret them as windows 1252 without any justification.
Is that really so difficult to understand? -- Sakurambo 桜ん坊 12:27, 7 February 2008 (UTC)
I'm just saying that I will not support these characters. Is that really so difficult to understand?
  • Using windows 1252 is a way to convert every characters to some printable characters, avoiding to insert some junk control characters in articles.
  • Also, windows 1252 is a common american/european charset that works well when no special characters are in the string : If, for some reasons, the title was made of standard non-accentuated latin characters, the conversion would have worked
  • Eventually, a lot of windows-made web pages use windows 1252 as a charset without specifying in the meta tags.
That makes three very good reasons to use it, three very good reasons for you to move along.
NicDumZ ~ 12:36, 7 February 2008 (UTC)
Fixed, when GB 2312 handling fails, I now try GBK : NicDumZ ~ 14:38, 7 February 2008 (UTC)
And when that fails? When invalid characters are found, why not just ignore the character (which was nowhere near the title tag) and turn it into a question mark or � instead of assuming that it's lying about the encoding and falling back on windows-1252? Why even attempt to convert anything outside the title tag? —Random832 19:30, 7 February 2008 (UTC)
Because, again, a lot of pages specify the wrong encoding : Consider a wrongly encoded file : If I only convert the title part into unicode, from a statistical point of vue, chances are that I will be able to convert it without raising any error : Codepoints are different from a charset to another, but I might still be able to convert it to the specified charset. The title won't make any sense, because I converted it using the wrong charset, but still, I would think that I have converted well the document. If now, I try to convert the whole document using a wrong charset, I'm more likely to raise an error, encountering a bad character that has no correspondance in the charset, hence I have more chances to detect a wrong charset.
NicDumZ ~ 20:38, 7 February 2008 (UTC)
Please excuse my kibitzing, but I think the (most excellent) bot needs to be extremely resistant to screwed up content, of which there is a lot on the web. I think your "transcode the whole document" heuristic is a very good idea, but I'd suggest that if there is any evidence that the title may not be correctly extracted and transcoded, the bot should bail and not put the title in the reference. —johndburger 01:24, 8 February 2008 (UTC)