Misplaced Pages talk:Link rot: Difference between revisions

Browse history interactively ← Previous edit Next edit →Content deleted Content addedVisual WikitextInline

Revision as of 23:01, 21 August 2013 editArk25 (talk \| contribs)Autopatrolled, Extended confirmed users, Pending changes reviewers10,940 edits →Archive.is: Archive.is is very nice← Previous edit		Revision as of 18:59, 25 August 2013 edit undoLexein (talk \| contribs)Extended confirmed users, Rollbackers17,577 edits →Archive.is: followupNext edit →
Line 214:		Line 214:
	:: I am sorry, my bad. I didn't know that Archive.is is making incremental backups and that it started to create backups on all Wikipedias in may-june 2013 - see ] and ]. Sorry for the false alarm! — ] (]) 22:52, 21 August 2013 (UTC)		:: I am sorry, my bad. I didn't know that Archive.is is making incremental backups and that it started to create backups on all Wikipedias in may-june 2013 - see ] and ]. Sorry for the false alarm! — ] (]) 22:52, 21 August 2013 (UTC)
	:: I think Archive.is is very nice, for making automatic backups for all links in Misplaced Pages. It really deserves to be integrated as a WikiMedia project or at least it deserves to be payed by Misplaced Pages. It's very important to preserve the archives of the newspapers. — ] (]) 23:00, 21 August 2013 (UTC)		:: I think Archive.is is very nice, for making automatic backups for all links in Misplaced Pages. It really deserves to be integrated as a WikiMedia project or at least it deserves to be payed by Misplaced Pages. It's very important to preserve the archives of the newspapers. — ] (]) 23:00, 21 August 2013 (UTC)
			:::The proprietor of ], ], assures us here and on its FAQ page that it is financially secure. However '']'' has stated on its home page that it will be in financial trouble later this year. This has become a topic of discussion here:]. --] (]) 18:59, 25 August 2013 (UTC)

	== Newspaper websites which have undergone link format changes ==		== Newspaper websites which have undergone link format changes ==

Revision as of 18:59, 25 August 2013

This is the talk page for discussing improvements to the Link rot page.

Put new text under old text. Click here to start a new topic.
New to Misplaced Pages? Welcome! Learn to edit; get help.

Archives: 1, 2, 3, 4, 5

Misplaced Pages essays Top‑impact

	This page is within the scope of WikiProject Misplaced Pages essays, a collaborative effort to organize and monitor the impact of Misplaced Pages essays. If you would like to participate, please visit the project page, where you can join the discussion. For a listing of essays see the essay directory.Misplaced Pages essaysWikipedia:WikiProject Misplaced Pages essaysTemplate:WikiProject Misplaced Pages essaysWikiProject Misplaced Pages essays
Top	This page has been rated as Top-impact on the project's impact scale.
	The above rating was automatically assessed using data on pageviews, watchers, and incoming links.

Misplaced Pages Help B‑class Mid‑importance

	This page is within the scope of the Misplaced Pages Help Project, a collaborative effort to improve Misplaced Pages's help documentation for readers and contributors. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks. To browse help related resources see the Help Menu or Help Directory. Or ask for help on your talk page and a volunteer will visit you there.Misplaced Pages HelpWikipedia:Help ProjectTemplate:Misplaced Pages Help ProjectHelp
B	This page does not require a rating on the project's quality scale.
Mid	This page has been rated as Mid-importance on the project's importance scale.

Archives

Archive 1 (2005-2007)
Archive 2 (2008-2009)

Internet Archive

The Internet Archive doesn't seem to have archived anything since about August 2008. What does this mean for dead links that should have been archived since then? AnemoneProjectors (talk) 14:36, 23 January 2010 (UTC)

See Wayback_Machine#Growth_and_storage: "Snapshots become available 6 to 18 months after they are archived." -- Quiddity (talk) 20:55, 23 January 2010 (UTC)

The Wayback Machine and IA

This has been talked about over there and a huge number of websites are now tagging their sites as "do not archive" which the IA respects. Thus it no longer indexes any information and removes the site from being seen, from the time of the first archive. — Preceding unsigned comment added by 99.180.245.237 (talk) 21:20, 16 January 2013 (UTC)

It is true that almost every news outlet now blocks indexing by Internet Archive, but that is understandable as their sites feature advertising and often a pay-for-archive-view, meaning that free access reduces their potential revenue in a market where the margins are quite thin. Some sites do still have extensive free archives, like the Arizona Daily Star which I included in a table below, which has free archives stretching back to at least 2006. --User:Ceyockey (talk to me) 00:49, 17 January 2013 (UTC)

removing a dead link?

if I fina a dead link and I don't feel like fixing it is it cool to remove it, esp if I think the claim it supported was kinda retarded anyway? --n-dimensional §кakkl€ 18:52, 26 January 2010 (UTC)

uh... not with reasoning like that, no. 'kinda retarded' does not qualify as an objective assessment of the merits of the link, since other editors can easily say 'it aint so retarded' - an equally valid statement without any further evidence. if you're just fixing linkrot, fix the link or flag it for others; if you want to get involved with content editing (to remove 'retarded' content) go ahead and do it explicitly as an edit; don't call it a linkrot fix. --Ludwigs2 20:25, 26 January 2010 (UTC)

Tag it with a {{deadlink}}.--Blargh29 (talk) 22:17, 26 January 2010 (UTC)

Dead link vs. linkrot

What's the difference, exactly? 85.76.80.10 (talk) 20:56, 30 January 2010 (UTC)

There isn't one. "Linkrot" is a term used to describe the phenomenon of good links going dead over time. --ThaddeusB (talk) 04:40, 4 February 2010 (UTC)

Linkrot and sustained notability

If an article is at first supported by a series of links to establish notability, and 100% of those links go bad, does that mean that in some cases the subject of the article can be considered not notable and the article be deleted? Sebwite (talk) 16:21, 18 February 2010 (UTC)

It shouldn't happen. Notability is forever, even if all of the links go bad. That's why it's probably a good idea to use a citation template, so that there is plenty of documentation about the former link. Also, check out WP:OFFLINE, which would apply to dead links. --Blargh29 (talk) 19:16, 18 February 2010 (UTC)

I have seen articles get put up for deletion on the basis that all the links have gone bad, and the noms use the WP:PROVEIT argument to support their cause, while those who support keeping cannot prove it. Those favoring deletion do not buy the WP:OFFLINE argument in these cases. Sebwite (talk) 15:23, 23 February 2010 (UTC)

Editors delete content all the time based on dead links. The Orwellian memory hole lives, and it lives here in the Misplaced Pages. I think it is a huge problem. --Marcwiki9 (talk) 23:18, 4 March 2010 (UTC)

Archiving British web pages

The following Wired article explores some of the problems regarding archiving British web pages: Archiving Britain's web: The legal nightmare explored These problems affect the strategy used here. Squideshi (talk) 19:23, 6 March 2010 (UTC)

Does it though? A web archiving service acts on the laws of its resident country, not on those of the site it is archiving (as I understand the law). So archive.org and WebCite are fine, and that is their concern anyway; only if "we" (the WMF) were to set up our own archiving server in the UK would "we" be affected (as I understand it). - Jarry1250 11:22, 7 March 2010 (UTC)

That is not how I understand it. In fact, the article itself mentions that this is a problem for organizations like the Internet Archive, which hosts the Wayback Machine. It affects us because, as part of our strategy, we specifically recommend using tools, such as the Wayback Machine, which are affected by this law. I'm not asking for a change in the article--I just wanted to make people aware that the Wayback Machine isn't a magic bullet in the effort to help stave off linkrot. Squideshi (talk) 21:28, 8 March 2010 (UTC)

External links are not references

Just to explain my recent changes:

You should (almost) never remove this:

==References==
* Long dead reference

You should cheerfully remove this:

==External links==
* Calculator that you'd think was cool, except it no longer exists

It is not possible to justify a dead "External link" under the External links guideline. WhatamIdoing (talk) 18:32, 18 March 2010 (UTC)

You are correct. However, caution should certainly be used since inexperienced users often put stuff they've used a reference under "External links". --ThaddeusB (talk) 20:30, 29 March 2010 (UTC)

Besides this, it seems to me that an archived copy of an external link may well be a good replacement for the original (as it is for a reference), so linking to such a copy (if available) is preferable to simply removing the link. JudahH (talk) 16:30, 3 January 2012 (UTC)

linkrot vs. stability, e.g. News Corp vs. Fairfax in Australia

I've noticed that links to many articles published by News Corp in Australia are especially susceptible to linkrot, whereas links to articles in the Fairfax papers, The Age and the SMH, are quite solid. If there were enough evidence to support my statement, would WP ever have a guideline such as "Use paper X, Y, Z, if possible, instead of P, Q, as these are less susceptible to linkrot?" cojoco (talk) 20:52, 1 April 2010 (UTC)

We do advise against using Yahoo news stories (which typically decay within weeks), so it is certainly possible. --ThaddeusB (talk) 02:01, 11 April 2010 (UTC)

Of course, nobody reads the directions, so I wouldn't get my hopes up, but you're certainly welcome to include the advice. WhatamIdoing (talk) 03:44, 11 April 2010 (UTC)

Archiving every reference?

Is it suggested that we should archive every reference used in our articles? I see there's a WebCiteBOT, but I've never seen it in action, and certainly not on any article I've worked on. I just recently lost a very important reference and I'm still trying to work on finding a fix (contacting the editors, etc.). This was a great lesson to me about link rot, but now I'm wondering if I'm supposed to archive every reference I use? – Kerαunoςcopia_galaxies 20:48, 9 June 2010 (UTC)

Quite simply put WebCite cannot handle the volume that Misplaced Pages provides, even the small run of 10-50 PDFs a night by Checklinks seems to be contributing to the problem. — Dispenser 22:15, 9 June 2010 (UTC)

That I suppose would explain the bot, but what about manual submissions to the archive? Should I just archive references as I see fit? WayBack's six-month lag seems to be a bit of a long wait considering some website pages disappear in only a few weeks. – Kerαunoςcopia_galaxies 22:18, 9 June 2010 (UTC)

Is this still the case? It could affect issue PYWP-18. — Jeff G. ツ (talk) 04:24, 29 February 2012 (UTC)

Impossible archiving

Some cited sources use various forms of presentation, including streaming audio (sometimes integrated within a written interview), streaming video, and, especially in the case of Billboard's website, flash or some similar method of loading articles. These sites can't be archived at all. Without transcripts published elsewhere, these sites seem to me to be absolutely vulnerable to link rot. – Kerαunoςcopia_galaxies 19:04, 12 June 2010 (UTC)

dafuq? impossible to archive a .mov file? or a .mp4 file? or a .swf file? this is usually no problem... (although most search engines CHOOSE not to do it, but its entirely optional.) 88.88.102.239 (talk) 21:13, 2 May 2012 (UTC)

Most media embeds are not simple .mov/.mp4 references and usually an archival crawler might only be able to grab the .swf that is doing the embedding. It doesn't run the .swf, so it can't grab the video stream itself (which may not even be over HTTP). HTML5 has the potential to improve things somewhat, but progress is slow, and there are still lots of other tricky cases that are difficult to archive. AndyJ 14:54, 2 January 2013 (UTC)

Link Rescue Bots

Two new bots have just been approved to find archives for dead links. User:DASHBot, the first one, is written and opperated by User:Tim1357. It has gone through all the featured articles, and has made a large dent in the good articles. However, due to some small technical difficulties, it is down for the moment. User: H3llBot is written and operated by User: H3llkn0wz. It does pretty much the same thing. As the two bots finish up the Featured articles and the Good articles i think we will do articles by request. Any ideas of which articles we could let the bots run on next? (Categories are good) Tim1357 17:12, 15 June 2010 (UTC)

I'd say A-Class articles and then all Vital articles that are B-class and below, that is if the bot is able to make that distinction. -- œ 02:25, 21 July 2010 (UTC)

blogs.nzherald.co.nz

URLs http://blogs.nzherald.co.nz will cease 301 redirecting to URLs on http://www.nzherald.co.nz shortly. Checking my logs I note that a few articles have references/links to articles on blogs.nzherald.co.nz ( such as Gordon Ramsay ). These should be updated as soon as possible. The equivalent articles should still exist but will be harder to find after the redirect is gone. Could somebody please inform a bot operator. I have no idea how many links are in place. - NZH Admin —Preceding unsigned comment added by 203.99.66.20 (talk) 03:46, 10 August 2010 (UTC)

Web Link Checking Bot

Hi, I'm currently running a bot on my server against Misplaced Pages to check the external links, using pywikipediabot and the included weblinkchecker.py script. What this bot does is scan the contents of articles for external links, and then proceeds to check the links for 404s or timeouts, and creates a datafile of the non-working links. After about one week, the bot will then recheck the links, and report on the talk pages of the articles which links are dead, according to the data that the bot collected. In the report submitted, the bot will automagically suggest a link to archive.org, which if it was caught, should be a valid archived version of the link. The reason for my post here is to request input from the community, per the suggestion of Tim1357 in this thread. I am watching both this page, and the BRFA thread, so commenting at either location is ok, and your input is greatly appreciated. Thanks, Phuzion (talk) 14:34, 17 August 2010 (UTC)

On dewiki we decided that at minimum 4 weeks delay and 3 tests are required because many links are back online after 2-3 weeks after changing hosting service. But the script on repository has some bugs you should care about. You could test the script this page:

which report errors on all four links above. Merl issimo 16:13, 17 August 2010 (UTC)

Thanks for the input. Do you know if there is an updated version of the script that has the bugs fixed? Phuzion (talk) 16:45, 17 August 2010 (UTC)

No, i never used this script. I only know the reponse from dewiki where we have a template which can be used by users for marking failed dead link bot reports. Merl issimo 17:25, 17 August 2010 (UTC)

What bugs are meant by "the script on repository has some bugs you should care about."? — Jeff G. ツ (talk) 04:57, 17 January 2012 (UTC)

How can I help? I'm interested in helping with any automated deadlink detection/mitigation. Since archive.org stopped archiving as of late 2008, checking it is necessary, but not sufficient. Automated checking of, and pre-emptive archiving with, Webcitation is needed, IMHO (or other service, especially for pages poorly captured by Webcitation - conditionals, Javascript, AJAX, etc have problems). I'm in favor of an on-demand full-rendered-web-page screengrab service, or an as-rendered-html+CSS-only service if one exists - these seem to be the only way to simultaneously guarantee pixel accuracy and actual content presence. Of course, respecting robots.txt. --Lexein (talk) 01:35, 12 September 2010 (UTC)

We mostly need people filling out references. Currently Reflinks is probably the best in filling out references, but I haven't updated it with the feedback/learning mechanisms and the WebCite interface is a bit hard to use. You can also use Checklinks to semi-automatically fix links. — Dispenser 22:37, 12 September 2010 (UTC)

I know and use those tools frequently, but I would certainly participate in revising and betatesting semi-auto tools which help as well. --Lexein (talk) 23:18, 13 September 2010 (UTC)

I have a proposal in for such a bot, and could use some responses at m:Talk:Pywikipediabot/weblinkchecker.py##Questions_from_BRFAs_and_elsewhere_on_English_Wikipedia. — Jeff G. ツ 03:02, 23 March 2011 (UTC)

My request for responses linked above has moved here. — Jeff G. ツ (talk) 20:57, 21 January 2012 (UTC)

Solution against the broken external_links: backup the Internet

Please find the concept description on the Village Pump. JackPotte (talk) 09:53, 3 September 2010 (UTC)

2013 update: NSA does this, and email, IMs, texts, VOIP and phone calls now. New 'pedia: NSApedia.gov. Soon all us editors will be out of work. --Lexein (talk) 02:45, 18 August 2013 (UTC)

Marking a dead link within a citation template

How is one to mark a dead link within a citation template, e.g.:

"Gujrat Police official website, Standard Operating Procedures" (PDF). Retrieved 2009-03-08.

I did a hack by adding |publisher={{Dead link}} into the template, but that may not be the preferred way to do this. __meco (talk) 16:33, 5 September 2010 (UTC)

It's better not to do so, but rather follow the }} with {{dead link|date=August 2010}}.

"Gujrat Police official website, Standard Operating Procedures" (PDF). Retrieved 2009-03-08.

Yes, it seems to look odd, but I believe it's best practice for "deadlink" to always appear as the last text on a citation or link line. Of course, make an attempt to repair with Checklinks, too... --Lexein (talk) 18:51, 5 September 2010 (UTC)

All links eventually go bad

I think that in the fullness of time, on geologic time scales, all links will go bad. This is simply because those who sponsor such web pages will ultimately die off. Web servers will be lost in fires and floods. Misplaced Pages administration needs to recognize this reality. The future expansion of Virtual Servers with NO PERSISTENT STATES will only make this worse. Please see Amazon Virtual Private Cloud. There are many Misplaced Pages editors who delete content that has a dead link, and use WP:proveit to make a point. Most editors are too lazy to go to the library to verify older information, and just delete things. It is hard to maintain "presumption of good faith" when undereducated editors are denying a lot of history. Look at this example: Misplaced Pages:Articles_for_deletion/Event_Driven_Language. We can see that Beeblebrox, by all accounts a good wikipedian, justified a delete because the Library was too far away. Misplaced Pages should not exist at the convenience of the editors, but should exist in the service of truth. Perhaps there can be some kind of "grandfathering" clause on links. Perhaps, I would suggest, that if a link exists for a long enough period of time, that the standard of proof should shift from the creators/maintainers to those who would delete. In other words, if the link was there for a number of years, and then it rotted, then the link would be "presumed valid" instead of the present case, where is seems to be presumed a fabrication of someone's imagination. This way, the content in Misplaced Pages could age gracefully, becoming more authoritative as it got older. This feels more proper to me. This would be a good alternative to the present case where good content is deleted willy nilly by those who would deny history, simply because it is hard to verify. — Preceding unsigned comment added by Marcwiki9 (talk • contribs) 03:30, 20 December 2010 (UTC)

You seem to have declared everyone's opinion on a single incident. The closing administrators should be experienced enough to separate valid reasons from invalid reasons. The content was not lost, it was merged. Verifiability is a principle of Misplaced Pages, and the reader cannot verify the material if the website rotted years ago. That's why we have this page. Given you posted here, is there an actual change/removal/addition you propose to this guideline? The "more authoritative as it gets older" will in my opinion not pass. — HELLKNOWZ ▎TALK 10:55, 20 December 2010 (UTC)

I don't mean to impugn everyone. What I am proposing is not a reduction in verifiability. Misplaced Pages must remain verifiable, of course. But the system we have now is that overzealous and undereducated editors will deny history, simply because the links have rotted. They are too lazy to verify content, so they delete it. They do it because "the library is 250 miles away", and they cannot just pop over there. I am making the suggestion that this is wrong and bad. Misplaced Pages ought to do something about the very long term problem of rotted links, because all links eventually will rot. WP:linkrot seems to show this as an accellerating problem. As links rot through distant time scales, under the present system, the whole of wikipedia will have to be slowly rewritten. I think this is revisionist history, and it is objectionable to me. It can lead to history being manipulated by those who control search engines. Of course, you all might think I'm wrong. Whatever. I intend it only as food for thought. I am not declaring everyone's opinion on a single incident. I see a pattern here of editors denying history and deleting content, simply because they see the verification as too much work. I see it all the time. It is as if the orwellian memory hole lives. Editors will chuck all content without a valid link, even if the link was good in the past. They do this despite the wikipedia policies expressly forbidding it. --Marcwiki9 (talk) 02:51, 21 December 2010 (UTC)— Preceding unsigned comment added by Marcwiki9 (talk • contribs) 02:40, 21 December 2010 (UTC)

If their actions are against policy, then their edits should be reverted. If their good-faith edits are against policy or guidelines, then they should be educated. If they remove previously undisputed content because a link is bad, they should be informed not to do this. I don't see what solution you propose for the hyperbolic problems you are describing. Misplaced Pages has a strong bias towards electronic sourcing, because frankly websites are easy to access without driving 150 miles to the library. As far as actual record of history is concerned, there is much much written material elsewhere that doesn't "linkrot". — HELLKNOWZ ▎TALK 10:25, 21 December 2010 (UTC)

So, my thoughts are meaningless drivel? To be chucked into the ether? No, the problem is much worse than you are even able to comprehend. You're unshakable defense of the status quo blinds you to even see that there is a problem, much less forge a solution. You admit there is a bias, but yet, fail to point to any solution at all. And when one is put forward as food for thought, not a serious proposal, you dismiss it as hyperbole. And then you make the astonishing claim that Misplaced Pages doesn't matter, because the "actual record of history" lies elsewhere. I guess that Misplaced Pages will overcome all of these problems someday. I was just trying to help.--Marcwiki9 (talk) 00:52, 22 July 2011 (UTC)

It seems you have misinterpreted every sentence I said to the level of personal remarks. Personally, having run a bot that tags and replaces thousands of dead links, I do not see a need to explain my stance or motivation if my replies are misinterpreted anyway. — HELLKNOWZ ▎TALK 07:32, 22 July 2011 (UTC)

Solving link rot problem

We are working to solve the link rot problem here. We would like everybody to voice there concerns here. Thanks - Hydroxonium (H₃O) 14:25, 6 February 2011 (UTC)

Conflict between guidelines

This guideline and WP:DEADREF give conflicting advice about dealing with dead links used to support article content. Please join the conversation at WT:Citing sources#Question_regarding_.22Preventing_and_repairing_dead_links.22. WhatamIdoing (talk) 22:12, 17 February 2011 (UTC)

The lengthy conversation has closed, and I have updated the advice at WP:DEADREF. If anyone wants to check over this page and improve its contents, please feel free. WhatamIdoing (talk) 19:43, 28 March 2011 (UTC)

Proposal for new WikiProject to repair dead links

Just a notice for anyone who's interested. Misplaced Pages:WikiProject Council/Proposals/Dead Link Repair. -- œ 06:39, 20 April 2011 (UTC)

A new WebCiteBOT

Hi all. I'm working in a new WebCiteBOT. I have opened a request for approval. It is free software and written in Python. I hope we can work together on this. Archiving regards. emijrp (talk) 17:15, 21 April 2011 (UTC)

RfC to add dead url parameter for citations

A relevant RfC is in progress at Misplaced Pages:Requests for comment/Dead url parameter for citations. Your comments are welcome, thanks! — HELLKNOWZ ▎TALK 10:49, 21 May 2011 (UTC)

Simple answer

Use more print references...

Obvious really. Misplaced Pages is a joke if it leans too heavily on the web alone.--MacRusgail (talk) 16:32, 10 August 2011 (UTC)

If only more people were aware of the fact that references don't have to be online.. we should promote WP:Offline sources more.. -- œ 15:58, 16 August 2011 (UTC)

But they're so eeeeeeasy! But seriously, in practice, there's a balance to be struck. Some editors such as Cirt have created articles which are fantastically sourced, but completely offline, leaving out all convenience links. I don't know why; it may be due the research tools he uses, which, though deep, are not at all accessible to non-subscribers. Very annoying.

Over at WP:AN/I I finally twigged to Bare link rot harms verifiability. Seems I don't care so much if a link rots if it has been properly, verifiably expanded. --Lexein (talk) 17:23, 16 August 2011 (UTC)

Extension:ArchiveLinks

http://www.mediawiki.org/Extension:ArchiveLinks

Is it possible to ask WMF to enable (maybe also finish) this wonderful extension? Bulwersator (talk) 10:20, 10 January 2012 (UTC)

Incompatibility with Misplaced Pages:Citing sources#Preventing and repairing dead links (even if that is linked here)

This page (Misplaced Pages:Link rot) states in its lead section that "These strategies should be implemented in accordance with Misplaced Pages:Citing sources#Preventing and repairing dead links, which describes the steps to take when a link cannot be repaired."

But how can we do in accordance with Misplaced Pages:Citing sources#Preventing and repairing dead links if some sentence in this page's lead section (for example "Do not delete factual information solely because the URL to the source does not work any longer. WP:Verifiability does not require that all information be supported by a working link, nor does it require the source to be published online.

Except for URLs in the External links section that have not been used to support any article content, do not delete a URL solely because the URL does not work any longer. Recovery and repair options and tools are available.") and the whole "Keeping dead links" section are incompatible with that page?

Does explicit instruction to "implement in accordance with Misplaced Pages:Citing sources#Preventing and repairing dead links" means that that page is predominat? --79.17.150.185 (talk) 22:25, 8 February 2012 (UTC)

I'm not seeing a direct incompatibility. (I'd like to wikify your comment to make readability easier, may I?)

As an encyclopedia, I think part of our mission is preservation. This means not to delete "dead links" just because they're dead. We also should not delete content just because a link goes dead. That's why we archive sources, and why we must, IMHO, always, as fast as possible, expand bare urls in inline citations so that if their links go dead, the title, date, publisher, and author still permit verification of claims.

As for harmonizing the text of various policy, guidelines, essays, and info pages, that's an important ongoing task. It's good to wait for editing flurries to die down before trying to harmonize text. And thanks for discussing before making radical changes, by the way. --Lexein (talk) 02:24, 17 September 2012 (UTC)

Archive.is

I think we should go slow on advocating http://archive.is. The field is littered with defunct archive sites - just look at this article history. Archive.is looks good, very good in fact, and its performance and coverage of essentially all used sources is very encouraging. But IMHO Misplaced Pages can't afford to depend on a brand new site which so far, discloses no public information about its funding, affiliation, or future. I have communicated with the owner, and I am confident the owner is acting in good faith, but it's a solo effort. I'd like to see if the site is here in a year. In the meantime, I would like to advocate using WebCite in parallel with Archive.is, meaning at least archiving at WebCitation, if not citing in ref. I hope this is received as a sensible precaution, in the best interest of Misplaced Pages's future source verifiability. --Lexein (talk) 02:10, 17 September 2012 (UTC)

I agree that we need to be circumspect. Just before seeing your commment above, I asked at http://blog.archive.is/ask :

"Who runs this site? If we're going to trust it (see Misplaced Pages:Link_rot#Repairing_a_dead_link) we need to have good reason to think it's stable/funded/likely to stick around indefinitely. The webcite faq is CC-NC-SA, so consider using it as a starting point for your own faq."

If we don't hear back soon, we should remove it. If Archive.is triggered a WebCite archive, in addition to its own, then I'd support it's continued mention here starting now. Also, the IA now supports on-demand archiving. It just doesn't appear online for months. --Elvey (talk) 17:25, 5 October 2012 (UTC)

IA certainly supports on-demand archiving. Traditionally, archived pages took many (three to six?) months to appear. In mid-2012, many pages seemed to be returned within about three to five weeks. By early 2013, this seems to have further reduced to about three to seven days, especially when archiving pages from several well-known sites. In more recent times, some archived pages have been returned in around 200 minutes by IA but this very much depends on the site being archived. -- 31.52.117.100 (talk) 20:27, 29 July 2013 (UTC)

+1 for removing archive.is from the instructions, or at least not promoting it so strongly over sites like archive.org and other institutions that are part of the International Internet Preservation Consortium --Edsu (talk) 16:52, 16 November 2012 (UTC)

+2 for removing http://archive.is from the instructions, until such time as its reliability and persistence is better demonstrated. Beyond the web archives already mentioned, the List of Web archiving initiatives and Memento Project pages may be other useful resources to point to in the instructions. --nullhandle (talk) 21:46, 16 November 2012 (UTC)

Sorry, I did not find your message not in my inbox nor in Tubmlr control panel :( Hopefully, I found this conversation by searching for archive.is on twitter. As I found the questions here, I answer here as well.

About FAQ and more info on the page: a new design is being prepared. It will have more information (both textual and infographic) about how to use the site, how to search for saved pages, etc.

About funding: it was started as a side project, because I had a computational cluster with huge hard drives and those disk space was not used. It was an kind of experiment, to see if the people would need a service like this and choose a ways to develop the service based on how people will use it.

About stability: currently it hosted on budget hosting providers (ovh.net and hetzner.de) using Hadoop File System. Although the hardware is cheap, all data is duplicated 3 times in 2 datacenters (in Germany and France) and designed to survive hardware fault with minimal downtime.

Almost all external links of Misplaced Pages (all Wikipedias, not only English) were archived in May 2012 pursuing two goals: to preserve the pages which may disappear and to stress test and find bugs in my software. If you see your link is rot, you can check it on archive.is and change link to the saved page. If you feel you do not trust archive.is but it is the only site which has preserved your content, you can save archive.is's page on WebCite or other site thus making more copies and increase redundancy.

Vice-versa, you can save WebCite's or IA pages on archive.is to increase redundancy. (IA is not likely to go offline, but the new domain owner may put "Disallow: /" in robots.txt and thus remove the previous domain owner's content from IA, so it may have sense).--Rotlink (talk) 04:25, 18 November 2012 (UTC)

Also, there are some popular sites IA and WebCite cannot work with. Facebook.com is a big example. --Rotlink (talk) 04:58, 21 November 2012 (UTC)

I've rewritten the archive.is mention as "under evaluation", and emphasized that it should not be used alone until consensus agrees it is reliable. I did not delete it because we have quite a history of suggesting trying out services without advocating them. Back when IA was broken in 2008-2010, I was desperate, and used anything that seemed like it would work. Many of those services later vanished. But WebCitation, as sketchy and unfunded as it first seemed, has survived, Javascript malscripts be damned. So can we AGF for archive.is as "under evaluation"? --Lexein (talk) 23:09, 16 November 2012 (UTC)

It very much looks like Archive.is keeps only the newest shots when it archives external links automatically. It archives the external links once in a while, discarding the old archived versions. In the end, it's archiving dead links. And that is very bad. I detailed the process at Talk:Archive.is#How does automatic archiving work?. The owner of Archive.is probably doesn't realize that the program deletes old versions. — Ark25 (talk) 00:35, 27 July 2013 (UTC)

I've written to archive.is both on the Ask Me Anything form and by email, to ask about this behavior. I have not yet checked out old archive.is links I've used to see if this a global problem. --Lexein (talk) 02:29, 18 August 2013 (UTC)

I am sorry, my bad. I didn't know that Archive.is is making incremental backups and that it started to create backups on all Wikipedias in may-june 2013 - see Talk:Archive.is#How does automatic archiving work? and User talk:Rotlink#Questions about Archive.is. Sorry for the false alarm! — Ark25 (talk) 22:52, 21 August 2013 (UTC)

I think Archive.is is very nice, for making automatic backups for all links in Misplaced Pages. It really deserves to be integrated as a WikiMedia project or at least it deserves to be payed by Misplaced Pages. It's very important to preserve the archives of the newspapers. — Ark25 (talk) 23:00, 21 August 2013 (UTC)

The proprietor of Archive.is, User:Rotlink, assures us here and on its FAQ page that it is financially secure. However Webcitation.org has stated on its home page that it will be in financial trouble later this year. This has become a topic of discussion here:meta:WebCite. --Lexein (talk) 18:59, 25 August 2013 (UTC)

Newspaper websites which have undergone link format changes

Might be good to record a list of newspaper websites which have changed their article link format. This would help systematic, albeit manual, review of citations based on said newspapers. For instance:

periodical	old format	new format	change date
Arizona Daily Star	azstarnet.com/{section}/{article id}	azstarnet.com/{section}/{abbreviated title}/article_{identifier}.html	sometime after 2010

Just a thought. --User:Ceyockey (talk to me) 11:47, 16 January 2013 (UTC)

Yes, I find it a very good idea. I am doing that extensively on my native language Misplaced Pages (Romanian). I am putting such information on the talk pages of every newspaper's article. Check for example:

For example the website of Adevărul changed link formatting 2 times:

http://www.adevarulonline.ro/articole/cosmonautul-prunariu-trimis-in-rezerva/302198

==>

http://www.adevarul.ro/actualitate/Cosmonautul-Prunariu-trimis-rezerva_0_43197528.html Cosmonautul Prunariu, trimis în rezervă], 8 februarie 2007, Adevărul

==>

http://adevarul.ro/news/societate/cosmonautul-prunariu-trimis-rezerva-1_50ac14727c42d5a663849b01/index.html Cosmonautul Prunariu, trimis în rezervă], 8 februarie 2007, Adevărul (2012)

It's also an interresting exercise, I learned that many sites (most) changed their formatting like this:

http://www.evz.ro/detalii/stiri/averile-celor-mai-bogati-1000-de-britanici-s-au-redus-la-jumatate-833971.html (having the article index at the end - the change happened in 2010 or so)

In this case, first it was like:

http://www.evenimentulzilei.ro/article.php?artid=79214

and then:

http://www.evz.ro/articole/detalii-articol/833971/Averile-celor-mai-bogati-1000-de-britanici-s-au-redus-la-jumatate/

It's interresting that you can access the article like this:

http://www.evz.ro/detalii/stiri/abc-833971.html or http://www.evz.ro/detalii/stiri/833971.html

Print version:

http://www.evz.ro/detalii/printeaza-articol/stiri/averile-celor-mai-bogati-1000-de-britanici-s-au-redus-la-jumatate-833971.html

Mobile version:

http://m.evz.ro/news/833971 - doesn't keep them for long it seems. A link that works: http://m.evz.ro/news/1049504

PDF version:

http://www.evz.ro/detalii/printeaza-articol/stiri/averile-celor-mai-bogati-1000-de-britanici-s-au-redus-la-jumatate-833971.html?type=1234

It's useful to know all those things. It's a little bit of reverse engineering. But it helps those who try to repair broken links. Such knowledge even helped me to repair broken links with a robot. Links like

http://www.financiarul.com/articol_27075/celrom-severin-a-intrat-in-lichidare-avas-cauta-cumparatori.html

Transformed into:

http://www.incomemagazine.ro/articol_27075/celrom-severin-a-intrat-in-lichidare-avas-cauta-cumparatori.html

Here is the robot at work: — Ark25 (talk) 00:18, 27 July 2013 (UTC)

Yahoo news - when they disappear, do they disappear altogether?

I found recently that at least some content at Yahoo News has been captured at archive.org (internet archive). The item which caught my attention ... http://web.archive.org/web/20090502072711/http://news.yahoo.com/s/ap/20090427/ap_on_re_mi_ea/ml_odd_israel_kosher_flu . It might be that after a certain date, Yahoo! blocked the archiving and they've not bothered to reach back and request removal of older content?? --User:Ceyockey (talk to me) 15:09, 26 January 2013 (UTC)

Mementos - cross-archive searching

Hi all,

A colleague of mine has just alerted me to the Mementos interface - it's hosted by the UK Web Archive, but searches across a range of archive sites. Here's an example of a search run for news.bbc.co.uk; as you can see, it picks up a couple of smaller repositories, such as the LoC, as well as the usual suspects.

Any objections to my pointing to this as a resource in the "Web archive services" section? Andrew Gray (talk) 16:15, 29 January 2013 (UTC)

Citations on Misplaced Pages and discussion at meta:WebCite

There is a discussion at meta:WebCite regarding citations on Misplaced Pages that would be of interest to those that watchlist this page. Misplaced Pages currently has 182,368 links to this archive site. Regards. 64.40.54.47 (talk) 11:41, 11 February 2013 (UTC)

suggest removing: Web archiving is especially important when citing web pages that are unstable or prone to changes, like time sensitive news articles or pages hosted by financially distressed organizations.

I believe we should remove "Web archiving is especially important when citing web pages that are unstable or prone to changes, like time sensitive news articles or pages hosted by financially distressed organizations." Any news article that changes places, you can just update the link to. Most don't change themselves though. If ever the link stops working, then you can add an archive to replace it. Over at Talk:Garrett_(character), an editor is quoting that sentence as a reason to include archive links all over an article, when its not needed since links to the content still work fine on their own. Dream Focus 22:34, 12 March 2013 (UTC)

Late reply: I don't agree with removing this sentence. All sources are ephemeral. Some are more ephemeral than others, like news such as AP(with contractual expiration times), UPI, NYT, Google News(15-30 days), and anything Google caches(15-120 days). Many Archive links in an article "all over" aren't a problem, since they're just an incremental burden when using templates, and can be filled in automatically by some tools, like reflinks. Eventually, all links rot. This is invariant. Archive.org and Webcitation.org don't/can't archive all websites due to robots.txt. I've even seen whole sites which used to be archiveable by them completely disappear behind a domain owner's new robots.txt (archive.org respects current robots.txt, not past ones). My argument is that we should archive early, defensively, redundantly(multiple archives), and often, to avoid being caught flatfooted by such blackouts. I use the webcitation bookmarklet like a Tourette syndrome twitch. --Lexein (talk) 00:55, 30 July 2013 (UTC)

Pay Wall

The " Web archive services " section implies that the use of a web archiving service is useful in cases where material is moved behind a paywall. This position is troubling. Do we really mean this, and if so, how do we justify it?--SPhilbrick (Talk) 23:45, 19 March 2013 (UTC)

Not likely to get a response over here. That's why I asked that over at User_talk:Jimbo_Wales#violating_copyright_laws_by_linking_to_archived_sites_when_original_site_is_still_live. Also have the discussion still going on at Talk:Garrett (character). Dream Focus 23:53, 19 March 2013 (UTC)

Should the original url= be required when using archiveurl=

People here may be interested in commenting on the issue described at:

Misplaced Pages:Village pump (policy)/Archive_105#Citations: Should the original url.3D be required when using archiveurl.3D. Dragons flight (talk) 18:47, 8 April 2013 (UTC)

Link Rotting Across the Universe

Me and Tmol42‎ have been discussing link archiving, and I'd just like some clarification on a matter. Sorry, I know you've probably had this so many times in so many forms, but I'd be grateful if you'd humour me! I found a dead link to a PDF file at Parish councils in England‎, and found a backup at the WayBack machine. I added the archiveurl= and archivedate= parameters, and took the information from Wayback. My revision can be found here. The other user then changed this, removing the archive parameters, and setting the URL to the Wayback archive. His change can be seen here. Which, if either, is correct? drewmunn talk 16:22, 24 May 2013 (UTC)

Yours is correct, we don't link direct to archives direct for many reasons. We even have bots to correct this. — HELLKNOWZ ▎TALK 20:24, 24 May 2013 (UTC)

Better archiving of external links

On Romanian Misplaced Pages we were trying to use a WikiWix gadget. Each external link is accompanied by it's WikiWix cache link. In order to archive the external link, you just have to click on it's WikiWix corresponding link. The first time you click it - it will archive your external link/reference. The next time you click on it, you will get the archived page. This is a much better way to archive web pages than submitting a link on WebCite or even than using a bookmarklet. It works very fast, if you want to archive 20 references, you just have to open them all in tabs. However, WikiWix has some issues: seems it has some daily or weekly or monthly quota - you can't archive more than say 100 links per week - which makes it quite unusable.

You can see the cached links on WikiWix using a gadget that you can activate in your preferences.

The best solution would be something like the WikiWix gadget because you don't have to bother to present the archived link, the gadget will show it automatically. And it's very easy to archive it by just clicking on the archive for the first time. However, we need a better solution: A robot to cache all the external links in Misplaced Pages automatically (I just noticed in the discussion above that Archive.is did just that). And without such quotas like WikiWix, of course.

One solution would be to create a gadget for WebCite, to show the archived (cached) links near each external link. Together with a robot to take care of archiving all external links.

Another solution would be to create a gadget for Archive.is, to show the cached links, since its owner claims it cached all the external links last year. And, if possible, to arrange with him to archive the new external links each month or so.

For those who are not clear how WikiWix works, check this page: ro:Șantierul Naval Constanța. On the "Note"' section, you will see the next link:

Bosanceanu si-a dublat afacerile la Santierul Naval Constanta, 31.08.2009, zf.ro, accesat la 11 februarie 2010

In order to see the WikiWix cached links, you have to activate the WikiWix gadget: http://ro.wikipedia.org/Special:Preferences#mw-prefsection-gadgets - the last checkbox: Versiunea arhivată pentru legăturile externe

Now, near each link you can see a small yellow image (like 10x10 pixels). In this case, right before the date (31.08.2009), the WikiWix archive link: http://archive.wikiwix.com/cache/?url=http://www.zf.ro/burse-fonduri-mutuale/bosanceanu-si-a-dublat-afacerile-la-santierul-naval-constanta-4823139/&title=Bosanceanu%20si-a%20dublat%20afacerile%20la%20Santierul%20Naval%20Constanta . — Ark25 (talk) 19:05, 26 July 2013 (UTC)

Memento for Wayback Machine links

I just discovered Memento, a protocol that has been proposed in the past as a way for MediaWiki to provide easier access to historical revisions of pages; and that has a draft extension which could implement such a thing. AFAIK it isn't implemented on any wikis yet. But the idea is a neat one, and according to the bug request asking for the extension to be added to mediawiki core, Archive.org now recognizes memento URLs for referencing cached pages in the wayback machine. So this seems like a good time to revisit setting up a linkbot that caches links with them.

I sent an email to Alexis @ IA and Kevin who wrote the unfinished ArchiveLinks extension to see if any recent progresds had been made. If so, it would be nice to have a guideline for including a memento timestamp in links to the archive.org cache. – SJ + 22:41, 15 August 2013 (UTC)

Categories: