Misplaced Pages

:Misplaced Pages Signpost/2013-02-04/Special report: Difference between revisions - Misplaced Pages

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.
< Misplaced Pages:Misplaced Pages Signpost | 2013-02-04 Browse history interactively← Previous editNext edit →Content deleted Content addedVisualWikitext
Revision as of 16:34, 31 January 2013 view sourceMilowent (talk | contribs)Autopatrolled, Extended confirmed users, IP block exemptions, New page reviewers, Pending changes reviewers, Rollbackers48,661 editsm word choice. "releasing the best album of 2012" better than "publishing ..."← Previous edit Revision as of 16:30, 4 February 2013 view source Milowent (talk | contribs)Autopatrolled, Extended confirmed users, IP block exemptions, New page reviewers, Pending changes reviewers, Rollbackers48,661 edits update paragraph based on updated WP:5000 report.Next edit →
Line 9: Line 9:
We have begun producing two weekly charts on the most popular articles on Misplaced Pages, the ] list, and the moderated ]. We have begun producing two weekly charts on the most popular articles on Misplaced Pages, the ] list, and the moderated ].


], an automated list of the 5,000 most popular pages on Misplaced Pages, is now being compiled weekly. It also identifies how many featured articles, good articles, and lists are included. For the current list compiled on January 26, we find 242 featured articles and 481 good articles in the top 5000 pages. However, this report is based on raw data and includes non-article pages and popularly requested redlinks, like "]" at No. 18 on the current list (a script used to stream media content; see ]), as well as ] at position 244, a recurring entry likely fueled by spambots. More information on how the ] results are computed is found ]. ], an automated list of the 5,000 most popular pages on Misplaced Pages, is now being compiled weekly. It also identifies how many featured articles, good articles, and lists are included. For the current list covering January 27 - February 2, we find 239 featured articles and 468 good articles in the top 5000 pages. However, this report is based on raw data and includes non-article pages and popularly requested redlinks, like "]" at No. 15 on the current list (a script used to stream media content; see ]), as well as ] at position 166, a recurring entry likely fueled by spambots. More information on how the ] results are computed is found ].


The ] is a manually-moderated weekly Top 25 list started in January 2013 of the most popular articles on English Misplaced Pages. Similar in format to best-selling ] or ] charts, it is a bit more user friendly in that it excludes non-article pages, likely ] entries, and the ]. It also tracks how long an article has remained in the Top 25. For the first three weeks of January 2013, certain ]-related pages have been popular (a yearly trend seen during the playoff season of that sport), as well as popular recently-released movies such as '']'' and notable recent deaths such as ]. The ] is a manually-moderated weekly Top 25 list started in January 2013 of the most popular articles on English Misplaced Pages. Similar in format to best-selling ] or ] charts, it is a bit more user friendly in that it excludes non-article pages, likely ] entries, and the ]. It also tracks how long an article has remained in the Top 25. For the first three weeks of January 2013, certain ]-related pages have been popular (a yearly trend seen during the playoff season of that sport), as well as popular recently-released movies such as '']'' and notable recent deaths such as ].

Revision as of 16:30, 4 February 2013

Article display preview:
pxTKTK – TKTKSpecial reportTKTKTKTK
This is a draft of a potential Signpost article, and should not be interpreted as a finished piece. Its content is subject to review by the editorial team and ultimately by JPxG, the editor in chief. Please do not link to this draft as it is unfinished and the URL will change upon publication. If you would like to contribute and are familiar with the requirements of a Signpost article, feel free to be bold in making improvements!

This draft article ...

  • Red X symbolN ... has no title defined.
  • Red X symbolN ... has no blurb defined.
  • Red X symbolN ... is not yet ready to be copyedited.
  • Red X symbolN ... has not yet been copyedited.
  • Red X symbolN ... does not have an image.
  • Red X symbolN ... is not yet approved for publication.

Writer resources ...

deadlines Writing: 6 January 00:00 (8 days left; 61%) Publishing: 7 January 00:00 (9 days left; 64%)There are 7 days, 14 hours, 52 minutes and 27 seconds until deadline. (refresh)



Last revised 16:30, 4 February 2013 (UTC) (11 years ago) by Milowent (refresh)
The Signpost
← Back to ContentsView Latest Issue

Special report

Examining the popularity of Misplaced Pages articles: catalysts, trends, and applications

Contribute  —   Share this By West.andrew.g and Milowent
On February 12, 2012, the Whitney Houston page received 425 hits per second at its peak as news of her death spread.
On February 12, 2012, news of Whitney Houston's death brought 425 hits per second to her Misplaced Pages article, the highest peak traffic on any article since at least January 2010. It is broadly known that Misplaced Pages is the sixth most popular website on the Internet, but the English Misplaced Pages now has over 4 million articles and 29 million total pages. Much less attention has been given to traffic patterns and trends in content viewed. The Wikimedia Foundation makes available aggregate raw article view data for all of its projects. This article attempts to convey some of the fascinating phenomena that underlie extremely popular articles, and perhaps more importantly to editors, discusses how this information can be used to improve the project moving forward. While some dismiss view spikes as the manifestation of shallow pop culture interests (e.g., Justin Bieber is the 6th most popular article over the past 3 years, see Tab. 2), these are valuable opportunities to study reader behavior and to shape the public perception of our projects.

Misplaced Pages's most popular articles

We have begun producing two weekly charts on the most popular articles on Misplaced Pages, the WP:5000 list, and the moderated WP:5000/Top25Report.

WP:5000, an automated list of the 5,000 most popular pages on Misplaced Pages, is now being compiled weekly. It also identifies how many featured articles, good articles, and lists are included. For the current list covering January 27 - February 2, we find 239 featured articles and 468 good articles in the top 5000 pages. However, this report is based on raw data and includes non-article pages and popularly requested redlinks, like "Com/fluendo/plugin/KateDec.class" at No. 15 on the current list (a script used to stream media content; see Cortado (software)), as well as 18k Gold Watch at position 166, a recurring entry likely fueled by spambots. More information on how the WP:5000 results are computed is found below.

The WP:5000/Top25Report is a manually-moderated weekly Top 25 list started in January 2013 of the most popular articles on English Misplaced Pages. Similar in format to best-selling book or music charts, it is a bit more user friendly in that it excludes non-article pages, likely DOS attack entries, and the Main page. It also tracks how long an article has remained in the Top 25. For the first three weeks of January 2013, certain American football-related pages have been popular (a yearly trend seen during the playoff season of that sport), as well as popular recently-released movies such as Django Unchained and notable recent deaths such as Aaron Swartz.

The origins of heightened popularity

Articles which are "extremely popular" on Misplaced Pages fall into the category of either (1) occasional or isolated popularity, or (2) consistent popularity.

The prime sources of occasional or isolated popularity include:

Tab. 1. The most viewed pages on Misplaced Pages in a one hour period, since January 1, 2010 (excluding duplicate entries and DOS attacks)
Rank Article Date Views/hr Views/sec Notes
1 Whitney Houston 12 Feb 2012 1532302 425.6 Death of subject
2 Amy Winehouse 23 Jul 2011 1359091 377.5 Death of subject
3 Steve Jobs 6 Oct 2012 1063665 295.5 Death of subject
4 Madonna (entertainer) 6 Feb 2012 993062 275.9 Super Bowl halftime
5 Osama bin Laden 5 Feb 2011 862169 239.5 Death of subject
6 The Who 7 Feb 2010 567905 157.8 Super Bowl halftime
7 Ryan Dunn 20 Jun 2011 522301 145.1 Death of subject
8 Jodie Foster 14 Jan 2013 451270 125.4 Golden Globes speech
  • Cultural events and deaths: The best way to reach the highest levels of Misplaced Pages popularity are to be a celebrity who (a) dies, or (b) plays the Super Bowl halftime show (see Tab. 1). Indeed, prominent deaths dominate the top-100 traffic events and beyond. However, less morbid events are occasionally on the same scale, such as Jodie Foster following her recent coming out at the 2013 Golden Globes, Bubba Watson upon winning the 2012 Masters Tournament, and Ice hockey at the 2010 Winter Olympics during the final match between the U.S. and Canada (all drew over 250,000 views in a single hour).
  • Google Doodles: Google often replaces its logo to commemorate anniversaries and other events, and clicking on the logo will usually produce the search results for that topic. With Misplaced Pages appearing first for many search engine queries, this can be a tremendous source of traffic. When the 110th birthday of Dennis Gabor was celebrated in this fashion on June 5, 2010 his article peaked at over 55 views per second (this for an article that currently sees only about 140 views per day). There are many other examples, including Winsor McCay on October 15, 2012, Gideon Sundback on April 24, 2012, and the London Underground earlier this month.
  • Non-human views and DOS attacks: Page access data cannot distinguish between human and automated attackers. The most dramatic example occurred on March 9, 2010 when the Jyllands-Posten Muhammad cartoons controversy article saw 5.3 million views in a single hour (likely the densest view-hour at any point in Misplaced Pages's history). Due to the religious controversy/sensitivity surrounding the topic, this is believed to be an attack designed to prevent others from viewing the page and its associated imagery. Ironically, the Denial of Service article also appears to be a frequent target. Often, it can be hard to distinguish between malicious attacks, accidental misconfiguration (e.g. bot testing), and undiscovered catalysts of human traffic. In compiling the WP:5000/Top25Report, some discretion is applied to attempt to remove odd anomalies. For example, Cat anatomy has been a popular article in raw page views for a few months (and not only on Caturdays), after previously being much less popular.
  • Second screen effect: Though not nearly on the scale of the above spikes, we find that television programs and their content are reflected in page view data. This can be as broad as spikes on the Big Bang Theory article when the program airs on popular networks, but is even seen in small traffic bumps when a quiz show like Jeopardy! or Who Wants to be a Millionaire? asks about a particular topic. This phenomenon has recently been more thoroughly investigated on the German Misplaced Pages.
  • Slashdot effect: When extremely popular aggregation sites like Slashdot or Reddit prominently link to Misplaced Pages, traffic follows. Internally, Misplaced Pages's Main page can have much the same effect.
  • Temporal patterns: The Christmas article is popular in December, Easter peaks around that holiday, and Christianity-related articles tend to see unusual amounts of Sunday traffic. This is the just the start of patterns which are reflected diurnally, annually, and at other pre-determined intervals.
Tab. 2. The most popular articles on Misplaced Pages (2010–2012)
Rank Article
1 Wiki
2 Facebook
3 United States
4 YouTube
5 Google
6 Justin Bieber
7 Glee (TV series)
8 Sex
9 Misplaced Pages
10 Lady Gaga
11 Eminem
12 How I Met Your Mother
13 United Kingdom
14 The Big Bang Theory
15 India
16 World War II
This is a caption

Meanwhile, reasons for long-term popularity are somewhat more intuitive. Tab. 2 shows the most popular articles over the last ~3 years. In addition to the broad underlying cultural and academic interests of Misplaced Pages's audience, we encourage the reader to consider:

  • English Misplaced Pages's readership is not representative of English speaking populations. Previous studies have shown that Misplaced Pages's readership tends to be somewhat young, male, and educated—and their interests are likely to vary accordingly. Anecdotal evidence suggests significant traffic is driven by primary/secondary/university students in academic contexts, and we find that related topics are frequent vandal targets as well (e.g., classic English literature, trigonometry concepts, etc.).
  • Notice that Google, YouTube, and Facebook are all consistently popular articles. We speculate this is due in part to people accidentally typing these site names/URLs into a Misplaced Pages search box (either in the Mediawiki interface or a web browser) when intending to actually visit the sites themselves; related, but not a case of typosquatting.

Applications and use-cases of the data

For anti-vandalism/damage

Fig. 1. CDF plot of survival times and view counts for a corpus of damaging revisions, e.g., about 50% of damage has a lifespan of < 100 seconds, and 90% of damage has < 100 views.

The impetus behind storing these statistics was to better understand damage response on Misplaced Pages (the dissertation topic of author User:West.andrew.g). By storing statistics for every article at the finest granularity possible (hourly), it becomes possible to accurately estimate the number of readers who saw any particular article version. While practical writings have often focused on the time to revert of damaging edits, we argue that the quantity of persons who view it is the more relevant metric. Vandalism that survives for days on an obscure article is effectively harmless if no one visits that article.

Fig. 1 plots CDF both the lifespan and view count of about 500,000 recent damaging edits. As the graph shows, at median just 1 person will be exposed to a damaging edit. Such an impressive figure is a testament to the automated (e.g. ClueBot NG) and semi-automated (e.g., Huggle and STiki) mechanisms that have recently been brought to bear on the task. While these tools produce probabilistic measures of damage, only STiki will soon integrate an article's popularity into its prioritization schema.

Fig. 1 also shows that ~10% of damaging edits are viewed by 100+ persons. Deeper analysis shows that many of the associated survival times are quite short, and these are often the result of damage to extremely popular articles. With the human latency already quite minimal (and a certain amount of latency being inherent), new solutions are needed. Consider that spammers could opportunistically target very popular pages to exploit these brief windows of opportunity. Dynamically and autonomously moving articles in and out of "page protection" or "pending changes" based on their traffic patterns is another possible use-case for this data. As Fig. 2 demonstrates, the power-law distribution of views over articles would suggest relatively few articles need to be protected to have significant impact.

Spam and vandalism are surface-level issues. Recent analysis of deleted revisions on English Misplaced Pages showed copyright violations, being much harder to detect in casual patrolling work, to have significant lifespans and end-user exposures. This finding has motivated research into autonomous means of copyright violation discovery (see WP:Turnitin).

Improving article quality

Fig. 2. Log-log plot of article rank vs. daily views, e.g., the 100th most popular article receives just over 10k daily views. The Zipf distribution is also plotted for comparison.

Article popularity can also be a measure for deciding which articles to improve, a concept already familiar to WikiProjects who keep tabs on the popularity of articles within their project (e.g., Misplaced Pages:WikiProject Songs has a watchlist for the 1,500 most popular song articles). At the aggregate level, the distribution of page views follows a "power law distribution". Fig. 2 represents one months' views on Misplaced Pages graphed against a Zipf distribution (a distribution where the most frequent item will occur approximately twice as often as the next item, three times as often as the third item, and so forth.) The top 25 most viewed pages represent 4.02% of all total views, and the top 5000 represent 19.1% of all views. Though the distribution has an extremely long tail, the top 5000 data provides an opportunity to locate popular but poorly-written articles that need attention, as opposed to randomly selecting one of the 4.15 million remaining articles on the project. That is not to say that articles deep in the long tail are less important, but for editors interested in prioritizing article improvement based on popularity and effect on public perception, the WP:5000 data is an important tool.

Insights into popular culture

These statistics also provide an opportunity to study what is popular in contemporary culture. Before the growth of the Internet, the primary quantitative measures of contemporary popularity included bestselling book and music charts, box office sales, and television and radio ratings. The digital age now gathers vast quantities of data on consumption not previously available, but some observations from the past still hold true. The fact that Justin Bieber was the sixth most popular article from 2010–12, far ahead of more critically appreciated talent, is consistent with what James D. Hart (author of The Oxford Companion to American Literature) observed in 1950 in writing about the most popular books of the mid-19th century:

Misplaced Pages:Misplaced Pages Signpost/Quote

Thus, in the same way, page view statistics permit us to consider that Justin Bieber and One Direction—as maligned as they may be critically—are more popular and likely influential on culture, than say, Kendrick Lamar, chosen by Pitchfork Media as releasing the best album of 2012.

Data details and alternative perspectives

All the statistics in this article were produced by aggregating raw data made available by the WMF. This data contains hourly hit data on a per article basis for all WMF language/project combinations. Since Jan. 1, 2010 User:West.andrew.g has been parsing these files nightly and storing the English Misplaced Pages (article namespace) portions to a database hosted at the University of Pennsylvania. This is a non-trivial undertaking, consuming 1TB+ yearly. In addition to being the basis for several academic results (and motivated by earlier third-party work), he has more recently begun publishing the aforementioned weekly reports of the top 5000 articles, made available monthly reports for 2012, and released the source code behind these computations.

Others have used the same data for alternative purposes: User:Henrik has developed a tool for looking up the traffic history of specific articles. The Wikitrends site concentrates on dramatic popularity increases/decreases. WMF analyst Erik Zachte produces WikiStats, which provides a higher-level perspective on all WMF projects in numerous statistical dimensions. Mr. Zachte also has a fascinating portfolio of his WMF statistical work. These Misplaced Pages/WMF-specific resources complement other Internet-scale observations regarding search and popularity; most famously the Google Zeitgeist.

There are some caveats in interpreting this data. First, this is a raw presentation of traffic and popularity. It is known that English Misplaced Pages traffic has generally been increasing over time (per ). This fact, and the growing Internet connectivity that likely underlies it, lends some bias to more recent events. Second, it should be mentioned that logs may have under reported page view data in early 2010.

References

  1. (17 November 2012). Misplaced Pages-Zugriffszahlen bestätigen Second-Screen-Trend, martinrycak.de (in German, article investigates how Misplaced Pages traffic matches German television shows during broadcast times) (English translation)
  2. West, Andrew G., Sampath Kannan, and Insup Lee. Detecting Misplaced Pages Vandalism via Spatio-Temporal Analysis of Revision Metadata. In EUROSEC ‘10: Proceedings of the Third European Workshop on System Security, pp. 22–28. Paris, France. April 2010. (@ACM)(Author's version available for download)
  3. ^ West, Andrew G. Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee. Link Spamming Misplaced Pages for Profit. In CEAS '11: Proceedings of the Eighth Annual Collaboration, Electronic Messaging, Anti-Abuse, and Spam Conference, pp. 152–161, Perth, Australia. September 2011. – (@ACM)(Author's version available for download)
  4. ^ West, Andrew G. and Insup Lee. What Misplaced Pages Deletes: Characterizing Dangerous Collaborative Content. In WikiSym '11: Proceedings of the Seventh International Symposium on Wikis and Open Collaboration, pp. 25–28, Mountain View, CA, USA. October 2011. – (@ACM)(Author's version available for download)
  5. (20 December 2012). The Top 50 Albums of 2012, Pitchfork
  6. Priedhorsky, Reid, Jilin Chen, Shyong (Tony) K. Lam, Katherine Panciera, Loren Terveen, and John Riedl. Creating, Destroying, and Restoring Value in Misplaced Pages. In GROUP '07: Proceedings of the International ACM Conference on Supporting Group Work, pp. 259–268, Sanibel Island, Florida, USA. November 2007. – (@ACM)
S
In this issue4 February 2013 (all comments)
  • Special report
  • News and notes
  • WikiProject report
  • Featured content
  • In the media
  • Technology report
  • + Add a comment

    Discuss this story

    These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.
    Media mentionThis page has been mentioned by multiple media organizations:


    This is a fantastic article; thank you for sharing. Jujutacular (talk) 02:37, 6 February 2013 (UTC)

    I agree, fascinating stuff.--ukexpat (talk) 03:22, 6 February 2013 (UTC)
    Indeed, very interesting and comprehensive. Great job! --Waldir 03:59, 6 February 2013 (UTC)
    Yes, this is excellent, thanks so much for doing this and writing the article about it. -- phoebe / (talk to me) 04:17, 6 February 2013 (UTC)
    Yes, completely agreed with all the previous comments. (And, for those interested in the topic of high-profile events leading to massive page view spikes pre-2010, there's some coverage here, here, here, and here.) --MZMcBride (talk) 06:19, 6 February 2013 (UTC)
    Some more notes on 2008 election traffic. I didn't get hourly figures for Obama, but I suspect in the low hundreds of thousands per hour - so just off our list above. Andrew Gray (talk) 07:31, 6 February 2013 (UTC)
    I expect the most viewed article ever will come from a celebrity who dies while playing the Super Bowl half time show ... but seriously, very interesting article. Great work. MasterOfHisOwnDomain (talk) 11:55, 6 February 2013 (UTC)
    A significant reason we write the encyclopedia is for people to read it. Misplaced Pages:Did you know/Statistics provides some sense of what readers look for on the Main Page. However, the analysis provide above by West.andrew.g and Milowent is exactly what we need to get a better sense of what our readers desire on a larger scale as well as a sense of how the encyclopedia articles are being used. Great job! - Uzma Gamal (talk) 13:02, 6 February 2013 (UTC)

    Not only is this a great article, it supplies important information -- not just for Misplaced Pages, but for the marketing world in general. Since most people don't know about the Signpost, I highly recommend posting about this article on marketing and social media sites - tweet it up. -- kosboot (talk) 14:26, 6 February 2013 (UTC)

    • Like others above, I welcome the coverage of viewing statistics, which is an area greatly neglected in most wiki-discourse. But I am always less interested in the very top of the charts than the middle and bottom, and more on this in the future would be really great. A few weeks ago I posed a question on the technical pumps, asking how we can generalize about the number or proportion of crawler bot hits in the article stats which, unlike everything else on the page, received no response at all. Yet this is a key question for much current editing, which overwhelmingly concentrates on long tail article with low viewing figures. Also, what are we able to say about how long average "readers" spend on an article, and how much they actually read? I haven't a clue, beyond the overall average figure of a few seconds (which I can't seem to find now). Johnbod (talk) 16:24, 6 February 2013 (UTC)
    • Very good work! A pity that this came out now and that it started in 2010; if it had started in 2009, it would have been able to account for stories about the death of Michael Jackson, and if it had only come a few days later, it would have been able to include the massive spike for hits on Richard III of England, which typically got a few thousand hits daily until Monday, when it got about 800,000, or almost 25× the number of hits for that day's featured article. I'll look forward to future studies! Nyttend (talk) 02:42, 7 February 2013 (UTC)

    Really great article about the power of WP. The effect that WP has had on our world is huge, but unfortunately largely unmeasurable on the individuals' side of things. I thought that the readers here, may likewise enjoy a piece of research that I recently read (that cites WP as an example), that I feel is very fascinating in how it describes the power behind phenomena like WP. It's called "The Theory of Crowd Capital" and you can download it here if you're interested: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2193115 Enjoy! — Preceding unsigned comment added by 24.85.85.220 (talk) 01:24, 7 February 2013 (UTC)

    Nice work. I'm reminded of this page: Misplaced Pages:Short popular vital articles.

    Azerbaijani places

    • Thanks for starting that AfD on Gasaneri, I presume due to the cool "Ә"s in your name you know your stuff. There are probably more stubs like that for Azerbaijani locations - is there official census information for each rayon that could help us improve our coverage? Regards.--Milowent 13:03, 4 October 2014 (UTC)
      • Hello my friend. Yes there are a lot of villages which abolished a lot of years ago, but today all of them in English and Bahasa wikipedias. I will start to work on them. You also can help me to delete them because I am not an adminstirator and I do not have permission to delete them. If you are adminstirator or you have a friend who is adminstirator help delete them. Thanks you for attention.--Nəcməddin Kəbirli (talk) 13:19, 4 October 2014 (UTC)
    What do you think of The Signpost? Share your feedback. Home About Archives Newsroom Subscribe Suggestions Categories: