You are browsing the archive for Data Roundup.

Data roundup, July 3

Neil Ashton - July 3, 2013 in Data Roundup

We’re rounding up data news from the web each week. If you have a data news tip, send it to us at

Photo credit: David O'Leary

Photo credit: David O’Leary


The 12th Python in Science conference, #scipy2013, just concluded, and the conference proceedings are now available. How was this superfast turnaround time possible? “For 2013 [the reviewers] followed a very lightweight review process, via comments on GitHub pull-requests.” Hopefully this remarkable publication method will achieve broader currency. If that’s not enough SciPy content for you this week, also check out Brad Chapman’s notes on day one and day two of the conference.

Escuela de Datos has launched. This new project from the School of Data is an example of the School’s efforts “to bring the School of Data methodologies and materials to people in their native languages”, transporting the School’s hands-on teaching approach to the Spanish-speaking world. OKF International Community Manager Zara Rahman reflects on meeting the Latin American open knowledge community.

Abre Latam, “the first unconference on open data and transparency in Latin American governments”, took place in Montevideo, Uruguay, on June 24th and 25th. Learn about what happened at Abre Latam in a La Nación blog post.

Poderopedia is a “data journalism website that uses public data, semantic web technology, and network visualizations to map who’s who in business and politics in Chile”. It is now also a platform. New sites on the Poderopedia model can now be created by forking the Poderopedia GitHub repository.

Open Knowledge Foundation Nepal’s first meetup took place on the 28th of June. The meetup was an informal discussion of the OKFN’s nature and purpose, setting the agenda for future activities. Prakash Neupane provides a summary of the event.

RecordBreaker “turns your text-formatted data (logs, sensor readings, etc) into structured Avro data, without any need to write parsers or extractors”. It aims to reduce that most familiar of all obstacles to data analysis by automatically generating structure for text-embedded data.

dat is a new project—existing just as a mission statement, so far!—that aims to be “a set of tools to store, synchronize, manipulate and collaborate in a decentralized fashion on sets of data, hopefully enabling platforms analogous to GitHub to be built on top of it”. Derek Willis comments on its significance.

Communist is a JavaScript library that makes it easier to make use of the JavaScript threading tools called “workers” (surely such a library should be called Manager or Cadre? anyway…). Communist’s demos include data-pertinent items like parsing a dictionary and creating a census visualization.

If you’re wondering what can be learned about you from your metadata, check out Immersion, a meditative MIT Media Labs project which takes your Gmail metadata and returns “a tool for self-reflection at a time where the zeitgeist is one of self-promotion”.


How is the Brazilian uprising using Twitter? Check out this report for some revealing numbers and insights in the form of charts and network visualizations.

Some initial results from the Phototrails project have been posted. Phototrails mines visual data from InstaGram to explore patterns in the photographic life of cities.

What went wrong at the G8 summit with the possibility of “a new global initiative to open up data that is needed to tackle tax havens”? OKFN policy director Jonathan Gray takes a look at what needs to happen in the way of G8 companies connecting “the dots between their commitments to opening up their data and their commitments to tackling tax havens”.

This has been a good month for OpenCorporates. Most recently, OpenCorporates has quietly started releasing visualizations of the network structures of corporate ownership. This visualization of the network of companies connected to Facebook Ireland gives a taste of what is to come.

La Gazette des Communes recently published an app breaking down “les préréquations horizontales” region by region as a first step to evaluating the redistribution project’s success. La Gazette has now published the code and data for the app. is a visualization of Danish companies’ payment of corporate income tax for the year of 2011. Drawing on data from and built with MapBox, the map highlights a disturbing (albeit, as the authors hasten to point out, potentially legally explicable) amount of tax avoidance.

DATA SOURCES has launched, providing “open data about crime and policing in England, Wales and Northern Ireland” through both CSV downloads and an API under an Open Government License.

In what the BBC is hailing as “a historic moment”, the British National Health Service has released the first of a series of performance datasets on individual British surgeons, this set covering vascular surgeons. The data is available from the NHS Choices website.

The Global Observatory, as reported by, is a database which aims to document the “large-scale land acquisitions or ‘land grabs'” that have resulted in 32.8 million hectares of land falling into the hands of foreign investors since 2000. It has recently updated its online tool for “the crowdsourcing and visualisation of data as well as the verification of sources of such data”.

Foursquare has “created an authoritative source of polygons around a curated list of places”, merged it with “data licensed from many governments around the world”, and released the result, Quattroshapes, 30 gigabytes of geospatial data, under a Creative Commons license.

Flattr this!

Data roundup, June 26

Neil Ashton - June 26, 2013 in Data Roundup

We’re rounding up data news from the web each week. If you have a data news tip, send it to us at

Photo credit: Eduardo M. C.

Photo credit: Eduardo M. C.


The eight winners of the 2013 Data Journalism Awards have been announced. Check out the award-winning work on gay rights, Chinese power structure, class in Britain, and more.

Members of Investigative Reporters & Editors can now use Tableau Desktop for free. Tableau is a drag-and-drop data analysis tool popular with journalists who work with data. Now a wider range of journalists can take advantage of Tableau to tell stories with data.

Berlin Open Data Day 2013 was held this past Sunday, preceded by the redesign and addition of new data to Berlin’s open data portal. Projects showcased at the event included Bürger baut Stadt, an interactive map of construction projects, and a map of accessibility of living places by public transport.

If you’re in the Washington D.C. area, you can learn how to turn raw text into data with the magic of the Python Natural Language Toolkit by participating in a workshop on natural language processing basics being held July 27. Registration is still open, and tickets are $150.

Learn how to visualize data with a master class from one of the leading lights in data journalism, the Guardian. This “introduction to visualising data” will cover both the technical and the journalistic side of data visualization. Tickets are £99 and available till July 6.

What can you do with the statistical programming language R? Well, you can build a beer recommendation system, and yhat can show you how. This lesson starts with data from Beer Advocate and finishes with a find_similar_beers function wrapped in an API.

If you’re sharing data of a personal nature, you need to take steps to render it anonymous to protect the privacy of the people the data describes. The UK Anonymisation Network is an organization that can help you do this, providing “practical advice and information to anyone who handles personal data and needs to share it”.

Glue is a Python library to explore relationships within and among related datasets.” Glue, which rests upon the Python numerical computation stack, specializes in links across and relations between datasets, making it easy to juxtapose your data or to use selections in one set to constrain another.

It’s a commonplace of machine learning that a trained classifier is a black box, able to generate predictions but not itself straightforwardly interpretable. New work by a team of Italian researchers shows that this is less than completely true. Higher-order machine learning classifiers—classifiers of classifiers—can be trained “to hack other classifiers, obtaining meaningful information about their training sets”.


The south of the Canadian province of Alberta flooded this past week, requiring the evacuation of thousands of people from Alberta’s largest city. A map of the Alberta floods produced by Google Crisis Response illustrates the extent of the flood and the severity of the damage.

Indonesia, meanwhile, has been on fire. Smoke from forest fires in Sumatra caused unprecedented deterioration of air quality levels and forced closures of schools and airports. The World Resources Institute has compiled and mapped data on Indonesian forest fires in an effort to better understand their patterns and causes.

The summer solstice, June 21, was also Canada’s National Aboriginal Day, “a day of celebration for the Aboriginal Peoples in Canada”. Statistics Canada commemorated the occasion by releasing an annotated compilation of facts drawn from the 2011 National Household Survey.

An interactive map of Bangladesh factory disasters, presenting the past 23 years of industrial deaths in Dhaka, is certainly this week’s most heartbreaking use of CartoDB. The 1,127 dead in Rana Plaza loom large. Each accident is linked to the source of information.

Periscopic, everyone’s favorite data do-gooders, have unveiled two major new pieces this week. The first is, an interactive created in partnership with the Economic Policy Institute to illustrate the extent, impact, and origin of income inequality in America. The next is The Wait We Carry, a grim illustration of the long wait times that American veterans experience applying for disability status.

Repetition is one of life’s great pleasures. The rhythmic quality of poetry comes about, first and foremost, through repetition and recurrence of sounds and images. Former English major and present-day natural language processor Will Kurt applies his NLP experience to visualize repetition in T.S. Eliot’s Four Quartets.

Argentine journalists have no access to Freedom of Information legislation or open data in their country—and yet they are killing it at the data-driven journalism game. A new article from highlights the award-winning data journalism of Argentina’s La Nacion.

How do different machine learning classifiers perform on different datasets? A fascinating new gallery of classifier algorithm outputs provides visual insight into this question, plotting attempts by various classifiers at learning simple two-dimensional patterns.


I have to admit: I just didn’t encounter any major new data sources released this past week. If you want to share any, please leave a comment pointing them out!

As for last week, many, including me, were distracted by the relaunch of the Canadian federal government’s data portal, But the real story, says David Eaves, is the government’s adoption of the Open Data Charter. Meanwhile, with input from Eaves, the government has also unveiled a new Open Government License.

Flattr this!

Data roundup, June 19

Neil Ashton - June 19, 2013 in Data Roundup

We’re rounding up data news from the web each week. If you have a data news tip, send it to us at

Photo credit: Mike S

Photo credit: Mike S


The G8 Open Data Charter was unveiled this past Tuesday at the 39th G8 Summit. The charter reaffirms the G8 countries’ commitment to open data and sets an “open by default” policy for government data. The Telegraph discusses the significance of the charter.

There is, however, still much to be done. The Open Knowledge Foundation launched a preview of the 2013 Open Data Census in time for the G8 Summit, and this preview suggests that “G8 countries still have a long way to go in releasing essential information as open data”.

Also released in time for the G8 Summit is a pilot of the open company data index from OpenCorporates, supported by the World Bank. The index highlights global corporate participation in the move towards transparency in data.

The Open Knowledge Foundation has announced the launch of the Panton Fellowships, awards valued at £8,000 per annum which will reward scientists who actively promote scientific open data. Applications are now being accepted.

The Open Data Institute has announced the beta of Open Data Certificates, a website which allows data publishers to self-report and certify their data’s adherence to openness standards.

The State of New York has published a provisional open data handbook—on GitHub. The handbook is “a general guide for government entities participating in OPEN-NY”. Comments from the public are invited.

Got a spatiotemporal dataset with several billion data points? Want to visualize it interactively in a web browser? Nanocubes are a new, fast data structure that can be used to do exactly that. Nanocubes use so little memory “that you can run a nanocube in a modern-day laptop”. Check out a demo of nanocubes applied to some two hundred million tweets.

There is now a Go library for spatial data operations, and it’s called gogeos. Fans of Google’s programming language can now take advantage of a wide range of powerful spatial data manipulations, as detailed in the announcement blog post.

Learn how a small team of journalists can tackle a big data-journalistic project with David Bauer’s account of how TagesWoche investigated migrant remittances. Bauer explains how teams were structured, data was found, and visualizations were constructed.

Hive is a “data warehouse system” that facilitates the analysis of large datasets. Learn how to process social science data with Hive in a new blog post by John Beieler that shows you how to query a 40+ gigabyte dataset in a matter of minutes.

Become a Knight-Mozilla OpenNews Fellow! Applications for the fellowships, which fund a journalistically inclined programmer for ten months of newsroom participation, will be open until August 17.


Treezilla aims to map every tree in Britain. It is a citizen science platform in the form of an app and a crowdsourced database of British trees, already tens of thousands of trees strong, contributing to awareness of trees and their central ecological importance.

Detention Logs publishes “data, documents and investigations that reveal new perspectives on conditions and events inside immigration detention” in Australia. Called “one of the largest data journalism projects in Australian history” (source), the project aims “to arm the public” with the facts necessary for informed public policy on asylum seekers and immigration. presents many perspectives on the lives of Parisiens—income, politics, sex…—all from the reference-point of the Métro stations that abut on their lives. It is among the newest, and perhaps richest, transit-oriented infographic takes on urban life; compare the New Yorker on inequality and NYC’s subway.

Social network analysis has now been applied to Homer. The social network of characters in the Odyssey has been analyzed by PJ Miranda and colleagues, who conclude that “this social network bears remarkable similarities to Facebook, Twitter and the like”.

How do cats spend their time? As a cat owner, I know very well how pressing this question is. The BBC, in collaboration with the Royal Veterinary College, has investigated, presenting a day in the life of nine cats in the form of a dynamic map.

Responding to the opening of the trial of Andre Cornet’s alleged killers on Monday, Arnaud Wéry has mapped 15 years of crime stories in the Huy-Waremme region and blogged about how he did it.

How will climate change affect flora and fauna in Spain? The World Wildlife Foundation has created an interactive application allowing the projections of two models of climate change to be compared on a map of Spain, showing how living areas for plant and animal species will change in the years to come.


No new data source this week is more exciting than the International Consortium of Investigative Journalists’ release of a database of over 100,000 offshore tax havens, the Offshore Leaks Database. The database is “part of a cache of 2.5 million leaked offshore files ICIJ analyzed with 112 journalists in 58 countries” (source). Learn how it was built on the ICIJ blog, and read more about why it matters.

UK Cabinet Office Minister Francis Maude has announced “new commitments on open data that will give citizens detailed information on the operations of charities and companies”. Data held by the UK Charity Commission is slated to be made freely available by March of next year.

In response to the G8’s open data charter, Canada has launched a new data portal. The usefulness of this new portal is likely to be compromised by the serious budget cuts suffered by Statistics Canada under the Harper administration.

Flattr this!

Data roundup, June 12

Neil Ashton - June 12, 2013 in Data Roundup

We’re rounding up data news from the web each week. If you have a data news tip, send it to us at

Photo credit: Kris Krüg

Photo credit: Kris Krüg


The World Wide Web Foundation and its partners (including the OKFN) have launched the Global Open Data Initiative, “a champion for Open Data globally”, aiming to create and promote a unified set of guidelines assisting governments in the use of open data.

Today wraps up the second Open Economics Workshop, an Open Knowledge Foundation event hosted at MIT. As reported on the Open Econ blog, the event brought together some 40 economists and social scientists to discuss research data sharing and transparency in economics.

Data-Crunched Democracy was a conference bringing together journalists and analysts “to cut through the hype and understand the use of voter data in campaigns”. Derek Willis reflects on “the lessons for journalists covering campaigns that engage in the use of data” in an in-depth blog post.

I’ve known more than one graduate student in the social sciences who has described Excel’s pivot tables as “the best thing ever”. Pivot tables are a powerful tool for data exploration. A new blog post by Abbott Katz explains you can begin using pivot tables in your own work.

Real-time and historical data on United States drone strikes is now available as an API. is a public API making it easy to “build data visualizations about covert war […] in Pakistan, Yemen, and Somalia”.

Learn about pandas, “one of the best, and most important, libraries for data analysis in Python”, and how it can be used to do serious data analysis using SQL queries in a new blog post by John Beieler.

Bayesian Methods for Hackers, an introduction to Bayesian probability theory in practical and Pythonic terms, has appeared on the data roundup before. Now a draft of the PDF version of the book has been released. Check out this “understanding-first” introduction to “the natural approach to inference”.

Check out Source’s journalism code event roundup, June 10, for a worldwide selection of hackathons and conferences in data-driven and computer-assisted journalism.


So the NSA has all your metadata. What can they do with it? German Green Party politician Malte Spitz sued to repatriate six months of his own phone data and made it available to Zeit online, who combined it with publicly available data to reconstruct six months of Spitz’s life. You can read more about the project and download its data. You can also check out a timeline of the NSA’s domestic spying from the Electronic Frontier Foundation.

ProjectPolicy aims to “unify, organize and visualize the world’s government information onto one intuitive web platform”. Its take on San Francisco is available as a demo of what it aims to do.

America’s Worst Charities presents a year’s investigation by the Tampa Bay Times and the Center for Investigative Reporting into the misuse of charity funds by American charities. It prominently features an interactive presentation of the data, some of which is also available for download in CSV form.

The central limit theorem is a statistical theorem of scientific importance that cannot be overstated. A new visualization of the theorem constructed with D3.js, explained in terms of coin flips, makes it easier to develop intuitions about its meaning.

Stamen has put together 3D contour maps of the surface of Mars from data collected by the Mars Orbiter Laser Altimeter. As their blog reports, these maps are “a small gesture of thanks to the scientists who are working hard to do science and communicate with the public despite the stupid sequester”.

The latest work from Accurat presents the lives of ten famous painters in the form of beautiful timelines. Each timeline presents the artist’s personal history in a manner sensitive to the artist’s style.

Check out’s roundup of Datenjournalismus im Mai 2013 (German) for a collection of some of last month’s best examples of data-driven journalism.


In a move that is unlikely to distract attention away from the PRISM scandal, the Obama administration has released a portal calling out climate science deniers.

Open Nepal has launched Open Data Nepal, a project “not about creating yet another data repository in the web but an effort to curate and disseminate data that is already available in public domain”.

Canada’s Global News has obtained, at great difficulty, a database of over 61,000 Albertan oil spill incidents spanning the period from 1975 to 2013, and they are “now offering this information to the public for download”. This is certainly one of the most important datasets to see the light of data in Alberta—especially that Alberta’s open data catalogue has been described as perhaps “the most useless […] in the history of open data catalogues”.

The Los Angeles Times has acquired and released a database of the salaries of Department of Water and Power employees in 2012, finding that their “average total pay … is more than 50% high­er than oth­er city em­ploy­ees”. You can download the dataset and see for yourself.

Freddie Mac, a major US mortgage backer, is “standardizing its processes and making raw data more easily accessible to the public”. This move towards “transparency” appears to be part of a process of privatization of government-sponsored mortgages, “using our data to attract private capital”.

Flattr this!

Data Roundup – 5th June

John Murtagh - June 5, 2013 in Data Roundup

We’re rounding up data news from the web each week. If you have a data news tip, send it to us at

Tools, Courses, and Events

Booking now is the UK Data Service one-day workshop on using large-scale survey data for research which is in Manchester 24 June 2013. This workshop is aimed at those with little or no experience in the secondary analysis of survey data available from the UK Data Service and will introduce attendees to the skills required to find and access survey data and to carry out basic secondary analyses with the survey data.

As part of European Open Data Week there are 3 conferences and 14 workshops, from Tuesday, June 25 to Friday 28 June 2013 in Marseille, France, specifically looking at harmonising open data policies.

On 28th May the 7th Open Data Ireland Meetup took place in Dublin, the theme of which was entitled “Give Us Our Health Data” , and was attended by around 40 people. More details form the blog of the event at,36932,en.aspx.

Open Nepal Week is running in Kathmandu from June 2 to June 6 and is a partnership driven series of event spread over five days that aims to raise awareness about open data in Nepal and to devise mechanisms to help citizens reach to such data. Check out the website at

Newly launched is the Policy RECommendations for Open Access to Research Data in Europe project (RECODE) – the work plan and deliverables are available from here #RECODE.

A succinct and useful presentation from Victoria Stodden at Columbia University entitled: “Why Public Access to Data is So Important (and why getting the policy right is even more so)” which is available on their website.

On June 18 there is a webinar from the Open Government Partnership (OGP) entitled: “Strengthening the Demand for and use of open data initiatives” which is from 10:00 – 11:00 AM EST | 14:00 – 15:00 GMT

New open government tools have been launched for the Oakland, California area by community technologists, see more details

A useful blog post by Siri Anderson advising those who have data sets and how to published them: “3 Guidelines for Publishing Your First Open Data Sets”.

The National Day of Civic Hacking is a national event that took place June 1-2, 2013, in cities across the United States. Civic Hackers: The Neighborland API is a resource for local ideas and actions:

Lastly, research data now and in conjunction with the 3rd International Conference on Theory and Practice of Digital Libraries (TPDL) the first workshop on ‘Linking and Contextualizing Publications and datasets:’ is on September 26 in Malta:

Data Stories

The story of how detailed real time data got released from UK’s rail infrastructure owner (PDF) #opendata

The BBC has reported a story on the stunning visualizations of flight paths across the globe produced by GIS (and in their spare time, no less).

At the Centre for Sustainable Energy their most popular news story of the past 12 months: ‘Energy Company Obligation data in a usable format’,

The World Bank Development Data Group (DECDG) and the aid data organization Development Gateway has unearthed data which looks at the question of whether 29 developing countries are meeting their education goals and their progress visualized here:

An important article has been published (April 4) that reasserts that research data and their used in journal articles leads to an “a open data citation advantage”. You can read the pre-print on their website

Jess Denham, an Interactive Journalism MA student at City University London has interviewed David Ottewell, Head of the Data Journalism Unit at the Trinity Mirror (Regionals) group of UK Newspapers.

Jonathan Stray has written a blog post on a two-day data journalism workshop he gave in Taiwan which asks “How does a country get to open data? What Taiwan can teach us about the evolution of access” He writes “Assumptions about government openness vary from country to country. Here are a few lessons a cross-national perspective can bring to the open data movement.”

“Fell in love with data”, is an interesting blog post by Enrico Bertini, Assistant Professor at NY-Poly (with equally interesting comments) on data visualisation success stories – which are often in short supply. Read it on their website.

And the New York Times in its Technophoria blog has a piece about the struggle to gain access  to your own data which is stored (and monetized) by commercial companies like telecoms and utilities. It also details who is making this data available to consumers. “If My Data Is an Open Book, Why Can’t I Read It?” is available on their website.

Data Sources

On Friday May 31 Germany released the first results of its 2011 census, the first in 24 years and the first since east and west were joined together again. See the announcement on their website.

In Canada the Government of Alberta has joined the open data movement by launching its Open Data Portal. There are already more than 280 data sets on the portal — found at and you can watch a TV News item on it here.

In Australia the Government of New South Wales (NSW) has drafted an Open Data Policy which is open for public which is part of the NSW Government ICT Strategy supporting government transparency, accountability and efficiency.They have also re-launched their Open Data platform using CKAN 2.0 at

The Registry of Research Data Repositories has launched which allows the easy identification of appropriate research data repositories, both for data producers and users. The registry covers research data repositories from all academic disciplines. Information icons display the principal attributes of a repository, allowing users to identify the functionalities and qualities of a data repository. These attributes can be used for multi-faceted searches, for instance to find a repository for geoscience data using a Creative Commons licence. By April 2013, 338 research data repositories were indexed in 171 of these are described by a comprehensive vocabulary, which was developed by involving the data repository community (

The EC Open Data Portal ( went online just before Christmas 2012. It is designed to be the open data hub for European Institutions, beginning with data from the European Commission. There’s a recent video lecture introducing the hub from Malte Beyer- Katzbenberger entitled Towards a European open data infrastructure and is a guide through the portal – and the policy that is behind it.

The U.S. Government’s new CKAN open data catalog has just  launched… and African governments are now opening open data portals too. See Kenya’s at and Ghana’s at

A database of worldwide private companies registries has been launched called the Open Database of the Corporate World which currently holds information on 54,196,924 companies. The database uses the Google Refine reconciliation service and allows access to the information as JSON or XML.

There are some new datasets using the the history of UK websites via the UK Web Archive: They have also made a few example tools available, showing how the open data might be used, and these are hosted in their GitHub repository.

Flattr this!

Data Roundup May 21

John Murtagh - May 22, 2013 in Data Roundup

Data Roundups contain news on new apps, libraries, and other tools for working with data; newly announced conferences, workshops, hackathons, MOOCs, etc. If you have a data news tip, send it to us at This week’s Data Roundup is brought to you by John Murtagh.

Tools, Courses, and Events

The Big Data in Biomedicine conference kicks off at Stanford today. The event, which will be held at the School of Medicine’s Li Ka Shing Center for Learning and Knowledge, is bringing together leading figures from academia, industry, government and philanthropic foundations to discuss the burgeoning opportunities for mining the vast amounts of biomedical data housed in public databases. Here’s a look at the schedule. The event will be webcasted via the Big Data in Biomedicine website.

The next OKF Open Data Meetup is in Amsterdam  on May 30th which is part of the wider community across the world. Spread the word & join!

The April workshop Open Data on the Web which was in London now has a list of all received papers now online. The main topics of the Workshop were discoverability; transformation (to other formats); combinations of data from different models (e.g. linked data and CSV); quality assessment and self-description; extracting human-readable “stories” from data.

The new Open Data Tourism Hack at home, part of the Open Cities project is offering prizes for the best app and best use of Open Data to help European cities find new ways to manage the big challenges, and improve their tourism.

The LinkedUp Project has 39.500 EUR in prize money for the LinkedUp Challenge; three consecutive competitions looking for interesting and innovative tools and applications that analyse and/or integrate open web data for educational purposes.

There is now an Open Data Stack Exchange site in public beta. It’s a question and answer site for developers and researchers interested in open data and is built and run by users as part of the Stack Exchange network of Q&A sites. It’s working together to build a library of detailed answers to every question about open data.

Data Stories

Our @iainh_z wrote a guest post on the state of #opendata in Medicine for the Open Knowledge Foundation (@OKFN) 

The @americangut project is a collective of research scientists interested in exploring global gut microbiota World’s LARGEST open-source, community driven effort (they’re asking for your poo) to characterize the microbial diversity of the Global Gut. It’s embraced openscience – open data, open software and analysis on GitHub.

On May 22, 2012 at the University of North Texas, a group of technologists and librarians, scholars and researchers, university administrators, and other stakeholders gathered to discuss and articulate best practices and emerging trends in research data management.  This Denton Declaration bridges the converging interests of these stakeholders and promotes collaboration, transparency, and accountability across organizational and disciplinary boundaries.

The Guardian’s Data Blog has done some analysis on the key points and recommendations of the Shakespeare review which looked at open government data in the UK. They also ask if his open data agenda is shrewd but stuck in the @guardian live discuss #OpenData agenda in practice… They’ve also done a piece on Rohan Silva. “…the man who turned David Cameron onto open data”.

@chris_whong has visualised NYC’s subway turnstiles for @NYCEDC and reveals the hard work (finger clicking) that goes into it. To help, also from NY @NYStateCIO have just released open data sets that reveal the ‘Wineries, Breweries, and Distilleries Map’ for the state of New York.

Data Sources

IBM Smarter Cities ‏@IBMSmartCities26m is a comprehensive new mapping tool from LSE Cities, an urban studies project of the London School of Economics, makes the varying fortunes of Europe’s urban areas clear. Using Oxford Economics’ European Cities and Regional Forecasts database, the Metromonitor measures the employment and economic growth of 150 of the continent’s largest metro areas against metrics like national growth, population size, and urban typology.

The World Bank launched a much-improved version of its Open Data Catalog such as all essential information is available in a one-page list, sorted by name, popularity, or date. You can access bulk downloads, APIs or query tools from the same page with a single click. And you can see all the available metadata without having to visit separate pages on various sites.

The Government of South Australia has made its data freely available by releasing more than 100 datasets from 16 organisations, covering a range of topics including transportation, environment, education, crime and even baby names. Also from Down Under the Australian Archaeological Association has compiled a freely available set of Microsoft® Excel® databases listing radiocarbon, luminescence and uranium series ages from archaeological sites across Australia.

From academics at Stanford University a free 437-page book PDF entitled “Mining of Massive Datasets” has been made available for download (hardcopy published by Cambridge University Press).

new service (ODIN) allows researchers to add their research datasets – and other content with DataCite DOIs, including all figshare content – to their ORCID profile by integrating with the DataCite Metadata Store. ORCID provides a persistent digital identifier that distinguishes researchers from each other).

Pan European datasets have been released via the ESPON Database Portal which supplies different users (researchers, policy makers and stakeholders at regional and local level) with data, indicators and tools that can be used for European territorial development and cohesion policy formulation, application and monitoring at different geographical levels. The data included in the ESPON Database is mainly coming from European institutions such as EUROSTAT and EEA, and from all ESPON projects.

Flattr this!

Data roundup, May 15

Neil Ashton - May 15, 2013 in Data Roundup

We’re rounding up data news from the web each week. If you have a data news tip, send it to us at

Photo credit: Josh More

Photo credit: Josh More


In “the most significant piece of CKAN news since the project began”, version 2.0 of CKAN has been released. CKAN, the Open Knowledge Foundation’s flagship, has been updated with a new API, new design features, overhauled documentation, and more.

Spain’s first data journalism conference, las I Jornadas de Periodismo de Datos y Open Data en España, will be held from May 24 – 26 by the Spanish chapter of the OKFN. Held simultaneously in Barcelona and Madrid, the conference’s events will include workshops and a hackathon.

Last week brought together over data journalists from across Europe for the third Data Harvest Conference in Brussels. A blog post from the International Consortium of Investigative Journalists explains what you missed if you weren’t there.

Learn how to find stories in data at a one-day introduction to open data for journalists. The course will take place this September 19 at the Open Data Institute in London. Lisa Evans of the OKFN and Kathryn Corrick of the ODI will preside.

The first issue of Network Science, a new journal for the emerging discipline “using the network paradigm […] to inform research, methodology, and applications from many fields across the natural, social, engineering and informational sciences”, is available for free. For a sense of the growing importance of networks and graphs to data analysis, read a GigaOM article on the rise of the graph in big data.

Markov networks are a powerful way to represent multivariate probability distributions. A new blog post shows you how to work with Markov networks in Haskell, the gloriously algebraic programming language, using the HLearn library. This approach exploits the networks’ algebraic structure to get “online and parallel training algorithms ‘for free’”.

Ben Frederickson, tackling the twin challenges of learning JavaScript and D3.js, shows how to create “the simplest possible visualization [he] could think of”, the Venn Diagram. His blog post explores the challenges of the task and gives examples of interesting results.

Data for Radicals is on a roll. Lisa Williams has released yet another excellent and “absurdly illustrated” guide to data wrangling, this time explaining sortable, searchable online data tables.


The cicadas have arrived! The East Coast of America is playing host to swarms of 17-year cicadas. The Radiolab Cicada Tracker project, which has been gathering data to predict the cicadas’ arrival from volunteers using $80 sensors, is starting to show results on its map. The cicadas have made it to Manhattan!

The GDELT dataset is beginning to bear fruit. A new interactive from New Scientist uses GDELT data to construct a hexagon-binned map of violent events in Syria since 2011. “The resulting view suggests that the violence has subsided in recent months, from a peak in the third quarter of 2012.”

Following up on their map of slurs on President Obama, Floating Sheep has constructed a map of “a broader swath of discriminatory speech in social media, including the usage of racist, homophobic and ableist slurs”. Their work draws on over 130,000 geotagged tweets, tagged for offensiveness by human annotators. As observed by Jen Lowe, the Twitter conversation around the map “is a fantastic reality check on the data”.

Data on medical provider charges across the United States has been released, showing “significant variation across the country and within communities in what hospitals charge for common inpatient services”. The Washington Post analyzes the data and finds that “even on the same street, hospitals can vary by upwards of 300 percent in price for the same service”.

Also from the Washington Post comes a profile of baseball player Bryce Harper’s swing, annotated with remarkably lucid informational graphics.

How does ESPN discuss white and non-white quarterbacks? This question is investigated by Trey Causey (who you may remember from last week’s investigation of R-help’s cruelty) in an analysis of more than 36,000 ESPN articles that uncovers a number of interesting asymmetries.

In another investigation into discourse asymmetries, UNC’s Neal Caren asks: does the New York Times write differently about men and women? The post shows how to explore this question using Python and its natural language processing toolkit NLTK.

How much money is China investing in Africa? Aid Data China, in its first application of its “media-based data collection” methodology for “systematically collect[ing] open-source information about development finance flows from suppliers that do not publish their own project-level data”, has created a database of Chinese finance flow into Africa, encompassing over 1,700 projects.

Check out the latest edition of the weekly VisualLoop Data Viz News for a gigantic collection of data visualization news, articles, and resources.


On May 9, the U.S. government issued an executive order and memorandum “making open and machine readable the new default for government information”. To paraphrase Joe Biden, this is a rather big deal. The OKFN’s Rufus Pollock unpacks the executive order in a blog post, Joshua Tauberer takes a close look, and David Eaves offers his thoughts. Remarkably, the US Open Data Policy has been drafted and released on GitHub.

As explained in “Data Stories” above, data on United States medical provider charges is now available for download in Excel and CSV form.

A new portal for Latin American datasets, OpenData Latinoamérica, is a central repository bringing together the continent’s scattered data sources. A blog post on (in Spanish) explains the repository’s significance to data journalism in South America.

A portal for the U.S. State of Maryland has been launched, “offering state data not accessible to the public before”, including “ handgun permits, vendor payments, vehicle accidents, licensed veterinary clinics and per capita electricity consumption” (source).

Flattr this!

Data roundup, May 8

Neil Ashton - May 8, 2013 in Data Roundup

We’re rounding up data news from the web each week. If you have a data news tip, send it to us at

Photo credit: George Lu

Photo credit: George Lu


“Data is hard. Really, really hard,” says Dan Sinker, “[and] one of the hardest parts is cleaning.” That’s why OpenNews is announcing two code sprints for data cleaning. The sprints will work on developing Dedupe and the FMS parser.

Yes, data is hard—luckily, the School of Data is now ready to answer your questions about data. is a new service drawing on the expertise of the School of Data community to clarify the problems that come with working with data.

The Ghana Data Bootcamp, taking place May 27 to 29 in Accra, aims to bring together journalists, web programmers, and activists to foster the use of public data in Ghana. Participants will build data-driven content using their new data literacy in competition for a seed grant of $1,000. Registration is open, and free seats are extended to “journalists, developers, and digital creatives”.

Reflections are pouring in from the Open Knowledge Foundation’s School of Data Journalism 2013, which ran from April 24 – 27. Moran Barkai has written a piece looking back in awe at the event. Ten ideas to remember (French) from SDJ2013 have been compiled by

The G8 Conference on Open Data for Agriculture was held last week in Washington, D.C., including the announcement of a food, agriculture, and rural data portalRecorded webcasts from the conference are available on the World Bank’s website.

Skeptical about the hype surrounding “big data” and “data science”? Good! Join others at the first-ever NYC Data Skeptics Meetup this June 19. The meetup aims to foster a critical perspective on “mathematical, ethical, and business aspects of data”.

recording of Jonathan Corum’s much-loved keynote address at the Tapestry Conference is now available online.

Lisa Williams of Data For Radicals celebrates the legalization of gay marriage in Rhode Island by providing you with an absurdly illustrated guide to your first data-driven timeline. The guide walks you through the process of using Timeline.js to create an interactive timeline from start to finish.

Learn Pandas, the Python data analysis library. “Learn Pandas” is a collection of Python notebooks—updated to mark the recent 0.11 release of Pandas—organized into lessons to help you get up and running with Pandas.

Once you’ve learned Pandas, learn bearcart, a Python library “for creating Rickshaw visualizations with Pandas timeseries data structures”. In other words, bearcart does for time series graphs what Vincent does for Vega: it makes it easier to get from code to visualization.


A new paper by Harvard researchers explores the nature of Internet censorship in China (pdf link). Analyzing the content of millions of censored social media posts from over 1,400 different services, the researchers arrive at a surprising new theory of Chinese internet censorship.

This week’s data roundup period begins on May Day. Business Insider CEO Henry Blodget commemorates the occasion by graphing the plight of the worker under modern economic conditions. Felix Salmon reflects on the graphs and their depressing implications.

The UK local elections provide “an opportunity to put some of the open data released by UK local and county council elections to a practical test”. A School of Data blog post provides a detailed first look at “proving the data” with exploratory data visualization and mapping, and blog posts by Tony Hirst round up live election data initiatives and looks to see whether election data has a story to tell.

Car2go is a car-sharing service offering one-way rental cars charged by the minute. Disposable Cars tracks these momentary rentals in their last three days of travels around Portland in the form of a time-evolving map.

Bolides is an animated visualization of the last 1,152 years of meteorite sightings, beginning in Nogata, Japan, and ending in Battle Mountain, USA. The rain of destruction unfolding across centuries is strangely relaxing—but watch out for the Sikhote-Alin meteorite of 1947!

The history of San Francisco place names is the subject of a new interactive map made by Noah Veltman. Zoom and click through the map for an amazing guided tour through the onomastics of San Francisco. is also rounding up data journalism news from the web and posting the results on a monthly basis. Check out their April data journalism roundup, which links back to the School of Data’s roundup.


The CMU movie summary corpus comprises 42,306 movie plot summaries with aligned metadata extracted from Wikipedia and Freebase, accompanied by summaries preprocessed with Stanford CoreNLP. It is the basis of a forthcoming computational linguistics research paper, “Learning Latent Personas of Film Characters”.

The Center for Investigative Reporting has released an API for data related to its backlog of Veterans’ Affairs disability claims, making it easier to reuse the CIR’s data to produce work like its interactive map of claims backlogs.

The Sunlight Foundation has opened a new API user hub to focus more attention on its sizeable base of API users. The hub provides an overview of Sunlight APIs and a showcase of their associated projects.

Norway is slated to release its topographic datasets to the public. These include, according to Bjørn Sandvik, “topographic datasets at 1:50,000 scale […] together with address, road and cadastre data”.

A digital collection of over 38,000 historical maps has been released by the Digital Public Library of America. These maps are accessible through the DPLA API, as well as through the DLPA portal.

Flattr this!

Data roundup, May 1

Neil Ashton - May 1, 2013 in Data Roundup

We’re rounding up data news from the web each week. If you have a data news tip, send it to us at

Photo credit: Anonymous

Photo credit: Anonymous


If you missed out on the School of Data’s “School of Data Journalism 2013” workshop at the International Journalism Festival, don’t let that stop you from learning from it. Michael Bauer’s social network analysis tutorial will teach you how to use Gephi for journalism in the comfort of your own home, and Gregor Aisch‘s tutorials will teach you how to create charts with Datawrapper and how to analyze datasets with Tableau public. And there’s more—much, much more—on

In other news from, the Spanish Data Journalism Handbook translation has been released. The handbook is a free and open introduction to the world of data-driven journalism through case studies and discussions of methods. This translation, produced by Argentina’s La Nación, will help foster data journalism in the Spanish-speaking world.

The 72 finalists for the 2013 Data Journalism Awards were announced on Saturday, selected from a pool of over 300 applicants. The eight winners will be announced in June, and a total of €15,000 will be awarded.

The Global Data on Events, Location, and Tone database (GDELT), a “CAMEO-coded data set containing more than 200-million geolocated events with global coverage for 1979 to the present”,  has been receiving a great deal of hype lately. What can be done with GDELT? Recent blog posts by Rolf Fredheim provide preliminary explorations: mapping GDELT in R and experiments in Python and D3.

Much has already been written about D3.js’s mechanism of “selections”. But a new post by D3.js creator Mike Bostock is perhaps the deepest explanation of D3.js selections yet published, promising to “dispel any magic and help you master data-driven documents”.

Are you using Vincent to help you generate data-driven graphics from Python yet? No? Perhaps this new blog post showing you how you can use Vincent to create a map viz in less than 10 lines of Python will convince you to start.


R-help, a mailing list for dealing with problems with the statistical programming language R, has a somewhat scary reputation. Some have wondered: has R-help gotten meaner over time? Trey Causey investigates, with results that are “surprising, but [have] some simple sociological explanations”.

Stephen Wolfram shows off the power of Mathematica and his eponymous programming language by analyzing donated Facebook data. The result is a detailed and illustrated exploration of the demographics of Facebook.

TweetMap ALPHA lets you play with a dataset of 95 million tweets, querying them by time, space, and keyword and viewing the results on an interactive world map. The map both displays individual tweets as dots and aggregates them into a heatmap.

Dzhokhar Tsarnaev’s social media activity continues to haunt investigators. This Digg post features a network graph of Tsarnaev’s Twitter connections and digs into the network’s topology and its implications.

National Geographic published a story on a quest to find and photograph all 39 birds of paradise in December of 2012, accompanied by a spectacular graphic illustrating the birds’ relationships. A new blog post on National Infographic explores the evolution of the graphic.

Explore the density of underwater grasses in Chesapeake Bay with a beautiful new visualization. The map draws on decades of bathymetric data to show the ebb and flow of vegetation living in the Bay.

Where do mathematicians go when they graduate? A guest post by Kaisa Taipale on digs into arXiv data using R and Circos to discover patterns, across time and by subfield, in the emigration patterns of mathematics PhDs.

Beer Mapper is an iPad app that will present you your beer preferences in the form of a heat map of “beer space”. The inner workings of the app, which implements Kevin Jamieson’s research on active ranking, are explained in detail.


A year after its announcement, WikiData has been born. WikiData serves “the over 280 language versions of Wikipedia as a common source of structured data”, published under a Creative Commons public domain license. The Guardian and GigaOM provide further details.

The United Kingdom’s Land Registry has announced that it will be releasing a number of free datasets in the coming months. This includes historical house price index data next month and historical price paid data the month after that.

The city of Buenos Aires has opened a new data portal. The portal, built with CKAN, provides RESTful API access to the city’s transparency data and links to apps built with the data.

The city of Raleigh in North Carolina has launched its official data portal, built with Socrata. Popular datasets include fire incidents and crime data from 2012.

Flattr this!

Data roundup, April 24

Neil Ashton - April 24, 2013 in Data Roundup

We’re rounding up data news from the web each week. If you have a data news tip, send it to us at

Photo credit: Jason Kuffer

Photo credit: Jason Kuffer


The annual Perugia International Journalism Festival begins today, and the School of Data has helped organize a School of Data Journalism for the festival. Click through to see the School’s panels and workshops, and follow #DDJSchool on Twitter to see discussions of the events. has been launched. Crowdcrafting is a platform built on PyBossa crowdsourcing technology which makes it easy to recruit participants in community science projects en masse.

What are the three elements of successful data visualizations? According to Jim Stikeleather’s post on the Harvard Business Review blog, they reflect understanding of the audience, they set up a clear framework, and they tell a story.

But what does “data storytelling” mean, anyway? Many people have been asking that question lately. Zach Gemignani has put together a comprehensive collection of data storytelling resources which address that question and many, many more.

Marcelo Träsel is crowdsourcing a database of data journalism websites, and he needs your help. Contribute your links and help fix “the absence of a comprehensive database or other resource listing websites and weblogs on visualization, investigative techniques, CAR, and all other newsrooms practices labelled as data journalism”.

Ciara Byrne “signed up to spend a month of immersion in data, hoping to emerge a newly minted data scientist”, and her lessons from a crash course in data science tell the tale of what she learned.

Are you, like many programmers, too lazy to use relational databases for your day-to-day work with small-to-medium datasets? dataset is a Python library that aims to change that. Its motto: “Because managing databases in Python should be as simple as reading and writing JSON files.”

If you like Python and have been enjoying the new ease of creating data-driven graphics with Vega, then you’ll love being able to combine the two using Vincent, a Python library that “takes Python data structures (tuples, lists, dicts, and Pandas DataFrames) and translates them into Vega visualization grammar”.

The PLOS Text Mining Collection has gone live. This collection brings together reviews and research published in PLOS journals on the subject of text mining, the art of the retrieval and analysis of information from unstructured text. This initial launch only covers the past two years of publications but will grow over time.

Lisa Williams of Data For Radicals will be holding a data visualization workshop for beginners in Detroit on June 20 – 23. Participants in the workshop will get a personalized introduction to “maps, charts, graphs, and data visualizations”.


The Boston bombings have generated a large number of interactives and visualizations. Visual Loop has collected some of the better ones in the 32nd edition of Interactive Inspiration.

What, meanwhile, do we know about Dzhokhar Tsarnaev from his social media use? Quartz reflects on the way “we reveal immense amounts of information about ourselves publicly, unthinkingly, and sometimes involuntarily”.

If you haven’t been following Données Fleuries, Nicolas Patte’s weekly (French-language) review of excellence in data visualization, this week’s edition is a good time to start. DF #11 touches on the use of Raphaël.js, Twitter cartograms, and more.

New York City’s subways are an interesting window into wealth inequality in the city. The New Yorker has produced an interactive infographic showing how each subway line winds its way through the peaks and valleys of NYC’s wealth; Noah Veltman has produced a neat visual variant on the New Yorker map. These maps will remind many of’s map, which illustrated the impact of fare hikes on different NYC neighborhoods.

Moritz Stefaner takes a close look at gender balance in data visualization conferences. There is, he concludes, still a ways to go before a real balance is achieved.

Canada’s Global News has finally processed 318 PDFs of census survey responses from Toronto students and is producing an ongoing series of works based on the data. One recent piece asks: how safe do Toronto students feel?


The Philippine Agriculture Department has opened a new data portal, the Department of Agriculture Accountability Network (DAAN). DAAN aims to promote public awareness of the DA’s projects and to increase transparency with respect to its funding and other relevant data.

BISON (ostensibly standing for “Biodiversity Information Serving Our Nation”) is a new portal giving access to United States species occurrence data, tracking more than 100 million species in its datasets.

In time for Earth Day, the federal government of Canada and provincial government of Alberta have launched a new portal for environmental data from Alberta’s oilsands. The portal is intended to address criticisms of the government’s secrecy about the environmental impact of the oilsands.

The source code for, the data portal for the Spanish government’s Catalogue of Public Information, has been made available in an open form. The portal architecture can now be freely reused for new open data projects.

Flattr this!

 Receive announcements  Get notifications of news from the School in your inbox
Join the discussion Discussion list - have your say: