You are browsing the archive for Data Stories.

Βίντεο: School of Data Summer Camp

Anthimos Ioannidis - September 17, 2014 in Data Stories, School_Of_Data

Με μεγάλη χαρά, αναδημοσιεύουμε ένα βίντεο σε άμεση σχέση με το School of Data. Μέσα από το πεντάλεπτης διάρκειας βίντεο, παρουσιάζεται η προσπάθεια του School of Data, καθώς και όλα τα μέλη του δικτύου του, τα μέλη και οι εταίροι του. Παρακολουθήστε το και πάρτε μια γεύση από το School of Data Summer Camp 2014.

Λίγα λόγια για το Data Summer Camp

Με το School of Data Summer Camp ήρθαν σε επαφή Εταίροι του School of Data με μέλη των τοπικών πρωτοβουλιών ώστε να τεθούν οι βάσεις για μια παραγωγική και γεμάτη γεγονότα χρονιά. Ήδη από το περασμένο έτος έχουν δημιουργηθεί ιστοσελίδες των τοπικών παραρτημάτων σε Ελλάδα, Ισπανία, Γαλλία και Πορτογαλία και έχουν πραγματοποιηθεί ποικίλες ενέργειες σε τοπικό επίπεδο ανά τον κόσμο. Σε πιο ατομικό επίπεδο η ομάδα του School of Data έχει συνεργαστεί με εταίρους από όλο τον κόσμο με σκοπό τη διάδοση και την ενίσχυση της γνώσης επεξεργασίας δεδομένων στις κοινωνίες όπου ανήκουν. Επιδίωξη ήταν η επιστροφή του καθενός από τους συμμετέχοντες στη βάση του, εξοπλισμένο με περισσότερες ικανότητες και με ξεκάθαρο όραμα για τη μελλοντική, κοινή πορεία του School of Data!

Flattr this!

Ειδησεογραφική οδύσσεια

Anastasios Ventouris - April 29, 2014 in Data Journalism, Data Stories


Η πρώτη διαδικτυακή εφημερίδα δημοσιεύθηκε το 1974, ωστόσο, μόλις τα τελευταία 10 χρόνια άρχισε να ανθίζει ο τομέας της ειδησεογραφίας στο διαδίκτυο. Πλέον μετράμε χιλιάδες ιστοσελίδες με ειδήσεις όπου ενημερώνουν τους αναγνώστες όλο το 24ωρο.

Χάρη στη νέα τεχνολογία και στην ανάπτυξη των μηχανών αναζήτησης, όπως η Google, ο καθένας μπορεί να αναζητήσει τις πληροφορίες που χρειάζεται. Αλλά πως μπορούμε να εντοπίσουμε την πραγματική πηγή της είδησης; Πότε και που ξεκίνησαν όλα, αλλά και ποια ήταν η πορεία που ακολούθησε η ιστορία;

Αυτό είναι το News Odyssey!

News Odyssey 1

Η ειδησεογραφική οδύσσεια (News Odyssey) είναι ένας πρωτοποριακός αλγόριθμος για ειδησεογραφικά site όπου συγκεντρώνει και ομαδοποιεί ειδήσεις με το ίδιο θέμα από πολλές κορυφαίες ιστοσελίδες ειδήσεων ανά τον κόσμο. Ανάμεσά τους οι CNN, Reuters, BBC, Washington Post, The Guardian και άλλες.

To News Odyssey αποτελεί ένα εργαλείο τόσο για ερευνητικούς, αλλά και για δημοσιογραφικούς  σκοπούς. Επιτρέπει στους επαγγελματίες και ερασιτέχνες δημοσιογράφους να αναζητούν την πορεία μιας είδησης μέσα σε ένα ορισμένο χρονικό διάστημα μέσα από τα διάφορα ειδησεογραφικά site. Έτσι, μπορούν οι ίδιοι να δημιουργούν τις δικές τους ιστορίες, βασισμένες σε πραγματικά γεγονότα, κερδίζοντας πολύτιμο χρόνο έρευνας, αλλά και με μεγαλύτερη ποιότητα καθώς τα δεδομένα τους δε βασίζονται σε μία μόνο ειδησεογραφική σελίδα.

Για την απεικόνιση της πορείας της είδησης, χρησιμοποιήθηκε το εργαλείο Timemapper. Πρόκειται για ένα ειδικό διαδραστικό χρονολόγιο που επιτρέπει στον αναγνώστη να ανατρέξει στις επικεφαλίδες των όμοιων άρθρων και να μελετήσεις την πορεία τους ανά τη χώρα, αλλά και ανά ειδησεογραφική σελίδα. Έτσι, μπορεί να δει ποια σελίδα το δημοσίευσε πρώτο και πως αυτή η είδηση μεταφέρθηκε από το ένα μέρος του πλανήτη στο άλλο. Το Timemapper δημιουργήθηκε από το OKFN labs και προσαρμόζεται αυτόματα σε σταθερούς υπολογιστές, tablets και smartphones.

News Odyssey 2

Όμως η λειτουργικότητα του News Odyssey δε σταματάει εκεί. Εκτός από την καινοτομία και την ευκολία που προσφέρει στο δημοσιογράφο, αυξάνει την ποιότητα του χρόνου ανάγνωσης της είδησης του ίδιου του αναγνώστη. Χρησιμοποιώντας το εργαλείο DBpedia Spotlight, εμπλουτίζει το κείμενο της είδησης με διασυνδέσεις όπου του επιτρέπουν να μεταβεί απευθείας στα λήμματα της Wikipedia ή να διαβάσει επί τόπου μία περίληψη για λέξεις του κειμένου με υψηλή σημασιολογική βαρύτητα. Έτσι πλέον, δε χρειάζεται να αναζητεί στο Google λέξεις, όπως επιστημονικούς όρους, ονόματα, περιοχές, πόλεις και άλλα.

Spotlight

Περισσότερες πληροφορίες στο http://newsodyssey.okfn.gr/

Flattr this!

Exploratory Data Analysis – A Short Example Using World Bank Indicator Data

Tony Hirst - July 7, 2013 in Data Stories, HowTo

Knowing how to get started with an exploratory data analysis can often be one of the biggest stumbling blocks if a data set is new to you, or you are new to working with data. I recently came across a powerful example from Al Essa/@malpaso where he illustrates one way in to exploring a new data set – explaining a set of apparent outliers in the data. (Outliers are points that are atypical compared to the rest of data, in this example by virtue of taking on extreme values compared to other data points collected at the same time.)

The case refers to an investigation of life expectancy data obtained from the World Bank (World Bank data sets: life expectancy at birth*), and how Al tried to find what might have caused an apparent crash in life expectancy in Rwanda during the 1990s: The Rwandan Tragedy: Data Analysis with 7 Lines of Simple Python Code

*if you want to download the data yourself, you will need to go into the Databank page for the indicator, then make an Advanced Selection on the Time dimension to select additional years of data.

world bank data

The environment that Al uses to analyse the data in the case study is iPython Notebook, an interactive environment for editing Python code within the browser. (You can download the necessary iPython application from here (I installed the Anaconda package to try it), and then followed the iPython Notebook instructions here to get it running. It’s all a bit fiddly, and could do with a simpler install and start routine, but if you follow the instructions it should work okay…)

Ipython notebook

iPython is not the only environment that supports this sort of exploratory data analysis, of course. For example, we can do a similar analysis using the statistical programming language R, and the ggplot2 graphics library to help with the chart plotting. To get the data, I used a special R library called to WDI that provides a convenient way of interrogating the World Bank Indicators API from within R, and makes it easy to download data from the API directly.

I have posted an example of the case study using R, and the WDI library, here: Rwandan Tragedy (R version). The report was generated form a single file written using a markup language called R markdown in the RStudio environment. R markdown provides a really powerful workflow for creating “reproducible reports” that combine analysis scripts with interpretive text (RStudio – Using Markdown). You can find the actual R markdown script used to generate the Rwanda Tragedy report here.

As you have seen, exploratory data analysis can be thought of as having a conversation with data, asking it questions based on what answers it has previously told you, or based on hypotheses you have made using other sources of information or knowledge. If exploratory data analysis is new to you, try walking through the investigation using either iPython or R, and then see if you can take it further… If you do, be sure to let us know how you got on via the comments:-)

Flattr this!

Using SQL for Lightweight Data Analysis

Rufus Pollock - March 26, 2013 in Data Blog, Data Cleaning, Data Stories, HowTo, SQL

This article introduces the use of SQL for lightweight data analysis by walking through a small data investigation to answer the question: who were the top recipients of Greater London Authority spending in January 2013?

Along the way, it not only introduces SQL (and SQLite) but illustrates various other skills such as locating and cleaning data and how to load tabular data into a relational database.

Note: if you are intrigued by the question or the data wrangling do check out the OpenSpending project – the work described here was part of some recent work by OpenSpending community members at a recent Open Data Maker Night.

Finding the Data

First we need to locate the data online. Let’s start with a web search, e.g.: “London GLA spending” (GLA = greater london authority). This quickly yields the jackpot in the form of this web page:

For our work, we’ll focus on the latest month. So jump in and grab the CSV file for February which is at the top of that page (at the moment!).

Preparing the Data

The data looks like this (using the Chrome CSV Viewer extension):

gla-csv

Unfortunately, it’s clear these files have a fair amount of “human-readable” cruft that make them unsuitable for further processing without some cleaning and preparation. Specifically:

  • There is various “meta” information plus a blank linke at the top of each file
  • There are several blank lines at the bottom
  • The leading column is empty

We’ll need to remove these if we want to work with this data properly – e.g. load into OpenSpending, put in a database etc. You could do this by hand in your favourite spreadsheet package but we’ll do this using the classic UNIX command line tools head, tail and sed:

tail -n +7 2012-13-P11-250.csv | head -n -4 | sed "s/^,//g" > 2013-jan.csv

This command takes all lines after the first 6 and before the last 4, strips off the leading “,” and puts it in a new file called 2013-jan.csv. It uses unix pipes to run together these few different operations:

# strip off the first 6 lines
tail -n +7

# strip off the last 4 lines
head -n -4

# remove the lead column in the form of "," at the start of each line
# "^," is a regular expression matching "," at the start of a line ("^"
# matches the start of a line)
sed "s/^,//g"

The result of this is shown in the screenshot below and we’re now ready to move on to the next stage.

gla-csv-cleaned

Analyzing the Data in a Relational Database (SQLite)

Our aim is to work out the top recipients of money. To do this we need sum up the amounts spent by Vendor (Name). For the small amount of data here you could use a spreadsheet and pivot tables. However, I’m going to take a somewhat different approach and use a proper (relational) database.

We’ll be using SQLite, an open-source relational database that is lightweight but fully-featured. So, first check you have this installed (type sqlite or sqlite3 on the command line – if you don’t have it is easy to download and install).

Loading into SQLite

Now we need to load our CSV into SQLite. Here we can take advantage of a short python csv2sqlite script. As its name suggests, this takes a CSV file and loads it into an SQLite DB (with a little bit of extra intelligence to try and guess types). The full listing for this is in the appendix below and you can also download it from a gist here. Once you have it downloaded we can use it:

# this will load our csv file into a new table named "data"
# in a new sqlite database in a file named gla.sqlite
csv2sqlite.py 2013-jan.csv gla.sqlite

Analysis I

Let’s get into the SQLite shell so we can run some SQL:

# note you may need to run sqlite3 rather than sqlite!
sqlite gla.sqlite

Now you will be in the SQLite terminal. Let’s run our query:

sqlite> SELECT "Vendor Name", sum(amount) FROM data
          GROUP BY "Vendor Name"
          ORDER BY SUM(amount) DESC
          LIMIT 20;

How does this work? Well the key thing here is the “GROUP BY” which has a similar function to pivoting in spreadsheets: what it does is group together all the rows with the same value in the “Vendor Name” field. We can then use SELECT to specify fields, or functions of fields that are common or aggregate across all the rows with the same “Vendor Name” value. In this case, we just select the “Vendor Name” and the SUM of the “Amount” field. Lastly, we order the results by the sum (descending – so most first) and limit to only 20 results. The result is as follows:

Vendor Name                          SUM(Amount)
-----------------------------------  -----------
NEWLON HOUSING TRUST                 7540500.0  
ONE HOUSING GROUP                    6655104.0  
L B OF HARINGEY                      6181359.0  
LONDON BOROUGH OF HACKNEY - BSP      5665249.0  
LONDON BOROUGH OF HAVERING           4378650.0  
LONDON BOROUGH OF NEWHAM             3391830.0  
LONDON BOROUGH OF BARKING            2802261.0  
EVERSHEDS                            2313698.54 
METROPOLITAN HOUSING TRUST LIMITED   2296243.0  
BERKELEY PARTNERSHIP HOMES LIMITED   2062500.0  
LONDON BOROUGH OF LAMBETH            1917073.95 
PARADIGM HOUSING GROUP LIMITED       1792068.0  
AMAS LTD                             1673907.5  
VIRIDIAN HOUSING                     1467683.0  
LONDON BOROUGH OF GREENWICH          1350000.0  
CITY OF WESTMINSTER                  1250839.13 
CATALYST HOUSING GROUP LTD            829922.0   
ESTUARY HOUSING ASSOCIATION LIMITED   485157.0   
LOOK AHEAD HOUSING AND CARE           353064.0   
TRANSPORT FOR LONDON                  323954.1   

We could try out some other functions, for example to see the total number of transactions and the average amount we’d do:

sqlite> SELECT "Vendor Name", SUM(Amount), AVG(Amount), COUNT(*)
          FROM data
          GROUP BY "Vendor Name"
          ORDER BY sum(amount) DESC;

Vendor Name                          SUM(Amount)  AVG(Amount)  COUNT(*)  
-----------------------------------  -----------  -----------  ----------
NEWLON HOUSING TRUST                 7540500.0    3770250.0    2         
ONE HOUSING GROUP                    6655104.0    3327552.0    2         
L B OF HARINGEY                      6181359.0    6181359.0    1         
LONDON BOROUGH OF HACKNEY - BSP      5665249.0    1888416.333  3         
LONDON BOROUGH OF HAVERING           4378650.0    4378650.0    1         

This gives us a sense of whether there are many small items or a few big items making up the expenditure.

What we’ve seen so far shows us that (unsurprisingly) GLA’s biggest expenditure is support to other boroughs and to housing associations. One interesting point is the approx £2.3m paid to Eversheds (a City law firm) in January and the £1.7m to Amas Ltd.

Analysis II: Filtering

To get a bit more insight let’s try a crude method to remove boroughs from our list:

sqlite> SELECT "Vendor Name", SUM(Amount) FROM data
          WHERE "Vendor Name" NOT LIKE "%BOROUGH%"
          GROUP BY "Vendor Name"
          ORDER BY sum(amount)
          DESC LIMIT 10;

Here we are using the WHERE clause to filter the results. In this case we are using a “NOT LIKE” clause to exclude all rows where the Vendor Name does not contain “Borough”. This isn’t quite enough, let’s also try to exclude housing associations / groups:

SELECT "Vendor Name", SUM(Amount) FROM data
  WHERE ("Vendor Name" NOT LIKE "%BOROUGH%" AND "Vendor Name" NOT LIKE "%HOUSING%")
  GROUP BY "Vendor Name"
  ORDER BY sum(amount)
  DESC LIMIT 20;

This yields the following results:

Vendor Name                          SUM(Amount)
-----------------------------------  -----------
L B OF HARINGEY                      6181359.0  
EVERSHEDS                            2313698.54 
BERKELEY PARTNERSHIP HOMES LIMITED   2062500.0  
AMAS LTD                             1673907.5  
CITY OF WESTMINSTER                  1250839.13 
TRANSPORT FOR LONDON                  323954.1   
VOLKER FITZPATRICK LTD                294769.74  
PEABODY TRUST                         281460.0   
GEORGE WIMPEY MAJOR PROJECTS          267588.0   
ST MUNGOS                             244667.0   
ROOFF LIMITED                         243598.0   
R B KINGSTON UPON THAMES              200000.0   
FOOTBALL FOUNDATION                   195507.0   
NORLAND MANAGED SERVICES LIMITED      172420.75  
TURNER & TOWNSEND PROJECT MAGAG       136024.92  
BARRATT DEVELOPMENTS PLC              108800.0   
INNOVISION EVENTS LTD                 108377.94  
OSBORNE ENERGY LTD                    107248.5   
WASTE & RESOURCES ACTION PROGRAMME     88751.45   
CB RICHARD ELLIS LTD                   87711.45 

We still have a few boroughs due to abbreviated spelling (Haringey, Richmond, Westminster) but the filter is working quite well. New names are now appearing and we could start to look intro these in more detail.

Some Stats

To illustrate a few additional features of let’s get some overall stats.

The number of distinct suppliers: 283

SELECT COUNT(DISTINCT "Vendor Name") FROM data;

Total amount spent in January: approx £60m (60,448,491)

SELECT SUM(Amount) FROM data;

Wrapping Up

We now have an answer to our original question:

  • The biggest recipient of GLA funds in January was Newlon Housing Trust with £7.5m
  • Excluding other governmental or quasi-governmental entities the biggest recipient was Eversheds, a law firm with £2.4m

This tutorial has shown we can get these answers quickly and easily using a simple relational database. Of course, there’s much more we could do and we’ll be covering some of these in subsequent tutorials, for example:

  • Multiple tables of data and relations between them (foreign keys and more)
  • Visualization of of our results
  • Using tools like OpenSpending to do both of these!

Appendix

Colophon

CSV to SQLite script

Note: this script is intentionally limited by requirement to have zero dependencies and its primary purpose is to act as a demonstrator. If you want real CSV to SQL power check out csvsql in the excellent CSVKit or MessyTables.

SQL

All the SQL used in this article has been gathered together in one script:

.mode column
.header ON
.width 35
-- first sum
SELECT "Vendor Name", SUM(Amount) FROM data GROUP BY "Vendor Name" ORDER BY sum(amount) DESC LIMIT 20;
-- sum with avg etc
SELECT "Vendor Name", SUM(Amount), AVG(Amount), COUNT(*) FROM data GROUP BY "Vendor Name" ORDER BY sum(amount) DESC LIMIT 5;
-- exclude boroughs
SELECT "Vendor Name", SUM(Amount) FROM data
  WHERE "Vendor Name" NOT LIKE "%Borough%"
  GROUP BY "Vendor Name"
  ORDER BY sum(amount) DESC
  LIMIT 10;
-- exclude boroughs plus housing
SELECT "Vendor Name", SUM(Amount) FROM data
  WHERE ("Vendor Name" NOT LIKE "%BOROUGH%" AND "Vendor Name" NOT LIKE "%HOUSING%")
  GROUP BY "Vendor Name"
  ORDER BY sum(amount) DESC
  LIMIT 20;
-- totals
SELECT COUNT(DISTINCT "Vendor Name") FROM data;
SELECT SUM(Amount) FROM data;

Assuming you had this in a file called ‘gla-analysis.sql’ you could run it against the database by doing:

sqlite gla.sqlite < gla-analysis.sql

Flattr this!

First Steps in Identifying Climate Change Denial Networks On Twitter

Tony Hirst - March 14, 2013 in Data Stories

A week or two ago, a Guardian article declared the Secret funding helped build vast network of climate denial thinktanks. The article described how two grant making trusts were being used to channel funds that were “doled out between 2002 and 2010, [and] helped build a vast network of thinktanks and activist groups working to a single purpose: to redefine climate change from neutral scientific fact to a highly polarising “wedge issue” for hardcore conservatives”.

Inspired by this story, we started to ask ourselves whether we could make any inroads into mapping this network based on the friend and follower networks that are built around the Twitter accounts of groups known to have received that funding. (Here is a list of Twitter accounts for the top 20 (by funding) climate change denial organisations who received funding from the Donors Trust (2002-2011).)

As with any data story, there are several steps we need to walk through: finding the data; getting a copy of the data; tidying the data up so you can actual work with it; a period of analysis (which may include visual analysis – using visualisations to help you analyse the data); and then the exposition of the story you want to tell.

This post falls into the “period of analysis” phase, and demonstrates how we can use a network visualisation approach to start exploring the “social positioning” of the Twitter accounts of groups associated with climate change denial. The idea is this: if we assume that people follow people they are interested in on Twitter, or people follow people who share a similar interest, we can look at who the followers of a particular target individual commonly follow to draw up a “social positioning map” that locates the target individual in some sort of social interest space.

Here’s an example in a slightly different context – bookshops. Books in bookshops tend to be categorised in different ways. Firstly, we have fiction and non-fiction; then we have further subdivisions: fiction books may include crime, or science-fiction, or “general”; non-fiction may include topics like history, transport, psychology, for example. If we take one particular author, such as Nobel prize-winning physicist Richard Feynman, we may find that people who buy his books (that is, people who in a certain sense “follow” Richard Feynman) also tend to buy (“follow”) books by Stephen Hawking or James Gleick, William Gibson, or maybe even Steven Hampton… If we map out the authors commonly bought by the purchasers of Feynman’s books, grouping them so that authors who tend to be bought by the same people are close together, we can generate a map that “socially positions” Feynman relative to other authors, based on the shared interests of their common followers. Reading this map might show us how Feynman is close (in interest space) to other physicists, for example, or popular science writers. The map might also reveal niche interests, such as lock-picking! Using Feynman – a known physicist – as a seed, we can generate a map of other people who represent similar interests based on the behaviour of their followers, and through those followers turn up interests we might not otherwise have been aware of. (For an expanded description of this idea in the Twitter context, see the BBC College of Journalism post: How to map your social network.)

So how does this relate to climate change denial networks on Twitter?

Although not published as open data, many social networks do make data publicly available that we can use for “personal research” purposes… Often, this data is published in ways that are amenable to automated collection and analysis through APIs, interfaces that provide programmable access to an application rather than the web site or app we might use. (In fact, the APIs typically provide the plumbing that connects the app or webpage to the application provides “back end”.) The Twitter API provides a means for grabbing the lists of the friends, or followers, of a particular account, and looking up basic, public information about those accounts.

For each of the target Twitter accounts identified above, we grabbed a sample of 997 of their followers, along with the friends lists of each of those follower samplers. We then generated a network file for each target that includes the target account’s followers and the people followed by at least 50 followers of the target account. Which is where this post begins… with a quick visual analysis of the accounts commonly followed by the followers of @CFACT. Here’s a tease of the sort of thing we’re looking for…

CFACT - ESP tease

If you want to play along, you can download the CFACT social positioning network data. You will need to unzip the file in order to make use of it.

The tool we’re going to use to explore the data is the open source, cross-platform application Gephi. (It does require Java, though…) Download Gephi here.

We can load the data file into Gephi using the File-Open menu option – the file dialogue gives us some basic information about the network data, which all looks fine; we can accept the defaults as they stand.

CFACT_file_open

Once loaded in, a preview view of the network, randomly laid out, appears in the central Graph panel. (Mouse wheel or up-down slide on a trackpad can be used to zoom in and out using the cursor as the focal point. It’s a bit fiddly at first, but you soon get used to it… If you lose the graph, the magnifying glass icon in the toolbar along the lefthand side of the Graph panel will centre and resize the graph for you.) The tabs to the right contain tools for working with the graph and analysing the data numerically. Tools are provided for running netwrok statistics as well as filtering the network to highlight certain elements within it. The panels on the left hand side contain tools for working with the visual layout; the toolbar at the bottom of the Graph panel also has tools for working with the display of the graph.

CFACT - data loaded

If you select the Statistics tab on the right hand side of the desktop, you will see a range of network statistics you can apply over the network. We’re going to use two – the Modularity statistic, which tries to group the nodes based on how highly connected they are to each other; and Eigenvector centrality, which gives a measure of how “influential” a node is, based on how many people follow the node and how “influential” they are. You can apply the statistics by clicking the appropriate Run button. A pop-up dialogue will appear for each, but we can just accept the default settings for now. Running each report also results in a report dialogue. We really should look at these, but for now just cancel them… (trust me;-)

CFACT_stats

We’re now going to use the values we calculated by running the statistics to colour the graph (that is, the network). Select the Partition panel, and the node tab. Click on the recycle arrow to load in the grouping parameters we can colour the graph by, and select the Modularity.

CFACT - partition

In this case, the Modularity statistic found three groups, and has coloured them randomly.

(You can change the colours if you want by right-clicking within the panel and selecting Randomise Colours or by click-and-hold over a colour option to pop up a colour selection palette. I often select pastel colours because it makes labels on the graph easier to read.)

If you Apply the colour selection, the nodes will be coloured correspondingly.

CFACT partition colour

We can also change the size of the nodes to reflect their relative importance within the graph, as given by their eigenvector centrality value. Select the Ranking panel and choose the diamond icon. Select the Eigenvector Centrality as the dimension we will use to size the nodes.

CFACT- ranking node size

Now set appropriate minimum and maximum node sizes. (The Spline option defines how the Eigenvector Centrality values (in this case) actually map on to node sizes…) If you Apply the settings, you should see the nodes are resized.

CFACT - node size apply

That’s all very well, but still not that informative. We can now work some magic using the Layout settings:

There are lots of different layout tools, and they each have their own strengths and weaknesses. The Force Atlas 2 algorithm was designed specifically for Gephi, and generates layouts that we can read as maps, grouping nodes that tend to be connected to each other in an effective way.

CFACT - layout selection

The are quite a few parameters associated with this layout algorithm, but the default settings often provide a good start. (If you use Gephi a lot, you soon start to develop your own aesthetic preferences…)

CFACT - force atals2

Now for the magic – click the Run button and watch the nodes fly! Sometimes the layout settles and stops running of it’s own accord; other times you need to stop it running explicitly.

If you find the nodes are too close together, increase the Scaling value and Run the layout tool again.

So now we have laid out our graph – what does it say? In this case, I read it as you would a geographical map. The colours represent interest countries and the nodes are cities. (The lines (also referred to as edges) show how nodes are connected to each other through following relationships.)

We can label the “city” nodes using a font size proportional to the node size to see, at a glance, which are the more important ones.

CFACT labeling

If you zoom in the graph, you may notice that a lot of labels overlap and the view is quite cluttered. One way of separating out the nodes is to tweak the ForceAtlas2 settings and rerun it. Increasing the scale separates out the graph, for example. Two other layout tools can also help us. The Expansion tool will stretch the layout out (or shrink it if you choose a scaling factor less than 1.); and the Label Adjust tool that moves nodes around that are in view (so it can be worth zooming in to cluttered areas and running this tool…) so that they don’t overlap.

CFACT - layout tweaks

If you do a bit of fiddling with the those two tools, you can start to generate something like this:

CFACT  - layout and increase label size

(If it all goes horribly wrong, just run the ForceAtlas2 layout again!)

To generate a “print quality” version of the network, we can go to the Preview panel:

CFACT - Overview and Preview selectors

(To go back to the original view, click on the Overview button.)

The key Preview settings to know for now are Node labels/Show labels (to display the node labels) and Refresh, which generates a vector view of the graph displayed in the Graph panel. When you generate the preview, you may notice the node labels are the wrong size, overlapping etc.

CFACT preview

You can tweak the node font size and refresh the preview, or go back to the Graph view and play with the layout tools again (expanding the network layout, using the slider to increase the displayed font size and using the Label Adjust tool to “un-overlap” labels in view. (It sounds complicated, but you do soon find an effective workflow – honestly!)

With a bit more fiddling we can get a prettier layout that we can export as PDF, SVG or a PNG graphic:

CFACT

Use your judgement when selecting landscape or profile views. Exporting the file in a PDF or SCVG format means that you can zoom in to view detail clearly.

CFACT - export PDF

(One thing to bear in mind is that large, highly connected networks can generate large files!)

If you want to see the result of my attempt, here it is: Get the @CFACT followers’ common friends network [PDF].

If you would like to try generating maps for some of the Twitter accounts associated with the other top 20 Donor’s Trust climate change denial groups, we have the data – please get in touch for a download link.

In the next post, we’ll explore in a little more detail what distinguishes the make up of the different groups and who is most influential in these groups…

Flattr this!

Twitter users in Pakistan

Irfan Ahmad - January 25, 2013 in Data Stories

Irfan Ahmad has recently collected data on twitter users in Pakistan. He writes:

Twitter escultura de arena

Twitter has recently become very popular in Pakistan as lots of journalists, TV anchors, celebrities and politicians have started using it very actively. There are reports of political parties establishing proper Social Media Cells for managing their online identities on sites like twitter and facebook. There was a need to find out answers for the questions like:

  • How many twitter users are there in Pakistan?
  • How are Pakistani twitter users distributed demographically?
  • Who are most influential twitter users in Pakistan?

Data Acquisition:

Twitter reveals no such data but twitter’s API is there exposing public profiles. Our assumption was that every Pakistani twitter user must have
some specific keywords or geo coordinates as their location in their profiles.

Although, the assumption skips many Pakistanis as many of them never filled in their profile in detail and many live abroad but we were able to find almost all commonly known popular twitter users using this assumption. Twitter API also provides time zone information along with each profile. We used this information as well.

Another problem with this logic was that there are many keywords, like Kashmir, Gujrat, Hyderabad etc., common in India and Pakistan. So, another trivial part was to exclude those users from being crawled. For this, importance of these keywords was minimized and some block keywords were introduced. This approach filtered many Indian twitter users from the results.

Using this logic we started a crawler which runs days and nights to find more and more Pakistani tweeps. The crawler is quite slower because of rate limits imposed by twitter in its API. This crawler has been running for a few months and so far it has crawled about 155,000 Pakistani twitter users into its database. Which is a significant dataset for our data analysis.

Second part of the project was to find more influential twitter users. After analysing top twitter users by followers count it was concluded that followers count is not very authentic influence count. There are lots of other factors involved. Fortunately, there is a very nice application, Klout, which measures influence score. So, next step is to write a crawler which fetches klout score of all Pakistani twitter users. Klout has a very reasonable rate limit and with the current database size it is possible to update klout score of each user within a week. This crawler runs days and nights as well.

Both of these crawlers are coded in Python and they store their results in Redis. They utilize Python’s multiprocessing. Following python libraries were used:

Data Analysis

Finding Geographical Location:

First part was to find out where are Pakistani twitter users located geologically. Our crawler assigns a city or province to each user based upon the keyword matched against his location in twitter profile.

Finding Gender:

Twitter does not provide gender of the twitter user with each user but it can easily be guessed from user’s name. For this analysis gender.c was used with a custom names database of 5000 Pakistani names.

Finding most commonly used names:

This was easy, just a few lines of Python code and result was compiled.

Finding most commonly used words in Pakistani tweeps profiles

Out of 150K only about 77K (about 51%) users have set description field in their twitter profiles. For this analysis FreqDist and stopwords of nltk were used.

Other results

There were some other results generated using simple queries like:

  • Finding verified twitter accounts
  • Finding users with the most followers
  • When Pakistanis joined twitter?
  • Finding most influential twitter users in Pakistan

Now the most important part. As we are also crawling user’s klout score, this was again simply a query to our data.

Now this is quite interesting data. Based on this, some more facts were collected like:

  • Out of top 100 most influential Pakistani twitter accounts, 1/3rd are related to traditional media (mostly talk show anchors)
  • There are about 10K Pakistani twitter users with no followers and 12K accounts with just 1 follower (most of them were fake accounts created by social media cells of political parties to increase follower count)
  • On average each Pakistani tweep is followed by 129 tweeps
  • There are 24 Pakistan twitter users with 50K+ followers and only 11 users with 100K+ followers
  • Out of 150K Pakistani twitter users, almost half have less than 10 followers and a quarter have 11-50 followers

What I learnt from this project

  • NoSQL is great in handling large datasets
  • Whenever using some API or scraping, use unicode.
  • Handling rate limits of APIs with multiprocessing can be challenging.
  • NLTK is a great library if used properly.

The detailed results can be seen here: 1 2 3

Flattr this!

How to visualize political coalitions in electoral coverage

Anders Pedersen - January 24, 2013 in Data Stories, Uncategorized

This post was written by Gregor Aisch.

If I was asked for the golden rule of information visualization, it would be:

“Show the most important thing first!”

Not second or third, but first! And what is the most important thing to show about the outcome of an election? Who actually won.

In political systems like Germany’s, where we have no party getting anywhere near 50% of the vote, the usual one-bar-per-party bar charts totally fail to answer this most important question.

For example, in the following chart we can see the number of seats won by different political parties – but this does not tell us who won the election.

wahlergebnis-balken

This is because the elections are not won by the party with the most votes, but the party who manages to get a majority of seats in parliament. And, with the exception of Bavaria and Hamburg, in Germany there’s no way to take government without forming a coalition.

The bar chart above makes it tremendously difficult for readers to figure out which coalition has won. Therefore one must calculate the total number of seats for each coalition, and compare it to the number of seats needed for a majority (which itself is the sum of each parties seats divided by two).

Humans aren’t particularly good at calculating and weighing up these different possibilities on the fly. That’s why most election reporting websites show an additional coalition view. But where is it? Right – often it is the last thing that they show, such as in this recent example from the Zeit Online:

mgl-koalitionen

How coalitions have been visualized in the past

In past elections in Germany coalitions have been visualized in two different ways: either as simple horizontal bar chart or as an interactive coalition calculator.

mgl-koalitionen-2

The simple bar chart (as seen above) usually shows a limited selection of two or three coalitions having a majority. One problem is that sometimes it would be interesting to compare those coalitions with other possible – but politically unlikely – coalitions, such as the CDU and the Greens in Germany.

The second problem is that excluding the coalitions that fail to have a majority eliminates valuable contextual information.

Do it yourself: the coalition calculator

An alternative approach is the coalition calculator. The main idea is to let the users try out their own coalitions and see whether or not a given coalition could have a majority.

ltwnds-spon-koalitionsrechner

However, this puts quite a bit of effort onto the user, who might well be checking back several times during election nights. Also the calculator only shows one or two coalitions at a time, so it’s hard to actually compare different possible coalitions.

A new approach: extended coalition charts

This isn’t exactly groundbreaking stuff, but for some reason nobody seems to have ever visualized elections this way before. The idea is to show as many coalitions as possible side by side, including the politically unlikely and those who fail the majority.

To visually separate the winning coalitions from the rest I finally decided to simply pull them apart. Since I need some more space I went for vertical bars instead.

koalitionen-940

There’s a nice side effect of showing all coalitions: when new preliminary results are coming in during election nights, the visualization doesn’t show an entirely different picture, but some coalitions simply ‘change sides’.

Since the actual total number of seats depends on the election results, I decided to label the coalitions seats with relative numbers. This means that instead of saying ‘coalition X has 70 seats’ we say ‘3 seats are missing for majority’.

Coalition maps

It was a small step from extended coalition charts to coalition maps. A coalition map shows in which election districts a coalition holds the majority of votes. To indicate the coalition I decided to go for diagonal stripes, although I don’t recommend looking at them for too long. :-)

koalitionskarten1

It is interesting to compare how coalition maps vary between coalitions. For instance from comparing the two most preferred coalitions you can see a clear divide within the state Lower Saxony in the North West.

koalitionskarten-2

Try it out!

If you want to try out either the extended charts or the mapping mini-apps shown above, then you can grab the code via the following links. All of my examples above were created using the open source Raphaël JavaScript Library.

Note: This post is a translated version of this one.

Flattr this!

The Role of Government: Small Public Sector or Big Cuts?

Velichka Dimitrova - October 30, 2012 in Data Stories

Second Presidential Debate 2012

News stories based on statistical arguments emphasise a single fact but may lack the broader context. Would the future involve some more interactive form of media communication? Could tools like Google Fusion Tables allow us to delve into data and make our own data visualisations while discovering aspects of the story we are not told about?

There has rarely been an issue as controversial in economic policy as the role of government. Recently the role of government has been in the heart of the ideological divide of the US presidential debates. While Governor Romney advocates against a government-centred (small government) approach and threatens to undo the role of federal government in national life, President Obama supports the essential function of the state (big government) in promoting economic growth, empowering all societal groups with federal investments in education, healthcare and future competitive technology.


Graph 1: United Kingdom and other major country groups. Data Source: World Economic Outlook 2012. Download data from the DataHub

Two weeks ago the Guardian published an article about how the Tory government plans to shrink the state even below US levels, based on the recently-released data from the IMF’s World Economic Outlook. Let’s take the source data and take a look at the bigger picture. On the DataHub, I uploaded all data for “General Government Total Expenditure to GDP” for all countries as well as country groups [See the dataset]. You could use the Datahub Datastore default visualisation tools to build a line graph (select the dataset, then in Preview choose “Graph”) or try the Google Fusion Table with the all countries dataset to select the countries you are interested in exploring.

According to the data Britain would have a smaller public sector1 than the average of all advanced economies by 2017: other country groups are added to show how regions in the world compare (see Graph 1). Even EU countries with staggering public debt – like Greece – would still have a higher relative total government expenditure to GDP according to the projections (see Graph 2).

Graph 2: United Kingdom and other European countries + United States. Data Source: World Economic Outlook 2012. Download data from the DataHub

But what is the bigger picture? And does shrinking the role of government in the economy mean that total government expenditures will fall? Not necessarily, because remember that percentages are relative numbers. The growth or decline of total government spending would ultimately depend on economic growth or the increase of the total output of the economy until 2017. If we take the data for “General Government Total Expenditure” in national currency and build the growth rates2, we see that for the UK the growth rate is above zero, meaning that government expenditure would actually increase overtime, despite the diminishing role of government in the economy’s total output.

Debt-ridden countries like Greece, Portugal or Spain (dropping US and Italy to avoid an over-crowded graph) will have to slash spending first before reaching a positive growth despite their larger public sectors. The lesson is that the size of the public sector does not always equate to actual growth in government expenditures.

Graph 3: Growth in government expenditure for United Kingdom and other European countries. Data Source: World Economic Outlook 2012

Despite having a limited meaning for practical interpretational purposes, the government total expenditure to GDP is often an argument in ideological debates or a measure in policy papers which investigate the impact of government spending on consumption and economic growth or the optimal size of government. While lumped together, government total expenditure varies in composition between high-, middle- and low-income groups: richer societies tend to spend more on social security and welfare, middle- and low-income countries have higher relative capital expenditure and low-income societies tend to spend a larger share of their government budget on the military (e.g. See some examples from earlier IMF publications).

Even if no cuts are actually made, the public sector will eventually shrink: For example inflation could mean that the government will actually spend less in real terms. A smaller public sector in the long run will eventually mean that countries like UK will not be able to support an aging population or provide the same levels of infrastructure and public services as currently used to. Yet smaller public sector might also provide an opportunity to cut taxes, provide incentives for the business sector and boost economic growth. Policy choices about the size of government following the US presidential elections and the sovereign debt crisis in Europe would partly be choices of ideology, as it there is no clear evidence which recipe works for an individual case.

In the next piece in this series, we will look at some of the detailed data available on government staff salaries around the world.



1 The size of the public sector is measured by the percentage of total government expenditure in GDP.
2 How do I build growth rates? Add one column where you build natural logarithm of the absolute expenditure numbers, add another column where you lag all observations by one year: shift the entire column down by one row. In the third column take the difference between the current year logarithm and the lagged value. Growth rate = ln (xt) – ln (xt-1)

Flattr this!

Data at the Karnataka Learning Partnership

Laura Newman - October 11, 2012 in Data Stories

Megha Vishwanath outlines how the Karnataka Learning Parntership have been working to collect and analyse data about public schools in India. Below, Megha talks us through the process of collecting data, how they integrated their information with other data sources, the stories they have already found, and the next steps for the project.

An Introduction to the Karnataka Learning Partnership

The Karnataka Learning Partnership (KLP) was formed as a framework for nonprofits, corporations, academic institutions and citizens to get involved in improving government schools in Karnataka, India. KLP aims to be an independent platform for data collation, visualisation, sharing and data-driven advocacy in Public education.

Infrastructure issues in Government schools and preschools have a domino effect on the whole functioning of a school. KLP reports aggregate data and summarise the status of public education. For more about the KLP, see this post on the OKFN India blog.

KLP’s Reports

At KLP, we try to make information on the status of public schools available, in order to allow elected representatives to better allocate budgets and improve local schools. Last academic year, KLP published overviews on:

  • The demographics of government schools and preschools
  • Financial allocations to government schools
  • Infrastructure of government schools

The reports can be found here. The basic information for a Government school / preschool and the MP and MLA linkages per school is available as raw csv download on the website. Any additional information per school or preschool either from KLP’s database or the DISE database can be provided quickly on request.

We thank our Visualisation experts – Anand S and Rahul Gonsalves for consulting with us pro bono to make these reports meaningful.

The Data: What do we have?

Much of the data was gathered over the past decade by the Akshara Foundation. This data, collected and cleaned up, feeds KLP’s public database. For every school we have at a minimum a category, medium of instruction and DISE code (a unique identifier by which the Education Department publishes data). Additionally for all schools in Bangalore we have addresses, geo location and photo identifiers.

Akshara’s field staff mapped an MP constituency, MLA constituency and Ward name for every school and preschool in Bangalore. At first, some of the data was clearly erroneous. The exercise of collecting information was repeated several times until the numbers seemed acceptable, and until we had a mapping for every government school in Bangalore.

What kind of stories could be told with this data?

Some of the questions we were able to ask of data for this report were:

  • How many schools are there in this electoral region? How many students?
  • What’s the gender profile?
  • What categories of schools exist? What is enrollment per category?
  • Which local langauges are spoken, and are there sufficient schools to meet the needs of a multilingual community?

Within the reports, we published comparative graphs. Because the report is meant to be informative, we haven’t indicated e.g. whether 17 Urdu schools would be sufficient for the 938 Urdu speaking children in BTM Layout, Bangalore – but this data would clearly allow others to ask questions and begin drawing conclusions.

Simple ‘group by’ queries of our school table can yield a pie graph like the one above. On our public KLP database we provide aggregated views of the data. This can be grouped by e.g. MLA Constituency and mother tongue via school IDs.

Linking our data with information about government budgets

In order to report on financial allocations to Government schools, we needed more than our own internal data. Therefore we also began to integrate data from two further sources:

1) PAISA is Accountability Initiative’s flagship project, which works to develop innovative models to track social sector programs. From PAISA, we found that Government schools receive funding according to the following measures:

  • School Maintenance Grant: Rs. 5,000 for up to 3 classrooms & Rs. 10,000 for >3 classrooms.
  • School Grant: Rs. 5,000 for lower primary schools and Rs. 7,000 for upper and model primary schools
  • TLM Grant: Rs. 500 per teacher

2) NUEPA-DISE: District Information System for Education’ (DISE) is a public database providing comprehensive information on Government schools across India. Year on year, the database is updated and information on basic identifiers (teachers, compliance with Right to Education, enrollment, financial allocation, school facilities etc) is published. From the DISE database, we could determine the number of classrooms and the number of teaching staff in a school.

More data, more questions..

With more data, there were many more questions that could be answered. For instance, our data now included a table of facilities. So, we could ask:

  • How many schools in this electoral region have a play ground?
  • How many of them have a library? How many actually have books in the library?
  • How many schools have a separate girls toilet?

We did have to apply some of our own assumptions to the flat DISE data in order to assign a binary (0/1) score for each of the facilities for each school. As an example:

  1. The facilities table has a column – with values <1> indicating a library exists & <2> indicating that it does not exist. Additionally there’s a column with a numeric count. Only if library = 1 and books_in_library > 0 does it mean that the school has a functional library
  2. The table has columns & which contain a numeric count. If toilet_common > 0 it means that the school has a common toilet. If toilet_girls > 0 it means that the school has a separate girls’ toilet.

The 11 parameters were summarised to produce the kind of indication below:

Each of the above indicators is a single or composite representation of the larger set of indicators. The colour coding indicates whether the constituency scores better than the chosen sufficiency condition for a particular indicator or not. The question asked of the aggregated data therefore becomes e.g. do 70% of the schools in this constituency have playgrounds?

Allowing for geographical comparisons

Early on in the design of our reports, we understood that it might be useful to compare constituencies with neighbouring areas. A manual exercise of reviewing which constituencies shared boundaries was carried out on publicly visible (but not downloadable) shape files on bbmpelections.in.

The results of sharing these reports

The feedback of our Elected Representatives to these reports have been overwhelming. From “wanting to make mine a Model Constituency” to “please keep me informed about the problems in my area”, these positive responses keep us motivated to publish information. The write up on the team’s experiences in delivering these reports can be found here.

The Technology Behind it All:

The underlying reporting engine is python – web.py based web app reading from a Postgres database that brings in views from the multiple data sources. Python scripts were written for all data scraping, data base creation and cleanup processes. Charting libraries used are from Google Visualization. The report is an HTML5 web page that is styled with CSS and printed off the browser in high quality.

Unicode compliance, to be able to produce a bilingual report was key to this design. The current code repository is on Github

So what next?

  • We will be reporting particularly on Library Infrastructure and Learning Outcomes in these constituencies.
  • We need in terms of a tech stack a rigourous reporting tool that seamlessly integrates with unicode, print formats / styling and can zoom in and zoom out on the aggregated data based on a selection of geography.
  • We intend to integrate these reports if shapes files and proper geocodes be available for all these constituencies on our map page. We also need a map-math based method to determine geographic neighbours or a ward / constituency.

There’s a lot of work ahead of us though these are seemingly simple wants/needs. We invite programming and visualisation enthusiasts to help us do this task better. We welcome all constructive criticism and inputs from the data communities at dev [@] klp.org.in

Flattr this!

Netneutralitymap.org – converting 700Gb of data to a map.

Michael Bauer - September 6, 2012 in Data Stories


In this post, Michael Bauer explains how he wrangled data from measurement lab in order to create this visualisation of netneutrality, which maps interference with end user internet access.

Netneutrality describes the principle that data packets should be transported as best as possible over the internet without discrimination as to their origin, application or content. Whether or not we need legal measures to protect this principle is currently under discussion in the European Union and in the US.

The European Commission have frequently stated that there is no data to demonstrate the need for regulation. However, this data does exist: The measurement-lab initiative has collected large amounts of data from tests designed to detect interference with end user internet access. I decided to explore whether we can use some of the data and visualize it. The result? netneutralitymap.org.

Data acquisition: Google Storage

Measurement lab releases all their data with CC-0 licenses. This allows other researchers and curious people to play with the data. The datasets are stored as archives on Google storage. Google has created a set of utilities to retrieve them: gsutils. If you are curious about how to use it, look at the recipe in the school of data handbook.

All the data that I needed was in the archives – alongside a lot of data that I didn’t need. So I ended up downloading 2Gb data for a single day of tests, and actually only using a few megabytes of it.

Data Wrangling: Design decisions, parsing and storage

Having a concept of how to do things when you are starting out always pays. I wanted to keep a slim backend for the data and do the visualizations in browser. This pretty much reduced my options to either having a good API backend or JSON. Since I intend to update the data once a day, I only need to analyze and generate the data for the frontend once a day. So I decided to produce static json files using a specific toolchain.

For the toolchain I chose python and postgres SQL. Python for parsing, merging etc. and postgres for storage and analysis. Using SQL based databases for analysis pays off as soon as you get a lot of data. I expected a lot. SQL is considered to be slow, but a lot faster than python.

The first thing to do was parsing: the test I selected was glasnost. It is a testsuite emulating different protocols to transfer data and trying to detect whether these protocols are shaped. Glasnost stores very verbose logfiles. The logfiles state the results in nicely human readable format: So I had to write a custom parser to do this. There are many ways of writing parsers – I recently decided to use a more functional style and do it using pythons itertools and treating the file as a stream. The parser simply fills up the SQL database. But there is one more function – since we want to be able to distinguish countries, the parser also looks up the country belonging to the IP of the test using pygeoip and the geolite geoip database.

Once the table was filled, I wanted to know which providers the IPs of the test clients belonged to. So I added an additional table and started to look up ASNs. Autonomous System Numbers are numbers assigned to a block of Internet addresses, which tell us who currently owns the block. To look them up I used a python module cymruwhois (which queries whois.cymru.com for information). Once this step was complete, I had all the data I needed.

Analysis: Postgres for the win!

Once all the data is assembled, analysis needs to be done. The glasnost team previously used a quite simple metric to determine whether interference was present or not. I decided I use the same one. I created some new tables in postgres and started working on the queries. A good strategy is to do this iteratively – figure your subqueries out and then join them together in a larger query. This way things like:

INSERT INTO client_results
SELECT id,ip,asn,time,cc,summary.stest,min(rate) FROM client INNER JOIN
(SELECT test.client_id,test.test,max(rxrate)/max(txrate) AS
rate,mode,test.test AS stest FROM result,test WHERE test_id=test.id
GROUP BY test.client_id,mode,test.port,test.test HAVING max(txrate)>0) 
summary ON summary.client_id=id WHERE id NOT IN (SELECT id FROM client_results)
GROUP BY client.id,client.asn,client.time,client.cc,client.ip,summary.stest;

don’t seem too scary. In the map I wanted to show the precentage of tests in which intereference with a user’s internet connection took place, both by country and by provider. The total number of tests, the number of tests in which interference was detected, and the precentage of ‘interfered with tests’ are thus calculated for each country and for each provider. The Glasnost suite offers multiple tests for different applications. To do this the results are then further broken down by applications. Since this is run once a day I didn’t worry too much about performance. With all the data, calculating the results takes a couple of minutes – so no realtime queries here.

The next step is to simply dump the result as json files. I used python’s json module for this and it turned out to work beautifully.

Visualization: Jvectormap

For visualization I imagined a simple choropleth map with color-coded values of interference. I looked around how to do it. Using openstreetmap/leaflet seemed too cumbersome, but on the way I stumbled across jvectormap – a jquery based map plugin. I decided to use it. It simply takes data in the form of {“countrycode”:value} to display it. It also takes care of color-coding etc. A detailed recipe on how to create similar maps can be found in the School of Data Handbook. Once I had the basic functionality down e.g. display country details when a country is clicked, it was time to call in my friends. Getting feedback early helps in developing something like the map. One of my friends is a decent web designer – so he looked at it and immediately replaced my functional design with something much nicer.

Things I’ve learned creating the project:

  • SQL is amazing! Creating queries by iteration eases things down and results in mindboggling queries
  • Displaying data on maps is actually easy (this is the first project I did so).
  • 700 Gb of data is a logistical challenge (I couldn’t have done it from home, thanks to friends at “netzfreiheit.org” for giving me access to their server)

If you’re interested in the details: check out the github repository.

Michael Bauer is a new datawrangler with the OKFN. He’ll be around schoolofdata.org and okfn labs. In his free time he likes to wrangle data from advocacy projects and research.

Flattr this!

 Receive announcements  Get notifications of news from the School in your inbox
Join the discussion Discussion list - have your say: