You are browsing the archive for Events.

Cursive Data: Drawing by Robots

Lucy Chambers - June 24, 2013 in Events

CursiveData.co.uk converts real time data into slowly growing online visualisations that are simultaneously brought to life by drawing robots. Robot artists use new website to turn live data into art.

A Bristol-based engineer has found a new way to bring data to life – with robots. Since last year Matthew Venn’s ‘Polarbots’ have been drawing pictures on walls in response to energy use – the pictures change as the power goes up or down. But Venn realised any live data could be visualised this way and now with help from Bristol Council, he and his collaborators have built a website that transforms data into art in real-time: www.CursiveData.co.uk. The ever-changing pictures that result can be drawn by Polarbots or embedded in websites.

“The big difference between cursive data and other types of data visualisation is that cursive data provides continually growing visualisations, ones that respond to change.” – Matthew Venn

Any live data can drive Cursive Data visualisations but Venn’s focus is still on energy. He says, “energy is abstract and invisible but it is important that we notice it”. The team hopes that their live data art will help people waste less energy and be more aware of their greenhouse gas emissions. A grant from Bristol Council has allowed Venn to build a public Polarbot and work with a local artist to develop new ways to respond to energy data. A diverse team of programmers and engineers have collaborated on the project.


A robot drawing numbers of tweets at Newcastle maker faire

Venn will be talking about the project at Dorkbot (18 June 2013) and Pervasive Media Studio (21 June 2013). Programmers and artists will come together to develop new ‘pattern generators’ for www.CursiveData.co.uk at an event on 7 July 2013.

How www.CursiveData.co.uk works

Cursive data collects data from any web-based data source. Venn’s own home energy monitor is connected this way. When enough data has been collected it is passed to a pattern generator that adds new elements to the continually changing image. The pattern generators can be programmed to react in many different ways. One of the early patterns places leaves on a tree. The size of the leaves corresponds to the energy being generated by a solar power array that Cursive Data connects it to.

How Polarbots work

Polarbots draw by hanging a pen from wires that are controlled by motors. Their movement is determined by a Cursive Data ‘pattern generator’. As they move they can pull a pen across the wall, or lift it off the wall.

Flattr this!

DDJSchool Tutorial: Analysing Datasets with Tableau Public

Lucy Chambers - April 27, 2013 in Events, HowTo

This tutorial is written by Gregor Aisch, visualization architect and interactive news developer, based on his workshop, Data visualisation, maps and timelines on a shoestring. The workshop is part of the School of Data Journalism 2013 at the International Journalism Festival.

Screen Shot 2013-04-27 at 17.34.02

Pre-requisites

  1. Download and install Tableau Public. By now there is only a Windows version available.
  2. Download the dataset eurostat-youth.csv

Loading a CSV file

  1. Click Open data to open the data import window. From the list on the left pick Text File and select eurostat-youth.csv. Make sure that Field separator is set to "Comma". Click OK to proceed.

tableau-csv.png

  1. Note: if this step fails with an error message, try changing your system region to English in Windows control panel (see screenshot). It seems that Tableau has cannot comma-separated values if comma is set as decimal separator for numbers in the system settings.
  2. Tableau now lists all the columns of the table in the data panel on the left. The columns are classified into Dimensions and Measures.

tableau-initial-view.png

  1. The dataset contains the following columns (all data is 2011 and aggregated on NUTS-2 level):
  • secondary_edu: percentage of population with secondary education
  • youth_unemployed: percentage of people that aged between 18 and 24, unemployed and do not participate in education or training.
  • unemployed_15_24M: percentage of unemployed males between 15 and 24.
  • unemployed_15_24F: percentage of unemployed females between 15 and 24.

Analysing a dataset

Now we are going to analyze the dataset using Tableau.

  1. Now drag the field youth_unemployed from Measures to Columns. Then drag secondary_edu to Rows.

tableau-plot-1.png

  1. As you see Tableau computes the sums of the columns instead of plotting the individual values. To fix this we need to right-click the green fields and select Dimension.

tableau-plot-2.png

  1. If both fields are set to be treated as dimensions you should see a scatterplot like shown in the following screenshot. You can see that there is a negative correlation between education and youth unemployment.

tableau-plot-3.png

  1. Now drag the field country from Dimensions to Color to color the plot symbols by country. You can also drag the country to Shape to change the icon.

tableau-plot-4.png

  1. Add the fields country and geo_name to the Detail mark to include that piece of information in the tooltips.
  2. Now you can use the color legend and quick filters to highlight and hide certain countries.
  3. Focus on Turkey
  4. Plotting unemployment by gender

Bonus: creating a map with Tableau

  1. Now we can create a map easily: select the dimension lat and lon together with the measure count (while holding the Ctrl key) and click on Show Me to expand the list of suggested visualizations. Then click on the icon of the map with the blue circles. Click on Show Me again to hide the panel.

tableau-select-vis.png

  1. Now you should already see the complete table. Tableau is smart enough to use the square roots of the counts for the circles radii automatically, so we don't have to care about this.
  2. You can make the circles transparent by clicking on Color in the panel Marks and moving the transparency slider. To change the size of the circles, click on Size and adjust the slider.
  3. Now drag the field name to the Mark Label to add the city names as labels to the map.

Note: The final steps of this tutorial are going to be added in the coming days. 

Enjoyed this? Want to stay in touch? Join the School of Data Announce Mailing List for updates on more training activities from the School of Data or the Data Driven Journalism list for discussions and news from the world of Data Journalism.

Flattr this!

Creating a Map Using QGis

Lucy Chambers - April 27, 2013 in Events, HowTo

This is the second of the tutorials from the hands-on visualisation session from Gregor Aisch at the School of Data Journalism at the International Journalism Festival in Perugia. In this tutorial we will create a simple map of the Tour de France stations of the last 100 years.

Pre-requirements

    1. Install and configure QGIS.

    1. Install from http://qgis.org. On most systems there should be a one-click installer that guides you through the process.

    2. We need to install the following handy plugins:

  • Add Delimited Text Layer allows us to read and plot points from a CSV file.

  • Edit Any Layer allows us to easily edit CSV layers

  1. In menu click Plugins > Fetch Python Plugins. In the appearing dialog type in edit any in the the filter box to narrow down the list:

  2. Select the plugin and click Install/upgrade plugin. Repeat the same for Add Delimited Text Layer.

  1. Download country shapefile from naturalearthdata.com. We are looking for ne_50m_admin_0_countries.

  2. Download our sample dataset from http://vis4.net/perugia13/tour-de-france.csv.

Creating the base map layer

  1. Click Layer > Add Vector Layer > Browse and select the file 50m_admin_0_countries.shp. That’s the shapefile containing the borders of all countries. Click Open to add finally it to the map.

  2. Filter for countries with ISO code of France. Right-click on the layer and select Query from its context menu. In the text box SQL where clause enter the text: ISO_A3 = ‘FRA’. Make sure to use single-quotes as double-quotes are reserved for addressing column names. Click OK to apply the filter.

  3. Zoom to Metropolitan France. You can simply use the Zoom In tool

    and draw a rectangle around France.

  4. You might have noticed by now that France looks rather compressed. That is because by default QGIS is using the Plate Carree projection (nerdily referred to by its EPSG code EPSG:4326). You can change the projection by clicking the following icon in the lower right of the window:

  5. In the opening dialog activate the checkbox next to “Enable ‘on the fly’ CRS transformation”. Then in the filter text field enter France to search for map projection spezialized for France. For instance you can pick ED50 / France EuroLambert. Click OK to activate the projection.

  6. Let’s change the default styling. Again, right-click the layer and select Properties. The next dialog should be opened with the Style tab selected by default. Click the button Change… to change the layer style.

  7. Now we are going to disable the filling by selecting No Brush in Fill style drop-down. Change the border color to red and increase the border width to 1. Click OK to apply the styling.

  8. By now the resulting map should look like this:

Adding the Tour de France stations

  1. Add delimited layer (CSV of tour de france stations). Click Layer > Add Delimited Text Layer. (If this option is not available, please make sure you installed the corresponding plugin in the first step.) Then click Browse… and select the file tour-de-france.csv that we downloaded previously.

  2. QGIS is smart enough to recognize the format of the CSV file, and it even detects that the columns named lat and lon probably contain the map coordinates. All we need to do is to click OK.

  3. Now QGIS will ask you in what reference system (=projection) the provided coordinates are given. In most cases you will need to pick WGS 84 or EPSG:4326. Just type 4326 in the filter box and select WGS 84. Click OK to finish.

  4. Now our map contains all the locations of Tour de France stations:

  5. Now we are going to size the stations according to how often they have been part of the tour. Right click the layer tour-de-france and select Properties in the context menu.

  6. Change the value in the Size field to a lower value such as 0,5.

  7. Now click Advanced > Size scale field > count to let QGIS use the values in the column count as radius for the symbols.

  8. You might also want to make the symbols more transparent by moving the Transparency slider to 50%.
    Your map should now look like this:

  9. Since we must always size symbols by area, and not radius, we now need to correct our map. As the area of circles depends grows proportionally with the square of the radius, we need to compute the square roots of the counts to get proper radii.

  10. Usually you could have done this already during the data preparation phase and could simply stored another column in the CSV file. Also you can just load the CSV into a spreadsheet tool like Excel and add a new column with the square roots of the counts. However, you can also do this in QGIS using the Edit Any Layer plugin.

  11. null

  12. In the menu Plugins select Edit Any Layer > Create Editable Layer. Select tour-de-france as input layer and chose a name for the output layer. I will simply use tour-de-france-2 here. Click OK to proceed.

  13. You will be asked for the coordinate system again. WGS84 should be selected by default so simply clicking OKshould work.

  14. Now open the attribute table by right-clicking the new layer and selecting Open Attribute Table in the context menu. You will now see all the data stored in the CSV. Activate editing mode by clicking on the little blue pencil icon (see screenshot). Then open the field calculator by clicking on the little calculator icon.

  15. Make sure that Create a new field is checked and enter a meaningful name for the new column, e.g. radius. As the square roots are going to be decimal numbers, select Decimal number (real) as Output field type. Finally enter the following formula into the Expression text field: sqrt(count). The dialog should now look like shown in the following screenshot. Click OK to proceed.

  16. Back in the attribute table you can take a look at the new column (you may have to scroll the table to the right). Now deactivate editing mode by clicking on the blue pencil icon again. QGIS will ask you if you agree to save the changes. Click Save, and Close the attribute table.

  17. Now hide the layer tour-de-france that we created in step 2 by deactivating its checkbox in the layer window on the left. Now we repeat the second step with the new layer (tour-de-france-2), but instead of count we will pick the column radius for sizing the symbols.

  18. If you like, change the color to blue and set the transparency to 50%. Finally the map should look like this:

Exporting to PDF

In the last section we are going to export our map to PDF.

  1. In the menu click File > New Print Composer. The print composer allows us to set up a print layout with our map. Initially the page is empty, but we are going to change this by clicking the icon for Add new map (1) and dragging a rectangle onto the page (2):

  2. Optionally you can disable the black frame by disabling the checkbox General options > Show frame in the panel on the right.

  3. Now in the menu click on File > Export as PDF… to finally save the map as PDF. You can now open the map in other graphic tools such as Illustrator to do some fine tuning (adding title, labels etc).


Enjoyed this? Want to stay in touch? Join the School of Data Announce Mailing List for updates on more training activities from the School of Data or the Data Driven Journalism list for discussions and news from the world of Data Journalism.

Flattr this!

Data Wrapper Tutorial – Gregor Aisch – School of Data Journalism – Perugia

Gregor Aisch - April 27, 2013 in Events, HowTo

By Gregor Aisch, visualization architect and interactive news developer, based on his workshop, Data visualisation, maps and timelines on a shoestring. The workshop is part of the School of Data Journalism 2013 at the International Journalism Festival.

This tutorial goes through the basic process of creating simple, embeddable charts using Datawrapper.

Preparing the Dataset

  1. Go to the Eurostat website and download the dataset Unemployment rate by sex and age groups – monthly average as Excel spreadsheet. You can also directly download the file from here.
  2. We now need to clean the spreadsheet. Make a copy the active sheet to keep the original sheet for reference. Now remove the header and footer rows so that GEO/TIME is stored in the first cell (A1).
  3. It's a good idea to limit the number of shown entries to something around ten or fiveteen, since otherwise the chart would be cluttered up too much. Our story will be about how Europe is divided according to the unemployment rate, so I decided to remove anything but the top-3 and bottom-3 countries plus some reference countries of interest in between. The final dataset contains the countries: Greece, Spain, Croatia, Portugal, Italy, Cyprus, France, United Kingdom, Norway, Austria, Germany.
  4. Let's also try to keep the labels short. For Germany we can remove the appendix "(until 1990 former territory of the FRG)", since it wouldn't fit in out chart.
  5. This is how the final dataset looks like in OpenOffice Calc

dw-prepared-dataset.png

Loading the Data into Datawrapper

  1. Now, to load the dataset into Datawrapper you can simply copy and paste it. In your spreadsheet software look for the Select All function (e.g. Edit > Select All in OpenOffice).
  2. Copy the data into the clipboard by either selecting Edit > Copy from the menu or pressing Ctrl + C (for Copy) on your keyboard.
  3. Go to datawrapper.de and click the link Create A New Chart. You can do this either being logged in or as guest. If you create the chart as guest, you can add it to your collection later by signing up for free.
  4. Now paste the data into the big text area in Datawrapper. Click Upload and continue to proceed to the next step.

dw-paste.png

Check and Describe the Data

  1. Check if the data has been recognized correctly. Things to check for are the number format (in our example the decimal separator , has been replaced with .). Also check wether the row and column headers have been recognized.
  2. Change number format to one decimals after point to ensure the data is formatted according to your selected language (e.g. decimal comma for France).
  3. Now provide information about the data source. The data has been published by Eurostat. Provide the link to the dataset as well. This information will be displayed along with the published charts, so readers can trace back the path to the source themselves.

dw-source3.png

  1. Click Visualize to proceed to the next step.

 

Selecting a Visualization

  1. Time series are best represented using line charts, so click on the icon for line chart to select this visualization.
  2. Give the chart a title that explains both what the readers are seeing in the chart and why they should care about it. A title like "Youth unemployment rates in Europe" only answers half of the question. A better title would be"Youth unemployment divides Europe" or "Youth unemployment on record high in Greece and Spain"
  3. In the introduction line we should clarify what exactly is shown in the chart. Click Introduction and type "Seasonally adjusted unemployment rates of under 25 aged". Of course you can also provide more details about the story.
  4. Now highlight the data series that are most important for telling the story. The idea is to let one or two countries really pop out of the chart, and attract the readers attention immediately. Click Highlight and select Greece and Spain from the list. You might also want to include your own country for reference.
  5. Activate direct labeling to make it easier to read the chart. Also, since our data is already widely distributed, we can force the extension of the vertical axis to the zero-baseline.
  6. We can let the colors support the story by choosing appropriate colors. First, click on the orange field to select it as base color. Then click on define custom colors and pick red for high unemployment countries Greece and Spain. For countries with low youth unemployment such as Germany, Norway and Austria we can pick a green, or even better, a blue tone (to respect the color blind). Now the resulting chart should look like this:

dw-result1.png

  1. Click Publish to proceed to the last step.

 

Publishing the Visualization

  1. Now a copy of the chart is being pushed to the content delivery network Amazon S3, which ensures that it loads fast under high traffic.
  2. Meanwhile you can already copy the embed code and paste it into your newsrooms CMS to include it in the related news article – just like you would do with a Youtube video.

 

Further tutorials can be found on the Datawrapper website

Enjoyed this? Want to stay in touch? Join the School of Data Announce Mailing List for updates on more training activities from the School of Data or the Data Driven Journalism list for discussions and news from the world of Data Journalism.

Flattr this!

Data visualization guidelines – By Gregor Aisch – International Journalism Festival

Gregor Aisch - April 26, 2013 in Data Blog, Events

The following tips are from Gregor Aisch, visualization architect and interactive news developer. We’re delighted he could join us to lead the workshop: “Making Data Visualisations, A survival Guide” here at the International Journalism Festival in Perugia.

Watch the Video

See the slides

Charts

  • Avoid 3d-charts at all costs. The perspective distorts the data, what is displayed ‘in front’ is perceived as more important than what is shown in the background.
  • Use pie charts with care, and only to show part of whole relationships. Two is the ideal number of slices, but never show more than five. Don’t use pie charts if you want to compare values (use bar charts instead).
  • Always extend bar charts to zero baseline. Order bars by value to make comparison easier.
  • Use line charts to show time series data. That’s simply the best way to show how a variable changes over time.
  • Avoid stacked area charts, they are easily mis-interpreted.
  • Prefer direct labeling wherever possible. You can safe your readers a lot time by placing labels directly onto the visual elements instead of collecting them in a separate legend. Also remind that we cannot differentiate that much colors.
  • Label your axes! You might think that’s kind of obvious, but still it happens quite often that designers and journalists simply forget to label the axes.
  • Tell readers why they should care about your graphic. Don’t waste the title line by just saying what data is shown.

Color

Colors are difficult. They might make a boring graphic look pretty, but they really need to be handled with care.

  • Use colors sparingly. If possible, use only one or two colors in your visualization.
  • Double-check your colors for the color blind. You can use tools such as ColorOracle to simulate the effect of different types of color blindness.
  • Say good-bye to red-green color scales. A significant fraction of the male population is color blind and have problems differentiating between red and green tones. Use red-blue or purple-green are common alternatives.
  • In doubt, use color scales from colorbrewer2.com

Maps

  • Don’t use the Mercator projection for world maps. The distortion of area is not acceptable. Use equal-area projections instead.
  • Size symbols by area, not diameter. A common mistake is to map data values to the radius of circles. However, our visual system compares symbols by area. Use square root to compute radii from data.

Recommended reading

Flattr this!

Slides, Tools and Other Resources From the School of Data Journalism 2013

Liliana Bounegru - April 26, 2013 in Data Blog, Events

The School of Data Journalism, Europe's biggest data journalism event, brings together around 20 panelists and instructors from Reuters, New York Times, Spiegel, Guardian, Walter Cronkite School of Journalism, Knight-Mozilla OpenNews and others, in a mix of discussions and hands-on sessions focusing on everything from cross-border data-driven investigative journalism, to emergency reporting and using spreadsheets, social media data, data visualisation and maping for journalism.

In this post we will be listing links shared during this training event. The list will be updated as the sessions progress. If you have links shared during the sessions that we missed, post them in the comments section and we will update the list.

Video recordings 

Panel Discussions

Workshops

Slides, tutorials, articles

Coverage

Tools and other resources

  • Source, an index of news developer source code, code walkthroughs and project breakdowns from journalist-coders
  • School of Data – online tutorials for working with data
  • The Data Journalism Handbook – reference book about how to use data to improve the news authored by 70 data journalism practitioners and advocated
  • Open Refine  for data cleaning
  • Gephi  for graph visualisations
  • Hashtagify  visualisation of Twitter hashtags related to a particular #tag)
  • Investigative Dashboard  methodologies, resources, and links for journalists to track money, shareholders, and company ownership across international borders
  • Tabula  open-source application that allows users to upload PDFs and extract the data in them in CSV format
  • Topsy. Social Media Analysis tool mentioned in panel on Covering emergencies.
  • DataSift Mentioned in panel on Covering emergencies.
  • Storyful The social media Newswire for Newsrooms
  • GeoFeedia Search and Monitor Social Media by Location.
  • Spokeo “Spokeo is a people search engine that organizes White-pages listings, Public Records and Social Network Information to help you safely find & learn about people.”
  • The Tor project Useful in environments likely to suffer from censorship. Tor is free software and an open network that helps you defend against a form of network surveillance that threatens personal freedom and privacy, confidential business activities and relationships, and state security known as traffic analysis.

Projects and organisations

Enjoyed this? Want to stay in touch? Join the School of Data Announce Mailing List for updates on more training activities from the School of Data or the Data Driven Journalism list for discussions and news from the world of Data Journalism.

Flattr this!

Social network analysis for journalists using the Twitter API

Lucy Chambers - April 25, 2013 in Events, HowTo

< p class="c1 c10 title">
Due to the changed Twitter API the first part of this course does not work anymore, we're sorry for this

Social Network analysis allows us to identify players in a social network and how they are related to each other. For example: I want to identify people who are involved in a certain topic - either to interview or to understand what different groups are engaging in debate.

What you’ll Need:

  1. Gephi (http://gephi.org)
  2. OpenRefine (http://openrefine.org)
  3. The Sample Spreadsheet
  4. Another sample Dataset
  5. Bonus: The twitter search to graph tool

Step 1: Basic Social Networks

Throughout this exercise we will use Gephi for graph analysis and visualization. Let’s start by getting a small graph into gephi.

Take a look at the sample spreadsheet - this is data from a fictional case you are investigating.

In your country the minister of health (Mark Illinger) recently bought 500,000 respiration masks from a company (Clearsky-Health) during a flu-scare that turned out non substantial. The masks were never used and rot away in the basement of the ministry. During your investigation you found that during the period of this deal Clearsky-Health was consulted by Flowingwater Consulting and paid them a large sum for their services. A consulting company owned by Adele Meral-Poisson. Adele Meral-Poisson is a well known lobbyist and the wife of Mark Illinger.

While we don’t need to apply network analysis to understand this fictional case - it helps understanding the sample spreadsheet. Gephi is able to import spreadsheets like this through it’s “import csv” section. Let’s do this.

Walkthrough Importing CSV into Gephi

  1. Save the Sample Spreadsheet as csv (or click download as → comma seperated values if using google spreadsheet)
  2. Start Gephi
  3. Select File → Open
  4. Select the csv file safed from the sample spreadsheet.
  5. You will get a import report - check whether the number of nodes and edges seem correct and there are no errors reported

  1. The default values are OK for many graphs of this type. If the links between the objects in your spreadsheet are not unilateral but rather bilateral: e.g. lists of friendship, relationships etc. select Undirected instead of directed.
  2. For now we’ll go with directed - so click “OK” to import the graph.

Now we have imported our simple graph and already see some things on the screen let’s make it a little nicer. By playing around with Gephi a bit.

Walkthrough: Basic layout in Gephi

See the grey nodes there, let’s make this graph a little easier to read

  1. Click on the big fat “T” on the bottom of the graph screen to activate labels

  1. Let’s zoom a bit, click on the button on the lower right of the graph window to open the larger menu

  1. You should see a zoom slider now, slide it around to make your graph a little bigger:

  1. You can click on individual nodes and drag them around to arrange them nicer.

Step 2: Getting data out of Twitter

Now we have this, let’s get some data out of Twitter. We’ll be using the twitter search for a particular hashtag to find information who talks about it, with whom and what do they talk about. Twitter offers loads of information on their API for search it’s here: https://dev.twitter.com/docs/api/1/get/search

It basically all boils down to using https://search.twitter.com/search.json?q=%23 tag (the %23 is the #character encoded - so %23ijf corresponds to #ijf). If you open the link in the browser you will get the data in json format - a format that is ideal for computers to read - but rather hard for you. Luckily Refine can help with this and turn the information into a table. (If you’ve never worked with refine before, consider having a quick look at the cleaning data with refine recipe at the school of data: https://schoolofdata.org/handbook/recipes/cleaning-data-with-refine/)

Walktrough: Get JSON data from web apis into Refine

  1. Open Refine
  2. Click Create Project
  3. Select “Web Adresses”
  4. Enter the the following url https://search.twitter.com/search.json?q=%23ijf - this searches for the #ijf hashtag on twitter.
  5. Click on “Next”
  6. You will get  a preview window showing you nicely formatted json:

  1. Hover over the curly bracket inside results and click this selects the results as the data to import into a table.
  2. Now name your project and click “create project” to get the final table

By now we have the all the tweets in a table. You see there is a ton of information to each tweet: we’re interested in who communicates with whom and about what: so the columns we care about are the “text” column and the “from_user” column  -let’s delete all the others. (To do so use “All → Edit Columns → remove/reorder Columns”)

The from user is stripped of the characteristical @ in front of the username that is used in tweets - since we want to extract the usernames from tweets later, let’s add a new column with from as @tweets. This will involve a tiny bit of programming - don’t be afraid it’s not rocket science

Walkthrough: Adding a new column in Refine

  1. On your from_user column Select “Edit column → add column based on this column...”

  1. Whoah - Refine wants us to write a little code to tell it what the new column looks like
  2. Let’s program then: Later on we’ll do something the built in programming language doesn’t let us do, luckily it offers two alternatives Jython (basically python) and clojure. We’ll go for clojure as we’ll need it later.
  3. Select Clojure as your language
  4. We want to prepend “@” to each name (here “value” refers to the value in each row)
  5. Enter (str “@” value) into the expression field

  1. See how the value has been changed from peppemanzo to @peppemanzo - what happened? In clojure “str” can be used to combine multiple strings: (str “@” value) therefore combines the string “@” with the string in value - what we wanted to do.
  2. Now simply name your column (eg. “From”) and click on OK you will have a new column

Ok we got the first thing of our graph: the from user - now let’s see what the users talk about. While this will get a lot more complicated - don’t worry we’ll walk you through....

Walkthrough: Extracting Users and Hashtags from Tweets

  1. Let’s start with adding a new column based on the text column
  2. The first thing we want to do is to split the tweet into words - we can do so by entering (.split value “ “) into the expression field (make sure your language is still clojure)

  1. Our tweet now looks very different - it has been turned into an “Array” of words. (an Array is simply a collection, you can recognize it by the square brackets.
  2. We don’t actually want all words, do we? We only want those starting with @ or # - users and hashtags (so we can see who’s talking with whom about what) - so we need to filter our array.
  3. Filtering in clojure works with the “filter” function, it takes a filter-function and an array  - the filter-function simply determines whether the value should be kept or not. In our case the filter-function looks like “#(contains? #{\# \@} (first %))” - looks like comic-book characters swearing? Don’t worry, contains? basically checks if something is in something else, here whether the first character of the value (first %) is either # or @ (#{\# \@}) - exactly what we want. Let’s extend our expression:

  1. Whoohaa, that seemed to have worked! Now the only thing we need to do is to create a single value out of it. - Remember we can do so by using “str” as above.
  2. If we do this straight away we run into a problem: before we used “str” as (str “1st” “2nd”) now we want to do (str [“1st” “2nd”]) because we have an array - clojure helps us here with the apply function: (apply str [“1st” “2nd”]) converts (str [“1st” “2nd”]) to (str “1st” “2nd”). Let’s do so...

  1. Seems to have worked. Do you spot the problem though?
  2. Exactly the words are joined without a clear seperator, let’s add a seperator: The easiest way is to interpose a character (e.g. a comma) between all the elements of the array - clojure does this with the interpose function. (interpose “,” [1 2 3]) will turn out to be [1 “,” 2 “,” 3]. Let’s extend our formula:

  1. So our final expression is:

(apply str (interpose "," (filter #(contains? #{\# \@} (first %)) (.split value " "))))

Looks complicated but remember, we built this from the ground up.

  1. Great - we can now extract who talks to whom! name your column and click “OK”  to continue

Now we have extracted who talks with whom, but the format is still different from what we need in gephi. So let’s clean up to have the data in the right format for gephi.

Waltkthrough Cleaning up

  1. First, let’s remove the two columns we don’t need anymore: the text and the original from_user column - do this with “all → edit columns → remove and reorder columns
  2. Make sure your “from” column is the first column

  1. Now, let’s split up the to column so we have one row in each entry: use “to → edit cells → split multi valued cells” enter “,” as seperator

  1. Make sure to switch back to “rows” mode.
  2. Now let’s fill the empty rows: Select “from → edit cells → fill down”
  3. Notice that there are some characters in there that don’t belong to names (e.g. “:” ?) Let’s remove them.
  4. Select “to → edit cells → transform...”
  5. To replace our transformation is going to be (.replace value “:” “”)

You’ve now cleaned your csv and prepared it enough for gephi, let’s make some graphs! Export the file as csv and open it in gephi as above.

A small network from a Twitter Search

Let’s play with the network we got through google refine:

  1. Open the CSV file from google refine in gephi
  2. Look around the graph - you’ll see pretty soon that there are several nodes that don’t really make sense: “from” and “to” for example. Let’s remove them
  3. Switch gephi to the “data laboratory” view

  1. This view will show you nodes and edges found

  1. You can delete nodes by right clicking on them (you could also add new nodes)
  2. Delete “from” “to” and “#ijf” - since this was the term we searched it’s going to be mentioned everywhere
  3. Activate the labels: it’s pretty messy right now so let’s add some layouting. To layout simply select the algorithm in layout and click “play” - see how the graph changes.
  4. Generally combining “Force Atlas” with “Fuchterman Reingold” gives nice results. Add “label adjust” to make sure text does not overlap.
  5. Now let’s make some more adjustments - let’s scale the label by how often things are mentioned. Select label size in the ranking menu

  1. Select “Degree” as rank parameter

  1. Click on “Apply” - you might need to run the “label adjust” layout again to avoid overlapping labels
  2. With this simple trick, we see what kind of topics and persons are frequently mentioned

Great - but it has one downside - the data we’re able to get via google refine is very limited - so let’s explore another route.

A larger network from a Twitter search

Now we analyzed a small network from a search - let’s deal with a bigger one. This one is from a week of searching for the twitter hashtag #ddj. (you can download it here)

The file is in gexf format - a format for exchanging graph data.

Walkthrough: Network analysis using Gephi

  1. Open the sample graph file in gephi
  2. Go to the Data view and remove the #ddj node
  3. Enable Node labels
  4. Scale labels by Degree (number of edges from this node)
  5. Apply “Force Atlas”, “Fuchterman Rheingold” and “Label Adjust” (remember to stop the first two after a while).
  6. Now you should have  a clear view of the network

  1. Now let’s perform some analysis. One thing we are interested in is: who is central and who’s not: in other words: Who is talking and who is talked to.
  2. For this we will run statistics (found in the statistics tab on the right) - we will use the “Network diameter” statistics first - they tell us about eccentricity, Betweenness centrality and closeness centrality. Betweenness centrality tells us which nodes connect nodes: in our terms: high betweenness centrality are nodes who are communication leaders. Low betweenness centrality are topics.
  3. Now we ran our test, we can color the labels according to this. Select the label color ranking and “Betweenness Centrality”

  1. Pick colors as you like them - I prefer light colors and a dark background.

  1. Now let’s do something different. Let’s try to detect the different groups of people who are involved in the discussion. This is done with the “modularity” statistic.
  2. Color your labels using the “Modularity Class” - now you see different clusters of people who are involved in the discussion

Now we have analyzed a bigger network - found the important players and the different groups active in the discussions - all by searching twitter and storing the result.

Bonus: Scraping the twitter search with a small java utility

If you have downloaded the .jar file mentioned above - it’s a scraper extracting persons and hastags from twitter - think of what we did previously but automated. To run it use:

java twsearch.jar "#ijf" 0 ijf.gexf 

This will search for #ijf on twitter every 20 seconds and write it to the file ijf.gexf - the gexf format is a graph format understood by gephi. If you want to end data collection: press ctrl-c - simple isn’t it? - In fact the utility just runs using java - it is written entirely in clojure (the language we used to work with the tweets above).

Enjoyed this? Want to stay in touch? Join the School of Data Announce Mailing List for updates on more training activities from the School of Data or the Data Driven Journalism list for discussions and news from the world of Data Journalism.

Flattr this!

Using Excel to do precision journalism, an Update from the School of Data Journalism in Perugia.

Lucy Chambers - April 24, 2013 in Events, HowTo


photo

Our first workshop has just kicked off with Steve Doig leading “Excel for Journalists”.

If you missed it, don’t worry – here’s the breakdown for you!

Download the Data and the Tutorial

You can download the full data for this tutorial and a text version of the tutorial itself via Steve’s website.

The Tutorial

A tutorial by Steve Doig, journalism professor at ASU's Cronkite School and Pulitzer-winning data journalist, based on his workshop, Excel for Journalists. The workshop is part of the School of Data Journalism at the International Journalism Festival.

Microsoft Excel is a powerful tool that will handle most tasks that are useful for a journalist who needs to analyze data to discover interesting patterns. These tasks include:

  • Sorting
  • Filtering
  • Using math and text functions
  • Pivot tables

Introduction to Excel

Excel will handle large amounts of data that is organized in table form, with rows and columns. The columns (which are labeled A, B, C…) list the variables (like Name, Age, Number of Crimes, etc.) Typically, the first row holds the names of the variables. The rest of the rows are for the individual records or cases being analyzed. Each cell (like A1) holds a piece of data.

Screen_shot_2013-04-24_at_12.11.02_PM.png

Modern versions of Excel will hold as many as 1,048,576 records with as many as 16,384 variables! An Excel spreadsheet also will hold multiple tables on separate sheets, which are tabbed on the bottom of the page.

Screen_shot_2013-04-24_at_12.12.18_PM.png

Sorting

One of the most useful abilities of Excel is to sort the data into a more revealing order. Too often, we are given lists that are in alphabetical order, which is useful only for finding a particular record in a long list. In journalism, we usually are more interested in extremes: The most, the least, the biggest, the smallest, the best, the worst. 

Consider the data used in this workshop, a list of the provinces of Italy showing the number of various kinds of crimes reported during a recent year.  Here is how it looks sorted in alphabetical order of province name:

Screen_shot_2013-04-24_at_12.14.03_PM.png

Far more interesting would be to sort it in descending order of the total number of crimes, with the most crime-ridden city at the top of the list:

Screen_shot_2013-04-24_at_12.17.02_PM.png

There are two methods of sorting. The first method is quick and can be used for sorting by a single variable. Put the cursor in the column you wish to sort by (“Delitti in totale” in this case) and then click the Z-A button:

Screen_shot_2013-04-24_at_12.18.42_PM.png

But beware! Put the cursor in the column, but DO NOT select the column letter (C, in this case) and then sort. Consider the example below:

Screen_shot_2013-04-24_at_12.20.32_PM.png

Doing that will sort ONLY the data in that column, thereby disordering your data! Notice well how this can happen!

Screen_shot_2013-04-24_at_12.21.51_PM.png

The other method of sorting is for when you want to sort by more than one variable. For instance, suppose we wish to sort the crime data first by Territerio in alphabetical order, but then by “Delitti in Totale” in descending order within each Territerio. To do that, go to the toolbar, click on “Data” and then “Sort…”, and then choose the variables by which you wish to sort. Then click “OK”.

Screen_shot_2013-04-24_at_12.23.45_PM.png

The result will be this:

Screen_shot_2013-04-24_at_12.24.28_PM.png

Filtering

Sometimes you want to examine only particular records from a large collection of data. For that, you can use Excel’s Filter tool. On the toolbar, go to “Data…Filter…Autofilter”. Small buttons will appear at the top of each column:

Screen_shot_2013-04-24_at_12.25.47_PM.png

Suppose we wish to see only the records from the territerio of Lazio. Click on the button on the Territerio column and choose Lazio from the list. This is the result:

Screen_shot_2013-04-24_at_12.26.46_PM.png

Notice that you now are seeing only rows 36, 44, 78, 80 and 104.

More complicated filters are possible. For instance, suppose you wish to see only records in which “Delitti in totale” is greater than or equal to 50,000. Click on the button and choose “Custom Filter…”:

Screen_shot_2013-04-24_at_12.29.08_PM.png

You could also, for instance, choose records in which “Delitti in totale” is greater than 50,000 and “Omicidi” is less than or equal to 25.

Functions

Excel has many built-in functions useful for performing math calculations and working with dates and text. For instance, assume that we wish to calculate the total number of crimes in all the provinces. To do this, we would go to the bottom of Column C, skip a row, and then enter this formula IN Cell C106: =SUM(C2:C104). The equals sign (=) is necessary for all functions. The colon (:) means “all the numbers from Cell C2 to Cell 104”. The result is this:

Screen_shot_2013-04-24_at_1.39.47_PM.png

(The reason for skipping a row is to separate the sum from the main table so that the table can be sorted without pulling the sum into the table during the sorting operation. This way the sum will stay at the bottom of the column.

Often you will want to do a calculation on each row of your data table. For instance, you might want to calculate the crime rate (the number of crimes per 100,000 population), which would let you compare the crime problem in cities of different sizes. To do this, we would create a new variable called “Crime Rate” in Column L, the first empty column. Then, in Cell L2, we would enter this formula:
=(C2/J2)*100000.  This divides the total crimes by the population, then multiplies the result by 100,000.  (Notice that there are no spaces and no thousands separators used in the formula.) Here is the result:

Screen_shot_2013-04-24_at_1.41.55_PM.png

It would be very tedious to repeat writing that calculation in each of 103 rows of data. Happily, Excel has a way to rapidly copy a formula down a column of cells. To do that, you careful move the cursor (normally a big fat white cross) to the bottom right corner of the cell containing the formula. When it is in the right spot, the cursor will change to a small black cross. At that point, you can double-click and the formula will copy down the column until it reaches a blank cell in the column to the left. This would be the result:

Screen_shot_2013-04-24_at_12.33.03_PM.png

Notice that the formula changes for each row, so that Row 6 is =(C6/J6)*100000.

Now, if we sort by Crime Rate in descending order, we see the cities with the worst crime problems:

Screen_shot_2013-04-24_at_12.34.26_PM.png

and sorting in ascending order, the least crime:

Screen_shot_2013-04-24_at_12.35.09_PM.png

Here are some other useful Excel functions that can be used in similar ways:

(You can add, subtract, multiply or divide by using the symbols + – * and /)
=AVERAGE – calculates the arithmetic mean of a column or row of numbers
=MEDIAN – finds the middle value of a column or row of numbers
=COUNT – tells you how many items there are in a column or row
=MAX – tells you the largest value in a column or row
=MIN – tells you the smallest value in a column or row

There are also a variety of text functions that can join and cut apart text strings. For instance:

If “Steve” is in Cell B2 and “Doig” is in Cell C2, then =B2&” “&C2 will produce “Steve Doig”. And =C2&”, “&B2 will produce “Doig, Steve”. Other text functions include:

=SEARCH – this will find the start of a desired string of text in a larger string.
=LEN – this will tell you how many characters are in a text string.
=LEFT – this will extract however many characters you specify starting from the left.
=RIGHT — this will extract characters starting from the right.

You can also do date arithmetic, such as calculating the number of days or years between two dates, or hours, minutes and/or seconds between two times. For instance, to calculate on April 24, 2010, the age in years of someone whose birth date is in cell B2, you could use this formula: =(DATE(2010,4,24)-B2)/365.25. The first part of the formula calculates the number of days between the two dates, then that is divided by 362.25 (the .25 accounts for leap years) to produce the years. Another useful date function is =WEEKDAY, which will tell you on which day of the week a chosen date falls. For instance =WEEKDAY(DATE(1948,4,21)) returns a 4, which means I was born on a Wednesday.

Excel offers well over 200 functions in a variety of categories beyond just math, dates and text: Financial, engineering, database, logical, statistical, etc. But it is unlikely that you will need to be familiar with more than a dozen or so functions, unless you are a journalist with a very specialized beat such as economics.

Pivot Tables

 

One of Excel’s best tricks is the ability to summarize data that is in categories. The tool that does this is called a pivot table, which creates an interactive cross-tabulation of the data by category.

To create a pivot table, every column of your data must have a variable label; in fact, it is always good practice to put in a variable label any time you insert or add a new column. First, you make sure your cursor is on some cell in the table. Then go to the tool bar and click on “Data…Pivot Table Report”. A window will pop up called the “Pivot Table Wizard”. Just hit “Next…Next…Finish” on the three steps of the wizard.

This will open a new sheet that looks like this:

Screen_shot_2013-04-24_at_12.37.26_PM.png

To build a pivot table, you should visualize the piece of paper that would answer your question. Our example data shows 103 provinces in the 20 Territorios of Italy. Imagine that you wanted to know the total number of crimes in each Territorio. The piece of paper that would answer that question would list each Territorio, with the total number of crimes next to each name.

To build this pivot table, we would use the mouse to pick up “Territorio” from the list of variables in the floating box to the right, and place it in the “Drop Row Fields Here” box. We would then take the “Delitti in totale” variable and put it in the “Drop Data Items Here” box. This would be the result:

Screen_shot_2013-04-24_at_12.38.25_PM.png

If you click the cursor into the “Total” Column and hit the Z-A button to sort, you will get this:

Screen_shot_2013-04-24_at_12.39.24_PM.png

It is possible to make very complicated pivot tables, with multiple subtotals. But I recommend making a new pivot table for each question you want to answer; several simple tables are easier to understand than one very complicated table that tries to answer many questions at once.

The Screen_shot_2013-04-24_at_12.40.31_PM.png button on the variable list opens up a box that will let you make a variety of other choices about how to summarize and display the result:

 

Screen_shot_2013-04-24_at_12.41.00_PM.png

Other Excel Tips

Excel will import data that comes in a variety of formats other than the native *.xls that Excel uses. For instance, Excel can readily import text files in which the data columns are separated by commas, tabs, or other characters, like this:

Screen_shot_2013-04-24_at_12.42.01_PM.png

If you find a web page with data in table format (rows and columns), Excel can open it as a spreadsheet.

Excel also will let you format your data to make it more readable. For instance, “Format…Cells…Number” will allow you to put thousands separators in your numbers, like this:

Screen_shot_2013-04-24_at_12.42.46_PM.png

Finding Data

Government agencies are starting to make some of their data available in Excel or other formats. For instance, ISTAT.IT has very comprehensive data about Italian demographics, economy, crime, etc. Many of their tables can be downloaded directly as Excel files.

One trick to find interesting data would be to use Google and add these search terms: site:.it filetype:xls.

Need Help?
Feel free to send me an email at steve.doig@asu.edu. I will be glad to give you advice if I can.

Enjoyed this? Want to stay in touch? Join the School of Data Announce Mailing List for updates on more training activities from the School of Data or the Data Driven Journalism list for discussions and news from the world of Data Journalism.

Flattr this!

One more day to go to the School of Data Journalism at the International Journalism Festival

Lucy Chambers - April 23, 2013 in Events

It’s not too late: To register for the School of Data Journalism workshops please fill in your name and email address in this form.

There’s just one more day left to go to the School of Data Journalism at the International Journalism Festival in Perugia. Here’s a quick reminder of what we’ve had so far!

Last year’s edition attracted hundreds of journalists and featured a stellar team of panelists and instructors from the New York Times, the Guardian, Deutsche Welle, Duke University, Walter Cronkite School of Journalism and ProPublica. This year we return with a leading team of about 20 new and returning panelists and instructors from Reuters, New York Times, Spiegel, Guardian, Walter Cronkite School of Journalism, Knight-Mozilla OpenNews and others, and a mix of discussions and hands-on sessions focusing on everything from cross-border data-driven investigative journalism, to emergency reporting and using Excel, the Twitter API, data visualisation and maps for journalism.

The 2013 edition takes place in the beautiful city of Perugia between 24-28 April. Entry to the School of Data Journalism panels and workshops is free.

school_bus_perugia_1.jpg

PANEL DISCUSSIONS

 

1. The State of Data Journalism in 2013 (24 April)

2. Data and Investigations: Collaborating Across Borders (25 April)

3. Data Journalism in Southern European Countries (26 April, co-organised with Ahref and datajournalism.it)

4. Covering Emergencies in the Age of Big Data (27 April)

Speakers:

  • Anthony de Rosa, Social Media Editor, Reuters
  • Aron Pilhofer, Editor of Interactive News, New York Times
  • Dan Sinker, Director, Knight-Mozilla OpenNews
  • Elisabetta Tola, co-founder Formicablu, data journalism trainer
  • Friedrich Lindenberg, OpenNews Fellow, Spiegel Online
  • Guido Romeo, Science Editor, Wired Italy, Ahref
  • Jack Thurston, writer, broadcaster and co-founder of Farmsubsidy.org and Fishsubsidy.org
  • James Ball, data journalist, Guardian
  • Mar Cabra, multimedia investigative journalist
  • Marko Rakar, president of Windmill, blogger and data journalist
  • Paul Radu, Executive Director of the Organized Crime and Corruption Reporting  Project, Co-founder of the Investigative Dashboard concept

Moderators:

  • Guido Romeo, Science Editor, Wired Italy, Ahref
  • Liliana Bounegru, Project lead Data Driven Journalism, European Journalism Centre
  • Lucy Chambers, Head of Knowledge, Open Knowledge Foundation
  • Rina Tsubaki, Project lead Emergency Journalism, European Journalism Centre


WORKSHOPS

 

1. Excel for Journalism with Steve Doig  (24 April)

2. Using the Twitter API for Journalism (25 April)

3. Making Data Visualisations: A Survival Guide (26 April)

4. Data Visualisation, Maps and Timelines on a Shoestring (27 April)

Instructors:

  • Steve Doig, Knight Chair in Journalism, Professor, Walter Cronkite School of Journalism
  • Michael Bauer, School of Data, Open Knowledge Foundation
  • Gregor Aisch, award-winning freelance data visualisation expert

The full description of the sessions can be found on the International Journalism Festival website.

How to register

There is no fee to attend the workshops but there is a limited number of available seats and they will be given out on a first-come first-served basis.More information about the registration process for the four workshops will be available in the coming days. Registration is not necessary for attending the panel discussions.

What do you need to bring?

Enthusiasm and a laptop are required for the workshop sessions. Please note for hands-on workshops tablet devices will not be appropriate.

Questions?

If you have questions about the School of Data Journalism get in touch with the coordinators: Liliana Bounegru or Lucy Chambers.

 

Flattr this!

When Wranglers meet Watchdogs

Michael Bauer - April 9, 2013 in Events


P3260856

How to draw insight from what local governments do and involve the community? This and other questions where eagerly discussed at the School of Datas “Wrangling for Watchdogs” workshop last week at Supermarkt Berlin.

Fifteen members of watchdog organizations in central and eastern Europe met up with data nerds from the School of Data and Open Spending to discuss how to use data in campaigns and other issues. The focus of the two day event was to improve the participants workflow around data and enable them to base their campaigns on evidence.

See what happened there:

Wrangling For Watchdogs – A Workshop with School of Data from Open Knowledge Foundation on Vimeo.

Video by Sam Muirhead

The workshop left us at the school of data inspired and re-enforced our idea of hands-on learning as the way to go using data problems. If you are interested in something similar Contact Us or subscribe to our Mailing List.

Flattr this!

 Receive announcements  Get notifications of news from the School in your inbox
Join the discussion Discussion list - have your say: