Steps in Creating a Data Feature
Data journalism begins in one of two ways: either you have a question that needs data, or a dataset that needs questioning.
#1 Seek inspiration: find a topic and report, what data is relevant?
#2 Find the data: review available sources, perform searches
#3 Download, clean, and organize the data
#4 Interview and analyze the data: show change over time, identify patterns, make comparisons, investigate outliers, personalize with databases
#5 Execution: What type of format or visualization is best suited to your story: charts, maps, interactive graphics.
Source: Carl V Lewis
Another process: The inverted pyramid of data journalism
Compile
Data journalism begins in one of two ways: either you have a question that needs data, or a dataset that needs questioning. Whichever it is, the compilation of data is what defines it as an act of data journalism.
Compiling data can take various forms. At its most simple the data might be:
- supplied directly to you by an organisation
- found through using advanced search techniques to plough into the depths of government websites;
- compiled by scraping databases hidden behind online forms or pages of results
- by converting documents into something that can be analysed
- or by collecting the data yourself through observation, surveys, online forms or crowdsourcing.
Clean
Cleaning typically takes two forms: removing human error; and converting the data into a format that is consistent with other data you are using.
For example, datasets will often include some or all of the following: duplicate entries; empty entries; the use of default values to save time or where no information was held; incorrect formatting (e.g. words instead of numbers); corrupted entries or entries with HTML code; multiple names for the same thing (e.g. BBC and B.B.C. and British Broadcasting Corporation); missing data ; mixed data in the same column; or data in the wrong shape (e.g. swapping columns and rows).
There are simple ways to clean up data in Excel or Google Sheets such as find and replace, sorting to find unusually high, low, or empty entries, and using filters so that only duplicate entries (i.e. those where a piece of data occurs more than once) are shown.
Google Refine adds a lot more power: its ‘common transforms’ function will, for example, convert all entries to lowercase, uppercase or titlecase. It can remove HTML, remove spaces before and after entries (which you can’t see but which computers will see as different to the same data without a space), remove double spaces, join and split cells, and format them consistently. It will also ‘cluster’ entries and allow you to merge those which should be the same. Note: this will work for BBC and B.B.C. but not BBC and British Broadcasting Corporation, so some manual intervention is often needed.
Context
Like any source, data cannot always be trusted. It comes with its own histories, biases, and objectives. So like any source, you need to ask questions of it: who gathered it, when, and for what purpose? How was it gathered? (The methodology). What exactly do they mean by that?
You will also need to understand jargon, such as codes that represent categories, classifications or locations, and specialist terminology.
All the above will most likely lead you to compile further data. For example, knowing the number of crimes in a city is interesting, but only becomes meaningful when you contextualise that alongside the population, or the numbers of police, or the levels of crime 5 years ago, or perceptions of crime, or levels of unemployment, and so on.
Combine
Good stories can be found in a single dataset, but often you will need to combine two together. After all, given the choice between a single-source story and a multiple-source one, which would you prefer?
The classic combination is the maps mashup: taking one dataset and combining it with map data to provide an instant visualisation of how something is distributed in space.
A more mundane combination is to combine two or more datasets with a common data point. This often means ensuring that the particular data point is formatted in the same name across each dataset.
In one, for example, the first and last names might have separate columns, but not in the other (you can concatenate or split cells to solve this).
Or you might have local authority names in one, but local authority codes in another (find another dataset that has both together.
Process
Step #1: Story Needs Data, Data Reveals Story
– Reporting on a story
– Breaking news
– News folo
– Browsing data sets
Examples:
You are doing a story about uneven distribution of vaccines and you need data to illustrate the nuances of this trend.
You discover a data set on NYC Open Data of NYC Trees. Using the you create a visualization shows the variety and quantity of street trees in all five New York City boroughs.
An Interactive Visualization of NYC Street Trees
Step #2: Find Relevant Data Sources
In addition to NYC Open Data, Census and American Community Survey data are frequently used for news stories. The NYC Factfinder and Department of City Planning are good places to begin for local data.
Government (city, county, state and national) and nonprofits
– NYC Open Data NYC
– NYC Factfinder
– US Census/ACS
CUNY J-school List of Data Sources
This is a very thorough, constantly updated, list of data sources.
Data collections, portals, hubs
– DataPortal.org: Data portals around the world
– Data.gov: U.S. data portal
– World Bank
– United Nations
– Guardian World Government Data
– The Data Hub
Data is Plural – Collection of interesting data sets for journalists
Searching for data using Google
Tips from the Data Journalism Handbook
When searching for data, make sure that you include both search terms relating to the content of the data you’re trying to find as well as some information on the format or source that you would expect it to be in. Google and other search engines allow you to search by file type. For example, you can look only for spreadsheets (by appending your search with ‘type:XLS type:CSV’), geodata (‘filetype:shp’), or database extracts (‘type:MDB, filetype:SQL, type:DB’). If you’re so inclined, you can even look for PDFs (‘type:pdf’).
You can also search by part of a URL. Googling for ‘inurl:downloads type:xls’ will try to find all Excel files that have “downloads” in their web address (if you find a single download, it’s often worth just checking what other results exist for the same folder on the web server). You can also limit your search to only those results on a single domain name, by searching for, e.g. ‘site:agency.gov’.
Another popular trick is not to search for content directly, but for places where bulk data may be available. For example, ‘site:agency.gov Directory Listing’ may give you some listings generated by the web server with easy access to raw files, while ‘site:agency.gov Database Download’ will look for intentionally created listings.
Data formats
.CSV: CSV stands for Comma Separated Values – this means that the value in each column is separated by a comma like so: Name, Date of birth, Address. When the spreadsheet software opens this, it replaces each comma with a new column.
.TSV: Tab Separated Values – values are separated by tabs
Fixed width: values are separated by spaces
.XLS: Excel document
XML: Extensible markup language – values arranged using open/closing tags like HTML
JSON: Used for exporting data for use in Web javascript
PDF: If data is in PDF, there are tools to extricate it like Tabula.
.KML or .SHP: Geographic data
Verify the source
- Who collected the data?
- What was the methodology?
- Any comparable dataset?
- What does an expert think?
- Are the numbers plausible?
- All data is dirty and…
- Almost always incomplete!
Step #3: Download data, import into Excel, clean
Many data sets can be downloaded in Excel or CSV formats. NYC Open Data allows you to filter out the data sets for particular information.
Watch NYC Open Data – Video Guides
Spreadsheet Basics
At its most basic, a spreadsheet acts as a calculator. They can add, subtract, divide, and multiply, and through a combination of those can calculate averages, proportions, rates, and various other things which can provide an insight into the data. One of the most common of those stories is about change. Change is, almost by definition, newsworthy, and an integral part of storytelling. When things change, it makes the news. — Finding Stories with Spreadsheets, Paul Bradshaw
All about Excel with exercises by Miguel Paz
Functions are words that have special meanings in Excel and other spreadsheet software. They are shortcuts for a series of instructions. Functions are always followed by parentheses containing the ingredients it needs. These are called arguments (specifically) or parameters (in general). The function for adding up a series of cells is called SUM. =SUM(A1:A5) or =SUM(A:A) or =SUM(A1:B500) or =SUM(1:1)
– SUM: Add all these figures up’
– AVERAGE: Calculate an average for these figures
– COUNT: Count how many numbers I have in this column
Resource: Excel Spreadsheet Basics
Formulas
– Calculations are called formulas in Excel and other spreadsheet software
– A formula begins with an equals operator: =
– Formulas are most likely to use cell references like A2 or B300.
Cleaning Data
Workflow – Tutorial: Intro To Cleaning Data, Len de Groot
– Make sure data was imported properly and column data matches the column headers
– Save a copy immediately
– Delete blank rows within the data
– Consolidate Column headers to a single row
– Format all integer columns to eliminate commas
– Find and replace special characters (&, ! , ~, etc) When in doubt, replace with a space, under-stroke or hyphen
– Check spelling (especially in columns you plan to Join)
– Check capitalization (especially in columns you plan to Join)
– Check abbreviations (especially in columns you plan to Join)
– Keep totals from original data to check against. This ensures data is not lost.
– Manipulate data in columns last.
– Delete what you don’t need
EXERCISE: Intro to Cleaning Data
Step #4: Interview and Analyze Data
Interviewing data is very much like interviewing human sources — your goal is to find patterns that generate additional interview questions or story ideas.
– How much has something changed?
– Who or where experienced the most change? Who changed the least?
– Which company or person got the most money?
– What is typical – and who stands out the most as being atypical?
– How often does a name appear? Where?
– How much was spent in total on something?
— Finding Stories with Spreadsheets, Paul Bradshaw
How reliable, credible is the data set?
– Where did the data come from?
– Have the data been peer-reviewed?
– How were the data collected?
– If the data come from a survey, for example, you want to know that the people who responded to the survey were selected at random.
– Is it a “self-selected sample?” People who feel really passionately about one side often flood a web site poll, skewing the results from what they would have been had you polled only a random sample of people in the community.
– Have results been cherry-picked? This is the social-science equivalent of gerrymandering, where you draw up a legislative district so that all the people who are going to vote for your candidate are included in your district and everyone else is scattered among a bunch of other districts. (See Awash in Data, Thirsting for Truth.)
– When in doubt, plot the subjects of a study on map and look for yourself to see if the boundaries make sense.
– Are the numbers taken out of context? The example of weather reports citing the number of fender benders on the road in a snow storm. How do these numbers compare to a sunny day?
– 10 Foundational Quantitative Reasoning Questions
Once you have spent some time answering these questions, you should consult an expert who is familiar with both the subject and the data set.
– Have I got things right?
– Am I missing information missing?
– Where should I be looking?
– How would you evaluate this data?