Steps to Creating a Data Feature
#1 Seek inspiration: find a topic and report, what data is relevant?
#2 Find the data: review available sources, perform searches
#3 Download, clean, and organize the data
#4 Interview and analyze the data: show change over time, identify patterns, make comparisons, investigate outliers, personalize with databases
#5 Execution: What type of format or visualization is best suited to your story: charts, maps, interactive graphics.
Source: Carl V Lewis
Another process: The inverted pyramid of data journalism
Step #1: Seek Inspiration
- Reporting on a story
- Breaking news
- News folo
- Browsing data sets
An Interactive Visualization of NYC Street Trees
Using data provided by NYC Open Data, this visualization shows the variety and quantity of street trees in all five New York City boroughs.
Step #2: Find Relevant Data
CUNY J-school List of Data Sources
This is a very thorough, constantly updated, list of data sources.
Government (city, county, state and national) and nonprofits
- NYC Open Data NYC
- NYC Factfinder
- Census/ACS – American Factfinder
- Data.gov: U.S. data portal
- World Bank
- United Nations
Data collections, portals, hubs
- DataPortal.org: Data portals around the world
- Guardian World Government Data
- The Data Hub
- Census Reporter

Data is Plural – Collection of interesting data sets for journalists
Searching for data using Google
Tips from the Data Journalism Handbook
When searching for data, make sure that you include both search terms relating to the content of the data you’re trying to find as well as some information on the format or source that you would expect it to be in. Google and other search engines allow you to search by file type. For example, you can look only for spreadsheets (by appending your search with ‘filetype:XLS filetype:CSV’), geodata (‘filetype:shp’), or database extracts (‘filetype:MDB, filetype:SQL, filetype:DB’). If you’re so inclined, you can even look for PDFs (‘filetype:pdf’).
You can also search by part of a URL. Googling for ‘inurl:downloads filetype:xls’ will try to find all Excel files that have “downloads” in their web address (if you find a single download, it’s often worth just checking what other results exist for the same folder on the web server). You can also limit your search to only those results on a single domain name, by searching for, e.g. ‘site:agency.gov’.
Another popular trick is not to search for content directly, but for places where bulk data may be available. For example, ‘site:agency.gov Directory Listing’ may give you some listings generated by the web server with easy access to raw files, while ‘site:agency.gov Database Download’ will look for intentionally created listings.
Data formats
.CSV: CSV stands for Comma Separated Values – this means that the value in each column is separated by a comma like so: Name, Date of birth, Address. When the spreadsheet software opens this, it replaces each comma with a new column.
.TSV: Tab Separated Values – values are separated by tabs
Fixed width: values are separated by spaces
.XLS: Excel document
XML: Extensible markup language – values arranged using open/closing tags like HTML
JSON: Used for exporting data for use in Web javascript
PDF: If data is in PDF, there are tools to extricate it like Tabula.
.KML or .SHP: Geographic data
Step #3: Download data, import into Excel, clean
Many data sets can be downloaded in Excel or CSV formats. NYC Open Data allows you to filter out the data sets for particular information.
Watch NYC Open Data – Video Guides
At its most basic, a spreadsheet acts as a calculator. They can add, subtract, divide, and multiply, and through a combination of those can calculate averages, proportions, rates, and various other things which can provide an insight into the data. One of the most common of those stories is about change. Change is, almost by definition, newsworthy, and an integral part of storytelling. When things change, it makes the news. — Finding Stories with Spreadsheets, Paul Bradshaw
All about Excel with exercises by Miguel Paz
EXERCISE: Intro to Cleaning Data
- Calculations are called formulas in Excel and other spreadsheet software
- A formula begins with an equals operator: =
- Formulas are most likely to use cell references like A2 or B300.
Functions
Functions are words that have special meanings in Excel and other spreadsheet software. They are shortcuts for a series of instructions. Functions are always followed by parentheses containing the ingredients it needs. These are called arguments (specifically) or parameters (in general). The function for adding up a series of cells is called SUM. =SUM(A1:A5) or =SUM(A:A) or =SUM(A1:B500) or =SUM(1:1)
- SUM: Add all these figures up’
- AVERAGE: Calculate an average for these figures
- COUNT: Count how many numbers I have in this column
Resource: Excel Spreadsheet Basics
Workflow – Tutorial: Intro To Cleaning Data, Len de Groot
- If importing CSV, retain leading zeros
- Make sure data was imported properly and column data matches the column headers
- Save a copy immediately
- Delete blank rows within the data
- Consolidate Column headers to a single row
- Format all integer columns to eliminate commas
- Find and replace special characters (&, ! , ~, etc) When in doubt, replace with a space, under-stroke or hyphen
- Check spelling (especially in columns you plan to Join)
- Check capitalization (especially in columns you plan to Join)
- Check abbreviations (especially in columns you plan to Join)
- Keep totals from original data to check against. This ensures data is not lost.
- Manipulate data in columns last.
- Delete what you don’t need
Step #4: Interview and Analyze Data

Interviewing data is very much like interviewing human sources — your goal is to find patterns that generate additional interview questions or story ideas.
- How much has something changed?
- Who or where experienced the most change? Who changed the least?
- Which company or person got the most money?
- What is typical – and who stands out the most as being atypical?
- How often does a name appear? Where?
- How much was spent in total on something?
– Finding Stories with Spreadsheets, Paul Bradshaw
How reliable, credible is the data set?
- Where did the data come from?
- Have the data been peer-reviewed?
- How were the data collected?
- If the data come from a survey, for example, you want to know that the people who responded to the survey were selected at random.
- Is it a “self-selected sample?” People who feel really passionately about one side often flood a web site poll, skewing the results from what they would have been had you polled only a random sample of people in the community.
- Have results been cherry-picked? This is the social-science equivalent of gerrymandering, where you draw up a legislative district so that all the people who are going to vote for your candidate are included in your district and everyone else is scattered among a bunch of other districts. (See Awash in Data, Thirsting for Truth.)
- When in doubt, plot the subjects of a study on map and look for yourself to see if the boundaries make sense.
- Are the numbers taken out of context? The example of weather reports citing the number of fender benders on the road in a snow storm. How do these numbers compare to a sunny day?
- 10 Foundational Quantitative Reasoning Questions
Once you have spent some time answering these questions, you should consult an expert who is familiar with both the subject and the data set.
- Have I got things right?
- Am I missing information missing?
- Where should I be looking?
- How would you evaluate this data?





