6 Ways to get News Data Online

2020 · Posted by Sebastien Lemieux-Codere

Content

  1. News APIs
  2. Prepared News Datasets
  3. RSS Feeds
  4. Web Scraping / Crawling
  5. Common Crawl Archive
  6. GDELT

1. News APIs

The easiest way to get up to date news data is by using a news API. These APIs often allow for searching and filtering based on criteria such as news source and publication date. There are many APIs to choose from (list here) including NewsAPI and Bing News Search API. Many of these APIs are paid but some have free trials.

Tutorials and Examples

2. Prepared News Datasets

There are many prepared news datasets that can be readily downloaded. If you're lucky, you might be able to find a readily downloadable news dataset that provides the type of news data you need from a set of sources and a time period that suits your need.

Tutorials and Examples

3. RSS Feeds

RSS Feeds are a standardized, computer-readable format that allows users and applications to access updates to websites. Many news publications provide RSS Feeds that include information about their latest articles. This information often includes headlines, links and publish dates. Some news sources even provide RSS Feeds that are organized by news category. News RSS Feeds can be an easy way to get the some data on the latest news from the sources that provide them. Open source libraries like feedparser make it easy to parse RSS Feeds.

Tutorials and Examples

4. Web Scraping / Crawling

Another way to get news data is to crawl online news websites and extract the desired data. There are open source libraries make it relatively easy to do this for news data such as Newspaper3k. There are also other open source libraries not specialized for news data like Scrapy and Beautiful Soup. One important thing to note is that the difference between news websites can be difficult to account for. Code that works for one news website may need to be adjusted to work on others. For example, Newspaper3k works well on many but not all news websites.

Tutorials and Examples

5. Common Crawl Archive

Common Crawl is an organization that builds and maintains a petabyte scale open repository of web crawl data that can be accessed and analyzed by anyone for free. Common Crawl provides a monthly general crawl that includes all types of web pages and a seperate daily news only crawl.

The web crawl data is stored in a WARC (Web ARChive format) files. This format stores the raw html of the crawled page as well as other information including the download date and HTTP headers. These WARC files are gzipped and stored in AWS S3 as a public dataset. The first generic monthly crawl is from 2008 but the crawl include articles written far earlier than that. Some of the ealier months have been skipped and do not have a crawl available. Earlier crawls use a slightly different ARC crawl format. The news only crawl started in 2016.

Common Crawl has a large quantity of data, and it is likely a large fraction of it not relevant to your use case, so you may not want to download and process the entire dataset. To do this, you can find the location of the subset of the WARC files that you are targeting and then only download those files. The Common Crawl Index Server allows for searching WARC file locations based on some filters like page crawl/download date, domain name or normalized url. Alternatively, an index stored in a columnar Parquet format can be used for efficient aggregations and filtering on any field/column of the index. Each crawl is stored in multiple crawl files, and each of those crawl files is the result of concatenating many gzipped WARC files. Thus, the location of a WARC file for a single crawled web page consists of the path to a crawl file as well as a start and end offset within that file.

Tutorials and Examples

6. GDELT

The GDELT Project monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day. This data can be accessed using GDELT Analysis Service, Google BigQuery or Raw Data Files Download.

Tutorials and Examples

The Author

Sebastien Lemieux-Codere

Sebastien is a Data Scientist and Software Developer.