Newspaper Archive API icon

Newspaper Archive API

(0 reviews)

home

Overview

Newspaper Archive stores and exposes archival newspapers for Aftonbladet, Aftenposten, Bergens Tidende, Svenska Dagbladet, Stavanger Aftenblad and VG. It contains historical material from year 1835 onwards and it is constantly updated with day-to-day stream of new releases from those brands. Main goal for this API is to expose archival newspapers in a consistent way for all years of their publishing. It allows to search by text or by other criteria as date and product codes. Results returned by the API allow to highlight text match on the page and return both low-res and high-res versions of the pages.

Detailed list of newspapers which archival data is exposed via the API can be found on page List of available products

Data structure

Structure of data in the archive and description of concepts used in the documentation:

resources/paper_archive-b2df2179-61d5-495f-8860-63aea394f284.png

  • Product is the main element of data stored in the archive. Product is for example "SvD Perfect Guide" and has a characteristic that for given issue of product on given day consumers see it as a list of pages from 1 to N.
  • Product code is the code that identifies a type of a product like for example SVPG ("SvD Perfect Guide") or SVNY ("SvD Main newspaper") or SARB ("Stavanger Aftenblad editorial appendix"). Full list of product codes supported by this API is available on the page List of available products. The code is based mostly on codes that are in use in 2024. Whenever possible, older materials that used historical product codes were converted to the current codes. This was done to have consistent product codes for the same products throughout the years. For example, if the main newspaper product of Aftonbladet used three different codes over the years: AB34, ABD8, ABNY, then in our archive, all data from those years will be accessible by one product code, ABNY which is product code in use in 2024. It's worth noting that for older materials (generally before 2010; for some newspapers earlier) product codes were not available (or are available sparsely) and because of that, all pages on given day may be under one code, for example SVNY ("SvD Main newspaper"), even though within the pages customer can find for example "SvD Perfect Guide". On top of that for some of newer material (post 2010) even though it was possible to determine that given pages were a separate product it was impossible to determine specific product code and in those cases special codes with suffix "_SUP" were used that can be understood as "appendix number N" (like for example APHA_SUP1 - APHA_SUP9). Specific list of years and brands for which those special codes were used are listed on the page List of available products.
  • ProductIssueId - because there may be multiple issues of product with one product code on one day, productIssueId may be used to identify one specific issue of a product on given day. For example there may be multiple advertisement appendixes on given day and all of them will have the same product code (for example APAB) but each of them will have a different productIssueId
  • Bundle - a group of products sold together on a given day. For example Aftenposten main newspaper (APHA) may be sold on a given day together with "Aftenposten A-magasinet" (APAM)
  • PageNumber - page number within product. For given product issue on a given day it is always increasing from 1 to N. It generally corresponds with the page number that is printed on the page, but for older material (before 2010) there may be some exceptions to this rule.
  • Edition - in some cases for one day and one product we may have more than one page with given page number. It may come for example from the fact there were different pages for different parts of the country (like for example local news or advertisements). To distinguish those pages field edition may have either value "1" which means it's the "ordinary" edition or any other value like "2", "3" identifying other local editions etc. "1" has got the special meaning of being the "ordinary" edition (the main one) and other values don't have any special meaning and shouldn't be interpreted. For example when you fetch all pages for given triple (productCode = "SVNY", date = "1947-05-03", edition = "1") you get a list of all pages that at some place were available to buy by customers in stores and in the result list there will be no duplicated pages (pages having the same page number). It's worth noting that for edition "1" as a general rule you can expect that there will be continuous pages from 1 to N (with small number of exception for days with low quality data), but there is no such rule for other edition numbers.

Search functionality

Page search request

Main way for searching within the archive is to use endpoint:

GET /paper-archive/api/v2/page/search

Example request:

curl https://api.schibsted.com/paper-archive/api/v2/page/search?q=SPECIALISTREKRYTERING&newspaperBrand=SVENSKA_DAGBLADET&productCode=SVNY&productCode=SVMD&startDate=2000-01-05&endDate=2000-01-05&offset=0&sort=DATE&size=50

The endpoint accepts following parameters:

  • q - (text) - query as it was entered by the user. Query can contain multiple words or it can contain a phrase in quotes. If empty all pages matching other criteria will be returned.
  • startDate - (ISO.DATE, for example "2011-12-03") - start of a timespan that the search should cover (including given date)
  • endDate - (ISO.DATE, for example "2011-12-03") - end date of a timespan that the search should cover (including given date)

Additonally following parameters can be provided to limit the results:

  • newspaperBrand (text)

The brand of newspaper. If it is not specified then the search will be performed in all newspaper brands. The parameter may be specified multiple times to search in multiple brands. Parameter accepts one of folllowing values:

  • AFTENPOSTEN
  • AFTONBLADET
  • BERGENS_TIDENDE
  • SVENSKA_DAGBLADET
  • STAVANGER_AFTENBLAD
  • VG
  • SCHN - Schibsted non-branded Norwegian
  • EXTN - Exterrnal Norwegian

For one newspaper brand and day we may have multiple products available.

  • productCode (text)

One or more of products codes to which search should be limited. List of all available product codes is available on page List of available products. If this parameter is not specified search will be performed in all products. The parameter may be specified multiple times to search in multiple products. This parameter also supports "extended product codes" meaning products codes with number at the end like "SAAB1", "SAAB2" etc. You can use both short codes like for example "SAAB" to find all Stavanger Aftenblad advertisements or specific "extended code" like "SAAB1" to find only advertisement with that code. All products that support "extended code" are listed in List of available products.

  • imageUrlValidity (int)

This parameter specifies how long the returned image URLs should be valid (in minutes). Default is 30. Maximum allowed value is 7 days.

  • edition (int)

In some cases for one day and one product we may have more than one page with given page number. It may come for example from the fact there were different pages for different parts of the country (like for example local news or advertisements). To distinguish those pages field edition may have either value "1" which means it's the "ordinary" edition or any other value like "2", "3" identifying other local editions etc. "1" has got the special meaning of being the "ordinary" edition (the main one) and other values don't have any special meaning and shouldn't be interpreted.

  • pageNumber (int)

Limit search only to pages with given page number.

  • sort, sortOrder (text)

Sort parameter has got to have one of values: DATE, RELEVANCE. RELEVANCE means that most relevant result (scored by search engine) will be at top. sortOrder has got one of values: ASC, DESC.

  • offset, size (int)

Parameters used for paging. Size determines how many result items should be returned and offset how many result items should be skipped (starting from the top of the list). For size: maximum allowed value is 1000 and the default is 10. Also there is a limitation that sum of offset + size cannot be greater than 10000.

You can also use following parameters to find pages that relate to the the page that you have already fetched:

  • bundleId (text)

All pages that have given bundle id will be returned. It may be used in a use case when you have one page from previous search result and you want to find all other pages of products that were sold together with given page on the same day. bundleId is globally unique so you don't have to specify day or brand when you pass bundleId. You may (but don't have to) pass pageNumber, edition or any other filter together with bundleId if you need to limit the results.

  • productIssueId (text)

All pages that have given productIssueId will be returned. It may be used in a use case when you have one page from previous search result and you want to find all other pages of the product that were published on the same day as original page. productIssueId is globally unique so you don't have to specify day, productCode or brand when you pass productIssueId. You may (but don't have to) pass pageNumber or edition or any other filter together with productIssueId if you need to limit the results.

Page search response

Example response:

{
    "total": 1,
    "hits": [
        {
            "id": "20160726_0038",
            "thumbnailUrl": "https://some_domain/20160726_0038_thumb.jpg?someparam",
            "imageUrl": "https://some_domain/20160726_0038_large.jpg?someparam",
            "date": "2016-07-26",
            "newspaperBrand": "Aftonbladet",
            "productCode": "ABAB",
            "pageNumber": 38,
            "maxPageNumber": 47,
            "height": 4450,
            "width": 3000,
            "topic": "NEWS",
            "edition": 1,
            "bundleId": "B233",
            "productPositionInBundle": 1,
            "productIssueId": "PRD2343",
            "productCodeExtended": "ABAB1",
            "highlightText": "Min förhoppning är att satsningen på elbilar tyder på nya tider, säger <em>Mattias</em> <em>Goldmann</em> och syftar på",
            "highlightArea": {
                "textBlocks": [
                    {
                        "hpos": "1070",
                        "vpos": "2982",
                        "height": "433",
                        "width": "222"
                    }
                ],
                "words": [
                    {
                        "height": "43",
                        "hpos": "1070",
                        "vpos": "3482",
                        "width": "122",
                        "word": "Mattias"
                    },
                    {
                        "height": "43",
                        "hpos": "669",
                        "vpos": "3524",
                        "width": "180",
                        "word": "Goldmann"
                    }
                ]
            }
        }
    ]
}

Description of fields in the response:

  • total - total number of matching items (may be capped at some value)
  • hits - list of matching items - contains pages of newspaper matching search request
  • id - unique identifier of result item i.e. page
  • thumbnailUrl, imageUrl - URLs to images representing the page. thumbnailUri points to image in low resolution intended to show as a thumbnail and imageUrl points to high resolution image intended to be shown full-screen. These URLs are valid only for limited time that is dependent on a request parameter imageUrlValidity (maximum 7 days)
  • date - date on which given page was published
  • newspaperBrand - brand of newspaper like for example SVD (for a list of available values - see search request)
  • productCode - code of a product like for example SVPG ("SvD Perfect Guide"). Product has a characteristic that page numbers for a product on given day always start with 1 and continuously go up. If a page has got page number printed "1" it means that it is a first page of product on given day. It often doesn't correspond with what we think as a "newspaper" that customer buys in store, because what is bought often contains multiple products bundled together like for example SvD News product + SvD Business product.
  • pageNumber - page number that corresponds with the number that is printed on a page
  • maxPageNumber - total number of pages that are available for given product on given day
  • height, width - numbers defining the size of the page. Units of these numbers are not defined. They should only be used to calculate relative positioning of words and text areas from highlightAreas field
  • topic - defines a topic of a page. Has got one of predefined values like NEWS, ADVERTISEMENT, BUSINESS, CULTURE etc. (for a full list of available values - see search request). May be empty for some subset of archival data.
  • edition - in some cases for one day, one product may have more than one page with given page number. It may come for example from the fact there were different pages for different parts of the country (like for example local news or advertisements). To distinguish those pages field edition may have either value "1" which means it's the "ordinary" edition or any other value like "2", "3" identifying other local editions etc. "1" has got the special meaning of being the "ordinary" edition (the main one) and other values don't have any special meaning and shouldn't be interpreted. See search request to read more about edition.
  • highlightText - fragment of page text matching the search query (it contains searched word in em tag and a few words around)
  • bundleId - identifier of the bundle to which the product to which given page belongs. Bundle is a group of products sold together on a given day (like for example when you buy SvD newspaper you can get SvD News product together with SvD Business product and two advertisement appendixes). It's a globally unique identifier. It can be used to find all other pages and products that were part of a bundle on a given day when for example you have only one of those pages.
  • productPositionInBundle - position of the product to which given page belongs in the bundle. Determines the order of products in a bundle. First product in a bundle has got value 1 in that field. It can be used to sort the resulting products (and pages) in the same order as it was in the the bundle that was delivered to the customers.
  • productIssueId - identifier of the product issue to which given page belongs. All pages for given product on a given day have got the same productIssueId (like for example all pages for first advertisement appendix for SvD on given day have the same productIssueId, but second advertisement appendix on the same day has got different productIssueId). It's a globally unique identifier. It can be used in a usecase when you have one page that comes from search results and you want to find all other pages of that product on the same day.
  • highlightArea.words - contains coordinates of words matching the search query (coordinates are relative to page size defined in width and height fields). This field is always present in search results when q parameter was used. Coordinates of words are available for majority of content in archive but in some exceptional cases due to lacking source data coordinates of words may not always be present.
  • highlightArea.textBlocks - contains coordinates of textBlocks that contain words matching the search query (coordinates are relative to page size defined in width and height fields). Text blocks are only returned when q parameter was used. Text blocks are not available for all archival content so it should not be expected that textBlocks will always contain some data, but when they are available then usually all words in highlightAreas.words are within the area of highlightAreas.textBlocks.
  • productCodeExtended - this field is for specific use case when product codes that include index like "SAAB2" are needed. This is mainly for compatibility reasons with external system that accept only this kind of codes. If that compatibly is not needed then it's recommended to use productCode and other fields like productIssueId and productPositionInBundle. That field will always contain a value: either short productCode (for those products that don't have indexes like "SAHA" for example) or "extended product code" with index for those products that support them.

Typical use cases

  1. Search for given text on a page

In that case you usually have to pass q parameter with the query and then you can optionally limit the search by providing startDate and endDate and/or limit to one newspaperBrand or specific productCode. You can also add edition = 1 parameter if you want to search only in main editions of the newspapers.

  1. Find related pages

In a case when you have a page that comes from search results and you want to show it in a context then you can fetch all the surrounding pages that were part of the product on a given day by passing parameter productIssueId. If similarity you want find all the pages that were part of a bundle (ie products sold together on given day) then you can pass bundleId parameter. If you want to limit search results only to one edition (the same edition as the page or main edition) you should add edition parameter (it is not necessary to add newspaperBrand, dates or productCode as _productIssueId and bundleId already include that information). In case of fetching pages in a bundle it may be useful to use productPositionInBundle for sorting the results.

Authentication and access control

Endpoints are protected by HTTP BASIC authentication. All authenticated users will have unlimited access to all data in Newspaper Archive.


Reviews