---
title: "Analysing and visualising a literature"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Analysing and visualising a literature}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
has_ggplot2 <- requireNamespace("ggplot2", quietly = TRUE)
```

```{r setup}
library(scopusflow)
```

Once a set of records is in hand, the package offers a small analysis layer that
turns it into the figures a bibliometric study usually needs. The steps that
contact the API are shown but not run; the rest use the bundled `example_records`
and run offline.

## What is in a record set

`scopus_top()` tallies the most frequent sources or authors. Author strings that
hold several names are split, so each contributor is counted once per record.

```{r}
scopus_top(example_records, by = "source")
scopus_top(example_records, by = "author", n = 5)
```

```{r eval = has_ggplot2, fig.alt = "A horizontal bar chart of the most frequent sources", fig.width = 7, fig.height = 3.5}
plot_scopus_top(scopus_top(example_records, by = "source"))
```

The same plot works on the author tally.

```{r eval = has_ggplot2, fig.alt = "A horizontal bar chart of the most frequent authors", fig.width = 7, fig.height = 3.5}
plot_scopus_top(scopus_top(example_records, by = "author", n = 5))
```

A record set also has an honest default view: `autoplot()` draws its records per
year. The same `autoplot()` generic dispatches on `scopus_trend` and `scopus_top`
objects too, delegating to the plots above.

```{r eval = has_ggplot2, fig.alt = "A bar chart of records per year", fig.width = 7, fig.height = 3.5}
ggplot2::autoplot(example_records)
```

## How a literature grows

`scopus_trend()` counts how many records match a query in each year, which is the
size of a literature over time. It issues one count request per year, so it needs
the API.

```{r eval = FALSE}
tr <- scopus_trend("graphene", years = 2004:2022, field = "TITLE-ABS-KEY")
plot_scopus_trend(tr)
```

The result has a fixed shape, which we reproduce here to show the plot.

```{r eval = has_ggplot2, fig.alt = "A line and area chart of records per year rising over time", fig.width = 7.5, fig.height = 4}
years <- 2004:2022
tr <- tibble::tibble(query = "TITLE-ABS-KEY(graphene)", year = years,
                     n = round(exp(seq(log(50), log(28000), length.out = length(years)))))
class(tr) <- c("scopus_trend", class(tr))
plot_scopus_trend(tr)
```

## Reading the fuller record

The Search API returns a few fields per record. To read the abstract and the
fuller metadata for a record you already know, `scopus_abstract()` calls the
Abstract Retrieval API, by DOI or 'Scopus' identifier. A batch is resilient, so
an identifier that cannot be found yields a row of `NA`s with a warning rather
than stopping the run.

```{r eval = FALSE}
ab <- scopus_abstract(c("10.1038/s41586-019-0001-1", "10.1103/PhysRevLett.116.061102"))
```

The result is a tibble of class `scopus_abstracts`, one row per identifier. To
show its shape without a key, here is a stand-in with the same columns.

```{r}
ab <- tibble::tibble(
  id          = c("10.1038/s41586-019-0001-1", "10.1103/PhysRevLett.116.061102"),
  scopus_id   = c("85060000001", "84960000002"),
  doi         = c("10.1038/s41586-019-0001-1", "10.1103/PhysRevLett.116.061102"),
  title       = c("A single-cell atlas of gene expression",
                  "Observation of gravitational waves from a binary black hole merger"),
  abstract    = c("Here we present a comprehensive single-cell survey of ...",
                  "On 14 September 2015 the two detectors of LIGO observed ..."),
  publication = c("Nature", "Physical Review Letters"),
  year        = c(2019L, 2016L),
  citations   = c(420L, 5400L)
)
class(ab) <- c("scopus_abstracts", class(ab))
ab

ab[, c("doi", "title", "year")]
substr(ab$abstract[2], 1, 40)
```

## Beyond five thousand records

A single Search API query returns at most its first 5000 records under the
ordinary offset paging. When you need the whole of a larger result set in one
pass, `scopus_fetch(cursor = TRUE)` follows the API's cursor instead, which has no
such ceiling.

```{r eval = FALSE}
recs <- scopus_fetch("TITLE-ABS-KEY(microplastics)", cursor = TRUE)
nrow(recs)
```

The records then arrive in the API's deep-paging order rather than sorted by
relevance, which is the right trade for a complete harvest. This is the one-call
alternative to the year-partitioned plan in the *Search plans and quota-aware
retrieval* article: a plan keeps each cell under the ceiling and preserves
relevance order, whereas `cursor = TRUE` harvests the whole set in a single pass.
Reach for the plan when you want cached, resumable cells, and the cursor when you
want the complete set at once.