---
title: "Extract Social Media Meta Data"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{extract_meta_data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

We start with loading the required packages. `mediacloudr` is used to download
an article based on a article id received from the 
[Media Cloud Topic Mapper](https://topics.mediacloud.org/). Furthermore, 
`mediacloudr` provides a function to extract social media meta data from 
HTML documents. `httr` is used to turn R into a HTTP client to download and 
process the responding article page. We use `xml2` to parse - make it readable
for R - the HTML document and `rvest` to find elements of interest within
the HTML document.

```{r setup, eval=FALSE}
# load required packages
library(mediacloudr)
library(httr)
library(xml2)
library(rvest)
```

In the first step, we request the article with the id `1126843780`. It is 
important to add the upper case `L` to the number to turn the numeric type into
an integer type. Otherwise the function will throw an error. The article was 
selected with help of the 
[Media Cloud Topic Mapper](https://topics.mediacloud.org/) online tool. If you
created an account, you can create and analyze your own topics.

```{r get_article, eval=FALSE}
# define media id as integer
story_id <- 1126843780L
# download article
article <- get_story(story_id = story_id)
```

The USA Today [news article](https://eu.usatoday.com/story/news/2018/12/27/ice-drops-off-migrants-phoenix-greyhound-bus-station/2429545002/) comes with an URL which we can use to download the 
complete article using the `httr` package. We use the `GET` function to 
download the article. Afterwards, we extract the website using the `content`
function. It is important to provide the `type` argument to extract the text 
only. Otherwise, the function tries to guess the type and will automatically 
parse the content based on the `content-type` HTTP header. The author of the
`httr` package 
[suggests](https://CRAN.R-project.org/package=httr/vignettes/quickstart.html) 
to manually parse the content. In this case, we use the `read_html` function
which is provided in the `xml2` package.

```{r download_website, eval=FALSE}
# download article
response <- GET(article$url[1])
# extract article html
html_document <- content(response, type = "text", encoding = "UTF-8")
# parse website 
parsed_html <- read_html(html_document)
```

After parsing the response into a R readable format, we extract the actual body 
of the article. Therefore, we use the `html_nodes` function to find the html
tags with defined in the `css` argument. A useful open source tool to find the 
corresponding tags or css classes is the 
[Selector Gadget](https://selectorgadget.com/). Alternatively, you can use the
developer tools of the browser you are usually using. The `html_text` provides
us with a character vector. Each element contains a paragraph of the article. We
use the `paste` function to merge the paragraph into one closed text. We could
analyze the text using different metrics such as word frequencies or sentiment
analysis.

```{r body_content, eval=FALSE}
# extract article body
article_body_nodes <- html_nodes(x = parsed_html, css = ".content-well div p")
article_body <- html_text(x = article_body_nodes)
# paste character vector to one text
article_body <- paste(article_body, collapse = " ")
```

In the last step, we extract the social media meta data from the article. Social
media meta data are shown if the article URL is shared on social media. The
article representation usually include a heading, summary and a small 
image/thumbnail. The `extract_meta_data` expects a raw HTML document and 
provides [Open Graph](http://ogp.me/) (a standard introduced by Facebook), [Twitter](https://developer.twitter.com/en/docs/tweets/optimize-with-cards/overview/markup.html) 
and native meta data.

```{r get_meta_data, eval=FALSE}
# extract meta data from html document
meta_data <- extract_meta_data(html_doc = html_document)
```

Open Graph Title: *"ICE drops off migrants at Phoenix bus station"*

Article Title (provided by mediacloud.org): *"Arizona churches working to help 
migrants are 'at capacity' or 'tapped out on resources'"*

The meta data can be compared to the original content of the article. A short
analysis reveals that USA Today chose a different heading to advertise the 
article on Facebook. Larger analysis can use quantitative tools such as string
similarity measures, such as provided by the `stringdist` package.
