--- title: "Web Scraping" subtitle: "Statistical Programming" author: "Shawn Santo" institute: "" date: "10-01-19" output: xaringan::moon_reader: css: "slides_update.css" lib_dir: libs nature: highlightStyle: github highlightLines: true countIncrementalSlides: false editor_options: chunk_output_type: console --- ```{r include=FALSE} knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE, comment = "#>", highlight = TRUE, fig.align = "center") ``` class: inverse, center, middle # HTML --- ## Hypertext Markup Language - HTML describes the structure of a web page; your browser interprets the structure and contents and displays the results. - The basic building blocks include elements, tags, and attributes. - an element is a component of an HTML document - elements are generally wrapped in tags (start and end tag) - attributes provide additional information about HTML elements
--- ## Simple HTML document ```html Web Scraping

Using rvest

To get started...

```

We can visualize this in a tree-like structure... --- ## HTML as a tree
If we have access to an HTML document, then how can we easily extract information? --- class: inverse, center, middle # `rvest` --- ## Package `rvest` `rvest` is a package from Hadley Wickham that makes basic processing and manipulation of HTML data easy. ```{r} library(rvest) ``` Core functions: - `read_html()` - read HTML data from a url or character string - `html_nodes()` - select specified nodes from the HTML document using CSS selectors - `html_table()` - parse an HTML table into a data frame - `html_text()` - extract tag pairs' content - `html_name()` - extract tags' names - `html_attrs()` - extract all of each tag's attributes - `html_attr()` - extract tags' attribute value by name --- ## `html_document` ```{r} simple_html <- " Web Scraping

Using rvest

To get started...

" html_doc <- read_html(simple_html) attributes(html_doc) ``` --- ```{r} html_doc ``` --- ## CSS selectors To extract components out of HTML documents use `html_nodes()` and CSS selectors. In CSS, selectors are patterns used to select elements you want to style. We can determine the necessary CSS selectors we need via the point-and-click tool [selector gadget](https://selectorgadget.com/). More on this in a moment. .small-text[ Selector | Example | Description :-----------------|:-----------------|:-------------------------------------------------- element | `p` | Select all <p> elements element element | `div p` | Select all <p> elements inside a <div> element element>element | `div > p` | Select all <p> elements with <div> as a parent .class | `.title` | Select all elements with class="title" #id | `#name` | Select all elements with id="name" [attribute] | `[class]` | Select all elements with a class attribute [attribute=value] | `[class=title]` | Select all elements with class="title" ] For more CSS selector references click [here](https://www.w3schools.com/cssref/css_selectors.asp). ??? - CSS stands for Cascading Style Sheets. - CSS describes how HTML elements are to be displayed on screen, paper, or in other media. - CSS can be added to HTML elements in 3 ways: - Inline - by using the style attribute in HTML elements - Internal - by using a