Assignment #4, due Jan 13, 9am: Scrape Wikipedia data for current S&P 500 constituents #16

joachim-gassen · 2019-12-16T13:30:18Z

Your task is to collect and tidy Wikipedia data for the companies that constitute the Standard & Poor’s 500 index. You can find a convenient list here. The idea is to scrape some data from each companies’ Wikipedia page and to prepare a tidy dataset that contains that data. You can decide yourself what data you want to collect for each constituent but things that come to mind are:

• The info in the top right infobox
• The length of the Wikipedia article
• Some info on its revision history

Clearly, this is not an exhaustive list. Collect whatever data you find to be interesting and can get in a standardized way for a reasonable subset of firms. The tidy datasets should be stored in the “data” directory. If you feel like it, you can also prepare an informative visual based on your scraped data.

You can use whatever packages or resources you find helpful for the task. As always, please make reference to all used resources in the code. Ideally, your code runs in the docker container. For Python users: Please submit plain Python code, no Jupyter notebooks please.

The deadline for this task is Monday, January 13th, 2020, 9am. Feel free to use this issue to discuss things that need clarification or to help each other.

Please note that I will be offline from Friday, 20th Dec, until Sunday, 5th Jan, 2020. Enjoy the break!

fengzhi22 · 2019-12-31T00:07:36Z

I am struggling in making R read SP500 companies' wikipedia webpages. I have tried the following method to get each firm's website but failed, so I wonder if someone has figured out a more efficient way of doing this and could share some thoughts.

I started from the source code of this SP500 list site, which contains the link to each company's wikipage. But I found that for many companies, their wikipages are redirected, which means that if I take the website strings literally from this list, the string will not take me directly to the companies wikipage. For example, the company Amcor plc (official name) has a wikipage with link https://en.wikipedia.org/wiki/Amcor, but in the source code of the list site, it appears like this:
. However, the company Linde plc's source code is
. Without being redirected, the string "linde_plc" get me to its wikipage correctly.
Also, the company Eli Lilly and Company 's source code is
.
In sum, I found most of the redirected wikipages have irregular links, which does not match the company's names. How can I read the wikipages without making adjustment of each company by hand? Thank you in advance.

joachim-gassen · 2020-01-05T07:41:32Z

Happy New Year Year to you all and thank you Fengzhi for your question! The URLs that you provide are not technically incorrect but permanent redirections. Permanent redirections are very common in the Web and it is important that you use tools that simply follow them instead just providing you with the HTML of the redirect link. html_read(), for example, does that.

See:

library(tidyverse)
library(rvest)

get_title_from_wiki_url <- function(url) {
  read_html(url) %>%
    html_node(xpath = '//*[@id="firstHeading"]/text()') %>%
    html_text()
}
  
sp500_urls <- c(
  "https://en.wikipedia.org/wiki/Amcor",
  "https://en.wikipedia.org/wiki/Linde_plc",
  "https://en.wikipedia.org/wiki/Eli_Lilly_and_Company"
)

sapply(sp500_urls, get_title_from_wiki_url)

The code returns (at least for me)

                https://en.wikipedia.org/wiki/Amcor 
                                            "Amcor" 
            https://en.wikipedia.org/wiki/Linde_plc 
                                        "Linde plc" 
https://en.wikipedia.org/wiki/Eli_Lilly_and_Company 
                            "Eli Lilly and Company"

indicating that it successfully accessed the actual Wiki pages of your URLs, regardless whether they pointed to redirects or the 'correct' URL directly.

Does this help?

fengzhi22 · 2020-01-05T15:09:19Z

Hi @joachim-gassen, thank you for the tips. Although the function I actually want is get_wiki_url_from_title, which takes the company name (which may or may not agree with the suffix of the company's wikipage) as input and returns its wikipage URL. I found the idea of using tools that simply follow redirections very useful and I'll try to incorporate it into my code.

yagmurdalman · 2020-01-11T07:58:11Z

Hi Joachim,

The first question is can we use Wikipedia API or should we only scrape from the Html source code?

Second, can we use xtools? There are some nice page statistics. It can be reached by clicking View history > Page statistics. Here is the link for 3M Company:

https://xtools.wmflabs.org/articleinfo/en.wikipedia.org/3M

Thanks.

joachim-gassen · 2020-01-11T10:54:59Z

Hi Yagmur:

In principle you can use whatever gets the job done. So using the Wikipedia API is fine. You can also use secondary data as long as it is current and the way it is being collected is transparent. btw: Wikipedia also provides summary page statistics. See: https://en.wikipedia.org/wiki/Help:Page_information

joachim-gassen added the assignment Individual class assignment label Dec 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assignment #4, due Jan 13, 9am: Scrape Wikipedia data for current S&P 500 constituents #16

Assignment #4, due Jan 13, 9am: Scrape Wikipedia data for current S&P 500 constituents #16

joachim-gassen commented Dec 16, 2019

fengzhi22 commented Dec 31, 2019 •

edited

Loading

joachim-gassen commented Jan 5, 2020

fengzhi22 commented Jan 5, 2020

yagmurdalman commented Jan 11, 2020

joachim-gassen commented Jan 11, 2020 •

edited

Loading

Assignment #4, due Jan 13, 9am: Scrape Wikipedia data for current S&P 500 constituents #16

Assignment #4, due Jan 13, 9am: Scrape Wikipedia data for current S&P 500 constituents #16

Comments

joachim-gassen commented Dec 16, 2019

fengzhi22 commented Dec 31, 2019 • edited Loading

joachim-gassen commented Jan 5, 2020

fengzhi22 commented Jan 5, 2020

yagmurdalman commented Jan 11, 2020

joachim-gassen commented Jan 11, 2020 • edited Loading

fengzhi22 commented Dec 31, 2019 •

edited

Loading

joachim-gassen commented Jan 11, 2020 •

edited

Loading