Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assignment #4, due Jan 13, 9am: Scrape Wikipedia data for current S&P 500 constituents #16

Open
joachim-gassen opened this issue Dec 16, 2019 · 5 comments
Labels
assignment Individual class assignment

Comments

@joachim-gassen
Copy link
Owner

Your task is to collect and tidy Wikipedia data for the companies that constitute the Standard & Poor’s 500 index. You can find a convenient list here. The idea is to scrape some data from each companies’ Wikipedia page and to prepare a tidy dataset that contains that data. You can decide yourself what data you want to collect for each constituent but things that come to mind are:

• The info in the top right infobox
• The length of the Wikipedia article
• Some info on its revision history

Clearly, this is not an exhaustive list. Collect whatever data you find to be interesting and can get in a standardized way for a reasonable subset of firms. The tidy datasets should be stored in the “data” directory. If you feel like it, you can also prepare an informative visual based on your scraped data.

You can use whatever packages or resources you find helpful for the task. As always, please make reference to all used resources in the code. Ideally, your code runs in the docker container. For Python users: Please submit plain Python code, no Jupyter notebooks please.

The deadline for this task is Monday, January 13th, 2020, 9am. Feel free to use this issue to discuss things that need clarification or to help each other.

Please note that I will be offline from Friday, 20th Dec, until Sunday, 5th Jan, 2020. Enjoy the break!

@joachim-gassen joachim-gassen added the assignment Individual class assignment label Dec 16, 2019
@fengzhi22
Copy link
Contributor

fengzhi22 commented Dec 31, 2019

I am struggling in making R read SP500 companies' wikipedia webpages. I have tried the following method to get each firm's website but failed, so I wonder if someone has figured out a more efficient way of doing this and could share some thoughts.

I started from the source code of this SP500 list site, which contains the link to each company's wikipage. But I found that for many companies, their wikipages are redirected, which means that if I take the website strings literally from this list, the string will not take me directly to the companies wikipage. For example, the company Amcor plc (official name) has a wikipage with link https://en.wikipedia.org/wiki/Amcor, but in the source code of the list site, it appears like this:
image. However, the company Linde plc's source code is
image. Without being redirected, the string "linde_plc" get me to its wikipage correctly.
Also, the company Eli Lilly and Company 's source code is
image.
In sum, I found most of the redirected wikipages have irregular links, which does not match the company's names. How can I read the wikipages without making adjustment of each company by hand? Thank you in advance.

@joachim-gassen
Copy link
Owner Author

Happy New Year Year to you all and thank you Fengzhi for your question! The URLs that you provide are not technically incorrect but permanent redirections. Permanent redirections are very common in the Web and it is important that you use tools that simply follow them instead just providing you with the HTML of the redirect link. html_read(), for example, does that.

See:

library(tidyverse)
library(rvest)

get_title_from_wiki_url <- function(url) {
  read_html(url) %>%
    html_node(xpath = '//*[@id="firstHeading"]/text()') %>%
    html_text()
}
  
sp500_urls <- c(
  "https://en.wikipedia.org/wiki/Amcor",
  "https://en.wikipedia.org/wiki/Linde_plc",
  "https://en.wikipedia.org/wiki/Eli_Lilly_and_Company"
)

sapply(sp500_urls, get_title_from_wiki_url)

The code returns (at least for me)

                https://en.wikipedia.org/wiki/Amcor 
                                            "Amcor" 
            https://en.wikipedia.org/wiki/Linde_plc 
                                        "Linde plc" 
https://en.wikipedia.org/wiki/Eli_Lilly_and_Company 
                            "Eli Lilly and Company" 

indicating that it successfully accessed the actual Wiki pages of your URLs, regardless whether they pointed to redirects or the 'correct' URL directly.

Does this help?

@fengzhi22
Copy link
Contributor

Hi @joachim-gassen, thank you for the tips. Although the function I actually want is get_wiki_url_from_title, which takes the company name (which may or may not agree with the suffix of the company's wikipage) as input and returns its wikipage URL. I found the idea of using tools that simply follow redirections very useful and I'll try to incorporate it into my code.

@yagmurdalman
Copy link
Contributor

Hi Joachim,

The first question is can we use Wikipedia API or should we only scrape from the Html source code?

Second, can we use xtools? There are some nice page statistics. It can be reached by clicking View history > Page statistics. Here is the link for 3M Company:

https://xtools.wmflabs.org/articleinfo/en.wikipedia.org/3M

Thanks.

@joachim-gassen
Copy link
Owner Author

joachim-gassen commented Jan 11, 2020

Hi Yagmur:

In principle you can use whatever gets the job done. So using the Wikipedia API is fine. You can also use secondary data as long as it is current and the way it is being collected is transparent. btw: Wikipedia also provides summary page statistics. See: https://en.wikipedia.org/wiki/Help:Page_information

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
assignment Individual class assignment
Projects
None yet
Development

No branches or pull requests

3 participants