-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assignment #4, due Jan 13, 9am: Scrape Wikipedia data for current S&P 500 constituents #16
Comments
I am struggling in making R read SP500 companies' wikipedia webpages. I have tried the following method to get each firm's website but failed, so I wonder if someone has figured out a more efficient way of doing this and could share some thoughts. I started from the source code of this SP500 list site, which contains the link to each company's wikipage. But I found that for many companies, their wikipages are redirected, which means that if I take the website strings literally from this list, the string will not take me directly to the companies wikipage. For example, the company Amcor plc (official name) has a wikipage with link https://en.wikipedia.org/wiki/Amcor, but in the source code of the list site, it appears like this: |
Happy New Year Year to you all and thank you Fengzhi for your question! The URLs that you provide are not technically incorrect but permanent redirections. Permanent redirections are very common in the Web and it is important that you use tools that simply follow them instead just providing you with the HTML of the redirect link. See:
The code returns (at least for me)
indicating that it successfully accessed the actual Wiki pages of your URLs, regardless whether they pointed to redirects or the 'correct' URL directly. Does this help? |
Hi @joachim-gassen, thank you for the tips. Although the function I actually want is get_wiki_url_from_title, which takes the company name (which may or may not agree with the suffix of the company's wikipage) as input and returns its wikipage URL. I found the idea of using tools that simply follow redirections very useful and I'll try to incorporate it into my code. |
Hi Joachim, The first question is can we use Wikipedia API or should we only scrape from the Html source code? Second, can we use xtools? There are some nice page statistics. It can be reached by clicking View history > Page statistics. Here is the link for 3M Company: https://xtools.wmflabs.org/articleinfo/en.wikipedia.org/3M Thanks. |
Hi Yagmur: In principle you can use whatever gets the job done. So using the Wikipedia API is fine. You can also use secondary data as long as it is current and the way it is being collected is transparent. btw: Wikipedia also provides summary page statistics. See: https://en.wikipedia.org/wiki/Help:Page_information |
Your task is to collect and tidy Wikipedia data for the companies that constitute the Standard & Poor’s 500 index. You can find a convenient list here. The idea is to scrape some data from each companies’ Wikipedia page and to prepare a tidy dataset that contains that data. You can decide yourself what data you want to collect for each constituent but things that come to mind are:
• The info in the top right infobox
• The length of the Wikipedia article
• Some info on its revision history
Clearly, this is not an exhaustive list. Collect whatever data you find to be interesting and can get in a standardized way for a reasonable subset of firms. The tidy datasets should be stored in the “data” directory. If you feel like it, you can also prepare an informative visual based on your scraped data.
You can use whatever packages or resources you find helpful for the task. As always, please make reference to all used resources in the code. Ideally, your code runs in the docker container. For Python users: Please submit plain Python code, no Jupyter notebooks please.
The deadline for this task is Monday, January 13th, 2020, 9am. Feel free to use this issue to discuss things that need clarification or to help each other.
Please note that I will be offline from Friday, 20th Dec, until Sunday, 5th Jan, 2020. Enjoy the break!
The text was updated successfully, but these errors were encountered: