-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML encoding is not autodetected properly #777
Comments
Pretty sure this is not a problem with Colly but with the terminal. Most terminals do not support cyrillic output. If you put everything in a database everything should look fine (I've crawled cyrillic pages before and I know that it works). But in case you really need to have the output displayed in the terminal try using something like Windows PowerShell ISE - it has a fairly good support for displaying Unicode. |
It's not about the terminal, this example is just to reproduce the error. The on API data is also sent incorrectly. |
Yeah, I can reproduce it with colly/v2, too |
Solved the problem, by adding a check meta[http-equiv='Content-Type'] in body, in the absence of a "charset" but with "text/html" in the header. I don't know if this is the correct approach, but it solves the problem. response.go:
|
There's a specific algorithm for detecting the encoding of an HTML document defined here: https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding. It also handles the It's implemented in Go here: https://pkg.go.dev/golang.org/x/net/html/charset#DetermineEncoding There's even a recipe how to integrate it into goquery: https://github.com/PuerkitoBio/goquery/wiki/Tips-and-tricks/7fad3f848d40fbc4504912e57fb52f8fcee7e348 We really should incorporate it into Colly. |
Just did some testing. Apparently the default colly charset detection thinks the encoding is actually ISO-8859-1. I checked that by just having the "fixCharset" function, in the response file, print out the encoding. Maybe we can try to implement a new type of encoding detection or try to fix any bugs in the current ? |
Hi! When I try to recognize the encoding on sites with windows-1251, I get:
2023/08/23 21:45:10 ÑÄÎ «Ïðîìåòåé» | ÎÎÎ «Âèðòóàëüíûå òåõíîëîãèè â îáðàçîâàíèè»
2023/08/23 21:45:10 Ýëåêòðîííûå êóðñû
2023/08/23 21:45:10 Ïðîäóêòû
Example:
colly.DetectCharset() / c.DetectCharset = true - does not working.
The text was updated successfully, but these errors were encountered: