You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1 - I need to use Ocropy to extract text from documents in Portuguese.
So far, I have added the required characters in char.py and I am training (with a previously trained model) the network based on this: https://github.com/tmbdev/ocropy/wiki/Working-with-Ground-Truth.
2 - I know about document quality restrictions (300 dpi), but some images that I have are bad scans. I've tried the same images in other APIs (like Google Vision) and got better results, but I liked ocropy.
I'm wondering if there are some preprocess techniques that can improve the results.
So, what can I do? What is the best way to generate data for training ocropy network?
Edit: ocropy training supports multithreading?
Thanks!
ocropus-gpageseg: Enable usage of masks to specify column separators/ ignore areas of scan
Operating System and version:
Linux ubuntu-virtual 4.10.0-28-generic doc #32~16.04.2-Ubuntu SMP Thu Jul 20 10:19:48 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
The text was updated successfully, but these errors were encountered:
How does your confusions look like currently, i.e. ocropus-econf? In general it is hard to say what can improve the accuracy. Can you share here 2-3 of your documents?
The confusion in test data is:
errors 233
missing 0
total 4894
err 4.761 %
errnomiss 4.761 %
28 ÇÆ çã
15 8 S
14 Ä á
13 Æ ã
12 Ë í
11 Ï ó
7 È é
7 0A ÇÃ
5 ÇÔ çõ
5 , .
0.0476093175317
I left my model training all night with portuguese texts and images generated by ocropus-linegen. the training error is decreasing, but the test error is worse than the default model (en version).
Last 4 test errors:
0.04298535663675012
0.050070854983467174
0.0547945205479452
0.05550307038261691
It's currently in 19000 iterations.
I'll see if I can share some files and comeback here.
Thanks for your reply!
Edit:
Files:
These are good files. For now, I'm trying to get better results with portuguese characters and not worrying about the quality.
Hi,
I'm facing 2 problems:
1 - I need to use Ocropy to extract text from documents in Portuguese.
So far, I have added the required characters in char.py and I am training (with a previously trained model) the network based on this: https://github.com/tmbdev/ocropy/wiki/Working-with-Ground-Truth.
2 - I know about document quality restrictions (300 dpi), but some images that I have are bad scans. I've tried the same images in other APIs (like Google Vision) and got better results, but I liked ocropy.
I'm wondering if there are some preprocess techniques that can improve the results.
So, what can I do? What is the best way to generate data for training ocropy network?
Edit: ocropy training supports multithreading?
Thanks!
Python version:
Python 2.7.14 :: Anaconda, Inc.
Git revision of ocropy:
commit e9b6121
Merge: 43381c4 289a58f
Author: Konstantin Baierer [email protected]
Date: Mon Feb 19 19:24:12 2018 +0100
Merge pull request ocropus-gpageseg: Enable usage of masks to specify column separators/ ignore areas of scan #236 from lehzwo/master
ocropus-gpageseg: Enable usage of masks to specify column separators/ ignore areas of scan
Operating System and version:
Linux ubuntu-virtual 4.10.0-28-generic doc #32~16.04.2-Ubuntu SMP Thu Jul 20 10:19:48 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
The text was updated successfully, but these errors were encountered: