Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scrape 実装を selenium に戻す #9

Merged
merged 9 commits into from
Aug 27, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 1 addition & 29 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,38 +1,10 @@
# myscrapers

## myscrapers download sbi
## myscrapers-sbi
- SBIのポートフォリオを保存
- https://site1.sbisec.co.jp/ETGate/ に自動的にログインして、ポートフォリオの表ごとに保存する。
- 出力先は、コンテナ内の /data/YYYYMMDD_x.csv
- x: 連番
- outputDir オプションがあった場合は、${outputDir}/YYYYMMDD_x.csv
- s3ストレージにアップロードへの機能がある。
- 環境変数 `BUCKET_NAME` があった場合、取得したデータを `s3://${BUCKET_NAME}/${REMOTE_DIR}/YYYYMMDD/` に保存。

## myscrapers download moneyforward

### output CSV
例:
```
計算対象,日付,内容,金額(円),保有金融機関,大項目,中項目,メモ,振替,削除
,07/16(火),ローソン,-291,三井住友カード,食費,食料品,,,
,07/16(火),GITHUB,-158,JCBカード,通信費,情報サービス,,,
,07/10(水),マクドナルド,-600,三井住友カード,食費,外食,,,
```

### 出力先
- コンテナ内デフォルト: `/data/YYYYMM/YYYYMMDD/cf.csv`, `--lastmonth` 付与時は、`/data/YYYYMM/YYYYMMDD/cf_lastmonth.csv` も出力。

## Quick start (binary)

```
docker run --rm -p 7327:7327 ghcr.io/go-rod/rod:v0.116.2
```

```
user=<your id> \
pass=<your pass> \
outputDir="." \
wsAddr="localhost:7327"
build/bin/myscrapers download moneyforward
```
20 changes: 7 additions & 13 deletions build/sbi/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,13 +1,5 @@
FROM golang:1.21.0 as builder
FROM python:3.9-bookworm

COPY . /app/
WORKDIR /app

# Go build
RUN go mod download
RUN make bin-linux-amd64

FROM debian:bookworm-slim as runner
# Required Packages
RUN apt-get update && \
apt-get install -y curl unzip && \
Expand All @@ -18,7 +10,9 @@ RUN apt-get update && \
RUN curl -o /var/tmp/awscli.zip https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip && \
unzip -d /usr/local/bin/ /var/tmp/awscli.zip

RUN mkdir -p /usr/local/bin && mkdir -p /data/
COPY --from=builder /app/build/bin/myscrapers /usr/local/bin/myscrapers
COPY --chmod=755 build/sbi/main.sh /usr/local/bin/main.sh
ENTRYPOINT ["/usr/local/bin/main.sh"]
COPY /src/sbi/requirements.txt /tmp/
RUN pip install --upgrade pip && pip install -r /tmp/requirements.txt && mkdir -p /data
COPY --chmod=755 build/sbi/main.sh /src/main.sh
COPY src/sbi/ /src/

ENTRYPOINT ["/src/main.sh"]
5 changes: 2 additions & 3 deletions build/sbi/main.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@ YYYYMMDD=`date '+%Y%m%d'`
# AWS_REGION # from env (ex: ap-northeast-1)
# AWS_ACCESS_KEY_ID # from env
# AWS_SECRET_ACCESS_KEY # from env
# wsAddr # from env (ex: localhost:7327)

SCRAPERS_BIN="/usr/local/bin/myscrapers"
AWS_BIN="/usr/local/bin/aws/dist/aws"
Expand All @@ -19,11 +18,11 @@ REMOTE_DIR="${BUCKET_DIR}/${YYYYMM}"

function fetch () {
echo "fetcher start"
${SCRAPERS_BIN} download sbi
python3 -u /src/main.py
echo "fetcher complete"
}

function create_s3_credentials () {
function create_s3_credentials {
echo "s3 credentials create start"
mkdir -p ~/.aws/

Expand Down
24 changes: 24 additions & 0 deletions build/sbigo/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
FROM golang:1.21.0 as builder

COPY . /app/
WORKDIR /app

# Go build
RUN go mod download
RUN make bin-linux-amd64

FROM debian:bookworm-slim as runner
# Required Packages
RUN apt-get update && \
apt-get install -y curl unzip && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*

# AWS Setup
RUN curl -o /var/tmp/awscli.zip https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip && \
unzip -d /usr/local/bin/ /var/tmp/awscli.zip

RUN mkdir -p /usr/local/bin && mkdir -p /data/
COPY --from=builder /app/build/bin/myscrapers /usr/local/bin/myscrapers
COPY --chmod=755 build/sbi/main.sh /usr/local/bin/main.sh
ENTRYPOINT ["/usr/local/bin/main.sh"]
59 changes: 59 additions & 0 deletions build/sbigo/main.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
#!/bin/bash
set -e
YYYYMM=`date '+%Y%m'`
YYYYMMDD=`date '+%Y%m%d'`

# BUCKET_URL # from env (ex: "https://s3.ap-northeast-1.wasabisys.com")
# BUCKET_NAME # from env (ex: hoge-system-stg-bucket)
# BUCKET_DIR # from env (ex: fetcher/sbi)
# AWS_REGION # from env (ex: ap-northeast-1)
# AWS_ACCESS_KEY_ID # from env
# AWS_SECRET_ACCESS_KEY # from env
# wsAddr # from env (ex: localhost:7327)

SCRAPERS_BIN="/usr/local/bin/myscrapers"
AWS_BIN="/usr/local/bin/aws/dist/aws"
DATA_DIR="/data"

REMOTE_DIR="${BUCKET_DIR}/${YYYYMM}"

function fetch () {
echo "fetcher start"
${SCRAPERS_BIN} download sbi
echo "fetcher complete"
}

function create_s3_credentials () {
echo "s3 credentials create start"
mkdir -p ~/.aws/

echo "[default]" >> ~/.aws/config
echo "region = ${AWS_REGION}" >> ~/.aws/config

echo "[default]" >> ~/.aws/credentials
echo "aws_access_key_id = ${AWS_ACCESS_KEY_ID}" >> ~/.aws/credentials
echo "aws_secret_access_key = ${AWS_SECRET_ACCESS_KEY}" >> ~/.aws/credentials

chmod 400 ~/.aws/config
chmod 400 ~/.aws/credentials
ls -la ~/.aws/
echo "s3 credentials create complete"
}

function s3_upload () {
echo "s3 upload start"
mkdir -p ${DATA_DIR}/${YYYYMM}
cp -f ${DATA_DIR}/*.csv ${DATA_DIR}/${YYYYMM}/ # ex. $DATA_DIR/YYYYMMDD_1.csv -> $DATA_DIR/$YYYYMM/YYYYMMDD_1.csv
rm ${DATA_DIR}/*.csv
${AWS_BIN} s3 cp ${DATA_DIR}/${YYYYMM}/ "s3://${BUCKET_NAME}/${REMOTE_DIR}" --recursive --endpoint-url="${BUCKET_URL}"
echo "s3 upload complete"
}

fetch

if [ -z $BUCKET_NAME ]; then
exit 0
fi

create_s3_credentials
s3_upload
11 changes: 11 additions & 0 deletions deployment/.compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
services:
myscrapers-sbi-test:
image: myscrapers-sbi:latest
container_name: myscrapers-sbi-test
environment:
- wsAddr=katarina.int.azuki.blue:7317 # your browser value
- TZ="JST-9"
env_file:
- sbi-token.env
volumes:
- ./browser/:/data/
16 changes: 5 additions & 11 deletions deployment/compose.yml
Original file line number Diff line number Diff line change
@@ -1,17 +1,11 @@
services:
myscrapers-sbi-test:
image: myscrapers:latest
container_name: myscrapers-sbi-test
image: myscrapers-sbi:latest
container_name: myscrapers-sbi
environment:
- wsAddr=127.0.0.1:7317 # your browser value
- chromeAddr=http://example.com:4444/wd/hub # your value
- TZ="JST-9"
# - BUCKET_NAME=azk-system-stg-bucket # from env (ex: hoge-system-stg-bucket)
# - BUCKET_URL=https://s3.ap-northeast-1.wasabisys.com # if you set, fetch data will be updated
# - BUCKET_DIR=myscrapers/sbi/
# - AWS_REGION=ap-northeast-1
# - AWS_ACCESS_KEY_ID=*** # your value
# - AWS_SECRET_ACCESS_KEY=*** # your value
- user=*** # your sbi user
- pass=*** # your sbi pass
env_file:
- sbi-token.env
volumes:
- ./browser/:/data/
1 change: 1 addition & 0 deletions internal/scenario/sbi.go
Original file line number Diff line number Diff line change
Expand Up @@ -205,6 +205,7 @@ func (s *ScenarioSBI) getPortfolio(ctx context.Context) error {
func (s *ScenarioSBI) Start(ctx context.Context) error {
slog.Info("connect to browser")
if err := s.getBrowser(ctx); err != nil {
slog.Error("get browser error", "err", err.Error())
return err
}
defer s.browser.Close()
Expand Down
19 changes: 19 additions & 0 deletions src/sbi/driver.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
from selenium import webdriver
import os

def get_remote_driver():
options=webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument("--no-sandbox")
options.add_argument("--disable-gpu")
options.add_argument("--lang=ja-JP")
options.add_argument("--disable-dev-shm-usage")
UA = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
options.add_argument("--user-agent=" + UA)
driver = webdriver.Remote(
command_executor=os.getenv("chromeAddr"),
options=options
)

driver.implicitly_wait(10)
return driver
117 changes: 117 additions & 0 deletions src/sbi/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
import driver
import os
import datetime
import logging
from pythonjsonlogger import jsonlogger
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from bs4 import BeautifulSoup

lg = logging.getLogger(__name__)
lg.setLevel(logging.DEBUG)
h = logging.StreamHandler()
h.setLevel(logging.DEBUG)
json_fmt = jsonlogger.JsonFormatter(
fmt="%(asctime)s %(levelname)s %(name)s %(message)s", json_ensure_ascii=False
)
h.setFormatter(json_fmt)
lg.addHandler(h)

SBI_USER = os.getenv("user")
SBI_PASS = os.getenv("pass")
SAVE_DIR = "/data"
LOGIN_URL = "https://site1.sbisec.co.jp/ETGate/"
PORT_URL = "https://site1.sbisec.co.jp/ETGate/?_ControlID=WPLETpfR001Control&_PageID=DefaultPID&_DataStoreID=DSWPLETpfR001Control&_ActionID=DefaultAID&getFlg=on"

def main():
global driver
try:
driver = driver.get_remote_driver()
run_scenario(driver=driver)
except Exception as e:
lg.error("failed to run fetch program", e, stack_info=True)
finally:
# ブラウザを閉じる
driver.quit()

def run_scenario(driver):
# ログインURLにアクセス
driver.get(LOGIN_URL)
lg.info("Move Login page")
element = driver.find_element(by=By.NAME, value="ACT_login")
input_user_id = driver.find_element(by=By.NAME, value="user_id")
input_user_id.send_keys(SBI_USER)
input_user_password = driver.find_element(by=By.NAME, value="user_password")
input_user_password.send_keys(SBI_PASS)

# ログインボタンを押す
driver.find_element(by=By.NAME, value="ACT_login").click()
lg.info("Login")

# ポートフォリオページに移動
driver.get(PORT_URL)
lg.info("Move portfolio page")

soup = BeautifulSoup(driver.page_source, "html.parser")

# ポートフォリオのテーブルを取得
table_data = soup.find_all(
"table", bgcolor="#9fbf99", cellpadding="4", cellspacing="1", width="100%"
)

# 取得したテーブルを上から順に、#1, #2 をつけて YYYYMMDD_#x.csv として保存
for i in range(len(table_data)):
fetch_data = createCSV(table_data[i])
lg.info("create CSV: #{}".format(i + 1))
writeCSV(fetch_data, i + 1)
lg.info("write CSV")

# HTMLテーブルデータからCSVを作成
def createCSV(table_data):
outputCSV = ""
m = []
tbody = table_data.find("tbody")
trs = tbody.find_all("tr")
for tr in trs:
r = []
for td in tr.find_all("td"):
td_text_without_comma = td.text.replace(",", "")
r.append(td_text_without_comma)
m.append(r)
for r in m:
outputCSV += ",".join(r)

return outputCSV


# 作成した文字列データから空行などを消してCSVフォーマットを整える
def reshapeCSV(rawoutputCSV):
outputCSV = rawoutputCSV.replace(",\n", ",")
outputCSV = outputCSV.replace("\n\n", "\n")
return outputCSV

# 作成した文字列データ(CSV)を指定場所に書き込み
def writeCSV(rawoutputCSV, index):
filepath = get_file_path(index)
outputCSV = reshapeCSV(rawoutputCSV)

with open(filepath, mode="w") as f:
f.write(outputCSV)

print(outputCSV)


# ディレクトリ作成とファイル名取得する
def get_file_path(index):
today = datetime.date.today() # 出力:datetime.date(2020, 3, 22)
yyyymm = "{0:%Y%m}".format(today) # 202003
yyyymmdd = "{0:%Y%m%d}".format(today) # 20200322

filepath = SAVE_DIR + "/" + yyyymmdd + "_" + str(index) + ".csv"
return filepath


if __name__ == "__main__":
main()
4 changes: 4 additions & 0 deletions src/sbi/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
beautifulsoup4==4.12.2
selenium==4.12.0
webdriver-manager==4.0.0
python-json-logger>=2.0.7