Skip to content

Simple Python client for interactive data science with Google Cloud Platform.

License

Notifications You must be signed in to change notification settings

orfeon/gcp-python-dsclient

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GCP Python Client Library for Data Science

GCP-DSClient (Google Cloud Platform Data Science Client) helps interactive data science by coordinating data science libraries (pandas, ipython) and Google Cloud Infrastructure.

This client supports the following Google Cloud Platform services:

Quick Start

Installation

$ pip install -U git+https://github.com/orfeon/gcp-python-dsclient
$ pip3 install -U git+https://github.com/orfeon/gcp-python-dsclient

Initialize Client instance

import dsclient

# When run on GCE instance that has GCP access permission,
# you need to set only project name.
client = dsclient.Client("your project name")

# In other case (ex: run on local PC),
# you need to set project name, and access key file path.
client = dsclient.Client("your project name", "./keyfile.json")

Usage Google BigQuery with Pandas

Google BigQuery (BigQuery API docs)

# Query and read data as pandas.DataFrame
query_string = """
    SELECT date, year
    FROM aaa
    WHERE year = 2016
"""
df = client.query(query_string) # Use lquery() for large data.

# Upload(Append) pandas.DataFrame to existing table on BigQuery.
client.load(df, "your_dataset.your_table")
# Override existing table.
client.load(df, "your_dataset.your_table", append=False)

# Insert query result into table. (Override if table exists)
client.query(query_string, table_name="your_dataset.your_table_2")

Usage Google Cloud Storage with Pandas

Google Cloud Storage (Storage API docs)

official Google Cloud Storage documentation

import pandas as pd

# Write local pandas.DataFrame to Cloud Storage.
df1 = pd.DataFrame(...somedata...)
client.write_csv(df1, "gs://your_bucket/your_file1.csv")

# Read pandas.DataFrame from csv file on Cloud Storage.
df2 = client.read_csv("gs://your_bucket/your_file2.csv")

# Write blob data (ex: ML model) to Cloud Storage.
reg = LinearRegressor()
reg.fit(df1[["attr1","attr2",...]], df1["target"])
client.write_blob(reg, "gs://your_bucket/your_file.model")

# Read blob data from Cloud Storage.
reg = client.read_blob("gs://your_bucket/your_file.model")
prd = reg.predict(df2[["attr1","attr2",...]])

Usage Google Cloud Datastore with Pandas

Google Cloud Datastore (Cloud Datastore API docs)

# Query and read data using GQL as pandas.DataFrame
df = client.gql("SELECT * FROM SomeKind WHERE date = '20170101'")

Usage Google Compute Engine with IPython and IPyParallel

# create snapshot of current instance on GCE.
client.create_current_snapshot("snapshot_name")

# deploy ipcluster from snapshot.
client.deploy_ipcluster(profile="mycluster", snapshot="snapshot_name",
                        itype="standard", core=4, num=4)

# execute some tasks on ipcluster.
import ipyparallel
rc = ipyparallel.Client(profile='mycluster')
lview = rc.load_balanced_view()
results = lview.map_async(some_func, task_list)

# add ipengine to existing ipcluster.
client.add_ipengine(profile="mycluster", snapshot="snapshot_name",
                    itype="small", num=4)

# delete ipcluster.
client.delete_ipcluster(profile="mycluster")

# stop current instance when all tasks are finished.
client.stop_current_instance()

License

Apache 2.0 - See LICENSE for more information.

About

Simple Python client for interactive data science with Google Cloud Platform.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published