EcommerceTools
EcommerceTools is a data science toolkit for those working in technical ecommerce, marketing science, and technical seo and includes a wide range of features to aid analysis and model building. The package is written in Python and is designed to be used with Pandas and works within a Jupyter notebook environment or in standalone Python projects.
Installation
You can install EcommerceTools and its dependencies via PyPi by entering pip3 install ecommercetools
in your terminal, or !pip3 install ecommercetools
within a Jupyter notebook cell.
Modules
Transactions
-
Load sample transaction items data
If you want to get started with the transactions, products, and customers features, you can use the load_sample_data()
function to load a set of real world data. This imports the transaction items from widely-used Online Retail dataset and reformats it ready for use by EcommerceTools.
from ecommercetools import utilities
transaction_items = utilities.load_sample_data()
transaction_items.head()
| order_id | sku | description | quantity | order_date | unit_price | customer_id | country | line_price |
---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 2.55 | 17850.0 | United Kingdom | 15.30 |
---|
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
---|
2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2010-12-01 08:26:00 | 2.75 | 17850.0 | United Kingdom | 22.00 |
---|
3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
---|
4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
---|
-
Create a transaction items dataframe
The utilities
module includes a range of tools that allow you to format data, so it can be used within other EcommerceTools functions. The load_transaction_items()
function is used to create a Pandas dataframe of formatted transactional item data. When loading your transaction items data, all you need to do is define the column mappings, and the function will reformat the dataframe accordingly.
import pandas as pd
from ecommercetools import utilities
transaction_items = utilities.load_transaction_items('transaction_items_non_standard_names.csv',
date_column='InvoiceDate',
order_id_column='InvoiceNo',
customer_id_column='CustomerID',
sku_column='StockCode',
quantity_column='Quantity',
unit_price_column='UnitPrice'
)
transaction_items.to_csv('transaction_items.csv', index=False)
print(transaction_items.head())
| order_id | sku | description | quantity | order_date | unit_price | customer_id | country | line_price |
---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 2.55 | 17850.0 | United Kingdom | 15.30 |
---|
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
---|
2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2010-12-01 08:26:00 | 2.75 | 17850.0 | United Kingdom | 22.00 |
---|
3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
---|
4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
---|
-
Create a transactions dataframe
The get_transactions()
function takes the formatted Pandas dataframe of transaction items and returns a Pandas dataframe of aggregated transaction data, which includes features identifying the order number.
import pandas as pd
from ecommercetools import customers
transaction_items = pd.read_csv('transaction_items.csv')
transactions = transactions.get_transactions(transaction_items)
transactions.to_csv('transactions.csv', index=False)
print(transactions.head())
| order_id | order_date | customer_id | skus | items | revenue | replacement | order_number |
---|
0 | 536365 | 2010-12-01 08:26:00 | 17850.0 | 7 | 40 | 139.12 | 0 | 1 |
---|
1 | 536366 | 2010-12-01 08:28:00 | 17850.0 | 2 | 12 | 22.20 | 0 | 2 |
---|
2 | 536367 | 2010-12-01 08:34:00 | 13047.0 | 12 | 83 | 278.73 | 0 | 1 |
---|
3 | 536368 | 2010-12-01 08:34:00 | 13047.0 | 4 | 15 | 70.05 | 0 | 2 |
---|
4 | 536369 | 2010-12-01 08:35:00 | 13047.0 | 1 | 3 | 17.85 | 0 | 3 |
---|
Products
1. Get product data from transaction items
products_df = products.get_products(transaction_items)
products_df.head()
| sku | first_order_date | last_order_date | customers | orders | items | revenue | avg_unit_price | avg_quantity | avg_revenue | avg_orders | product_tenure | product_recency |
---|
0 | 10002 | 2010-12-01 08:45:00 | 2011-04-28 15:05:00 | 40 | 73 | 1037 | 759.89 | 1.056849 | 14.205479 | 10.409452 | 1.82 | 3749 | 3600 |
---|
1 | 10080 | 2011-02-27 13:47:00 | 2011-11-21 17:04:00 | 19 | 24 | 495 | 119.09 | 0.376667 | 20.625000 | 4.962083 | 1.26 | 3660 | 3393 |
---|
2 | 10120 | 2010-12-03 11:19:00 | 2011-12-04 13:15:00 | 25 | 29 | 193 | 40.53 | 0.210000 | 6.433333 | 1.351000 | 1.16 | 3746 | 3380 |
---|
3 | 10123C | 2010-12-03 11:19:00 | 2011-07-15 15:05:00 | 3 | 4 | -13 | 3.25 | 0.487500 | -3.250000 | 0.812500 | 1.33 | 3746 | 3522 |
---|
4 | 10123G | 2011-04-08 11:13:00 | 2011-04-08 11:13:00 | 0 | 1 | -38 | 0.00 | 0.000000 | -38.000000 | 0.000000 | inf | 3620 | 3620 |
---|
2. Calculate product consumption and repurchase rate
repurchase_rates = products.get_repurchase_rates(transaction_items)
repurchase_rates.head(3).T
| 0 | 1 | 2 |
---|
sku | 10002 | 10080 | 10120 |
---|
revenue | 759.89 | 119.09 | 40.53 |
---|
items | 1037 | 495 | 193 |
---|
orders | 73 | 24 | 29 |
---|
customers | 40 | 19 | 25 |
---|
avg_unit_price | 1.05685 | 0.376667 | 0.21 |
---|
avg_line_price | 10.4095 | 4.96208 | 1.351 |
---|
avg_items_per_order | 14.2055 | 20.625 | 6.65517 |
---|
avg_items_per_customer | 25.925 | 26.0526 | 7.72 |
---|
purchased_individually | 0 | 0 | 9 |
---|
purchased_once | 34 | 17 | 22 |
---|
bulk_purchases | 73 | 24 | 20 |
---|
bulk_purchase_rate | 1 | 1 | 0.689655 |
---|
repurchases | 39 | 7 | 7 |
---|
repurchase_rate | 0.534247 | 0.291667 | 0.241379 |
---|
repurchase_rate_label | Moderate repurchase | Low repurchase | Low repurchase |
---|
bulk_purchase_rate_label | Very high bulk | Very high bulk | High bulk |
---|
bulk_and_repurchase_label | Moderate repurchase_Very high bulk | Low repurchase_Very high bulk | Low repurchase_High bulk |
---|
Customers
1. Create a customers dataset
from ecommercetools import customers
customers_df = customers.get_customers(transaction_items)
customers_df.head()
| customer_id | revenue | orders | skus | items | first_order_date | last_order_date | avg_items | avg_order_value | tenure | recency | cohort |
---|
0 | 12346.0 | 0.00 | 2 | 1 | 0 | 2011-01-18 10:01:00 | 2011-01-18 10:17:00 | 0.00 | 0.00 | 3701 | 3700 | 20111 |
---|
1 | 12347.0 | 4310.00 | 7 | 7 | 2458 | 2010-12-07 14:57:00 | 2011-12-07 15:52:00 | 351.14 | 615.71 | 3742 | 3377 | 20104 |
---|
2 | 12348.0 | 1797.24 | 4 | 4 | 2341 | 2010-12-16 19:09:00 | 2011-09-25 13:13:00 | 585.25 | 449.31 | 3733 | 3450 | 20104 |
---|
3 | 12349.0 | 1757.55 | 1 | 1 | 631 | 2011-11-21 09:51:00 | 2011-11-21 09:51:00 | 631.00 | 1757.55 | 3394 | 3394 | 20114 |
---|
4 | 12350.0 | 334.40 | 1 | 1 | 197 | 2011-02-02 16:01:00 | 2011-02-02 16:01:00 | 197.00 | 334.40 | 3685 | 3685 | 20111 |
---|
2. Create a customer cohort analysis dataset
from ecommercetools import customers
cohorts_df = customers.get_cohorts(transaction_items, period='M')
cohorts_df.head()
| customer_id | order_id | order_date | acquisition_cohort | order_cohort |
---|
0 | 17850.0 | 536365 | 2010-12-01 08:26:00 | 2010-12 | 2010-12 |
---|
7 | 17850.0 | 536366 | 2010-12-01 08:28:00 | 2010-12 | 2010-12 |
---|
9 | 13047.0 | 536367 | 2010-12-01 08:34:00 | 2010-12 | 2010-12 |
---|
21 | 13047.0 | 536368 | 2010-12-01 08:34:00 | 2010-12 | 2010-12 |
---|
25 | 13047.0 | 536369 | 2010-12-01 08:35:00 | 2010-12 | 2010-12 |
---|
3. Create a customer cohort analysis matrix
from ecommercetools import customers
cohort_matrix_df = customers.get_cohort_matrix(transaction_items, period='M', percentage=True)
cohort_matrix_df.head()
periods | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
---|
acquisition_cohort | | | | | | | | | | | | | |
---|
2010-12 | 1.0 | 0.381857 | 0.334388 | 0.387131 | 0.359705 | 0.396624 | 0.379747 | 0.354430 | 0.354430 | 0.394515 | 0.373418 | 0.500000 | 0.274262 |
---|
2011-01 | 1.0 | 0.239905 | 0.282660 | 0.242280 | 0.327791 | 0.299287 | 0.261283 | 0.256532 | 0.311164 | 0.346793 | 0.368171 | 0.149644 | NaN |
---|
2011-02 | 1.0 | 0.247368 | 0.192105 | 0.278947 | 0.268421 | 0.247368 | 0.255263 | 0.281579 | 0.257895 | 0.313158 | 0.092105 | NaN | NaN |
---|
2011-03 | 1.0 | 0.190909 | 0.254545 | 0.218182 | 0.231818 | 0.177273 | 0.263636 | 0.238636 | 0.288636 | 0.088636 | NaN | NaN | NaN |
---|
2011-04 | 1.0 | 0.227425 | 0.220736 | 0.210702 | 0.207358 | 0.237458 | 0.230769 | 0.260870 | 0.083612 | NaN | NaN | NaN | NaN |
---|
from ecommercetools import customers
cohort_matrix_df = customers.get_cohort_matrix(transaction_items, period='M', percentage=False)
cohort_matrix_df.head()
periods | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
---|
acquisition_cohort | | | | | | | | | | | | | |
---|
2010-12 | 948.0 | 362.0 | 317.0 | 367.0 | 341.0 | 376.0 | 360.0 | 336.0 | 336.0 | 374.0 | 354.0 | 474.0 | 260.0 |
---|
2011-01 | 421.0 | 101.0 | 119.0 | 102.0 | 138.0 | 126.0 | 110.0 | 108.0 | 131.0 | 146.0 | 155.0 | 63.0 | NaN |
---|
2011-02 | 380.0 | 94.0 | 73.0 | 106.0 | 102.0 | 94.0 | 97.0 | 107.0 | 98.0 | 119.0 | 35.0 | NaN | NaN |
---|
2011-03 | 440.0 | 84.0 | 112.0 | 96.0 | 102.0 | 78.0 | 116.0 | 105.0 | 127.0 | 39.0 | NaN | NaN | NaN |
---|
2011-04 | 299.0 | 68.0 | 66.0 | 63.0 | 62.0 | 71.0 | 69.0 | 78.0 | 25.0 | NaN | NaN | NaN | NaN |
---|
4. Create a customer "retention" dataset
from ecommercetools import customers
retention_df = customers.get_retention(transactions_df)
retention_df.head()
| acquisition_cohort | order_cohort | customers | periods |
---|
0 | 2010-12 | 2010-12 | 948 | 0 |
---|
1 | 2010-12 | 2011-01 | 362 | 1 |
---|
2 | 2010-12 | 2011-02 | 317 | 2 |
---|
3 | 2010-12 | 2011-03 | 367 | 3 |
---|
4 | 2010-12 | 2011-04 | 341 | 4 |
---|
5. Create an RFM (H) dataset
This is an extension of the regular Recency, Frequency, Monetary value (RFM) model that includes an additional parameter "H" for heterogeneity. This shows the number of unique SKUs purchased by each customer. While typically unassociated with targeting, this value can be very useful in identifying which customers should probably be buying a broader mix of products than they currently are, as well as spotting those who may have stopped buying certain items.
from ecommercetools import customers
rfm_df = customers.get_rfm_segments(customers_df)
rfm_df.head()
| customer_id | acquisition_date | recency_date | recency | frequency | monetary | heterogeneity | tenure | r | f | m | h | rfm | rfm_score | rfm_segment_name |
---|
0 | 12346.0 | 2011-01-18 10:01:00 | 2011-01-18 10:17:00 | 3700 | 2 | 0.00 | 1 | 3701 | 1 | 1 | 1 | 1 | 111 | 3 | Risky |
---|
1 | 12350.0 | 2011-02-02 16:01:00 | 2011-02-02 16:01:00 | 3685 | 1 | 334.40 | 1 | 3685 | 1 | 1 | 1 | 1 | 111 | 3 | Risky |
---|
2 | 12365.0 | 2011-02-21 13:51:00 | 2011-02-21 14:04:00 | 3666 | 3 | 320.69 | 2 | 3666 | 1 | 1 | 1 | 1 | 111 | 3 | Risky |
---|
3 | 12373.0 | 2011-02-01 13:10:00 | 2011-02-01 13:10:00 | 3686 | 1 | 364.60 | 1 | 3686 | 1 | 1 | 1 | 1 | 111 | 3 | Risky |
---|
4 | 12377.0 | 2010-12-20 09:37:00 | 2011-01-28 15:45:00 | 3690 | 2 | 1628.12 | 2 | 3730 | 1 | 1 | 1 | 1 | 111 | 3 | Risky |
---|
6. Create a purchase latency dataset
from ecommercetools import customers
latency_df = customers.get_latency(transactions_df)
latency_df.head()
| customer_id | frequency | recency_date | recency | avg_latency | min_latency | max_latency | std_latency | cv | days_to_next_order | label |
---|
0 | 12680.0 | 4 | 2011-12-09 12:50:00 | 3388 | 28 | 16 | 73 | 30.859898 | 1.102139 | -3329.0 | Order overdue |
---|
1 | 13113.0 | 24 | 2011-12-09 12:49:00 | 3388 | 15 | 0 | 52 | 12.060126 | 0.804008 | -3361.0 | Order overdue |
---|
2 | 15804.0 | 13 | 2011-12-09 12:31:00 | 3388 | 15 | 1 | 39 | 11.008261 | 0.733884 | -3362.0 | Order overdue |
---|
3 | 13777.0 | 33 | 2011-12-09 12:25:00 | 3388 | 11 | 0 | 48 | 12.055274 | 1.095934 | -3365.0 | Order overdue |
---|
4 | 17581.0 | 25 | 2011-12-09 12:21:00 | 3388 | 14 | 0 | 67 | 21.974293 | 1.569592 | -3352.0 | Order overdue |
---|
7. Customer ABC segmentation
from ecommercetools import customers
abc_df = customers.get_abc_segments(customers_df, months=12, abc_class_name='abc_class_12m', abc_rank_name='abc_rank_12m')
abc_df.head()
| customer_id | abc_class_12m | abc_rank_12m |
---|
0 | 12346.0 | D | 1.0 |
---|
1 | 12347.0 | D | 1.0 |
---|
2 | 12348.0 | D | 1.0 |
---|
3 | 12349.0 | D | 1.0 |
---|
4 | 12350.0 | D | 1.0 |
---|
8. Predict customer AOV, CLV, and orders
EcommerceTools allows you to predict the AOV, Customer Lifetime Value (CLV) and expected number of orders via the Gamma-Gamma and BG/NBD models from the excellent Lifetimes package. By passing the dataframe of transactions from get_transactions()
to the get_customer_predictions()
function, EcommerceTools will fit the BG/NBD and Gamma-Gamma models and predict the AOV, order quantity, and CLV for each customer in the defined number of future days after the end of the observation period.
customer_predictions = customers.get_customer_predictions(transactions_df,
observation_period_end='2011-12-09',
days=90)
customer_predictions.head(10)
| customer_id | predicted_purchases | aov | clv |
---|
0 | 12346.0 | 0.188830 | NaN | NaN |
---|
1 | 12347.0 | 1.408736 | 569.978836 | 836.846896 |
---|
2 | 12348.0 | 0.805907 | 333.784235 | 308.247354 |
---|
3 | 12349.0 | 0.855607 | NaN | NaN |
---|
4 | 12350.0 | 0.196304 | NaN | NaN |
---|
5 | 12352.0 | 1.682277 | 376.175359 | 647.826169 |
---|
6 | 12353.0 | 0.272541 | NaN | NaN |
---|
7 | 12354.0 | 0.247183 | NaN | NaN |
---|
8 | 12355.0 | 0.262909 | NaN | NaN |
---|
9 | 12356.0 | 0.645368 | 324.039419 | 256.855226 |
---|
---
Advertising
1. Create paid search keywords
from ecommercetools import advertising
product_names = ['fly rods', 'fly reels']
keywords_prepend = ['buy', 'best', 'cheap', 'reduced']
keywords_append = ['for sale', 'price', 'promotion', 'promo', 'coupon', 'voucher', 'shop', 'suppliers']
campaign_name = 'fly_fishing'
keywords = advertising.generate_ad_keywords(product_names, keywords_prepend, keywords_append, campaign_name)
keywords.head()
| product | keywords | match_type | campaign_name |
---|
0 | fly rods | [fly rods] | Exact | fly_fishing |
---|
1 | fly rods | [buy fly rods] | Exact | fly_fishing |
---|
2 | fly rods | [best fly rods] | Exact | fly_fishing |
---|
3 | fly rods | [cheap fly rods] | Exact | fly_fishing |
---|
4 | fly rods | [reduced fly rods] | Exact | fly_fishing |
---|
2. Create paid search ad copy using Spintax
from ecommercetools import advertising
text = "Fly Reels from {Orvis|Loop|Sage|Airflo|Nautilus} for {trout|salmon|grayling|pike}"
spin = advertising.generate_spintax(text, single=False)
spin
['Fly Reels from Orvis for trout',
'Fly Reels from Orvis for salmon',
'Fly Reels from Orvis for grayling',
'Fly Reels from Orvis for pike',
'Fly Reels from Loop for trout',
'Fly Reels from Loop for salmon',
'Fly Reels from Loop for grayling',
'Fly Reels from Loop for pike',
'Fly Reels from Sage for trout',
'Fly Reels from Sage for salmon',
'Fly Reels from Sage for grayling',
'Fly Reels from Sage for pike',
'Fly Reels from Airflo for trout',
'Fly Reels from Airflo for salmon',
'Fly Reels from Airflo for grayling',
'Fly Reels from Airflo for pike',
'Fly Reels from Nautilus for trout',
'Fly Reels from Nautilus for salmon',
'Fly Reels from Nautilus for grayling',
'Fly Reels from Nautilus for pike']
Operations
1. Create an ABC inventory classification
inventory_classification = operations.get_inventory_classification(transaction_items)
inventory_classification.head()
| sku | abc_class | abc_rank |
---|
0 | 10002 | A | 1 |
---|
1 | 10080 | A | 2 |
---|
2 | 10120 | A | 3 |
---|
3 | 10123C | A | 4 |
---|
4 | 10123G | A | 4 |
---|
Marketing
1. Get ecommerce trading calendar
from ecommercetools import marketing
trading_calendar_df = marketing.get_trading_calendar('2021-01-01', days=365)
trading_calendar_df.head()
| date | event |
---|
0 | 2021-01-01 | January sale |
---|
1 | 2021-01-02 | |
---|
2 | 2021-01-03 | |
---|
3 | 2021-01-04 | |
---|
4 | 2021-01-05 | |
---|
2. Get ecommerce trading events
from ecommercetools import marketing
trading_events_df = marketing.get_trading_events('2021-01-01', days=365)
trading_events_df.head()
| date | event |
---|
0 | 2021-01-01 | January sale |
---|
1 | 2021-01-29 | January Pay Day |
---|
2 | 2021-02-11 | Valentine's Day [last order date] |
---|
3 | 2021-02-14 | Valentine's Day |
---|
4 | 2021-02-26 | February Pay Day |
---|
NLP
1. Generate text summaries
The get_summaries()
function of the nlp
module takes a Pandas dataframe containing text and returns a machine-generated summary of the content using a Huggingface Transformers pipeline via PyTorch. To use this feature, first load your Pandas dataframe and import the nlp
module from ecommercetools
.
import pandas as pd
from ecommercetools import nlp
pd.set_option('max_colwidth', 1000)
df = pd.read_csv('text.csv')
df.head()
Specify the name of the Pandas dataframe, the column containing the text you wish to summarise (i.e. product_description
), and specify a column name in which to store the machine-generated summary. The min_length
and max_length
arguments control the number of words generated, while the do_sample
argument controls whether the generated text is completely unique (do_sample=False
) or extracted from the text (do_sample=True
).
df = nlp.get_summaries(df, 'product_description', 'sampled_summary', min_length=50, max_length=100, do_sample=True)
df = nlp.get_summaries(df, 'product_description', 'unsampled_summary', min_length=50, max_length=100, do_sample=False)
df = nlp.get_summaries(df, 'product_description', 'unsampled_summary_20_to_30', min_length=20, max_length=30, do_sample=False)
Since the model used for text summarisation is very large (1.2 GB plus), this function will take some time to complete. Once loaded, summaries are generated within a second or two per piece of text, so it is advisable to try smaller volumes of data initially.
SEO
1. Discover XML sitemap locations
The get_sitemaps()
function takes the location of a robots.txt
file (always stored at the root of a domain), and returns the URLs of any XML sitemaps listed within.
from ecommercetools import seo
sitemaps = seo.get_sitemaps("http://www.flyandlure.org/robots.txt")
print(sitemaps)
2. Get an XML sitemap
The get_dataframe()
function allows you to download the URLs in an XML sitemap to a Pandas dataframe. If the sitemap contains child sitemaps, each of these will be retrieved. You can save the Pandas dataframe to CSV in the usual way.
from ecommercetools import seo
df = seo.get_sitemap("http://flyandlure.org/sitemap.xml")
print(df.head())
| loc | changefreq | priority | domain | sitemap_name |
---|
0 | http://flyandlure.org/ | hourly | 1.0 | flyandlure.org | http://www.flyandlure.org/sitemap.xml |
---|
1 | http://flyandlure.org/about | monthly | 1.0 | flyandlure.org | http://www.flyandlure.org/sitemap.xml |
---|
2 | http://flyandlure.org/terms | monthly | 1.0 | flyandlure.org | http://www.flyandlure.org/sitemap.xml |
---|
3 | http://flyandlure.org/privacy | monthly | 1.0 | flyandlure.org | http://www.flyandlure.org/sitemap.xml |
---|
4 | http://flyandlure.org/copyright | monthly | 1.0 | flyandlure.org | http://www.flyandlure.org/sitemap.xml |
---|
3. Get Core Web Vitals from PageSpeed Insights
The get_core_web_vitals()
function retrieves the Core Web Vitals metrics for a list of sites from the Google PageSpeed Insights API and returns results in a Pandas dataframe. The function requires a a Google PageSpeed Insights API key.
from ecommercetools import seo
pagespeed_insights_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
urls = ['https://www.bbc.co.uk', 'https://www.bbc.co.uk/iplayer']
df = seo.get_core_web_vitals(pagespeed_insights_key, urls)
print(df.head())
4. Get Google Knowledge Graph data
The get_knowledge_graph()
function returns the Google Knowledge Graph data for a given search term. This requires the use of a Google Knowledge Graph API key. By default, the function returns output in a Pandas dataframe, but you can pass the output="json"
argument if you wish to receive the JSON data back.
from ecommercetools import seo
knowledge_graph_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
knowledge_graph = seo.get_knowledge_graph(knowledge_graph_key, "tesla", output="dataframe")
print(knowledge_graph)
5. Get Google Search Console API data
The query_google_search_console()
function runs a search query on the Google Search Console API and returns data in a Pandas dataframe. This function requires a JSON client secrets key with access to the Google Search Console API.
from ecommercetools import seo
key = "google-search-console.json"
site_url = "http://flyandlure.org"
payload = {
'startDate': "2019-01-01",
'endDate': "2019-12-31",
'dimensions': ["page", "device", "query"],
'rowLimit': 100,
'startRow': 0
}
df = seo.query_google_search_console(key, site_url, payload)
print(df.head())
| page | device | query | clicks | impressions | ctr | position |
---|
0 | http://flyandlure.org/articles/fly_fishing_gea... | MOBILE | simms freestone waders review | 56 | 217 | 25.81 | 3.12 |
---|
1 | http://flyandlure.org/ | MOBILE | fly and lure | 37 | 159 | 23.27 | 3.81 |
---|
2 | http://flyandlure.org/articles/fly_fishing_gea... | DESKTOP | orvis encounter waders review | 35 | 134 | 26.12 | 4.04 |
---|
3 | http://flyandlure.org/articles/fly_fishing_gea... | DESKTOP | simms freestone waders review | 35 | 200 | 17.50 | 3.50 |
---|
4 | http://flyandlure.org/ | DESKTOP | fly and lure | 32 | 170 | 18.82 | 3.09 |
---|
Fetching all results from Google Search Console
To fetch all results, set fetch_all
to True
. This will automatically paginate through your Google Search Console data and return all results. Be aware that if you do this you may hit Google's quota limit if you run a query over an extended period, or have a busy site with lots of page
or query
dimensions.
from ecommercetools import seo
key = "google-search-console.json"
site_url = "http://flyandlure.org"
payload = {
'startDate': "2019-01-01",
'endDate': "2019-12-31",
'dimensions': ["page", "device", "query"],
'rowLimit': 25000,
'startRow': 0
}
df = seo.query_google_search_console(key, site_url, payload, fetch_all=True)
print(df.head())
Comparing two time periods in Google Search Console
payload_before = {
'startDate': "2021-08-11",
'endDate': "2021-08-31",
'dimensions': ["page","query"],
}
payload_after = {
'startDate': "2021-07-21",
'endDate': "2021-08-10",
'dimensions': ["page","query"],
}
df = seo.query_google_search_console_compare(key, site_url, payload_before, payload_after, fetch_all=False)
df.sort_values(by='clicks_change', ascending=False).head()
6. Get the number of "indexed" pages
The get_indexed_pages()
function uses the "site:" prefix to search Google for the number of pages "indexed". This is very approximate and may not be a perfect representation, but it's usually a good guide of site "size" in the absence of other data.
from ecommercetools import seo
urls = ['https://www.bbc.co.uk', 'https://www.bbc.co.uk/iplayer', 'http://flyandlure.org']
df = seo.get_indexed_pages(urls)
print(df.head())
| url | indexed_pages |
---|
2 | http://flyandlure.org | 2090 |
---|
1 | https://www.bbc.co.uk/iplayer | 215000 |
---|
0 | https://www.bbc.co.uk | 12700000 |
---|
7. Get keyword suggestions from Google Autocomplete
The google_autocomplete()
function returns a set of keyword suggestions from Google Autocomplete. The include_expanded=True
argument allows you to expand the number of suggestions shown by appending prefixes and suffixes to the search terms.
from ecommercetools import seo
suggestions = seo.google_autocomplete("data science", include_expanded=False)
print(suggestions)
suggestions = seo.google_autocomplete("data science", include_expanded=True)
print(suggestions)
| term | relevance |
---|
0 | data science jobs | 650 |
---|
1 | data science jobs chester | 601 |
---|
2 | data science course | 600 |
---|
3 | data science masters | 554 |
---|
4 | data science salary | 553 |
---|
5 | data science internship | 552 |
---|
6 | data science jobs london | 551 |
---|
7 | data science graduate scheme | 550 |
---|
8. Retrieve robots.txt content
The get_robots()
function returns the contents of a robots.txt file in a Pandas dataframe so it can be parsed and analysed.
from ecommercetools import seo
robots = seo.get_robots("http://www.flyandlure.org/robots.txt")
print(robots)
| directive | parameter |
---|
0 | User-agent | * |
---|
1 | Disallow | /signin |
---|
2 | Disallow | /signup |
---|
3 | Disallow | /users |
---|
4 | Disallow | /contact |
---|
5 | Disallow | /activate |
---|
6 | Disallow | /*/page |
---|
7 | Disallow | /articles/search |
---|
8 | Disallow | /search.php |
---|
9 | Disallow | *q=* |
---|
10 | Disallow | *category_slug=* |
---|
11 | Disallow | *country_slug=* |
---|
12 | Disallow | *county_slug=* |
---|
13 | Disallow | *features=* |
---|
9. Get Google SERPs
The get_serps()
function returns a Pandas dataframe containing the Google search engine results for a given search term. Note that this function is not suitable for large-scale scraping and currently includes no features to prevent it from being blocked.
from ecommercetools import seo
serps = seo.get_serps("data science blog")
print(serps)
| title | link | text |
---|
0 | 10 of the best data science blogs to follow - ... | https://www.tableau.com/learn/articles/data-sc... | 10 of the best data science blogs to follow. T... |
---|
1 | Best Data Science Blogs to Follow in 2020 | by... | https://towardsdatascience.com/best-data-scien... | 14 Jul 2020 — 1. Towards Data Science · Joined... |
---|
2 | Top 20 Data Science Blogs And Websites For Dat... | https://medium.com/@exastax/top-20-data-scienc... | Top 20 Data Science Blogs And Websites For Dat... |
---|
3 | Data Science Blog – Dataquest | https://www.dataquest.io/blog/ | Browse our data science blog to get helpful ti... |
---|
4 | 51 Awesome Data Science Blogs You Need To Chec... | https://365datascience.com/trending/51-data-sc... | Blog name: DataKind · datakind data science bl... |
---|
5 | Blogs on AI, Analytics, Data Science, Machine ... | https://www.kdnuggets.com/websites/blogs.html | Individual/small group blogs · Ai4 blog, featu... |
---|
6 | Data Science Blog – Applied Data Science | https://data-science-blog.com/ | ... an Bedeutung – DevOps for Data Science. De... |
---|
7 | Top 10 Data Science and AI Blogs in 2020 - Liv... | https://livecodestream.dev/post/top-data-scien... | Some of the best data science and AI blogs for... |
---|
8 | Data Science Blogs: 17 Must-Read Blogs for Dat... | https://www.thinkful.com/blog/data-science-blogs/ | Data scientists could be considered the magici... |
---|
9 | rushter/data-science-blogs: A curated list of ... | https://github.com/rushter/data-science-blogs | A curated list of data science blogs. Contribu... |
---|
To set the domain and host language you can use these parameters. This will search for "bmw" on the German Google domain and return the results in German.
df = seo.get_serps("bmw", pages=1, domain="google.de", host_language="de")
Create an ABCD classification of Google Search Console data
The classify_pages()
function returns an ABCD classification of Google Search Console data. This calculates the cumulative sum of clicks and then categorises pages using the ABC algorithm (the first 80% are classed A, the next 10% are classed B, and the final 10% are classed C, with the zero click pages classed D).
from ecommercetools import seo
key = "client_secrets.json"
site_url = "example-domain.co.uk"
start_date = '2022-10-01'
end_date = '2022-10-31'
df_classes = seo.classify_pages(key, site_url, start_date, end_date, output='classes')
print(df_classes.head())
df_summary = seo.classify_pages(key, site_url, start_date, end_date, output='summary')
print(df_summary)
page clicks impressions ctr position clicks_cumsum clicks_running_pc pc_share class class_rank
0 https://practicaldatascience.co.uk/machine-lea... 3890 36577 10.64 12.64 3890 8.382898 8.382898 A 1
1 https://practicaldatascience.co.uk/data-scienc... 2414 16618 14.53 14.30 6304 13.585036 5.202138 A 2
2 https://practicaldatascience.co.uk/data-scienc... 2378 71496 3.33 16.39 8682 18.709594 5.124558 A 3
3 https://practicaldatascience.co.uk/data-scienc... 1942 14274 13.61 15.02 10624 22.894578 4.184984 A 4
4 https://practicaldatascience.co.uk/data-scienc... 1738 23979 7.25 11.80 12362 26.639945 3.745367 A 5
class pages impressions clicks avg_ctr avg_position share_of_clicks share_of_impressions
0 A 63 747643 36980 5.126349 22.706825 79.7 43.7
1 B 46 639329 4726 3.228043 31.897826 10.2 37.4
2 C 190 323385 4698 2.393632 38.259368 10.1 18.9
3 D 36 1327 0 0.000000 25.804722 0.0 0.1
Reports
The Reports module creates weekly, monthly, quarterly, or yearly reports for customers and orders and calculates a range of common ecommerce metrics to show business performance.
1. Customers report
The customers_report()
function takes a formatted dataframe of transaction items (see above) and a desired frequency (D for daily, W for weekly, M for monthly, Q for quarterly) and calculates aggregate metrics for each period.
The function returns the number of orders, the number of customers, the number of new customers, the number of returning customers, and the acquisition rate (or proportion of new customers). For monthly reporting, I would recommend a 13-month period so you can compare the last month with the same month the previous year.
from ecommercetools import reports
df_customers_report = reports.customers_report(transaction_items, frequency='M')
print(df_customers_report.head(13))
2. Transactions report
The transactions_report()
function takes a formatted dataframe of transaction items (see above) and a desired frequency (D for daily, W for weekly, M for monthly, Q for quarterly) and calculates aggregate metrics for each period.
The metrics returned are: customers, orders, revenue, SKUs, units, average order value, average SKUs per order, average units per order, and average revenue per customer.
from ecommercetools import reports
df_orders_report = reports.transactions_report(transaction_items, frequency='M')
print(df_orders_report.head(13))