Search Tutorial¶
1.1 Load packages¶
import sys
import pandas as pd
import ECAUGT
import time
import multiprocessing
import numpy as np
1.2 Connect to server¶
# set parameters
endpoint = "https://HCAd-Datasets.cn-beijing.ots.aliyuncs.com"
access_id = "LTAI5t7t216W9amUD1crMVos" #enter your id and keys
access_key = "ZJPlUbpLCij5qUPjbsU8GnQHm97IxJ"
instance_name = "HCAd-Datasets"
table_name = 'HCA_d'
# # setup client
ECAUGT.Setup_Client(endpoint, access_id, access_key, instance_name, table_name)
Connected to the server, find the table.
HCA_d
TableName: HCA_d
PrimaryKey: [('cid', 'INTEGER')]
Reserved read throughput: 0
Reserved write throughput: 0
Last increase throughput time: 1605795297
Last decrease throughput time: None
table options's time to live: -1
table options's max version: 1
table options's max_time_deviation: 86400
0
1.3 Build index¶
We should check if the index has been built.
ECAUGT.build_index()
index already exist.
2. Search cell with metadata condition¶
Conditions are presented in a structured string which is a combination of several logical expressions.
Each logical expression should be in the following forms:
field_name1 == value1, here '==' means equal
field_name2 <> value2, here '<>' means unequal
Three symbols are used for logical operation between expressions:
logical_expression1 && logical_expression2, here '&&' means AND operation
logical_expression1 || logical_expression2, here '||' means OR operation
! logical_expression1, here '!' means not NOT operation
Brackets are allowed and the priorities of the logical operations are as common. The metadata condition string is also robust to the space character.
# get primary keys
rows_to_get = ECAUGT.search_metadata("organ == Lung && cell_type == T cell ")
14894 cells found
We found 14894 cells here, and the vairable rows_to_get is a list containing their primary keys.
3. Download data¶
We first download three columns of the queried cellls and return them in the DataFrame form. (The first column in the result is the primary keys)
For illustration, we only download the first 20 cells.
rows_to_get_2 = rows_to_get[0:20]
3.1 Download interested columns¶
# download data in pandas::DataFrame from
ECAUGT.get_columnsbycell_para(rows_to_get = rows_to_get_2, cols_to_get=['cl_name','hcad_name','cell_type'], col_filter=None, do_transfer = True, thread_num = multiprocessing.cpu_count()-1)
cell_type | cl_name | uHAF_name | |
---|---|---|---|
cid | |||
2000932 | T cell | T cell | Lung-Connective tissue-T cell-CD3D CD8A |
2000962 | T cell | T cell | Lung-Connective tissue-T cell-CD3D IL32 |
2000971 | T cell | T cell | Lung-Connective tissue-T cell-CD3D IL32 |
2000978 | T cell | T cell | Lung-Connective tissue-T cell-CD3D CD8A |
2000987 | T cell | T cell | Lung-Connective tissue-T cell-CD3D IL32 |
2000994 | T cell | T cell | Lung-Connective tissue-T cell-CD3D IL32 |
2001027 | T cell | T cell | Lung-Connective tissue-T cell-CD3D IL32 |
2001030 | T cell | T cell | Lung-Connective tissue-T cell-CD3D CD8A |
2001031 | T cell | T cell | Lung-Connective tissue-T cell-CD3D IL32 |
2001032 | T cell | T cell | Lung-Connective tissue-T cell-CD3D CD8A |
2001038 | T cell | T cell | Lung-Connective tissue-T cell-CD3D IL32 |
2001050 | T cell | T cell | Lung-Connective tissue-T cell-CD3D CD8A |
2001059 | T cell | T cell | Lung-Connective tissue-T cell-CD3D CD8A |
2001065 | T cell | T cell | Lung-Connective tissue-T cell-CD3D IL32 |
2001086 | T cell | T cell | Lung-Connective tissue-T cell-CD3D IL32 |
2001091 | T cell | T cell | Lung-Connective tissue-T cell-CD3D IL32 |
2001099 | T cell | T cell | Lung-Connective tissue-T cell-CD3D IL32 |
2001105 | T cell | T cell | Lung-Connective tissue-T cell-CD3D IL32 |
2001106 | T cell | T cell | Lung-Connective tissue-T cell-CD3D CD8A |
2001112 | T cell | T cell | Lung-Connective tissue-T cell-CD3D CD8A |
Then we show how the result will look like when we don’t do transform.
# download data in list from
ECAUGT.get_columnsbycell_para(rows_to_get = rows_to_get_2, cols_to_get=['cl_name','hcad_name','cell_type'], col_filter=None, do_transfer = False, thread_num = multiprocessing.cpu_count()-1)
[[('cid', 2000932),
('cell_type', 'T cell'),
('cl_name', 'T cell'),
('hcad_name', 'Lung-Connective tissue-T cell-CD3D CD8A')],
[('cid', 2000962),
('cell_type', 'T cell'),
('cl_name', 'T cell'),
('hcad_name', 'Lung-Connective tissue-T cell-CD3D IL32')],
[('cid', 2000971),
('cell_type', 'T cell'),
('cl_name', 'T cell'),
('hcad_name', 'Lung-Connective tissue-T cell-CD3D IL32')],
[('cid', 2000978),
('cell_type', 'T cell'),
('cl_name', 'T cell'),
('hcad_name', 'Lung-Connective tissue-T cell-CD3D CD8A')],
[('cid', 2000987),
('cell_type', 'T cell'),
('cl_name', 'T cell'),
('hcad_name', 'Lung-Connective tissue-T cell-CD3D IL32')],
[('cid', 2000994),
('cell_type', 'T cell'),
('cl_name', 'T cell'),
('hcad_name', 'Lung-Connective tissue-T cell-CD3D IL32')],
[('cid', 2001027),
('cell_type', 'T cell'),
('cl_name', 'T cell'),
('hcad_name', 'Lung-Connective tissue-T cell-CD3D IL32')],
[('cid', 2001030),
('cell_type', 'T cell'),
('cl_name', 'T cell'),
('hcad_name', 'Lung-Connective tissue-T cell-CD3D CD8A')],
[('cid', 2001031),
('cell_type', 'T cell'),
('cl_name', 'T cell'),
('hcad_name', 'Lung-Connective tissue-T cell-CD3D IL32')],
[('cid', 2001032),
('cell_type', 'T cell'),
('cl_name', 'T cell'),
('hcad_name', 'Lung-Connective tissue-T cell-CD3D CD8A')],
[('cid', 2001038),
('cell_type', 'T cell'),
('cl_name', 'T cell'),
('hcad_name', 'Lung-Connective tissue-T cell-CD3D IL32')],
[('cid', 2001050),
('cell_type', 'T cell'),
('cl_name', 'T cell'),
('hcad_name', 'Lung-Connective tissue-T cell-CD3D CD8A')],
[('cid', 2001059),
('cell_type', 'T cell'),
('cl_name', 'T cell'),
('hcad_name', 'Lung-Connective tissue-T cell-CD3D CD8A')],
[('cid', 2001065),
('cell_type', 'T cell'),
('cl_name', 'T cell'),
('hcad_name', 'Lung-Connective tissue-T cell-CD3D IL32')],
[('cid', 2001086),
('cell_type', 'T cell'),
('cl_name', 'T cell'),
('hcad_name', 'Lung-Connective tissue-T cell-CD3D IL32')],
[('cid', 2001091),
('cell_type', 'T cell'),
('cl_name', 'T cell'),
('hcad_name', 'Lung-Connective tissue-T cell-CD3D IL32')],
[('cid', 2001099),
('cell_type', 'T cell'),
('cl_name', 'T cell'),
('hcad_name', 'Lung-Connective tissue-T cell-CD3D IL32')],
[('cid', 2001105),
('cell_type', 'T cell'),
('cl_name', 'T cell'),
('hcad_name', 'Lung-Connective tissue-T cell-CD3D IL32')],
[('cid', 2001106),
('cell_type', 'T cell'),
('cl_name', 'T cell'),
('hcad_name', 'Lung-Connective tissue-T cell-CD3D CD8A')],
[('cid', 2001112),
('cell_type', 'T cell'),
('cl_name', 'T cell'),
('hcad_name', 'Lung-Connective tissue-T cell-CD3D CD8A')]]
3.2 Download all columns¶
We also compare the time comsumption between parallel and unparallel cell download processes for the first 20 cells, and find the parallel process only takes about 1/3 time.
# the parallel version
start_time = time.time()
result = ECAUGT.get_columnsbycell_para(rows_to_get = rows_to_get_2, cols_to_get=None, col_filter=None, do_transfer = False, thread_num = multiprocessing.cpu_count()-1)
time.time()-start_time
5.3618810176849365
# the unparallel version
start_time = time.time()
result = ECAUGT.get_columnsbycell(rows_to_get = rows_to_get_2, cols_to_get=None,col_filter=None,do_transfer = False)
time.time()-start_time
29.58238458633423
4. Search cell with both metadata condition and gene condition¶
Now we show how to add gene conditions when downloading cells. Here we download some genes of the queried cells and select the cells whose expression level on PTPRC is larger than 0.1 and experssion level on CD3D is no less than 0.1
# add col_filter on gene
gene_condition = ECAUGT.set_gene_condition("PTPRC > 0.1 && CD3D>=0.1")
4.1 Download some of the columns¶
df_result = ECAUGT.get_columnsbycell_para(rows_to_get = rows_to_get, cols_to_get=['CD3D','PTPRC','donor_id','hcad_name'], col_filter=gene_condition, do_transfer = True, thread_num = multiprocessing.cpu_count()-1)
df_result
CD3D | PTPRC | donor_id | uHAF_name | |
---|---|---|---|---|
cid | ||||
2000962 | 2.598072 | 2.229140 | 343B | Lung-Connective tissue-T cell-CD3D IL32 |
2000987 | 2.138744 | 1.790511 | 343B | Lung-Connective tissue-T cell-CD3D IL32 |
2000994 | 3.055748 | 2.269341 | 343B | Lung-Connective tissue-T cell-CD3D IL32 |
2001099 | 2.682663 | 1.864017 | 343B | Lung-Connective tissue-T cell-CD3D IL32 |
2001112 | 2.417518 | 1.482966 | 343B | Lung-Connective tissue-T cell-CD3D CD8A |
... | ... | ... | ... | ... |
2115395 | 2.729593 | 2.729593 | FetalLung1_12W | Lung-Connective tissue-T cell-CD3D CD1C |
2115433 | 3.851911 | 3.179780 | FetalLung1_12W | Lung-Connective tissue-T cell-CD3D CD1C |
2115441 | 3.591656 | 2.546684 | FetalLung1_12W | Lung-Connective tissue-T cell-CD3D CD1C |
2115483 | 3.181991 | 3.181991 | FetalLung1_12W | Lung-Connective tissue-T cell-CD3D CD1C |
2115498 | 3.124087 | 3.124087 | FetalLung1_12W | Lung-Connective tissue-T cell-CD3D CD1C |
7403 rows × 4 columns
We found that 7403 cells among the 14889 queried cells has expression levels that satisfy PTPRC > 0.1 && CD3D>=0.1. Then we can download some columns of these cells with the parameter cols_to get and the genes involved in the condition must be included in the
4.2 Download all columns of these cells¶
We can get all expression levels and metadatas of these cells by setting the parameter cols_to_get as None
df_result = ECAUGT.get_columnsbycell_para(rows_to_get = rows_to_get, cols_to_get=None, col_filter=gene_condition, do_transfer = True, thread_num = multiprocessing.cpu_count()-1)