ECAUGT module¶
-
ECAUGT.
Setup_Client
(endpoint, access_key_public, access_key_secret, instance_name, table_name)¶ Sets up an OTSClient connected to the OTS server
This function builds an OTSClient variable based on the parameters and connects to the corresponding server. If successed, this function will print the discription of the selected table on the server. This function should be run as the first step of any usage of our package. And the Client is only need to be setup once.
- Parameters
endpoint (str) – The address of the OTS server on the public network
access_key_public (str) – The public key of a user accessed to the server
access_key_secret (str) – The cryptographic key of the given public key
instance_name (str) – The name of the instance which the table belongs to
table_name (str) – The name of the selected table
- Returns
0 means success to connected to the server while -1 means failure.
- Return type
int
-
ECAUGT.
build_index
()¶ build the index on the metadata columns for cell searching
This function builds index on all metadata columns in the table on the OTS server. The index contains 17 fields and each one represents a metadata column in the OTS table.The 17 fields are user_id, study_id, cell_id, organ, region, subregion, seq_tech, sample_status, donor_id, donor_gender, donor_age, original_name, cl_name, hcad_name, tissue_type, cell_type and marker_gene.
The index is neccessary for any processes containing searching operation. Index is only need to be built once in a table and can update automatically when the table changes. Hence we will print a warning if the index exists.
- Returns
no return but print a warning when the index has been built
- Return type
None
-
ECAUGT.
check_data
(genenum_chk=True)¶ check if the loaded data satisfied the standardization in metadata and gene numbers
This function checkes if the loaded data satisfied the following standards:
If the column number of the matrix equals to 43896 (43878 genes and 18 metadata columns)
If the name of the metadata columns is correct
If donors’ genders are presented in correct form (‘Female’, ‘Male’, ‘NA’)
If the cids in the matrix are correct based on the user_id and current quota value
- Parameters
genenum_chk (bool) – genenum_chk is the parameter deciding if to do the check on the column number of the dataset. The default value is true and it can be set False when using on other species’ datasets.
- Returns
if the dataset pass the check
- Return type
bool
-
ECAUGT.
get_all_rows
(cols_to_get=[])¶ get values in the selected columns in all cells in the OTS table
This function downloads all values of the selected columns in the whole OTS table
- Parameters
cols_to_get (list) –
cols_to_get is a list of strings containing the names of the selected columns which can be either metadata columns or genes.
The default value of this parameter is an empty list that means the function will only download the primary keys
- Returns
A list of tablestore::Row variables, each one represent a cell
- Return type
list
-
ECAUGT.
get_column_set
(rows_to_get=None, col_to_get=None, col_filter=None, exlude_Unclassified=False)¶ get all unique values in a selected column in the given cells
This function calls the get_columnsbycell_para function to download the values of a selected column in the given cells and return a set of all unique values.
- Parameters
rows_to_get (list) – rows_to_get is a list of primary keys of the cells to download. Each element in the list is a list containing a primary key tuple like [(‘cid’,XXXXXXX)]. The default value of this parameter is None which means to get all rows in the database.
col_to_get (list) – col_to_get is a list whose length is 1. The element in this list is the name of the selected column. if the length of the list is not 1, a error will be risen.
col_filter (tablestore::CompositeColumnCondition) –
col_filter is a combined column condition in tablestore package. It can be generated by the seq2filter function which takes a structed string as input.
Once a filter is set, the function will first filter the given cells then download the selected columns of the cells pass the filter. A massage will be print if no cells pass the given filter. The default value of this parameter is None that means no filtering will be conducted.
exlude_Unclassified (bool) – This parameter is only used when rows_to_get is None. It decides whether the cells with Unclassified cell_type will be included in the result.
- Returns
A set of all unique values in the selected column
- Return type
Set
-
ECAUGT.
get_columnsbycell
(rows_to_get=None, cols_to_get=None, col_filter=None, do_transfer=True, exlude_Unclassified=False)¶ download the selected columns of the given cells from the OTS table on the server (nonparallelly)
This function gets the cells in the given primary key list and downloads parts of columns or all columns. A further filtering on the gene expression levels can be conducted based on given column filters.
Users can select the return data form as a pandas::DataFrame or a list without transformation.
- Parameters
rows_to_get (list) – rows_to_get is a list of primary keys of the cells to download. Each element in the list is a list containing a primary key tuple like [(‘cid’,XXXXXXX)]. The default value of this parameter is None which means to get all rows in the database.
cols_to_get (list) –
cols_to_get is a list of strings containing the names of the selected columns which can be either metadata columns or genes
The default value of this parameter is None that means the function will download all columns.
col_filter (tablestore::CompositeColumnCondition) –
col_filter is a combined column condition in tablestore package. It can be generated by the seq2filter function which takes a structed string as input.
Once a filter is set, the function will first filter the given cells then download the selected columns of the cells pass the filter. A massage will be print if no cells pass the given filter. The default value of this parameter is None that means no filtering will be conducted.
do_transfer (bool) –
If do_transfer is true, the output of this function will be transform into a pandas::DataFrame; vice versa.
The default value of this parameter is True.
exlude_Unclassified (bool) – This parameter is only used when rows_to_get is None. It decides whether the cells with Unclassified cell_type will be included in the result.
- Returns
If do_transfer == True, the return will be a pandas::DataFrame where each row represents a cell.
If do_transfer == False, the return will be a list. Each element is a list which represents a cell.
- Return type
pandas::DataFrame/list
-
ECAUGT.
get_columnsbycell_para
(rows_to_get=None, cols_to_get=None, col_filter=None, do_transfer=True, thread_num=15, exlude_Unclassified=False)¶ download the selected columns of the given cells from the OTS table on the server parallelly
This function gets the cells in the given primary key list and downloads parts of columns or all columns. A further filtering on the gene expression levels can be conducted based on given column filters. Users can select the return data form as a pandas::DataFrame or a list without transformation.
- Parameters
rows_to_get (list) – rows_to_get is a list of primary keys of the cells to download. Each element in the list is a list containing a primary key tuple like [(‘cid’,XXXXXXX)]. The default value of this parameter is None which means to get all rows in the database.
cols_to_get (list) –
cols_to_get is a list of strings containing the names of the selected columns which can be either metadata columns or genes.
The default value of this parameter is None that means the function will download all columns.
col_filter (tablestore::CompositeColumnCondition) –
col_filter is a combined column condition in tablestore package. It can be generated by the seq2filter function which takes a structed string as input.
Once a filter is set, the function will first filter the given cells then download the selected columns of the cells pass the filter. A massage will be print if no cells pass the given filter. The default value of this parameter is None that means no filtering will be conducted.
do_transfer (bool) –
If do_transfer is true, the output of this function will be transform into a pandas::DataFrame; vice versa.
The default value of this parameter is True.
thread_num (int) – Thread_num is number of threads used in the parallel upload process. The default value of this parameter is multiprocessing.cpu_count()-1 ,which means the function will use as many as accessible threads
exlude_Unclassified (bool) – This parameter is only used when rows_to_get is None. It decides whether the cells with Unclassified cell_type will be included in the result.
- Returns
If do_transfer == True, the return will be a pandas::DataFrame where each row represents a cell.
If do_transfer == False, the return will be a list. Each element is a list which represents a cell.
- Return type
pandas::DataFrame/list
-
ECAUGT.
insert_matrix_para
(thread_num=15)¶ upload the loaded dataframe to the server parallelly
This function uploads the DataFrame stored in the gloable variable df to the server. We recommand running the check_data function before using this function to ensure the DataFrame is standardized and ready for upload.
This function will print how many rows failed to be uploaded and return the upload results of each operation.
- Parameters
thread_num (int) – thread_num is number of threads used in the parallel upload process. The default value of this parameter is multiprocessing.cpu_count()-1 ,which means the function will use as many as accessible threads
- Returns
If all upload operations in the parallel upload process success, this function will return 0
If any operation fails, this function will return the list of all upload status. In i-th element, 0 means success and -1 means failure in the i-th operation.
- Return type
int/list
-
ECAUGT.
query_cells
(metadata_conditions='', include_children=False, exlude_Unclassified=False, print_message=True)¶ query cells and return the list of their primary keys
This function is used to query cells satisfing the metadata conditions in the OTS table on the server. The conditions on the metadata should be a structured string which is a combination of several logical expressions.
Users should remember that this function only enable them to query cell based on the conditions on the metadata columns in the index while conditions on gene expression should be set in the get_columnsbycell function or get_columnsbycell_para function.
- Parameters
metadata_conditions (str) –
It is a structured string which is a combination of several logical expressions.
- Each logical expression should be in the following forms:
field_name1 == value1, here ‘==’ means equal
field_name2 <> value2, here ‘<>’ means unequal
- Three symbols are used for logical operation between expressions:
logical_expression1 && logical_expression2, here ‘&&’ means AND operation
logical_expression1 || logical_expression2, here ‘||’ means OR operation
! logical_expression1, here ‘!’ means not NOT operation
Brackets are allowed and the priorities of the logical operations are as common. The metadata condition string is also robust to the space character.
- Here are some examples of legal condition strings:
’(user_id == 2 && organ == Heart) || (user_id == 3 && organ <> Brain)’
’organ == Lung && !seq_tech == 10X ‘
’organ == Heart &&(cell_type == Fibrocyte || region <> atria)’
The default value of this parameter is ‘’, which will get all rows in the database.
include_children (bool) – include_children is the parameter deciding if subtypes should be included in the query with condition on cell type. For example, when we query “cell_type == T cell” and set include_children as True, cells in the query result would contain T cell and the subtypes of T cells like CD4 T cell, CD8 T cell and so on; otherwise, only T cells would exist in the query result.
- Returns
The return is a list of the primary keys of quried cells. Each element is a list containing a tuple with the name of primary key and its value: [(‘cid’,2021119)].
Here is an example return with 2000 quried cells:
[[(‘cid’,2000001)],[(‘cid’,2000002)],[(‘cid’,2000003)],[(‘cid’,2000004)],…,[(‘cid’,2002000)]]
The primary key list can be used for downstream data downloading or table updating.
- Return type
list
-
ECAUGT.
search_metadata
(metadata_conditions='', include_children=False, exlude_Unclassified=False, print_message=True)¶ query cells and return the list of their primary keys
This function is used to query cells satisfing the metadata conditions in the OTS table on the server. The conditions on the metadata should be a structured string which is a combination of several logical expressions.
Users should remember that this function only enable them to query cell based on the conditions on the metadata columns in the index while conditions on gene expression should be set in the get_columnsbycell function or get_columnsbycell_para function.
- Parameters
metadata_conditions (str) –
It is a structured string which is a combination of several logical expressions.
- Each logical expression should be in the following forms:
field_name1 == value1, here ‘==’ means equal
field_name2 <> value2, here ‘<>’ means unequal
- Three symbols are used for logical operation between expressions:
logical_expression1 && logical_expression2, here ‘&&’ means AND operation
logical_expression1 || logical_expression2, here ‘||’ means OR operation
! logical_expression1, here ‘!’ means not NOT operation
Brackets are allowed and the priorities of the logical operations are as common. The metadata condition string is also robust to the space character.
- Here are some examples of legal condition strings:
’(user_id == 2 && organ == Heart) || (user_id == 3 && organ <> Brain)’
’organ == Lung && !seq_tech == 10X ‘
’organ == Heart &&(cell_type == Fibrocyte || region <> atria)’
The default value of this parameter is ‘’, which will get all rows in the database.
include_children (bool) – include_children is the parameter deciding if subtypes should be included in the query with condition on cell type. For example, when we query “cell_type == T cell” and set include_children as True, cells in the query result would contain T cell and the subtypes of T cells like CD4 T cell, CD8 T cell and so on; otherwise, only T cells would exist in the query result.
- Returns
The return is a list of the primary keys of quried cells. Each element is a list containing a tuple with the name of primary key and its value: [(‘cid’,2021119)].
Here is an example return with 2000 quried cells:
[[(‘cid’,2000001)],[(‘cid’,2000002)],[(‘cid’,2000003)],[(‘cid’,2000004)],…,[(‘cid’,2002000)]]
The primary key list can be used for downstream data downloading or table updating.
- Return type
list
-
ECAUGT.
seq2filter
(gene_condition)¶ transform a gene-condition string into a OTS column condition
This function will analyze the logical expression string of the gene condition for cell searching and generate a combined column condition of OTS database.
- Parameters
gene_condition (str) –
It is a structured string which is a combination of several logical expressions.
- Each logical expression should be in the following forms:
gene_name1 == value1, here ‘==’ means equal
gene_name2 <> value2, here ‘<>’ means unequal
gene_name3 > value3, here ‘>’ means larger than
gene_name4 < value4, here ‘<’ means smaller than
gene_name5 >= value5, here ‘>=’ means not smaller than
gene_name6 <= value6, here ‘<=’ means not larger than
- Three symbols are used for logical operation between expressions:
logical_expression1 && logical_expression2, here ‘&&’ means AND operation
logical_expression1 || logical_expression2, here ‘||’ means OR operation
! logical_expression1, here ‘!’ means not NOT operation
Brackets are allowed and the priorities of the logical operations are as common. The metadata condition string is also robust to the space character.
- Here are some examples of legal condition strings:
’(CD3D > 2 && CD3E >= 0.1) || (PTPRC <= 3 && CD8A >= 0.01)’
- Returns
the column condition for tablestore to seach cells
- Return type
tablestore.CompositeColumnCondition
-
ECAUGT.
set_gene_condition
(gene_condition)¶ transform a gene-condition string into a OTS column condition
This function will analyze the logical expression string of the gene condition for cell searching and generate a combined column condition of OTS database.
- Parameters
gene_condition (str) –
It is a structured string which is a combination of several logical expressions.
- Each logical expression should be in the following forms:
gene_name1 == value1, here ‘==’ means equal
gene_name2 <> value2, here ‘<>’ means unequal
gene_name3 > value3, here ‘>’ means larger than
gene_name4 < value4, here ‘<’ means smaller than
gene_name5 >= value5, here ‘>=’ means not smaller than
gene_name6 <= value6, here ‘<=’ means not larger than
- Three symbols are used for logical operation between expressions:
logical_expression1 && logical_expression2, here ‘&&’ means AND operation
logical_expression1 || logical_expression2, here ‘||’ means OR operation
! logical_expression1, here ‘!’ means not NOT operation
Brackets are allowed and the priorities of the logical operations are as common. The metadata condition string is also robust to the space character.
- Here are some examples of legal condition strings:
’(CD3D > 2 && CD3E >= 0.1) || (PTPRC <= 3 && CD8A >= 0.01)’
- Returns
the column condition for tablestore to seach cells
- Return type
tablestore.CompositeColumnCondition
-
ECAUGT.
update_batch
(rows_to_update, update_sets, thread_num=5)¶ update cells in the OTS table with the given columns’ values parallelly
This function updates the cells in the given primary key list with the given column values.
- Parameters
rows_to_update (list) – rows_to_update is a list of primary keys of the cells to be updated. Each element in the list is a list containing a primary key tuple like [(‘cid’,XXXXXXX)]
update_sets (list) – update_sets is a list whose length is the same as the parameter rows_to_update. Each element in this list is a list which contains several tuples where each tuple contains the name of a column and the value to update: [(column_name1, value1),(column_name2, value2),…]
thread_num (int) – Thread_num is number of threads used in the parallel update process. The default value of this parameter is 5.
- Returns
If all update operations in the parallel update process success, this function will return 0
If any operation fails, this function will return the list of all update status. In i-th element, 0 means success and -1 means failure in the i-th operation.
- Return type
int/list
-
ECAUGT.
update_row
(primary_key, update_data)¶ update a cell in the OTS table with the given columns’ values
This function first checks if the given cell is in the OTS table. If the cell is found, the given columns will be update by the given values; otherwise, a warning message will be risen.
- Parameters
primary_key (list) – primary_key is a list which contains a tuple like: [(‘cid’,XXXXXXX)]
update_data (list) –
updtae_data is a list which contains several tuples. Each tuple contains the name of a column and the value to update.
Here is an example of update_data:
[(“oragn”,”Heart”),(“user_id”,2),(“cell_type”,”T cell”)]
- Returns
the status of the update operation, 0 means success and -1 means failure.
- Return type
int