ECAUGT module

ECAUGT.Setup_Client(endpoint, access_key_public, access_key_secret, instance_name, table_name)

Sets up an OTSClient connected to the OTS server

This function builds an OTSClient variable based on the parameters and connects to the corresponding server. If successed, this function will print the discription of the selected table on the server. This function should be run as the first step of any usage of our package. And the Client is only need to be setup once.

Parameters
  • endpoint (str) – The address of the OTS server on the public network

  • access_key_public (str) – The public key of a user accessed to the server

  • access_key_secret (str) – The cryptographic key of the given public key

  • instance_name (str) – The name of the instance which the table belongs to

  • table_name (str) – The name of the selected table

Returns

0 means success to connected to the server while -1 means failure.

Return type

int

ECAUGT.build_index()

build the index on the metadata columns for cell searching

This function builds index on all metadata columns in the table on the OTS server. The index contains 17 fields and each one represents a metadata column in the OTS table.The 17 fields are user_id, study_id, cell_id, organ, region, subregion, seq_tech, sample_status, donor_id, donor_gender, donor_age, original_name, cl_name, hcad_name, tissue_type, cell_type and marker_gene.

The index is neccessary for any processes containing searching operation. Index is only need to be built once in a table and can update automatically when the table changes. Hence we will print a warning if the index exists.

Returns

no return but print a warning when the index has been built

Return type

None

ECAUGT.check_data(genenum_chk=True)

check if the loaded data satisfied the standardization in metadata and gene numbers

This function checkes if the loaded data satisfied the following standards:

  1. If the column number of the matrix equals to 43896 (43878 genes and 18 metadata columns)

  2. If the name of the metadata columns is correct

  3. If donors’ genders are presented in correct form (‘Female’, ‘Male’, ‘NA’)

  4. If the cids in the matrix are correct based on the user_id and current quota value

Parameters

genenum_chk (bool) – genenum_chk is the parameter deciding if to do the check on the column number of the dataset. The default value is true and it can be set False when using on other species’ datasets.

Returns

if the dataset pass the check

Return type

bool

ECAUGT.get_all_rows(cols_to_get=[])

get values in the selected columns in all cells in the OTS table

This function downloads all values of the selected columns in the whole OTS table

Parameters

cols_to_get (list) –

cols_to_get is a list of strings containing the names of the selected columns which can be either metadata columns or genes.

The default value of this parameter is an empty list that means the function will only download the primary keys

Returns

A list of tablestore::Row variables, each one represent a cell

Return type

list

ECAUGT.get_column_set(rows_to_get=None, col_to_get=None, col_filter=None, exlude_Unclassified=False)

get all unique values in a selected column in the given cells

This function calls the get_columnsbycell_para function to download the values of a selected column in the given cells and return a set of all unique values.

Parameters
  • rows_to_get (list) – rows_to_get is a list of primary keys of the cells to download. Each element in the list is a list containing a primary key tuple like [(‘cid’,XXXXXXX)]. The default value of this parameter is None which means to get all rows in the database.

  • col_to_get (list) – col_to_get is a list whose length is 1. The element in this list is the name of the selected column. if the length of the list is not 1, a error will be risen.

  • col_filter (tablestore::CompositeColumnCondition) –

    col_filter is a combined column condition in tablestore package. It can be generated by the seq2filter function which takes a structed string as input.

    Once a filter is set, the function will first filter the given cells then download the selected columns of the cells pass the filter. A massage will be print if no cells pass the given filter. The default value of this parameter is None that means no filtering will be conducted.

  • exlude_Unclassified (bool) – This parameter is only used when rows_to_get is None. It decides whether the cells with Unclassified cell_type will be included in the result.

Returns

A set of all unique values in the selected column

Return type

Set

ECAUGT.get_columnsbycell(rows_to_get=None, cols_to_get=None, col_filter=None, do_transfer=True, exlude_Unclassified=False)

download the selected columns of the given cells from the OTS table on the server (nonparallelly)

This function gets the cells in the given primary key list and downloads parts of columns or all columns. A further filtering on the gene expression levels can be conducted based on given column filters.

Users can select the return data form as a pandas::DataFrame or a list without transformation.

Parameters
  • rows_to_get (list) – rows_to_get is a list of primary keys of the cells to download. Each element in the list is a list containing a primary key tuple like [(‘cid’,XXXXXXX)]. The default value of this parameter is None which means to get all rows in the database.

  • cols_to_get (list) –

    cols_to_get is a list of strings containing the names of the selected columns which can be either metadata columns or genes

    The default value of this parameter is None that means the function will download all columns.

  • col_filter (tablestore::CompositeColumnCondition) –

    col_filter is a combined column condition in tablestore package. It can be generated by the seq2filter function which takes a structed string as input.

    Once a filter is set, the function will first filter the given cells then download the selected columns of the cells pass the filter. A massage will be print if no cells pass the given filter. The default value of this parameter is None that means no filtering will be conducted.

  • do_transfer (bool) –

    If do_transfer is true, the output of this function will be transform into a pandas::DataFrame; vice versa.

    The default value of this parameter is True.

  • exlude_Unclassified (bool) – This parameter is only used when rows_to_get is None. It decides whether the cells with Unclassified cell_type will be included in the result.

Returns

If do_transfer == True, the return will be a pandas::DataFrame where each row represents a cell.

If do_transfer == False, the return will be a list. Each element is a list which represents a cell.

Return type

pandas::DataFrame/list

ECAUGT.get_columnsbycell_para(rows_to_get=None, cols_to_get=None, col_filter=None, do_transfer=True, thread_num=15, exlude_Unclassified=False)

download the selected columns of the given cells from the OTS table on the server parallelly

This function gets the cells in the given primary key list and downloads parts of columns or all columns. A further filtering on the gene expression levels can be conducted based on given column filters. Users can select the return data form as a pandas::DataFrame or a list without transformation.

Parameters
  • rows_to_get (list) – rows_to_get is a list of primary keys of the cells to download. Each element in the list is a list containing a primary key tuple like [(‘cid’,XXXXXXX)]. The default value of this parameter is None which means to get all rows in the database.

  • cols_to_get (list) –

    cols_to_get is a list of strings containing the names of the selected columns which can be either metadata columns or genes.

    The default value of this parameter is None that means the function will download all columns.

  • col_filter (tablestore::CompositeColumnCondition) –

    col_filter is a combined column condition in tablestore package. It can be generated by the seq2filter function which takes a structed string as input.

    Once a filter is set, the function will first filter the given cells then download the selected columns of the cells pass the filter. A massage will be print if no cells pass the given filter. The default value of this parameter is None that means no filtering will be conducted.

  • do_transfer (bool) –

    If do_transfer is true, the output of this function will be transform into a pandas::DataFrame; vice versa.

    The default value of this parameter is True.

  • thread_num (int) – Thread_num is number of threads used in the parallel upload process. The default value of this parameter is multiprocessing.cpu_count()-1 ,which means the function will use as many as accessible threads

  • exlude_Unclassified (bool) – This parameter is only used when rows_to_get is None. It decides whether the cells with Unclassified cell_type will be included in the result.

Returns

If do_transfer == True, the return will be a pandas::DataFrame where each row represents a cell.

If do_transfer == False, the return will be a list. Each element is a list which represents a cell.

Return type

pandas::DataFrame/list

ECAUGT.insert_matrix_para(thread_num=15)

upload the loaded dataframe to the server parallelly

This function uploads the DataFrame stored in the gloable variable df to the server. We recommand running the check_data function before using this function to ensure the DataFrame is standardized and ready for upload.

This function will print how many rows failed to be uploaded and return the upload results of each operation.

Parameters

thread_num (int) – thread_num is number of threads used in the parallel upload process. The default value of this parameter is multiprocessing.cpu_count()-1 ,which means the function will use as many as accessible threads

Returns

If all upload operations in the parallel upload process success, this function will return 0

If any operation fails, this function will return the list of all upload status. In i-th element, 0 means success and -1 means failure in the i-th operation.

Return type

int/list

ECAUGT.query_cells(metadata_conditions='', include_children=False, exlude_Unclassified=False, print_message=True)

query cells and return the list of their primary keys

This function is used to query cells satisfing the metadata conditions in the OTS table on the server. The conditions on the metadata should be a structured string which is a combination of several logical expressions.

Users should remember that this function only enable them to query cell based on the conditions on the metadata columns in the index while conditions on gene expression should be set in the get_columnsbycell function or get_columnsbycell_para function.

Parameters
  • metadata_conditions (str) –

    It is a structured string which is a combination of several logical expressions.

    Each logical expression should be in the following forms:

    field_name1 == value1, here ‘==’ means equal

    field_name2 <> value2, here ‘<>’ means unequal

    Three symbols are used for logical operation between expressions:

    logical_expression1 && logical_expression2, here ‘&&’ means AND operation

    logical_expression1 || logical_expression2, here ‘||’ means OR operation

    ! logical_expression1, here ‘!’ means not NOT operation

    Brackets are allowed and the priorities of the logical operations are as common. The metadata condition string is also robust to the space character.

    Here are some examples of legal condition strings:

    ’(user_id == 2 && organ == Heart) || (user_id == 3 && organ <> Brain)’

    ’organ == Lung && !seq_tech == 10X ‘

    ’organ == Heart &&(cell_type == Fibrocyte || region <> atria)’

    The default value of this parameter is ‘’, which will get all rows in the database.

  • include_children (bool) – include_children is the parameter deciding if subtypes should be included in the query with condition on cell type. For example, when we query “cell_type == T cell” and set include_children as True, cells in the query result would contain T cell and the subtypes of T cells like CD4 T cell, CD8 T cell and so on; otherwise, only T cells would exist in the query result.

Returns

The return is a list of the primary keys of quried cells. Each element is a list containing a tuple with the name of primary key and its value: [(‘cid’,2021119)].

Here is an example return with 2000 quried cells:

[[(‘cid’,2000001)],[(‘cid’,2000002)],[(‘cid’,2000003)],[(‘cid’,2000004)],…,[(‘cid’,2002000)]]

The primary key list can be used for downstream data downloading or table updating.

Return type

list

ECAUGT.search_metadata(metadata_conditions='', include_children=False, exlude_Unclassified=False, print_message=True)

query cells and return the list of their primary keys

This function is used to query cells satisfing the metadata conditions in the OTS table on the server. The conditions on the metadata should be a structured string which is a combination of several logical expressions.

Users should remember that this function only enable them to query cell based on the conditions on the metadata columns in the index while conditions on gene expression should be set in the get_columnsbycell function or get_columnsbycell_para function.

Parameters
  • metadata_conditions (str) –

    It is a structured string which is a combination of several logical expressions.

    Each logical expression should be in the following forms:

    field_name1 == value1, here ‘==’ means equal

    field_name2 <> value2, here ‘<>’ means unequal

    Three symbols are used for logical operation between expressions:

    logical_expression1 && logical_expression2, here ‘&&’ means AND operation

    logical_expression1 || logical_expression2, here ‘||’ means OR operation

    ! logical_expression1, here ‘!’ means not NOT operation

    Brackets are allowed and the priorities of the logical operations are as common. The metadata condition string is also robust to the space character.

    Here are some examples of legal condition strings:

    ’(user_id == 2 && organ == Heart) || (user_id == 3 && organ <> Brain)’

    ’organ == Lung && !seq_tech == 10X ‘

    ’organ == Heart &&(cell_type == Fibrocyte || region <> atria)’

    The default value of this parameter is ‘’, which will get all rows in the database.

  • include_children (bool) – include_children is the parameter deciding if subtypes should be included in the query with condition on cell type. For example, when we query “cell_type == T cell” and set include_children as True, cells in the query result would contain T cell and the subtypes of T cells like CD4 T cell, CD8 T cell and so on; otherwise, only T cells would exist in the query result.

Returns

The return is a list of the primary keys of quried cells. Each element is a list containing a tuple with the name of primary key and its value: [(‘cid’,2021119)].

Here is an example return with 2000 quried cells:

[[(‘cid’,2000001)],[(‘cid’,2000002)],[(‘cid’,2000003)],[(‘cid’,2000004)],…,[(‘cid’,2002000)]]

The primary key list can be used for downstream data downloading or table updating.

Return type

list

ECAUGT.seq2filter(gene_condition)

transform a gene-condition string into a OTS column condition

This function will analyze the logical expression string of the gene condition for cell searching and generate a combined column condition of OTS database.

Parameters

gene_condition (str) –

It is a structured string which is a combination of several logical expressions.

Each logical expression should be in the following forms:

gene_name1 == value1, here ‘==’ means equal

gene_name2 <> value2, here ‘<>’ means unequal

gene_name3 > value3, here ‘>’ means larger than

gene_name4 < value4, here ‘<’ means smaller than

gene_name5 >= value5, here ‘>=’ means not smaller than

gene_name6 <= value6, here ‘<=’ means not larger than

Three symbols are used for logical operation between expressions:

logical_expression1 && logical_expression2, here ‘&&’ means AND operation

logical_expression1 || logical_expression2, here ‘||’ means OR operation

! logical_expression1, here ‘!’ means not NOT operation

Brackets are allowed and the priorities of the logical operations are as common. The metadata condition string is also robust to the space character.

Here are some examples of legal condition strings:

’(CD3D > 2 && CD3E >= 0.1) || (PTPRC <= 3 && CD8A >= 0.01)’

Returns

the column condition for tablestore to seach cells

Return type

tablestore.CompositeColumnCondition

ECAUGT.set_gene_condition(gene_condition)

transform a gene-condition string into a OTS column condition

This function will analyze the logical expression string of the gene condition for cell searching and generate a combined column condition of OTS database.

Parameters

gene_condition (str) –

It is a structured string which is a combination of several logical expressions.

Each logical expression should be in the following forms:

gene_name1 == value1, here ‘==’ means equal

gene_name2 <> value2, here ‘<>’ means unequal

gene_name3 > value3, here ‘>’ means larger than

gene_name4 < value4, here ‘<’ means smaller than

gene_name5 >= value5, here ‘>=’ means not smaller than

gene_name6 <= value6, here ‘<=’ means not larger than

Three symbols are used for logical operation between expressions:

logical_expression1 && logical_expression2, here ‘&&’ means AND operation

logical_expression1 || logical_expression2, here ‘||’ means OR operation

! logical_expression1, here ‘!’ means not NOT operation

Brackets are allowed and the priorities of the logical operations are as common. The metadata condition string is also robust to the space character.

Here are some examples of legal condition strings:

’(CD3D > 2 && CD3E >= 0.1) || (PTPRC <= 3 && CD8A >= 0.01)’

Returns

the column condition for tablestore to seach cells

Return type

tablestore.CompositeColumnCondition

ECAUGT.update_batch(rows_to_update, update_sets, thread_num=5)

update cells in the OTS table with the given columns’ values parallelly

This function updates the cells in the given primary key list with the given column values.

Parameters
  • rows_to_update (list) – rows_to_update is a list of primary keys of the cells to be updated. Each element in the list is a list containing a primary key tuple like [(‘cid’,XXXXXXX)]

  • update_sets (list) – update_sets is a list whose length is the same as the parameter rows_to_update. Each element in this list is a list which contains several tuples where each tuple contains the name of a column and the value to update: [(column_name1, value1),(column_name2, value2),…]

  • thread_num (int) – Thread_num is number of threads used in the parallel update process. The default value of this parameter is 5.

Returns

If all update operations in the parallel update process success, this function will return 0

If any operation fails, this function will return the list of all update status. In i-th element, 0 means success and -1 means failure in the i-th operation.

Return type

int/list

ECAUGT.update_row(primary_key, update_data)

update a cell in the OTS table with the given columns’ values

This function first checks if the given cell is in the OTS table. If the cell is found, the given columns will be update by the given values; otherwise, a warning message will be risen.

Parameters
  • primary_key (list) – primary_key is a list which contains a tuple like: [(‘cid’,XXXXXXX)]

  • update_data (list) –

    updtae_data is a list which contains several tuples. Each tuple contains the name of a column and the value to update.

    Here is an example of update_data:

    [(“oragn”,”Heart”),(“user_id”,2),(“cell_type”,”T cell”)]

Returns

the status of the update operation, 0 means success and -1 means failure.

Return type

int