Documentation

The `Budoids_class` class

class MidlineIdentifier.Budoids_class.Budoid(args)[source]

Bases: object

Budoid object class

Attributes:

imgImg: Image object
dataAdata: Single cell object
samplestr: Sample identifier. Will be store in adata.obs[‘sample’]. Useful when concanating multiple Adata objects.
outdirstr: Output directory where files will be saved

Methods

`FindPath`(**kwagrs)	Identify the morphological midline of the structure.
`ADProcess`()	Preprocessing of single cell dataset.
`RMOutliers`([plot])	Remove cells that fall out of the structure segmentation.
`ProjectCells`([alpha, plot])	Project cells onto the nearest coordinate on the morphological midline.
`FindOrientation`(**kwargs)	Orient the coords based on the provided genelists.
`run_wrapper`([save])	A wrapper function to process the datset.
`FindDEG`(groupby, condition[, method, ...])	Finds differentially expressed genes (DEGs) for each of the identity classes in a dataset.
`FindSVG`(coords[, sample])	Finds spatially variable genes (SVGs) for each of the identity classes in a dataset.
`Concat`(object_list)	Merge multiple objects.

ADProcess()[source]: Preprocessing of single cell dataset. A wrapper function of Preprocessing(). See Preprocessing() for more detail.

Concat(object_list)[source]

Merge multiple objects.

Parameters:

object_listlist of Budoid: A list of Budoid to merge

FindDEG(groupby, condition, method='DESeq2', corr_method='benjamini-hochberg', **kwagrs)[source]

Finds differentially expressed genes (DEGs) for each of the identity classes in a dataset. A wrapper function of FindDEG()

Parameters:

groupbystr

The key of the observations grouping to consider.

methodstr (default: ‘DESeq2’)

Method used to calcualte DEGs. DESeq2 and DESeq2_pb apply pydeseq2, the python implementation of the DESeq2 method. DESeq2 calculates DEGs on single cell level while DESeq2_pb generate pseudobulk expression based on condition. t-test, 't-test_overestim_var' overestimates variance of each group, 'wilcoxon' uses Wilcoxon rank-sum, 'logreg' uses logistic regression.

If method is one of ['logreg', 't-test', 'wilcoxon', 't-test_overestim_var'], This function directly calls scanpy.tl.rank_genes_groups().

conditionstr

Required for DESeq2_pb method.

kwagrs

Additonal arguments to pass to scanpy.tl.rank_genes_groups()

Examples

>>> import PSUils as ps
>>> budoid = ps.io.ReadObj('testdata/Budoid_1A/Budoids.pkl')

>>> groupby = 'condition'
>>> cond = 'loc'
>>> budoid.data.adata.obs

>>> # test DESeq2_pb method
>>> budoid.FindDEG(groupby, cond, method = 'DESeq2_pb', groups = 'P', reference = 'D')

>>> # test wilcoxon method
>>> budoid.FindDEG(groupby, method = 'wilcoxon')

FindOrientation(**kwargs)[source]

Orient the coords based on the provided genelists. A wrapper function of FindOrientation(). See FindOrientation() for more detail.

Parameters:

kwagrs: Additonal arguments to pass to FindOrientation()

Examples

To define the proximal (start) and distal (end) ends of the midline using an example datset. If the dataset with both proximal score and distal score greater than self-defined threshold (Thre = 0.01), it will be considered as polarized; Otherwise, it will be considered as non-polarized.

>>> import PSUils as ps
>>> budoid = ps.io.ReadObj('testdata/Budoid_1A/Budoids.pkl')

>>> start_genes = ['Sox9','Acan','Col2a1','Col9a1','Col9a2','Col11a1']
>>> end_genes = ['Col1a1', 'Col3a1']
>>> coords = 'major_coor_scaled' # previsouly stored midline coordinates

>>> budoid.FindOrientation(coords, start_genes, end_genes)

>>> adata = budoid.data.adata
>>> Thre = 0.01 # self-defined threshold
>>> max_s, max_e = adata.uns['start_score'], adata.uns['end_score']

>>> if max_s > Thre and max_e > Thre:
... idx = adata.obs['major_coor_used'] > 0.5
... adata.obs.loc[idx, 'loc'] = 'Proximal'
... adata.obs.loc[idx, 'loc'] = 'Distal'
>>> else:
    adata.obs['loc'] = 'Round'

FindPath(**kwagrs)[source]

Identify the morphological midline of the structure. A wrapper function of FindPath(). See FindPath() for more detail.

Parameters:

kwagrs: Additonal arguments to pass to FindPath()

FindSVG(coords, sample='sample', **kwargs)[source]

Finds spatially variable genes (SVGs) for each of the identity classes in a dataset. This should be done on the sample level. A wrapper function of FindSVG().

Parameters:

samplestr (Default: sample): Sample identifier. Must be one of the .obs.columns
kwagrs: Additonal arguments to pass to FindSVG()

ProjectCells(alpha=0.01, plot=True)[source]

Project cells onto the nearest coordinate on the morphological midline. We developed a scoring scheme which takes into account the distance between coordinates and cells and the number of cells associated with the coordinates. The score of coordinate-cell pair \((i,c)\) is defined as

\[S_{ic} = D_{ic} e^{αN_{i}}\]

where \(D_{ic}\) represents the Euclidian distance, \(N_i\) is the number of cells associated with \(i\) and \(α\) is the scaling factor. Each cell was then projected to the coordinate with the highest score.

Parameters:

alphafloat (default: 0.01): alpha (\(α\)) that control the level of penalty.
plotbool (default: True): If True, save teh plot into 'Cells_remove.pdf' in the output directory (.outdir)

RMOutliers(plot=True)[source]

Remove cells that fall out of the structure segmentation.

Parameters:

plotbool: If True, save the plot into 'Cells_remove.pdf' in the output directory (.outdir)

run_wrapper(save=True, **kwagrs)[source]

A wrapper function to process the datset.

Parameters:

savebool (default: True): If True, save the processed data into pickle file.
kwagrs: Additonal arguments to pass to SaveObj()

The `adata_class` class

class MidlineIdentifier.adata_class.Adata(fad, sample, outdir)[source]

Bases: object

Adata object to wrap anndata

Attributes:

adataanndata.AnnData: anndata object to store the single cell data. Compatible with all scanpy functions.
outdirstr: output directory to save files

Methods

`Preprocessing`()	Preprocessing of single cell dataset.
`EnrichBins`(genes, coords[, nbin, score_name])	Perform gene set enrichment in bins.
`FindOrientation`([coords, start_genes, ...])	Orient the coords based on the provided genelists.
`FindDEG`(groupby, condition, method, **kwargs)	Finds differentially expressed genes (DEGs) for each of the identity classes in a dataset.
`FindSVG`(coords, sample[, layer, ...])	Finds spatially variable genes (SVGs) for each of the identity classes in a dataset.

EnrichBins(genes, coords, nbin=4, score_name='score', **kwargs)[source]

Perform gene set enrichment in bins. This function calls scanpy.tl.score_genes(). Result will be stored into .uns[score_name].

Parameters:

geneslist | str: The list of gene names used for score calculation
coordsstr: The key of the observations to consider. Must be one of the .obs.columns
nbinint: The number of bins
score_namestr: Name of the field to be added in .uns
kwargs: Additonal arguments to pass to scanpy.tl.score_genes()

FindDEG(groupby, condition, method, **kwargs)[source]

Finds differentially expressed genes (DEGs) for each of the identity classes in a dataset.

Parameters:

adataanndata.AnnData

Annotated data matrix.

groupbystr

The key of the observations grouping to consider.

methodstr (default: ‘DESeq2’)

Method used to calcualte DEGs. 'DESeq2' and 'DESeq2_pb' use pydeseq2, the python implementation of the DESeq2 method. DESeq2 calculates DEGs on single cell level while DESeq2_pb generate pseudobulk expression based on condition. 't-test', 't-test_overestim_var' overestimates variance of each group, 'wilcoxon' uses Wilcoxon rank-sum, 'logreg' uses logistic regression.

If method is one of ['logreg', 't-test', 'wilcoxon', 't-test_overestim_var'], This function directly calls scanpy.tl.rank_genes_groups().

kwargs

Additonal arguments to pass to scanpy.tl.rank_genes_groups()

Returns:

:
pandas.DataFrame

FindOrientation(coords='major_coor_scaled', start_genes=['Sox9', 'Acan', 'Col2a1', 'Col9a1', 'Col9a2', 'Col11a1'], end_genes=['Col1a1', 'Col3a1'], plot=True, **kwargs)[source]

Orient the coords based on the provided genelists. This allows cross-dataset/structure comparisons. Result will be stored into .uns[start_score] and .uns[end_score].

Parameters:

coordsstr: The key of the observations to consider. Must be one of the .obs.columns
start_geneslist: The list of gene names used to calculate the start
end_geneslist: The list of gene names used to calculate the end
plotbool (default: True): Set to True by default. If True, save teh plot into 'Orientation_score.pdf' in the output directory (.outdir)
kwargs: Additonal arguments to pass to EnrichBins()

FindSVG(coords, sample, layer='counts', min_exp_gene=0, min_exp_cell=0)[source]

Finds spatially variable genes (SVGs) for each of the identity classes in a dataset. This function incoporate SpatialDE(). Raw counts should be used.

Parameters:

samplestr (default: ‘sample’): Sample identifier. Must be one of the .obs.columns
coordsstr: Spatial coordinates for each cell. Can be one of the .obs.columns or a pandas.DataFrame with rows as cells and columns as spatial dimensions.
layerstr (default: ‘counts’): Key from adata.layers whose value will be used to. If None, .`adata.layers[‘counts’]` will be used.
min_exp_geneint (default: ‘0’): Filter genes whose expression lower than this
min_exp_cellint (default: ‘0’): Filter cells whose total expression lower than this

PolarizationScoring(genes, norm=False, coords='major_coor_used', bootstrapping=False, n_bs=1000, random_state=1234)[source]

Calculate the polarization score for genes.

Parameters:

geneslist | str: genes to calculate
bootstrappingbool (defaultFalse): Whether to bootstrapping or not
n_bsint (default100): The number of bootstrapping to perform
random_stateint (default1234): Set random seed for reproducibility

Preprocessing()[source]

Preprocessing of single cell dataset. This function calls scanpy.pp.normalize_total() and scanpy.pp.log1p().

Raw data, normalized data and log data will be stored into .layers['counts'], .layers["norm_counts"] and .layers["lognorm_counts"] respectively.

The `image_class` class

class MidlineIdentifier.image_class.Img(fimg, dc, do, outdir)[source]

Bases: object

Image object

Attributes:

imgnp.ndarray: Original image
imgbnp.ndarray: Binary image
dcint: Disk size for image closing
doint: Disk size for image openning
paddingsnumpy.ndarray: Record if paddings has been performed for the four borders of the image.
outdirstr: output directory to save files
arfloat: The aspect ratio of the major structure
measurepandas.DataFrame
segnumpy.ndarray: The segmentation of the major structure
ridgenumpy.ndarray: The segmentation of the major structure
pathnumpy.ndarray: The morphological midline of the major structure in the image
starttuple: The start of the morphological midline
endtuple: The end of the morphological midline

Methods

`Padding`([tolerance, num_pads])	Image padding.
`Segmentation`()	Get a closed segmentation of the major structure
`Image_measurement`([plot])	Measure properties of all image regions.
`RMSmallRegion`()	Remove small regions (noise) by setting the corresponding pixels to false.
`GetStartEnd`()	Return the start and end of the mophological midline.
`GetAspectRatio`()	Return the aspect ratio of the structure.
`FindRidge`(**kwargs)	Filter the Euclidean distance transform of the image with the Meijering neuriteness filter.
`FindPath`([plot])	Identify the morphological midline of the major structure in the image.The midline will be store in `.path`.

FindPath(plot=True)[source]

Identify the morphological midline of the major structure in the image.The midline will be store in .path.

Parameters:

plotbool (default: True): If True, save teh plot into 'Major_axis_sk_on_ridge.pdf' in the output directory (.outdir)
kwargs: Additonal arguments to pass to skimage.filters.meijering()

FindRidge(**kwargs)[source]

Filter the Euclidean distance transform of the image with the Meijering neuriteness filter. This function calls scipy.ndimage.distance_transform_edt() and skimage.filters.meijering().

Parameters:

kwargs: Additonal arguments to pass to skimage.filters.meijering()

GetAspectRatio()[source]

Return the aspect ratio of the structure.

Returns:

:
AspectRatiofloat: Aspect ratio of the major structure

GetStartEnd()[source]

Return the start and end of the mophological midline.

Returns:

:
starttuple of ( :class:`int, int)`: Start pixel coordiantes of the morphological midline
endtuple of ( :class:`int, int)`: End pixel coordiantes of the morphological midline

Image_measurement(plot=True)[source]

Measure properties of all image regions. Measurements will be stored in .measure

Parameters:

plotbool (default: True): If True, save teh plot into 'region_measure.pdf' in the output directory (.outdir)

Padding(tolerance=50, num_pads=100)[source]

Image padding. Paddings will be performed if the target structure locates too close to the image borders. This will allow better structure segmentation.

If padding is performed, the paddings attribute will be modified accordingly.

Parameters:

toleranceint (default: 50): The number of pixel to tolerant.
num_padsint (default: 100): The number of pixel to pad

RMSmallRegion()[source]: Remove small regions (noise) by setting the corresponding pixels to false. Only the largest segment was kept for structural segmentation.

Segmentation()[source]: Get a closed segmentation of the major structure

The `MidlineIdentifier.io` module

MidlineIdentifier.io.ReadObj(filename)[source]

Read .pkl-formatted pickle file.

Parameters:

filenamestr: File name of data file.

MidlineIdentifier.io.SaveObj(obj, filename=None)[source]

Save object into pkl-formatted pickle file.

Parameters:

objBudoid: Budoid object
filenamestr: File name of data file.

The `MidlineIdentifier.plotting` module

MidlineIdentifier.plotting.trend_plot(budoid, features, groupby, coords='major_coor_used', save=False, **kwargs)[source]

Makes a trend plot of the expression values of var_names as a function of coords

For each var_name and each groupby category a dot is plotted. Each dot represents two values: mean expression within each category (visualized by color) and fraction of cells expressing the var_name in the category (visualized by the size of the dot). If groupby is not given, the dotplot assumes that all data belongs to a single category.

This function use seaborn.lmplot(). If you need more flexibility, you should use seaborn.lmplot() directly.

Parameters:

featurestr | list: Column name in .var DataFrame that stores gene symbols. By default var_names refer to the index column of the .var DataFrame.
groupbystr: The key of the observation grouping to consider. Must be one of obs.columns
coordsstr (default: ‘major_coor_used’): To which the gene expression should be consider to. Must be one of obs.columns.
savebool (default: False): If True or a str, save the figure. A string is appended to the default filename. Infer the filetype if ending on {‘.pdf’, ‘.png’, ‘.svg’}.
kwargs: Additonal arguments to pass to seaborn.lmplot()

Returns:

:
seaborn.lmplot() object.

Examples

Create a trend plot using the given markers using an example dataset grouped by the category ‘batch’.

import PSUils as ps

budoid1 = ps.io.ReadObj('testdata/Budoid_1A/Budoids.pkl')
budoid2 = ps.io.ReadObj('testdata/Budoid_3H/Budoids.pkl')
budoid1.Concat(budoid2)

markers = ['Col9a2','Col3a1']
sc.pl.dotplot(budoid1, markers, groupby='batch')

The `MidlineIdentifier.utilis` module

MidlineIdentifier.utilis.EuclideanDist(pts, pt)[source]

Calculate the Euclidean distance between one point and other point(s).

Parameters:

ptsarray_like
ptarray_like of size one

Returns:

:
distfloat or numpy.ndarray: Euclidean distance

MidlineIdentifier.utilis.ParseArgs(args)[source]

Parse arguments from the commandline.

Parameters:

argslist: List of arguments

Returns:

:
fad, fimg, args.diskClosing, args.diskOpening, outdir, sample

MidlineIdentifier.utilis.ScaleMinMax(x)[source]

Scale the input vector into the range between zero and one.

Parameters:

xarray_like

Returns:

:
array_like: Scaled x

MidlineIdentifier.utilis.grouped_obs(adata, groupby, method, layer=None, gene_symbols=None)[source]

Get average exp by condition.

Parameters:

adataanndata.AnnData: Annotated data matrix.
groupbystr: The key of the observations grouping to consider.
methodstr: Method used to aggregate the expression. Must be one of ['sum','mean']
layerstr (default: None): Key from adata.layers whose value will be used to. If None, adata.X will be used.
gene_symbolslist | None (default: None): Genes to aggregate. If None, calculation will be done for all genes

Returns:

:
pandas.DataFrame: A gene by group dataframe

Documentation

The Budoids_class class

The adata_class class

The image_class class

The MidlineIdentifier.io module

The MidlineIdentifier.plotting module

The MidlineIdentifier.utilis module

Module contents

The `Budoids_class` class

The `adata_class` class

The `image_class` class

The `MidlineIdentifier.io` module

The `MidlineIdentifier.plotting` module

The `MidlineIdentifier.utilis` module