[090c8c]: / src / __pycache__ / scotv2.cpython-38.pyc

Download this file

114 lines (106 with data), 12.7 kB

U

ùqþc¼8ã@sldZddlZddlZddlZddlZddlmZddlm	Z	ddl
mZddlm
Z
mZGdd„deƒZdS)	a
Author: Pinar Demetci
Principal Investigator: Ritambhara Singh, Ph.D. from Brown University
08 August 2021
Updated: 23 February 2023
SCOTv2 algorithm: Single Cell alignment using Optimal Transport version 2
Correspondence: pinar_demetci@brown.edu, ritambhara@brown.edu
éN)Údijkstra)Ú
csr_matrix)Úkneighbors_graph)ÚStandardScalerÚ	normalizec@steZdZdZdd„Zdd„Zd$dd	„Zd%d
d„Zdd„Zdd„Z	d&dd„Z
d'dd„Zdd„Zd(dd „Z
d)d"d#„ZdS)*ÚSCOTv2a›

	SCOT algorithm for unsupervised alignment of single-cell multi-omic data.
	https://www.biorxiv.org/content/10.1101/2020.04.28.066787v2 (original preprint)
	https://www.liebertpub.com/doi/full/10.1089/cmb.2021.0446 (Journal of Computational Biology publication through RECOMB 2021 conference)

	Input: domain1, domain2 in form of numpy arrays/matrices, where the rows correspond to samples and columns correspond to features.
	Returns: aligned domain 1, aligned domain 2 in form of numpy arrays/matrices projected on domain 1

	Example use:
	# Given two numpy matrices, domain1 and domain2, where the rows are cells and columns are different genomic features:
	scot= SCOT(domain1, domain2)
	aligned_domain1, aligned_domain2 = scot.align(k=20, e=1e-3)

	#If you can't pick the parameters k and e, you can try out our unsupervised self-tuning heuristic by running:
	scot= SCOT(domain1, domain2)
	aligned_domain1, aligned_domain2 = scot.align(selfTune=True)

	Required parameters:
	- k: Number of neighbors to be used when constructing kNN graphs. Default= min(min(n_1, n_2), 50), where n_i, for i=1,2 corresponds to the number of samples in the i^th domain.
	- e: Regularization constant for the entropic regularization term in entropic Gromov-Wasserstein optimal transport formulation. Default= 1e-3 
   
	Optional parameters:

	- normalize= Determines whether to normalize input data ahead of alignment. True or False (boolean parameter). Default = True.
	- norm= Determines what sort of normalization to run, "l2", "l1", "max", "zscore". Default="l2" 
	- mode: "connectivity" or "distance". Determines whether to use a connectivity graph (adjacency matrix of 1s/0s based on whether nodes are connected) or a distance graph (adjacency matrix entries weighted by distances between nodes). Default="connectivity"  
	- metric: Sets the metric to use while constructing nearest neighbor graphs. some possible choices are "correlation", "minkowski".  "correlation" is Pearson's correlation and "minkowski" is equivalent to Euclidean distance in its default form (). Default= "correlation". 
	- verbose: Prints loss while optimizing the optimal transport formulation. Default=True
	- XontoY: Determines the direction of barycentric projection. True or False (boolean parameter). If True, projects domain1 onto domain2. If False, projects domain2 onto domain1. Default=True.

	Note: If you want to specify the marginal distributions of the input domains and not use uniform distribution, please set the attributes p and q to the distributions of your choice (for domain 1, and 2, respectively) 
			after initializing a SCOT class instance and before running alignment and set init_marginals=False in .align() parameters
	cCsTt|ƒtkrt|ƒdks tdƒ‚||_g|_g|_g|_g|_g|_	g|_
g|_dS)NéayAs input, SCOTv2 requires a list, containing at least two numpy arrays to be aligned.  				Each numpy array/matrix corresponds to a dataset, with samples (cells) in rows and features (latent representations or genomic features) in columns. 				We recommend using latent representations (e.g. principal components for RNA-seq and topics - via cisTopic- for ATAC-seq/Methyl-seq).)ÚtypeÚlistÚlenÚAssertionErrorÚdataÚ	marginalsÚgraphsÚ
graphDistsÚ	couplingsZgwdistsÚflagsZaligned_data)Úselfr
©rú&/Users/CCMB/Desktop/SCOT/src/scotv2.pyÚ__init__;s zSCOTv2.__init__cCsDtt|jƒƒD].}|j|jd}t |¡|}|j |¡q|jS)Nr)Úrangerr
ÚshapeÚtorchÚonesrÚappend)rÚiZ	num_cellsZmarginalDistrrrÚ_init_marginalsJs
zSCOTv2._init_marginalsÚl2TcCs„|dkstdƒ‚tt|jƒƒD]^}|dkrHtƒ}| |j|¡|j|<q|dksX|dkr^d}nd}t|j|||d|j|<q|jS)N)Úl1rÚmaxÚzscorea~Norm argument has to be either one of 'max', 'l1', 'l2' or 'zscore'.		 If you would like to perform another type of normalization, please give SCOT the normalized data and set the argument 'normalize=False' when running the algorithm. 		 We have found l2 normalization to empirically perform better with single-cell sequencing datasets, including when using latent representations. r!Tér)ÚnormÚaxis)rrrr
rÚ
fit_transformr)rr#ÚbySamplerZscalerr$rrrÚ
_normalizeRszSCOTv2._normalizeéÚconnectivityÚcorrelationc
Cs\|dkstdƒ‚|dkrd}nd}tt|jƒƒD]$}|j t|j|||||d¡q0|jS)N)r)ÚdistancezENorm argument has to be either one of 'connectivity', or 'distance'. r)TF)Zn_neighborsÚmodeÚmetricÚinclude_self)rrrr
rrr)rÚkr,r-r.rrrrÚconstruct_graphcs"zSCOTv2.construct_graphcCsftt|jƒƒD]P}tt|j|ƒddd}t ||tjk¡}||||k<|j	 
|| ¡¡q|j	S)NF)ÚcsgraphÚdirectedÚreturn_predecessors)rrr
rrrÚnpÚnanmaxÚinfrrr )rrZshortestPathZMax_distrrrÚinit_graph_distancesoszSCOTv2.init_graph_distancescCs¶|dks|dkr&t |¡t |¡}}|||| ¡|||| ¡}|d|dd…df|ddd…f||dd…df|ddd…f ¡}d||dd||d||	}
||
}||||}}t|
ƒD]z}| ¡}t d|||¡dd||}t d|||¡dd||	}| ¡| ¡ ¡ ¡ 	¡||krìqhqì|dd…df|ddd…f||dd…df|ddd…f}|||fS)	a}
			Parameters
			----------
			- ecost: torch.Tensor of size [size_X, size_Y]
					 Exponential kernel generated from the local cost based on the current coupling.  
			- u: torch.Tensor of size [size_X[0]].
				 First dual potential defined on X.
			- v: torch.Tensor of size [size_Y[0]].
				 Second dual potential defined on Y. 
			- mass: torch.Tensor of size [1]. 
					Mass of the current coupling.
			- nits_sinkhorn: int. 
							 Maximum number of iterations to update Sinkhorn potentials in inner loop.
			- tol_sinkhorn: float
							Tolerance on convergence of Sinkhorn potentials.

			Returns
			----------
			u: torch.Tensor of size [size_X[0]]
			   First dual potential of Sinkhorn algorithm
			v: torch.Tensor of size [size_Y[0]]
			   Second dual potential of Sinkhorn algorithm
			logpi: torch.Tensor of size [size_X, size_Y]
				   Optimal transport plan in log-space.
			Nrgà?g@zij,i->jgð¿çð?úij,j->i)
rÚ	ones_likeÚsumrÚcloneÚeinsumÚlogÚabsr Úitem)rÚecostÚuÚvÚaÚbZmassÚepsÚrhoÚrho2Ú
nits_sinkhornÚtol_sinkhornr/ÚzÚjZu_prevÚpirrrÚ_exp_sinkhorn_solver{s,P(""$DzSCOTv2._exp_sinkhorn_solverç{®Gáz„?r8Né¸çíµ ÷ư>cCs4|dkr|}|dd…df|ddd…f| ¡| ¡ ¡}t |¡}
d\}}t|ƒD]Î}| ¡}
| ¡}t d|t d||¡¡}t |||dd…df|ddd…fd ¡¡}tj|ddtj|dd}}t d|d	|¡}t d
|d	|¡}|dd…df|ddd…fd	|||}|tdƒkr`||t |||d ¡¡}|tdƒkr||t |||d ¡¡}||| 	¡}|ddkrºt
d
|ƒ| ||||||||||
|¡\}}}d}t t 
|¡¡rød}|| ¡ ¡|}||
 ¡ ¡ ¡|	krZq,qZ||fS)N)NNz	ij,kj->ikz	kl,jl->kjg»½×Ùß|Û=r")Údimrr9rzkl,l->kÚInfé
zUnbalanced GW step:TF)r;ÚsqrtrÚ
zeros_likerr<r=r>ÚfloatÚexpÚprintrNÚanyÚisnanr?r r@)rrDÚdxrEÚdyrFrGrHÚ	nits_planÚtol_planrIrJrMZpi_prevÚupZvprÚmpZdistxyZkl_piÚmuÚnuZdistxxZdistyyZlcostrAÚflagrrrÚexp_unbalanced_gw¨s:4
60""
$zSCOTv2.exp_unbalanced_gwc
Cs|r|j||d| ¡tdƒ|j|||d| ¡tt|jƒdƒD]²}
tdƒt 	|j
d¡t 	|j
|
d¡}}t 	|jd¡t 	|j|
d¡}
}|j||
|||||	ddddd	\}}|j
 |¡|j |¡|d
krLtd|||	f›dƒ‚qL|j
S)
N)r#r&z&computing intra-domain graph distances)r/r,r-r"z#running pairwise dataset alignmentsrrPrQ)rFrGrHr^r_rIrJFz4Solver got NaN plan with params (eps, rho, rho2)  = z. Try increasing argument eps)r'rrYr0r7rrr
rZTensorrrrerrrÚ	Exception)rrr#r&r/r,r-rFrGrHrrDrEr\r]ÚcouplingrdrrrÚfind_correspondencesÒs$&&$ÿzSCOTv2.find_correspondencescCst|jdg}tdt|jƒƒD]R}t |j| ¡¡}tj|dd}t ||dd…df|jd¡}| 	|¡q|S)Nrr")r$)
r
rrrr4Ú	transposeÚnumpyr;Úmatmulr)rZaligned_datasetsrrgÚweightsZprojected_datarrrÚbarycentric_projectionés"zSCOTv2.barycentric_projectionrTcCsvt|jƒ}g}g}t|dƒD](}|j|t |j|¡d|j|<qt|ƒD]°}|j||j|j |j|j|j|k¡|j| |j|j|j|k¡}t 	| 
¡¡}t |dk¡}	d||	||	<t t 
|t t |¡d¡¡¡}
| |
|¡qPg}g}t|dƒD]t}| t t 
t t t |j|¡d¡¡|j|¡¡¡| t t 
|j|t t |j|¡d¡¡¡¡qtd}
|d||d}|d||d}td|dƒD]D}t |
|j|f¡}
t||||||ƒ}||||}qÊtj |¡\}}|d}t |d¡}t 
|t 
|t |¡¡¡}tj |¡\}}|d}t |d¡}t 
|t 
|t |¡¡¡}t 
|t 
|
|¡¡}tj |¡\}}}dg}t|dƒD] }| ||tt|ƒ¡qÈ|dd…dt…ft |¡dd…dt…f}}t 
||¡}t 
||¡}g}t|dƒD]$}| |||||d…¡qB| |¡|S)z`
		Co-embeds datasets in a shared space.
		Implementation is based on Cao et al 2022 (Pamona)
		r"réÿÿÿÿgê-™—q=gà¿N)rr
rrr4rrÚTÚmultiplyÚarrayÚtodenseÚwhereÚdiagÚdotrrrirgÚvstackÚ
block_diagÚlinalgÚeigÚsvdZ
output_dim)rÚLambdaÚout_dimÚ
n_datasetsZH0ÚLrZ
graph_dataÚWZ	index_posÚDZSigma_xZSigma_yZS_xyZS_xxZS_yyrCÚQÚVZH_xZH_yÚHÚUÚsigmaÚnumÚfxÚfyÚintegrated_datarrrÚcoembed_datasetsòs^
&, ÿ":80"
zSCOTv2.coembed_datasetsÚ	embeddingc
Csb|
dkstdƒ‚|j|||||||||	d	td|jƒ|
dkrP|j||d}
n| ¡}
|
|_|
S)N)r‹Zbarycentricz®The input to the parameter 'projMethod' needs to be one of 								'embedding' (if co-embedding them in a new shared space) or 'barycentric' (if using barycentric projection))	rr#r&r/r,r-rFrGrHÚFLAGSr‹)r{r|)rrhrYrrŠrmr‰)rrr#r&r/r,r-rFrGrHÚ
projMethodr{r|r‰rrrÚalign2szSCOTv2.align)rT)r(r)r*)rOr8NrPrQrPrQ)	TrTr(r)r*rOr8N)r8rT)TrTr(r)r*rOr8Nr‹r8rT)Ú__name__Ú
__module__Ú__qualname__Ú__doc__rrr'r0r7rNrerhrmrŠrŽrrrrrs"

-
*
	
@r)r’rjr4rZotÚscipyZscipy.sparse.csgraphrÚscipy.sparserÚsklearn.neighborsrÚsklearn.preprocessingrrÚobjectrrrrrÚ<module>s