Briefings in Bioinformatics \DOIDOI HERE \accessAdvance Access Publication Date: Day Month Year \appnotesPaper
Yiru Pan et al.
[]Corresponding author. email: mjwang@mail.hzau.edu.cn, zhangzeyu@mail.hzau.edu.cn
‡ Yiru Pan, Xingyu Ji and Jiaqi You contributed equally to this work.
0Year 0Year 0Year
CSGDN: Contrastive Signed Graph Diffusion Network for Predicting Crop Gene-phenotype Associations
Abstract
Positive and negative association prediction between gene and phenotype helps to illustrate the underlying mechanism of complex traits in organisms. The transcription and regulation activity of specific genes will be adjusted accordingly in different cell types, developmental stages, and physiological states. There are the following two problems in obtaining the positive/negative associations between gene and trait: 1) High-throughput DNA/RNA sequencing and phenotyping are expensive and time-consuming due to the need to process large sample sizes; 2) experiments introduce both random and systematic errors, and, meanwhile, calculations or predictions using software or models may produce noise. To address these two issues, we propose a Contrastive Signed Graph Diffusion Network, CSGDN, to learn robust node representations with fewer training samples to achieve higher link prediction accuracy. CSGDN employs a signed graph diffusion method to uncover the underlying regulatory associations between genes and phenotypes. Then, stochastic perturbation strategies are used to create two views for both original and diffusive graphs. Lastly, a multi-view contrastive learning paradigm loss is designed to unify the node presentations learned from the two views to resist interference and reduce noise. We conduct experiments to validate the performance of CSGDN on three crop datasets: Gossypium hirsutum, Brassica napus, and Triticum turgidum. The results demonstrate that the proposed model outperforms state-of-the-art methods by up to 9.28% AUC for link sign prediction in G. hirsutum dataset. The source code of our model is available at https://github.com/Erican-Ji/CSGDN.
keywords:
gene-phenotype associations, graph neural networks, signed bipartite networks, signed graph neural networks
1 INTRODUCTION
Positive/negative association prediction between gene and phenotype has long been known to be important in the field of biology, with broad applications in breeding crops li2017natural . By constructing the association between genomic variations (such as single nucleotide polymorphisms, SNPs) and phenotypes within a large population, genome-wide association study (GWAS) uncovers the genetic basis underlying the formation of the phenotypes visscher201710 ; pasaniuc2017dissecting ; tam2019benefits ; qi2024genetic . In many crops such as rice, maize, wheat, cotton and rapeseed, GWAS has been widely applied to identify quantitative phenotype loci (QTLs) related to yield liu2023high ; lin2024systematic ; zhang2023ggamma ; miller2019variation ; ma2018resequencing , quality ma2018resequencing ; you2023regulatory ; zhao2022genomic ; wang2022genomic , resistance zhang2023ggamma ; zhao2022genomic ; tian2023allelic ; wang2024tabhlh27 ; li2020phenomics ; ma2021combination , and other agronomic phenotypes miller2019variation ; zhang2024integrating ; tang2021genome , and to further help identify candidate genes. In sorghum, an alkaline tolerance related gene named Alkali Tolerance 1 (AT1) was located through GWAS analysis of 352 representative accessions, which encodes a heterotrimeric G protein subunit (G) [10]. The nonfunctional mutant can improve the yield in sorghum, rice and maize in alkaline soils. In rapeseed, Zhang et al. (2024) located a previously unknown gene on chromosome A02, named BnaA02.SE zhang2023ggamma . Overexpression of this gene contributes to longer and larger siliquae, indicating that the gene is positively associated with rapeseed yield.
Compared with validating candidate genes through molecular biology approaches, the strategies integrated transcriptome such as GWAS-eQTL colocalization methods (eg. coloc, eCaviar) giambartolomei2014bayesian ; hormozdiari2016colocalization and mendelian randomization methods (eg. SMR/HEIDI) zhu2016integration ; commonmind2017erratum assist with evaluating gene-phenotype associations at an earlier stage, which narrow down the number of candidate genes and greatly reduce the time and cost required for creating mutant materials. Transcriptome-wide association study (TWAS) builds expression prediction models based on BLUP (Best Linear Unbiased Prediction) robinson1991blup and BSLMM (Bayesian Sparse Linear Mixed Model) zhou2013polygenic , enabling the prediction of large-scale expression levels from small-scale transcriptomes, which further addresses challenges such as difficultly or costly sampling for RNA sequencing zhu2016integration ; wainberg2019opportunities . However, the number of genes that can be accurately imputed is still limited by the training cohort size and the quality of the training data gusev2016integrative ; and TWAS strategy still relies on the framework of association analysis, lacking accurate predictive ability for genes with low variant density.
Additionally, traditional methods can introduce various types of errors and noise. For instance, biases such as systematic bias, coverage bias, and batch effects often affect Next-Generation Sequencing (NGS) data slatko2018overview . These biases are introduced by sequencing platform, genome content and experimental variability taub2010overcoming . Experimental design and the selection of sample replication depend on the specificity of the species, which to some extent affects the bias of quantitative results conesa2016survey . The accurate TWAS results gusev2016integrative also depend on the high quality of training data. Therefore, it is crucial to develop models to predict positive/negative gene-phenotype associations, minimizing the biases and errors produced in traditional experimental methods.
As mentioned, there are two major issues in predicting the positive/negative regulation associations between gene and phenotype:
(1) The high cost associated with large sample sizes. Traditional methods like GWAS and TWAS require large sample sizes to detect associations, which increases costs for sequencing and data analysis. Larger samples also demand more time and resources, making the process expensive and slow.
(2) Data noise from experiments and methods. Experiments often introduce biases such as systematic errors and batch effects. Noise makes it difficult to accurately detect the associations between gene and phenotype.
In response to these two issues, we propose a Contrastive Signed Graph Diffusion Network, CSGDN to solve these issues of predicting the positive/negative regulation associations. In this approach, gene-phenotype associations are modeled as a signed bipartite graph, where gene and phenotype nodes are connected by either positive or negative edges, indicating positive/negative-regulation of genes in phenotypes. For obstacle 1, we applies a signed graph diffusion theory to uncover hidden associations between genes and phenotypes. The diffusion method helps address the challenge of large sample size requirements by capturing complex associations through a smaller dataset, reducing the overall cost. For obstacle 2, we employs stochastic perturbation techniques to generate two views of both the original and diffused graphs. A multi-view contrastive learning loss unifies the node representations from both views, helping to reduce interference and noise. In summary, our model effectively addresses the cost and noise challenges as mentioned. CSGDN can predict gene-phenotype associations across crop species using only small-sample associations, and shows strong robustness against interference from experimental noise.
To evaluate the effectiveness of CSGDN, we conduct extensive experiments on three crop datasets: Gossypium hirsutum, Brassica napus, and Triticum turgidum. Our results demonstrate that CSGDN consistently outperforms baseline models from both unsigned GNNs and signed GNNs. For unsigned GNNs (GCN, GAT, GRACE), CSGDN achieves improvements in AUC of up to 6.66%, 7.13%, and 7.26%, respectively. For signed GNNs (SGCN, SGCL, SGNNMD), CSGDN outperforms the baselines with AUC gains of 5.96%, 5.29%, and 9.29%, respectively. These results underscore CSGDN’s superior ability to enhance link sign prediction tasks across diverse crop datasets. We randomly sample 80% of the datasets to evaluate CSGDN’s performance in addressing the challenge of small sample sizes. The results show that our model outperforms the baselines on AUC, Binary-F1, Micro-F1, and Macro-F1, with improvements of up to approximately 10% across all metrics. Specifically, CSGDN enhances Binary-F1 by up to 9.51% with 10% perturbed edges and improves Micro-F1 by up to 12.64% with 20% perturbed edges compared to SGNNMD. These experiments clearly highlight the effectiveness of our model CSGDN.
Overall, our contributions are summarized as follows:
-
•
We propose a model for gene-phenotype association prediction, which outperforms baselines in terms of link sign prediction performance.
-
•
By combining the diffusion method and the contrastive learning framework, our model effectively addresses the challenges of high cost and noise in positive/negative regulation association prediction.
2 RELATED WORK
Association Prediction. In recent years, significant progress has been made in the field of association prediction. SGNNMD proposed by Zhang et al. zhang2022sgnnmd , extracts subgraphs around miRNA-disease pairs from the signed bipartite networks and utilizes biological features to accurately predict deregulation types of miRNA-disease associations. The framework HGATLDA proposed by Zhao et al. zhao2022heterogeneous , effectively integrates meta-path-based heterogeneous graphs and attention mechanisms to improve the prediction of lncRNA–disease associations, addressing limitations in feature fusion, complex associations, and data imbalance. MTRD proposed by Zhang et al. zhang2022learning integrates multi-scale topology embeddings and node attributes with advanced learning mechanisms, achieving superior performance in predicting drug-disease associations and effectively identifying potential candidate diseases for specific drugs. BGF-CMAP proposed by Guo et al. guo2024biolinguistic integrates advanced techniques like Word2vec and graph embedding algorithms to extract both sequence and interaction features, significantly enhancing the prediction of complex circRNA–miRNA associations. MDGF-MCEC proposed by Wu et al. wu2022mdgf , utilizes a multi-view dual attention GCN and cooperative ensemble learning to predict circRNA-disease associations, achieving enhanced feature representation and classification through multi-similarity relation graphs and attention-based feature adjustment. Although neural networks have shown great potential in various association prediction fields, there is still insufficient progress in developing direct prediction methods for gene-phenotype associations, particularly due to the high costs associated with large sample sizes and the noise introduced throughout the experimental processes.
Signed Graph Neural Networks. Recently, neural approaches have gained traction in signed graph representation, we refer to such methods collectively as Signed Graph Neural Networks (SGNNs) derr2018signed ; zhang2023contrastive ; li2024se ; he2024mitigating ; zhang2024dropedge . With innovations like SiNE wang2017signed pioneering deep learning use by leveraging triangle structures with mixed edge polarities. SiNE optimizes an objective grounded in balance theory for embedding generation. The advent of SGCN derr2018signed extended GCN’s kipf2016semi scope to signed graphs, incorporating balance theory for multi-hop relationship discernment. Models like SiGAT huang2019signed , SNEA li2020learning , SDGNN huang2021sdgnn , and SGCL shu2021sgcl , rooted in attention mechanisms vaswani2017attention , further enrich this landscape.
Beyond the aforementioned approaches, some SGNNs claim to possess robustness against noise. SGCL shu2021sgcl first adopts contrastive learning paradigm in signed graph analysis, which can learn invariant representation of nodes under minor random perturbation and help to enhance the robustness of the model. RSGNN zhang2023rsgnn is another trial to enhance the robustness of SGNNs. It enables the model to learn a cleaner structure, less influenced by noise. However, considering the substantial computational overhead involved in learning a new adjacency matrix, this approach is only feasible for smaller node sets. Given the vast number of gene types, this method is impractical for gene-phenotype datasets.
3 MATERIALS
3.1 DataSets
We use datasets from three species including G. hirsutum you2023regulatory , B. napus tang2021genome and T. turgidum yang2024combined to input our model. For each species data, we perform TWAS process method to obtain associations between gene and phenotype. Initially, phenotype data is standardized using qqnorm function in R and Principal Component Analysis (PCA) is conducted for population structure data using TASSEL software bradbury2007tassel . FaST-LMM software lippert2011fast is employed to perform GWAS, considering phenotype data and population structure. We use Fusion software for performing TWAS gusev2016integrative to obtain TWAS of associations between partial genes and phenotypes. The positive or negative sign of TWAS indicates positive/negative regulation, respectively. Note that TWAS process can only calculate partial associations between genes and phenotypes. We separate the genes that can be associated with TWAS from those that cannot be calculated by TWAS within the species, and input them separately into the model.
G. hirsutum you2023regulatory . The phenotypes data include four phenotypes at the 4 DPA (days post anthesis) stage: Fiber length (FL), Fiber strength (FS), Fiber elongation rate (FE), Fiber Uniformity (FU). The reference genome is from Wang et al. wang2019reference . The result of TWAS process contains 523 associations between genes and FL phenotype, 1129 associations between genes and phenotype FS, 1521 associations between genes and phenotype FE, 509 associations between genes and phenotype FU.
B. napus tang2021genome . B. napus dataset includes 20 DPA and 40 DPA stages. phenotype type is only seed oil content (SOC). The reference genome (B. napus v4.1) can be downloaded from Genoscope (http://www.genoscope.cns.fr/brassicanapus/). The number of associations between genes in B. napus and SOC are 605 at the 20 DPA stage and 148 at the 40 DPA stage, separately. We regard the different stages as distinct phenotypes.
T. turgidum yang2024combined . T. turgidum dataset includes four phenotypes: Biomass (BM), Root length (RL), Root area (RA), and Root volume (RV). We use WEW v2.1 as the reference genome muslu2021comparative . The result of the TWAS process includes 36 associations with these four phenotypes, with BM having 1 association, RA having 2, RL having 17, and RV having 15.
3.2 Gene Sequences Similarity
As the complement to our model, we calculate the similarity between gene sequences as features. The software BLAST+/2.9.0 camacho2009blast+ is used for aligning sequences between gene pairs. BLAST can calculate the similarity scores between gene pairs. We use BLAST+ 2.9.0 camacho2009blast+ to obtain a similarity matrix , where = . We use the similarity matrix as the initial feature input for genes. To be specific, the input feature of gene is the -th row of the similarity matrix.
4 Preliminary
Notations | Descriptions |
---|---|
A signed graph | |
A set of gene nodes | |
A set of phenotype nodes | |
A set of edges, where denotes positive edges and denotes negative edges | |
A subset of edges calculated through TWAS processes, defined as such that , where and | |
The set of gene nodes associated with TWAS, defined as | |
A similarity matrix of genes | |
The adjacency matrix of the graph | |
The outdegree diagonal matrix of the graph | |
The graphs obtained by the graph augmentation method | |
The representation of the -th node for the -th layer in the -th augmented graph | |
The final representation of the -th node in the positive graph of the -th augmented graph | |
The final representation of the -th node in the negative graph of the -th augmented graph |
We construct a signed bipartite graph , where a matrix set of gene and a set of phenotypes are given. The edge set represents deregulation associations, where and . Each edge belongs to one regulation type: a positive edge indicates that the specific gene up-regulates the phenotype, while a negative edge represents down-regulation.
We define a subset of the edge set as , which consists of edges that can be calculated through TWAS processes. Consequently, the node set is defined as the set of genes that correspond to the edges in : , where . The gene set associated with TWAS can then be expressed as:
Given , the goal is to learn a function to map nodes and to low-dimensional embeddings and , where is the dimension of node embeddings. These embeddings should be useful for downstream tasks such as link sign prediction. This setup forms a crucial framework for studying the associations between genes and phenotypes, allowing for the prediction of up-regulation or down-regulation patterns between genes and phenotypes.
5 PROPOSED METHOD

In this section, we introduce the Contrastive Signed Graph Diffusion Network (CSGDN) model to tackle the aforementioned challenges, including enhancing structural information for small samples to reduce the expense in TWAS analysis and reducing noise from the whole process. As mentioned in MATERIALS section, we divide genes in each species into two parts: one part is associated with phenotypes through the TWAS process, and the other part lacks such associations.
The main framework of the proposed CSGDN model is to encode genes associated with TWAS and phenotypes, which is illustrated in Fig.2. The main framework includes four key components: (1) graph diffusion; (2) graph augmentation; (3) the encoder; and (4) contrastive learning. To be specific, we first use a graph diffusion method to uncover the potential relationships between genes and phenotypes, resulting in a diffusion graph. Then, we randomly remove edges from the original graph and the diffusion graph to obtain augmented graphs. After encoding these augmented graphs with the encoder, we obtain node representations to calculate the contrastive loss.
In addition, we train a MLP to learn the encoding capability of the model’s main framework for TWAS genes, in order to encode genes that lack TWAS associations.
5.1 Signed Graph Diffusion
The collection of large-scale crop sample data requires a high cost. Therefore, developing methods that can maintain high-precision prediction even with a low number of training samples is very necessary. In the field of biology, the effects of genes on phenotypes involves both positive and negative regulation, meaning that genes can either promote or inhibit phenotypes through complex mechanisms. This forms a bipartite signed graph, which consists of positive and negative edge types and two disjoint sets of nodes. In situations where crop sample data is scarce, the resulting signed graph structure is relatively sparse.
Traditional ranking models, such as PageRank page1999pagerank , can mine the potential relationships between nodes to increase the number of edges but are only suitable for graphs with a single type of positive edges. Wu et al. wu2016troll proposed Troll-Trust model (TR-TR) which is a variant of PageRank, without explanation of complex relationships between negative and positive edges. To better handle the signed graph with two edge types, we use a diffusion operation based on the Signed Random Walk with Restart (SRWR) algorithm proposed by previous researchers to uncover the potential associations between genes and phenotypes, obtaining a diffusion graph. The diffusion graph provides richer structural information, which can effectively alleviate the problem of sparsity in graph data, while also providing a new graph structure for subsequent contrastive learning operations.
To be specific, the diffusion method is based on the balance theory. In balance theory, there are four types of relationships between nodes: friend’s friend, friend’s enemy, enemy’s friend, and enemy ’s enemy. On this basis, we introduce a signed random surfer for bipartite signed graph. The surfer randomly surfs between nodes and traverses their relationships. When the surfer encounters a positive edge, it maintains a positive sign +; when it encounters a negative edge, it flips its sign to negative -. Initially, the surfer carries a positive sign + at node . If the surfer is at node at this time and the restart probability is , the probability of the surfer randomly surfing to neighboring nodes is .
is the probability that a positive sign surfer is at node from the seed node , and is the probability that a negative sign surfer is at node from the seed node . We can formulate and by Equation 1 below where is the set of in-neighbors of node u , and is the set of out-neighbors of node v. Noted that node s is the seed node, and is 1 if u is the seed node s and 0 otherwise.
(1) |
We use the adjacency matrix of the signed graph to vectorize Equation 1. Suppose is positive or negative for the signed edge and 0 otherwise. is the out-degree diagonal matrix, where . The semi-row normalized matrix is . We decompose into two matrices: a positive semi-row normalized matrix () and a negative semi-row normalized matrix (), such that .
Based on the above equations, we can vectorize Equation 1 as follows:
(2) | ||||
where and are the transposed adjacency matrices for positive and negative edges, respectively. is the seed node vector at node . Initially, set , , and define .
Since the simple balance theory with four types of relationships cannot explain the complex imbalanced relationships, introducing the balance attenuation factors and is beneficial. When a negative walker encounters a negative edge at a node, its sign will switch to positive with probability , or remain negative with probability . Similarly, when the negative walker encounters a positive edge at node , its sign will remain negative with probability , or switch to positive with probability . The diffusion model with balance attenuation factors becomes:
(3) | ||||
By repeated iterative computation of and , we can concatenate and into . Then by computing the error between the current iteration result and the previous iteration result , where , we update for the next iteration. The iteration stops when the error is smaller than a tolerance .
For each node, the algorithm return an and an . We combine all and into a positive matrix and a negative matrix . Note that the relationship between edges () and () may differ, and the sign may even be opposite. This means may occur.
To address this and generate an undirected graph for subsequent process, we transpose and to and respectively where ( or ). For each node, we take the maximum value between and to form and . Then, we calculate for each node at the corresponding position to generate the symmetric matrix, which can also be called the diffuse matrix .
5.2 Graph Augmentation
Generating different views is crucial in contrastive learning methods. In this study, we primarily focus on randomly removing edges on both the original graph and the diffusion graph to obtain different views. For each graph, we can construct a matrix , where each column represents one edge (such as u->v) containing only existing edges. Then we generate a random masking matrix drawn from a uniform distribution over , denoted as . Setting a threshold , we reset to denote the deletion of the corresponding edge if . The resulting matrix can be computed as: , where is Hadamard product.
We perform this process twice on both the original graph and the diffusion graph separately, randomly removing 10% of the edges each time. Therefore, we can obtain four augmented graphs, denoted as , , , and , respectively.
5.3 Graph Encoder
After data augmentation, we obtain four augmented graphs including , , , and . For convenience, we define the set of augmented graphs as , where k=1,2,3,4. Edge types are classified into positive and negative, meaning the two types of effects of genes on specific phenotypes, namely up-regulation and down-regulation. Consequently, it is imperative to design two distinct GNNs to separately aggregate information from positive and negative neighbors. We split each graph into two graphs containing only positive or negative edges, where and . Utilizing a design akin to shu2021sgcl , node representations are learned from positive graphs using a positive GNN, and from negative graphs using a negative GNN. Parameters are shared in the same perspective for positive (negative) GNNs. GAT model velivckovic2017graph is used as the graph encoder and it computes as follows:
(4) |
(5) |
where , represents the -th augmented graph, denotes the number of GNN layers and is a learnable transformation matrix. denotes the input feature vector of the -th node and is the representation of the -th node for the -th layer. represents the final representation of -th node in the -th augmented graph. Note that we use two GAT layers here.
5.4 Objective Loss
5.4.1 Contrastive Loss
Inter-view Contrastive Learning. As mentioned earlier, we obtained four augmented graphs and further divided each augmented graph into positive and negative graphs for separate encoding. Since the positive and negative graphs contain different semantic properties, we define inter-view contrastive losses for both the positive and negative augmented graphs. Below, we discuss the positive graphs in detail and define the losses for the negative graphs in a similar manner.
For any positive graph, in order to obtain more robust representations, CSGDN maximizes the agreements of the representations between the same node across different positive graphs while minimizing the representation similarities between different nodes. For example, the representation of the -th node in graph of the inter-set perspective, i.e., , should be consistent with representations generated from the same node in the other positive augmented graph in the same perspective, i.e., . Therefore, we treat the representations of the same node from other positive graphs within the same perspective as positive samples. Also, we want the representation of a node to be distinct from those of different nodes so we consider the representations of different nodes from other positive graphs within the same perspective as negative samples. Given a mini-batch containing I nodes, the inter-view contrastive loss for positive augmented graphs is defined as follows, inspired by the InfoNCE loss shu2021sgcl ; sohn2016improved :
(6) |
where represents the representation of node in the -th augmented positive graph, sim(·,·) represents the similarity function between the two representations and denotes the preset temperature parameter.
Similarly, for the representation of the -th node in the graph of the inter-set perspective , its inter-view positive samples are the representations generated from the same node from other negative graphs , and its inter-view negative samples are the ones generated from different nodes in other negative graphs. As the same, the inter-view contrastive loss for negative augmented graphs is defined as:
(7) |
Combining the above two losses, we obtain the perspective specific contrastive loss:
(8) |
Intra-view Contrastive Learning. In addition to maximizing the consistency of representations for the same node across different positive graphs or different negative graphs, we also design intra-view contrastive losses to let the ultimate representation of each node to be close to the representations of the same node in positive graphs and far from the representations of the same node in negative graphs.
For node , we generate the representation by concatenating all representations containing different information of diverse views, which is formulated as follows:
(9) |
where is a two-layer MLP, and represents the final representation of node .
To be specific, we treat the representations of the same node from other positive graphs as positive samples while the representations of the same node from other negative graphs as negative samples. Given a mini-batch containing I nodes, the intra-view contrastive objective is formally defined as:
(10) |
where denotes the number of graph views, which equals to 4 in this paper.
Contrastive loss. we generates the combined contrastive learning objective from the inter-view and intra-view contrastive learning objectives, and it is formulated as follows:
(11) |
where is the weight coefficient that controls the significance between two losses.
5.4.2 Label Loss
After obtaining the final node representations using Equation 9, we utilize a two-layer MLP to compute the sign scores between nodes from different sets:
(12) |
where represents the predicted sign score of the link between nodes and . The dimension of is 3, representing the probabilities of predicting the edge sign as positive, negative, or neutral, respectively.
Following existing methodologies, the cross-entropy loss function is used for link sign prediction:
(13) |
where is the ground truth label. The labels are defined as follows:
-
•
-1 indicates a negative relationship (down-regulation).
-
•
1 indicates a positive relationship (up-regulation).
-
•
0 indicates an undefined relationship which is not from the TWAS analysis process.
Note that the ground truth label are converted into one-hot encoding, with a value of 1 in the corresponding category position and 0 elsewhere.
5.4.3 Total Loss
Finally, our model is trained using a joint loss function that integrates the link sign prediction loss and the contrastive learning loss:
(14) |
where is a weight parameter that controls the relative importance of the contrastive loss.
5.5 MLP for Genes without TWAS associations

For those genes that lack TWAS associations, due to the lack of supervision from TWAS association information, the encoder trained specifically for genes in TWAS associations cannot be directly used for their encoding. To address this issue, we propose to train a multi-layer perceptron (MLP) to transform these genes that lack TWAS associations into a shared space with TWAS-associated genes:
(15) |
where represents the input feature of gene in TWAS associations, and represents the final representation of gene after being encoded by the aforementioned TWAS framework. By minimizing the above Mean Squared Error (MSE) loss, we can learn the encoding capability of the main framework for TWAS-associated genes through MLP, which can be used for the encoding of genes lacking TWAS associations. Then we can obtain the representation of genes lacking TWAS associations through this MLP:
(16) |
where . The specific process is shown in Fig. 3.
6 EXPERIMENTS
Datasets | Unsigned GNNs | Signed GNNs | ||||||
---|---|---|---|---|---|---|---|---|
GCN | GAT | GRACE | SGCN | SGCL | SGNNMD | Our CSGDN | ||
G. hirsutum | AUC | 0.7145 ± 0.0294 | 0.7098 ± 0.0093 | 0.7085 ± 0.0068 | 0.7215 ± 0.0123 | 0.7282 ± 0.0191 | 0.6883 ± 0.1783 | 0.7811 ± 0.0116 |
Binary-F1 | 0.6231 ± 0.0633 | 0.5950 ± 0.0131 | 0.5934 ± 0.0098 | 0.6325 ± 0.0190 | 0.6307 ± 0.0365 | 0.7894 ± 0.1128 | 0.7458 ± 0.0103 | |
Micro-F1 | 0.7444 ± 0.0082 | 0.7624 ± 0.0102 | 0.7609 ± 0.0074 | 0.7624 ± 0.0122 | 0.7759 ± 0.0146 | 0.7203 ± 0.1578 | 0.7804 ± 0.0240 | |
Macro-F1 | 0.7102 ± 0.0139 | 0.7134 ± 0.0104 | 0.7120 ± 0.0076 | 0.7284 ± 0.0136 | 0.7349 ± 0.0221 | 0.6634 ± 0.2151 | 0.7981 ± 0.0447 | |
B. napus | AUC | 0.5000 ± 0.0000 | 0.5000 ± 0.0000 | 0.5000 ± 0.0000 | 0.5821 ± 0.0449 | 0.4409 ± 0.1262 | 0.6026 ± 0.1397 | 0.6615 ± 0.0495 |
Binary-F1 | 0.9130 ± 0.0000 | 0.9130 ± 0.0000 | 0.8773 ± 0.0000 | 0.8560 ± 0.0385 | 0.5133 ± 0.3839 | 0.2456 ± 0.3027 | 0.8608 ± 0.0334 | |
Micro-F1 | 0.8400 ± 0.0000 | 0.8400 ± 0.0000 | 0.7815 ± 0.0000 | 0.7627 ± 0.0574 | 0.4800 ± 0.2918 | 0.8233 ± 0.0304 | 0.7815 ± 0.0437 | |
Macro-F1 | 0.4565 ± 0.0000 | 0.4565 ± 0.0000 | 0.4387 ± 0.0000 | 0.5828 ± 0.0550 | 0.3368 ± 0.1848 | 0.5719 ± 0.1558 | 0.6649 ± 0.0478 | |
T. turgidum | AUC | 0.5000±0.0000 | 0.5000±0.0000 | 0.5000±0.0000 | 0.4714±0.0350 | 0.5000±0.0000 | 0.5810±0.1160 | 0.5982+0.0787 |
Binary-F1 | 0.8235±0.0000 | 0.8235±0.0000 | 0.8636±0.0000 | 0.7941±0.0360 | 0.8235±0.0000 | 0.3352±0.2201 | 0.8231+0.0946 | |
Micro-F1 | 0.7000±0.0000 | 0.7000±0.0000 | 0.7600±0.0000 | 0.6600±0.0490 | 0.7000±0.0000 | 0.7886±0.1022 | 0.5925+0.1027 | |
Macro-F1 | 0.4118±0.0000 | 0.4118±0.0000 | 0.4318±0.0000 | 0.3971±0.0180 | 0.4118±0.0000 | 0.6043±0.1411 | 0.5925+0.1027 |
AUC | Binary-F1 | Micro-F1 | Macro-F1 | |
GCN | 0.5883 ± 0.0182 | 0.4084 ± 0.0932 | 0.6562 ± 0.0393 | 0.5792 ± 0.0317 |
GAT | 0.5549 ± 0.0244 | 0.3042 ± 0.1045 | 0.6315 ± 0.1417 | 0.4811 ± 0.1058 |
GRACE | 0.6967 ± 0.0208 | 0.6203 ± 0.0330 | 0.7098 ± 0.0504 | 0.6849 ± 0.0378 |
SGCN | 0.5832 ± 0.0478 | 0.4134 ± 0.0885 | 0.6472 ± 0.0208 | 0.5793 ± 0.0431 |
SGCL | 0.6271 ± 0.0429 | 0.4980 ± 0.0975 | 0.6360 ± 0.0464 | 0.5926 ± 0.0174 |
SGNNMD | 0.5988 ± 0.1210 | 0.5035 ± 0.2912 | 0.5964 ± 0.1242 | 0.5231 ± 0.1843 |
CSGDN | 0.7495 ± 0.0339 | 0.6884 ± 0.0407 | 0.7574 ± 0.0515 | 0.7406 ± 0.0420 |
Ptb(%) | Unsigned GNNs | Signed GNNs | ||||||
---|---|---|---|---|---|---|---|---|
GCN | GAT | GRACE | SGCN | SGCL | SGNNMD | CSGDN | ||
AUC | 10 | 0.6714±0.0116 | 0.6649±0.0169 | 0.6759±0.0163 | 0.6994±0.0138 | 0.6850±0.0221 | 0.6258±0.1102 | 0.7066+0.0300 |
Binary-F1 | 0.5424±0.0302 | 0.5166±0.0387 | 0.5435±0.0309 | 0.6046±0.0330 | 0.5663±0.0429 | 0.5547±0.2151 | 0.6498+0.0562 | |
Micro-F1 | 0.7203±0.0129 | 0.7203±0.0129 | 0.7278±0.0138 | 0.7368±0.0106 | 0.7323±0.0169 | 0.6045±0.1409 | 0.6987+0.0253 | |
Macro-F1 | 0.6702±0.0150 | 0.6598±0.0220 | 0.6748±0.0196 | 0.7033±0.0159 | 0.6863±0.0259 | 0.5891±0.1427 | 0.6987+0.0253 | |
AUC | 20 | 0.6030 ± 0.0454 | 0.6313 ± 0.0176 | 0.6515 ± 0.0061 | 0.6225 ± 0.0312 | 0.6518 ± 0.0198 | 0.6063 ± 0.1025 | 0.7209 ± 0.0298 |
Binary-F1 | 0.4939 ± 0.0901 | 0.4863 ± 0.0317 | 0.5320 ± 0.0047 | 0.5576 ± 0.0403 | 0.5840 ± 0.0338 | 0.6420 ± 0.1803 | 0.6501 ± 0.0530 | |
Micro-F1 | 0.6286 ± 0.0400 | 0.6797 ± 0.0162 | 0.6932 ± 0.0111 | 0.6316 ± 0.0293 | 0.6602 ± 0.0573 | 0.6180 ± 0.1165 | 0.7444 ± 0.0190 | |
Macro-F1 | 0.5968 ± 0.0468 | 0.6268 ± 0.0207 | 0.6517 ± 0.0053 | 0.6205 ± 0.0306 | 0.6364 ± 0.0408 | 0.5847 ± 0.1244 | 0.7236 ± 0.0289 |
CSGDN | ||||
---|---|---|---|---|
G. hirsutum | 0.7811 ± 0.0116 | 0.7624 ± 0.0179 | 0.7712 ± 0.0551 | 0.7297 ± 0.0208 |
B. napus | 0.6615 ± 0.0495 | 0.6485 ± 0.0400 | 0.6700 ± 0.0206 | 0.6204 ± 0.0017 |
T. turgidum | 0.5982 ± 0.0787 | 0.5175 ± 0.0351 | 0.5333 ± 0.0887 | 0.4956 ± 0.0428 |
80% G. hirsutum | 0.7495 ± 0.0339 | 0.6998 ± 0.0365 | 0.7355 ± 0.0184 | 0.6870 ± 0.0327 |
FE | FU | FS | FL | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Gene | TWAS | CSGDN | Gene | TWAS | CSGDN | Gene | TWAS | CSGDN | Gene | TWAS | CSGDN | ||||||||||||
Ghir_A13G012290 | Up | Down | 0.672 | 0.033 | 0.294 | Ghir_A02G005400 | Down | Up | 0.043 | 0.013 | 0.944 | Ghir_D09G001870 | Up | Up | 0.014 | 0.007 | 0.979 | Ghir_A09G017330 | Up | Up | 0.010 | 0.006 | 0.984 |
Ghir_D02G002560 | Up | Up | 0.012 | 0.007 | 0.982 | Ghir_A05G041310 | Up | Up | 0.014 | 0.007 | 0.979 | Ghir_D07G018780 | Up | Up | 0.089 | 0.019 | 0.892 | Ghir_D07G002590 | Up | Up | 0.010 | 0.006 | 0.984 |
Ghir_A11G026810 | Up | Up | 0.012 | 0.007 | 0.981 | Ghir_A04G002650 | Up | Up | 0.031 | 0.011 | 0.958 | Ghir_D05G001070 | Up | Up | 0.012 | 0.007 | 0.982 | Ghir_D01G006880 | Down | Up | 0.282 | 0.031 | 0.686 |
Ghir_A09G016050 | Up | Up | 0.452 | 0.035 | 0.513 | Ghir_D03G012470 | Up | Up | 0.012 | 0.007 | 0.981 | Ghir_D07G018070 | Down | Up | 0.018 | 0.008 | 0.973 | Ghir_D05G006550 | Down | Down | 0.990 | 0.006 | 0.004 |
Ghir_A12G025670 | Up | Down | 0.483 | 0.035 | 0.482 | Ghir_A12G019480 | Down | Down | 0.990 | 0.006 | 0.004 | Ghir_D07G002590 | Down | Down | 0.493 | 0.035 | 0.472 | Ghir_A01G012860 | Down | Down | 0.589 | 0.035 | 0.376 |
Ghir_A09G014910 | Down | Down | 0.971 | 0.011 | 0.018 | Ghir_A12G019760 | Down | Up | 0.019 | 0.009 | 0.972 | Ghir_D06G018900 | Down | Down | 0.991 | 0.005 | 0.004 | Ghir_A01G017490 | Down | Down | 0.981 | 0.009 | 0.011 |
Ghir_A01G013950 | Down | Down | 0.989 | 0.006 | 0.005 | Ghir_A01G018190 | Down | Down | 0.990 | 0.006 | 0.004 | Ghir_D13G017280 | Down | Down | 0.991 | 0.005 | 0.004 | Ghir_A01G013320 | Down | Down | 0.987 | 0.007 | 0.006 |
Ghir_D01G000640 | Down | Down | 0.928 | 0.018 | 0.054 | Ghir_A01G017780 | Down | Down | 0.991 | 0.005 | 0.004 | Ghir_A01G016960 | Down | Down | 0.989 | 0.006 | 0.005 | Ghir_A11G001240 | Down | Up | 0.038 | 0.012 | 0.949 |
In this section, we present experiments on real-world datasets to evaluate the effectiveness of CSGDN in link sign prediction. We also compare its performance with leading methods in both unsigned and signed graph neural networks. Specifically, we aim to address the following questions:
-
•
Q1: Does CSGDN outperform the advanced baselines?
-
•
Q2: How does CSGDN perform with a small sample size and random noise?
-
•
Q3: How do different model components affect the performance of CSGDN?
-
•
Q4: Is CSGDN sensitive to hyperparameters?
6.1 Experiment Settings
Baselines. To validate the effectiveness of CSGDN, we compare our proposed model with several common methods in the fields of unsigned and signed graph neural networks.
Unsigned GNNs. GCN kipf2016semi is a pioneering and notable GNN model tailored for unsigned graphs, featuring an effective layer-wise propagation mechanism. GAT velivckovic2017graph utilizes masked self-attentional layers, allowing nodes to attend to their neighbors’ features with varying weights without costly matrix operations or prior knowledge of the graph structure. GRACE zhu2020deep proposes a novel unsupervised graph representation learning framework that leverages contrastive objectives at the node level, creating two graph views through corruption, and maximizing the agreement of node representations in these views, utilizing diverse contexts via a hybrid scheme on structure and attribute levels, and demonstrating superior performance over state-of-the-art methods.
Signed GNNs. SGCN derr2018signed utilizes balance theory to correctly aggregate and propagate the information across layers of a signed GCN model and generalizes GCN to signed graphs. SGCL shu2021sgcl introduces a novel graph contrastive representation learning techniques tailored for signed graphs, leveraging balance theory and dual contrastive strategies to achieve superior node representations across diverse datasets, including social and online gaming networks. SGNNMD zhang2022sgnnmd utilizes signed graph neural networks to predict deregulation types of miRNA-disease associations, achieving competitive performance by integrating structural and biological features from a signed bipartite network.
Hyper-parameters setting. we analyze the sensitivity of CSGDN to six key hyperparameters: , , , , , and . The default configuration for each hyperparameter is as follows: , , , , , and . These settings were determined based on the model’s overall highest AUC score. The AUC value is utilized as the primary metric to assess the sensitivity of the model to changes in these hyperparameters.
Reducing sample size. We randomly extract eighty percent of the G. hirsutum dataset. CSGDN shows the excellent results to face with small sample size.
Random Noise. To demonstrate that our model CSGDN has an outstanding performance when resisting interference, we achieve the effect of random perturbation to simulate noise by randomly flipping a certain proportion of edge signs in the G. hirsutum dataset. In this experiment, we set the proportion as 10% and 20%.
Task and evaluation metrics. We use AUC, Micro-F1,Binary-F1 and Macro-F1 to evaluate the results on the link sign prediction task. For each dataset, we randomly split edges into a training set and a testing set with a ratio 8:2. Note that superior performance is indicated by higher values for all these four evaluation metrics.
6.2 Experiment Results
Performance of CSGDN compared with baselines (Q1). To answer Q1, we compare CSGDN with current state-of-the-art methods. We primarily use two types of GNN frameworks as baselines: unsigned GNN and signed GNN. For unsigned GNNs, we used GCN, GAT, and GRACE, while for signed GNNs, we employed SGCN, SGCL, and SGNNMD. We use link prediction as the evaluation task, with AUC, Binary-F1, Micro-F1, and Macro-F1 as the evaluation metrics for model performance. Across three common crop datasets (G. hirsutum, B. napus, and T. turgidum), the evaluation metrics of CSGDN generally outperform those of the state-of-the-art baselines. As shown in Table 2, CSGDN demonstrates strong performance in the link prediction task on crop datasets.
Performance of CSGDN when addressing small sample size and random noise (Q2). As shown in Tables 3 and 4, we address Q2 by verifying that CSGDN can effectively overcome two issues: the costs of samples and noise. For the first issues we randomly reduce the sample size of the G. hirsutum dataset to 80% and subsequently divide this dataset into training and testing sets. The results presented in the Table 3 indicate that CSGDN outperforms all baselines on G. hirsutum datasets with randomly reduced sample sizes, demonstrating its effectiveness in handling small sample datasets. This suggests that we can reduce experimental costs and durations by minimizing the sample size while still achieving excellent prediction outcomes. Then for noise, we utilize two G. hirsutum datasets with the proportion of 10% and 20% perturbations. As shown in Table 4, our model CSGDN outperforms the vast majority of baselines. This reflects that our model has strong anti-interference ability against various types of noise through the contrastive learning methods.
6.3 Ablation Study
We conduct ablation study to assess the effectiveness of different components in our proposed model to answer Q3. In this subsection, we employ sign perturbation as the graph augmentation method to analyze performance. Specifically, we compare CSGDN with its three variants: , , and , which are defined as follows:
-
•
: The graph diffusion in contrastive learning is removed, and instead, we utilize the same two original graphs without diffusion step for the next augmentation step.
-
•
: The graph augmentation step is removed. In this variant, graphs and instead of augmented graphs are exploited during training.
-
•
: The part of contrastive learning is removed and ignores the contrastive loss.
As shown in Table 5, we demonstrate that all three components are essential to CSGDN’s performance. Each components plays a unique role in enhancing the model’s ability.
6.4 Hyper-parameters Analysis

To answer Q4, we analyze the sensitivity of CSGDN to six key hyperparameters: , , , , , and .
The hyperparameter , shown in Fig. 4(a), serves as the weight coefficient that balances the significance of the inter-view contrastive loss and intra-view contrastive loss. We evaluate the model’s performance with set to values from the set . We find that the model performs better when is between 0.5 and 0.8, while anything outside of this range decreases performance. Therefore, for this dataset intra-view, i.e., learning a consistent representation across augmented graphs, is more important.
The hyperparameter , shown in Fig. 4(b), controls the trade-off in the joint loss function between the link sign prediction loss () and the contrastive learning loss (). We vary over the set and observe that the highest AUC score is achieved when . We find that the model performance tends to increase when , and the model reaches its best performance at , followed by a decrease in performance. This illustrates the importance of contrast learning in the model, which decreases when the contrast learning effect on model performance is extreme, i.e., when the value is too small or too large.
The hyperparameter , shown in Fig. 4(c), refers to the node embedding dimension. We test six different node embedding dimensions, ranging from 8 to 128, and evaluate their impact on model performance. We find that the model performance decreases extremely when , which may be due to feature sparsity caused by the elevated dimension of the feature space, which prevents the model from effectively establishing correlations between nodes.
The hyperparameter , shown in Fig. 4(d), represents the ratio of edges randomly dropped from both the original and diffusion graphs. We explore the values of in the set and find that the model performs optimally when the mask ratio is set to 0.4. We find that the model performance fluctuates as mask ratio increases, but generally performs better when mask ratio is small. When mask ratio is large, too much information is lost, resulting in incomplete information and lower model performance.
The hyperparameter , shown in Fig. 4(e), determines the best architechture of the downstream link prediction task, allowing for variations among a 1-layer MLP, 2-layer MLP, 3-layer MLP, and 4-layer MLP. The number of layers directly influences the model’s prediction accuracy. We found that as the number of layers of increases, the model performance first increases and then decreases, and reaches the highest value at 3-layer MLP, while 4-layer MLP may be due to the overfitting of the model because of the excessive number of layers of the neural network, which is ultimately manifested in the decrease of the model performance on the test set.
Finally, the hyperparameter , shown in Fig. 4(f), is the temperature parameter used in the contrastive loss function to regulate the similarity contrast between positive and negative edges. We test values from the set , and the default value of yields the best performance. We find that the model performance is better when the value is lower, and generally lower when , which also indicates that this dataset needs to learn a consistent representation as much as possible in the comparative learning, and smoothing operation here will reduce the performance.
6.5 Case Study
In this section, we initiate a case study focusing on G. hirsutum genes that associated with the four types of phenotypes including Fiber Elongation rate (FE), Fiber Uniformity (FU), Fiber Strength (FS), Fiber Length (FL). By leveraging the optimal hyperparameter configurations outlined, we utilize the associations between G. hirsutum genes and four types phenotypes predicted by TWAS process as the training set to train the CSGDN’s TWAS Frame. Then, the randomly-selected gene-phenotype associations types that TWAS cannot calculate in G. hirsutum are used as irrelevant associations. CSGDN can predict probabilities of three types associations, including up-regulation, down-regulation and irrelevant associations. We take the type of the highest predicted probability as the current gene and phenotype type. Thus, for the target four phenotypes, we can predict association types with target G. hirsutum genes. As shown in Table 6, we randomly list 10 associations of each phenotype. We use the result associations from TWAS as the reference for our predictions. For example, for phenotype FE, 2 out of 16 predictions are incorrectly compared with TWAS results. For phenotype FU, 2 out of 9 predictions are incorrect from the TWAS results. Therefore, the case studies demonstrate the usefulness of CSGDN in discovering novel gene-phenotype associations and can be validated by TWAS results.
7 CONCLUSION
Association prediction between gene and phenotype plays a crucial role in grasping complex biological and genetic process in crops. We propose a novel model CSGDN to address two major issues including costs and noise. CSGDN employs the diffusion method to capture the potential associations with minimal sample size. Contrastive learning strategies are utilized to unify the node presentations from two view created by stochastic perturbation. Multi-view contrastive loss demonstrates outstanding outcomes facing interference and noise. Extensive experiments show that CSGDN achieves state-of-the-art performance, and outperforms baselines. Case study illustrate the superior performance of our model on crop datasets. CSGDN can predict positive/negative/irrelevant associations between gene and phenotype, and predictions are largely correct using TWAS results as a reference. As a result, CSGDN significantly improve the previously mentioned two issues and demonstrate essential biological significance. In addition to TWAS results, the model can also accept other types of data inputs. For instance, researchers engaged in functional genomics can input gene-phenotype associations obtained from CRISPR, RNAi, or overexpression experiments into the CSGDN. This allows the model to provide new candidate genes for researchers in these fields. However, the current model lacks interpretability, which limits its direct application in crop breeding. In the future, we look forward to applying advanced mechanisms to improve the interpretability to provide clearer insights into underlying gene-phenotype associations.
8 ACKNOWLEDGEMENTS
This study was supported by the Finance science and technology project of Xinjiang Uyghur Autonomous Region (2023A01). This study was also supported by the National Key Research and Development Program of China (2021YFF1000900) and the National Natural Science Foundation of China (W2411020, 32170645). We thank the high-performance computing platform at the National Key Laboratory of Crop Genetic Improvement in Huazhong Agricultural University.
References
- [1] Weitao Li, Ziwei Zhu, Mawsheng Chern, Junjie Yin, Chao Yang, Li Ran, Mengping Cheng, Min He, Kang Wang, Jing Wang, et al. A natural allele of a transcription factor in rice confers broad-spectrum blast resistance. Cell, 170(1):114–126, 2017.
- [2] Peter M Visscher, Naomi R Wray, Qian Zhang, Pamela Sklar, Mark I McCarthy, Matthew A Brown, and Jian Yang. 10 years of gwas discovery: biology, function, and translation. The American Journal of Human Genetics, 101(1):5–22, 2017.
- [3] Bogdan Pasaniuc and Alkes L Price. Dissecting the genetics of complex traits using summary association statistics. Nature reviews genetics, 18(2):117–127, 2017.
- [4] Vivian Tam, Nikunj Patel, Michelle Turcotte, Yohan Bossé, Guillaume Paré, and David Meyre. Benefits and limitations of genome-wide association studies. Nature Reviews Genetics, 20(8):467–484, 2019.
- [5] Ting Qi, Liyang Song, Yazhou Guo, Chang Chen, and Jian Yang. From genetic associations to genes: methods, applications, and challenges. Trends in Genetics, 2024.
- [6] Yangyang Liu, Jun Chen, Changbin Yin, Ziying Wang, He Wu, Kuocheng Shen, Zhiliang Zhang, Lipeng Kang, Song Xu, Aoyue Bi, et al. A high-resolution genotype–phenotype map identifies the taspl17 controlling grain number and size in wheat. Genome Biology, 24(1):196, 2023.
- [7] Xuelei Lin, Yongxin Xu, Dongzhi Wang, Yiman Yang, Xiaoyu Zhang, Xiaomin Bie, Lixuan Gui, Zhongxu Chen, Yiliang Ding, Long Mao, et al. Systematic identification of wheat spike developmental regulators by integrated multi-omics, transcriptional network, gwas, and genetic analyses. Molecular Plant, 17(3):438–459, 2024.
- [8] Huili Zhang, Feifei Yu, Peng Xie, Shengyuan Sun, Xinhua Qiao, Sanyuan Tang, Chengxuan Chen, Sen Yang, Cuo Mei, Dekai Yang, et al. A g protein regulates alkaline sensitivity in crops. Science, 379(6638):eade8416, 2023.
- [9] Charlotte Miller, Rachel Wells, Neil McKenzie, Martin Trick, Joshua Ball, Abdelhak Fatihi, Bertrand Dubreucq, Thierry Chardot, Loic Lepiniec, and Michael W Bevan. Variation in expression of the hect e3 ligase upl3 modulates lec2 levels, seed size, and crop yields in brassica napus. The Plant Cell, 31(10):2370–2385, 2019.
- [10] Zhiying Ma, Shoupu He, Xingfen Wang, Junling Sun, Yan Zhang, Guiyin Zhang, Liqiang Wu, Zhikun Li, Zhihao Liu, Gaofei Sun, et al. Resequencing a core collection of upland cotton identifies genomic variation and loci influencing fiber quality and yield. Nature genetics, 50(6):803–813, 2018.
- [11] Jiaqi You, Zhenping Liu, Zhengyang Qi, Yizan Ma, Mengling Sun, Ling Su, Hao Niu, Yabing Peng, Xuanxuan Luo, Mengmeng Zhu, et al. Regulatory controls of duplicated gene expression during fiber development in allotetraploid cotton. Nature genetics, 55(11):1987–1997, 2023.
- [12] Nan Zhao, Weiran Wang, Corrinne E Grover, Kaiyun Jiang, Zhuanxia Pan, Baosheng Guo, Jiahui Zhu, Ying Su, Meng Wang, Hushuai Nie, et al. Genomic and gwas analyses demonstrate phylogenomic relationships of gossypium barbadense in china and selection for fibre length, lint percentage and fusarium wilt resistance. Plant Biotechnology Journal, 20(4):691–710, 2022.
- [13] Maojun Wang, Jianying Li, Zhengyang Qi, Yuexuan Long, Liuling Pei, Xianhui Huang, Corrinne E Grover, Xiongming Du, Chunjiao Xia, Pengcheng Wang, et al. Genomic innovation and regulatory rewiring during evolution of the cotton genus gossypium. Nature Genetics, 54(12):1959–1971, 2022.
- [14] Geng Tian, Shubin Wang, Jianhui Wu, Yanxia Wang, Xiutang Wang, Shuwei Liu, Dejun Han, Guangmin Xia, and Mengcheng Wang. Allelic variation of tawd40-4b. 1 contributes to drought tolerance by modulating catalase activity in wheat. Nature Communications, 14(1):1200, 2023.
- [15] Dongzhi Wang, Xiuxiu Zhang, Yuan Cao, Aamana Batool, Yongxin Xu, Yunzhou Qiao, Yongpeng Li, Hao Wang, Xuelei Lin, Xiaomin Bie, et al. Tabhlh27 orchestrates root growth and drought tolerance to enhance water use efficiency in wheat. Journal of Integrative Plant Biology, 2024.
- [16] Baoqi Li, Lin Chen, Weinan Sun, Di Wu, Maojun Wang, Yu Yu, Guoxing Chen, Wanneng Yang, Zhongxu Lin, Xianlong Zhang, et al. Phenomics-based gwas analysis reveals the genetic architecture for drought resistance in cotton. Plant biotechnology journal, 18(12):2533–2544, 2020.
- [17] Yizan Ma, Ling Min, Junduo Wang, Yaoyao Li, Yuanlong Wu, Qin Hu, Yuanhao Ding, Maojun Wang, Yajun Liang, Zhaolong Gong, et al. A combination of genome-wide and transcriptome-wide association studies reveals genetic elements leading to male sterility during high temperature stress in cotton. New Phytologist, 231(1):165–181, 2021.
- [18] Liyuan Zhang, Bo Yang, Xiaodong Li, Si Chen, Chao Zhang, Sirou Xiang, Tingting Sun, Ziyan Yang, Xizeng Kong, Cunmin Qu, et al. Integrating gwas, rna-seq and functional analysis revealed that bnaa02. se mediates silique elongation by affecting cell proliferation and expansion in brassica napus. Plant Biotechnology Journal, 2024.
- [19] Shan Tang, Hu Zhao, Shaoping Lu, Liangqian Yu, Guofang Zhang, Yuting Zhang, Qing-Yong Yang, Yongming Zhou, Xuemin Wang, Wei Ma, et al. Genome-and transcriptome-wide association studies provide insights into the genetic basis of natural variation of seed oil content in brassica napus. Molecular Plant, 14(3):470–487, 2021.
- [20] Claudia Giambartolomei, Damjan Vukcevic, Eric E Schadt, Lude Franke, Aroon D Hingorani, Chris Wallace, and Vincent Plagnol. Bayesian test for colocalisation between pairs of genetic association studies using summary statistics. PLoS genetics, 10(5):e1004383, 2014.
- [21] Farhad Hormozdiari, Martijn Van De Bunt, Ayellet V Segre, Xiao Li, Jong Wha J Joo, Michael Bilow, Jae Hoon Sul, Sriram Sankararaman, Bogdan Pasaniuc, and Eleazar Eskin. Colocalization of gwas and eqtl signals detects target genes. The American Journal of Human Genetics, 99(6):1245–1260, 2016.
- [22] Zhihong Zhu, Futao Zhang, Han Hu, Andrew Bakshi, Matthew R Robinson, Joseph E Powell, Grant W Montgomery, Michael E Goddard, Naomi R Wray, Peter M Visscher, et al. Integration of summary data from gwas and eqtl studies predicts complex trait gene targets. Nature genetics, 48(5):481–487, 2016.
- [23] CommonMind Consortium et al. Erratum: Large-scale identification of common trait and disease variants affecting gene expression (the american journal of human genetics (2017) 100 (6)(885–894)(s0002929717301611)(10.1016/j. ajhg. 2017.04. 016)). American Journal of Human Genetics, 101(1):157, 2017.
- [24] George K Robinson. That blup is a good thing: the estimation of random effects. Statistical science, pages 15–32, 1991.
- [25] Xiang Zhou, Peter Carbonetto, and Matthew Stephens. Polygenic modeling with bayesian sparse linear mixed models. PLoS genetics, 9(2):e1003264, 2013.
- [26] Michael Wainberg, Nasa Sinnott-Armstrong, Nicholas Mancuso, Alvaro N Barbeira, David A Knowles, David Golan, Raili Ermel, Arno Ruusalepp, Thomas Quertermous, Ke Hao, et al. Opportunities and challenges for transcriptome-wide association studies. Nature genetics, 51(4):592–599, 2019.
- [27] Alexander Gusev, Arthur Ko, Huwenbo Shi, Gaurav Bhatia, Wonil Chung, Brenda WJH Penninx, Rick Jansen, Eco JC De Geus, Dorret I Boomsma, Fred A Wright, et al. Integrative approaches for large-scale transcriptome-wide association studies. Nature genetics, 48(3):245–252, 2016.
- [28] Barton E Slatko, Andrew F Gardner, and Frederick M Ausubel. Overview of next-generation sequencing technologies. Current protocols in molecular biology, 122(1):e59, 2018.
- [29] Margaret A Taub, Hector Corrada Bravo, and Rafael A Irizarry. Overcoming bias and systematic errors in next generation sequencing data. Genome medicine, 2:1–5, 2010.
- [30] Ana Conesa, Pedro Madrigal, Sonia Tarazona, David Gomez-Cabrero, Alejandra Cervera, Andrew McPherson, Michał Wojciech Szcześniak, Daniel J Gaffney, Laura L Elo, Xuegong Zhang, et al. A survey of best practices for rna-seq data analysis. Genome biology, 17:1–19, 2016.
- [31] Guangzhan Zhang, Menglu Li, Huan Deng, Xinran Xu, Xuan Liu, and Wen Zhang. Sgnnmd: signed graph neural network for predicting deregulation types of mirna-disease associations. Briefings in Bioinformatics, 23(1):bbab464, 2022.
- [32] Xiaosa Zhao, Xiaowei Zhao, and Minghao Yin. Heterogeneous graph attention network based on meta-paths for lncrna–disease association prediction. Briefings in bioinformatics, 23(1):bbab407, 2022.
- [33] Hongda Zhang, Hui Cui, Tiangang Zhang, Yangkun Cao, and Ping Xuan. Learning multi-scale heterogenous network topologies and various pairwise attributes for drug–disease association prediction. Briefings in Bioinformatics, 23(2):bbac009, 2022.
- [34] Lu-Xiang Guo, Lei Wang, Zhu-Hong You, Chang-Qing Yu, Meng-Lei Hu, Bo-Wei Zhao, and Yang Li. Biolinguistic graph fusion model for circrna–mirna association prediction. Briefings in Bioinformatics, 25(2):bbae058, 2024.
- [35] Qunzhuo Wu, Zhaohong Deng, Xiaoyong Pan, Hong-Bin Shen, Kup-Sze Choi, Shitong Wang, Jing Wu, and Dong-Jun Yu. Mdgf-mcec: a multi-view dual attention embedding model with cooperative ensemble learning for circrna-disease association prediction. Briefings in Bioinformatics, 23(5):bbac289, 2022.
- [36] Tyler Derr, Yao Ma, and Jiliang Tang. Signed graph convolutional networks. In 2018 IEEE International Conference on Data Mining (ICDM), pages 929–934. IEEE, 2018.
- [37] Zeyu Zhang, Jiamou Liu, Kaiqi Zhao, Song Yang, Xianda Zheng, and Yifei Wang. Contrastive learning for signed bipartite graphs. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1629–1638, 2023.
- [38] Lu Li, Jiale Liu, Xingyu Ji, Maojun Wang, and Zeyu Zhang. Se-sgformer: A self-explainable signed graph transformer for link sign prediction. arXiv preprint arXiv:2408.08754, 2024.
- [39] Fang He, Jinhai Deng, Ruizhan Xue, Maojun Wang, and Zeyu Zhang. Mitigating degree bias in signed graph neural networks. arXiv preprint arXiv:2408.08508, 2024.
- [40] Zeyu Zhang, Lu Li, Shuyan Wan, Sijie Wang, Zhiyi Wang, Zhiyuan Lu, Dong Hao, and Wanli Li. Dropedge not foolproof: Effective augmentation method for signed graph neural networks. arXiv preprint arXiv:2409.19620, 2024.
- [41] Suhang Wang, Jiliang Tang, Charu Aggarwal, Yi Chang, and Huan Liu. Signed network embedding in social media. In Proceedings of the 2017 SIAM international conference on data mining, pages 327–335. SIAM, 2017.
- [42] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
- [43] Junjie Huang, Huawei Shen, Liang Hou, and Xueqi Cheng. Signed graph attention networks. In International Conference on Artificial Neural Networks, pages 566–577. Springer, 2019.
- [44] Yu Li, Yuan Tian, Jiawei Zhang, and Yi Chang. Learning signed network embedding via graph attention. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 4772–4779, 2020.
- [45] Junjie Huang, Huawei Shen, Liang Hou, and Xueqi Cheng. Sdgnn: Learning node representation for signed directed networks. arXiv preprint arXiv:2101.02390, 2021.
- [46] Lin Shu, Erxin Du, Yaomin Chang, Chuan Chen, Zibin Zheng, Xingxing Xing, and Shaofeng Shen. Sgcl: Contrastive representation learning for signed graphs. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 1671–1680, 2021.
- [47] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
- [48] Zeyu Zhang, Jiamou Liu, Xianda Zheng, Yifei Wang, Pengqian Han, Yupan Wang, Kaiqi Zhao, and Zijian Zhang. Rsgnn: A model-agnostic approach for enhancing the robustness of signed graph neural networks. In Proceedings of the ACM Web Conference 2023, pages 60–70, 2023.
- [49] Guang Yang, Yan Pan, Wenqiu Pan, Qingting Song, Ruoyu Zhang, Wei Tong, Licao Cui, Wanquan Ji, Weining Song, Baoxing Song, et al. Combined gwas and egwas reveals the genetic basis underlying drought tolerance in emmer wheat (triticum turgidum l.). New Phytologist, 242(5):2115–2131, 2024.
- [50] Peter J Bradbury, Zhiwu Zhang, Dallas E Kroon, Terry M Casstevens, Yogesh Ramdoss, and Edward S Buckler. Tassel: software for association mapping of complex traits in diverse samples. Bioinformatics, 23(19):2633–2635, 2007.
- [51] Christoph Lippert, Jennifer Listgarten, Ying Liu, Carl M Kadie, Robert I Davidson, and David Heckerman. Fast linear mixed models for genome-wide association studies. Nature methods, 8(10):833–835, 2011.
- [52] Maojun Wang, Lili Tu, Daojun Yuan, De Zhu, Chao Shen, Jianying Li, Fuyan Liu, Liuling Pei, Pengcheng Wang, Guannan Zhao, et al. Reference genome sequences of two cultivated allotetraploid cottons, gossypium hirsutum and gossypium barbadense. Nature genetics, 51(2):224–229, 2019.
- [53] Tugdem Muslu, Bala Ani Akpinar, Sezgi Biyiklioglu-Kaya, Meral Yuce, and Hikmet Budak. Comparative analysis of coding and non-coding features within insect tolerance loci in wheat with their homologs in cereal genomes. International Journal of Molecular Sciences, 22(22):12349, 2021.
- [54] Christiam Camacho, George Coulouris, Vahram Avagyan, Ning Ma, Jason Papadopoulos, Kevin Bealer, and Thomas L Madden. Blast+: architecture and applications. BMC bioinformatics, 10:1–9, 2009.
- [55] Lawrence Page. The pagerank citation ranking: Bringing order to the web. Technical report, Technical Report, 1999.
- [56] Zhaoming Wu, Charu C Aggarwal, and Jimeng Sun. The troll-trust model for ranking in signed networks. In Proceedings of the Ninth ACM international conference on Web Search and Data Mining, pages 447–456, 2016.
- [57] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
- [58] Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. Advances in neural information processing systems, 29, 2016.
- [59] Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. Deep graph contrastive representation learning. arXiv preprint arXiv:2006.04131, 2020.