Identification of intrachromosomal communities

1. cluster_for_hic

After preparing the Hi-C interaction matrices, clustering algorithm (ModularityOptimizer.jar) is used to identify network communities.

dna.cluster_for_hic(modularity_function = 1, resolution_parameter = 1.0, optimization_algorithm = 3, n_random_starts = 10, n_iterations = 100, random_seed = 0, print_output = 1)

Arguments

Optional arguments:

  • modularity_function: The modularity function for clustering method where 1 = standard and 2 = alternative. The default is 1.

  • resolution_parameter: The resolution of clustering method. The default is 1.0.

  • optimization_algorithm: The selection of optimization algorithm where 1 = original Louvain algorithm, 2 = Louvain algorithm with multilevel refinement, and 3 = SLM algorithm. The default is 3.

  • n_random_starts: The number of random starts for clustering. The default is 10.

  • n_iterations: The number of iterations per random start. The default is 100.

  • random_seed: The seed of the random number generator. The default is 0.

  • print_output: Whether or not to print output to the console (0 = no; 1 = yes). The default is 1.

Data path

output data path: Then, cluter results like nodes and communities are organized into .tsv format, while networks are transformed into .json format. All these results can be found in /out_data_folder/hic_data/resolution/hic_community_data.

dna.cluster_to_json()

2. estimate_community_size

The next function is to find the cutoff value for valid community.

Note

This function requires the first two functions finished (dna.cluster_for_hic() and dna.cluster_to_json()). It is recommended to get the optimal cutoff by using all 23 chromosomes, but again, this is more time consuming.

dna.estimate_community_size(cohort, chromosome, cutoff4Proportion = 0.02, fig_dpi = 300)

Arguments

Required arguments:

  • cohort: The dataset you want to investigate.

  • chromosome: The chromosome you want to investigate, i.e. [‘chr1’, ‘chr2’, …]. If you want to check all the chromosome, please use ‘whole_genome’. Whole genome estimation is recommended but time consuming.

Optional arguments:

  • cutoff4Proportion: The proportion is used to determine the minimal size of valid community. The default is 0.02.

  • fig_dpi: Figure resolution in dots per inch. The default is 300.

3. cluster_community_structure

Next, valid communities should be identified in all clusters. Hi-C interactions and other genomic features within each community need to be analyzed.

dna.cluster_community_structure(minCluster_size = 20, fig_dpi = 300)

Arguments

Optional arguments:

  • minCluster_size: The minimum size (or number of interactions) of valid communities. It is recommeded to use module estimate_community_size to identify the optimal value at a specific window resolution. The default is 20. Value smaller than 3 is not accepted.

  • fig_dpi: Figure resolution in dots per inch. The default is 300.

Data path

output data path: Through this, a summary of interactions and modularity score of network clustering and valid communities are exported to a table located in /out_data_folder/hic_data/resolution/hic_community_figures.

4. cluster_super_network

Furthermore, a super network is constructed according to the clustering results, where the node size and the edge width indicate the community size and the number of community interactions.

dna.cluster_super_network(minCluster_size = 20, fig_dpi = 300)

Arguments

Optional arguments:

  • minCluster_size: The minimum size (or number of interactions) of valid communities. It should be consistent with the minCluster_size in cluster_community_structure. The default is 20. Value smaller than 3 is not accepted.

  • fig_dpi: Figure resolution in dots per inch. The default is 300.

Data path

output data path: The exported super networks can be found in /out_data_folder/hic_data/resolution/hic_community_figures.