Datasets - Spektral

<!DOCTYPE html>   <html class="no-js" lang="en" >  <head> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <meta name="author" content="Daniele Grattarola"> <link rel="canonical" href="https://graphneural.network/datasets/"> <link rel="shortcut icon" href="../img/favicon.ico"> <title>Datasets - Spektral</title> <link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Lato:400,700|Roboto+Slab:400,700|Inconsolata:400,700" /> <link rel="stylesheet" href="../css/theme.css" /> <link rel="stylesheet" href="../css/theme_extra.css" /> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/10.5.0/styles/github.min.css" /> <link href="../stylesheets/extra.css" rel="stylesheet" /> <script> // Current page data var mkdocs_page_name = "Datasets"; var mkdocs_page_input_path = "datasets.md"; var mkdocs_page_url = "/datasets/"; </script> <script src="../js/jquery-2.1.1.min.js" defer></script> <script src="../js/modernizr-2.8.3.min.js" defer></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/10.5.0/highlight.min.js"></script> <script>hljs.initHighlightingOnLoad();</script> <script> (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })(window,document,'script','https://www.google-analytics.com/analytics.js','ga'); ga('create', 'UA-125823175-1', 'auto'); ga('send', 'pageview'); </script> </head> <body class="wy-body-for-nav" role="document"> <div class="wy-grid-for-nav"> <nav data-toggle="wy-nav-shift" class="wy-nav-side stickynav"> <div class="wy-side-scroll"> <div class="wy-side-nav-search"> <a href=".." class="icon icon-home"> Spektral</a> <div role="search"> <form id ="rtd-search-form" class="wy-form" action="../search.html" method="get"> <input type="text" name="q" placeholder="Search docs" title="Type search term here" /> </form> </div> </div> <div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation"> <ul> <li class="toctree-l1"><a class="reference internal" href="..">Home</a> </li> </ul> <p class="caption"><span class="caption-text">Tutorials</span></p> <ul> <li class="toctree-l1"><a class="reference internal" href="../getting-started/">Getting started</a> </li> <li class="toctree-l1"><a class="reference internal" href="../data-modes/">Data modes</a> </li> <li class="toctree-l1"><a class="reference internal" href="../creating-dataset/">Creating a dataset</a> </li> <li class="toctree-l1"><a class="reference internal" href="../creating-layer/">Creating a layer</a> </li> <li class="toctree-l1"><a class="reference internal" href="../examples/">Examples</a> </li> </ul> <p class="caption"><span class="caption-text">Layers</span></p> <ul> <li class="toctree-l1"><a class="reference internal" href="../layers/convolution/">Convolutional layers</a> </li> <li class="toctree-l1"><a class="reference internal" href="../layers/pooling/">Pooling layers</a> </li> <li class="toctree-l1"><a class="reference internal" href="../layers/base/">Base layers</a> </li> <li class="toctree-l1"><a class="reference internal" href="../models/">Models</a> </li> </ul> <p class="caption"><span class="caption-text">Data</span></p> <ul class="current"> <li class="toctree-l1"><a class="reference internal" href="../data/">Containers</a> </li> <li class="toctree-l1 current"><a class="reference internal current" href="./">Datasets</a> <ul class="current"> <li class="toctree-l2"><a class="reference internal" href="#citation">Citation</a> </li> <li class="toctree-l2"><a class="reference internal" href="#dblp">DBLP</a> </li> <li class="toctree-l2"><a class="reference internal" href="#flickr">Flickr</a> </li> <li class="toctree-l2"><a class="reference internal" href="#graphsage">GraphSage</a> </li> <li class="toctree-l2"><a class="reference internal" href="#ppi">PPI</a> </li> <li class="toctree-l2"><a class="reference internal" href="#reddit">Reddit</a> </li> <li class="toctree-l2"><a class="reference internal" href="#mnist">MNIST</a> </li> <li class="toctree-l2"><a class="reference internal" href="#modelnet">ModelNet</a> </li> <li class="toctree-l2"><a class="reference internal" href="#ogb">OGB</a> </li> <li class="toctree-l2"><a class="reference internal" href="#qm7">QM7</a> </li> <li class="toctree-l2"><a class="reference internal" href="#qm9">QM9</a> </li> <li class="toctree-l2"><a class="reference internal" href="#tudataset">TUDataset</a> </li> </ul> </li> <li class="toctree-l1"><a class="reference internal" href="../loaders/">Loaders</a> </li> <li class="toctree-l1"><a class="reference internal" href="../transforms/">Transforms</a> </li> </ul> <p class="caption"><span class="caption-text">Utils</span></p> <ul> <li class="toctree-l1"><a class="reference internal" href="../utils/convolution/">Convolution</a> </li> <li class="toctree-l1"><a class="reference internal" href="../utils/sparse/">Sparse</a> </li> <li class="toctree-l1"><a class="reference internal" href="../utils/misc/">Miscellaneous</a> </li> </ul> <p class="caption"><span class="caption-text">Other</span></p> <ul> <li class="toctree-l1"><a class="reference internal" href="../external/">External resources</a> </li> <li class="toctree-l1"><a class="reference internal" href="../about/">About</a> </li> </ul> </div> </div> </nav> <section data-toggle="wy-nav-shift" class="wy-nav-content-wrap"> <nav class="wy-nav-top" role="navigation" aria-label="top navigation"> <i data-toggle="wy-nav-top" class="fa fa-bars"></i> <a href="..">Spektral</a> </nav> <div class="wy-nav-content"> <div class="rst-content"> <div role="navigation" aria-label="breadcrumbs navigation"> <ul class="wy-breadcrumbs"> <li><a href="..">Docs</a> »</li> <li>Data »</li> <li>Datasets</li> <li class="wy-breadcrumbs-aside"> </li> </ul> <hr/> </div> <div role="main"> <div class="section"> <h2 id="datasets">Datasets</h2> <p>This module provides benchmark datasets for graph-level and node-level prediction. Datasets are automatically downloaded and saved locally on first usage. You can configure the path where the data are stored by creating a <code>~/.spektral/config.json</code> file with the following content:</p> <pre><code class="language-json">{ "dataset_folder": "/path/to/dataset/folder" } </code></pre> <p><span style="float:right;"><a href="https://github.com/danielegrattarola/spektral/blob/master/spektral/datasets/citation.py#L15">[source]</a></span></p> <h4 id="citation">Citation</h4> <pre><code class="language-python">spektral.datasets.citation.Citation(name, random_split=False, normalize_x=False, dtype=<class 'numpy.float32'>) </code></pre> <p>The citation datasets Cora, Citeseer and Pubmed.</p> <p>Node attributes are bag-of-words vectors representing the most common words in the text document associated to each node. Two papers are connected if either one cites the other. Labels represent the subject area of the paper.</p> <p>The train, test, and validation splits are given as binary masks and are accessible via the <code>mask_tr</code>, <code>mask_va</code>, and <code>mask_te</code> attributes.</p> <p><strong>Arguments</strong></p> <ul> <li><code>name</code>: name of the dataset to load (<code>'cora'</code>, <code>'citeseer'</code>, or <code>'pubmed'</code>);</li> <li><code>random_split</code>: if True, return a randomized split (20 nodes per class for training, 30 nodes per class for validation and the remaining nodes for testing, as recommended by <a href="https://arxiv.org/abs/1811.05868">Shchur et al. (2018)</a>). If False (default), return the "Planetoid" public splits defined by <a href="https://arxiv.org/abs/1603.08861">Yang et al. (2016)</a>.</li> <li><code>normalize_x</code>: if True, normalize the features.</li> <li><code>dtype</code>: numpy dtype of graph data.</li> </ul> <hr /> <p><span style="float:right;"><a href="https://github.com/danielegrattarola/spektral/blob/master/spektral/datasets/dblp.py#L17">[source]</a></span></p> <h4 id="dblp">DBLP</h4> <pre><code class="language-python">spektral.datasets.dblp.DBLP(normalize_x=False, dtype=<class 'numpy.float32'>) </code></pre> <p>A subset of the DBLP computer science bibliography website, as collected in the <a href="https://arxiv.org/abs/2002.01680">Fu et al. (2020)</a> paper.</p> <p><strong>Arguments</strong></p> <ul> <li><code>normalize_x</code>: if True, normalize the features.</li> <li><code>dtype</code>: numpy dtype of graph data.</li> </ul> <hr /> <p><span style="float:right;"><a href="https://github.com/danielegrattarola/spektral/blob/master/spektral/datasets/flickr.py#L14">[source]</a></span></p> <h4 id="flickr">Flickr</h4> <pre><code class="language-python">spektral.datasets.flickr.Flickr(normalize_x=False, dtype=<class 'numpy.float32'>) </code></pre> <p>The Flickr dataset from the <a href="https://arxiv.org/abs/1907.04931">Zeng at al. (2019)</a> paper, containing descriptions and common properties of images.</p> <p><strong>Arguments</strong></p> <ul> <li><code>normalize_x</code>: if True, normalize the features.</li> <li><code>dtype</code>: numpy dtype of graph data.</li> </ul> <hr /> <p><span style="float:right;"><a href="https://github.com/danielegrattarola/spektral/blob/master/spektral/datasets/graphsage.py#L15">[source]</a></span></p> <h4 id="graphsage">GraphSage</h4> <pre><code class="language-python">spektral.datasets.graphsage.GraphSage(name) </code></pre> <p>The datasets used in the paper</p> <blockquote> <p><a href="https://arxiv.org/abs/1706.02216">Inductive Representation Learning on Large Graphs</a><br> William L. Hamilton et al.</p> </blockquote> <p>The PPI dataset (originally <a href="https://www.ncbi.nlm.nih.gov/pubmed/16381927">Stark et al. (2006)</a>) for inductive node classification uses positional gene sets, motif gene sets and immunological signatures as features and gene ontology sets as labels.</p> <p>The Reddit dataset consists of a graph made of Reddit posts in the month of September, 2014. The label for each node is the community that a post belongs to. The graph is built by sampling 50 large communities and two nodes are connected if the same user commented on both. Node features are obtained by concatenating the average GloVe CommonCrawl vectors of the title and comments, the post's score and the number of comments.</p> <p>The train, test, and validation splits are given as binary masks and are accessible via the <code>mask_tr</code>, <code>mask_va</code>, and <code>mask_te</code> attributes.</p> <p><strong>Arguments</strong></p> <ul> <li><code>name</code>: name of the dataset to load (<code>'ppi'</code>, or <code>'reddit'</code>);</li> </ul> <hr /> <p><span style="float:right;"><a href="https://github.com/danielegrattarola/spektral/blob/master/spektral/datasets/graphsage.py#L119">[source]</a></span></p> <h4 id="ppi">PPI</h4> <pre><code class="language-python">spektral.datasets.graphsage.PPI() </code></pre> <p>Alias for <code>GraphSage('ppi')</code>.</p> <hr /> <p><span style="float:right;"><a href="https://github.com/danielegrattarola/spektral/blob/master/spektral/datasets/graphsage.py#L128">[source]</a></span></p> <h4 id="reddit">Reddit</h4> <pre><code class="language-python">spektral.datasets.graphsage.Reddit() </code></pre> <p>Alias for <code>GraphSage('reddit')</code>.</p> <hr /> <p><span style="float:right;"><a href="https://github.com/danielegrattarola/spektral/blob/master/spektral/datasets/mnist.py#L11">[source]</a></span></p> <h4 id="mnist">MNIST</h4> <pre><code class="language-python">spektral.datasets.mnist.MNIST(p_flip=0.0, k=8) </code></pre> <p>The MNIST images used as node features for a grid graph, as described by <a href="https://arxiv.org/abs/1606.09375">Defferrard et al. (2016)</a>.</p> <p>This dataset is a graph signal classification task, where graphs are represented in mixed mode: one adjacency matrix, many instances of node features.</p> <p>For efficiency, the adjacency matrix is stored in a special attribute of the dataset and the Graphs only contain the node features. You can access the adjacency matrix via the <code>a</code> attribute.</p> <p>The node features of each graph are the MNIST digits vectorized and rescaled to [0, 1]. Two nodes are connected if they are neighbours on the grid. Labels represent the MNIST class associated to each sample.</p> <p><strong>Note:</strong> the last 10000 samples are the default test set of the MNIST dataset.</p> <p><strong>Arguments</strong></p> <ul> <li><code>p_flip</code>: if >0, then edges are randomly flipped from 0 to 1 or vice versa with that probability.</li> <li><code>k</code>: number of neighbours of each node.</li> </ul> <hr /> <p><span style="float:right;"><a href="https://github.com/danielegrattarola/spektral/blob/master/spektral/datasets/modelnet.py#L14">[source]</a></span></p> <h4 id="modelnet">ModelNet</h4> <pre><code class="language-python">spektral.datasets.modelnet.ModelNet(name, test=False, n_jobs=-1) </code></pre> <p>The ModelNet10 and ModelNet40 CAD models datasets from the paper:</p> <blockquote> <p><a href="https://arxiv.org/abs/1406.5670">3D ShapeNets: A Deep Representation for Volumetric Shapes</a><br> Zhirong Wu et al.</p> </blockquote> <p>Each graph represents a CAD model belonging to one of 10 (or 40) categories.</p> <p>The models are polygon meshes: the node attributes are the 3d coordinates of the vertices, and edges are computed from each face. Duplicate edges are ignored and the adjacency matrix is binary.</p> <p>The dataset are pre-split into training and test sets: the <code>test</code> flag controls which split is loaded.</p> <p><strong>Arguments</strong></p> <ul> <li><code>name</code>: name of the dataset to load ('10' or '40');</li> <li><code>test</code>: if True, load the test set instead of the training set.</li> <li><code>n_jobs</code>: number of CPU cores to use for reading the data (-1, to use all available cores)</li> </ul> <hr /> <p><span style="float:right;"><a href="https://github.com/danielegrattarola/spektral/blob/master/spektral/datasets/ogb.py#L7">[source]</a></span></p> <h4 id="ogb">OGB</h4> <pre><code class="language-python">spektral.datasets.ogb.OGB(dataset) </code></pre> <p>Wrapper for datasets from the <a href="https://ogb.stanford.edu/">Open Graph Benchmark (OGB)</a>.</p> <p><strong>Arguments</strong></p> <ul> <li><code>dataset</code>: an OGB library-agnostic dataset.</li> </ul> <hr /> <p><span style="float:right;"><a href="https://github.com/danielegrattarola/spektral/blob/master/spektral/datasets/qm7.py#L12">[source]</a></span></p> <h4 id="qm7">QM7</h4> <pre><code class="language-python">spektral.datasets.qm7.QM7() </code></pre> <p>The QM7b dataset of molecules from the paper:</p> <blockquote> <p><a href="https://arxiv.org/abs/1703.00564">MoleculeNet: A Benchmark for Molecular Machine Learning</a><br> Zhenqin Wu et al.</p> </blockquote> <p>The dataset has no node features. Edges and edge features are obtained from the Coulomb matrices of the molecules.</p> <p>Each graph has a 14-dimensional label for regression.</p> <hr /> <p><span style="float:right;"><a href="https://github.com/danielegrattarola/spektral/blob/master/spektral/datasets/qm9.py#L17">[source]</a></span></p> <h4 id="qm9">QM9</h4> <pre><code class="language-python">spektral.datasets.qm9.QM9(amount=None, n_jobs=1) </code></pre> <p>The QM9 chemical data set of small molecules.</p> <p>In this dataset, nodes represent atoms and edges represent chemical bonds. There are 5 possible atom types (H, C, N, O, F) and 4 bond types (single, double, triple, aromatic).</p> <p>Node features represent the chemical properties of each atom and include:</p> <ul> <li>The atomic number, one-hot encoded;</li> <li>The atom's position in the X, Y, and Z dimensions;</li> <li>The atomic charge;</li> <li>The mass difference from the monoisotope;</li> </ul> <p>The edge features represent the type of chemical bond between two atoms, one-hot encoded.</p> <p>Each graph has an 19-dimensional label for regression.</p> <p><strong>Arguments</strong></p> <ul> <li><code>amount</code>: int, load this many molecules instead of the full dataset (useful for debugging).</li> <li><code>n_jobs</code>: number of CPU cores to use for reading the data (-1, to use all available cores).</li> </ul> <hr /> <p><span style="float:right;"><a href="https://github.com/danielegrattarola/spektral/blob/master/spektral/datasets/tudataset.py#L16">[source]</a></span></p> <h4 id="tudataset">TUDataset</h4> <pre><code class="language-python">spektral.datasets.tudataset.TUDataset(name, clean=False) </code></pre> <p>The Benchmark Data Sets for Graph Kernels from TU Dortmund (<a href="https://chrsmrrs.github.io/datasets/docs/datasets/">link</a>).</p> <p>Node features are computed by concatenating the following features for each node:</p> <ul> <li>node attributes, if available;</li> <li>node labels, if available, one-hot encoded.</li> </ul> <p>Some datasets might not have node features at all. In this case, attempting to use the dataset with a Loader will result in a crash. You can create node features using some of the transforms available in <code>spektral.transforms</code> or you can define your own features by accessing the individual samples in the <code>graph</code> attribute of the dataset (which is a list of <code>Graph</code> objects).</p> <p>Edge features are computed by concatenating the following features for each node:</p> <ul> <li>edge attributes, if available;</li> <li>edge labels, if available, one-hot encoded.</li> </ul> <p>Graph labels are provided for each dataset.</p> <p>Specific details about each individual dataset can be found in <code>~/spektral/datasets/TUDataset/<dataset name>/README.md</code>, after the dataset has been downloaded locally (datasets are downloaded automatically upon calling <code>TUDataset('<dataset name>')</code> the first time).</p> <p><strong>Arguments</strong></p> <ul> <li><code>name</code>: str, name of the dataset to load (see <code>TUD.available_datasets</code>).</li> <li><code>clean</code>: if <code>True</code>, rload a version of the dataset with no isomorphic graphs.</li> </ul> </div> </div> <footer> <div class="rst-footer-buttons" role="navigation" aria-label="footer navigation"> <a href="../loaders/" class="btn btn-neutral float-right" title="Loaders">Next <span class="icon icon-circle-arrow-right"></span></a> <a href="../data/" class="btn btn-neutral" title="Containers"><span class="icon icon-circle-arrow-left"></span> Previous</a> </div> <hr/> <div role="contentinfo">  </div> Built with <a href="https://www.mkdocs.org/">MkDocs</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>. </footer> </div> </div> </section> </div> <div class="rst-versions" role="note" aria-label="versions"> <span class="rst-current-version" data-toggle="rst-current-version"> <span> <a href="https://github.com/danielegrattarola/spektral/" class="fa fa-github" style="color: #fcfcfc"> GitHub</a> </span> <span><a href="../data/" style="color: #fcfcfc">« Previous</a></span> <span><a href="../loaders/" style="color: #fcfcfc">Next »</a></span> </span> </div> <script>var base_url = '..';</script> <script src="../js/theme_extra.js" defer></script> <script src="../js/theme.js" defer></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS-MML_HTMLorMML" defer></script> <script src="../js/macros.js" defer></script> <script src="../search/main.js" defer></script> <script defer> window.onload = function () { SphinxRtdTheme.Navigation.enable(true); }; </script> </body> </html>

CINXE.COM

Datasets - Spektral