{
  "metadata": {
  },
  "nbformat": 4,
  "nbformat_minor": 5,
  "cells": [
    {
      "id": "metadata",
      "cell_type": "markdown",
      "source": "<div style=\"border: 2px solid #8A9AD0; margin: 1em 0.2em; padding: 0.5em;\">\n\n# Python - Globbing\n\nby [Helena Rasche](https://training.galaxyproject.org/hall-of-fame/hexylena/), [Donny Vrins](https://training.galaxyproject.org/hall-of-fame/dirowa/), [Bazante Sanders](https://training.galaxyproject.org/hall-of-fame/bazante1/)\n\nCC-BY licensed content from the [Galaxy Training Network](https://training.galaxyproject.org/)\n\n**Objectives**\n\n- How can I collect a list of files.\n\n**Objectives**\n\n- Use glob to collect a list of files\n- Learn about the potential pitfalls of glob\n\n**Time Estimation: 15M**\n</div>\n",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-0",
      "source": "<p>Globbing is the term used in computer science when we have a bunch of files and we want to list all of them matching some pattern.</p>\n<blockquote class=\"agenda\" style=\"border: 2px solid #86D486;display: none; margin: 1em 0.2em\">\n<div class=\"box-title\" aria-label=\"agenda box: Agenda\" style=\"font-size: 150%\"> Agenda</div>\n<p>In this tutorial, we will cover:</p>\n<ol id=\"markdown-toc\">\n<li><a href=\"#setup\" id=\"markdown-toc-setup\">Setup</a></li>\n</ol>\n</blockquote>\n<h1 id=\"setup\">Setup</h1>\n<p>We’ll start by creating some files for use in the rest of this tutorial</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-1",
      "source": [
        "import os\n",
        "import subprocess\n",
        "\n",
        "dirs = ['a', 'a/b', 'c', 'c/e', 'd', '.']\n",
        "files = ['a.txt', 'a.csv', 'b.csv', 'b.txt', 'e.glm']\n",
        "\n",
        "for d in dirs:\n",
        "    # Create some directories\n",
        "    os.makedirs(d, exist_ok=True)\n",
        "    # Create some files\n",
        "    for f in files:\n",
        "        subprocess.check_output(['touch', os.path.join(d, f)])"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [

      ],
      "metadata": {
        "attributes": {
          "classes": [
            "python"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-2",
      "source": "<p>Now we should have a pretty full folder!</p>\n<h1 id=\"finding-files\">Finding Files</h1>\n<p>We can use the glob module to find files:</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-3",
      "source": [
        "import glob\n",
        "print(glob.glob('*.csv'))\n",
        "print(glob.glob('*.txt'))"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [

      ],
      "metadata": {
        "attributes": {
          "classes": [
            "python"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-4",
      "source": "<p>Here we use an asterisk (<code class=\"language-plaintext highlighter-rouge\">*</code>) as a wildcard, it matches any bit of text (but not into folders!) to all matching files. Here we list all matching <code style=\"color: inherit\">csv</code> or <code style=\"color: inherit\">txt</code> files. This is great to find files matching a pattern.</p>\n<p>We can also use asterisks anywhere in the glob, it doesn’t just have to be the filename portion:</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-5",
      "source": [
        "print(glob.glob('a*'))"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [

      ],
      "metadata": {
        "attributes": {
          "classes": [
            "python"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-6",
      "source": "<p>Here we even see a third entry: the directory.</p>\n<h1 id=\"finding-files-in-directories\">Finding files in directories</h1>\n<p>Until now we’ve found only files in a single top level directory, but what if we wanted to find files in subdirectories?</p>\n<p>Only need a single directory? Just include that!</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-7",
      "source": [
        "print(glob.glob('a/*.csv'))"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [

      ],
      "metadata": {
        "attributes": {
          "classes": [
            "python"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-8",
      "source": "<p>But if you need more levels, or want to look in <em>all</em> folders, then you need the double wildcard! With two asterisks <code style=\"color: inherit\">**</code> we can search recursively through directories for files:</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-9",
      "source": [
        "print(glob.glob('**/a.csv'))"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [

      ],
      "metadata": {
        "attributes": {
          "classes": [
            "python"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-10",
      "source": "<h1 id=\"exercise\">Exercise</h1>\n<blockquote class=\"question\" style=\"border: 2px solid #8A9AD0; margin: 1em 0.2em\">\n<div class=\"box-title\" aria-label=\"question box: Question: Where in the world is the CSV?\" style=\"font-size: 150%\">❓ Question: Where in the world is the CSV?</div>\n<ol>\n<li>How would you find all <code style=\"color: inherit\">.csv</code> files?</li>\n<li>How would you find all <code style=\"color: inherit\">.txt</code> files?</li>\n<li>How would you find all files starting with the letter ‘e’?</li>\n</ol>\n<br/><details style=\"border: 2px solid #B8C3EA; margin: 1em 0.2em; padding: 0.5em;\"><summary>👁 View solution</summary>\n<div class=\"box-title\" aria-label=\"solution box: Solution\" style=\"font-size: 150%\">👁 Solution</div>\n<ol>\n<li><code style=\"color: inherit\">glob.glob('**/*.csv')</code></li>\n<li><code style=\"color: inherit\">glob.glob('**/*.txt')</code></li>\n<li><code style=\"color: inherit\">glob.glob('**/e*')</code></li>\n</ol>\n</details>\n</blockquote>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-11",
      "source": [
        "# Try things out here!"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [

      ],
      "metadata": {
        "attributes": {
          "classes": [
            "python"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-12",
      "source": "<h1 id=\"pitfalls\">Pitfalls</h1>\n<p>Some analyses (especially simultaions) can be dependent on data input order or data sorting. This was recently seen in <span class=\"citation\"><a href=\"#Bhandari_Neupane_2019\">Neupane <i>et al.</i> 2019</a></span> where the data files used were sorted one way on Windows, and another on Linux, resulting in different results for the same code and the same datasets! Yikes!</p>\n<p>If you know your analyses are dependent on file ordering, then you can use <code style=\"color: inherit\">sorted()</code> to make sure the data is provided in a uniform way every time.</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-13",
      "source": [
        "print(sorted(glob.glob('**/a.csv')))"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [

      ],
      "metadata": {
        "attributes": {
          "classes": [
            "python"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-14",
      "source": "<p>If you’re not sure if your results will be dependent, you can try sorting anyway. Or better yet, randomising the list of inputs to make sure your code behaves properly in any scenario.</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "cell_type": "markdown",
      "id": "final-ending-cell",
      "metadata": {
        "editable": false,
        "collapsed": false
      },
      "source": [
        "# Key Points\n\n",
        "- If your data is ordering dependent, sort your globs!\n",
        "\n# Congratulations on successfully completing this tutorial!\n\n",
        "Please [fill out the feedback on the GTN website](https://training.galaxyproject.org/training-material/topics/data-science/tutorials/python-glob/tutorial.html#feedback) and check there for further resources!\n"
      ]
    }
  ]
}