Outputs and deliverables from the Galaxy project. WIP! Mostly stubs. Help wanted

Author(s) AvatarggscAvatarRoss Lazarus
Overview
Creative Commons License: CC-BY Questions:
  • What are the main identifiable project outputs?

  • How can Galaxy offer ‘free’ analysis computational resources?

  • How does the open source community support Galaxy

  • How are new project outputs or resources created?

Objectives:
  • Understand Galaxy outputs and open science services

  • Understand how different communities work together to make things happen

  • Understand opportunities for engagement and contributing your skills

Time estimation: 10 minutes
Supporting Materials:
Last modification: May 22, 2023
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
Comment: Note to contributors
  • Work in progress!! First draft to try to get a structure to make sense.
  • Needs many contributors to make it useful. What would you like to have known, when you first tried getting things done in Galaxy? Please add what’s missing and fix what’s broken. Headings are mostly stubs waiting to be edited and extended
  • Trying to describe the big picture will necessarily be big. Will probably need to break this already very wordy module into separate parts.
  • Add your story or stories to the stories tutorial too please!
  • Ross has strong opinions.
  • Many of them are probably wrong but he doesn’t know which ones yet.
  • Please feel free to contribute your own, to make this more useful to future readers.
Comment: Note to readers
  • The most important outputs of the Galaxy project are grouped arbitrarily here, and there are many overlaps, because Galaxy grows organically through collaborations, rather than by design.
  • The project continues to expand rapidly, so this module will need updating regularly
  • The Hub provides much more detail about many of the same structures and their activities, but this material is designed to provide simplified views of the project, so the Hub becomes easier to navigate.
  • This is an attempt at a kind of field guide to the ecosystem generating those Hub activities, for the use of participants trying to navigate it.
Agenda: Field Guide Part 3. Project outputs and impact
  1. 3. Project outputs
    1. Source code
    2. Open science analysis services
    3. Capacity building: training resources and services
    4. Downstream impacts: Normalising transparent, shareable open science analyses
    5. Further reading

3. Project outputs

Galaxy source code is the core deliverable. Many other project activities build on it, to add additional value. This makes the code an essential resource for the project, but for clarity, it is arbitrarily grouped as a deliverable in this guide.

The most important project supported resources used by researchers are from the project source code being run to provide a web accessible scientific workflow application. There are an unknown number of private deployments, and 100+ specialised public servers are listed on the Hub. For many researchers, the most important examples are the free usegalaxy.* services. They, in turn, depend on all the other parts of the project, including the source code, and the people and the resources described in this guide.

Source code includes pluggable, shareable tool infrastructure, allowing third-party open source command line analysis software packages to be integrated. These can be installed to provide a specific local “flavour” for each framework deployment. In this way, generic project deliverables can be tailored to serve different kinds of science. The GTN supports integrated training resources, to make the framework and tools even more useful and valuable to researchers. The Galaxy communities of practice illustrate how efficiently the project can be extended to serve the analysis needs of whole communities of researchers in new fields.

Source code

The core framework source code is supported by many other project repositories. For example, providing tools for system administrators and developers, and ToolShed code for tool distribution services. These are not very useful outside the project, since they are specific to Galaxy. Their impact on open science is through their support for Galaxy.

Comment: Do we need to enumerate these?

Will most readers care about the details?

  • Generic analysis framework source code
  • ToolShed source code
  • Developer and system administrator utilities
    • planemo, ephemeris, ansible…
  • Tool wrappers to “flavour” framework instances
    • +8k variable quality tools in public toolshed
    • Click to install from ToolShed, in any framework server
    • Many well maintained tools from IUC and communities of practice.
  • Your ideas here please?

Open science analysis services

  • “Free” project supported services
    • usegalaxy.*
    • Large Australian, European and US research infrastructure allocations
    • Tens of thousands of users
    • Professional user support
    • Stress testing framework code and tools
  • 100+ specialised public instances
  • Unknown number of private installations
  • Your ideas here please?

Capacity building: training resources and services

Providing training to build community capacity is an essential activity for the project, to ensure wide, well managed deployment and long term sustainability.

  • GTN user training integrated directly into the Galaxy user interface, helps new users to gain the skills they need to be productive and efficient.
  • Training system administrators helps support the public usegalaxy.* and the many private servers that operate in academic and commercial laboratories.
  • Training for external developers makes it easier for them to contribute efficiently, improving Galaxy code and wrapping new tools.
  • The Galaxy Trainng Network (GTN) is central to building community capability.
  • Offers free training to enhance global open science research capacity.
    • Generic aspects of using Galaxy for new users
    • Specific kinds of analyses with common types of open data.
    • System administrators are key to running reliable framework services
    • Software developers can contribute more efficiently with appropriate training
    • Trainers can learn how to prepare material for new GTN topics
  • Your ideas here please?

Downstream impacts: Normalising transparent, shareable open science analyses

Scientific scrutiny and trustworthiness: Problems with “black box” analyses.

An analysis where all scripts, package source code, assumptions, settings and methods cannot be readily shared and made accessible for independent replication, is effectively a “black box”. Lack of effective transparency prevents the scientific trustworthiness of the results being routinely tested. It is said that many eyes make bugs shallow, but commercial or other unshareable and opaque analyses, cannot easily be scrutinised, so results must be taken on trust.

Commercial or other “black box” analysis code and settings may be perfect. Unfortunately, experience suggests that all complex software contains errors, many of which can only be found after widespread and thorough independent scrutiny. This is as true of expensive commercial software, as it is of open source software. Open source package assumptions, methods and code are readily accessible for review, testing and improvement. Open projects encourage and facilitate scrutiny and replication, in order to decrease the risks to scientific integrity and trustworthiness from hidden coding or methodological errors. Making any new analysis transparent and reproducible is a hard technical problem, that is largely solved for Galaxy users without requiring any special effort on their part.

Transparent open science analysis for any researcher.

The downstream impact of Galaxy, on open science analysis practice, is important, and probably large, but it is hard to measure. Open science outputs are hard to identify.

  • Research outputs from Galaxy users in open science, indirectly represent increased analysis productivity for researchers. It is a process measure that suggests that it is useful to researchers. If scientific discovery is the desired output, it is very far removed but a hopeful indicator of activity at least.
  • More than 10,000 publications of all types are another, more tangible and direct measure of project impact.
  • Access to efficient and reliable analysis methods for large, complex data resources, probably increases their application in research, and thus their scientific value. Galaxy enables this for very large data collections in any scientific field, with configurable remote data sources. Measuring this impact is challenging, just as the opportunity costs of data lying idle because it is too hard to analyse, are unknown.
  • Improved trustworthiness of sharable, replicable analyses has important and lasting impact on open science. Again, this is very challenging to measure.
  • Analysis of provable scientific integrity are arguably the most important project deliverable
  • 10k+ publications
  • Tens of thousands of scientists trained
  • Millions of jobs run.
  • Your ideas here please?

Further reading

For more on the main components, and stories of how people get things done in the project, choose from the other lessons in this Topic: