Git is the command line tool (and file based version control database). As we have seen in previous pracs, you can do it entirely on your local computer.
GitHub is website that runs Git on the cloud to help you host a repository. There are others similar sites like GitLab and BitBucket.
A centrally hosted repo on the internet (with authorization) is useful as you can access it from many computers
They also provide many other features to help you run software projects like issue trackers, wikis and more.
Here are some repositories of famous bioinformatics tools.
Go and view the pages, skim the README and the issue trackers. Open some random issues
IGV - paper in 2013, last commit 2 days ago (from Oct 5 2023)
Samtools - paper in 2009, last commit 2 weeks ago
BWA - paper in 2009, last commit September 23, 2022. New development is being done on BWA-MEM2
Compare the release date of the paper vs the last commit.
Consider how different it must be from the first release, and whether the authors realised someone (or they themselves) would be working on the code 13 years later!
Note: Notice how the commit frequency and issue tracker is a good way to evaluate a project’s health.
On one of the bioinformatics project pages above, look in the top right for the licence link. If LICENCE is a standard file (as per IGV), it will show eg “MIT Licence”
Click on the IGV licence link, and open it up and skim it quickly…. it’s a lot of legalese
There’s a great site called TLDR legal that simplifies things, eg:
MIT - A short, permissive software license. Basically, you can do whatever you want as long as you include the original copyright and license notice in any copy of the software/source
GNU General Public License v3 (GPL-3) - You may copy, distribute and modify the software as long as you track changes/dates in source files. Any modifications to or software including (via compiler) GPL-licensed code must also be made available under the GPL along with build & install instructions.
Task: Please open the links above, and read through them
Github has excellent help -GitHub docs
However, I tend to just Google “GitHub X” to find what I’m looking for.
Github is a place to find software, but if you’re doing it right, GitHub is a social media site for programmers
Some use the term “free software” but you should think of free as in “free speech” not “free beer” - the most important part is that you can free to see how it works, free to modify it, if it’s broken.
You can get paid a lot of money writing code… so why do people work on code then give it away?
Often corporate programming is not that fun, and they want to do something different, solve a problem “the right way” (where engineering effort does not have an immediate source of income) or just work with personal freedom.
“There’s a reason why open source projects are called “projects,” rather than just code. While code is the final output of a project, the term “project” refers to the entire bundle of community, code, and communication and developer tools that support its underlying production.”
Software is very iterative. If you make something and put it out there. People use it (not in the way you expect!) then start making requests, asking for help, etc
If your project is successful you can often end up spending a lot of time managing it, and doing boring things like user support rather than the fun stuff (programming cool new features and doing science)!
“Running a successful open source project is just Good Will Hunting in reverse, where you start out as a respected genius and end up being a janitor who gets into fights”
But it is rewarding, and from a scientific career perspective - people are more likely to use and cite good, working tools!
Professionally - it also helps with exposure and networking. When hiring for jobs, I love to see someone list their GitHub account, you can see what they’ve done, and also have a glimpse into how they work (are they helpful and polite?)
Generally - fun driven development means boring jobs don’t get done.
Discussion: Has anyone had good or bad experience with open source software?
If you find a bug in some software, please report it by raising an issue.
Check that hasn’t been fixed (ie try latest version) and hasn’t already been reported. Ideally work for a bit to narrow it down and make it simple to reproduce.
Raising an issue if you find a bug is a useful contribution!
Suggesting a new feature you’d like is also useful, though it’s a lot easier to think of new features to do them, and not everyone shares your priorities.
There are a lot more people raising issues than fixing bugs and closing them
Thus, number of issues continues to grow while the software is being used (and it will probably be abandoned before they are all closed)
If a developers says “PR welcome”, that’s a polite way of saying “I have other things to do, if you want this, do it yourself!”
In order for a bug to be fixed, the following has to happen:
The first steps are valuable, but developers are busy.
You can wait a loooooong time for a bug fix.
Here’s a bug I raised, that took 812 days to fix. I am still super pumped.
Why? I haven’t written Perl in over 10 years, so had to wait for someone else to do it.
Two people encountered the same bug, that had a simple 15 line fix. One had to wait 2,467 days for a fix, while the other one only had to wait 5.
What’s the lesson here? If you want a fix fast, do it yourself!
Who fixes open source code? People like me, and people like you!
You generally don’t have permission to modify other people’s projects, so how do we modify it, or send them changes? Start by forking.
Forking is a way to create a personal copy of someone else’s repository on GitHub. When you fork a repo, you’re essentially duplicating it under your GitHub account. This allows you to freely experiment with changes without affecting the original project.
Steps for Forking on GitHub:
Navigate to the repository you want to fork.
Click the "Fork" button, usually located at the top right corner of the repo's page.
Choose where you want to fork the repo. Usually, it's your personal GitHub account.
What Happens After Forking:
You get a new repository under your account that's identical to the original repo at the time of forking.
Your fork exists independently, meaning changes to the fork won't affect the original repo, and vice versa.
Common Use Cases:
Contributing to a Project: Fork the repo, clone your fork locally, make changes, push back to your fork on GitHub, and then create a pull request to the original repo.
Personal Experiments: You can freely modify your fork without worrying about the original project.
Task: Fork a bioinformatics project of your choice, then clone it to your local VM
I always raise an issue before starting a fix, to confirm it is real, and
When making bug fixes, I like to write automated unit tests that reproduce the bug, and would fail with the current code.
Then I make the fix, and verify that the new tests pass. I make sure all of the existing tests still pass, to show I haven’t accidentally broken anything.
I like to use my code for a while to thoroughly use it myself for a few days under real world conditions, this means installing it to the system, so I can use the library from my code, or my modified tool in my pipelines.
Then, when I’m happy, I’ll try to get it merged into the original project
Question: I have selfish reasons to get my bug fixes merged into the project. What are they?
Since you don’t have permission to write to other people’s projects (ie you can’t push), you need them to pull your changes. This is called a pull request.
If a pull request is still open, and you push more commits to your fork, it adds onto the currently open pull request.
I like to create a branch in my repository for just that fix, then make the pull request from there.
Unit tests drastically increase the chance of your merge being accepted.
This allows me to make multiple pull requests, separated out per issue. This makes it easier to validate (rather than a giant pull request that fixes 4 independent issues)
I put all of the fixes in the main/master branch - so if I need to replace the library/tool - this is what I use until the original project merges the fixes and makes a release.
Developers have busy lives, sometimes it can take a while for the merges to get in. I try to gently remind people, and fulfill any requests they have.
Be patient, don’t be rude, and just keep using your forks for personal use.
Cutadapt:
HTSeq:
Pronto
HGVS
More pull requests: