Local Version Control practical

Outline

Git help
Git setup / config
Local repository
Changelog / History
Show / Diffs
Creating a new repository
Add/Commit
Staging area
Second commmit
Git blame
Tagging
Bash scripts + git
Git ignore
Moving through history
Restoring deleted files
Branches
Pull / Push
Bonus: Workflows
Bonus: R studio git integration
Bonus: GitHub preview

Git help

Like most command line programs, git has command line help. Run:

git --help

This shows an overview of sub commands. It’s short, read it all!

start a working area (see also: git help tutorial)
   clone     Clone a repository into a new directory
   init      Create an empty Git repository or reinitialize an existing one

work on the current change (see also: git help everyday)
   add       Add file contents to the index
   mv        Move or rename a file, a directory, or a symlink
   restore   Restore working tree files
   rm        Remove files from the working tree and from the index

examine the history and state (see also: git help revisions)
   bisect    Use binary search to find the commit that introduced a bug
   diff      Show changes between commits, commit and working tree, etc
   grep      Print lines matching a pattern
   log       Show commit logs
   show      Show various types of objects
   status    Show the working tree status

grow, mark and tweak your common history
   branch    List, create, or delete branches
   commit    Record changes to the repository
   merge     Join two or more development histories together
   rebase    Reapply commits on top of another base tip
   reset     Reset current HEAD to the specified state
   switch    Switch branches
   tag       Create, list, delete or verify a tag object signed with GPG

collaborate (see also: git help workflows)
   fetch     Download objects and refs from another repository
   pull      Fetch from and integrate with another repository or a local branch
   push      Update remote refs along with associated objects

You can get help for a subcommand via:

git clone --help

This opens a man (manual) page, and is the equivalent of man git-clone

Scroll through the man page with up/down keys, then page up/page down keys.

Press q to quit the man page.

Git setup / config

Git tracks which user did what, so you need to identify yourself.

If you didn’t do this in the tutorial, please run this:

git config --global user.name "Your name"
git config --global user.email "your.name@student.adelaide.edu.au"
git config --global core.editor "nano -w"

This is stored in your home directory and so works on all git repositories (global)

To see it:

cat ~/.gitconfig

Local repository

If you have already done this in a previous tutorial, you need to update it:

cd BIOTECH-7005-BIOINF-3000 # Wherever you cloned it before
git pull  # fetch changes from GitHub and update your local repository 
# If you get errors due to changes from the last tutorial, perhaps it is easier to just delete it and clone it again.

If you haven’t cloned the repo before, do so via:

# Clone a repo
git clone https://github.com/University-of-Adelaide-Bx-Masters/BIOTECH-7005-BIOINF-3000

When you clone a repo - you make a local copy of the whole version control database (history, deleted files etc)

Changelog / History

A git repository is a sequence of changes, you can see a history of these via:

git log

This opens an interactive log (shows a “:” at the bottom) of commits in reverse chronological order (ie latest commits first). Press the up/down arrow keys to scroll, or page up and down. Press q to quit.

Notice that at the top of every commit is a hash - which is the unique label you can use to refer to that change.

There are lots of options, eg to show a summary of changes.

git log --stat --summary

Or to see a visual of branching/merging:

git log --graph

Sample output:

| * |   commit 4ac3e07643488adad8a5a4462c756503cc5e975a
| |\ \  Merge: 6dea320 bbef0f4
| | | | Author: Dave Adelson <david.adelson@adelaide.edu.au>
| | | | Date:   Wed Sep 22 09:41:55 2021 +0930
| | | | 
| | | |     Merge pull request #27 from University-of-Adelaide-Bx-Masters/NewTranscriptome
| | | |     
| | | |     update link to merged major project document
| |\ \  Merge: 6dea320 bbef0f4
| | | | Author: Dave Adelson <david.adelson@adelaide.edu.au>
| | | | Date:   Wed Sep 22 09:41:55 2021 +0930
| | | | 
| | | |     Merge pull request #27 from University-of-Adelaide-Bx-Masters/NewTranscriptome
| | | |     
| | | |     update link to merged major project document
| | | |   
| * | |   commit 6dea320691f4e3d5d5ec1e3b02c1c84db46f5ec6
| |\ \ \  Merge: 8a40db2 5b9a8b1
| | | | | Author: Dave Adelson <david.adelson@adelaide.edu.au>
| | | | | Date:   Wed Sep 22 09:33:33 2021 +0930
| | | | | 
| | | | |     Merge pull request #26 from University-of-Adelaide-Bx-Masters/NewTranscriptome
| | | | |     
| | | | |     New transcriptome
| | | | |   
* | | | |   commit cff75d103c70161e44decc9bdf59f4cd8274cfeb
|\ \ \ \ \  Merge: 8a40db2 9d41f5e
| |/ / / /  Author: David Adelson <david.adelson@adelaide.edu.au>
|/| | | /   Date:   Wed Sep 22 09:48:38 2021 +0930
| | |_|/    
| |/| |         Merge branch 'NewTranscriptome'
| | | | 
| * | | commit 9d41f5e071da6e179660adb8a124beb2b190c2c6
| | |/  Author: David Adelson <david.adelson@adelaide.edu.au>
| |/|   Date:   Wed Sep 22 09:44:34 2021 +0930
| | |   
| | |       deleted html
| | | 
| * | commit bbef0f42d9e7efd2be26cd3d281f55dfcfcfef01
| |/  Author: David Adelson <david.adelson@adelaide.edu.au>
| |   Date:   Wed Sep 22 09:36:20 2021 +0930
| |   
| |       update link to merged major project document

Or a more concise graph:

git log --all --decorate --oneline --graph

Show / Diffs

To see the details of a particular commit, copy/paste a hash, eg:

git show 23d004b56503d2aaccc5f6e39993a8d10aee334c

To see the differences between two commits, you can use diff:

Show is the same as the difference between a commit and the previous commit:

git diff 11a25fea8e15523049896381de9312501053843d 23d004b56503d2aaccc5f6e39993a8d10aee334c

but you can eg look across 3 commits:

git diff 141d48db593f24783169062b1e94c4259bacfaeb f6cec9c27eea6ef854f754f1adbe3c3310cb708d

Exercise:

Use git log to find 2 commits, and run git diff between them.

Now add a few commits before/after it to see how the size of the diff grows

Shortcuts

HEAD is a shortcut to the latest commit.

If you don’t specify a second hash, it will default to HEAD, to show you all the changes since that commit.

Every commit usually has one “parent” commit which points to the previous state, some shortcuts:

# These also work on normal hashes not just HEAD
git show HEAD^  # to see the parent of HEAD
git show HEAD^^ # to see the grandparent of HEAD
git show HEAD~4 # to see the great-great grandparent of HEAD

# See last 2 commits:
git diff HEAD^^  # Implicitly comparin to HEAD

Creating a new repository

# Go to where you want to put a new git repository
mkdir bioinfo_tute_wk8
git init

There is now a hidden directory called “.git”

ls -R .git

Add/Commit

Edit a file, eg “README.md”
Paste in:

Apples
Watermelon

Then save the file, and exit to the shell. Run:

git status

This should show you that you have 1 untracked file called ‘README.md’.

To see this as a diff:

git diff

We need to tell Git about this change - run:

git add README.md

Now run:

git status

It should show you have a new file README.md, now run:

git commit

Because you didn’t specify a messae, this will open up your editor (which we specified as nano earlier) to make one. Enter something like “initial shopping list” then exit. It will apply a commit with the message you entered.

Staging area

Why did we have to do add and then commit? Ie why have a staging area instead of comitting directly?

It allows you to carefully add/remove individual files to build a commit - ie you can change a lot of files, then commit a few together, then others in a separate commit.

Think of it as a draft area where you can carefully build exactly what you want to commit.

Second commmit

Edit README.md again, adding 2 more lines:

Bananas
Mango

Run

git commit

Again, it should say: “no changes added to commit” - because you haven’t added the file. Let’s do that and commit:

git add README.md 
git commit -m "add stuff"

We used the -m parameter to provide the commit message on the command line, so we didn’t need to open up nano to type one.

Run:

git log

You can see your comments will stored against a commit for all time - make sure you spend a bit of time making them clear and useful!

Perhaps we should make that last message better, to edit it:

git commit --amend -m "Add Cyan monkeys's fruit"

Git blame

You can also see which was the last commit to touch a line in a file - this git command is actually easy to remember!

git blame README.md

Now edit README.md again, and add a “*” to the front of each line so it looks like:

* Apples
* Bananas
* Mango
* Watermelon

Add/commit again:

git add README.md 
git commit -m "formatting - markdown bullet points"

Now run git blame again:

git blame README.md

You can see that since we touched every line in the file, git blame is all that commit, and not very useful. So you can ignore a particular commit:

git blame README.md --ignore-rev HEAD  # Or the commit hash of your formatting change

You can see why logically grouping changes in commits is useful, eg separating formatting changes vs adding new things.

I will often break up a change into multiple commits, eg “clean up the area - keep existing functionality the same but make it easier to extend/change” and then a separate commit of “add the new functionality”.

If you run an auto-linter / code formatting tool (and you should!), you can keep track of commits in a file and use --ignore-revs-file

Tagging

All commits have unique hashes as labels, however these don’t mean anything and are hard to remember.

You can give commits your own names, which is called tagging.

git tag my_tag <hash from git log>  # need to change to your hash

A common use for tags is to label a release. This makes it easier to know the state of code in a release, find what releases are affected by bugs, and for instance show what changed between releases, eg:

CDot change diff

This is linked from the CHANGELOG file in one of my GitHub projects:

https://github.com/SACGF/cdot/blob/main/CHANGELOG.md

Bash scripts + git

Tip from scientific reproducibility lecture. If you always use Git, you can keep track of what version of a script was run.

For instance, copy/paste this into a file pipeline.sh

#!/bin/bash

DATA_DIR=data
FASTA_FILE=${DATA_DIR}/dna.fasta
TODAY=$(date --iso)
GIT_COMMIT=$(git rev-parse HEAD)
RUN_LOG=pipeline_run_${TODAY}_${GIT_COMMIT}.log

mkdir -p ${DATA_DIR}

if [[ ! -e ${FASTA_FILE} ]]; then
    echo ">DNA sequence, ~25% accuracy" > ${FASTA_FILE}
    cat /dev/urandom | tr -dc 'GATC' | fold -w ${1:-50} | head -1 >> ${FASTA_FILE}
fi

echo "Script ${0} started $(date) in '$(pwd)'" > ${RUN_LOG}
echo "git: ${GIT_COMMIT}" >> ${RUN_LOG}
echo "Input: $(md5sum ${FASTA_FILE})" >> ${RUN_LOG}

Now add/commit it:

git add pipeline.sh 
git commit -m "reproducible pipeline"

And run it, then examine the logs:

bash pipeline.sh
cat *.log

This should have saved the output:

Script pipeline.sh started Fri 16 Sep 2022 01:16:32 ACST in '/home/dlawrence/localwork/bioinfo_tute_wk8'
git: 2af18ad1033bbd5efd90a0aa88acf59dc1d9c4c4
Input: 482b49fe8627a6a300b51f46f7e5c6ff  data/dna.fasta

Exercise

Modify this script to generate a longer DNA sequence with a different filename, then checking in the changes, then re-running it.

Git ignore

Sometimes you deliberately don’t want to track certain files.

Examples include binary files, logs, temporary files or editor configuration (so each person can set their own preferences)

Run

git status

The log and data files are here. Let’s say that in a real experiment they are enormous, and you don’t want to check them in.

echo "data/" > .gitignore
echo "*.log" >> .gitignore

Now run:

git status

Exercise

Check in .gitignore

(optional) Write a 2000 word philosophy paper describing what you think should happen if you add .gitignore to the .gitignore file.

Moving through history

Git stores (in .git) all of the changes ever applied to the files in the repo.

It is worth thinking of the “normal files” in the directory as temporary - “working files”, they can be reconstructed at various historical points by replaying certain git changes. Re-applying all of the changes will create the “latest” versions, and you can selectively apply patches to move through historical versions.

Cat the first log file from your pipeline.sh runs above, and find the commit hash

To move the working files back to the state of this commit, you can run:

git checkout <commit_hash>  # Need to copy/paste from your git history

This will mention having a “detached head” - HEAD keeps track of the current point in a repository. Usually, it is attached to a branch (eg master), while moving through history, it’s “detached”

You can also see this if you run

git status

It will say “head detached”

Exercise

Run

git log

and pick a commit, ie “initial commit”

Move the repo back to this commit, and examine the README.md file

To move back to the latest commit, run:

git checkout master

To check you’re back to normal - run git log and ensure that the last commit has HEAD next to it

Restoring deleted files

Delete the README.md, ie:

rm README.md

Now run:

git status  # Shows file was deleted
git diff  # Shows patch

Retrieve the file from the repo - to undelete it in the working files

git checkout README.md

Branches

So far in your new repository we’ve been working in a linear way.

But remember Git is a graph, and you can have multiple paths (think back to the git log --graph diagrams above) that split apart.

The default branch is main/master, but you can diverge from this, and also merge together different branches.

A common task when working with multiple people is to create a branch from current HEAD to do new development, so you can store your code in the repository without breaking master.

You can create a branch then switch to it:

git branch blue_monkey # Creates a new branch from HEAD
git checkout blue_monkey

Or you could do this in 1 line via:

git checkout -b blue_monkey # if you run this after creating the branch above it will error and do nothing

Now edit the README.md and add a line “* Grapes”

git add README.md
git commit -m "blue monkey change"

Now, go back to the master branch.

git checkout master
cat README.md  # Double check that grapes are not there
git merge blue_monkey # Merge branch back into master

Because there were no conflicts - the merge was done automatically.

But imagine that the blue and cyan monkeys went off and did separate work. Find the commit before the merge, and create a branch from that, eg:

git checkout -b cyan_monkey HEAD^  # 1 commit back

Edit the README.md, say add “* Caffeinated bananas” and delete “* Watermelon”

Add/commit

git add README.md
git commit -m "cyan monkey shopping list changes"

Then back to main again:

git merge cyan_monkey

This can’t be auto-merged and is flagged as a conflict.

Auto-merging README.md
CONFLICT (content): Merge conflict in README.md
Automatic merge failed; fix conflicts and then commit the result.

If you look at the README.md it was merged, but with deliberate errors inserted around the merge conflict, so you have to manually edit it.

* Apples
* Bananas
<<<<<<< HEAD
* Grapes
=======
* Chocolate covered bananas
>>>>>>> cyan_monkey
* Mango

Edit this file, to put the entries in alphabetical order, then delete the lines with “<<<<” and “>>>” on them

git add README.md
git commit # editor will now suggest a 'merge' commit message

Now run:

git log --graph

Notice the branch and how it was merged back

Exercise

Create a new branch (and switch to it)
Create 2 modifications against that branch. Look at the log and pick 1 of the changes.
Go back to master/main
Run

git cherry-pick <the hash you want>

to only apply one change. This is useful to apply bug fixes to historical branches

Pull / Push

When you clone a repository (eg from GitHub) you make a local copy (with all commits and history)

When you edit files and make make changes to this, you

and also track where you got it from (remote origin) - so it is easy to keep them in sync.

Fetch - retrieves commits from another repo, but doesn’t apply them Pull - runs fetch, then applies the commits, in effect “syncing” your local repo to the remote Push - send your changes to the remote server (we will cover this in either the bonus section below, and GitHub prac in Week 9)

Bonus: Workflows

When working with others, you need to decide on how to use Git in your day to day work - these work patterns are often called git workflows

For instance - is it ok to have broken code in master/main? If you are working by yourself, or are making very small changes you have tested well, it may be OK to commit directly into the main/master branch.

If you are running an open source project and random people download your code - maybe not!

If I am working on something larger than 1 day’s work, I like to create a feature branch so that I can push my code to Github (to have a backup + work on different computers) then merge back to master when done.

Some teams only do work in feature branches, then use pull requests to review the code by someone else before it is merged into master.

Bonus: R studio git integration

Note: This may only work on the desktop version of R studio - haven’t tested on web version.

Many IDEs integrate with source control.

In R studio, select via top menu:

file -> new project
Select Existing project
Browse to the directory of the BIOTECH-7005-BIOINF-3000 repository
Click Create Project

There is now a git button in the toolbar (just below the top menu)

Question

Creating a new project caused a file “BIOTECH-7005-BIOINF-3000.Rproj” to appear in the repository. Why isn’t this visible as an untracked file in the git panel?

Bonus: GitHub

Github is a website that hosts git repositories on the internet, and also provides a number of services such as authentication, code browsing, issue tracking etc.

GitHub uses “Persoanl access tokens” rather than passwords. To generate one of these for the VM:

Click on your user icon in the top right
Click “Settings”
In bottom left menu option - “Developer Settings”, then “Personal Access token”
Generate new token, label it something like “bioinformatics VM”, it needs repository access

If you have password access to your VM, and you are happy to store your password in plain text on the VM, then you can do so via:

git config --global credential.helper store

Otherwise, just leave the tab open so you can copy/paste it, or copy/paste it into a temporary text editor

Create a new repository (eg “bioinfo_tute_wk8”)
Follow the instructions labelled …or push an existing repository from the command line

When questioned for your password, paste in the personal access token (if you store credentials, it will save it now to your VM disk)

Explore your repository on the GitHub site!

We will be going into GitHub in depth in the week 9 practical. Have a good mid-semester break!

Local Version Control practical

David Lawrence

2022-09-15

Outline

Git help

Git setup / config

Local repository

Changelog / History

Show / Diffs

Creating a new repository

Add/Commit

Staging area

Second commmit

Git blame

Tagging

Bash scripts + git

Git ignore

Moving through history

Restoring deleted files

Branches

Pull / Push

Bonus: Workflows

Bonus: R studio git integration

Bonus: GitHub