In the blog, I will summary some tricks and new concepts I learned and hope it will be helpful for others.

Tricks

1. Github Large File Storage

Reset the local commits: link

2. Torch choose the free GPU

        
      
import os
def find_gpus(nums=6):
    os.system('nvidia-smi -q -d Memory |grep -A4 GPU|grep Free >~/.tmp_free_gpus')
    # If there is no ~ in the path, return the path unchanged
    with open(os.path.expanduser ('~/.tmp_free_gpus') , 'r') as lines_txt:
        frees = lines_txt.readlines()
        idx_freeMemory_pair = [ (idx,int(x.split()[2]))
                        for idx,x in enumerate(frees) ]
    idx_freeMemory_pair.sort(key=lambda my_tuple:my_tuple[1],reverse=True)
    usingGPUs = [str(idx_memory_pair[0])
                        for idx_memory_pair in idx_freeMemory_pair[:nums] ]
    usingGPUs = ','.join(usingGPUs)
    print('using GPU idx: #', usingGPUs)
    return usingGPUs
# os.environ['CUDA_VISIBLE_DEVICES'] = find_gpus(nums=4) # 必须在import torch前⾯
os.environ['CUDA_VISIBLE_DEVICES'] = find_gpus(nums=6) # 必须在import torch前⾯
import torch

3. Calculate the GPU memory cost in Python

package

        
      
! pip install nvidia-ml-py3

import nvidia_smi

nvidia_smi.nvmlInit()

deviceCount = nvidia_smi.nvmlDeviceGetCount()
for i in range(deviceCount):
    handle = nvidia_smi.nvmlDeviceGetHandleByIndex(i)
    info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)
    print("Device {}: {}, Memory : ({:.2f}% free): {}(total), {} (free), {} (used)".format(i, nvidia_smi.nvmlDeviceGetName(handle), 100*info.free/info.total, info.total, info.free, info.used))

nvidia_smi.nvmlShutdown()

4. Meet with the Large-Scale file error after commited on Github

Basically, we need to reset the worngly commited files and add after configuring the LFS again.

# reset the commit
git reset --soft HEAD~1

git filter-branch --tree-filter 'rm -rf path/to/your/file' HEAD
git push

5. How to download a video from URL

Step 1: Get the m3u8 following the step from the link. Step 2: Use ffmpeg to download the video and save as a specific name: ffmpeg -protocol_whitelist file,http,https,tcp,tls,crypto -i "https://<url>.m3u8" -c copy video.mp4 from the link.

6. Numpy softmax overflow issue

Some details about it from Stanforad: Solution:

        
      
def softmax(self, x):
	after_softmax = []
	for single_x in x:
		r=np.exp(single_x - np.max(single_x))
		p = r / np.sum(r)
		after_softmax.append(p)
	after_softmax = np.array(after_softmax)
	print(after_softmax.shape)
	return after_softmax

Another solution that cause nan in the middle:

        
      
def softmax(self, x):
	r=np.exp(x - np.max(x))
	return r/r.sum(axis=1).reshape(x.shape[0], 1)

7. Change current github commit timestamp

        
      
git commit --amend --date="Wed Feb 16 14:00 2011 +0100" --no-edit
git commit --amend --date="now"
git commit --amend --date="2022-07-31T09:51:07"

The supporting link.

8. How to use wandb

Wandb is a good tool to track the training model and logging the course. Github: link.

A useful template from the video and the link:

        
      
import wandb
conf_dict = {"GPU":GPU,
				"Machine":MACHINE,
				"HM_GPU":HM_GPU,
				"NVLINK": NVLINK,
}

wandb.init(
	project=f'project name',
	entity="author name",
	config=conf_dict,
	#sync_tensorboard=True,  # see if it works ig
	name=f'{MACHINE}-{GPU}-{HM_GPU}GPU'
	)

An example to build a logger for wandb: link.

9. Pytorch GPU memory cost

Useful functions:

torch.cuda.memory_allocated()
torch.cuda.max_memory_allocated()
torch.cuda.memory_reserved()

memory_allocated+memory_reserved = nvidia-smi memory cost

A useful discussion: link

10. Ubuntu Mount external driver automatically when reboot

Useful Link: https://askubuntu.com/a/165462

[IMPORTANT] sudo cp /etc/fstab /etc/fstab.old - Create a backup of the fstab file just in case something unwanted happens. If something happens, you will need a bootable (live) usb. If you do not have one, use the GUI method instead.
sudo blkid - Note the UUID of the partition you want to automount.
sudo nano /etc/fstab - Copy the following line to the end of the file, save it and reboot afterwards to check if it worked.
- UUID=<uuid> <pathtomount> <filesystem> defaults 0 0
mkdir /my/path/tomount # to quote : “you must create the mount point before you mount the partition.” see https://help.ubuntu.com/community/Fstab

11. Run an application from another Mac OS

Try this command if you meet with the “App is damaged and please move it to the trash”: xattr -cr /path/to/application.app

12. Install Cuda/Cudnn by conda

Install cuda 10.1: conda install -c miniconda cudatoolkit=10.1 Install cudnn: conda install -c conda-forge cudnn

For Tensorflow 1: link

13. g++/gcc default version update in ubuntu

If some errors like this configure: error: This libpqxx version needs at least C++17. shows, it can be fixed by commands like this: ./configure CXXFLAGS="-std=c++20 -O3". Some useful links:

change the default versions of gcc/g++: link.

14. How to record the time and memory cost of a script or command

/usr/bin/time -v command from link

15. Debug when running tensorflow 1 on 30 series Nvidia GPUs

link

        
pip install nvidia-pyindex
pip install nvidia-tensorflow

Some useful links (but have not found a clear answer yet):

https://stackoverflow.com/questions/38303974/tensorflow-running-error-with-cublas
https://stackoverflow.com/questions/43990046/tensorflow-blas-gemm-launch-failed?noredirect=1&lq=1

16. Pytorch Memory issue

I met with the error RuntimeError: CUDA error: an illegal memory access was encountered when I run the pytorch code. I found that if I used the standard loss function cross entropy, it did not show the error. So far the mean reason is unclear. Some potential solutions:

decrease the batch_size
torch.cuda.empty_cache()

New Concepts

1. GAN

2. VAE

3. Diffusion Model

The basic idea of diffusion models is build a long Markov chain and add Gaussian noise in each step gradually and reverse to train.

Conclusion: Diffusion models are both analytically tractable and flexible. But they require multiple steps on a long Markov chain which make them slower even if they are still faster than GAN.

4. ChaptGPT

Dec-20-2022 4.1 Three main abilities of original GPT-3:

Language Generation
In-context Learning
World knowledge

4.2 What GPT-3.5 cannot do

on-the-fly overwriting the model’s belief
Formal reasoning
Retrieval from the Internet

reference:

https://lilianweng.github.io/posts/2021-07-11-diffusion-models/
https://yang-song.github.io/blog/2021/score/
https://yaofu.notion.site/GPT-3-5-360081d91ec245f29029d37b54573756

Tricks Summary 2022