Home Tricks Summary 2022
Post
Cancel

Tricks Summary 2022

In the blog, I will summary some tricks and new concepts I learned and hope it will be helpful for others.

Tricks

1. Github Large File Storage

Usage

Reset the local commits: link

2. Torch choose the free GPU

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import os
def find_gpus(nums=6):
    os.system('nvidia-smi -q -d Memory |grep -A4 GPU|grep Free >~/.tmp_free_gpus')
    # If there is no ~ in the path, return the path unchanged
    with open(os.path.expanduser ('~/.tmp_free_gpus') , 'r') as lines_txt:
        frees = lines_txt.readlines()
        idx_freeMemory_pair = [ (idx,int(x.split()[2]))
                        for idx,x in enumerate(frees) ]
    idx_freeMemory_pair.sort(key=lambda my_tuple:my_tuple[1],reverse=True)
    usingGPUs = [str(idx_memory_pair[0])
                        for idx_memory_pair in idx_freeMemory_pair[:nums] ]
    usingGPUs = ','.join(usingGPUs)
    print('using GPU idx: #', usingGPUs)
    return usingGPUs
# os.environ['CUDA_VISIBLE_DEVICES'] = find_gpus(nums=4) # 必须在import torch前⾯
os.environ['CUDA_VISIBLE_DEVICES'] = find_gpus(nums=6) # 必须在import torch前⾯
import torch

3. Calculate the GPU memory cost in Python

package

1
2
3
4
5
6
7
8
9
10
11
12
13
! pip install nvidia-ml-py3

import nvidia_smi

nvidia_smi.nvmlInit()

deviceCount = nvidia_smi.nvmlDeviceGetCount()
for i in range(deviceCount):
    handle = nvidia_smi.nvmlDeviceGetHandleByIndex(i)
    info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)
    print("Device {}: {}, Memory : ({:.2f}% free): {}(total), {} (free), {} (used)".format(i, nvidia_smi.nvmlDeviceGetName(handle), 100*info.free/info.total, info.total, info.free, info.used))

nvidia_smi.nvmlShutdown()

4. Meet with the Large-Scale file error after commited on Github

Basically, we need to reset the worngly commited files and add after configuring the LFS again.

1
2
3
4
5
# reset the commit
git reset --soft HEAD~1

git filter-branch --tree-filter 'rm -rf path/to/your/file' HEAD
git push

5. How to download a video from URL

Step 1: Get the m3u8 following the step from the link. Step 2: Use ffmpeg to download the video and save as a specific name: ffmpeg -protocol_whitelist file,http,https,tcp,tls,crypto -i "https://<url>.m3u8" -c copy video.mp4 from the link.

6. Numpy softmax overflow issue

Some details about it from Stanforad: Solution:

1
2
3
4
5
6
7
8
9
def softmax(self, x):
	after_softmax = []
	for single_x in x:
		r=np.exp(single_x - np.max(single_x))
		p = r / np.sum(r)
		after_softmax.append(p)
	after_softmax = np.array(after_softmax)
	print(after_softmax.shape)
	return after_softmax

Another solution that cause nan in the middle:

1
2
3
def softmax(self, x):
	r=np.exp(x - np.max(x))
	return r/r.sum(axis=1).reshape(x.shape[0], 1)

7. Change current github commit timestamp

1
2
3
git commit --amend --date="Wed Feb 16 14:00 2011 +0100" --no-edit
git commit --amend --date="now"
git commit --amend --date="2022-07-31T09:51:07"

The supporting link.

8. How to use wandb

Wandb is a good tool to track the training model and logging the course. Github: link.

A useful template from the video and the link:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import wandb
conf_dict = {"GPU":GPU,
				"Machine":MACHINE,
				"HM_GPU":HM_GPU,
				"NVLINK": NVLINK,
}

wandb.init(
	project=f'project name',
	entity="author name",
	config=conf_dict,
	#sync_tensorboard=True,  # see if it works ig
	name=f'{MACHINE}-{GPU}-{HM_GPU}GPU'
	)

An example to build a logger for wandb: link.

9. Pytorch GPU memory cost

Useful functions:

  • torch.cuda.memory_allocated()
  • torch.cuda.max_memory_allocated()
  • torch.cuda.memory_reserved()

memory_allocated+memory_reserved = nvidia-smi memory cost

A useful discussion: link

10. Ubuntu Mount external driver automatically when reboot

Useful Link: https://askubuntu.com/a/165462

  1. [IMPORTANT] sudo cp /etc/fstab /etc/fstab.old - Create a backup of the fstab file just in case something unwanted happens. If something happens, you will need a bootable (live) usb. If you do not have one, use the GUI method instead.
  2. sudo blkid - Note the UUID of the partition you want to automount.
  3. sudo nano /etc/fstab - Copy the following line to the end of the file, save it and reboot afterwards to check if it worked.
    • UUID=<uuid> <pathtomount> <filesystem> defaults 0 0
  4. mkdir /my/path/tomount # to quote : “you must create the mount point before you mount the partition.” see https://help.ubuntu.com/community/Fstab

11. Run an application from another Mac OS

Try this command if you meet with the “App is damaged and please move it to the trash”: xattr -cr /path/to/application.app

12. Install Cuda/Cudnn by conda

Install cuda 10.1: conda install -c miniconda cudatoolkit=10.1 Install cudnn: conda install -c conda-forge cudnn

For Tensorflow 1: link

13. g++/gcc default version update in ubuntu

If some errors like this configure: error: This libpqxx version needs at least C++17. shows, it can be fixed by commands like this: ./configure CXXFLAGS="-std=c++20 -O3". Some useful links:

  • change the default versions of gcc/g++: link.

14. How to record the time and memory cost of a script or command

/usr/bin/time -v command from link

15. Debug when running tensorflow 1 on 30 series Nvidia GPUs

link

1
2
pip install nvidia-pyindex
pip install nvidia-tensorflow

Some useful links (but have not found a clear answer yet):

  • https://stackoverflow.com/questions/38303974/tensorflow-running-error-with-cublas
  • https://stackoverflow.com/questions/43990046/tensorflow-blas-gemm-launch-failed?noredirect=1&lq=1

16. Pytorch Memory issue

I met with the error RuntimeError: CUDA error: an illegal memory access was encountered when I run the pytorch code. I found that if I used the standard loss function cross entropy, it did not show the error. So far the mean reason is unclear. Some potential solutions:

  • decrease the batch_size
  • torch.cuda.empty_cache()

New Concepts

1. GAN

2. VAE

3. Diffusion Model

The basic idea of diffusion models is build a long Markov chain and add Gaussian noise in each step gradually and reverse to train.

Conclusion: Diffusion models are both analytically tractable and flexible. But they require multiple steps on a long Markov chain which make them slower even if they are still faster than GAN.

4. ChaptGPT

Dec-20-2022 4.1 Three main abilities of original GPT-3:

  • Language Generation
  • In-context Learning
  • World knowledge

4.2 What GPT-3.5 cannot do

  • on-the-fly overwriting the model’s belief
  • Formal reasoning
  • Retrieval from the Internet

reference:

  1. https://lilianweng.github.io/posts/2021-07-11-diffusion-models/
  2. https://yang-song.github.io/blog/2021/score/
  3. https://yaofu.notion.site/GPT-3-5-360081d91ec245f29029d37b54573756
This post is licensed under CC BY 4.0 by the author.

Graph Databases and Visualization

Adversarial attack on Graph Neural Network for Node Classification