In the blog, I will summary some tricks and new concepts I learned and hope it will be helpful for others.
Tricks
1. Github Large File Storage
Reset the local commits: link
2. Torch choose the free GPU
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import os
def find_gpus(nums=6):
os.system('nvidia-smi -q -d Memory |grep -A4 GPU|grep Free >~/.tmp_free_gpus')
# If there is no ~ in the path, return the path unchanged
with open(os.path.expanduser ('~/.tmp_free_gpus') , 'r') as lines_txt:
frees = lines_txt.readlines()
idx_freeMemory_pair = [ (idx,int(x.split()[2]))
for idx,x in enumerate(frees) ]
idx_freeMemory_pair.sort(key=lambda my_tuple:my_tuple[1],reverse=True)
usingGPUs = [str(idx_memory_pair[0])
for idx_memory_pair in idx_freeMemory_pair[:nums] ]
usingGPUs = ','.join(usingGPUs)
print('using GPU idx: #', usingGPUs)
return usingGPUs
# os.environ['CUDA_VISIBLE_DEVICES'] = find_gpus(nums=4) # 必须在import torch前⾯
os.environ['CUDA_VISIBLE_DEVICES'] = find_gpus(nums=6) # 必须在import torch前⾯
import torch
3. Calculate the GPU memory cost in Python
1
2
3
4
5
6
7
8
9
10
11
12
13
! pip install nvidia-ml-py3
import nvidia_smi
nvidia_smi.nvmlInit()
deviceCount = nvidia_smi.nvmlDeviceGetCount()
for i in range(deviceCount):
handle = nvidia_smi.nvmlDeviceGetHandleByIndex(i)
info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)
print("Device {}: {}, Memory : ({:.2f}% free): {}(total), {} (free), {} (used)".format(i, nvidia_smi.nvmlDeviceGetName(handle), 100*info.free/info.total, info.total, info.free, info.used))
nvidia_smi.nvmlShutdown()
4. Meet with the Large-Scale file error after commited on Github
Basically, we need to reset the worngly commited files and add after configuring the LFS again.
1
2
3
4
5
# reset the commit
git reset --soft HEAD~1
git filter-branch --tree-filter 'rm -rf path/to/your/file' HEAD
git push
5. How to download a video from URL
Step 1: Get the m3u8 following the step from the link.
Step 2: Use ffmpeg
to download the video and save as a specific name: ffmpeg -protocol_whitelist file,http,https,tcp,tls,crypto -i "https://<url>.m3u8" -c copy video.mp4
from the link.
6. Numpy softmax overflow issue
Some details about it from Stanforad: Solution:
1
2
3
4
5
6
7
8
9
def softmax(self, x):
after_softmax = []
for single_x in x:
r=np.exp(single_x - np.max(single_x))
p = r / np.sum(r)
after_softmax.append(p)
after_softmax = np.array(after_softmax)
print(after_softmax.shape)
return after_softmax
Another solution that cause nan
in the middle:
1
2
3
def softmax(self, x):
r=np.exp(x - np.max(x))
return r/r.sum(axis=1).reshape(x.shape[0], 1)
7. Change current github commit timestamp
1
2
3
git commit --amend --date="Wed Feb 16 14:00 2011 +0100" --no-edit
git commit --amend --date="now"
git commit --amend --date="2022-07-31T09:51:07"
The supporting link.
8. How to use wandb
Wandb is a good tool to track the training model and logging the course. Github: link.
A useful template from the video and the link:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
import wandb
conf_dict = {"GPU":GPU,
"Machine":MACHINE,
"HM_GPU":HM_GPU,
"NVLINK": NVLINK,
}
wandb.init(
project=f'project name',
entity="author name",
config=conf_dict,
#sync_tensorboard=True, # see if it works ig
name=f'{MACHINE}-{GPU}-{HM_GPU}GPU'
)
An example to build a logger for wandb: link.
9. Pytorch GPU memory cost
Useful functions:
- torch.cuda.memory_allocated()
- torch.cuda.max_memory_allocated()
- torch.cuda.memory_reserved()
memory_allocated+memory_reserved = nvidia-smi memory cost
A useful discussion: link
10. Ubuntu Mount external driver automatically when reboot
Useful Link: https://askubuntu.com/a/165462
- [IMPORTANT]
sudo cp /etc/fstab /etc/fstab.old
- Create a backup of the fstab file just in case something unwanted happens. If something happens, you will need a bootable (live) usb. If you do not have one, use the GUI method instead. sudo blkid
- Note the UUID of the partition you want to automount.sudo nano /etc/fstab
- Copy the following line to the end of the file, save it and reboot afterwards to check if it worked.UUID=<uuid> <pathtomount> <filesystem> defaults 0 0
mkdir /my/path/tomount
# to quote : “you must create the mount point before you mount the partition.” see https://help.ubuntu.com/community/Fstab
11. Run an application from another Mac OS
Try this command if you meet with the “App is damaged and please move it to the trash”:
xattr -cr /path/to/application.app
12. Install Cuda/Cudnn by conda
Install cuda 10.1: conda install -c miniconda cudatoolkit=10.1
Install cudnn: conda install -c conda-forge cudnn
For Tensorflow 1: link
13. g++/gcc default version update in ubuntu
If some errors like this configure: error: This libpqxx version needs at least C++17.
shows, it can be fixed by commands like this: ./configure CXXFLAGS="-std=c++20 -O3"
.
Some useful links:
- change the default versions of gcc/g++: link.
14. How to record the time and memory cost of a script or command
/usr/bin/time -v command
from link
15. Debug when running tensorflow 1 on 30 series Nvidia GPUs
1
2
pip install nvidia-pyindex
pip install nvidia-tensorflow
Some useful links (but have not found a clear answer yet):
- https://stackoverflow.com/questions/38303974/tensorflow-running-error-with-cublas
- https://stackoverflow.com/questions/43990046/tensorflow-blas-gemm-launch-failed?noredirect=1&lq=1
16. Pytorch Memory issue
I met with the error RuntimeError: CUDA error: an illegal memory access was encountered
when I run the pytorch code. I found that if I used the standard loss function cross entropy
, it did not show the error. So far the mean reason is unclear.
Some potential solutions:
- decrease the batch_size
torch.cuda.empty_cache()
New Concepts
1. GAN
2. VAE
3. Diffusion Model
The basic idea of diffusion models is build a long Markov chain and add Gaussian noise in each step gradually and reverse to train.
Conclusion: Diffusion models are both analytically tractable and flexible. But they require multiple steps on a long Markov chain which make them slower even if they are still faster than GAN.
4. ChaptGPT
Dec-20-2022 4.1 Three main abilities of original GPT-3:
- Language Generation
- In-context Learning
- World knowledge
4.2 What GPT-3.5 cannot do
- on-the-fly overwriting the model’s belief
- Formal reasoning
- Retrieval from the Internet
reference:
- https://lilianweng.github.io/posts/2021-07-11-diffusion-models/
- https://yang-song.github.io/blog/2021/score/
- https://yaofu.notion.site/GPT-3-5-360081d91ec245f29029d37b54573756