What I think is true of these 4, somewhat related things.
Model compression
- Make models smaller or faster
- Why do we care about smaller models?
- Less space
- Less computation for inference
- can deploy these on the edge
- Latency
- Maybe we don’t care about all the juice the larger model has
Model quantisation
- A method of model compression
- Use less bits in the model parameters
- 32 -> 16 or something
Model distillation
- A method of model compression
- Transfer knowledge from a larger “Teacher” model to a smaller “student” one
- Distillation can also be used beyond compression. For example, to transfer capabilities between model architectures.
Model fine tuning
- Adapt a pre trained model to do even better on your task
- Can avoid fine tuning by just prompt engineering