mixing channels in theory is just adding the samples and dividing by channelcount. There is enough headroom to avoid clipping. But if we have 32 channels, the output signal is very quiet and we have too much headroom in this case. So, which formula or method is useful in these cases? Hard limiting? Compression? Is there an easy and good sounding way?
For file formats or situations that specify an actual global volume or amplification factor, the problem of reaching the desired target volume is basically transferred to the user.
In all other cases any player generally resorts to some sort of heuristic of which the "divide by channelcount" is one possible heuristic but not a very good one. The mixed signal (as a rule of thumb) increases by roughly 3dB instead of 6dB on average.
Dynamic range compression or hard limiting is a separate concern that can be applied after mixing anyway in order to mitigate clipping/distortion caused by going over 0dBFs.
Also, if the number of mixed channels varies over time (like for example in the situation of the Windows system audio mixer), you cannot even apply any heuristic based on channel count. Thus, no channel gets attenuated at all and the result is just a plain addition of the source signals. Windows applies a hard limiter to the result in order to avoid clipping.
And why do many players use floating sample data (-1..0..1) and not i.e. 32 bit integer data for calculations? Is there a big difference in sound quality or is it just simpler to convert the floating data to real output format?
Floating point data is generally easier to work with because you can basically ignore any data type overflow issues if the signal is too loud or quantization distortion if the signal is too quiet.
Sound quality wise, 32bit float vs. 32bit integer does not make any practical difference. Both provide more dynamic range than the human ear can resolve.