DeepSeek R1 botando el tablero

Hace unos días DeepSeek lanzó su modelo R1 y movió el piso. Tanto como para que las acciones de Nvidia cayeran cerca de un 17%.

Todo esto parece ser parte de una nueva guerra fría en la geopolítica y la lucha tecnológica entre China y EE.UU.

DeepSeek entrenó su modelo R1 usando las GPUs H800 (de Nvidia), que tienen menos ancho de banda y son menos potentes que las H100 –a las que China no tiene acceso debido a restricciones por sanciones económicas impuestas por EE.UU.

Para hacerlo (y compensar estas limitaciones) tuvieron que optimizar al máximo el uso de las H800 programando en PTX, un lenguaje de bajo nivel como assembly, para manejar la comunicación entre GPUs. Algo MUY poco común.

Con esto lograron desarrollar un modelo tan competente como el mejor de OpenAI disponible actualmente en el mercado, como es el caso de o1.

Pero además:

Liberaron toda la investigación y trabajo que hicieron para conseguirlo
Lo consiguieron a una fracción del costo (US $5,6 millones)
Puedes usar R1 gratis

No sabemos cuánto le costó a OpenAI entrenar o1, pero usar DeepSeek R1 vía API es 27 veces más barato que usar o1 de OpenAI.

Esto impactó el mercado, con las acciones de Nvidia cayendo hoy casi un 17%.

Así que ahí donde EE.UU. vio las sanciones y restricciones económicas como una oportunidad para frenar el desarrollo de China, aparece DeepSeek R1 como respuesta.

Si tienes bolsillos grandes hay innovaciones que no necesitas hacer

Además, las declaraciones de Sam Altman en India el 2023 envejecieron mal cuando dijo: “Es totalmente inútil competir con nosotros en el entrenamiento de modelos de base”. Esto cuando le preguntaron si era posible hacer algo importante con un equipo pequeño e inteligente con un presupuesto de US $10 millones (de nuevo, entrenar R1 costó US $5,6 millones).

El impacto del Open Source

Pero esto es más que solo una cuestión geopolítica y de la eficiencia que tuvieron que lograr por tener peores GPUs disponibles. Esto tiene otra arista interesante.

DeepSeek se ha beneficiado de la investigación abierta, como es el caso de PyTorch, Llama de Meta y de otros modelos abiertos como puedes ver en huggingface. Idearon y crearon sobre el trabajo de otras personas, y a la vez publicaron y dejaron disponible su trabajo como código abierto.

DeepSeek R1 nace en medio de restricciones, respondiendo con eficiencia y capacidad técnica, pero también haciéndose parte de la investigación abierta. El campo está abierto, la competencia sigue y esto acelerará no solo el avance de esta tecnología, sino también la reducción de costos y barreras para entrenar, crear y usar inteligencia artificial.

Recomiendo este artículo:

DeepSeek FAQ

That seems impossibly low.

DeepSeek is clear that these costs are only for the final training run, and exclude all other expenses; from the V3 paper:

Lastly, we emphasize again the economical training costs of DeepSeek-V3, summarized in Table 1, achieved through our optimized co-design of algorithms, frameworks, and hardware. During the pre-training stage, training DeepSeek-V3 on each trillion tokens requires only 180K H800 GPU hours, i.e., 3.7 days on our cluster with 2048 H800 GPUs. Consequently, our pre-training stage is completed in less than two months and costs 2664K GPU hours. Combined with 119K GPU hours for the context length extension and 5K GPU hours for post-training, DeepSeek-V3 costs only 2.788M GPU hours for its full training. Assuming the rental price of the H800 GPU is $2 per GPU hour, our total training costs amount to only $5.576 million. Note that the aforementioned costs include only the official training of DeepSeek-V3, excluding the costs associated with prior research and ablation experiments on architectures, algorithms, or data.

So no, you can’t replicate DeepSeek the company for $5.576 million.

I still don’t believe that number.

Actually, the burden of proof is on the doubters, at least once you understand the V3 architecture. Remember that bit about DeepSeekMoE: V3 has 671 billion parameters, but only 37 billion parameters in the active expert are computed per token; this equates to 333.3 billion FLOPs of compute per token. Here I should mention another DeepSeek innovation: while parameters were stored with BF16 or FP32 precision, they were reduced to FP8 precision for calculations; 2048 H800 GPUs have a capacity of 3.97 exoflops, i.e. 3.97 billion billion FLOPS. The training set, meanwhile, consisted of 14.8 trillion tokens; once you do all of the math it becomes apparent that 2.8 million H800 hours is sufficient for training V3. Again, this was just the final run, not the total cost, but it’s a plausible number.

Here’s the thing: a huge number of the innovations I explained above are about overcoming the lack of memory bandwidth implied in using H800s instead of H100s. Moreover, if you actually did the math on the previous question, you would realize that DeepSeek actually had an excess of computing; that’s because DeepSeek actually programmed 20 of the 132 processing units on each H800 specifically to manage cross-chip communications. This is actually impossible to do in CUDA. DeepSeek engineers had to drop down to PTX, a low-level instruction set for Nvidia GPUs that is basically like assembly language. This is an insane level of optimization that only makes sense if you are using H800s.

Again, just to emphasize this point, all of the decisions DeepSeek made in the design of this model only make sense if you are constrained to the H800; if DeepSeek had access to H100s, they probably would have used a larger training cluster with much fewer optimizations specifically focused on overcoming the lack of bandwidth.

So what about the chip ban?

The easiest argument to make is that the importance of the chip ban has only been accentuated given the U.S.’s rapidly evaporating lead in software. Software and knowhow can’t be embargoed — we’ve had these debates and realizations before — but chips are physical objects and the U.S. is justified in keeping them away from China.

At the same time, there should be some humility about the fact that earlier iterations of the chip ban seem to have directly led to DeepSeek’s innovations. Those innovations, moreover, would extend to not just smuggled Nvidia chips or nerfed ones like the H800, but to Huawei’s Ascend chips as well. Indeed, you can very much make the case that the primary outcome of the chip ban is today’s crash in Nvidia’s stock price.

What concerns me is the mindset undergirding something like the chip ban: instead of competing through innovation in the future the U.S. is competing through the denial of innovation in the past. Yes, this may help in the short term — again, DeepSeek would be even more effective with more computing — but in the long run it simply sews the seeds for competition in an industry — chips and semiconductor equipment — over which the U.S. has a dominant position.

Si tienes bolsillos grandes hay innovaciones que no necesitas hacer#

El impacto del Open Source#

DeepSeek FAQ#

Si tienes bolsillos grandes hay innovaciones que no necesitas hacer

El impacto del Open Source

DeepSeek FAQ