Getting Started#

What is Modal?#

Modal lets you run code in the cloud without having to think about infrastructure. It takes your code, puts it in a container, and executes it in their own cloud. You don’t need to mess with Kubernetes, Docker or even an AWS account.

  • Run any code remotely within seconds.

  • Scale up horizontally to thousands of containers.

  • Deploy and monitor persistent cron jobs.

  • Attach GPUs with a single line of code.

  • Serve your functions as web endpoints.

Account Setup#

The key thing about using Modal is that you don’t have to set up any infrastructure. Simply…

  • Create an account at modal.com

  • Set up your environment

    pip install modal
    python3 -m modal setup
    

…and you’re ready to go.

Minimal Example#

The following code shows a minimal example that will invoke Llama2 7B on an Nvidia A10G GPU. You’ll just need to download it to a .py file and run it to get your Flywheel endpoint up and running.

Note

The first time you attempt to run an MK1 model you will be prompted by the command line to accept our terms and conditions. This is a one-time process.

import modal

Model = modal.Cls.lookup(
    "mk1-flywheel-latest-llama2-7b-chat", "Model", workspace="mk1"
).with_options(
    gpu=modal.gpu.A10G(),
)

model = Model()
prompt = "[INST] What is the difference between a llama and an alpaca? [/INST] "

print(f"Prompt:\n{prompt}\n")

responses = model.generate.remote(text=prompt, max_tokens=512, eos_token_ids=[1, 2])
response = responses["responses"][0]["text"]

print(f"Response:\n{response}")

In this section you will find more information about other images with pre-populated models and how to use them.

Next Steps#

Bring-Your-Own-Model

Serve your own models (perhaps fine-tuned) with Flywheel on Modal.

Example: Batch Document Summarization on Modal

Summarize a large batch of news articles with Flywheel in half the time compared to vLLM.

Example: Endpoint

Setup your own endpoint with Flywheel and bootstrap any inference application. Experience up to 2x throughput at the same latencies compared to other leading inference solutions.