Template execution is expensive, particularly if the page contains nested loops (to render a table, for example). In one, admittedly extreme example, a 100x100 table took about 500ms to render.

Profiling shows no notable areas for optimisation:
{{{
         432324 function calls (392124 primitive calls) in 9.497 CPU seconds

   Ordered by: internal time, call count
   List reduced from 38 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    80202    0.895    0.000    0.895    0.000 context.py:147(write_content)
    20101    0.781    0.000    0.967    0.000 <albatross>:0(?)
  20101/1    0.655    0.000    9.497    9.497 template.py:209(to_html)
    10000    0.628    0.000    2.141    0.000 tags.py:87(get_name)
    20000    0.589    0.000    0.589    0.000 tags.py:19(escape)
    10000    0.582    0.000    4.490    0.000 tags.py:126(generic_to_html)
    10000    0.573    0.000    0.685    0.000 template.py:127(write_attribs_except)
    10000    0.510    0.000    8.299    0.001 tags.py:1178(to_html)
    20101    0.438    0.000    1.405    0.000 context.py:385(eval_expr)
    20101    0.424    0.000    1.829    0.000 template.py:145(eval_attrib)
    101/1    0.400    0.004    9.497    9.497 tags.py:668(to_html)
    10000    0.393    0.000    1.399    0.000 tags.py:1043(to_html)
    20402    0.295    0.000    0.295    0.000 tags.py:617(has_value)
    10000    0.283    0.000    4.991    0.000 tags.py:120(to_html)
    10000    0.281    0.000    1.513    0.000 tags.py:32(get_name)
    10000    0.237    0.000    2.378    0.000 tags.py:44(get_name_and_value)
    20202    0.213    0.000    0.439    0.000 template.py:43(to_html)
    10100    0.206    0.000    0.403    0.000 tags.py:630(next)
20100/100    0.204    0.000    9.490    0.095 template.py:251(to_html)
    30100    0.186    0.000    0.186    0.000 tags.py:611(value)
}}}

One potential idea would be to compile the template into python code, although this will result in semantic differences in execution (emulating the current globals() and ctx.locals may be tricky).