Template execution is expensive, particularly if the page contains nested loops (to render a table, for example). In one, admittedly extreme example, a 100x100 table took about 500ms to render.
Profiling shows no notable areas for optimisation:
432324 function calls (392124 primitive calls) in 9.497 CPU seconds Ordered by: internal time, call count List reduced from 38 to 20 due to restriction <20> ncalls tottime percall cumtime percall filename:lineno(function) 80202 0.895 0.000 0.895 0.000 context.py:147(write_content) 20101 0.781 0.000 0.967 0.000 <albatross>:0(?) 20101/1 0.655 0.000 9.497 9.497 template.py:209(to_html) 10000 0.628 0.000 2.141 0.000 tags.py:87(get_name) 20000 0.589 0.000 0.589 0.000 tags.py:19(escape) 10000 0.582 0.000 4.490 0.000 tags.py:126(generic_to_html) 10000 0.573 0.000 0.685 0.000 template.py:127(write_attribs_except) 10000 0.510 0.000 8.299 0.001 tags.py:1178(to_html) 20101 0.438 0.000 1.405 0.000 context.py:385(eval_expr) 20101 0.424 0.000 1.829 0.000 template.py:145(eval_attrib) 101/1 0.400 0.004 9.497 9.497 tags.py:668(to_html) 10000 0.393 0.000 1.399 0.000 tags.py:1043(to_html) 20402 0.295 0.000 0.295 0.000 tags.py:617(has_value) 10000 0.283 0.000 4.991 0.000 tags.py:120(to_html) 10000 0.281 0.000 1.513 0.000 tags.py:32(get_name) 10000 0.237 0.000 2.378 0.000 tags.py:44(get_name_and_value) 20202 0.213 0.000 0.439 0.000 template.py:43(to_html) 10100 0.206 0.000 0.403 0.000 tags.py:630(next) 20100/100 0.204 0.000 9.490 0.095 template.py:251(to_html) 30100 0.186 0.000 0.186 0.000 tags.py:611(value)
One potential idea would be to compile the template into python code, although this will result in semantic differences in execution (emulating the current globals() and ctx.locals may be tricky).