FlatFileAdvantages

PmWiki stores pages in flat files instead of using a relational database such as MySQL. This page explains why this design decision has been made.

Pm's Explanation

Pm: I chose flat files to store PmWiki pages because I haven't seen any real advantages of using a database, and there are definitely some disadvantages. For the standard operations (view, edit, page revisions), holding the information in flat files is clearly faster than accessing them in a database, and with page caching abilities (coming soon) it'll be even faster. The only operations that really benefit are searches, but I've always believed that for fast, flexible search capabilities it's much better to use existing search programs such as ht://Dig or Google over reinventing another search engine. PmWiki's Site.Search is functional/fast enough for most purposes, and if more performance is needed it's just better to switch to a real search engine.

Indeed, as of January 2004 the Wikipedia uses a MySQL database to store its 190K+ entries, but even with the database Wikipedia has disabled its online search because of performance issues and just forwards search queries directly to Google.
see the talk page?

And there are big disadvantages to using a database -- with a database we'd have to write a bunch of "administrative" tools/scripts to handle things such as mass page deletions in the database, backups/restores of the pages, recovering pages that have been wrongly deleted, etc. Much of that administrative programming overhead is eliminated by using a flat file system, as admins can use existing tools (FTP clients, web-based file/directory managers, shell commands). They are already comfortable with the administrative tools. It's also much easier to build sophisticated and customized page management tools and scripts for specialized applications.

Finally, PmWiki is already structured such that the flat file structure can be easily replaced by a database if it ever proves necessary. However, even PmWiki sites with more than 40 000 pages function well in a flat file system without any noticeable performance problems.

PmWiki supports the ability to subdivide the wiki.d/ directory into separate subdirectories for each group, avoiding the "too large" directory problem. Check out the Cookbook:PerGroupSubDirectories for more information.

Comments:

  • Flat files are indeed much more easy to manage and my experience shows that there is no problem at all for PmWiki. Still I had problems convincing my boss using PmWiki since it is not using a "real" database. Ever thought of using subdirectories for each group like in Uploads? There are known issues on Solaris for directories containing more than 20.000 files. Uli?
    PmWiki already supports the ability to subdivide the wiki.d/ directory into separate subdirectories for each group, avoiding the "too large" directory problem. Contact me via email or pmwiki-users. --Pm?
    This is now specified in Cookbook:PerGroupSubDirectories. Thanks, Ben and Patrick! --Sproaticus?
  • On a Linux based operating system, with a filesystem like ReiserFS which can handle directory with tons of files entries, performance should not be a problem and should even be better than using a database. -- Pouik
  • There is a lot of prejudice out there in favor of using database engines instead of flat files. Choosing which to use in a project ought to be similar to choosing what programming language to use. Some of the questions to ask are:
    • Which choice fits the problem domain best (databases fit random queries against a very large set of records best, flat files fit Wikis best)
    • What are the programmers familiar with; what do they like?
    • What is available; what does the corporate culture allow; how much do they cost? -- David Spector
  • Personally, I like to store un-structured data in flat file. However, I do believe database has its advantage on structured data. I feel this way when I was using other wiki (Tiki, Wikipedia, phpWiki..) I always think to extend them to include flat file. So, how about a common API? -- Duncan Hsu
    • PmWiki already has a common API, implemented via the PageStore class in pmwiki.php. Cookbook authors can create a class with the same interface as PageStore that saves pages in alternate locations such as a database. --Pm?
  • I've got a question: wouldn't there be a problem with same-time multi-user access to a file? (I mean writing - losing other's changes possibly)
    • That is one problem I guess. Another is the administration side of it. Of course I can dive into FTP and work with the flat files there, but I like an admin interface of restoring articles. Mainly because I have editors who are not so familiar with FTP as I am. --sjoerd
    • PmWiki handles any locking necessary to make sure that multiple accesses to a file don't cause any changes to be lost. PmWiki also supports automatic merging of simultaneous edits. --Pm?
  • I created a 8000 files wiki for fun and testing. Basic pagehandling is fine no performance issues. Search is acceptable. However creating the .linkindex file from scratch is a problem. The host I run the site on (and my test-machine) has a time out of 30 seconds. I disabled the linkindex, however no backlinks ( pagelist link={$FullName}) are too slow. --BrBrBr?
    Re-enable the link index and run a few backlink searches (even if they time out). PmWiki will incrementally build the link index. Once the link index is built, everything will be fast and there won't be a big cost in keeping the link index up-to-date. --Pm?
  • Another BIG advantage of flat files is that they are easy to edit directly. -- Babak
    • Exactly! I know many scenarios where data-loss, caused by hardware or transfer failure (storage medium crashes, power dropouts and the likes), was easy to fix by simply using an editor on the (flatfile) server's commandline and changing back what was causing errors. I've never been able to do this with similar ease for MySQL (and in such cases hate my job). -- SomeSysAdmin
  • Maybe the reason flat files work so well is that a file system IS a hierarchical database -- William
  • Another advantage of flat files: you can install pmwiki on a server that doesn't offer a database (e.g., a barebones academic server with php, but no MySQL). For someone who has long used plain text files and simple version control, I like having all my diffs in a plain text file. -- Matthew
  • Is a database more secure? That extra password protection needed to access MySQL databases must mean something... Right? -- Xen
    • Then why have no sites running PMwiki with flatfiles (that I know of) ever been compromised? ;-) -- Julius
    • If you can get access to PmWiki's flat files, you could also get access to the php script containing the database password. So it doesn't really provide any extra security. -- Andrew
      • Exactly. But one should never store the (non-flatfile) database password containing php in a web-server accessible location. Instead do an include and put the php somewhere outside of the web/doc root. -- Julius
    • Most hosts won't let you access the database server from outside the webserver itself, so having READ access to the database password is not enough to do anything with it, not even read the password-protected pages. -- Spyro
  • I think the biggest disadvantage of using a flatfile system is that it take the programmer too much time to design it and to maintain its stabilization, especially when more and more new feathers are added into the project and more and more requirements are put out. And this also add risk to user's data, as bugs are more likely to be brought in by program update. This also add difficulty to resolve compatibility problems. On the other hand, flat file system does work more efficiently than database in most situations. -- Adam
    • I would have to disagree (with part of that). Programming something to speak to (and read from) MySQL for example can be just as painful, precisely because it is not your own code or design. That can be a huge disadvantage: You never know when an updated MySQL needs changed queries, when it will do what, if it will do what you need and so on. -- Julius
  • I think that this could be an endless debate because the line is often thin between advantage and disadvantage, imho the safe bet will always be to give the option and let people choose given their own needs, cheers. -- h3
    • I don't think the line is that thin. With a separate database you will always have a much bigger chance on crashes and downtimes. You make yourself more dependent by needing yet another service to be running and backupped separately etc. Just count the times you see things don't work and give you MySQL errors online, I have rarely, if at all, seen that occur with flatfile databases. -- Ben
      • Many people already have a copy of MySQL running, so that isn't a problem. The mysql problems are from sites that are too many/too slow of queries for their hardware. something as simple as retrieving a wiki page isn't going to have trouble like that.
        • More people don't have a copy of MySQL running. In fact, I know more people who don't run it on their servers, precisely because it is such a resource monster for its purpose: Merely some text-file storage system. -- Steven
  • Flat file has a very important advantage -- you can diff and merge pages with merge tools. With that you would be able to make more than one wiki sites in all your computers and merge them periodically. I think lots of people need this function. At least, I switch to dukowiki from mediawiki just because of this.-- Edward
  • Databases are always on top of a filesystem -- At last all of the "real" databases store their data on a filesystem. They provide an abstraction layer for purposes as e.g. authentication, transactions or only convenience on different OS and have a common query syntax (SQL). Therefore the performance issue relies mainly on following factors:
    • Performance of the filesystem
    • Efficient caching strategies (for data, queries, ...)
    • Efficient internal file organization
    • Efficient code (client and server)-- Heiko
  • Most file systems map files to hard sectors on a disk. Databases offer a level of virtualisation:
the sectors can be on any disk or server. Result is you can use one server/disk for DB, another server for PHP and a third for web server. You can share out load and get better overall performance even in very heavy usage. Of course that may not be the goal of PmWiki, ;-). -- Peter
Well you can always use NFS if you want your files on another server. But in both cases NFS or a DB, running them on another server is actually likely to increase your latency and not necessarily increase your thoughput. The advantage of a separate DB is more apparent when you need more than one client accessing it at the same time, which, of course, you can do with NFS also, the DB might provide better locking mechanisms but they are not likely to be important to pmwiki (not writer heavy enough). How do you suggest running PHP on another server than your web server? And, whatever your solution for this, wouldn't this also be available without a DB also? Martin Fick?
  • Just to say. I prefer flatfiles in this case just because my home server is an MMX, but isn't SQL servers loaded in memory? memory access time is much slower than HD, not to mention the really old ones (my is 2GbATA100). Of course that not all the pages should be loaded on memory all the time, but for the most accessed ones... Also, it is easier to provide a single download file providing with all the wikidata for the user who wants to have it offline. He will just need a way to read it... And my third point is that it is better for a wiki because no JOIN is needed.
  • The only disadvantage I find in using flat files compared to storing data in a database is that you cannot perform certain data operations such as search and replace (which is easily done with an SQL query). Especially if you have a large site, that is quite a tedious job in PmWiki (but perhaps I should check the latest Cookbook recipes for a similar function). Other than that, no complaints - I am sticking to PmWiki. =) -- Bien
  • I agree w/ the file-based approach. Of course, if you throw in good SQLite support (an amazing tool) many of the arguments fade. Nevertheless, I think the actual file format used by PmWiki is close to being binary, since it requires filters inbound and outbound Cookbook.AdminByShell? before you can do anything very useful w/ data, e.g. edit it w/ Vim. Despite how good PmWiki is, I had to choose DokuWiki for work, essentially for this reason.
  • I prefer flat files so if I decide not to use the wiki software in the future or just want the documents fast I can just grab them. I'm not reliant of a database and software.
Category:


This page may have a more recent version on pmwiki.org: PmWiki:FlatFileAdvantages, and a talk page: PmWiki:FlatFileAdvantages-Talk.