Initial commit: Add CoderAI OpenAI-compatible API server

- Add main server script with FastAPI and memory-aware model loading - Add requirements.txt with dependencies and platform-specific PyTorch options - Add comprehensive README.md with installation, usage, and troubleshooting - Add LICENSE.md with GPLv3 license

Initial commit: Add CoderAI OpenAI-compatible API server
- Add main server script with FastAPI and memory-aware model loading - Add requirements.txt with dependencies and platform-specific PyTorch options - Add comprehensive README.md with installation, usage, and troubleshooting - Add LICENSE.md with GPLv3 license
087ba9e1 · Stefy Lanza (nextime / spora ) · 087ba9e1 · 087ba9e1 · 087ba9e1 · 087ba9e1
Commit 087ba9e1 authored Feb 27, 2026 by Stefy Lanza (nextime / spora )
Showing with 2290 additions and 0 deletions

LICENSE.md LICENSE.md +674 -0

README.md README.md +442 -0

main.cpython-313.pyc __pycache__/main.cpython-313.pyc +0 -0

coderai coderai +1132 -0

requirements.txt requirements.txt +42 -0

No files found.
--- a/LICENSE.md
+++ b/LICENSE.md
+                    GNU GENERAL PUBLIC LICENSE
+                       Version 3, 29 June 2007
+
+ Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/>
+ Everyone is permitted to copy and distribute verbatim copies
+ of this license document, but changing it is not allowed.
+
+                            Preamble
+
+  The GNU General Public License is a free, copyleft license for
+software and other kinds of works.
+
+  The licenses for most software and other practical works are designed
+to take away your freedom to share and change the works.  By contrast,
+the GNU General Public License is intended to guarantee your freedom to
+share and change all versions of a program--to make sure it remains free
+software for all its users.  We, the Free Software Foundation, use the
+GNU General Public License for most of our software; it applies also to
+any other work released this way by its authors.  You can apply it to
+your programs, too.
+
+  When we speak of free software, we are referring to freedom, not
+price.  Our General Public Licenses are designed to make sure that you
+have the freedom to distribute copies of free software (and charge for
+them if you wish), that you receive source code or can get it if you
+want it, that you can change the software or use pieces of it in new
+free programs, and that you know you can do these things.
+
+  To protect your rights, we need to prevent others from denying you
+these rights or asking you to surrender the rights.  Therefore, you have
+certain responsibilities if you distribute copies of the software, or if
+you modify it: responsibilities to respect the freedom of others.
+
+  For example, if you distribute copies of such a program, whether
+gratis or for a fee, you must pass on to the recipients the same
+freedoms that you received.  You must make sure that they, too, receive
+or can get the source code.  And you must show them these terms so they
+know their rights.
+
+  Developers that use the GNU GPL protect your rights with two steps:
+(1) assert copyright on the software, and (2) offer you this License
+giving you legal permission to copy, distribute and/or modify it.
+
+  For the developers' and authors' protection, the GPL clearly explains
+that there is no warranty for this free software.  For both users' and
+authors' sake, the GPL requires that modified versions be marked as
+changed, so that their problems will not be attributed erroneously to
+authors of previous versions.
+
+  Some devices are designed to deny users access to install or run
+modified versions of the software inside them, although the manufacturer
+can do so.  This is fundamentally incompatible with the aim of
+protecting users' freedom to change the software.  The systematic
+pattern of such abuse occurs in the area of products for individuals to
+use, which is precisely where it is most unacceptable.  Therefore, we
+have designed this version of the GPL to prohibit the practice for those
+products.  If such problems arise substantially in other domains, we
+stand ready to extend this provision to those domains in future versions
+of the GPL, as needed to protect the freedom of users.
+
+  Finally, every program is threatened constantly by software patents.
+States should not allow patents to restrict development and use of
+software on general-purpose computers, but in those that do, we wish to
+avoid the special danger that patents applied to a free program could
+make it effectively proprietary.  To prevent this, the GPL assures that
+patents cannot be used to render the program non-free.
+
+  The precise terms and conditions for copying, distribution and
+modification follow.
+
+                       TERMS AND CONDITIONS
+
+  0. Definitions.
+
+  "This License" refers to version 3 of the GNU General Public License.
+
+  "Copyright" also means copyright-like laws that apply to other kinds of
+works, such as semiconductor masks.
+
+  "The Program" refers to any copyrightable work licensed under this
+License.  Each licensee is addressed as "you".  "Licensees" and
+"recipients" may be individuals or organizations.
+
+  To "modify" a work means to copy from or adapt all or part of the work
+in a fashion requiring copyright permission, other than the making of an
+exact copy.  The resulting work is called a "modified version" of the
+earlier work or a work "based on" the earlier work.
+
+  A "covered work" means either the unmodified Program or a work based
+on the Program.
+
+  To "propagate" a work means to do anything with it that, without
+permission, would make you directly or secondarily liable for
+infringement under applicable copyright law, except executing it on a
+computer or modifying a private copy.  Propagation includes copying,
+distribution (with or without modification), making available to the
+public, and in some countries other activities as well.
+
+  To "convey" a work means any kind of propagation that enables other
+parties to make or receive copies.  Mere interaction with a user through
+a computer network, with no transfer of a copy, is not conveying.
+
+  An interactive user interface displays "Appropriate Legal Notices"
+to the extent that it includes a convenient and prominently visible
+feature that (1) displays an appropriate copyright notice, and (2)
+tells the user that there is no warranty for the work (except to the
+extent that warranties are provided), that licensees may convey the
+work under this License, and how to view a copy of this License.  If
+the interface presents a list of user commands or options, such as a
+menu, a prominent item in the list meets this criterion.
+
+  1. Source Code.
+
+  The "source code" for a work means the preferred form of the work
+for making modifications to it.  "Object code" means any non-source
+form of a work.
+
+  A "Standard Interface" means an interface that either is an official
+standard defined by a recognized standards body, or, in the case of
+interfaces specified for a particular programming language, one that
+is widely used among developers working in that language.
+
+  The "System Libraries" of an executable work include anything, other
+than the work as a whole, that (a) is included in the normal form of
+packaging a Major Component, but which is not part of that Major
+Component, and (b) serves only to enable use of the work with that
+Major Component, or to implement a Standard Interface for which an
+implementation is available to the public in source code form.  A
+"Major Component", in this context, means a major essential component
+(kernel, window system, and so on) of the specific operating system
+(if any) on which the executable work runs, or a compiler used to
+produce the work, or an object code interpreter used to run it.
+
+  The "Corresponding Source" for a work in object code form means all
+the source code needed to generate, install, and (for an executable
+work) run the object code and to modify the work, including scripts to
+control those activities.  However, it does not include the work's
+System Libraries, or general-purpose tools or generally available free
+programs which are used unmodified in performing those activities but
+which are not part of the work.  For example, Corresponding Source
+includes interface definition files associated with source files for
+the work, and the source code for shared libraries and dynamically
+linked subprograms that the work is specifically designed to require,
+such as by intimate data communication or control flow between those
+subprograms and other parts of the work.
+
+  The Corresponding Source need not include anything that users
+can regenerate automatically from other parts of the Corresponding
+Source.
+
+  The Corresponding Source for a work in source code form is that
+same work.
+
+  2. Basic Permissions.
+
+  All rights granted under this License are granted for the term of
+copyright on the Program, and are irrevocable provided the stated
+conditions are met.  This License explicitly affirms your unlimited
+permission to run the unmodified Program.  The output from running a
+covered work is covered by this License only if the output, given its
+content, constitutes a covered work.  This License acknowledges your
+rights of fair use or other equivalent, as provided by copyright law.
+
+  You may make, run and propagate covered works that you do not
+convey, without conditions so long as your license otherwise remains
+in force.  You may convey covered works to others for the sole purpose
+of having them make modifications exclusively for you, or provide you
+with facilities for running those works, provided that you comply with
+the terms of this License in conveying all material for which you do
+not control copyright.  Those thus making or running the covered works
+for you must do so exclusively on your behalf, under your direction
+and control, on terms that prohibit them from making any copies of
+your copyrighted material outside their relationship with you.
+
+  Conveying under any other circumstances is permitted solely under
+the conditions stated below.  Sublicensing is not allowed; section 10
+makes it unnecessary.
+
+  3. Protecting Users' Legal Rights From Anti-Circumvention Law.
+
+  No covered work shall be deemed part of an effective technological
+measure under any applicable law fulfilling obligations under article
+11 of the WIPO copyright treaty adopted on 20 December 1996, or
+similar laws prohibiting or restricting circumvention of such
+measures.
+
+  When you convey a covered work, you waive any legal power to forbid
+circumvention of technological measures to the extent such circumvention
+is effected by exercising rights under this License with respect to
+the covered work, and you disclaim any intention to limit operation or
+modification of the work as a means of enforcing, against the work's
+users, your or third parties' legal rights to forbid circumvention of
+technological measures.
+
+  4. Conveying Verbatim Copies.
+
+  You may convey verbatim copies of the Program's source code as you
+receive it, in any medium, provided that you conspicuously and
+appropriately publish on each copy an appropriate copyright notice;
+keep intact all notices stating that this License and any
+non-permissive terms added in accord with section 7 apply to the code;
+keep intact all notices of the absence of any warranty; and give all
+recipients a copy of this License along with the Program.
+
+  You may charge any price or no price for each copy that you convey,
+and you may offer support or warranty protection for a fee.
+
+  5. Conveying Modified Source Versions.
+
+  You may convey a work based on the Program, or the modifications to
+produce it from the Program, in the form of source code under the
+terms of section 4, provided that you also meet all of these conditions:
+
+    a) The work must carry prominent notices stating that you modified
+    it, and giving a relevant date.
+
+    b) The work must carry prominent notices stating that it is
+    released under this License and any conditions added under section
+    7.  This requirement modifies the requirement in section 4 to
+    "keep intact all notices".
+
+    c) You must license the entire work, as a whole, under this
+    License to anyone who comes into possession of a copy.  This
+    License will therefore apply, along with any applicable section 7
+    additional terms, to the whole of the work, and all its parts,
+    regardless of how they are packaged.  This License gives no
+    permission to license the work in any other way, but it does not
+    invalidate such permission if you have separately received it.
+
+    d) If the work has interactive user interfaces, each must display
+    Appropriate Legal Notices; however, if the Program has interactive
+    interfaces that do not display Appropriate Legal Notices, your
+    work need not make them do so.
+
+  A compilation of a covered work with other separate and independent
+works, which are not by their nature extensions of the covered work,
+and which are not combined with it such as to form a larger program,
+in or on a volume of a storage or distribution medium, is called an
+"aggregate" if the compilation and its resulting copyright are not
+used to limit the access or legal rights of the compilation's users
+beyond what the individual works permit.  Inclusion of a covered work
+in an aggregate does not cause this License to apply to the other
+parts of the aggregate.
+
+  6. Conveying Non-Source Forms.
+
+  You may convey a covered work in object code form under the terms
+of sections 4 and 5, provided that you also convey the
+machine-readable Corresponding Source under the terms of this License,
+in one of these ways:
+
+    a) Convey the object code in, or embodied in, a physical product
+    (including a physical distribution medium), accompanied by the
+    Corresponding Source fixed on a durable physical medium
+    customarily used for software interchange.
+
+    b) Convey the object code in, or embodied in, a physical product
+    (including a physical distribution medium), accompanied by a
+    written offer, valid for at least three years and valid for as
+    long as you offer spare parts or customer support for that product
+    model, to give anyone who possesses the object code either (1) a
+    copy of the Corresponding Source for all the software in the
+    product that is covered by this License, on a durable physical
+    medium customarily used for software interchange, for a price no
+    more than your reasonable cost of physically performing this
+    conveying of source, or (2) access to copy the
+    Corresponding Source from a network server at no charge.
+
+    c) Convey individual copies of the object code with a copy of the
+    written offer to provide the Corresponding Source.  This
+    alternative is allowed only occasionally and noncommercially, and
+    only if you received the object code with such an offer, in accord
+    with subsection 6b.
+
+    d) Convey the object code by offering access from a designated
+    place (gratis or for a charge), and offer equivalent access to the
+    Corresponding Source in the same way through the same place at no
+    further charge.  You need not require recipients to copy the
+    Corresponding Source along with the object code.  If the place to
+    copy the object code is a network server, the Corresponding Source
+    may be on a different server (operated by you or a third party)
+    that supports equivalent copying facilities, provided you maintain
+    clear directions next to the object code saying where to find the
+    Corresponding Source.  Regardless of what server hosts the
+    Corresponding Source, you remain obligated to ensure that it is
+    available for as long as needed to satisfy these requirements.
+
+    e) Convey the object code using peer-to-peer transmission, provided
+    you inform other peers where the object code and Corresponding
+    Source of the work are being offered to the general public at no
+    charge under subsection 6d.
+
+  A separable portion of the object code, whose source code is excluded
+from the Corresponding Source as a System Library, need not be
+included in conveying the object code work.
+
+  A "User Product" is either (1) a "consumer product", which means any
+tangible personal property which is normally used for personal, family,
+or household purposes, or (2) anything designed or sold for incorporation
+into a dwelling.  In determining whether a product is a consumer product,
+doubtful cases shall be resolved in favor of coverage.  For a particular
+product received by a particular user, "normally used" refers to a
+typical or common use of that class of product, regardless of the status
+of the particular user or of the way in which the particular user
+actually uses, or expects or is expected to use, the product.  A product
+is a consumer product regardless of whether the product has substantial
+commercial, industrial or non-consumer uses, unless such uses represent
+the only significant mode of use of the product.
+
+  "Installation Information" for a User Product means any methods,
+procedures, authorization keys, or other information required to install
+and execute modified versions of a covered work in that User Product from
+a modified version of its Corresponding Source.  The information must
+suffice to ensure that the continued functioning of the modified object
+code is in no case prevented or interfered with solely because
+modification has been made.
+
+  If you convey an object code work under this section in, or with, or
+specifically for use in, a User Product, and the conveying occurs as
+part of a transaction in which the right of possession and use of the
+User Product is transferred to the recipient in perpetuity or for a
+fixed term (regardless of how the transaction is characterized), the
+Corresponding Source conveyed under this section must be accompanied
+by the Installation Information.  But this requirement does not apply
+if neither you nor any third party retains the ability to install
+modified object code on the User Product (for example, the work has
+been installed in ROM).
+
+  The requirement to provide Installation Information does not include a
+requirement to continue to provide support service, warranty, or updates
+for a work that has been modified or installed by the recipient, or for
+the User Product in which it has been modified or installed.  Access to a
+network may be denied when the modification itself materially and
+adversely affects the operation of the network or violates the rules and
+protocols for communication across the network.
+
+  Corresponding Source conveyed, and Installation Information provided,
+in accord with this section must be in a format that is publicly
+documented (and with an implementation available to the public in
+source code form), and must require no special password or key for
+unpacking, reading or copying.
+
+  7. Additional Terms.
+
+  "Additional permissions" are terms that supplement the terms of this
+License by making exceptions from one or more of its conditions.
+Additional permissions that are applicable to the entire Program shall
+be treated as though they were included in this License, to the extent
+that they are valid under applicable law.  If additional permissions
+apply only to part of the Program, that part may be used separately
+under those permissions, but the entire Program remains governed by
+this License without regard to the additional permissions.
+
+  When you convey a copy of a covered work, you may at your option
+remove any additional permissions from that copy, or from any part of
+it.  (Additional permissions may be written to require their own
+removal in certain cases when you modify the work.)  You may place
+additional permissions on material, added by you to a covered work,
+for which you have or can give appropriate copyright permission.
+
+  Notwithstanding any other provision of this License, for material you
+add to a covered work, you may (if authorized by the copyright holders of
+that material) supplement the terms of this License with terms:
+
+    a) Disclaiming warranty or limiting liability differently from the
+    terms of sections 15 and 16 of this License; or
+
+    b) Requiring preservation of specified reasonable legal notices or
+    author attributions in that material or in the Appropriate Legal
+    Notices displayed by works containing it; or
+
+    c) Prohibiting misrepresentation of the origin of that material, or
+    requiring that modified versions of such material be marked in
+    reasonable ways as different from the original version; or
+
+    d) Limiting the use for publicity purposes of names of licensors or
+    authors of the material; or
+
+    e) Declining to grant rights under trademark law for use of some
+    trade names, trademarks, or service marks; or
+
+    f) Requiring indemnification of licensors and authors of that
+    material by anyone who conveys the material (or modified versions of
+    it) with contractual assumptions of liability to the recipient, for
+    any liability that these contractual assumptions directly impose on
+    those licensors and authors.
+
+  All other non-permissive additional terms are considered "further
+restrictions" within the meaning of section 10.  If the Program as you
+received it, or any part of it, contains a notice stating that it is
+governed by this License along with a term that is a further
+restriction, you may remove that term.  If a license document contains
+a further restriction but permits relicensing or conveying under this
+License, you may add to a covered work material governed by the terms
+of that license document, provided that the further restriction does
+not survive such relicensing or conveying.
+
+  If you add terms to a covered work in accord with this section, you
+must place, in the relevant source files, a statement of the
+additional terms that apply to those files, or a notice indicating
+where to find the applicable terms.
+
+  Additional terms, permissive or non-permissive, may be stated in the
+form of a separately written license, or stated as exceptions;
+the above requirements apply either way.
+
+  8. Termination.
+
+  You may not propagate or modify a covered work except as expressly
+provided under this License.  Any attempt otherwise to propagate or
+modify it is void, and will automatically terminate your rights under
+this License (including any patent licenses granted under the third
+paragraph of section 11).
+
+  However, if you cease all violation of this License, then your
+license from a particular copyright holder is reinstated (a)
+provisionally, unless and until the copyright holder explicitly and
+finally terminates your license, and (b) permanently, if the copyright
+holder fails to notify you of the violation by some reasonable means
+prior to 60 days after the cessation.
+
+  Moreover, your license from a particular copyright holder is
+reinstated permanently if the copyright holder notifies you of the
+violation by some reasonable means, this is the first time you have
+received notice of violation of this License (for any work) from that
+copyright holder, and you cure the violation prior to 30 days after
+your receipt of the notice.
+
+  Termination of your rights under this section does not terminate the
+licenses of parties who have received copies or rights from you under
+this License.  If your rights have been terminated and not permanently
+reinstated, you do not qualify to receive new licenses for the same
+material under section 10.
+
+  9. Acceptance Not Required for Having Copies.
+
+  You are not required to accept this License in order to receive or
+run a copy of the Program.  Ancillary propagation of a covered work
+occurring solely as a consequence of using peer-to-peer transmission
+to receive a copy likewise does not require acceptance.  However,
+nothing other than this License grants you permission to propagate or
+modify any covered work.  These actions infringe copyright if you do
+not accept this License.  Therefore, by modifying or propagating a
+covered work, you indicate your acceptance of this License to do so.
+
+  10. Automatic Licensing of Downstream Recipients.
+
+  Each time you convey a covered work, the recipient automatically
+receives a license from the original licensors, to run, modify and
+propagate that work, subject to this License.  You are not responsible
+for enforcing compliance by third parties with this License.
+
+  An "entity transaction" is a transaction transferring control of an
+organization, or substantially all assets of one, or subdividing an
+organization, or merging organizations.  If propagation of a covered
+work results from an entity transaction, each party to that
+transaction who receives a copy of the work also receives whatever
+licenses to the work the party's predecessor in interest had or could
+give under the previous paragraph, plus a right to possession of the
+Corresponding Source of the work from the predecessor in interest, if
+the predecessor has it or can get it with reasonable efforts.
+
+  You may not impose any further restrictions on the exercise of the
+rights granted or affirmed under this License.  For example, you may
+not impose a license fee, royalty, or other charge for exercise of
+rights granted under this License, and you may not initiate litigation
+(including a cross-claim or counterclaim in a lawsuit) alleging that
+any patent claim is infringed by making, using, selling, offering for
+sale, or importing the Program or any portion of it.
+
+  11. Patents.
+
+  A "contributor" is a copyright holder who authorizes use under this
+License of the Program or a work on which the Program is based.  The
+work thus licensed is called the contributor's "contributor version".
+
+  A contributor's "essential patent claims" are all patent claims
+owned or controlled by the contributor, whether already acquired or
+hereafter acquired, that would be infringed by some manner, permitted
+by this License, of making, using, or selling its contributor version,
+but do not include claims that would be infringed only as a
+consequence of further modification of the contributor version.  For
+purposes of this definition, "control" includes the right to grant
+patent sublicenses in a manner consistent with the requirements of
+this License.
+
+  Each contributor grants you a non-exclusive, worldwide, royalty-free
+patent license under the contributor's essential patent claims, to
+make, use, sell, offer for sale, import and otherwise run, modify and
+propagate the contents of its contributor version.
+
+  In the following three paragraphs, a "patent license" is any express
+agreement or commitment, however denominated, not to enforce a patent
+(such as an express permission to practice a patent or covenant not to
+sue for patent infringement).  To "grant" such a patent license to a
+party means to make such an agreement or commitment not to enforce a
+patent against the party.
+
+  If you convey a covered work, knowingly relying on a patent license,
+and the Corresponding Source of the work is not available for anyone
+to copy, free of charge and under the terms of this License, through a
+publicly available network server or other readily accessible means,
+then you must either (1) cause the Corresponding Source to be so
+available, or (2) arrange to deprive yourself of the benefit of the
+patent license for this particular work, or (3) arrange, in a manner
+consistent with the requirements of this License, to extend the patent
+license to downstream recipients.  "Knowingly relying" means you have
+actual knowledge that, but for the patent license, your conveying the
+covered work in a country, or your recipient's use of the covered work
+in a country, would infringe one or more identifiable patents in that
+country that you have reason to believe are valid.
+
+  If, pursuant to or in connection with a single transaction or
+arrangement, you convey, or propagate by procuring conveyance of, a
+covered work, and grant a patent license to some of the parties
+receiving the covered work authorizing them to use, propagate, modify
+or convey a specific copy of the covered work, then the patent license
+you grant is automatically extended to all recipients of the covered
+work and works based on it.
+
+  A patent license is "discriminatory" if it does not include within
+the scope of its coverage, prohibits the exercise of, or is
+conditioned on the non-exercise of one or more of the rights that are
+specifically granted under this License.  You may not convey a covered
+work if you are a party to an arrangement with a third party that is
+in the business of distributing software, under which you make payment
+to the third party based on the extent of your activity of conveying
+the work, and under which the third party grants, to any of the
+parties who would receive the covered work from you, a discriminatory
+patent license (a) in connection with copies of the covered work
+conveyed by you (or copies made from those copies), or (b) primarily
+for and in connection with specific products or compilations that
+contain the covered work, unless you entered into that arrangement,
+or that patent license was granted, prior to 28 March 2007.
+
+  Nothing in this License shall be construed as excluding or limiting
+any implied license or other defenses to infringement that may
+otherwise be available to you under applicable patent law.
+
+  12. No Surrender of Others' Freedom.
+
+  If conditions are imposed on you (whether by court order, agreement or
+otherwise) that contradict the conditions of this License, they do not
+excuse you from the conditions of this License.  If you cannot convey a
+covered work so as to satisfy simultaneously your obligations under this
+License and any other pertinent obligations, then as a consequence you may
+not convey it at all.  For example, if you agree to terms that obligate you
+to collect a royalty for further conveying from those to whom you convey
+the Program, the only way you could satisfy both those terms and this
+License would be to refrain entirely from conveying the Program.
+
+  13. Use with the GNU Affero General Public License.
+
+  Notwithstanding any other provision of this License, you have
+permission to link or combine any covered work with a work licensed
+under version 3 of the GNU Affero General Public License into a single
+combined work, and to convey the resulting work.  The terms of this
+License will continue to apply to the part which is the covered work,
+but the special requirements of the GNU Affero General Public License,
+section 13, concerning interaction through a network will apply to the
+combination as such.
+
+  14. Revised Versions of this License.
+
+  The Free Software Foundation may publish revised and/or new versions of
+the GNU General Public License from time to time.  Such new versions will
+be similar in spirit to the present version, but may differ in detail to
+address new problems or concerns.
+
+  Each version is given a distinguishing version number.  If the
+Program specifies that a certain numbered version of the GNU General
+Public License "or any later version" applies to it, you have the
+option of following the terms and conditions either of that numbered
+version or of any later version published by the Free Software
+Foundation.  If the Program does not specify a version number of the
+GNU General Public License, you may choose any version ever published
+by the Free Software Foundation.
+
+  If the Program specifies that a proxy can decide which future
+versions of the GNU General Public License can be used, that proxy's
+public statement of acceptance of a version permanently authorizes you
+to choose that version for the Program.
+
+  Later license versions may give you additional or different
+permissions.  However, no additional obligations are imposed on any
+author or copyright holder as a result of your choosing to follow a
+later version.
+
+  15. Disclaimer of Warranty.
+
+  THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
+APPLICABLE LAW.  EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
+HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
+OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
+THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE.  THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
+IS WITH YOU.  SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
+ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
+
+  16. Limitation of Liability.
+
+  IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
+WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
+THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
+GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
+USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
+DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
+PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
+EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
+SUCH DAMAGES.
+
+  17. Interpretation of Sections 15 and 16.
+
+  If the disclaimer of warranty and limitation of liability provided
+above cannot be given local legal effect according to their terms,
+reviewing courts shall apply local law that most closely approximates
+an absolute waiver of all civil liability in connection with the
+Program, unless a warranty or assumption of liability accompanies a
+copy of the Program in return for a fee.
+
+                     END OF TERMS AND CONDITIONS
+
+            How to Apply These Terms to Your New Programs
+
+  If you develop a new program, and you want it to be of the greatest
+possible use to the public, the best way to achieve this is to make it
+free software which everyone can redistribute and change under these terms.
+
+  To do so, attach the following notices to the program.  It is safest
+to attach them to the start of each source file to most effectively
+state the exclusion of warranty; and each file should have at least
+the "copyright" line and a pointer to where the full notice is found.
+
+    <one line to give the program's name and a brief idea of what it does.>
+    Copyright (C) <year>  <name of author>
+
+    This program is free software: you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation, either version 3 of the License, or
+    (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+
+    You should have received a copy of the GNU General Public License
+    along with this program.  If not, see <https://www.gnu.org/licenses/>.
+
+Also add information on how to contact you by electronic and paper mail.
+
+  If the program does terminal interaction, make it output a short
+notice like this when it starts in an interactive mode:
+
+    <program>  Copyright (C) <year>  <name of author>
+    This program comes with ABSOLUTELY NO WARRANTY; for details type 'show w'.
+    This is free software, and you are welcome to redistribute it
+    under certain conditions; type 'show c' for details.
+
+The hypothetical commands 'show w' and 'show c' should show the appropriate
+parts of the General Public License.  Of course, your program's commands
+might be different; for a GUI interface, you would use an "about box".
+
+  You should also get your employer (if you work as a programmer) or school,
+if any, to sign a "copyright disclaimer" for the program, if necessary.
+For more information on this, and how to apply and follow the GNU GPL, see
+<https://www.gnu.org/licenses/>.
+
+  The GNU General Public License does not permit incorporating your program
+into proprietary programs.  If your program is a subroutine library, you
+may consider it more useful to permit linking proprietary applications with
+the library.  If this is what you want to do, use the GNU Lesser General
+Public License instead of this License.  But first, please read
+<https://www.gnu.org/licenses/why-not-lgpl.html>.
--- a/README.md
+++ b/README.md
+# CoderAI
+
+An OpenAI-compatible API server for HuggingFace models with intelligent memory management, GPU auto-detection, and advanced features like tool calling and streaming.
+
+## Features
+
+- **OpenAI-Compatible API**: Drop-in replacement for OpenAI's API endpoints
+- **Memory-Aware Model Loading**: Automatically determines optimal loading strategy based on available VRAM and RAM
+- **Sequential Offloading**: Smart offload from VRAM → RAM → Disk when needed
+- **Multi-GPU Support**: Automatic distribution across multiple CUDA/ROCm devices
+- **GPU Auto-Detection**: Automatically detects CUDA (NVIDIA) or ROCm (AMD) GPUs
+- **Quantization Support**: 4-bit and 8-bit quantization via bitsandbytes for reduced memory usage
+- **Flash Attention 2**: Optional faster attention implementation for supported GPUs
+- **Streaming Responses**: Server-sent events for real-time token generation
+- **Tool Calling**: Support for function calling and tool use
+- **Multiple Endpoints**: `/v1/chat/completions`, `/v1/completions`, and `/v1/models`
+
+## Installation
+
+### Prerequisites
+
+- Python 3.8+
+- For NVIDIA GPUs: CUDA toolkit (11.8+ recommended)
+- For AMD GPUs: ROCm (5.4+ recommended)
+- For CPU-only: No additional requirements
+
+### Basic Installation
+
+```bash
+# Clone the repository
+git clone git@git.nexlab.net:nexlab/coderai.git
+cd coderai
+
+# Create virtual environment (recommended)
+python -m venv venv
+source venv/bin/activate  # On Windows: venv\Scripts\activate
+
+# Install base requirements
+pip install -r requirements.txt
+```
+
+### Platform-Specific PyTorch Installation
+
+PyTorch installation varies by platform. Uncomment the appropriate section in [`requirements.txt`](requirements.txt) or install manually:
+
+#### NVIDIA (CUDA)
+
+```bash
+# For CUDA 11.8
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
+
+# For CUDA 12.1
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
+
+# For CUDA 12.4 (latest)
+pip install torch torchvision torchaudio
+```
+
+#### AMD (ROCm)
+
+```bash
+# For ROCm 5.4.2
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2
+
+# For ROCm 5.6
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.6
+
+# For ROCm 6.0
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0
+```
+
+#### CPU Only
+
+```bash
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
+```
+
+### Optional Dependencies
+
+#### bitsandbytes (Quantization)
+
+For 4-bit and 8-bit quantization support (reduces VRAM requirements):
+
+```bash
+# CUDA
+pip install bitsandbytes>=0.41.0
+
+# ROCm support may require building from source
+# See: https://github.com/TimDettmers/bitsandbytes
+```
+
+#### Flash Attention 2
+
+For significantly faster inference on supported GPUs (requires specific CUDA/ROCm versions):
+
+```bash
+# Requires CUDA 11.6+ or ROCm 5.4+
+pip install flash-attn --no-build-isolation
+```
+
+**Note**: Flash Attention 2 requires:
+- CUDA 11.6+ or ROCm 5.4+
+- Linux OS (Windows support is experimental)
+- Specific GPU architectures (Ampere, Ada Lovelace, Hopper for NVIDIA)
+
+## Usage
+
+### Basic Usage
+
+```bash
+# Run with a specific model
+python coderai --model microsoft/DialoGPT-medium
+
+# The server will start on http://0.0.0.0:8000 by default
+```
+
+### Command-Line Options
+
+```
+usage: coderai [-h] [--model MODEL] [--host HOST] [--port PORT]
+               [--offload-dir OFFLOAD_DIR] [--load-in-4bit] [--load-in-8bit]
+               [--ram RAM] [--flash-attn]
+
+OpenAI-compatible API server with memory-aware model loading
+
+options:
+  -h, --help            show this help message and exit
+  --model MODEL         HuggingFace model name or path
+  --host HOST           Host to bind to (default: 0.0.0.0)
+  --port PORT           Port to bind to (default: 8000)
+  --offload-dir OFFLOAD_DIR
+                        Directory for disk offload when model doesn't fit in
+                        VRAM+RAM (default: ./offload)
+  --load-in-4bit        Load model in 4-bit precision (requires bitsandbytes)
+  --load-in-8bit        Load model in 8-bit precision (requires bitsandbytes)
+  --ram RAM             Manually specify available RAM in GB (bypasses auto-
+                        detection)
+  --flash-attn          Use Flash Attention 2 for faster inference (requires
+                        flash-attn package and compatible GPU)
+```
+
+### Examples
+
+#### Run with 4-bit Quantization (Low VRAM)
+
+```bash
+python coderai --model meta-llama/Llama-2-7b-chat-hf --load-in-4bit
+```
+
+#### Run with Custom Offload Directory
+
+```bash
+python coderai --model bigscience/bloom-7b1 --offload-dir /path/to/fast/storage
+```
+
+#### Run on Specific Host/Port
+
+```bash
+python coderai --model microsoft/DialoGPT-medium --host 127.0.0.1 --port 8080
+```
+
+#### Specify Available RAM Manually
+
+Useful for containerized environments where auto-detection may not work:
+
+```bash
+python coderai --model meta-llama/Llama-2-13b-chat-hf --ram 32
+```
+
+#### Enable Flash Attention 2
+
+```bash
+python coderai --model meta-llama/Llama-2-7b-chat-hf --flash-attn
+```
+
+## API Documentation
+
+The API is compatible with OpenAI's REST API. Interactive documentation is available at `http://localhost:8000/docs` when the server is running.
+
+### Endpoints
+
+| Endpoint | Description |
+|----------|-------------|
+| `GET /v1/models` | List available models |
+| `POST /v1/chat/completions` | Chat completions (ChatGPT-style) |
+| `POST /v1/completions` | Text completions (GPT-style) |
+
+### Example curl Commands
+
+#### List Models
+
+```bash
+curl http://localhost:8000/v1/models
+```
+
+#### Chat Completion (Non-Streaming)
+
+```bash
+curl -X POST http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "microsoft/DialoGPT-medium",
+    "messages": [
+      {"role": "system", "content": "You are a helpful assistant."},
+      {"role": "user", "content": "Hello, how are you?"}
+    ],
+    "temperature": 0.7,
+    "max_tokens": 150
+  }'
+```
+
+#### Chat Completion (Streaming)
+
+```bash
+curl -X POST http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "microsoft/DialoGPT-medium",
+    "messages": [
+      {"role": "user", "content": "Tell me a story"}
+    ],
+    "stream": true,
+    "max_tokens": 200
+  }'
+```
+
+#### Text Completion
+
+```bash
+curl -X POST http://localhost:8000/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "microsoft/DialoGPT-medium",
+    "prompt": "Once upon a time",
+    "max_tokens": 100,
+    "temperature": 0.8
+  }'
+```
+
+#### Chat Completion with Tools
+
+```bash
+curl -X POST http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "microsoft/DialoGPT-medium",
+    "messages": [
+      {"role": "user", "content": "What is the weather in Paris?"}
+    ],
+    "tools": [
+      {
+        "type": "function",
+        "function": {
+          "name": "get_weather",
+          "description": "Get the weather for a location",
+          "parameters": {
+            "type": "object",
+            "properties": {
+              "location": {"type": "string"}
+            },
+            "required": ["location"]
+          }
+        }
+      }
+    ]
+  }'
+```
+
+## Configuration for Different Setups
+
+### CUDA (NVIDIA GPU)
+
+```bash
+# Install CUDA-enabled PyTorch
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
+
+# Run with GPU acceleration (automatic)
+python coderai --model meta-llama/Llama-2-7b-chat-hf
+
+# Optional: Enable Flash Attention 2 for faster inference
+python coderai --model meta-llama/Llama-2-7b-chat-hf --flash-attn
+```
+
+### ROCm (AMD GPU)
+
+```bash
+# Install ROCm-enabled PyTorch
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2
+
+# Run with GPU acceleration (automatic)
+python coderai --model meta-llama/Llama-2-7b-chat-hf
+
+# Check ROCm detection in output
+```
+
+### CPU-Only
+
+```bash
+# Install CPU-only PyTorch
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
+
+# Run on CPU (automatic fallback)
+python coderai --model microsoft/DialoGPT-medium
+```
+
+### Low VRAM Configuration
+
+For GPUs with limited VRAM (4-8GB):
+
+```bash
+# Option 1: Use 4-bit quantization
+python coderai --model meta-llama/Llama-2-7b-chat-hf --load-in-4bit
+
+# Option 2: Use 8-bit quantization
+python coderai --model meta-llama/Llama-2-13b-chat-hf --load-in-8bit
+
+# Option 3: Enable disk offload for very large models
+python coderai --model bigscience/bloom-7b1 --offload-dir /path/to/fast/storage
+```
+
+### Multi-GPU Setup
+
+Multiple GPUs are automatically detected and utilized. The model will be distributed across available devices based on memory availability.
+
+```bash
+# Set visible GPUs (optional)
+export CUDA_VISIBLE_DEVICES=0,1,2,3
+
+# Run - model will be distributed across all visible GPUs
+python coderai --model meta-llama/Llama-2-70b-chat-hf --load-in-8bit
+```
+
+## Model Recommendations
+
+### Small Models (For Testing)
+
+- `microsoft/DialoGPT-medium` (~345M parameters)
+- `TinyLlama/TinyLlama-1.1B-Chat-v1.0` (~1.1B parameters)
+- `facebook/blenderbot-400M-distill` (~400M parameters)
+
+### Medium Models (4-8GB VRAM with 4-bit)
+
+- `meta-llama/Llama-2-7b-chat-hf` (~7B parameters)
+- `mistralai/Mistral-7B-Instruct-v0.2` (~7B parameters)
+- `HuggingFaceH4/zephyr-7b-beta` (~7B parameters)
+
+### Large Models (Multiple GPUs or High VRAM)
+
+- `meta-llama/Llama-2-13b-chat-hf` (~13B parameters)
+- `meta-llama/Llama-2-70b-chat-hf` (~70B parameters) - requires multiple GPUs or disk offload
+- `bigscience/bloom-7b1` (~7B parameters)
+
+## Troubleshooting
+
+### Out of Memory Errors
+
+**Problem**: `CUDA out of memory` or system RAM exhausted
+
+**Solutions**:
+1. Use quantization: `--load-in-4bit` or `--load-in-8bit`
+2. Enable disk offload: `--offload-dir /path/to/storage`
+3. Use a smaller model
+4. Reduce batch size in client requests
+
+### Flash Attention Installation Fails
+
+**Problem**: `pip install flash-attn` fails to build
+
+**Solutions**:
+1. Ensure CUDA/ROCm is properly installed
+2. Install build dependencies: `pip install packaging ninja`
+3. Try without build isolation: `pip install flash-attn --no-build-isolation`
+4. Check GPU compatibility (Ampere, Ada Lovelace, Hopper for NVIDIA)
+5. Skip Flash Attention - the server works without it
+
+### bitsandbytes Not Working on ROCm
+
+**Problem**: Quantization fails on AMD GPUs
+
+**Solutions**:
+1. bitsandbytes has limited ROCm support
+2. Use disk offload instead: `--offload-dir /path/to/storage`
+3. Build bitsandbytes from source with ROCm support
+
+### Model Download Stuck or Slow
+
+**Problem**: HuggingFace model download is slow or fails
+
+**Solutions**:
+1. Set HuggingFace cache directory: `export HF_HOME=/path/to/cache`
+2. Use mirror: `export HF_ENDPOINT=https://hf-mirror.com` (for China)
+3. Download model manually with `git-lfs` and use local path
+
+### Auto-Detection Issues in Containers
+
+**Problem**: Wrong memory detection in Docker/Podman containers
+
+**Solutions**:
+1. Specify RAM manually: `--ram 16`
+2. Pass through GPU devices properly
+3. For Docker: `--gpus all` flag for NVIDIA, or proper device mapping for ROCm
+
+### API Returns 503 Errors
+
+**Problem**: `Model not loaded` error
+
+**Solutions**:
+1. Ensure model name is correct and accessible
+2. Check model requires authentication: `huggingface-cli login`
+3. Verify internet connection for first-time model download
+
+### ROCm Not Detected
+
+**Problem**: ROCm GPU not detected, falling back to CPU
+
+**Solutions**:
+1. Verify ROCm installation: `rocminfo`
+2. Check PyTorch ROCm build: `python -c "import torch; print(torch.version.hip)"`
+3. Set HIP visible devices: `export HIP_VISIBLE_DEVICES=0`
+
+### Import Errors
+
+**Problem**: `ModuleNotFoundError` for various packages
+
+**Solutions**:
+1. Reinstall requirements: `pip install -r requirements.txt --force-reinstall`
+2. Check Python version: `python --version` (should be 3.8+)
+3. Verify virtual environment is activated
+
+## License
+
+This project is licensed under the GNU General Public License v3.0 - see the [LICENSE.md](LICENSE.md) file for details.
+
+## Contributing
+
+Contributions are welcome! Please feel free to submit a merge request.
+
+## Acknowledgments
+
+- Built with [FastAPI](https://fastapi.tiangolo.com/)
+- Powered by [HuggingFace Transformers](https://huggingface.co/docs/transformers/)
+- Inspired by the OpenAI API specification
--- a/__pycache__/main.cpython-313.pyc
+++ b/__pycache__/main.cpython-313.pyc
--- a/coderai
+++ b/coderai
+#!/usr/bin/env python3
+"""
+OpenAI-compatible API server for HuggingFace models.
+Supports CUDA, ROCm GPU auto-detection, memory-aware model loading,
+sequential offload (VRAM -> RAM -> Disk), streaming, and tool calling.
+"""
+
+import argparse
+import asyncio
+import json
+import os
+import re
+import sys
+import time
+import uuid
+from contextlib import asynccontextmanager
+from typing import AsyncGenerator, Dict, List, Optional, Union
+
+import psutil
+import torch
+from fastapi import FastAPI, HTTPException, Request
+from fastapi.responses import StreamingResponse
+from pydantic import BaseModel, Field
+from transformers import (
+    AutoModelForCausalLM,
+    AutoTokenizer,
+    AutoConfig,
+    TextIteratorStreamer,
+    StoppingCriteria,
+    StoppingCriteriaList,
+)
+from threading import Thread
+
+
+# =============================================================================
+# Flash Attention Detection
+# =============================================================================
+
+def check_flash_attn_availability() -> bool:
+    """Check if flash-attn is installed and available."""
+    try:
+        import flash_attn
+        return True
+    except ImportError:
+        return False
+
+
+# =============================================================================
+# Memory Detection and Model Sizing
+# =============================================================================
+
+def get_available_vram() -> int:
+    """Get available VRAM in bytes. Returns 0 if no GPU available."""
+    if not torch.cuda.is_available():
+        return 0
+    
+    try:
+        total_vram = 0
+        for i in range(torch.cuda.device_count()):
+            props = torch.cuda.get_device_properties(i)
+            total_vram += props.total_memory
+        return total_vram
+    except Exception as e:
+        print(f"Warning: Could not detect VRAM: {e}")
+        return 0
+
+
+def get_available_ram(manual_ram_gb: Optional[float] = None) -> int:
+    """
+    Get available system RAM in bytes.
+    
+    Args:
+        manual_ram_gb: If specified, use this value in GB instead of auto-detection
+    
+    Returns:
+        Available RAM in bytes
+    """
+    if manual_ram_gb is not None:
+        ram_bytes = int(manual_ram_gb * 1e9)
+        print(f"Using manually specified RAM: {manual_ram_gb} GB ({ram_bytes / 1e9:.2f} GB)")
+        return ram_bytes
+    
+    try:
+        mem = psutil.virtual_memory()
+        print(f"Auto-detected RAM: {mem.available / 1e9:.2f} GB available")
+        return mem.available
+    except Exception as e:
+        print(f"Warning: Could not detect RAM: {e}")
+        return 0
+
+
+def estimate_model_size_from_config(model_name: str) -> Optional[int]:
+    """
+    Estimate model size in bytes from config.
+    Returns None if config cannot be loaded.
+    """
+    try:
+        config = AutoConfig.from_pretrained(model_name, trust_remote_code=True)
+        
+        # Get model parameters from config
+        if hasattr(config, 'num_parameters'):
+            num_params = config.num_parameters
+        elif hasattr(config, 'n_params'):
+            num_params = config.n_params
+        elif hasattr(config, 'num_hidden_layers') and hasattr(config, 'hidden_size'):
+            # Estimate based on transformer architecture
+            # Rough estimate: ~12 * num_layers * hidden_size^2 for standard transformers
+            layers = config.num_hidden_layers
+            hidden = config.hidden_size
+            vocab_size = getattr(config, 'vocab_size', 50000)
+            
+            # Rough parameter count estimation
+            # Embedding: vocab_size * hidden_size
+            # Each layer: ~4 * hidden_size^2 (attn + FFN)
+            num_params = (vocab_size * hidden_size) + (layers * 4 * hidden_size * hidden_size)
+        else:
+            return None
+        
+        # Assume float16 (2 bytes per parameter) for GPU loading
+        # This is the typical loading format
+        return num_params * 2
+    except Exception as e:
+        print(f"Warning: Could not estimate model size: {e}")
+        return None
+
+
+def calculate_safety_margin(memory_bytes: int) -> int:
+    """Apply safety margin to available memory (leave 10% headroom)."""
+    return int(memory_bytes * 0.9)
+
+
+def determine_offload_strategy(
+    model_name: str,
+    available_vram: int,
+    available_ram: int,
+    quantization_bits: Optional[int] = None
+) -> Dict[str, any]:
+    """
+    Determine the best offload strategy based on available memory.
+    
+    Returns a dict with:
+    - device_map: str or dict for model loading
+    - offload_folder: Optional[str] for disk offload
+    - load_in_8bit: bool
+    - load_in_4bit: bool
+    - max_memory: Optional[dict]
+    """
+    # Estimate model size
+    estimated_size = estimate_model_size_from_config(model_name)
+    
+    if estimated_size is None:
+        print("Could not estimate model size, using auto device_map")
+        return {
+            'device_map': 'auto',
+            'offload_folder': None,
+            'load_in_8bit': False,
+            'load_in_4bit': False,
+            'max_memory': None,
+        }
+    
+    # Apply quantization factor if specified
+    if quantization_bits == 4:
+        estimated_size = estimated_size // 4  # 4-bit = 0.5 bytes per param
+    elif quantization_bits == 8:
+        estimated_size = estimated_size // 2  # 8-bit = 1 byte per param
+    
+    # Add overhead for activations and gradients (roughly 20%)
+    required_memory = int(estimated_size * 1.2)
+    
+    print(f"Estimated model size: {estimated_size / 1e9:.2f} GB")
+    print(f"Required memory (with overhead): {required_memory / 1e9:.2f} GB")
+    print(f"Available VRAM: {available_vram / 1e9:.2f} GB")
+    print(f"Available RAM: {available_ram / 1e9:.2f} GB")
+    
+    safe_vram = calculate_safety_margin(available_vram)
+    safe_ram = calculate_safety_margin(available_ram)
+    
+    strategy = {
+        'device_map': None,
+        'offload_folder': None,
+        'load_in_8bit': False,
+        'load_in_4bit': False,
+        'max_memory': None,
+    }
+    
+    # Case 1: Model fits entirely in VRAM
+    if required_memory <= safe_vram:
+        print("Strategy: Loading fully to GPU")
+        strategy['device_map'] = 'cuda'
+        if torch.cuda.device_count() > 1:
+            strategy['device_map'] = 'auto'
+    
+    # Case 2: Model fits in VRAM + RAM combined
+    elif required_memory <= (safe_vram + safe_ram):
+        print("Strategy: Using device_map='auto' for VRAM + RAM offload")
+        strategy['device_map'] = 'auto'
+        # Set max_memory to help accelerate distribute layers
+        if torch.cuda.is_available():
+            max_memory = {}
+            for i in range(torch.cuda.device_count()):
+                max_memory[i] = safe_vram // torch.cuda.device_count()
+            max_memory['cpu'] = safe_ram
+            strategy['max_memory'] = max_memory
+    
+    # Case 3: Need disk offload
+    else:
+        print("Strategy: VRAM + RAM + Disk offload required")
+        strategy['device_map'] = 'auto'
+        if torch.cuda.is_available():
+            max_memory = {}
+            for i in range(torch.cuda.device_count()):
+                max_memory[i] = safe_vram // torch.cuda.device_count()
+            max_memory['cpu'] = safe_ram
+            strategy['max_memory'] = max_memory
+        # offload_folder will be set from command line argument
+    
+    return strategy
+
+
+# =============================================================================
+# Pydantic Models for API
+# =============================================================================
+
+class ToolFunction(BaseModel):
+    name: str
+    description: Optional[str] = None
+    parameters: Optional[Dict] = None
+
+
+class Tool(BaseModel):
+    type: str = "function"
+    function: ToolFunction
+
+
+class ChatMessage(BaseModel):
+    role: str
+    content: Optional[str] = None
+    name: Optional[str] = None
+    tool_calls: Optional[List[Dict]] = None
+    tool_call_id: Optional[str] = None
+
+
+class ChatCompletionRequest(BaseModel):
+    model: str
+    messages: List[ChatMessage]
+    temperature: float = 0.7
+    top_p: float = 1.0
+    n: int = 1
+    max_tokens: Optional[int] = None
+    stream: bool = False
+    stop: Optional[Union[str, List[str]]] = None
+    presence_penalty: float = 0.0
+    frequency_penalty: float = 0.0
+    tools: Optional[List[Tool]] = None
+    tool_choice: Optional[Union[str, Dict]] = "auto"
+
+
+class CompletionRequest(BaseModel):
+    model: str
+    prompt: Union[str, List[str]]
+    temperature: float = 0.7
+    top_p: float = 1.0
+    n: int = 1
+    max_tokens: Optional[int] = None
+    stream: bool = False
+    stop: Optional[Union[str, List[str]]] = None
+
+
+class ModelInfo(BaseModel):
+    id: str
+    object: str = "model"
+    created: int = Field(default_factory=lambda: int(time.time()))
+    owned_by: str = "huggingface"
+
+
+class ModelList(BaseModel):
+    object: str = "list"
+    data: List[ModelInfo]
+
+
+# =============================================================================
+# Tool Parsing and Function Calling
+# =============================================================================
+
+class ToolCallParser:
+    """Parse model outputs to extract tool calls."""
+    
+    def __init__(self, tokenizer):
+        self.tokenizer = tokenizer
+    
+    def extract_tool_calls(self, text: str, available_tools: List[Tool]) -> Optional[List[Dict]]:
+        """Extract tool calls from model output."""
+        tool_calls = []
+        
+        # Look for function calls in various formats
+        # Format 1: <tool> or <function> tags
+        tool_pattern = r'<(?:tool|function)>(.*?)</(?:tool|function)>'
+        tool_matches = re.findall(tool_pattern, text, re.DOTALL)
+        
+        for match in tool_matches:
+            try:
+                tool_data = json.loads(match.strip())
+                if 'name' in tool_data and 'arguments' in tool_data:
+                    tool_calls.append({
+                        "id": f"call_{uuid.uuid4().hex[:16]}",
+                        "type": "function",
+                        "function": {
+                            "name": tool_data["name"],
+                            "arguments": json.dumps(tool_data["arguments"])
+                        }
+                    })
+            except json.JSONDecodeError:
+                pass
+        
+        # Format 2: JSON with function_call key
+        try:
+            if "function_call" in text:
+                json_match = re.search(r'\{[^}]*"function_call"[^}]*\}', text, re.DOTALL)
+                if json_match:
+                    data = json.loads(json_match.group())
+                    if "function_call" in data:
+                        fc = data["function_call"]
+                        tool_calls.append({
+                            "id": f"call_{uuid.uuid4().hex[:16]}",
+                            "type": "function",
+                            "function": {
+                                "name": fc.get("name", ""),
+                                "arguments": fc.get("arguments", "{}")
+                            }
+                        })
+        except (json.JSONDecodeError, AttributeError):
+            pass
+        
+        # Format 3: Direct JSON function call
+        for tool in available_tools:
+            tool_name = tool.function.name
+            pattern = rf'<{tool_name}>(.*?)</{tool_name}>'
+            matches = re.findall(pattern, text, re.DOTALL)
+            for match in matches:
+                try:
+                    args = json.loads(match.strip())
+                    tool_calls.append({
+                        "id": f"call_{uuid.uuid4().hex[:16]}",
+                        "type": "function",
+                        "function": {
+                            "name": tool_name,
+                            "arguments": json.dumps(args)
+                        }
+                    })
+                except json.JSONDecodeError:
+                    # Try to use the raw text as arguments
+                    tool_calls.append({
+                        "id": f"call_{uuid.uuid4().hex[:16]}",
+                        "type": "function",
+                        "function": {
+                            "name": tool_name,
+                            "arguments": match.strip()
+                        }
+                    })
+        
+        return tool_calls if tool_calls else None
+
+
+def format_tools_for_prompt(tools: List[Tool], messages: List[ChatMessage]) -> List[ChatMessage]:
+    """Format tools into the system message or add a tool description."""
+    if not tools:
+        return messages
+    
+    tool_descriptions = []
+    for tool in tools:
+        func = tool.function
+        desc = f"Tool: {func.name}"
+        if func.description:
+            desc += f"\nDescription: {func.description}"
+        if func.parameters:
+            desc += f"\nParameters: {json.dumps(func.parameters, indent=2)}"
+        tool_descriptions.append(desc)
+    
+    tools_text = "You have access to the following tools:\n\n" + "\n\n".join(tool_descriptions)
+    tools_text += "\n\nWhen you need to use a tool, format your response as:\n"
+    tools_text += '<tool>{"name": "tool_name", "arguments": {...}}</tool>'
+    
+    # Add or prepend to system message
+    new_messages = list(messages)
+    system_found = False
+    
+    for i, msg in enumerate(new_messages):
+        if msg.role == "system":
+            new_messages[i] = ChatMessage(
+                role="system",
+                content=f"{tools_text}\n\n{msg.content or ''}"
+            )
+            system_found = True
+            break
+    
+    if not system_found:
+        new_messages.insert(0, ChatMessage(role="system", content=tools_text))
+    
+    return new_messages
+
+
+# =============================================================================
+# Model Management
+# =============================================================================
+
+class ModelManager:
+    """Manages the loaded model and tokenizer."""
+    
+    def __init__(self):
+        self.model = None
+        self.tokenizer = None
+        self.model_name = None
+        self.device = None
+        self.tool_parser = None
+        self.offload_folder = None
+        self.use_flash_attn = False
+        self.flash_attn_available = False
+        
+    def check_flash_attn_support(self) -> None:
+        """Check and print Flash Attention availability status."""
+        self.flash_attn_available = check_flash_attn_availability()
+        if self.use_flash_attn:
+            if self.flash_attn_available:
+                print("Flash Attention 2: Available and enabled")
+            else:
+                print("Warning: Flash Attention 2 requested but not installed")
+                print("Install with: pip install flash-attn --no-build-isolation")
+                print("Falling back to standard attention")
+                self.use_flash_attn = False
+    
+    def detect_device(self) -> str:
+        """Auto-detect available GPU or fall back to CPU."""
+        if torch.cuda.is_available():
+            # Check for ROCm (HIP)
+            if hasattr(torch.version, 'hip') and torch.version.hip is not None:
+                print(f"ROCm/HIP detected: {torch.version.hip}")
+                return "cuda"
+            else:
+                print(f"CUDA detected: {torch.version.cuda}")
+                return "cuda"
+        else:
+            print("No GPU detected, using CPU")
+            return "cpu"
+    
+    def load_model(
+        self,
+        model_name: str,
+        offload_dir: Optional[str] = None,
+        load_in_4bit: bool = False,
+        load_in_8bit: bool = False,
+        manual_ram_gb: Optional[float] = None,
+        flash_attn: bool = False,
+    ):
+        """
+        Load the model and tokenizer from HuggingFace with memory-aware offload.
+        
+        Args:
+            model_name: HuggingFace model name or path
+            offload_dir: Directory for disk offload when model doesn't fit in VRAM+RAM
+            load_in_4bit: Use 4-bit quantization (requires bitsandbytes)
+            load_in_8bit: Use 8-bit quantization (requires bitsandbytes)
+            manual_ram_gb: Manually specify available RAM in GB (bypasses auto-detection)
+            flash_attn: Use Flash Attention 2 if available (requires flash-attn package)
+        """
+        print(f"Loading model: {model_name}")
+        
+        self.use_flash_attn = flash_attn
+        self.check_flash_attn_support()
+        
+        self.device = self.detect_device()
+        self.offload_folder = offload_dir
+        
+        # Create offload directory if needed
+        if offload_dir:
+            os.makedirs(offload_dir, exist_ok=True)
+            print(f"Disk offload directory: {offload_dir}")
+        
+        # Detect available memory
+        available_vram = get_available_vram()
+        available_ram = get_available_ram(manual_ram_gb)
+        
+        print(f"\nMemory Detection:")
+        print(f"  Available VRAM: {available_vram / 1e9:.2f} GB")
+        print(f"  Available RAM: {available_ram / 1e9:.2f} GB")
+        
+        # Determine quantization bits
+        quantization_bits = None
+        if load_in_4bit:
+            quantization_bits = 4
+        elif load_in_8bit:
+            quantization_bits = 8
+        
+        # Determine offload strategy
+        strategy = determine_offload_strategy(
+            model_name,
+            available_vram,
+            available_ram,
+            quantization_bits
+        )
+        
+        # Set offload folder if determined necessary
+        if strategy.get('offload_folder') is None and offload_dir:
+            estimated_size = estimate_model_size_from_config(model_name)
+            safe_vram = calculate_safety_margin(available_vram)
+            safe_ram = calculate_safety_margin(available_ram)
+            
+            if estimated_size and estimated_size > (safe_vram + safe_ram):
+                strategy['offload_folder'] = offload_dir
+                print(f"Model will use disk offload at: {offload_dir}")
+        
+        # Load tokenizer
+        self.tokenizer = AutoTokenizer.from_pretrained(
+            model_name,
+            trust_remote_code=True,
+            padding_side="left"
+        )
+        
+        # Set pad token if not present
+        if self.tokenizer.pad_token is None:
+            self.tokenizer.pad_token = self.tokenizer.eos_token
+        
+        # Prepare model loading arguments
+        load_kwargs = {
+            'trust_remote_code': True,
+        }
+        
+        # Set dtype based on device and quantization
+        if load_in_4bit or load_in_8bit:
+            # Check if bitsandbytes is available
+            try:
+                import bitsandbytes as bnb
+                print(f"Using {4 if load_in_4bit else 8}-bit quantization")
+                load_kwargs['load_in_4bit'] = load_in_4bit
+                load_kwargs['load_in_8bit'] = load_in_8bit
+                load_kwargs['device_map'] = strategy['device_map'] or 'auto'
+            except ImportError:
+                print("Warning: bitsandbytes not installed. Quantization disabled.")
+                print("Install with: pip install bitsandbytes")
+                if self.device == "cuda":
+                    load_kwargs['torch_dtype'] = torch.float16
+                else:
+                    load_kwargs['torch_dtype'] = torch.float32
+                load_kwargs['device_map'] = strategy['device_map'] or ('auto' if self.device == 'cuda' else None)
+        else:
+            if self.device == "cuda":
+                load_kwargs['torch_dtype'] = torch.float16
+            else:
+                load_kwargs['torch_dtype'] = torch.float32
+            load_kwargs['device_map'] = strategy['device_map'] or ('auto' if self.device == 'cuda' else None)
+        
+        # Add max_memory if specified
+        if strategy.get('max_memory'):
+            load_kwargs['max_memory'] = strategy['max_memory']
+        
+        # Add offload_folder if specified
+        if strategy.get('offload_folder'):
+            load_kwargs['offload_folder'] = strategy['offload_folder']
+        
+        # Add Flash Attention 2 configuration if enabled and available
+        if self.use_flash_attn and self.flash_attn_available:
+            load_kwargs['attn_implementation'] = "flash_attention_2"
+            print("\nUsing Flash Attention 2 for attention implementation")
+        
+        print(f"\nModel loading arguments:")
+        for key, value in load_kwargs.items():
+            print(f"  {key}: {value}")
+        
+        # Load model
+        self.model = AutoModelForCausalLM.from_pretrained(
+            model_name,
+            **load_kwargs
+        )
+        
+        # Handle CPU case where device_map is None
+        if self.device == "cpu" and load_kwargs.get('device_map') is None:
+            self.model = self.model.to(self.device)
+        
+        self.model.eval()
+        self.model_name = model_name
+        self.tool_parser = ToolCallParser(self.tokenizer)
+        
+        # Print model device placement
+        if hasattr(self.model, 'hf_device_map'):
+            print(f"\nDevice map:")
+            for layer, device in self.model.hf_device_map.items():
+                print(f"  {layer}: {device}")
+        
+        print(f"\nModel loaded successfully")
+        print(f"Model device: {next(self.model.parameters()).device}")
+    
+    def format_messages(self, messages: List[ChatMessage]) -> str:
+        """Format messages into a prompt string."""
+        formatted = []
+        
+        for msg in messages:
+            if msg.role == "system":
+                formatted.append(f"System: {msg.content}")
+            elif msg.role == "user":
+                formatted.append(f"User: {msg.content}")
+            elif msg.role == "assistant":
+                content = msg.content or ""
+                if msg.tool_calls:
+                    for tc in msg.tool_calls:
+                        if tc.get("function"):
+                            func = tc["function"]
+                            content += f'\n<tool>{{"name": "{func.get("name", "")}", "arguments": {func.get("arguments", "{}")}}}</tool>'
+                formatted.append(f"Assistant: {content}")
+            elif msg.role == "tool":
+                formatted.append(f"Tool ({msg.name}): {msg.content}")
+        
+        formatted.append("Assistant:")
+        return "\n\n".join(formatted)
+    
+    def generate_stream(
+        self,
+        prompt: str,
+        max_tokens: Optional[int] = None,
+        temperature: float = 0.7,
+        top_p: float = 1.0,
+        stop: Optional[List[str]] = None,
+    ) -> AsyncGenerator[str, None]:
+        """Generate text in streaming fashion."""
+        inputs = self.tokenizer(prompt, return_tensors="pt", padding=True)
+        inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
+        
+        input_length = inputs["input_ids"].shape[1]
+        
+        if max_tokens is None:
+            max_tokens = 512
+        
+        streamer = TextIteratorStreamer(
+            self.tokenizer,
+            skip_prompt=True,
+            skip_special_tokens=True,
+        )
+        
+        generation_kwargs = {
+            "input_ids": inputs["input_ids"],
+            "attention_mask": inputs["attention_mask"],
+            "max_new_tokens": max_tokens,
+            "temperature": temperature if temperature > 0 else 1.0,
+            "top_p": top_p,
+            "do_sample": temperature > 0,
+            "streamer": streamer,
+            "pad_token_id": self.tokenizer.pad_token_id,
+            "eos_token_id": self.tokenizer.eos_token_id,
+        }
+        
+        # Handle stop sequences
+        if stop:
+            class StopOnSequence(StoppingCriteria):
+                def __init__(self, stop_sequences, tokenizer):
+                    self.stop_sequences = stop_sequences
+                    self.tokenizer = tokenizer
+                
+                def __call__(self, input_ids, scores, **kwargs):
+                    decoded = self.tokenizer.decode(input_ids[0][-20:], skip_special_tokens=True)
+                    return any(seq in decoded for seq in self.stop_sequences)
+            
+            generation_kwargs["stopping_criteria"] = StoppingCriteriaList([
+                StopOnSequence(stop, self.tokenizer)
+            ])
+        
+        # Run generation in a separate thread
+        thread = Thread(target=self.model.generate, kwargs=generation_kwargs)
+        thread.start()
+        
+        generated_text = ""
+        for text in streamer:
+            generated_text += text
+            yield text
+        
+        thread.join()
+    
+    def generate(
+        self,
+        prompt: str,
+        max_tokens: Optional[int] = None,
+        temperature: float = 0.7,
+        top_p: float = 1.0,
+        stop: Optional[List[str]] = None,
+    ) -> str:
+        """Generate text non-streaming."""
+        inputs = self.tokenizer(prompt, return_tensors="pt", padding=True)
+        inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
+        
+        if max_tokens is None:
+            max_tokens = 512
+        
+        with torch.no_grad():
+            outputs = self.model.generate(
+                input_ids=inputs["input_ids"],
+                attention_mask=inputs["attention_mask"],
+                max_new_tokens=max_tokens,
+                temperature=temperature if temperature > 0 else 1.0,
+                top_p=top_p,
+                do_sample=temperature > 0,
+                pad_token_id=self.tokenizer.pad_token_id,
+                eos_token_id=self.tokenizer.eos_token_id,
+                stopping_criteria=self._create_stopping_criteria(stop) if stop else None,
+            )
+        
+        generated_tokens = outputs[0][inputs["input_ids"].shape[1]:]
+        return self.tokenizer.decode(generated_tokens, skip_special_tokens=True)
+    
+    def _create_stopping_criteria(self, stop_sequences):
+        """Create stopping criteria for stop sequences."""
+        if not stop_sequences:
+            return None
+        
+        class StopOnSequence(StoppingCriteria):
+            def __init__(self, stop_sequences, tokenizer):
+                self.stop_sequences = stop_sequences
+                self.tokenizer = tokenizer
+            
+            def __call__(self, input_ids, scores, **kwargs):
+                decoded = self.tokenizer.decode(input_ids[0][-20:], skip_special_tokens=True)
+                return any(seq in decoded for seq in self.stop_sequences)
+        
+        return StoppingCriteriaList([StopOnSequence(stop_sequences, self.tokenizer)])
+
+
+# Global model manager
+model_manager = ModelManager()
+
+
+# =============================================================================
+# FastAPI Application
+# =============================================================================
+
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    """Lifespan context manager for startup/shutdown."""
+    # Startup
+    yield
+    # Shutdown
+    if model_manager.model is not None:
+        del model_manager.model
+        del model_manager.tokenizer
+        torch.cuda.empty_cache() if torch.cuda.is_available() else None
+
+
+app = FastAPI(
+    title="OpenAI-Compatible API",
+    description="OpenAI-compatible API for HuggingFace models with memory-aware loading",
+    version="1.0.0",
+    lifespan=lifespan,
+)
+
+
+@app.get("/v1/models", response_model=ModelList)
+async def list_models():
+    """List available models."""
+    models = []
+    if model_manager.model_name:
+        models.append(ModelInfo(id=model_manager.model_name))
+    else:
+        models.append(ModelInfo(id="unknown"))
+    return ModelList(data=models)
+
+
+@app.post("/v1/chat/completions")
+async def chat_completions(request: ChatCompletionRequest):
+    """Chat completions endpoint with streaming and tool support."""
+    if model_manager.model is None:
+        raise HTTPException(status_code=503, detail="Model not loaded")
+    
+    # Format messages with tools if provided
+    messages = request.messages
+    if request.tools:
+        messages = format_tools_for_prompt(request.tools, messages)
+    
+    # Convert messages to prompt
+    prompt = model_manager.format_messages(messages)
+    
+    # Prepare stop sequences
+    stop_sequences = []
+    if request.stop:
+        if isinstance(request.stop, str):
+            stop_sequences = [request.stop]
+        else:
+            stop_sequences = request.stop
+    
+    if request.stream:
+        return StreamingResponse(
+            stream_chat_response(
+                prompt,
+                request.model,
+                request.max_tokens,
+                request.temperature,
+                request.top_p,
+                stop_sequences,
+                request.tools,
+            ),
+            media_type="text/event-stream",
+        )
+    else:
+        return await generate_chat_response(
+            prompt,
+            request.model,
+            request.max_tokens,
+            request.temperature,
+            request.top_p,
+            stop_sequences,
+            request.tools,
+        )
+
+
+async def stream_chat_response(
+    prompt: str,
+    model_name: str,
+    max_tokens: Optional[int],
+    temperature: float,
+    top_p: float,
+    stop: List[str],
+    tools: Optional[List[Tool]],
+) -> AsyncGenerator[str, None]:
+    """Stream chat completion response."""
+    completion_id = f"chatcmpl-{uuid.uuid4().hex}"
+    created = int(time.time())
+    
+    generated_text = ""
+    
+    for chunk in model_manager.generate_stream(
+        prompt=prompt,
+        max_tokens=max_tokens,
+        temperature=temperature,
+        top_p=top_p,
+        stop=stop,
+    ):
+        generated_text += chunk
+        
+        data = {
+            "id": completion_id,
+            "object": "chat.completion.chunk",
+            "created": created,
+            "model": model_name,
+            "choices": [{
+                "index": 0,
+                "delta": {"content": chunk},
+                "finish_reason": None,
+            }],
+        }
+        yield f"data: {json.dumps(data)}\n\n"
+    
+    # Check for tool calls in complete output
+    if tools:
+        tool_calls = model_manager.tool_parser.extract_tool_calls(generated_text, tools)
+        if tool_calls:
+            # Send tool calls as final delta
+            data = {
+                "id": completion_id,
+                "object": "chat.completion.chunk",
+                "created": created,
+                "model": model_name,
+                "choices": [{
+                    "index": 0,
+                    "delta": {"tool_calls": tool_calls},
+                    "finish_reason": "tool_calls",
+                }],
+            }
+            yield f"data: {json.dumps(data)}\n\n"
+        else:
+            yield f"data: {json.dumps({'choices': [{'finish_reason': 'stop'}]})}\n\n"
+    else:
+        yield f"data: {json.dumps({'choices': [{'finish_reason': 'stop'}]})}\n\n"
+    
+    yield "data: [DONE]\n\n"
+
+
+async def generate_chat_response(
+    prompt: str,
+    model_name: str,
+    max_tokens: Optional[int],
+    temperature: float,
+    top_p: float,
+    stop: List[str],
+    tools: Optional[List[Tool]],
+) -> Dict:
+    """Generate non-streaming chat completion response."""
+    completion_id = f"chatcmpl-{uuid.uuid4().hex}"
+    created = int(time.time())
+    
+    generated_text = model_manager.generate(
+        prompt=prompt,
+        max_tokens=max_tokens,
+        temperature=temperature,
+        top_p=top_p,
+        stop=stop,
+    )
+    
+    response_message = {
+        "role": "assistant",
+        "content": generated_text,
+    }
+    
+    finish_reason = "stop"
+    
+    # Check for tool calls
+    if tools:
+        tool_calls = model_manager.tool_parser.extract_tool_calls(generated_text, tools)
+        if tool_calls:
+            response_message["tool_calls"] = tool_calls
+            response_message["content"] = None
+            finish_reason = "tool_calls"
+    
+    return {
+        "id": completion_id,
+        "object": "chat.completion",
+        "created": created,
+        "model": model_name,
+        "choices": [{
+            "index": 0,
+            "message": response_message,
+            "finish_reason": finish_reason,
+        }],
+        "usage": {
+            "prompt_tokens": len(model_manager.tokenizer.encode(prompt)),
+            "completion_tokens": len(model_manager.tokenizer.encode(generated_text)),
+            "total_tokens": len(model_manager.tokenizer.encode(prompt)) + len(model_manager.tokenizer.encode(generated_text)),
+        },
+    }
+
+
+@app.post("/v1/completions")
+async def completions(request: CompletionRequest):
+    """Text completions endpoint."""
+    if model_manager.model is None:
+        raise HTTPException(status_code=503, detail="Model not loaded")
+    
+    prompts = request.prompt if isinstance(request.prompt, list) else [request.prompt]
+    stop_sequences = []
+    if request.stop:
+        stop_sequences = [request.stop] if isinstance(request.stop, str) else request.stop
+    
+    if request.stream:
+        return StreamingResponse(
+            stream_completion_response(
+                prompts[0],
+                request.model,
+                request.max_tokens,
+                request.temperature,
+                request.top_p,
+                stop_sequences,
+            ),
+            media_type="text/event-stream",
+        )
+    else:
+        return await generate_completion_response(
+            prompts[0],
+            request.model,
+            request.max_tokens,
+            request.temperature,
+            request.top_p,
+            stop_sequences,
+        )
+
+
+async def stream_completion_response(
+    prompt: str,
+    model_name: str,
+    max_tokens: Optional[int],
+    temperature: float,
+    top_p: float,
+    stop: List[str],
+) -> AsyncGenerator[str, None]:
+    """Stream completion response."""
+    completion_id = f"cmpl-{uuid.uuid4().hex}"
+    created = int(time.time())
+    
+    for chunk in model_manager.generate_stream(
+        prompt=prompt,
+        max_tokens=max_tokens,
+        temperature=temperature,
+        top_p=top_p,
+        stop=stop,
+    ):
+        data = {
+            "id": completion_id,
+            "object": "text_completion",
+            "created": created,
+            "model": model_name,
+            "choices": [{
+                "text": chunk,
+                "index": 0,
+                "logprobs": None,
+                "finish_reason": None,
+            }],
+        }
+        yield f"data: {json.dumps(data)}\n\n"
+    
+    yield f"data: {json.dumps({'choices': [{'finish_reason': 'stop'}]})}\n\n"
+    yield "data: [DONE]\n\n"
+
+
+async def generate_completion_response(
+    prompt: str,
+    model_name: str,
+    max_tokens: Optional[int],
+    temperature: float,
+    top_p: float,
+    stop: List[str],
+) -> Dict:
+    """Generate non-streaming completion response."""
+    completion_id = f"cmpl-{uuid.uuid4().hex}"
+    created = int(time.time())
+    
+    generated_text = model_manager.generate(
+        prompt=prompt,
+        max_tokens=max_tokens,
+        temperature=temperature,
+        top_p=top_p,
+        stop=stop,
+    )
+    
+    return {
+        "id": completion_id,
+        "object": "text_completion",
+        "created": created,
+        "model": model_name,
+        "choices": [{
+            "text": generated_text,
+            "index": 0,
+            "logprobs": None,
+            "finish_reason": "stop",
+        }],
+        "usage": {
+            "prompt_tokens": len(model_manager.tokenizer.encode(prompt)),
+            "completion_tokens": len(model_manager.tokenizer.encode(generated_text)),
+            "total_tokens": len(model_manager.tokenizer.encode(prompt)) + len(model_manager.tokenizer.encode(generated_text)),
+        },
+    }
+
+
+# =============================================================================
+# Main Entry Point
+# =============================================================================
+
+def parse_args():
+    """Parse command line arguments."""
+    parser = argparse.ArgumentParser(
+        description="OpenAI-compatible API server with memory-aware model loading"
+    )
+    parser.add_argument(
+        "--model",
+        type=str,
+        default=None,
+        help="HuggingFace model name or path",
+    )
+    parser.add_argument(
+        "--host",
+        type=str,
+        default="0.0.0.0",
+        help="Host to bind to (default: 0.0.0.0)",
+    )
+    parser.add_argument(
+        "--port",
+        type=int,
+        default=8000,
+        help="Port to bind to (default: 8000)",
+    )
+    parser.add_argument(
+        "--offload-dir",
+        type=str,
+        default="./offload",
+        help="Directory for disk offload when model doesn't fit in VRAM+RAM (default: ./offload)",
+    )
+    parser.add_argument(
+        "--load-in-4bit",
+        action="store_true",
+        help="Load model in 4-bit precision (requires bitsandbytes)",
+    )
+    parser.add_argument(
+        "--load-in-8bit",
+        action="store_true",
+        help="Load model in 8-bit precision (requires bitsandbytes)",
+    )
+    parser.add_argument(
+        "--ram",
+        type=float,
+        default=None,
+        help="Manually specify available RAM in GB (bypasses auto-detection)",
+    )
+    parser.add_argument(
+        "--flash-attn",
+        action="store_true",
+        help="Use Flash Attention 2 for faster inference (requires flash-attn package and compatible GPU)",
+    )
+    return parser.parse_args()
+
+
+def main():
+    """Main entry point."""
+    import procname
+    procname.setprocname("coderai")
+    args = parse_args()
+    
+    # Get model name from args or prompt interactively
+    model_name = args.model
+    if model_name is None:
+        print("No model specified. Please enter a HuggingFace model name.")
+        print("Examples:")
+        print("  - microsoft/DialoGPT-medium")
+        print("  - facebook/blenderbot-400M-distill")
+        print("  - meta-llama/Llama-2-7b-chat-hf (requires auth)")
+        print("  - TinyLlama/TinyLlama-1.1B-Chat-v1.0")
+        print("")
+        model_name = input("Enter model name: ").strip()
+        
+        if not model_name:
+            print("Error: Model name is required")
+            sys.exit(1)
+    
+    # Load the model with memory-aware offload
+    model_manager.load_model(
+        model_name=model_name,
+        offload_dir=args.offload_dir,
+        load_in_4bit=args.load_in_4bit,
+        load_in_8bit=args.load_in_8bit,
+        manual_ram_gb=args.ram,
+        flash_attn=getattr(args, 'flash_attn', False),
+    )
+    
+    # Start the server
+    import uvicorn
+    print(f"\nStarting server on http://{args.host}:{args.port}")
+    print(f"API documentation available at http://{args.host}:{args.port}/docs")
+    uvicorn.run(app, host=args.host, port=args.port)
+
+
+if __name__ == "__main__":
+    main()
--- a/requirements.txt
+++ b/requirements.txt
+# FastAPI and server dependencies
+fastapi>=0.104.0
+uvicorn[standard]>=0.24.0
+pydantic>=2.5.0
+
+# PyTorch - Uncomment the appropriate version for your system:
+# For NVIDIA (CUDA):
+# torch>=2.0.0
+# torchvision>=0.15.0
+# torchaudio>=2.0.0
+
+# For AMD (ROCm):
+# --index-url https://download.pytorch.org/whl/rocm5.4.2
+# torch>=2.0.0
+# torchvision>=0.15.0
+# torchaudio>=2.0.0
+
+# For CPU only:
+torch>=2.0.0
+
+# ML dependencies
+transformers>=4.35.0
+accelerate>=0.24.0
+
+# System resource detection
+psutil>=5.9.0
+procname>=0.3.0
+
+# Optional: for better performance
+# bitsandbytes>=0.41.0  # for 4-bit/8-bit quantization
+# sentencepiece>=0.1.99  # for some tokenizers
+# protobuf>=3.20.0  # for some models
+
+# Optional: Flash Attention 2 for faster inference on supported GPUs
+# Requires specific CUDA/ROCm versions and may need manual installation
+# Install with: pip install flash-attn --no-build-isolation
+# flash-attn>=2.5.0
+
+# Installation instructions:
+# 1. For NVIDIA GPUs: pip install torch torchvision torchaudio
+# 2. For AMD GPUs: pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2
+# 3. For CPU only: pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu